Jenkins Monitoring Integration
Integration¶
- Instructions: https://apps.sematext.com/ui/howto/Jenkins/overview
Jenkins Alerts¶
As soon as you create a Jenkins App, you will receive a set of default alert rules. These pre-configured rules will notify you of important events that may require your attention, as shown below.
Health check score anomaly¶
This alert rule continuously monitors the health check score (the ratio of successful health checks to total health checks) of Jenkins instances using anomaly detection. When anomalies are detected, it triggers warnings (WARN priority).
Let's say you have a Jenkins server that typically has a health check score of around 0.9, indicating a healthy system. However, due to a sudden increase in system load or a misconfiguration, the health check score drops significantly to 0.5 within a short period. Upon detecting the anomaly (in this case, a sudden drop in the health check score), the alert rule triggers a warning.
Actions to take¶
- You can examine system metrics such as CPU usage, memory, disk I/O, and network traffic to find any spikes that may have contributed to the drop in the health check score
- Review Jenkins logs (system log, build logs, and any plugin-specific logs) to get insights into any errors or warnings in the Jenkins environment
- Review recent changes to Jenkins configuration, including plugin updates, job configurations, and system settings
Response with Server Error Code¶
This alert rule continuously monitors the count of HTTP 500 server errors in the Jenkins master Web UI. When the count exceeds zero within the last 5 minutes, it triggers a warning (WARN priority). The minimum delay between consecutive notifications triggered by this alert rule is set to 10 minutes.
Suppose the Jenkins master Web UI typically operates smoothly, but due to a misconfiguration or software bug, it starts responding with HTTP 500 errors. The alert rule checks for occurrences of HTTP 500 errors within the last 5 minutes and it's triggered as soon as a single HTTP 500 error occurs.
Actions to take¶
- Review Jenkins logs and server configurations to find the cause of the HTTP 500 errors. This may involve checking for misconfigurations, software bugs, or issues with dependencies
- If recent changes were made to Jenkins or its dependencies, consider rolling back those changes to restore stability
Server unavailable response code¶
This alert rule continuously monitors the count of HTTP 503 server unavailable errors in the Jenkins master Web UI. When the count exceeds zero within the last 5 minutes, it triggers a warning (WARN priority). The minimum delay between consecutive notifications triggered by this alert rule is set to 10 minutes.
Suppose the Jenkins master Web UI experiences a sudden surge in traffic or encounters issues with backend services, leading to an increase in HTTP 503 errors. When this happens, the alert rule checks for occurrences of HTTP 503 errors within the last 5 minutes and is triggered as soon as a single HTTP 503 error occurs.
Actions to take¶
- Investigate the status and health of backend services that Jenkins depends on, such as databases, application servers, or external APIs
- Check Jenkins configuration settings, including connection settings to external integrations, resource allocation, and plugin configurations
- Monitor resource usage on the Jenkins server, including CPU, memory, disk I/O, and network bandwidth
- If Jenkins is experiencing high traffic, consider scaling up the infrastructure by adding more Jenkins nodes
You can create additional alerts on any metric.
Metrics¶
Metric Key (Type) (Unit) | Description |
---|---|
jenkins.health.checks (long counter) |
The count of health checks associated with the HealthCheckRegistry defined within the Metrics Plugin |
jenkins.health.checks.time (long counter) (ms) |
The duration of all health check runs |
jenkins.health.check.score (long gauge) |
The ratio of health checks reporting success to the total number of health checks. Larger values indicate increasing health as measured by the health checks. (This is a value between 0 and 1 inclusive) |
jenkins.http.requests.active (long counter) |
The count of currently active requests against the Jenkins master Web UI |
jenkins.http.requests (long counter) |
The time Jenkins master spends to process Web UI requests and generating the corresponding responses |
jenkins.http.requests.time (long counter) (ms) |
The count of Jenkins master Web UI requests |
jenkins.http.response.code.bad_request (long counter) |
The count at which the Jenkins master Web UI is responding to requests with a HTTP/400 status code |
jenkins.http.response.code.created (long counter) |
The count at which the Jenkins master Web UI is responding to requests with a HTTP/201 status code |
jenkins.http.response.code.forbidden (long counter) |
The count at which the Jenkins master Web UI is responding to requests with a HTTP/403 status code |
jenkins.http.response.code.no_content (long counter) |
The count at which the Jenkins master Web UI is responding to requests with a HTTP/204 status code |
jenkins.http.response.code.not_found (long counter) |
The count at which the Jenkins master Web UI is responding to requests with a HTTP/404 status code |
jenkins.http.response.code.not_modified (long counter) |
The count at which the Jenkins master Web UI is responding to requests with a HTTP/304 status code |
jenkins.http.response.code.ok (long counter) |
The count at which the Jenkins master Web UI is responding to requests with a HTTP/200 status code |
jenkins.http.response.code.server_error (long counter) |
The count at which the Jenkins master Web UI is responding to requests with a HTTP/500 status code |
jenkins.http.response.code.server_unavailable (long counter) |
The count at which the Jenkins master Web UI is responding to requests with a HTTP/503 status code |
jenkins.http.response.code.other (long counter) |
The count at which the Jenkins master Web UI is responding to requests with a non-informational status code that is not in the list - HTTP/200, HTTP/201, HTTP/204, HTTP/304, HTTP/400, HTTP/403, HTTP/404, HTTP/500, or HTTP/503 |
jenkins.nodes.builds (long counter) |
The count of build nodes available to Jenkins |
jenkins.nodes.builds.time (long counter) (ms) |
The time nodes spend for building |
jenkins.nodes.offline (long gauge) |
The count of build nodes available to Jenkins but currently off-line |
jenkins.nodes.online (long gauge) |
The count of build nodes available to Jenkins and currently on-line |
jenkins.queue.blocked (long gauge) |
The count of jobs that are in the Jenkins build queue and currently in the blocked state |
jenkins.queue.buildable (long gauge) |
The count of jobs that are in the Jenkins build queue and currently in the buildable state |
jenkins.queue.pending (long gauge) |
The count of jobs that are in the Jenkins build queue and currently in the pending state |
jenkins.queue.size (long gauge) |
The count of jobs that are in the Jenkins build queue |
jenkins.queue.stuck (long gauge) |
The count of jobs that are in the Jenkins build queue and currently in the stuck state |
jenkins.plugins.active (long gauge) |
The count of plugins in the Jenkins instance that started successfully |
jenkins.plugins.with_update (long gauge) |
The count of plugins in the Jenkins instance that have an newer version reported as available in the current Jenkins update center metadata held by Jenkins. This value is not indicative of an issue with Jenkins but high values can be used as a trigger to review the plugins with updates with a view to seeing whether those updates potentially contain fixes for issues that could be affecting your Jenkins instance |
jenkins.plugins.inactive (long gauge) |
The count of plugins in the Jenkins instance that are not currently enabled |
jenkins.plugins.failed (long gauge) |
The count of plugins in the Jenkins instance that failed to start. A value other than 0 is typically indicative of a potential issue within the Jenkins installation that will either be solved by explicitly disabling the plugin(s) or by resolving the plugin dependency issues |
jenkins.executors.free (long gauge) |
The count of executors available to Jenkins that are not currently in use |
jenkins.executors.in_use (long gauge) |
The count of executors available to Jenkins that are currently in use |
jenkins.runs.success (long counter) |
The count of job runs which performed successfully |
jenkins.runs.unstable (long counter) |
The count of job runs which were unstable |
jenkins.runs.failure (long counter) |
The count of job runs which failed |
jenkins.runs.not_built (long counter) |
The count of job runs that were not built |
jenkins.runs.aborted (long counter) |
The count of aborted job runs |
jenkins.jobs (long counter) |
The total count of jobs |
jenkins.jobs.scheduled (long counter) |
The count at which jobs are scheduled. If a job is already in the queue and an identical request for scheduling the job is received then Jenkins will coalesce the two requests. This metric gives a reasonably pure measure of the load requirements of the Jenkins master as it is unaffected by the count of executors available to the system |
jenkins.jobs.queuing (long counter) |
The count of queued jobs |
jenkins.jobs.blocked (long counter) |
The count at which jobs in the build queue enter the blocked state |
jenkins.jobs.buildable (long counter) |
The count at which jobs in the build queue enter the buildable state |
jenkins.jobs.execution.time (long counter) (ms) |
The amount of time jobs spend in execution state |
jenkins.jobs.queuing.time (long counter) (ms) |
The total time jobs spend in the build queue |
jenkins.jobs.blocked.time (long counter) (ms) |
The amount of time jobs in the build queue spend in blocked state |
jenkins.jobs.buildable.time (long counter) (ms) |
The amount of time jobs in the build queue spend in buildable state |
jenkins.jobs.waiting.time (long counter) (ms) |
The total amount of time that jobs spend in their quiet period |
jenkins.jobs.total.time (long counter) (ms) |
The time jobs spend from entering the build queue to completing build |