Kubernetes Focused Prometheus Queries¶
Queries¶
A collection of queries for inspecting data stored in a Prometheus server. Most of these queries should work in vanilla prometheus however a small number may contain a few “grafana-isms”
The following variables are used as examples
Name |
Description |
---|---|
|
A time interval e.g. |
Nodes¶
Queries relating to the state of the “physical” hardware that is hosting the cluster
The following metrics are used in the exmaples
Name |
Description |
---|---|
|
CPU time broken down into |
|
1m, 5m, 15m CPU load averages. |
CPU¶
CPU Utilisation, broken down by node
sum(rate(node_cpu_seconds_total{mode!="idle", mode!="iowait"}[$interval])) by (instance)
CPU load times, broken down by node normalised by number of cores
sum(node_load1) by (instance) / count(node_cpu_seconds_total{mode="system"}) by (instance) * 100
Kubernetes¶
A collection of queries specific to kubernetes instances
Pods¶
Useful for alerts, this query returns which pods are not ready
sum(kube_pod_container_status_ready) by (pod) < 1
This returns the number of times a pod has restarted
kube_pod_container_status_restarts_total
Useful for alerts, this query will return the pods that are waiting and the reason for the delay
sum(kube_pod_container_status_waiting_reason) by (pod, reason) > 0
Services¶
A collection of queries for metrics useful for monitoring services
The following metrics will be used as examples
Name |
Description |
---|---|
|
A counter that counts the number of requests received |
|
A histogram that records the response times for a request |
|
A counter that encodes information about the service in the metric labels |
The number of requests received (usually per second)
sum(rate(service_requests[$interval]))
The number of internal server errors (5xx responses), again usually per second
sum(rate(service_requests{code=~"5.*"}[$interval]))
The number of user errors (4xx responses), per second
sum(rate(service_requests{code=~"4.*"}[$interval]))
Each of the by can be broken down by service (assuming the existence of a
service
label in the scraped data) as follows
sum(rate(fdm_requests[$interval])) by (service)
Request latencies are recorded as a histogram and so we can only collect aggregate values. As far as I can tell the usual way to do this is to generate the following datapoints
95th/99th percentile
50th percentile / median
average response time
The Xth percentile tells us the upper bound on the amount of time X% requests are processed in. E.g. if the 95th percentile is 500ms, then 95% of requests are handled in 500ms or less.
Using these metrics we can get a feel for the distribution of the request times:
If the median coincides with the mean then we can infer that the response times are normally distributed
If the mean < median then the distribution is skewed towards zero i.e. the majority of requests are being processed quicker
If the mean > median then the distribution is skewed high, i.e. the majority of requests take a longer time to be processed.
To generate the Xth percentile
histogram_quantile(0.X, sum(rate(service_requests_latency_bucket[$interval])) by (le))
Note
This will only work if the bucket label is called le
To get the average
sum(rate(service_requests_latency_sum[$interval])) / sum(rate(service_requests_latency_count[$interval]))
To get a count of the number of services
count(sum(service_info) by (service))
Assuming the existence of a service
label