Kubernetes Focused Prometheus Queries¶

Queries¶

A collection of queries for inspecting data stored in a Prometheus server. Most of these queries should work in vanilla prometheus however a small number may contain a few “grafana-isms”

The following variables are used as examples

Name	Description
`$interval`	A time interval e.g. `1m`. `5m`, `1h` etc.

Nodes¶

Queries relating to the state of the “physical” hardware that is hosting the cluster

The following metrics are used in the exmaples

Name	Description
`node_cpu_seconds_total`	CPU time broken down into `modes` e.g. `idle`, `system`, `user`
`node_load1/5/15`	1m, 5m, 15m CPU load averages.

CPU¶

CPU Utilisation, broken down by node

sum(rate(node_cpu_seconds_total{mode!="idle", mode!="iowait"}[$interval])) by (instance)

CPU load times, broken down by node normalised by number of cores

sum(node_load1) by (instance) / count(node_cpu_seconds_total{mode="system"}) by (instance) * 100

Kubernetes¶

A collection of queries specific to kubernetes instances

Pods¶

Useful for alerts, this query returns which pods are not ready

sum(kube_pod_container_status_ready) by (pod) < 1

This returns the number of times a pod has restarted

kube_pod_container_status_restarts_total

Useful for alerts, this query will return the pods that are waiting and the reason for the delay

sum(kube_pod_container_status_waiting_reason) by (pod, reason) > 0

Services¶

A collection of queries for metrics useful for monitoring services

The following metrics will be used as examples

Name	Description
`service_requests`	A counter that counts the number of requests received
`service_requests_latency`	A histogram that records the response times for a request
`service_info`	A counter that encodes information about the service in the metric labels

The number of requests received (usually per second)

sum(rate(service_requests[$interval]))

The number of internal server errors (5xx responses), again usually per second

sum(rate(service_requests{code=~"5.*"}[$interval]))

The number of user errors (4xx responses), per second

sum(rate(service_requests{code=~"4.*"}[$interval]))

Each of the by can be broken down by service (assuming the existence of a service label in the scraped data) as follows

sum(rate(fdm_requests[$interval])) by (service)

Request latencies are recorded as a histogram and so we can only collect aggregate values. As far as I can tell the usual way to do this is to generate the following datapoints

95th/99th percentile
50th percentile / median
average response time

The Xth percentile tells us the upper bound on the amount of time X% requests are processed in. E.g. if the 95th percentile is 500ms, then 95% of requests are handled in 500ms or less.

Using these metrics we can get a feel for the distribution of the request times:

If the median coincides with the mean then we can infer that the response times are normally distributed
If the mean < median then the distribution is skewed towards zero i.e. the majority of requests are being processed quicker
If the mean > median then the distribution is skewed high, i.e. the majority of requests take a longer time to be processed.

To generate the Xth percentile

histogram_quantile(0.X, sum(rate(service_requests_latency_bucket[$interval])) by (le))

Note

This will only work if the bucket label is called le

To get the average

sum(rate(service_requests_latency_sum[$interval])) / sum(rate(service_requests_latency_count[$interval]))

To get a count of the number of services

count(sum(service_info) by (service))

Assuming the existence of a service label