Kubernetes Focused Prometheus Queries ===================================== Queries ------- A collection of queries for inspecting data stored in a Prometheus server. Most of these queries should work in vanilla prometheus however a small number may contain a few "grafana-isms" The following variables are used as examples ============= =========== Name Description ============= =========== ``$interval`` A time interval e.g. ``1m``. ``5m``, ``1h`` etc. ============= =========== Nodes ^^^^^ Queries relating to the state of the "physical" hardware that is hosting the cluster The following metrics are used in the exmaples ========================== =========== Name Description ========================== =========== ``node_cpu_seconds_total`` CPU time broken down into ``modes`` e.g. ``idle``, ``system``, ``user`` ``node_load1/5/15`` 1m, 5m, 15m CPU load averages. ========================== =========== CPU """ CPU Utilisation, broken down by node .. code-block:: none sum(rate(node_cpu_seconds_total{mode!="idle", mode!="iowait"}[$interval])) by (instance) CPU load times, broken down by node normalised by number of cores .. code-block:: none sum(node_load1) by (instance) / count(node_cpu_seconds_total{mode="system"}) by (instance) * 100 Kubernetes ^^^^^^^^^^ A collection of queries specific to kubernetes instances Pods """" Useful for alerts, this query returns which pods are not ready .. code-block:: none sum(kube_pod_container_status_ready) by (pod) < 1 This returns the number of times a pod has restarted .. code-block:: none kube_pod_container_status_restarts_total Useful for alerts, this query will return the pods that are waiting and the reason for the delay .. code-block:: none sum(kube_pod_container_status_waiting_reason) by (pod, reason) > 0 Services """""""" A collection of queries for metrics useful for monitoring services The following metrics will be used as examples ============================ =========== Name Description ============================ =========== ``service_requests`` A counter that counts the number of requests received ``service_requests_latency`` A histogram that records the response times for a request ``service_info`` A counter that encodes information about the service in the metric labels ============================ =========== The number of requests received (usually per second) .. code-block:: none sum(rate(service_requests[$interval])) The number of internal server errors (5xx responses), again usually per second .. code-block:: none sum(rate(service_requests{code=~"5.*"}[$interval])) The number of user errors (4xx responses), per second .. code-block:: none sum(rate(service_requests{code=~"4.*"}[$interval])) Each of the by can be broken down by service (assuming the existence of a ``service`` label in the scraped data) as follows .. code-block:: none sum(rate(fdm_requests[$interval])) by (service) Request latencies are recorded as a histogram and so we can only collect aggregate values. As far as I can tell the usual way to do this is to generate the following datapoints - 95th/99th percentile - 50th percentile / median - average response time The Xth percentile tells us the upper bound on the amount of time X% requests are processed in. E.g. if the 95th percentile is 500ms, then 95% of requests are handled in 500ms or less. Using these metrics we can get a feel for the distribution of the request times: - If the median coincides with the mean then we can infer that the response times are normally distributed - If the mean < median then the distribution is skewed towards zero i.e. the majority of requests are being processed quicker - If the mean > median then the distribution is skewed high, i.e. the majority of requests take a longer time to be processed. To generate the Xth percentile .. code-block:: none histogram_quantile(0.X, sum(rate(service_requests_latency_bucket[$interval])) by (le)) .. note:: This will only work if the bucket label is called ``le`` To get the average .. code-block:: none sum(rate(service_requests_latency_sum[$interval])) / sum(rate(service_requests_latency_count[$interval])) To get a count of the number of services .. code-block:: none count(sum(service_info) by (service)) Assuming the existence of a ``service`` label