How does Prometheus query work? - Part 1, Step, Query and Range
Prometheus is an opensource time series database, commonly used to gather and calculate monitoring metrics, this article explains how the query works with /query_range API.
Start a Prometheus
According to the Prometheus doc, a prometheus server has been started and listens at localhost:9090
.
Prometheus browser is a WEB UI that is used to query the metrics for testing, the path is http://localhost:9090/graph
, there are two APIs used to query the metrics, the first one is /query (which is used to see the metric value in a specified time point, the other one is /query_range used to query the metric during a period.
Metric types
Before we start analyzing a query, we need to know the metric types Prometheus provides:
Counter
Indicates an cumulative number of the observable, for example the total http request number. This metric value is increasing monotonically increasing. For instance, http_request_total
records the total count of HTTP requests the server serves.
Gauge
A gauge is a metric that represents a single numerical value that can arbitrarily go up and down. For instance, the instance’s CPU usage, it’s changeable all the time.
Histogram
A histogram samples observations (usually things like request durations or response sizes) and counts them in configurable buckets. It also provides a sum of all observed values.
The Prometheus Histograms gives an example, http_request_duration_seconds
, consider we have an SLO requirement, 95% requests’ response time is within 300ms, so the straightforward way is to divide the bucket into several segments, for example, 300ms, 1s, 5s, 10s, +inf, and each bucket contains its counter metric which counts the total number of the requests within this bucket,
1 | http_request_duration_bucket |
You’ll see the bucket is divided and each segment starts from 0 seconds, it means [0 - 1s] will include the [0 - 300ms]. As we said, each bucket metric is of counter type, it records the total requests number within that response time and there’s a total count metric called http_request_duration_seconds_count
.
To calculate how much does 300ms occupy, we can calculate with http_request_duration_seconds_bucket{le="0.3"}/http_request_duration_seconds_count
, it calculates the instant value of the moment, but if we only use the instant value of that moment, the data will be not smooth and the graph might be spiky, so we’d better use duration to gather more data, sum(rate(http_request_duration_seconds_bucket{le="0.3"}[5m]))/sum(rate(http_request_duration_seconds_count[5m]))
.
Summary
Similar to a histogram, a summary samples observations (usually things like request durations and response sizes). While it also provides a total count of observations and a sum of all observed values, it calculates configurable quantiles over a sliding time window.
How does query work?
When we use Prometheus to calculate a query, normally we’re using the query_range functionality, after a time range and step is specified, the query will be applied to every step and a point will be put into the results.
Let’s see an entire query API:
http://localhost:9090/api/v1/query_range?query=prometheus_target_interval_length_seconds&start=1590830727.588&end=1590834327.588&step=14
it might be a little messy, let’s break it down into parameters,
- query
query is the metric formula that needs to be calculated - Time range (start and end)
It is easy to understand, we can specify when should the metrics start and end, I want to see the last 30 minutes or last 7 days, it’s set by the start and end parameters, their format is unix timestamp. - step
It is used to decide how many data points we need by setting each data points’ interval and its format is second.
The result is too much, I would only paste part of it:
1 | { |
- status marks the calculation is successful or not
- result.metric shows the original metric
- result.values shows the actual data points we need, the left value is the timestamp, the right one is the metric value.
Each result value’s interval is exactly the step we set, in the previous example, it’s 14 seconds. It means every 14 second, the query will be calculated and one data point is generated.
Conclusion
I explored the query_range API a bit in this article, in the next article, I’ll explore some frequent Prometheus functions like rate, irate, histogram_percentile, etc.