An introduction to monitoring with Prometheus

Yuri Grinshteyn

6 years ago

Metrics are the primary way to signify each the general well being of your system and another particular info you take into account necessary for monitoring and alerting or observability. Prometheus is a number one open supply metric instrumentation, assortment, and storage toolkit built at SoundCloud starting in 2012. Since then, it is graduated from the Cloud Native Computing Foundation and turn into the de facto customary for Kubernetes monitoring. It has been lined in some element in:

However, none of those articles give attention to use Prometheus on Kubernetes. This article:

Describes the Prometheus structure and information mannequin that can assist you perceive the way it works and what it will possibly do
Provides a tutorial on setting Prometheus up in a Kubernetes cluster and utilizing it to watch clusters and purposes

Architecture

While understanding how Prometheus works might not be important to utilizing it successfully, it may be useful, particularly should you’re contemplating utilizing it for manufacturing. The Prometheus documentation offers this graphic and particulars in regards to the important components of Prometheus and the way the items join collectively.

For most use circumstances, it is best to perceive three main elements of Prometheus:

The Prometheus server scrapes and shops metrics. Note that it makes use of a persistence layer, which is a part of the server and never expressly talked about within the documentation. Each node of the server is autonomous and doesn’t depend on distributed storage. I will revisit this later when wanting to make use of a devoted time-series database to retailer Prometheus information, reasonably than counting on the server itself.
The net UI means that you can entry, visualize, and chart the saved information. Prometheus offers its personal UI, however you too can configure different visualization instruments, like Grafana, to entry the Prometheus server utilizing PromQL (the Prometheus Query Language).
Alertmanager sends alerts from consumer purposes, particularly the Prometheus server. It has superior options for deduplicating, grouping, and routing alerts and might route by means of different companies like PagerDuty and OpsGenie.

The key to understanding Prometheus is that it basically depends on scraping, or pulling, metrics from outlined endpoints. This implies that your software wants to reveal an endpoint the place metrics can be found and instruct the Prometheus server scrape it (that is lined within the tutorial beneath). There are exporters for a lot of purposes that do not need a straightforward method so as to add net endpoints, resembling Kafka and Cassandra (utilizing the JMX exporter).

Data mannequin

Now that you simply perceive how Prometheus works to scrape and retailer metrics, the following factor to study is the sorts of metrics Prometheus helps. Some of the next info (famous with citation marks) comes from the metric types part of the Prometheus documentation.

Counters and gauges

The two easiest metric varieties are counter and gauge. When getting began with Prometheus (or with time-series monitoring extra usually), these are the best varieties to grasp as a result of it is simple to attach them to values you’ll be able to think about monitoring, like how a lot system assets your software is utilizing or what number of occasions it has processed.

“A counter is a cumulative metric that represents a single monotonically growing counter whose worth can solely enhance or be reset to zero on restart. For instance, you should utilize a counter to signify the variety of requests served, duties accomplished, or errors.”

Because you can’t lower a counter, it will possibly and must be used solely to signify cumulative metrics.

“A gauge is a metric that represents a single numerical worth that may arbitrarily go up and down. Gauges are sometimes used for measured values like [CPU] or present reminiscence utilization, but additionally ‘counts’ that may go up and down, just like the variety of concurrent requests.”

Histograms and summaries

Prometheus helps two extra complicated metric varieties: histograms and summaries. There is ample alternative for confusion right here, on condition that they each monitor the variety of observations and the sum of noticed values. One of the explanations you may select to make use of them is that it’s essential to calculate a mean of the noticed values. Note that they create a number of time sequence within the database; for instance, they every create a sum of the noticed values with a _sum suffix.

“A histogram samples observations (normally issues like request durations or response sizes) and counts them in configurable buckets. It additionally offers a sum of all noticed values.”

This makes it a wonderful candidate to trace issues like latency which may have a service stage goal (SLO) outlined in opposition to it. From the documentation:

You might need an SLO to serve 95% of requests inside 300ms. In that case, configure a histogram to have a bucket with an higher restrict of zero.three seconds. You can then immediately categorical the relative quantity of requests served inside 300ms and simply alert if the worth drops beneath zero.95. The following expression calculates it by job for the requests served within the final 5 minutes. The request durations have been collected with a histogram known as http_request_duration_seconds.
sum(price(http_request_duration_seconds_bucket[5m])) by (job)
/
sum(price(http_request_duration_seconds_count[5m])) by (job)

Returning to definitions:

“Similar to a histogram, a abstract samples observations (normally issues like request durations and response sizes). While it additionally offers a complete rely of observations and a sum of all noticed values, it calculates configurable quantiles over a sliding time window.”

The important distinction between summaries and histograms is that summaries calculate streaming φ-quantiles on the client-side and expose them immediately, whereas histograms expose bucketed commentary counts, and the calculation of quantiles from the buckets of a histogram occurs on the server-side utilizing the histogram_quantile() perform.

If you might be nonetheless confused, I recommend taking the next strategy:

Use gauges more often than not for simple time-series metrics.
Use counters for issues you understand to extend monotonically, e.g., if you’re counting the variety of instances one thing occurs.
Use histograms for latency measurements with easy buckets, e.g., one bucket for “under SLO” and one other for “over SLO.”

This must be enough for the overwhelming majority of use circumstances, and it is best to depend on a statistical evaluation skilled that can assist you with extra superior situations.

Now that you’ve a fundamental understanding of what Prometheus is, the way it works, and the varieties of knowledge it will possibly acquire and retailer, you are prepared to start the tutorial.

Prometheus and Kubernetes hands-on tutorial

This tutorial covers the next:

Installing Prometheus in your cluster
Downloading the pattern software and reviewing the code
Building and deploying the app and producing load in opposition to it
Accessing the Prometheus UI and reviewing the essential metrics

This tutorial assumes:

You have already got a Kubernetes cluster deployed.
You have configured the kubectl command-line utility for entry.
You have the cluster-admin function (or a minimum of enough privileges to create namespaces and deploy purposes).
You are working a Bash-based command-line interface. Adjust this tutorial should you run different working techniques or shell environments.

If you do not have Kubernetes working but, this Minikube tutorial is a straightforward technique to set it up in your laptop computer.

If you are prepared now, let’s go.

Install Prometheus

In this part, you’ll clone the pattern repository and use Kubernetes’ configuration information to deploy Prometheus to a devoted namespace.

Clone the pattern repository regionally and use it as your working listing:

$ git clone https://github.com/yuriatgoogle/prometheus-demo.git
$ cd  prometheus-demo
$ WORKDIR=$(pwd)

Create a devoted namespace for the Prometheus deployment:
```
$ kubectl create namespace prometheus
```

Give your namespace the cluster reader function:

$ kubectl apply -f $WORKDIR/kubernetes/clusterRole.yaml 
clusterrole.rbac.authorization.k8s.io/prometheus created
clusterrolebinding.rbac.authorization.k8s.io/prometheus created

Create a Kubernetes configmap with scraping and alerting guidelines:

$ kubectl apply -f $WORKDIR/kubernetes/configMap.yaml -n prometheus
configmap/prometheus-server-conf created

Deploy Prometheus:

$ kubectl create -f prometheus-deployment.yaml -n prometheus
deployment.extensions/prometheus-deployment created

Validate that Prometheus is working:

$ kubectl get pods -n prometheus
NAME                                     READY   STATUS    RESTARTS   AGE
prometheus-deployment-78fb5694b4-lmz4r   1/1     Running   zero          15s

Review fundamental metrics

In this part, you may entry the Prometheus UI and overview the metrics being collected.

Use port forwarding to allow net entry to the Prometheus UI regionally:
Note: Your prometheus-deployment can have a unique identify than this instance. Review and substitute the identify of the pod from the output of the earlier command.
```
$ kubectl port-forward prometheus-deployment-7ddb99dcb-fkz4d 8080:9090 -n prometheus
Forwarding from 127.zero.zero.1:8080 -> 9090
Forwarding from [::1]:8080 -> 9090
```
Go to http://localhost:8080 in a browser:
You at the moment are prepared to question Prometheus metrics!

Some fundamental machine metrics (just like the variety of CPU cores and reminiscence) can be found instantly. For instance, enter machine_memory_bytes within the expression discipline, swap to the Graph view, and click on Execute to see the metric charted:

Containers working within the cluster are additionally robotically monitored. For instance, enter price(container_cpu_usage_seconds_total[1m]) because the expression and click on Execute to see the speed of CPU utilization by Prometheus:

Now that you know the way to put in Prometheus and use it to measure some out-of-the-box metrics, it is time for some actual monitoring.

Golden indicators

As described within the “Monitoring Distributed Systems” chapter of Google’s SRE e-book:

“The four golden signals of monitoring are latency, traffic, errors, and saturation. If you can only measure four metrics of your user-facing system, focus on these four.”

The e-book provides thorough descriptions of all 4, however this tutorial focuses on the three indicators that almost all simply function proxies for consumer happiness:

Traffic: How many requests you are receiving
Error price: How a lot of these requests you’ll be able to efficiently serve
Latency: How shortly you’ll be able to serve profitable requests

As you in all probability understand by now, Prometheus doesn’t measure any of those for you; you may must instrument any software you deploy to emit them. Following is an instance implementation.

Open the $WORKDIR/node/golden_signals/app.js file, which is a pattern software written in Node.js (recall we cloned yuriatgoogle/prometheus-demo and exported $WORKDIR earlier). Start by reviewing the primary part, the place the metrics to be recorded are outlined:

// whole requests - counter
const nodeRequestsCounter = new prometheus.Counter();

The first metric is a counter that will probably be incremented for every request; that is how the whole variety of requests is counted:

// failed requests - counter
const nodeFailedRequestsCounter = new prometheus.Counter();

The second metric is one other counter that increments for every error to trace the variety of failed requests:

// latency - histogram
const nodeLatenciesHistogram = new prometheus.Histogram(
    identify: 'node_request_latency',
    assist: 'request latency by path',
    labelNames: ['route'],
    buckets: [100, 400]
);

The third metric is a histogram that tracks request latency. Working with a really fundamental assumption that the SLO for latency is 100ms, you’ll create two buckets: one for 100ms and the opposite 400ms latency.

The subsequent part handles incoming requests, increments the whole requests metric for every one, increments failed requests when there’s an (artificially induced) error, and information a latency histogram worth for every profitable request. I’ve chosen to not file latencies for errors; that implementation element is as much as you.

app.get('/', (req, res) => )

Test regionally

Now that you’ve got seen implement Prometheus metrics, see what occurs while you run the appliance.

Install the required packages:

$ cd $WORKDIR/node/golden_signals
$ npm set up --save

Launch the app:
```
$ node app.js
```
Open two browser tabs: one to http://localhost:8080 and one other to http://localhost:8080/metrics.
When you go to the /metrics web page, you’ll be able to see the Prometheus metrics being collected and up to date each time you reload the house web page:

You’re now able to deploy the pattern software to your Kubernetes cluster and check your monitoring.

Deploy monitoring to Prometheus on Kubernetes

Now it is time to see how metrics are recorded and represented within the Prometheus occasion deployed in your cluster by:

Building the appliance picture
Deploying it to your cluster
Generating load in opposition to the app
Observing the metrics recorded

Build the appliance picture

The pattern software offers a Dockerfile you may use to construct the picture. This part assumes that you’ve:

Docker put in and configured regionally
A Docker Hub account
Created a repository

If you are utilizing Google Kubernetes Engine to run your cluster, you should utilize Cloud Build and the Google Container Registry as an alternative.

Switch to the appliance listing:
```
$ cd $WORKDIR/node/golden_signals
```

Build the picture with this command:

$ docker construct . --tag=<Docker username>/prometheus-demo-node:newest

Make certain you are logged in to Docker Hub:
```
$ docker login
```

Push the picture to Docker Hub utilizing this command:

$ docker push <username>/prometheus-demo-node:newest

Verify that the picture is on the market:
```
$ docker photographs
```

Deploy the appliance

Now that the appliance picture is within the Docker Hub, you’ll be able to deploy it to your cluster and run the appliance.

Modify the $WORKDIR/node/golden_signals/prometheus-demo-node.yaml file to drag the picture from Docker Hub:

spec:
      containers:
      - picture: docker.io/<Docker username>/prometheus-demo-node:newest

Deploy the picture:

$ kubectl apply -f $WORKDIR/node/golden_signals/prometheus-demo-node.yaml 
deployment.extensions/prometheus-demo-node created

Verify that the appliance is working:

$ kubectl get pods
NAME                                    READY   STATUS    RESTARTS   AGE
prometheus-demo-node-69688456d4-krqqr   1/1     Running   zero          65s

Expose the appliance utilizing a load balancer:

$ kubectl expose deployment prometheus-node-demo --type=LoadBalancer --name=prometheus-node-demo --port=8080
service/prometheus-demo-node uncovered

Confirm that your service has an exterior IP tackle:

$ kubectl get companies
NAME                   TYPE           CLUSTER-IP      EXTERNAL-IP      PORT(S)          AGE
kubernetes             ClusterIP      10.39.240.1     <none>           443/TCP          23h
prometheus-demo-node   LoadBalancer   10.39.248.129   35.199.186.110   8080:31743/TCP   78m

Generate load to check monitoring

Now that your service is up and working, generate some load in opposition to it by utilizing Apache Bench.

Get the IP tackle of your service as a variable:

$ export SERVICE_IP=$(kubectl get svc prometheus-demo-node -ojson | jq -r '.standing.loadBalancer.ingress[].ip')

Use ab to generate some load. You might need to run this in a separate terminal window.
```
$ ab -c three -n 1000 http://$SERVICE_IP:8080/
```

Review metrics

While the load is working, entry the Prometheus UI within the cluster once more and ensure that the “golden signal” metrics are being collected.

Establish a connection to Prometheus:

$ kubectl get pods -n prometheus

NAME                                     READY   STATUS    RESTARTS   AGE

prometheus-deployment-78fb5694b4-lmz4r   1/1     Running   zero          15s$ kubectl port-forward prometheus-deployment-78fb5694b4-lmz4r 8080:9090 -n prometheus
Forwarding from 127.zero.zero.1:8080 -> 9090
Forwarding from [::1]:8080 -> 9090

Note: Make certain to switch the identify of the pod within the second command with the output of the primary.

Open http://localhost:8080 in a browser:

Use this expression to measure the request price:
```
price(node_requests[1m])
```

Use this expression to measure your error price:
```
price(node_failed_requests[1m])
```

Finally, use this expression to validate your latency SLO. Remember that you simply arrange two buckets, 100ms and 400ms. This expression returns the share of requests that meet the SLO :
```
sum(price(node_request_latency_bucketle="100"[1h])) / sum(price(node_request_latency_count[1h]))
```

About 10% of the requests are inside SLO. This is what it is best to count on for the reason that code sleeps for a random variety of milliseconds between zero and 1,000. As such, about 10% of the time, it returns in additional than 100ms, and this graph exhibits that you may’t meet the latency SLO in consequence.

Summary

Congratulations! You’ve accomplished the tutorial and hopefully have a a lot better understanding of how Prometheus works, instrument your software with customized metrics, and use it to measure your SLO compliance. The subsequent article on this sequence will have a look at one other metric instrumentation strategy utilizing OpenCensus.