My response once I first got here throughout the phrases counter
and gauge
and the graphs with colours and numbers labeled “mean” and “upper 90” was considered one of avoidance. It’s like I noticed them, however I did not care as a result of I did not perceive them or how they may be helpful. Since my job did not require me to concentrate to them, they remained ignored.
That was about two years in the past. As I progressed in my profession, I needed to grasp extra about our community functions, and that’s once I began studying about metrics.
The three phases of my journey to understanding monitoring (to this point) are:
- Stage 1: What? (Looks elsewhere)
- Stage 2: Without metrics, we’re actually flying blind.
- Stage three: How will we preserve from doing metrics improper?
I’m at present in Stage 2 and can share what I’ve discovered to this point. I am transferring step by step towards Stage three, and I’ll provide a few of my sources on that a part of the journey on the finish of this text.
Let’s get began!
Software conditions
All the demos mentioned on this article can be found on my GitHub repo. You might want to have docker
and docker-compose
put in to play with them.
Why ought to I monitor?
The high causes for monitoring are:
- Understanding regular and irregular system and repair habits
- Doing capability planning, scaling up or down
- Assisting in efficiency troubleshooting
- Understanding the impact of software program/hardware modifications
- Changing system habits in response to a measurement
- Alerting when a system displays sudden habits
Metrics and metric sorts
For our functions, a metric is an noticed worth of a sure amount at a given level in time. The whole of quantity hits on a weblog put up, the overall variety of folks attending a chat, the variety of instances the info was not discovered within the caching system, the variety of logged-in customers in your web site—all are examples of metrics.
They broadly fall into three classes:
Counters
Consider your private weblog. You simply revealed a put up and wish to control what number of hits it will get over time, a quantity that may solely improve. This is an instance of a counter metric. Its worth begins at zero and will increase through the lifetime of your weblog put up. Graphically, a counter seems to be like this:
Gauges
Instead of the overall variety of hits in your weblog put up over time, for instance you wish to observe the variety of hits per day or per week. This metric known as a gauge and its worth can go up or down. Graphically, a gauge seems to be like this:
A gauge’s worth often has a ceiling and a flooring in a sure time window.
Histograms and timers
A histogram (as Prometheus calls it) or a timer (as StatsD calls it) is a metric to trace sampled observations. Unlike a counter or a gauge, the worth of a histogram metric would not essentially present an up or down sample. I do know that does not make a whole lot of sense and will not appear completely different from a gauge. What’s completely different is what you count on to do with histogram knowledge in comparison with a gauge. Therefore, the monitoring system must know that a metric is a histogram kind to permit you to do these issues.
Demo 1: Calculating and reporting metrics
Demo 1 is a fundamental internet utility written utilizing the Flask framework. It demonstrates how we will calculate and report metrics.
The src
listing has the applying in app.py
with the src/helpers/middleware.py
containing the next:
from flask import request
import csv
import timedef start_timer():
request.start_time = time.time()def stop_timer(response):
# convert this into milliseconds for statsd
resp_time = (time.time() - request.start_time)*1000
with open('metrics.csv', 'a', newline='') as f:
csvwriter = csv.author(f)
csvwriter.writerow([str(int(time.time())), str(resp_time)])return response
def setup_metrics(app):
app.before_request(start_timer)
app.after_request(stop_timer)
When setup_metrics()
known as from the applying, it configures the start_timer()
operate to be referred to as earlier than a request is processed and the stop_timer()
operate to be referred to as after a request is processed however earlier than the response has been despatched. In the above operate, we write the timestamp
and the time it took (in milliseconds) for the request to be processed.
When we run docker-compose up
within the demo1
listing, it begins the online utility, then a shopper container that makes quite a few requests to the online utility. You will see a src/metrics.csv
file that has been created with two columns: timestamp
and request_latency
.
Looking at this file, we will infer two issues:
- There is a whole lot of knowledge that has been generated
- No remark of the metric has any attribute related to it
Without a attribute related to a metric remark, we can’t say which HTTP endpoint this metric was related to or which node of the applying this metric was generated from. Hence, we have to qualify every metric remark with the suitable metadata.
Statistics 101
If we expect again to highschool arithmetic, there are a couple of statistics phrases we should always all recall, even when vaguely, together with imply, median, percentile, and histogram. Let’s briefly recap them with out judging their usefulness, identical to in highschool.
Mean
The imply, or the typical of an inventory of numbers, is the sum of the numbers divided by the cardinality of the checklist. The imply of three, 2, and 10 is (three+2+10)/three = 5
.
Median
The median is one other kind of common, however it’s calculated otherwise; it’s the heart numeral in an inventory of numbers ordered from smallest to largest (or vice versa). In our checklist above (2, three, 10), the median is three. The calculation is just not very simple; it relies on the variety of objects within the checklist.
Percentile
The percentile is a measure that offers us a measure under which a sure (ok
) share of the numbers lie. In some sense, it provides us an thought of how this measure is doing relative to the ok
share of our knowledge. For instance, the 95th percentile rating of the above checklist is 9.29999. The percentile measure varies from zero to 100 (non-inclusive). The zeroth percentile is the minimal rating in a set of numbers. Some of chances are you’ll recall that the median is the 50th percentile, which seems to be three.
Some monitoring techniques consult with the percentile measure as upper_X
the place X is the percentile; higher 90 refers back to the worth on the 90th percentile.
Quantile
The q-Quantile is a measure that ranks qN in a set of N numbers. The worth of q ranges between zero and 1 (each inclusive). When q is zero.5, the worth is the median. The relationship between the quantile and percentile is that the measure at q quantile is equal to the measure at 100q percentile.
Histogram
The metric histogram, which we discovered about earlier, is an implementation element of monitoring techniques. In statistics, a histogram is a graph that teams knowledge into buckets. Let’s contemplate a special, contrived instance: the ages of individuals studying your weblog. If you bought a handful of this knowledge and needed a tough thought of your readers’ ages by group, plotting a histogram would present you a graph like this:
Cumulative histogram
A cumulative histogram is a histogram the place every bucket’s rely consists of the rely of the earlier bucket, therefore the title cumulative. A cumulative histogram for the above dataset would appear like this:
Why do we’d like statistics?
In Demo 1 above, we noticed that there’s a lot of information that’s generated once we report metrics. We want statistics when working with metrics as a result of there are simply too a lot of them. We do not care about particular person values, quite total habits. We count on the habits the values exhibit is a proxy of the habits of the system underneath remark.
Demo 2: Adding traits to metrics
In our Demo 1 utility above, once we calculate and report a request latency, it refers to a selected request uniquely recognized by few traits. Some of those are:
- The HTTP endpoint
- The HTTP methodology
- The identifier of the host/node the place it is working
If we connect these traits to a metric remark, we have now extra context round every metric. Let’s discover including traits to our metrics in Demo 2.
The src/helpers/middleware.py
file now writes a number of columns to the CSV file when writing metrics:
node_ids = ['10.0.1.1', '10.1.3.4']def start_timer():
request.start_time = time.time()def stop_timer(response):
# convert this into milliseconds for statsd
resp_time = (time.time() - request.start_time)*1000
node_id = node_ids[random.choice(range(len(node_ids)))]
with open('metrics.csv', 'a', newline='') as f:
csvwriter = csv.author(f)
csvwriter.writerow([
str(int(time.time())), 'webapp1', node_id,
request.endpoint, request.methodology, str(response.status_code),
str(resp_time)
])return response
Since it is a demo, I’ve taken the freedom of reporting random IPs because the node IDs when reporting the metric. When we run docker-compose up
within the demo2
listing, it should end in a CSV file with a number of columns.
Analyzing metrics with pandas
We’ll now analyze this CSV file with pandas. Running docker-compose up
will print a URL that we’ll use to open a Jupyter session. Once we add the Analysis.ipynb
pocket book into the session, we will learn the CSV file right into a pandas DataFrame:
import pandas as pd
metrics = pd.read_csv('/knowledge/metrics.csv', index_col=zero)
The index_col
specifies that we wish to use the timestamp
because the index.
Since every attribute we add is a column within the DataFrame, we will carry out grouping and aggregation primarily based on these columns:
import numpy as np
metrics.groupby(['node_id', 'http_status']).latency.mixture(np.percentile, 99.999)
Please consult with the Jupyter pocket book for extra instance evaluation on the info.
What ought to I monitor?
A software program system has quite a few variables whose values change throughout its lifetime. The software program is working in some form of an working system, and working system variables change as nicely. In my opinion, the extra knowledge you’ve gotten, the higher it’s when one thing goes improper.
Key working system metrics I like to recommend monitoring are:
- CPU utilization
- System reminiscence utilization
- File descriptor utilization
- Disk utilization
Other key metrics to watch will differ relying in your software program utility.
Network functions
If your software program is a community utility that listens to and serves shopper requests, the important thing metrics to measure are:
- Number of requests coming in (counter)
- Unhandled errors (counter)
- Request latency (histogram/timer)
- Queued time, if there’s a queue in your utility (histogram/timer)
- Queue dimension, if there’s a queue in your utility (gauge)
- Worker processes/threads utilization (gauge)
If your community utility makes requests to different companies within the context of fulfilling a shopper request, it ought to have metrics to document the habits of communications with these companies. Key metrics to watch embody variety of requests, request latency, and response standing.
HTTP internet utility backends
HTTP functions ought to monitor all of the above. In addition, they need to preserve granular knowledge in regards to the rely of non-200 HTTP statuses grouped by all the opposite HTTP standing codes. If your internet utility has consumer signup and login performance, it ought to have metrics for these as nicely.
Long-running processes
Long-running processes corresponding to Rabbit MQ client or task-queue employees, though not community servers, work on the mannequin of choosing up a job and processing it. Hence, we should always monitor the variety of requests processed and the request latency for these processes.
No matter the applying kind, every metric ought to have applicable metadata related to it.
Integrating monitoring in a Python utility
There are two parts concerned in integrating monitoring into Python functions:
- Updating your utility to calculate and report metrics
- Setting up a monitoring infrastructure to accommodate the applying’s metrics and permit queries to be made towards them
The fundamental thought of recording and reporting a metric is:
def work():
requests += 1
# report counter
start_time = time.time()
# < do the work ># calculate and report latency
work_latency = time.time() - start_time
...
Considering the above sample, we frequently make the most of decorators, context managers, and middleware (for community functions) to calculate and report metrics. In Demo 1 and Demo 2, we used decorators in a Flask utility.
Pull and push fashions for metric reporting
Essentially, there are two patterns for reporting metrics from a Python utility. In the pull mannequin, the monitoring system “scrapes” the applying at a predefined HTTP endpoint. In the push mannequin, the applying sends the info to the monitoring system.
An instance of a monitoring system working within the pull mannequin is Prometheus. StatsD is an instance of a monitoring system the place the applying pushes the metrics to the system.
Integrating StatsD
To combine StatsD right into a Python utility, we might use the StatsD Python client, then replace our metric-reporting code to push knowledge into StatsD utilizing the suitable library calls.
First, we have to create a shopper
occasion:
statsd = statsd.StatsClient(host='statsd', port=8125, prefix='webapp1')
The prefix
key phrase argument will add the desired prefix
to all of the metrics reported through this shopper.
Once we have now the shopper, we will report a price for a timer
utilizing:
statsd.timing(key, resp_time)
To increment a counter:
To affiliate metadata with a metric, a secret is outlined as metadata1.metadata2.metric
, the place every metadataX
is a area that enables aggregation and grouping.
The demo utility StatsD is an entire instance of integrating a Python Flask utility with statsd
.
Integrating Prometheus
To use the Prometheus monitoring system, we’ll use the Promethius Python client. We will first create objects of the suitable metric class:
REQUEST_LATENCY = Histogram('request_latency_seconds', 'Request latency',
['app_name', 'endpoint']
)
The third argument within the above assertion is the labels
related to the metric. These labels
are what defines the metadata related to a single metric worth.
To document a selected metric remark:
REQUEST_LATENCY.labels('webapp', request.path).observe(resp_time)
The subsequent step is to outline an HTTP endpoint in our utility that Prometheus can scrape. This is often an endpoint referred to as /metrics
:
@app.route('/metrics')
def metrics():
return Response(prometheus_client.generate_latest(), mimetype=CONTENT_TYPE_LATEST)
The demo utility Prometheus is an entire instance of integrating a Python Flask utility with prometheus
.
Which is healthier: StatsD or Prometheus?
The pure subsequent query is: Should I take advantage of StatsD or Prometheus? I’ve written a couple of articles on this subject, and chances are you’ll discover them helpful:
Ways to make use of metrics
We’ve discovered a bit about why we wish to arrange monitoring in our functions, however now let’s look deeper into two of them: alerting and autoscaling.
Using metrics for alerting
A key use of metrics is creating alerts. For instance, chances are you’ll wish to ship an e-mail or pager notification to related folks if the variety of HTTP 500s over the previous 5 minutes will increase. What we use for establishing alerts relies on our monitoring setup. For Prometheus we will use Alertmanager and for StatsD, we use Nagios.
Using metrics for autoscaling
Not solely can metrics enable us to grasp if our present infrastructure is over- or under-provisioned, they will additionally assist implement autoscaling insurance policies in a cloud infrastructure. For instance, if employee course of utilization on our servers routinely hits 90% over the previous 5 minutes, we could have to horizontally scale. How we might implement scaling relies on the cloud infrastructure. AWS Auto Scaling, by default, permits scaling insurance policies primarily based on system CPU utilization, community site visitors, and different elements. However, to make use of utility metrics for scaling up or down, we should publish custom CloudWatch metrics.
Application monitoring in a multi-service structure
When we transcend a single utility structure, such that a shopper request can set off calls to a number of companies earlier than a response is shipped again, we’d like extra from our metrics. We want a unified view of latency metrics so we will see how a lot time every service took to answer the request. This is enabled with distributed tracing.
You can see an instance of distributed tracing in Python in my weblog put up Introducing distributed tracing in your Python application via Zipkin.
Points to recollect
In abstract, be sure to maintain the next issues in thoughts:
- Understand what a metric kind means in your monitoring system
- Know in what unit of measurement the monitoring system needs your knowledge
- Monitor essentially the most important parts of your utility
- Monitor the habits of your utility in its most crucial phases
The above assumes you do not have to handle your monitoring techniques. If that is a part of your job, you’ve gotten much more to consider!
Other sources
Following are a number of the sources I discovered very helpful alongside my monitoring schooling journey:
General
StatsD/Graphite
Prometheus
Avoiding errors (i.e., Stage three learnings)
As we study the fundamentals of monitoring, it is essential to control the errors we do not wish to make. Here are some insightful sources I’ve come throughout:
To study extra, attend Amit Saha’s speak, Counter, gauge, upper 90—Oh my!, at PyCon Cleveland 2018.