Science and technology

Quiet log noise with Python and machine studying

Continuous integration (CI) jobs can generate huge volumes of information. When a job fails, determining what went improper could be a tedious course of that includes investigating logs to find the basis trigger—which is commonly present in a fraction of the whole job output. To make it simpler to separate probably the most related knowledge from the remaining, the Logreduce machine studying mannequin is skilled utilizing earlier profitable job runs to extract anomalies from failed runs’ logs.

This precept will also be utilized to different use circumstances, for instance, extracting anomalies from Journald or different systemwide common log information.

Using machine studying to cut back noise

A typical log file accommodates many nominal occasions (“baselines”) together with a couple of exceptions which are related to the developer. Baselines might include random parts corresponding to timestamps or distinctive identifiers which are tough to detect and take away. To take away the baseline occasions, we are able to use a k-nearest neighbors pattern recognition algorithm (okay-NN).

Log occasions should be transformed to numeric values for okay-NN regression. Using the generic function extraction software HashingVectorizer allows the method to be utilized to any sort of log. It hashes every phrase and encodes every occasion in a sparse matrix. To additional scale back the search house, tokenization removes recognized random phrases, corresponding to dates or IP addresses.

Once the mannequin is skilled, the okay-NN search tells us the space of every new occasion from the baseline.

This Jupyter notebook demonstrates the method and graphs the sparse matrix vectors.

Introducing Logreduce

The Logreduce Python software program transparently implements this course of. Logreduce’s preliminary purpose was to help with Zuul CI job failure analyses utilizing the construct database, and it’s now built-in into the Software Factory growth forge’s job logs course of.

At its easiest, Logreduce compares information or directories and removes traces which are comparable. Logreduce builds a mannequin for every supply file and outputs any of the goal’s traces whose distances are above an outlined threshold by utilizing the next syntax: distance | filename:line-number: line-content.

$ logreduce diff /var/log/audit/audit.log.1 /var/log/audit/audit.log
INFO  logreduce.Classifier - Training took 21.982s at zero.364MB/s (1.314kl/s) (eight.000 MB - 28.884 kilo-lines)
zero.244 | audit.log:19963:        sort=USER_AUTH acct="root" exe="/usr/bin/su" hostname=managesf.sftests.com
INFO  logreduce.Classifier - Testing took 18.297s at zero.306MB/s (1.094kl/s) (5.607 MB - 20.015 kilo-lines)
99.99% discount (from 20015 traces to 1

A extra superior Logreduce use can practice a mannequin offline to be reused. Many variants of the baselines can be utilized to suit the okay-NN search tree.

$ logreduce dir-train audit.clf /var/log/audit/audit.log.*
INFO  logreduce.Classifier - Training took 80.883s at zero.396MB/s (1.397kl/s) (32.001 MB - 112.977 kilo-lines)
DEBUG logreduce.Classifier - audit.clf: written
$ logreduce dir-run audit.clf /var/log/audit/audit.log

Logreduce additionally implements interfaces to find baselines for Journald time ranges (days/weeks/months) and Zuul CI job construct histories. It can even generate HTML stories that group anomalies present in a number of information in a easy interface.

Managing baselines

The key to utilizing okay-NN regression for anomaly detection is to have a database of recognized good baselines, which the mannequin makes use of to detect traces that deviate too far. This methodology depends on the baselines containing all nominal occasions, as something that is not discovered within the baseline shall be reported as anomalous.

CI jobs are nice targets for okay-NN regression as a result of the job outputs are sometimes deterministic and former runs could be robotically used as baselines. Logreduce options Zuul job roles that can be utilized as a part of a failed job submit job as a way to subject a concise report (as a substitute of the total job’s logs). This precept could be utilized to different circumstances, so long as baselines could be constructed prematurely. For instance, a nominal system’s SoS report can be utilized to seek out points in a faulty deployment.

Anomaly classification service

The subsequent model of Logreduce introduces a server mode to dump log processing to an exterior service the place stories could be additional analyzed. It additionally helps importing current stories and requests to research a Zuul construct. The providers run analyses asynchronously and have an internet interface to regulate scores and take away false positives.

Reviewed stories could be archived as a standalone dataset with the goal log information and the scores for anomalous traces recorded in a flat JSON file.

Project roadmap

Logreduce is already getting used successfully, however there are a lot of alternatives for bettering the software. Plans for the long run embrace:

  • Curating many annotated anomalies present in log information and producing a public area dataset to allow additional analysis. Anomaly detection in log information is a difficult matter, and having a standard dataset to check new fashions would assist determine new options.
  • Reusing the annotated anomalies with the mannequin to refine the distances reported. For instance, when customers mark traces as false positives by setting their distance to zero, the mannequin may scale back the rating of these traces in future stories.
  • Fingerprinting archived anomalies to detect when a brand new report accommodates an already recognized anomaly. Thus, as a substitute of reporting the anomaly’s content material, the service may notify the consumer that the job hit a recognized subject. When the difficulty is mounted, the service may robotically restart the job.
  • Supporting extra baseline discovery interfaces for targets corresponding to SOS stories, Jenkins builds, Travis CI, and extra.

If you have an interest in getting concerned on this venture, please contact us on the #log-classify Freenode IRC channel. Feedback is all the time appreciated! 


Tristan Cacqueray will current Reduce your log noise using machine learning on the OpenStack Summit, November 13-15 in Berlin.

Most Popular

To Top