Ashmita Bohara

Analysis on Jaeger Dataset

With advancements in systems, every application is turning into a distributed application. Orchestration tools like Kubernetes have enabled running applications across single or multiple datacenters easily by providing easier abstractions. Similarly, modern databases are designed with some sort of distribution in mind, like replication or partitioning. Of course, this distributed nature provides better guards against hardware/network failures. But there are also few disadvantages that come along with it inherently.

Distributed Systems (DS) fail in unexpected ways:

  • One can never be fully sure that monitoring such systems is catching every failure scenario. Sometimes, distributed systems get so complex that it’s very difficult to identify the failing component in case of an outage. 
  • Given the high number of components, each at potentially different versions, it’s hard to keep track of the security between them.
  • High tail latencies are one of the common issues in most DS where certain requests took a lot longer than the 99th percentile of the requests. 
  • Software performance degrades over time due to a variety of reasons. For example, in the database world, an increase in the data size causes non-index queries to take a longer time.

Ways distributed tracing can help

Distributed tracing solutions, such as Jaeger can give a complete view of the system, and help to understand the system as a whole. We can see a macro view of the transaction, and how it interacted with dozens of microservices, while still being able to drill down into the details of one service. Capturing and analyzing traces from each of the components of DS can help in the quick identification of problematic services. Using Jaeger we can trace the path of a request as it travels across a complex system. It can even give us the full dependencies graph. Looking at the service dependency diagram, we can put fine-grained authorization rules thus improving the security. Using the traces generated from Jaeger, we tried to look at solving some of the problems in the distributed systems. This blog walks you through the repo where the traces of data is generated, formatted, and applied to some machine learning algorithms to find insights.

About the repository


  • Find ways to deduce interesting and useful information on Jaeger data.
  • Service Level Objective (SLO) validation: validate how often services breach their guaranteed SLO.
  • Web attacks: identifying malicious traffic in HTTP traffic.


  • Since we don’t have an open distributed tracing dataset from a production environment, we had to generate similar dataset.
  • Since every organization has a different set of applications running, thus jaeger traces can vary widely. Hence, we can’t use supervised learning to train our model for a particular dataset and expect it to work efficiently on every other dataset.

Overcoming Challenges

  • We created our dataset by running a few sample applications like Hot R.O.D (from Jaeger repository) and BookInfo (from Istio repository).
  • Evaluate unsupervised learning for our problem.

Data Generation

We used locust to generate production-like traffic for these sample applications. Locust is mainly for API but there are multiple plugins for GUI-based applications. We generate both the normal traffic and outlier traffic using the locust framework. Outliers are relatively very less as compared to normal traffic which can be controlled via locust framework. Since locust has a programmable interface, you can tune everything like HTTP User-Agent, adding delays to the requests and other scenarios that usually happen in production. The major drawback of such a method is that in this case, both server and client are hosted on the same machine, while in production clients can be located anywhere in the globe while servers are in few locations. Thus we weren’t able to capture request latency anywhere close to those in actual production.

Data Extraction

In Jaeger, data can be pulled from multiple places: Kafka, Storage backends (either Elasticsearch, Cassandra, or whichever storage engine being used), and Jaeger’s Query component. Messages in Kafka are in protobuf format, so will require extra overhead to convert those protobuf messages into a consumable format but the benefit is they can be consumed in a stream fashion. This has no extra resource overhead on the Jaeger setup. The way to extract data from the storage backend will depend on the storage query syntax, which varies from SQL (Cassandra) to Lucene (Elasticsearch); hence this isn’t a very comfortable approach. Jaeger’s query component exposes certain HTTP endpoints to extract data. We first query /api/services endpoint to get a list of all services and then using those names, query /api/traces?service= to get the traces for that service. This is a pull-based approach and hence will have some resource overhead on jaeger-query. To take care of that we aren’t reading the same trace-id multiple times, we need to make a sliding window-based query bypassing the start time in the api params. This approach is simple, hence we followed this way.

Data Storage

We stored the Jaeger data in JSON files named as .json. This is called a flat-file structure. This format is easy to consume for our machine learning training.

Applying Data Science

We applied basic algorithms such as Isolation Forest to achieve our goals. 

Anomaly Detection

The goal is to apply machine learning to identify the anomalous data points in our tracing dataset. Anomaly detection is the process of finding out the outliers in the datasets or the data points that don’t belong, that deviates from normal behavior. Anomalies are often indications of something failing or going wrong which should be detected & tackled as soon as possible. We tried applying anomaly detection on the HotRod dataset by using an unsupervised machine learning algorithm (Isolation Forest). Before applying any kind of data science algorithm, we first need to identify which fields in trace data are useful. Also, we need to select those fields which are common among all traces in general since we don’t want our code to work only on a certain kind of dataset. Hence, we decided to start with HTTP headers which are available as tags in each trace, that are present in every HTTP protocol based applications. Here is an example of an HTTP GET request with most of the headers captured from one of the open-sourced datasets.

Out of all the HTTP headers, we found URL Path (eg. /tienda1/index.jsp) and Query Parameters (eg. query=DROP+TABLE+users) as the most useful information. These can help us to distinguish from normal vs malicious HTTP requests. Few examples of malicious queries include SQL Injection, XSS, CSRF, etc. Although most of the modern web frameworks in almost all programming languages prevent such attacks still they are the most common attack as per OWASP. Hence we thought it would be great to find such traffic and capture them via an anomaly detection framework.

Data Formatting

All the tracing data was stored in JSON files which can’t be used as it is for machine learning. We need to first convert this data into pandas. DataFrame is just a two-dimensional array with each row corresponding to one element of this array. After this conversion, we did some data preprocessing (removing the null values, converting categories into numerical data), and then applied the Isolation Forest Algorithm to find the anomalous data points, and then used a scatter plot to visualize the anomalous data points.

Future Possibilities

  • Right now, this is built as a batch processing system. We can improve it by making it a stream processing.

  • AWS has built a better anomaly detection algorithm called RCF (Random Cut Forest) which has advantages over Isolation Forest (IF). RCF works on both batch and stream data while IF works only on batch data. RCF also performs better on high dimensional datasets. 

SLO Validation

Every application provides some guarantees about their health and functionality to their consumers. There isn’t an easy way to make sure that the application fulfills those promised guarantees. We can use trace data to calculate it. We calculated two metrics from Jaeger data for SLO calculations:

  • Response Duration (per operation per service)
  • Error Rate Percentage

SLO is generally represented in the form of 50th, 90th, 99th percentiles. Hence we calculated response duration in a similar format. This is calculated for every operation of each service. Error percentage is calculated from the “logs” field of a trace when the key set as “error”.

Future Possibilities Right now, this is built as a batch processing system. We can improve it by making it a stream processing. This can be used to determine the maturity level of an application.


Work can be find here: