Continuous profiling of JVM

Continuous profiling of JVM

Atlassian’s mission is to unleash the potential in every team. Our products help teams organize, discuss, and complete their work.​ Atlassian offers products and services such as Jira, Confluence, Trello, and Bitbucket. The products we build at Atlassian have hundreds of developers working on them, composed of a mixture of monolithic applications and microservices.

When an incident occurs in production, it can be hard to diagnose the root cause due to the high rate of change within the codebases. It is noticeably nontrivial to triage and reduce the impact radius of an incident to the most relevant contributing cause(s).

We always pursue techniques and tools that enable our engineers to apply performance engineering in their context, such as to diagnose a service at production. Profiling can speed up root-cause diagnosis significantly and is an effective technique to identify runtime hot spots and latency contributors within an application. Without profiling, it’s commonly necessary to implement custom and ad-hoc latency instrumentation of the code, which can be error-prone or cause other side effects. In addition, profiling allows exploration of unknowns at runtime whereas instrumentation requires planned work prior to runtime.

Profiling of JVM at Runtime

One of the frequently asked questions in many a diagnosis is (in one form or another):

What was the JVM doing during a particular time (from T1 to T2)?

When an incident occurs or an investigation kicks off, one of the first steps is to collect observations and facts of the running system. For a running JVM, there are well-established tools that allow us to look into JVM runtime around the process, threads’ state, memory, IO, and others.

In addition, profiling a JVM provides a precise way to understand the running code (via stack frames) in a JVM. There is also a well-established set of tools for profiling a JVM, for example, Visual VM.

One of the efficient tools to answer this question for JVM is a flame graph. A flame graph is a visualization of the running code inside a process that helps identify the most frequent code paths (for example, Java stack traces) in an accurate way.

The Journey of Profiling Tools

We started our journey to evolve our tooling for profiling JVM applications a few years ago. Like many other projects at Atlassian, the first step and building block completed was an innovation iteration, a practice that is greatly nurtured at Atlassian.

Initially, the goal was to create a simple tool, for example, a CLI for an engineer to be able to profile a running JVM at runtime in production and visualize the results as a flame graph.

​To develop the tool, the following elements were implemented:

This was a great first success; i.e. an engineer was able to run a CLI to profile the target JVM application at runtime in production and capture the profiling data and visualize them using Flame Graphs.

We continued to develop the tool and expand its functions and usage across our teams. While the tooling was highly valuable, our method had some limitations:

These limitations led us to look into continuous profiling; i.e., run the profiling tools on a set of target JVM applications based on a timed schedule and store the profiling data. There have been interesting challenges and trade-offs to build a reliable and efficient solution for continuous profiling:

We started to build an in-house continuous profiling solution. We decided to continue to use Linux perf and async-profiler, which came with several advantages, including:

Our initial continuous profiling solution comprised

We defined a schema in AWS Glue allowing developers to query the profile data for a particular service using AWS Athena. Athena empowered developers to write complex SQL queries to filter profile data on dimensions like time range and stack frames. We also started building a UI to run the Athena queries and render the results as flame graphs using SpeedScope.

Adopting Amazon CodeGuru Profiler

Meanwhile, the announcement of Amazon CodeGuru Profiler caught our attention — the service offering was highly relevant to us and largely overlapped with our existing capability.

We foster an agile culture in our engineering practices at Atlassian. The ultimate goal was to bring the value of continuous profiling as a performance engineering tool to every engineer across development teams. We drove a spike to evaluate and experiment with Amazon CodeGuru Profiler. The results looked promising in terms of integration, capturing the data and analysis via visualization. Though, we also learned there are some limitations to this service in contrast with Linux perf or async-profiler. A few of those differences and limitations are as follows:

Even with the effort we already employed for our in-house solution, we still had significant work ahead of us to build out a user-friendly one in comparison with Amazon CodeGuru Profiler. The team needed to make a decision; continue our solution development or adopt Amazon CodeGuru Profiler. We made a balanced trade-off to switch to adopting Amazon CodeGuru Profiler.

We have now integrated Amazon CodeGuru Profiler at a platform level, enabling any Atlassian service team to easily take advantage of this capability.

Using continuous profiling for Latency in JVM Thread States. One of the first ways we utilized CodeGuru Profiler was to identify code paths that show obvious or well-known problems in terms of CPU utilization or latency. We searched for different forms of synchronization in the profiled data. For example, we searched for methods containing synchronized sub-string. Another useful feature of Amazon CodeGuru is that it provides a HotSpot view, in which the user can see the most frequent code paths also categorized by thread states. Therefore, we chose a random (relatively small) time frame and used the HotSpot view to find any BLOCKED thread state.

One interesting case was an EnumMap that was wrapped in a Collections.synchronizedMap. The following screenshot shows the thread states of the stack frames in this code path for a span of 24 hours.

Although the involved stack trace consumes less than 0.5% of all runtime, when we visualized the latency of thread states, we saw that it spent twice the amount of time in a BLOCKED state than a RUNNABLE state. To increase the ratio of time spent in a RUNNABLE state, we moved away from using EnumMap to using an instance of ConcurrentHashMap.

The following screenshot shows a profile of a similar 24-hour period. After we implemented the change, the relevant stack trace is now all in a RUNNABLE state.

In a more complicated context, AWS CodeGuru Profiler helped to verify how similar code paths in the same stack trace might be performing different operations such as database writes or reads. One of the core metrics that is observed for Jira is the latency of a request and the latency of DB operations per HTTP endpoint. The following shows an observed regression on both request and DB latency on a specific endpoint although the load on the endpoint does not show a change or spike:

1️⃣ and 2️⃣ show the latency and count for HTTP request while 3️⃣ and 4️⃣ show the latency and count for the database layer for the same HTTP request. Where the spikes occur on 2️⃣ and 4️⃣, we do not observe any noticeable change on 1️⃣ and 3️⃣. How could we explain this?

In our observability stack, we have built another useful tool that captures and maintains a set of metrics mapping HTTP endpoints and SQL statements that are executed. In a simplified version, the following was observed based on the above metrics, that is trying to find the most costly SQL on the HTTP endpoints that has spikes for latency of request and DB:

Having the SQL hash, we map it to the class that runs the query. Zooming into the same time window, we observe the following on AWS CodeGuru Profiler for SQL running class. In this case, we search for a class and then use visualization on “latency” per thread states
( = TIMED_WAITING, = WAITING,   = RUNNABLE )

Why is the highlighted class spending it runtime mostly on WAITING and TIMED_WAITING?

One interesting thing is that the proportion of thread states look to be in the same ratio before the stack trace reaches this class at runtime. Therefore, there might be a cause in the stack trace prior to this stack trace element. We move up the stack trace trying to hunt for any interesting observation around issue HTTP endpoints.

There’re two interesting observations from the above stack trace snapshot:

  1. 1️⃣ and 3️⃣ subsection of the parent stack trace does not contain any TIMED_WAITING ✅. It’s only 2️⃣ that contains TIMED_WAITING thread state.
  2. Both operations of write and read seem to be executing in the same stack trace but in different threads.

When we move a bit further up the stack trace on 2️⃣ and 3️⃣  paths:

The above observation implies there could be different types of concurrent issue operations at the same time. This was confirmed by looking at database metrics and finding a specific one that shows a matching pattern:

Going back to the starting question to understand the short-timed regression on request and database latency without a change on request load:

A few tenants on Jira started to synchronize their issue data using scheduled jobs and plugins. The operations involved both read and write operations including both HTTP PUT (create issue) and HTTP DELETE (delete issue). To ensure the consistency and integrity of data, there is a locking mechanism applied before deleting an issue which explains TIMED_WAITING — the service was waiting to obtain a lock to be able to delete the issues or create new ones. And, on the other side, there were searches on issue that were receiving issue data which explains WAITING.

✅ In this investigation, in a nutshell:

⭐️ Bonus Insight:

If you noticed, in the screenshots of Amazon CodeGuru Profiler, the top right corner of a black box shows a very small percentage < 1.0% or ever 0% which is the percentage of the runtime of that stack trace element to the total runtime of the service.

It is fascinating that such a small portion of runtime is able to cause a noticeable regression in request and database latency. In a matter of sub-seconds, the database is saturated with IO operations, and requests are started to be queued up and the latency degrades.

It takes noticeable amounts of effort to perform the investigation, understand the root causes, and then try to improve the products and services. That is something we never hesitate to go through for our customers’ experience.

It has been an amazing journey to evolve our performance engineering tooling. The practice of continuous profiling on JVM applications has shown its great value to many teams across Atlassian. And, the journey continues. In future posts, we will share our experience of tooling when it comes to some of the limitations of Amazon CodeGuru Profiler to capturing native or OS-specific call stacks. In addition, we’re now looking at applying continuous profiling to technology stacks other than JVM. Finally, continuous profiling of JVM applications for memory usage and allocation is one of the next interesting challenges that we’re looking into.

Stay tuned for future stages of our performance engineering journey.

Exit mobile version