Scaling React server-side rendering in Jira Cloud

Scaling React server-side rendering in Jira Cloud

For the past 18 months, we’ve been iterating on and improving a React server-side rendering service to support the frontend in our cloud-hosted Jira offering. During this time, the service has grown fairly organically from a side-project spiking a proof-of-concept, to a performance-critical service with 24/7 on-call support.

This isn’t the story of an ideal SSR implementation, and it’s not a recommendation for how you should adapt SSR into your specific application or use case. This is just the story of how we adapted non-SSR friendly code in a largely legacy frontend architecture into a multi-tenanted service that scales to almost every page in Jira. We’re working within constraints that led us to this solution and improved our Time to First Meaningful Render (TTR) by about three seconds across the board.

An example of the TTR improvements from SSR in Jira, specifically focussed on the Side Navigation component.

Modernising a legacy frontend

Jira started life in 2002 – 17 years ago at the time of writing. Needless to say, web development has changed a bit along the way.

Jira’s backend has gone through major architectural shifts too – we’ve transformed Jira from a single-tenanted single-JVM tomcat webapp hosted in-house by customers, to a composition of multi-tenanted web services operated by Atlassian on AWS (one of the largest and most complex projects in the history of the company).

In terms of introducing React and SSR (or more generally, just modernising the frontend stack), the things we’re worrying about are:

In this blog, we’ll try to explain how we’ve addressed each of these problems. The more general question around why or when to undertake a large scale modernisation and tech debt reduction effort like this is out of scope.

Getting started

One of the easiest ways to get started introducing modern frontend code into a legacy codebase is what we refer to as an inside-out model. You start by creating a small island of modern code (e.g. a small feature written in React), and slowly expand it outwards allowing it more and more responsibility over the page.

This is a great way to get started with React quickly, but the complexity increases as the new code starts to replace more and more of the existing experience. We’ll explain this approach in more detail, as well the alternative outside-in model, in a future blog.

We implemented this initially with a system we’ll refer to as fragments. We started in a new repository, introduced React, and emitted:

We deploy these artifacts by uploading the fragments in place to an S3 bucket behind a Cloudfront CDN. When we go to render a page in our existing application, we make a request to our CDN to fetch a fragment, compose it with the legacy JVM rendered HTML we already have, and return it the browser.

Building from scratch in a new repository gave us the space to set up a frontend-focused development pipeline using tools familiar to frontend developers, and allowed us to develop and deploy new frontend features independently of the existing monolithic Jira codebase, which was a big dev speed win.

Shortly after we had migrated a few larger features to the new service, people started asking about SSR. The new system worked well, but one of its main limitations was that it was completely static – we download a pre-generated HTML file into part of the page, we send it to the user’s browser, then the Javascript is downloaded and executed.

We needed a JavaScript runtime somewhere between us and the user’s browser to support server-side rendering, so we added a new node-based service on top of our existing fragments system. The API was largely the same, but now, in addition to our existing fragments and assets, we also emitted a special SSR-compatible JavaScript bundle (which we’ll explain in more detail in the next section). We first call out to our SSR service to try and render a fragment server-side using that bundle, and fallback to our existing static fragments if it fails or times out.

Getting any code to even render initially was a challenge. Our frontend code wasn’t necessarily written with SSR compatibility in mind, which means we hadn’t remained mindful of things like safe platform-agnostic access to APIs like window or document, or avoiding global side-effects such as scheduling timers during a render.

One solution to this compatibility problem – arguably, a better solution – is to refactor all frontend code for both server-side and client-side rendering, but that’s a difficult cost to justify when we weren’t even sure what the benefits of SSR would be for our use case. We decided instead to provide a compatibility-layer that could adapt our existing non-SSR-friendly code to an SSR environment.

Polyfilling SSR compatibility

There are two main parts to our compatibility layer:

There’s plenty of good resources online about the differences between running in a browser and running on a server with regards to React. The typical recommendation is to write your components in a such a way that is aware of the differences in environments and to guard against them (e.g. checking for the existence of browser globals like window and document before trying to use them). The alternative is to produce a separate JavaScript output specially for server-side execution with some key global variables redefined and some modules replaced. The obvious drawback with the latter approach is that your server-side code diverges further from your client-side code, which can lead to obscure bugs, but the benefit is it enables you to retrofit large amounts of existing code and its dependencies – which was never intended to work with SSR – relatively easy.

Some changes, we make at build time (Webpack’s 
NormalModuleReplacementPlugin is your best friend here):

We provide the remaining mechanisms at runtime. Each script is executed in a separate Node VM context. A Node VM is essentially a fancy eval function with some special error handling and a separate global scope. It’s not a security mechanism for running untrusted code in a sandbox. We’re running our own apps, so we trust ourselves to not do anything deliberately malicious, but we do want to guard against mistakes or bugs compromising data isolation. We also need to be able to reliably cancel render jobs if they’re running too long or otherwise misbehaving. We provide a few mechanisms for this through the context object which we inject as the global state for each render VM:

We use a completely new VM and runtime for every render request. This is good for isolation (we get a clean global scope, and a clean scheduler context), but bad for performance (instantiating a new context can take ~200ms, which is too slow to do as part of every request). To mitigate this, we maintain a pool of available VMs for each render job to draw from. We know which scripts we need to render ahead of time, so we can instantiate VMs before we need them and precompile the scripts we need.

This was enough to get going, our next challenge was onboarding more experiences into the service and scaling it up to support Jira-levels of traffic.

Scaling up

We went through a few architecture iterations quickly early on, but eventually settled on:

We deployed this stack 1:1 in each of the AWS regions used by Jira, and it hummed along nicely! Separating the web servers and workers made sense intuitively – they had very different compute requirements (the workers had expensive render operations to perform, but the web servers just had to accept and respond to simple HTTP requests), and we liked that we could configure different Ec2 instance sizes and scaling rules for each of them. The job queue between the two provided an extra layer of load balancing, which allowed renders to be distributed fairly consistently across the available compute resources.

Results were good, and we weren’t even close to our scalability limits. Web servers in our busiest shards usually topped out around three instances, and workers would sometimes scale up to eight or nine instances during periods of high traffic.

We ran into problems when we started to onboard a new fragment into SSR – the navigation sidebar. Since the sidebar is present on every page in Jira, this would effectively increase the amount of traffic into SSR by more than 100 percent (+1 render request for every page that already had an SSR’d fragment previously, and +1 for pages that previously had no SSR interactions)! We knew it would be a challenge, but initial performance testing was successful, and we had the capability to increase or decrease the rollout on a per-request basis if we needed to.

We pushed the rollout along without incident. A few hours after hitting 100 percent, we started to run into problems. We started getting alerts for spikes in request duration, and an increase in jobs failing due to timeout. We decreased the rollout for the navigation fragment (decreasing load), and the service quickly recovered.

After some investigation, we found an interesting correlation between load, network I/O on the job queue, and render performance.

With the additional load from the navigation sidebar fragment:

The nature of our queueing solution means that all job results are sent to all available web servers, which means the bandwidth required between webserver and queue is unfortunately quadratic with the number of concurrently completing jobs, multiplied by the size of the job result (the rendered HTML string), multiplied by the number of active webserver nodes!

Eventually, we start to run into our max network-out bandwidth allowance at 10GB/sec for our Redis instance, and get (rightfully) throttled by AWS. With nothing left to connect incoming requests with worker nodes, render requests start to timeout and fail. Our CPU-utilisation-based scaling rules to add more web server nodes to deal with an increasing number of incoming requests actually makes the problem worse by further saturating the network connecting web servers with the queue instance!

Most people would agree that this is sub-optimal.

Despite our horizontal scaling and load balancing capability for web servers and workers, we were being let down by the shared queue in the middle. Fortunately, we were able to reduce customer impact in the immediate-term by limiting the rollout of the server-side rendered navigation fragment. We mitigated in the short-term by vertically scaling our Redis instance to allow for more generous network bandwidth allowances, but it still represented a single point of failure in the architecture that we needed to address.

Our longer-term solution was to keep the general architecture the same but flatten everything into one scaling group. So instead of scaling web servers and workers independently around a single shared queue, one ec2 instance runs a fixed number of web server processes, a fixed number of worker processes, and their own mini job queue between them in a separate Redis container. When we scale up (still based on CPU utilisation), we create another completely isolated group of web servers, workers, and a queue as one unit.

There are many alternative inter-process communication techniques we could have used to connect the web servers and workers, but this approach allowed us to keep the internals of the service mostly untouched and experiment only with the deployment model.

This architecture yielded more stable render times, was cheaper (we end up running more Ec2 nodes total, but we save money on the big Redis instance which was under-utilised in terms of CPU & memory), and handled the increased load from the navigation fragment without issue.

Was it worth it?

SSR, in general, is a trade-off between time-to-render (TTR) — the time taken to display meaningful content to the user (e.g. display relevant issue details), and time-to-interactive (TTI) — the time taken for the page to become clickable (e.g. a user can click through comments, or collapse & expand the sidebar). SSR typically increases TTR at the cost of TTI, so is only really useful for pages where an initial read-only view is more relevant to a user than an interactive experience.

We think the navigation sidebar specifically is a good candidate for this:

So for our specific use-case – yes, I think it was worth it! In addition to the performance benefits, we also have a frontend service that can act as the foundation for future decomposition efforts.

Exit mobile version