How we migrated Bitbucket Cloud to Envoy proxy

Introduction

Bitbucket Cloud is a Git-based code hosting and collaboration solution. It serves both HTTPS and SSH traffic over the internet.

Our engineering team recently completed a major architectural update of Bitbucket’s public facing Edge load balancing and proxy solution. Licensing and bandwidth were both primary motivators for this transition. Bitbucket Cloud began touching, and often exceeding, its maximum allocated bandwidth.

Licensing bandwidth

Bitbucket Cloud was also leveraging Amazon Global Accelerator (AGA). For requests coming from the public internet to Bitbucket, clients first resolve DNS to an AWS edge location to get routed by AWS Global Accelerator over the AWS backbone to the Bitbucket edge. This started becoming very costly.

As of today, Bitbucket Cloud is served using Envoy Proxy, a high-performance open-source edge and service proxy from Lyft and the Cloud Native Computing Foundation.

Previous solution

Our previous solution had a number of shortcomings we wanted to address:

Auto-scaling was not supported
Infrastructure was constrained to a single region

This means we were unable to respond to regional traffic patterns, the platform was cost-inefficient, and ultimately presented a reliability risk.

Network bandwidth across the solution was fixed and increases relied on negotiating additional license capacity with our vendor. The implication of this was an increase in reliability risk due to potential throttling and the inability to scale up bandwidth quickly when needed.

Bitbucket Cloud’s edge configuration was becoming increasingly complex. For request management, a Perl-based programming language was used. However, having access to a scripting language that allows the manipulation of requests with ease proved counterproductive. Over the years, a lot of business logic made its way into the network edge, where it did not belong. This created significant technical debt and risk.

New solution

Envoy enters the scene

Our new Envoy-based architecture overcomes all the previous solution’s shortcomings. It provides a huge feature set and highly extensible platform for the Bitbucket edge while offering high performance and scalability to support user growth.

From What is Envoy:

Out-of-process architecture
- Works with any application language
- Allows for quick and transparent upgrades
L3/L4 filter architecture and HTTP L7 filter architecture
- Pluggable filter chain mechanism to perform different TCP/UDP proxy tasks
- HTTP filters can be plugged into the HTTP connection management subsystem that performs different tasks such as buffering, rate limiting, routing/forwarding
First class HTTP/2 and HTTP/3 support (HTTP/3 support for Bitbucket Cloud TBA)
- Envoy can operate as a transparent HTTP/1.1 to HTTP/2 proxy in both directions. This means that any combination of HTTP/1.1 and HTTP/2 clients and target servers can be bridged.
HTTP L7 routing
- When operating in HTTP mode, Envoy supports a routing subsystem that is capable of routing and redirecting requests based on path, authority, content type, runtime values, etc.
Service discovery and dynamic configuration
- Envoy optionally consumes a layered set of dynamic configuration APIs for centralized management.
Health checking, outlier detection as well as advanced load balancing and observability
- Envoy includes a health-checking subsystem that can optionally perform active health checking of upstream service clusters.

Envoy-based solution is fully integrated into the Atlassian ecosystem and was developed to work across all Atlassian products.

It’s already the home of many products, including Trello, Opsgenie, Jira, and Confluence.

How we migrated Bitbucket Cloud to Envoy Proxy

Before we embarked on our migration, we needed to ensure that we had a plan to address some of the key challenges.

Caching

Our previous solution offered native Content Caching capability that is highly configurable using its proprietary scripting language. Bitbucket relies on caching functionality extensively serving 570K requests per hour from cache on average.

If your application is AWS-hosted, the first inclination when looking for a caching solution would be to adopt Amazon CloudFront. However, Bitbucket Cloud requires HTTP/80, HTTPS/443, and SSH/22 on the same domain—Bitbucket | Git solution for teams using Jira. Since it is not possible to split traffic by port in a DNS system and CloudFront is not able to expose raw TCP sockets, Bitbucket Cloud was not able to use CloudFront.

After doing some research, we decided to introduce self-hosted Varnish Cache as our caching solution. The introduction of self-hosted Varnish Cache instances provided us with several advantages:

Placing Varnish Cache behind Envoy and in front of our backend servers allowed us to send only HTTP traffic to it and direct SSH elsewhere.
Varnish Cache is highly configurable. By utilizing its Varnish Configuration Language (VCL) – a Domain-Specific Language we were able to make backend routing and caching decisions at request processing time. This let us achieve optimizations like caching of certain unsuccessful mutating requests and file browse requests.
By default, mutating requests are not cached by Varnish, but we found that caching certain unsuccessful requests is useful. It is achieved by the VCL snippet below

Routing

Bitbucket Cloud request management scripts contained thousands of lines of code implementing business, routing, and security logic. The scripts were used to modify the requests and responses, to set session persistence parameters, and to decide how to route request to various upstream services.

To ensure that this logic was accurately translated to the new architecture, we carefully mapped rules to the upstreams that they route traffic to and associated each rule with every virtual server (domain) it was executed by. We started by parsing Terraform infrastructure as code files that we use to manage our edge solution to create this map. We then translated them into Envoy and Varnish configurations while carefully preserving execution order.

For example, the old Traffic Manager rule:

that resulted in the following Envoy rule:

To validate that we did not miss anything and gain confidence in our translated configuration we used a combination of unit tests, Post-Deployment Verification tests and Atlassian-only release a.k.a. “dogfooding.”

Atlassian vs. public traffic during dogfooding

In the end, we managed to migrate all scripted rules to a combination of Envoy configuration and Varnish Configuration Language hooks that are called at certain points in the request lifecycle.

Before:
- 2,500 lines of routing scripts
- ~ 500 if blocks
After:
- Sequential routing rules. Simple to look at and to understand.
- < 1,000 configuration lines
- Golang templates to merge route rules into a base configuration and output a unique service configuration for each environment – development, staging, and production
- 21 if blocks

Rate limiting

In order to prevent potential abuse, Bitbucket Cloud employs rate limiting of inbound traffic based on different metrics like IP address or user agent. To be able to support feature parity, we made code contributions in Envoy Proxy 1.27 to support adding domains to RateLimitPerRoute. This allowed us to enable global rate limiting on our multi-tenant Envoy cluster, whilst allowing tenants to set unique policies on their particular hosts and routes.

SSH Support

Envoy’s main strength is as an HTTP proxy but it does offer raw TCP proxying as well. Since Bitbucket Cloud supports both HTTP and SSH protocols we needed to utilize this alongside Proxy Protocol configuration to provide SSH connection functionality.

IPv6

Bitbucket Cloud has clients using IPv6 and we needed to ensure that our solution accommodates IPv6 traffic. We’ve added support for the termination of IPv6 connections at our Envoy edge for Bitbucket, and this is now a feature that we’ll be able to utilize for other Atlassian products in the future.

Scale

Bitbucket Cloud runs at a large scale, serving billions of requests and petabytes of data per day, which we had to take into consideration when planning our migration. Our goal was to perform a zero downtime migration and ensure that Bitbucket Cloud customers were not negatively affected during and after our migration. To achieve this, we were dogfooding our transition by switching Atlassian-originated traffic first. We had to come up with a way of selecting certain users and workspaces and directing traffic to the new infrastructure since Bitbucket Cloud workspaces are not domain-based.

During the dogfooding stage of the migration our old solution sent traffic to newly created Bitbucket Cloud Network Load Balancers through a VPC PrivateLink. We then used request management rules to determine whether requests should route to Envoy based on request characteristics such as client IP and destination Bitbucket Cloud workspace.

After we resolved all the issues discovered during the migration of Atlassian-originated traffic, we gradually shifted traffic using weighted Route53 DNS records and performed a switch on a domain-by-domain basis.

Performance and reliability gains

Scalability

Envoy-based edge is a highly scalable platform with elastic bandwidth. We are able to rapidly scale our edge up and down following daily traffic patterns and spikes.

Lower latency

Prior to our migration we utilized Amazon Global Accelerator to improve performance by allowing the TCP 3-way handshake to occur close to the client. The TLS handshake still had to be handled at the network edge, which resulted in 2 round trips to the US regardless of the client location.

We now have edge regions closer to users, which means reduced latency and round-trip times for both TCP and TLS handshakes. Improved request latency is especially evident for non-US-based clients.

For example, the P95 duration for requests that can be handled at the network edge dropped from 200ms (Asia Pacific) and 100ms (E) to 5ms.

p90 and p95 of requests after the migration

No more TCP resets due to anycast shift

Anycast is the routing methodology employed by Amazon Global Accelerator, where a single IP address is assigned to multiple devices in more than one location. Routers decide which location is closest to the user, usually based on the lowest number of BGP network hops. This sometimes results in TCP resets, which occur when there are multiple equal-cost paths towards the destination and a router directs packets to different locations within the same TCP session.

We are now able to avoid this problem because our new architecture does not use Global Accelerator and is therefore not susceptible to Anycast Shift. All Bitbucket Cloud IPs are now single-homed.

Conclusion

Bitbucket is now using a highly scalable edge solution that is fully integrated into the Atlassian platform. We are able to take advantage of multi-regional presence, granular rate limiting, and scalable bandwidth.

Looking back, it was very important to carefully plan and then execute according to it resisting the temptation to migrate traffic faster. Coupled with switching Atlassian-originated traffic first, this allowed us to avoid customer interruptions.

Looking forward, we can improve our solution even more by fine-tuning caching and optimizing cross-regional traffic flow.

How we migrated Bitbucket Cloud to Envoy proxy

Introduction

Previous solution

New solution

Envoy enters the scene