Why Atlassian uses an internal PaaS to regulate AWS access

Why Atlassian uses an internal PaaS to regulate AWS access


Atlassian has an internal Platform-as-a-Service that we call Micros. It is a set of tools, services, and environments that enable Atlassian engineers to deploy and operate services in AWS as quickly, easily, and safely as possible.

The platform hosts over 1,000 services that range from experiments built during our ShipIt hackathons, to internal tooling supporting our company processes, to public-facing, critical components of our flagship products. The majority of Atlassian Cloud products are either partly or fully hosted on Micros.

Despite its crucial responsibilities, Micros is a relatively simple platform. The inputs it takes to deploy a service are just a Docker image containing the service logic, and a YAML file – the Service Descriptor – that describes the backing resources that the service needs (databases, queues, caches, etc.) as well as various configuration settings, such as the service’s autoscaling characteristics. The system takes care of expanding those inputs into production-ready services, ensuring operational functionality (e.g. log aggregation, monitoring, alerting) and best practices (e.g. cross Availability-Zone resiliency, consistent backup/restore/retention configuration, sensible security policies) are present out-of-the-box.

We haven’t invented much here: nearly everything Micros offers is achieved by using standard AWS features. With this in mind, it is common for engineers to question the need for such a platform: couldn’t we simply open up plain, direct AWS access to all teams, so that they can use AWS’s full and rapidly-expanding functionality?

This is a great question which we’ll explore below, focussing on the following points:

The benefits of consistent infrastructure

AWS’s breadth of Cloud infrastructure features is extensive to say the least, and only a subset is made available to Atlassian engineers via Micros. This discrepancy isn’t due to Micros being unable to “keep up” with AWS. The limitation exists primarily because we believe in the value of reducing technical sprawl, especially at the bottom of our stacks. There are fantastic benefits that manifest when teams across Atlassian use infrastructure in a consistent manner. To name a few:

These economies of scale are a consequence of using a consistent, controlled interface to provision and manage services in AWS, combined with sensible bounds on the vast array of AWS features available. Both would likely be degraded if direct access to AWS were the norm.

How the PaaS helps

Let’s take a closer look at how Micros achieves some of the above benefits.

Services on the PaaS can be represented in a simple and homogeneous way: highly available compute combined with whichever backing resources the service needs. However, if you look beneath the covers, you’ll see that the platform has provisioned much more, thereby enforcing sensible defaults for autoscaling, alerting, security, backup policies and more.

The trade-offs of a PaaS – and some mitigations

So, we’ve chosen a strategy which favours a controlled, consistent and hardened subset of AWS features, over free-form access to AWS accounts. This strategy has some trade-offs. For example:

Let’s look at some mitigations to these trade-offs.

Free-for-all access in a controlled Training Account

If teams would like to experiment with AWS features to validate they fit their use cases, or just to learn more about them, we generally point them to the Training Account.

This account has some notable restrictions: it is not connected to the rest of our internal network, and it is purged on a weekly basis. Nonetheless, it’s an ideal “playground” to experiment, validate assumption and build simple proofs of concept.

Extending the PaaS

The above isolated experimentation is valuable but can only go so far. Fortunately, there’s a range of ways in which the PaaS can be extended.

Many resources that Micros provides are integrated to the platform via a curated set of CloudFormation templates. Teams can branch our repository of templates and add their own CloudFormation templates, which can immediately be referenced by Service Descriptors and therefore be provisioned by services in development environments. This allows for the resource to be tried and tested, and for us to examine in detail what would be required to make the resource available to all teams in production.

Acknowledging that not everything one may wish to provision alongside a service is best represented with a CloudFormation template, Micros also accepts extensions in the form of service brokers that implement the Open Service Broker API. In other words, teams can contribute services which may themselves become part of the provisioning flow for subsequent services, by deploying and managing new types of assets defined in the Service Descriptor. Building and running such services is no small undertaking, and we take care to ensure this extension point isn’t used as a vector to pump out PaaS features that don’t have a high level of ongoing operational support. In practice, we have used this functionality primarily to decompose the core of the PaaS, and to scale the PaaS team. For example, this extension point enabled us to spin out an independent sub-team that owns all the platform’s data stores and their associated tooling (including the components that provide their provisioning logic), and to give our Networking team greater autonomy in the implementation and ownership of the platform’s integration with our company-wide public edge.

Aside from expanding the range of resource types that are available to services, some teams need to extend the PaaS by adding to the components that run alongside their service on their compute nodes. To this end, the platform offers a concept of Sidecars – shareable binaries that service owners can add to the set of processes that are spun up when their service starts. These have been used to provide additional diagnostics functionality, local caching for performance and resilience, a standardised implementation of staff authentication, and more.

“Bending the rules” of the PaaS

While we value consistent infrastructure, we understand that sometimes the boundaries of the PaaS are at odds with other factors that put teams under roadmap pressure. Therefore, we sometimes bend the rules of the platform to unblock teams.

All such cases start with a ticket raised on the Micros team’s Service Desk, and often involve a face-to-face discussion so that we can align on the cost/benefit, risks and ramifications of the exception. Once implemented, most exceptions – especially those that could be risky if used without fully considering the implications – are kept behind a feature toggle so that only specific teams or services can make use of them. Examples include sticky sessions (which we discourage by default to avoid resilience issues brought about by unnecessary statefulness), or the ability to target a specific Availability Zone to achieve affinity with a database for latency gains (thereby trading off on the resilience benefit of our being spread across 3 Availability Zones in production).

We keep track of all exceptions, and periodically review them to ensure they don’t stay in place longer than necessary. In many cases, the requests represent temporary relaxations of the platform rules for specific services, so that service owners can get stuff shipped. In other cases, the requests are an indication that the platform boundaries need to shift – and they therefore evolve into broader feature requests.

Working off-platform

There is no mandate at Atlassian that says services must run on Micros. In fact, there is a well established channel for teams to obtain their own, separate AWS accounts that they manage independently.

This flexibility comes with additional responsibilities and considerations, which echo the list above describing where Micros helps. Here are the questions teams must consider before going off-platform:

All services deployed to the PaaS automatically get a record in our internal service directory “Microscope”, which presents essential information about services in a concise and discoverable manner.

Our PaaS is not a wall between our engineers and AWS

It’s worth noting that while Micros adds some functionality and integrations around plain AWS, I avoid referring to it as an “abstraction”, because it is intentionally leaky: Micros very deliberately exposes the details of the underlying AWS infrastructure that it provisions and manages.

For example, services descriptor contains fields very similar to those you’d find in an equivalent CloudFormation template. Once resources are deployed, you can examine them directly via the AWS Console. You can get tokens to interact with them via the standard AWS CLI. From your code, you can use the standard AWS SDKs. We use standard AWS constructs to enforce security policies.

Because engineers use most AWS features directly, many enhancements to existing resources (such as DynamoDB Transactions) become available to you as soon as they land in AWS.

By and large, everything works as documented in AWS docs. It is rare that we add layers of in-house invention between Atlassians and AWS.

(An arguable trade-off is that this coupling with AWS would make a theoretical shift to another cloud provider more difficult. We believe that the concrete benefits we achieve now by avoiding heavy abstractions outweigh the hypothetical efforts we’d need for such a future migration.)

Our PaaS won’t stop evolving

The above helps explain why free-form, direct access to AWS is not Atlassian’s current platform strategy, and that having an internal PaaS is valuable… even if it sometimes feels like it gets in the way!

However, while the PaaS is valuable today in its current form, it cannot stand still. There are two main drivers that will continuously push our platform forwards: the evolving needs of Atlassian engineers, and those of our Cloud customers.

The first driver, the needs of Atlassian engineers, means we will keep improving the developer experience we provide, and increasing the speed at which engineers can innovate & get their job done. This involves reducing the operational burden on developers, and polishing the dev loop. We’re implementing a range of features on that front today, including improved Lambda support to reduce the amount of boilerplate code required for services that primarily react to AWS events, adding Kubernetes as a compute environment to speed up deployment & scale-out time, supporting more pre-configured combinations of backing resources that solve common use cases (such as DynamoDB tables pre-configured to stream to Elasticsearch), and externalising a range of common service-to-service communication concerns to a service mesh layer.

The second driver, the needs of Cloud customers, means the platform has a key role to play in Atlassian’s duty to continuously strengthen our Cloud products’ security, reliability and performance. These cross-cutting concerns cannot be delivered upon at the top of the stack alone. The platform will assist by delivering monitoring improvements to maintain our visibility into our systems as they grow ever more sophisticated, more consistent and centrally observable rules for service to service authorization, and better mechanisms for data classification across our fleet of services.

Tips for getting started

Even if you only have a few services for now, it isn’t too early to start thinking about how a service platform can help your fleet evolve and scale. Here are some points to think about early on:

Exit mobile version