Atlassian Engineering’s handbook: a guide for autonomous teams

Atlassian Engineering’s handbook: a guide for autonomous teams

At Atlassian, we’ve found that the key ingredients to empower autonomous teams are ownership, trust, and a common language.

We want every team to own our customers’ success. As opposed to a static command and control model, we start from a position of trust in every team and continually work together to verify the quality, security, and reliability of a product. We then scale this practice by ensuring our engineers have a common language. This includes a set of principles, rituals, and expectations to which they can align. The outcome is a loosely coupled and highly aligned system that empowers our teams to move fast, make decisions, and come up with innovative solutions that propel us forward.

We want to share the way we work with everyone, to help improve teams across industries. Hopefully, our learnings will help you build autonomous teams and ingrain ownership in your own organization.

True to our value of Be the Change You Seek, we’re always after new ways of thinking to better our teams. If you have ideas to supercharge this rocketship, join us and make it happen. Visit our Careers page to find out more about becoming an Atlassian engineer.

Contents

Abstract

As mentioned in our Cloud Engineering Overview, Atlassian engineering builds upon common foundations in order to develop, ship, and run highly secure, reliable, and compliant software at scale. This guide outlines widely used rituals, practices, processes, and operational tools for our engineering organization. It is a resource for new staff to understand how we operate and is an ongoing reference for existing staff that provides a comprehensive overview of how we work.

For each topic, the engineering handbook aims to include:

While each team will likely have adapted specialized rituals for their daily work, they should be closely aligned with the contents of this document.

Engineering philosophy and priorities

As we pursue our mission to unleash the potential of every team, we’ve documented our shared philosophy with the hopes it will guide our actions in the engineering organization, as our company values continue to serve us. We have a team of motivated engineers who deserve a high degree of ownership, and this philosophy supports that ownership.

Customers are our lifeblood; we act accordingly

Everything we do creates value for our customers, either by reducing the pain they experience with our products or delivering new innovations to unleash their potential. We regularly interact with customers and we make decisions as if they are in the room, without assuming we know what they need. We value a “bias for action” and we act with a pragmatic, incremental approach because that’s what our customers want.

We expect our teams to have consistent, regular interactions with customers.

Radically autonomous, aligned teams

We are loosely coupled but highly aligned. We prune coordination overhead and dependencies whenever possible. Decisions are made quickly with the right people involved and no more. Teams are empowered to decide and act with autonomy, yet they align with common practices, consistent with our company values and engineering philosophy. Radical autonomy is a two-way street. Every team contributes to our culture of autonomy by empowering other teams to move fast.

We expect our teams to own the outcomes they deliver, document their work, and empower other teams as a result.

Engineers, not hackers

We build with pride and rigor because it leads to better outcomes for our customers and a sustainable pace of innovation for the future. We avoid and eliminate unnecessary complexity, pay off technical debt, and reduce toil through automation. We earn and maintain our customers’ trust by ensuring we build and maintain well-architected, cloud-native infrastructure and applications that are secure, high-performing, resilient, and efficient to operate. We are innovators who deliver the quality that our customers expect.

We expect our teams to measure the metrics that matter, set challenging objectives to improve them, take action if we’re not meeting our baselines, and innovate with quality for the future.

Stand on the shoulders of giants

We leverage the wealth of our rich history and use what we’ve learned and created to power the future of teamwork. We believe a common platform underpinning all of our products will allow our customers to have a consistent experience and will create a competitive advantage for the long term. We invest in high-value, consistently built, reusable components to reduce duplication and wasted effort.

We expect our teams to use, contribute, and enhance our common services, libraries, patterns, and guidelines.

Guidelines for prioritization

In the course of our daily work, teams will need to make prioritization decisions as they choose to focus on one task or project above another. These priorities ensure we apply sufficient time and attention to the following list of activities, in the order in which they are listed.

  1. Incidents: Resolve and prevent future incidents, because our products and services are critical to our customers’ ability to work. If there are multiple, high-priority incidents that are competing for our attention and resources, then we will prioritize using the following guidance: security incidents first, then reliability incidents, then performance incidents (which we consider part of reliability), followed by functionality incidents, and all other categories of incidents.
  2. Build/release failures: Resolve and prevent build/release failures, because our ability to ship code and deploy updates is critical. If we’re unable to build and/or release new software, we’ll be unable to do anything else.
  3. SLO regressions: Triage, mitigate, then resolve out-of-bounds conditions with existing service level objectives (SLOs), because without meeting our SLOs we’re losing the trust of our customers. If multiple SLO regressions are competing for attention, security regressions are always given priority. The prioritization for SLO regressions includes (in priority order): security, reliability, performance, and bug SLOs.
  4. Functionality regressions: Fix regressions in functionality that have been reviewed and approved by your product manager.
  5. High-priority bugs and support escalations: Resolve bugs in agreement with your product manager and resolve customer escalations because they impact customer happiness.
  6. All other projects: Everything else, including roadmap deliverables, new objectives, and OKRs (objectives and key results, described further below).

Section 1: How we work

A combination of OKRs and SLOs

What it is

Objectives and Key Results are a goal-setting method we use across the company. OKRs can be set at different levels such as company, department, or team. OKRs represent goals for achieving something new, like increased customer engagement, or new levels in performance or reliability.

Service level objectives (SLOs) are used to maintain all that we’ve achieved in the past, preventing regressions and taking swift and sufficient action when we backslide. For example, maintaining all of our product capabilities, maintaining established levels of performance, security controls, compliance certifications, and our customer’s happiness and trust. When we want a significant step change in an SLO, then we’ll create a new project for it and track the achievement of the new goal as an OKR.

Using a metaphor from airplane flight, you can think of SLOs as maintaining a smooth, level altitude and OKRs as gaining new heights. As mentioned above, we always seek to maintain our SLOs and build on that foundation to achieve new goals defined by our OKRs.

Why they exist

How they are intended to be used

Our Product Delivery Framework

What is it

Our product delivery framework is how we address and improve products and services used across Atlassian. It provides a common approach, language, and toolbox of practices to help us collaborate and create impact for our customers.

Why this exists

As we have continued to grow, it’s become more difficult to communicate across teams and to introduce new team members to our ways of working at scale.

To address this challenge, we introduced a common approach for teams to tackle projects. Our customers benefit from this standardized approach. It allows us to prioritize customer needs, allocate more time to innovation, and allows us to continue improving our products instead of duplicating existing processes and practices across teams.

How it’s intended to be used

We celebrate a healthy combination of what already works in our teams, The Atlassian Team Playbooks, as well as elements of other frameworks that are proven to work.

When new projects are conceived and started, four components help us align on our ways of working from idea to impact: principles, phases, checkpoints, and plays.

Principles

The principles guide our thinking and decision-making as we work in our teams. They are intentionally not binary and use a ‘this over that’ format, to help us favor one behavior over another.

  1. Are we optimizing for customer value over feature delivery?
  2. Do we prioritize delivering an outcome rather than simply shipping on time?
  3. Do we have the confidence to make informed bets rather than spinning in analysis paralysis?
  4. Do we have a flow and rhythm in how we work across teams rather than being too reactive?

Phases

Our ways of working are shaped around four phases, from having an idea to delivering customer value:

Checkpoints

We are using intentional moments of reflection to help ensure continued clarity about the outcomes we are working toward, the problem we are solving, and whether we are still heading in the right direction.

Checkpoints enable our teams to orient where we are and prioritize what to do next, while ensuring we are addressing the right concerns at the right time. Are we confident enough to move on?

Plays

Plays are our way of putting these ideas into practice. Some are well defined in The Atlassian Team Playbook, some are newer and we’re still learning how to make the most of them.

The Atlassian Team Playbook

atlassian.com/team-playbook

What is it

Team Playbooks are resources we’ve developed for addressing common team challenges and starting important conversations. They’re used extensively within our company and have also been published publicly for use by any team in the world.

Some of the most popular Plays include:

How they are intended to be used

Each Play is designed to be a simple workshop-style tool that will help your team address a common problem or start a conversation about something new. Each Play is used with the following steps:

  1. Do some prep work, schedule a meeting, and share the materials.
  2. Run the Play by facilitating the workshop with your team.
  3. Leave with a plan by documenting insights and assigning action items.

Project Communications

Team Central

Team Central BETA | Organize your projects’ updates

The Loop: Project Communications for Teams-of-Teams

What is it

Team Central is an Atlassian product that helps organizations share the status of projects and goals among their teams. We use it internally to openly share work-in-progress across teams, functions, locations, and levels. Teams are required to use our internal instance of Team Central to communicate project status and foster an open, transparent culture of work. Engineering and cross-functional leaders use Team Central as their source of truth to effortlessly discover and give feedback on the status of the projects, goals, and teams they care about, without having to sift through the noise of traditional status spreadsheets.

Why this exists

It gives everyone access to real-time progress, potential problems, and priorities, in a digestible feed they’ll actually read.

How it’s intended to be used

DACI Decisions

DACI: a decision-making framework | Atlassian Team Playbook

What it is

A decision-making framework used extensively throughout the company and the industry, when a team is faced with a complex decision. Internally, we document our decisions with a standard template used in Confluence. DACI is an acronym for Driver, Approver, Contributor, and Informed.

Why this exists

DACIs are used to make effective and efficient group decisions. They spread knowledge, reduce surprises, highlight background research, and increase openness and visibility within our organization. Writing down the details creates accountability, fosters collaboration, and almost always leads to better decisions.

How it’s intended to be used

Request for Comments (RFCs)

What it is

An RFC is a short, written document describing how something will be implemented. They are used for any nontrivial design. Some companies refer to these as design documents or architectural decision record. Whatever they’re called, they’re very similar. They’re used to solicit feedback and increase knowledge sharing within an engineering organization.

Why this exists

It spreads knowledge, reduces surprises, highlights tech debt and other challenges, and increases openness and visibility within our organization. As with DACIs, writing down the details and sharing them with others creates accountability and almost always leads to better outcomes. RFCs are slightly different from DACIs in that they’re used to communicate the chosen solution while also soliciting feedback to further refine the approach.

How it’s intended to be used

Global engineering metrics (GEM)

What it is

A set of engineering metrics that can be consistently applied across all engineering teams, inspired by industry-standard DevOps metrics. Teams should augment these metrics with additional measures that make sense for their specific team or organization.

Why this exists

Having a shared set of metrics allows organizations and teams to benchmark themselves relative to their peers, identify bottlenecks/common roadblocks, and practice continuous improvement.

How it’s intended to be used

By harvesting data from Jira, Bitbucket, and other sources, GEM automatically populates a dashboard for each team that includes the following metrics.

Metric NameMeasurementWhat does the measure mean?
Pull request cycle timeThe time from when a pull request is opened until it’s deployed in a production environment, or to a successful build for codebases with no prod deployments.Shorter pull request times indicate fewer bottlenecks in our continuous delivery process.
Story cycle timeTime to transition from an In Progress status category to a Done status category.Shorter cycle time means our work is delivered to customers faster and in smaller increments, leading to improved iteration.
Successful deploymentsHow often the team deploys new code to a production environment. For pipelines that don’t deploy directly to a prod environment, we look at successful builds on the main branch.Higher deployment frequencies allows us to reduce the batch size (fewer changes in each deployment), reducing the likelihood of conflicts and making problem resolution easier.

We expect teams to use the global engineering metrics dashboard to establish a baseline, identify ways to improve their effectiveness, share the lessons with others, and look to amplify the impact across teams.

Section 2: How to stay informed

Below are the guidelines and resources for staying informed within our engineering org.

Engineering-wide communication

We share information within the engineering organization in the following ways:

Engaging with customers and ecosystem

Our primary means for engaging with our customers and our developer ecosystem is our community forums. We have a community for customers at Atlassian Community and a community for our developer ecosystem at community.developer.atlassian.com. These resources complement our support tool and developer documentation by enabling our employees to engage with our community. All of our employees are encouraged to join conversations there and help answer questions or share tips on how to get the most out of our products.

Section 3: How we develop software

Cloud engineering overview guide

All of Atlassian engineering builds upon common foundations in order to develop, ship, and run highly secure, reliable, and compliant software at scale. The document located at Atlassian’s Cloud Engineering Overview – Atlassian Engineering introduces our shared services, components, and tools.

Architecture principles

What it is

We’ve created a set of architecture principles to keep in mind as you design and evolve software. They go hand in hand with our engineering philosophy and they are intended to serve as a reminder of what’s essential in our approach for designing software and components.

Why this exists

When developing software, it’s easy to get bogged down in the details and day-to-day challenges of getting work done when working under time pressure.

How it’s intended to be used

These principles are designed to establish a north star for anyone who is designing a system or component. Additionally, when designs are being reviewed through an RFC process, the reviewers can use these principles to “pressure test” a design to ensure it’s aligned with our principles.

The principles are summarized below but the document linked above contains more context and details.

Development lifecycle

In addition to the development lifecycle that is outlined in Our Product Delivery Framework (see above), there are some common development lifecycle steps that most engineering teams follow while working on a team.

Outcome-driven development and Agile process

There is no prescribed process that all engineering teams must use. Most teams use a variation of Agile, each refining their own process through cycles of iteration, retrospection, and adaptation. We encourage all teams to deliver measurable outcomes over simply delivering output and projects. However, there are some common traits that most team’s processes include, such as:

Engineering efficiency formula

Our engineering efficiency score is a metric we use at Atlassian to measure the percentage of engineering effort that moves the company forward. It’s the percentage of time an org spends on developing new customer value vs. time spent on “keeping the lights on (KTLO)” or maintaining the current products and service level objectives.

Our engineering efficiency formula is defined as:

Engineering_Efficiency = Move_Atlassian_Forward / (Move_Atlassian_Forward + KTLO)

Here are our definitions of the terms in the equation above:

Moving Atlassian forward

Move_Atlassian_Forward = Org_Development + Internal_and_Customer_Improvements + Innovation_Time

Moving Atlassian forward is defined as the time we spend developing ourselves and our organization, improving our products, and innovating. Some examples of these activities are:

Keeping the lights on (KTLO)

KTLO = Tech_Entropy + Service_Operations + Administration_Work

Keeping the lights on refers to the time we spend addressing routine tasks that fall outside of improving our products and our organization. This includes maintaining what we’ve built, engaging in daily operations tasks, and attending to routine administrative work. Some example of these activities are:

When we see problems with efficiency, we explore the root causes and address the problems in order to bring the efficiency score back into an acceptable range.

Developing high-quality software

Tech stack policy

What it is

Our policy is to use standardized tech stacks for building cloud services, web applications, and mobile applications. The tech stacks include languages, libraries, frameworks, and other components.

Why this exists

This allows us to be more effective at scaling our organization through consistency and shared understanding. Being consistent with a small number of tech stacks increases our flexibility when resource shifts are required, while still allowing us to make an appropriate selection. It also enables us to achieve greater leverage with shared libraries, frameworks, testing tools, and supporting functions like the security teams, release engineering teams, platform teams, and others.

How it’s intended to be used

When browsing the tech stacks, you’ll see the following terms:

Consistent architecture and reusable patterns policy

What it is

Our policy is to use standard architectural design patterns across all our products, services, and components, whenever possible. Our architectural patterns describe the preferred approach to solve common engineering challenges.

Why this exists

There are many requirements to develop, ship, and run highly secure, reliable, and compliant software at scale. To scale efficiently, we need to be able to reuse patterns and best practices across the company. This is where standardization helps. Standardization unlocks scale by promoting reusability, consistency, and shared understanding. This leads to predictable results and reduces unnecessary complexity and cost.

How it’s intended to be used

Our catalog of standardized architectural design patterns:

Cloud platform adoption guidelines

What it is

Why this exists

We invest in shared platforms to:

Peer review and green build (PRGB) policy

What it is

All changes to our products and services, including their configurations and the tools that support the provisioning of them, must be reviewed and approved by someone other than the author of the change. In addition, the change must pass a set of build tests to confirm they work as expected.

PRGB is also a critical process used for our compliance certifications, covered later in this doc.

Why this exists

For our own peace of mind, customer confidence, and compliance requirements, we need to show our customers, auditors, regulators, and anyone that wants to know, that changes we’ve made have been peer-reviewed and tested.

How it’s intended to be used

Repository owners must ensure compliance settings on new repositories are enabled on either Bitbucket Data Center or Bitbucket Cloud and that they’re using a build plan or pipeline that includes the PRGB controls.

Once the PRGB repository settings are enabled, changes are made using the following process:

Testing recommendations

We don’t have specific guidelines or requirements for testing. Each team should make those decisions for themselves. However, it’s recommended that all changes should include unit, function, and regression test coverage prior to merging.

In addition, some back-end services should also include:

Developer documentation standard

What it is

All teams are required to expose their service and API documentation on internal pages on developer.atlassian.com. That site is our central resource for all of our internal developer documentation for our shared resources, components, services, and platform. Because it is such a critical resource for all of our teams, we require that every team allocate some portion of time to writing and maintaining high-quality documentation.

Why this exists

We want to reduce the cost associated with unnecessary overhead around service integration and adoption. When our APIs aren’t documented well, are hard to discover, or are difficult to integrate with, we create inefficiencies and friction within our organization. This results in more meetings between teams. Because we have a large number of products, dependencies, and teams operating in different time zones, this can eat up a lot of time.

How it’s intended to be used

If you build a service, API, component, or library that is intended to be used by other teams within Atlassian, then it should have high-quality documentation including:

These documents must be written as if they can easily be understood by an outside developer. We refer to this mindset as “third-party first” thinking. Whereby we build our internal components and documentation to the same level of quality we would use for external (or third-party) developers.

Developing accessible software

What it is

Accessibility is part of our DNA and therefore in many circumstances, it’s usually a required part of building our products. We have guidelines on how to make our products more accessible and a detailed “definition of done for accessibility” that describes our standards for delivering software. We require our products to undergo periodic accessibility assessments, as we aim to meet the needs of all of our customers.

Why this exists

Our mission is to help unleash the potential of every team. We simply cannot achieve this mission unless our products are accessible to everyone. Our products need to usable by people who have permanent or temporary disabilities so they can do the best work of their lives when they use our products.

We not only stand to lose business in the future should we fail to keep up with our competitors and standards, this is also our moral and ethical obligation to our customers and shareholders, not a feature request or a legal checkbox. We cannot be proud to ship unless we ship accessible software.

How it’s intended to be used

Working with open-source software

What it is

We have guidelines for working with open-source software that everyone must follow. The guidelines cover:

Why this exists

There are legal, intellectual property, security, privacy, and compliance risks associated with open-source projects. We developed a set of guidelines for everyone to follow in order to reduce these risks while unlocking all of the benefits associated with open-source software.

How it’s intended to be used

Note: We will share our open-source guidelines and philosophy in the coming months.

Developing secure software

The following are a series of activities, practices, and required policies that exist to ensure we build security into our products, services, and applications.

Secure Development Culture and Practices (SDCP)

What it is

The SDCP encapsulates the security practices we use during development. All of our development teams partner closely with our security engineers to ensure we design and build security into our products.

Why this exists

Secure development culture and practices are the cornerstones of secure software. You cannot pen-test your way to secure software; you must build security in. The focus of this program is maintaining verifiable secure development practices so we can demonstrate, to ourselves and our customers, that security is a primary consideration in the design and development of our products.

How it’s intended to be used

There are several parts to our SDCP, including:

Security reviews

We include security risk consideration as part of our engineering processes. Higher risk projects warrant security verification activities such as threat modeling, design review, code review, and security testing. This is performed by the security team and/or third-party specialists, in conjunction with the product teams.

Embedded security engineers

Product Security engineers are assigned to each major product/platform engineering group in order to collaborate and encourage engineering teams to improve their security posture. In many situations, the product security engineer will be embedded within a single team when they’re working on a component that has a higher risk profile for potential security issues.

Security champions program

A security champion is a member of an engineering team who takes on an official role in which the person dedicates time to helping their team make sure our products are secure. They are trained to become subject matter experts and assist teams with tasks such as answering security questions, reviewing code, architecture, or designs for security risks.

Application security knowledge base

We have a secure development knowledge base that helps to share knowledge among our development teams. This knowledge base contains patterns and guides for developing secure applications.

Secure developer training

We offer various types of secure code and security training to all engineers as part of our effort to build secure products from the beginning.

Automated detection and testing

Our security team uses tools that automatically scan our repositories and operate as part of our standard build pipelines. These tools are used to detect common security problems. All engineers are required to resolve alerts or warnings coming from our security scanning tools and these alerts are tracked as one of our critical SLOs.

Section 4: How we ship

This section introduces practices, rituals, and processes that we use to ship new features, improvements, services, and entire products. We use the processes below as part of our preflight steps to guide how we release new functionality to our customers.

It includes the topics of:

Operational readiness checklist (Credo)

What it is

Credo is the code name for our operational readiness checklist and review process for launching new services or major upgrades for existing services. The checklist contains prelaunch reminders that include capacity planning, resilience testing, metrics, logging, backups, compliance standards, and more.

Why this exists

Anyone can forget a step resulting in a disastrous outcome. Using a checklist is a proven technique for avoiding mistakes and oversights. We use an operational readiness checklist when launching new services to ensure they all reach a certain level of service maturity before they launch so major problems can be avoided.

How it’s intended to be used

Before a new service or major component is released for the first time or after a major update in functionality or implementation, the team who built it is required to achieve a passing grade on the Credo checklist. This involves a process that includes:

Controlled deployments to customers

What it is

Our teams use an incremental and controlled release process whenever they introduce changes into our products or production environments. This ensures that the change performs as expected and our key metrics are monitored as the change is gradually introduced. This process is used for code changes, environment changes, config changes, and data schema changes.

Why this exists

This reduces the blast radius of problems introduced by any change.

How it’s intended to be used

There are several options for performing controlled releases to customers. Some of these techniques are often used in combination.

Canary deployments/Progressive rollouts

These are techniques where code and configuration changes are released to a subset of users or a subset of the production cluster and monitored for anomalies. The changes only continue to roll out to the wider population if they behave as expected.

Preproduction deployments

This is a practice to add a deployment stage that closely resembles the production environment before the release. Often, the preproduction environment is used by the team or the entire company as a testing ground for new functionality.

Blue-green deployments

Blue-green deployments involve running two versions of an application at the same time and moving production traffic from the old version to the new version. One version is taking production traffic and the other is idle. When a new release is rolled out, the changes are pushed to the idle environment followed by a switch where the environment containing the new release becomes the live environment. If something goes wrong, you can immediately roll back to the other environment (which doesn’t contain the new release). If all is well, the environments are brought to parity once more.

Fast five for data schema changes

“Fast five” is a process for safely making data schema changes and updating the associated code (or configuration) that depends on those changes. Our goal is to prevent situations where we have tight coupling between code and schema changes that prevent us from being able to recover to a healthy state if something goes wrong. Fast five can be summarized as:

The five stages are:

Feature flags

Feature flags are used to fence off new code changes and then the flags are incrementally enabled after the code has been deployed into production. We use LaunchDarkly, a third-party feature flag service provider. Teams are encouraged to use the “feature delivery process” (described below) in combination with feature flags.

Feature delivery process

What it is

Our feature delivery process is a recommended process for delivering every nontrivial product change and is used by many of our teams. This process leverages automation that connects feature flags to a workflow to ensure stakeholders are notified, metrics are associated with new flags, and no steps are missed in rolling out new features.

When a team uses feature flags, there is a recommended process to associate a “feature delivery issue” with each feature flag in order to participate in a workflow that helps automate the process of informing stakeholders including product managers, designers, and the customer support team about new changes that will impact customers as the feature flag is incrementally enabled to our user population.

Why this exists

It enables a workflow to prevent steps from being missed during the release of any change. The workflow also ensures collaboration between all of the participants in the process, including:

How it’s intended to be used


Section 5: How we operate and maintain

We have a series of rituals, practices, processes, and standards to ensure we’re able to maintain and uplift the operational health of our products.

Service tiers

What it is

Our policy is that all teams use standardized tiers of service for categorizing a service’s reliability and operational targets. These targets have been developed based on the usage scenarios commonly encountered for the services we build at Atlassian.

Why this exists

Tiers give us a way to split our services up into easily understood buckets. This helps our teams decide what level of engineering effort is appropriate when building their service. Once the service is live, it sets quality standards that should be maintained throughout its life in production. This allows teams to allocate their time appropriately to feature work vs. remedial tasks.

Tiers also allow other teams to understand what to expect from the services they depend on.

How it’s intended to be used

There are four tiers that all services must be bucketed into:

These tiers are used within our service catalog and also used to drive behaviors across many of our operational practices and policies.

Service level objectives (SLOs) and error budgets

What it is

We use SLOs and error budgets to communicate reliability targets for all of our products and services. When these targets are breached, alerts are raised to trigger the responsible team to take action. We require that all teams use our internal SLO management tool for managing, measuring, alerting, and reporting on our SLOs.

Why this exists

We need to ensure we don’t lose the trust of our customers when it comes to reliability and performance. We’ve built a tool to manage SLOs and raise alerts.

How it’s intended to be used

Global on-call policy

What it is

We need to ensure we are able to support the operations of our business 24/7. Designated teams will be required to be on-call outside of normal business hours, including nights and weekends, to respond to alerts and incidents as quickly as possible. To compensate employees for the burden of being on-call, we have developed a specific compensation policy.

Why this exists

We need to ensure we don’t lose the trust of our customers, but we also need to treat our employees fairly for the extra burden of being on-call outside of normal business hours.

How it’s intended to be used

Teams use OpsGenie to manage their on-call rotations. We have several processes tied to the on-call schedule within OpsGenie, including our compensation process.

Service catalog (Microscope)

What it is

Microscope is our service catalog that records all of the metadata associated with our services. It’s integrated with Micros, our service and infrastructure provisioning system, as well as many other tools and processes we use. It also has an associated Service Linter tool that performs automatic checks on services to help us meet our operational health goals.

Why this exists

A single service catalog tool helps us have a consistent mechanism for exploring, tracking, and operating all of our microservices.

How it’s intended to be used

Service owners/admins are required to use Microscope to create and maintain the metadata associated with their services.

Operational security

Security Practices | Atlassian

The following series of activities, practices, and required policies exist to ensure we maintain security for ourselves and our customers, across all the products, services, and applications we maintain.

Security vulnerability resolution policy

What it is

We’ve made a public commitment to our customers that we’ll fix security vulnerabilities quickly once they’re reported to us and that commitment is backed up by our company policy.

Why this exists

Customers choose our cloud products largely because they trust that we can run our products more reliably and more securely than they could. Maintaining this customer trust is crucial to our business and is the reason security is our top priority.

How it’s intended to be used

Security scorecards

What it is

Product Security Scorecard is a process to measure the security posture of all products at Atlassian. Specifically, it’s an automated daily data snapshot of a variety of criteria set by the security team. You can then review your score and plan actions to improve.

Why this exists

The goal is the continual improvement of every product through the closing of existing gaps and by pushing for the adoption of emerging security improvements.

How it’s intended to be used

Reliability standards and practices

Operational maturity model (ServiceQuest)

What it is

ServiceQuest is the operational maturity model we use to measure and improve our operations. It’s essentially a scorecard that covers five critical areas for operational maturity and a process for periodically reviewing that scorecard to determine areas for improvement and/or investment.

Why this exists

Building great services and running them efficiently is critical to the success of our cloud business. ServiceQuest is built on the industry-standard concept of an “operational maturity model” and enables us to measure, monitor, and track our operational maturity over time. This leads to fewer unaddressed regressions in our operational health and improvements over time. By using a standardized set of measures, we can compare the health of our services and also look at how we are trending across a multitude of services.

How it’s intended to be used

Weekly operations review, learnings, and discussion (WORLD)

What it is

WORLD is a weekly operational review that cuts across all of the cloud teams in the company. It’s intended for Engineering Managers, architects, de facto architects, and Principal Engineers. The intention is to create a forum to spar on operational challenges, learn from each other, and increase mindshare on topics like resilience, incident resolution, operating costs, security, deployments, and other operational issues.

Why this exists

WORLD focuses on the “learn” aspect of the Build-Measure-Learn feedback loop. Build-Measure-Learn is a framework for establishing, and continuously improving, the effectiveness of products, services, and ideas quickly and cost-effectively. These meetings ensure senior staff within engineering are equipped to make sound, evidence-based business decisions about what to do next to make continuous improvement on operational health concerns.

How it’s intended to be used

The meeting brings together senior staff (managers and engineers) from all cloud teams and includes a cross-company operational review of incidents and dashboards. It also includes a rotating deep dive into one of the teams’ scorecards and practices, with the intent of creating mindshare around operational health that cuts across all cloud teams. Any Atlassian is welcome to attend.

TechOps process

What it is

A weekly meeting and process used by teams that operate a service or own responsibility for a shared component. The meeting and related activities are intended for the engineering manager and engineers who are on-call for a service or component.

Why this exists

The process aims to build a culture of reliability and help a team meet its operational goals through deliberate practice. It keeps operational metrics top of mind and helps to drive work to address any reported health issues of a service or component.

How it’s intended to be used

  1. Establish your operational goals. Write them down and ensure they’re being measured.
  2. While on-call, take notes about events and anomalies that occur.
  3. Upon completing your on-call rotation, prepare a TechOps report that includes events, anomalies, measurements against your objective, and follow-up actions.
  4. Conduct a weekly TechOps meeting with the on-call staff and engineering manager to review the reports and drive the follow-up actions.

TORQ process

What it is

TORQ is a quarterly operational maturity review with the CTO leadership team. The review is attended by org leaders, Heads of Engineering, Product leaders, and senior engineering staff. Its focus is to raise awareness of operational excellence concerns, establish operational OKRs, and influence product roadmaps and resource allocation decisions across all departments to ensure we don’t regress or under-invest in operational maturity.

Why this exists

Historically, it’s been easy to focus too heavily on building new features at the expense of operational maturity. The TORQ process is designed to highlight important operational health metrics and trends to influence the organization and product leaders to ensure they’re balancing operational health alongside product development activities.

How it’s intended to be used

Leaders attend a quarterly review meeting where results from the previous quarter and year are discussed. The results include SLOs, security scorecard measures, and infrastructure spending. The department leaders share plans and important learnings with each other, creating accountability for operational excellence.

Bug resolution and support escalations policy

Bug fix SLOs

We have a public bug fix policy that helps customers understand our approach to handling bugs in our products. We have SLOs in place to meet our customers’ expectations and engineering managers must ensure these SLOs are met by assigning adequate staff and regularly reviewing the SLO dashboard. Most teams use rotating roles where engineers are assigned to fixing bugs according to the priorities indicated in the bug fix tool.

Support escalation SLOs

There are occasions where customers are blocked and a Support Engineer needs your help to get them unblocked. The developer-on-support (DoS) mission is to understand why the customer is blocked and provide the Support Engineer with the information they need to get them moving again. You do this by accepting the support case, investigating the problem, providing information in comments on the case, and then returning the case to support.

We have SLOs to track how quickly we respond to customers.

Why this exists

One of our company values is “Don’t #@!% the customer. Customers are our lifeblood. Without happy customers, we’re doomed. So considering the customer perspective—collectively, not just a handful—comes first.” More specifically, this policy ensures engineering managers prioritize bug fixing and support escalation activities for their team.

How it’s intended to be used

The process most teams use for bug fixing is summarized as follows:

Managing costs

What it is

All cloud teams are responsible for managing their costs and ensuring we’re deriving adequate business value from the associated costs. This includes adhering to our tagging policy and using our FinOps (financial operations) reporting tools to review costs periodically. Engineering Managers and Principal engineers and above are accountable for cost-efficiency within their teams.

Why this exists

Cloud FinOps includes a combination of systems, best practices, and culture to understand our cloud costs and make tradeoffs. By managing our costs, we can make our products more accessible and achieve our mission to unleash the potential in all teams.

How it’s intended to be used

Incident management and response

Below are guidelines, practices, and resources that help us detect and respond to incidents and then investigate and fix the root causes to ensure the incidents don’t reoccur.

Incident severities

What it is

We have guidelines on determining the severity of an incident, used across all of our cloud teams. 

Why this exists

These guidelines provide a basis for decision-making when working through incidents, following up on incidents, or examining incident trends over time. They also foster a consistent culture between teams of how we identify, manage, and learn from incidents.

How it’s intended to be used

We use four severity levels for categorizing incidents:

In addition to the table above, we have detailed guidelines for determining the severity of an incident that considers the service tier of the affected service.

Our incident values

We have defined a set of values that helps empower our staff to make decisions during and after incidents as well as establish a culture around incident prevention and response.

Our incident valuesWhat they mean
Atlassian knows before our customers doA balanced service includes enough monitoring and alerting to detect incidents before our customers do. The best monitoring alerts us to problems before they even become incidents.
Escalate, escalate, escalateNobody will mind getting woken up for an incident it turns out they aren’t needed for. But they will mind if they don’t get woken up for an incident when they should have been. We won’t always have all the answers, so “don’t hesitate to escalate.”
Shit happens, clean it up quicklyOur customers don’t care why their service is down, only that we restore service as quickly as possible. Never hesitate in getting an incident resolved quickly so we can minimize the impact on our customers.
Always blamelessIncidents are part of running services. We improve services by holding teams accountable, not by apportioning blame.
Don’t have the same incident twiceIdentify the root cause and the changes that will prevent that entire class of incidents from occurring again. Commit to delivering specific changes by specific dates.

How we handle incidents

What it is

We have specific guidelines and training available to all engineers on how to handle incidents. These guidelines are based on years of real-world experience and common practices used across our industry.

Why this exists

The goal of incident management is to restore service and/or mitigate the security threat to users as soon as possible and capture data for post-incident review.

How it’s intended to be used

Security incident response

Atlassian Security Incident Management Process | Atlassian

Our security team has processes and systems in place to detect malicious activity targeting Atlassian and its customers. Security incidents are handled in the same way any other incidents are handled.

Post-incident review (PIR) policy

How we run incident postmortems | Atlassian

What it is

A post-incident review (PIR) is a process we use to create a written record of an incident detailing its impact, the actions taken to mitigate or resolve it, the root cause(s), and the follow-up actions taken to prevent the incident from recurring. PIRs are required for all incidents. We follow the blameless approach for the root cause and preventive action evaluation.

Our policy for PIRs is that they must be completed within 10 business days of the incident:

Why this exists

Post-incident reviews help us explore all contributing root causes, document the incident for future reference, and enact preventive actions to reduce the likelihood or impact of recurrence. PIRs are a primary mechanism for preventing future incidents.

How it’s intended to be used

A PIR has five steps, which are described more thoroughly in the document linked above.

  1. A PIR is created during an incident and is used to capture data.
  2. After the incident is resolved, the PIR form is filled out with relevant information, the most important being the root cause(s), contributing cause(s), system gap(s), and learnings.
  3. The draft PIR is reviewed and improved by the team who participated in the incident.
  4. A PIR meeting is held to determine the priority of the necessary follow-up actions.
  5. An approver will review the document, primarily to ensure the root and contributing causes were identified and that they’re willing to commit sufficient resources and time to complete the required follow-up actions.

Publishing public PIRs

What it is

For customer-facing incidents, the Head of Engineering who owns the faulty service should publish a public post-incident review to the Statuspage for the service within six days business days of the incident. There are guidelines and templates available for writing a public version of a PIR, but they should include the root cause analysis, remediations, and a commitment to completing them to prevent similar incidents in the future.

Why this exists

Public PIRs make us more transparent and accountable to our customers. Most customers understand that incidents can occur. Our timely communications and commitments to preventing similar incidents in the future help restore trust.

How it’s intended to be used

Risk, compliance, and disaster preparedness

Compliance at Atlassian | Atlassian

We aim to maintain a balance between calculated risks and expected benefits. We have a dedicated team that establishes policies for managing risks, helps to achieve industry certifications, coordinates the audit processes associated with our certifications (ISO 27001 & 27018, PCI DSS, SOX, SOC2 & SOC3, GDPR, FedRAMP, etc.), and helps to design solutions and practices that are resilient to disasters. This team will work with engineering teams from time to time, as described below.

Risk management

Atlassian’s Risk Management Program | Atlassian

What it is

We have a program to manage risks associated with our company strategy and business objectives. There is a tool for cataloging risks where they are all tracked, accepted (or mitigated), and reviewed annually.

Why this exists

Integrating risk management throughout the company improves decision-making in governance, strategy, objective-setting, and day-to-day operations. It helps enhance performance by more closely linking strategy and business objectives to both risk and opportunity. The diligence required to integrate enterprise risk management provides us with a clear path to creating, preserving, and realizing value.

How it’s intended to be used

The risk and compliance team uses a process for dealing with risks that:

As it relates to engineering, we are sometimes required to participate in an interview or provide details on a risk ticket as a subject matter expert for our products, technology, and/or operations. Within the engineering organization, we have embedded Risk & Compliance managers who help and provide guidance around risks, policies, and controls. They also manage company-wide initiatives, including achieving certifications and capabilities.

In addition, we may be required to participate in a control, a recurring process to ensure that risks are mitigated, by verifying some data or taking a specific action. A control is a process (manual, automated, or a combination) designed to ensure that we manage the risk within the agreed upon boundaries. The Risk & Compliance team works closely with engineers to design and implement necessary controls, and to manage internal and external audits. The purpose of these audits is to provide a high level of assurance that the controls are properly designed and operated. We have a strong bias toward automated controls. The control owner is responsible for the design, operation, and for providing evidence the control is working as intended.

Policy central

What it is

It’s a space within our Confluence instance for all of our corporate policies, including those covering technology, people, legal, finance, tax, and procurement.

Why this exists

All employees should be familiar with the policies that apply to their role. Within engineering, you should specifically be up-to-date on our technology policies at a minimum.

How it’s intended to be used

Our policies are made available internally to all of our teams to ensure they understand the bar they are expected to meet. The policies are updated whenever necessary and are reviewed annually, at a minimum.

Disaster recovery (DR)

What it is

Disaster Recovery involves a set of policies, tools, and procedures to enable the recovery or continuation of vital technology infrastructure, data, and systems following a disaster.

We require all engineers to use existing architecture patterns whenever possible and follow our policies to ensure:

Why this exists

When recovering our products during a disaster, all that matters is whether our customers are able to use our products again without broken experiences. Our products and experiences depend on an ever-growing collection of interdependent components and services owned by product and platform teams. To make sure we can recover fully functioning experiences, we rely on recovery patterns, tools, tests, and processes that understand and can manage the recovery of the customer journey’s system as a whole.

How it’s intended to be used

All engineers are required to read and follow our policy for disaster recovery.

Maintaining existing certifications and assisting an audit

Compliance at Atlassian

What it is

Engineers are required to help maintain our existing compliance certifications.

Why this exists

Our customers deserve and rely on our ability to prove we meet our control obligations. To do this, we obtain third-party certifications via independent audit firms.

How it’s intended to be used

Turn ideas into action

Hopefully, this handbook has given you insight into the ways Atlassian’s engineering organization works. To learn more about our shared services, components, and tools, read our Cloud Engineering Overview. If you’re interested in joining our engineering rocket ship, make sure to visit our Careers page to learn more about our open roles.

Exit mobile version