Project Roomba

The underlying storage engine that powers Trello’s data environment was unhealthy and sluggish, so we invested time into fixing it with the hope that our future selves could then focus on the data and the people who need it. So far it’s working (for example, we reduced the overall data size by 62%), but it won’t work forever.

Photo by The Creative Exchange on Unsplash.

Background

The start of Trello’s data engineering team (December, 2017) was met with a large backlog of requests for improvements to our existing data pipelines and for surfacing new sources of information.

However, there were signs that all was not well with Trello’s overall data environment. After some investigation, we decided to invest in the health of the Redshift database that powers our Mode reports before before addressing the data request backlog.

Unhealthy database: symptoms

Some of the signs that Trello’s Redshift database cluster wasn’t in good health:

  1. Frequent long-running queries and reports that locked database resources, causing a pileup of work
  2. Complex queries that exceeded the default 45 minute time limit, causing Redshift to cancel them
  3. An influx of requests to “pre-calculate” long-running queries (translation: adding code to our data pipelines and having to update it whenever these queries change)
  4. Redshift was emitting a lot of query alerts 

Unhealthy database: diagnosis

Redshift, How Does it Work?

When you’re running reports or writing queries, Redshift looks like a traditional relational database. Under the hood, however, there are some important differences:

Once you grok these differences, Amazon’s Redshift best practices start to make sense:

  1. Distribute data evenly between nodes: maximizes the benefit or parallel work.
  2. Store related data on the same node: prevents shuffling data between nodes during table joins.
  3. Compress data: reduces I/O when you can’t avoid shuffling data between nodes.
  4. Sort for your workload: reduces the amount of data that a query needs to access (for example, sorting by event date means that queries looking for yesterday’s events can ignore the older information).

After some sleuthing, we determined that we were doing a good job with data distribution (#1 and #2 above). Compression and sorting, however, needed some love.

The examination

As of January, 2018:

A closer look at that 36 gigabyte (GB) events table:

These findings also applied to many other large events-related tables, resulting in table scans, unnecessary shuffling of data, and slow queries that had a cascading effect on overall Redshift and Mode performance.

Unhealthy database: cure

Once we understood the underlying problems, the steps for fixing them were straightforward. We code-named this work “Project Roomba” because the database process that re-sorts data is called a VACUUM.

  1. Give ourselves room to work by doubling the size of the Redshift cluster to eight 16 TB nodes.
  2. Replace problematic tables with new versions that use high data compression encoding. The process of copying data to the new tables also re-sorts the data, solving both problems.
  3. Keep tables sorted on an ongoing basis.

As of April, 2018, when Project Roomba finished:

Results and the road ahead

For lots of nerdy reasons, Project Roomba took several months. And there’s more to do.

That said, Project Roomba paid off:

Happy querying!

Exit mobile version