Watch the talk - you’ll need to create a free account.
Get the slides (PDF)
Moving fast shouldn’t mean breaking things, but Facebook’s old mantra often reflects the realities of a rapidly evolving codebase and infrastructure. The Guardian’s development culture is designed to allow us to move quickly, deploy a dozen times a day and get statistically-significant A/B test results within hours.
Sometimes, though, that leads to the odd slip up.
Join Gareth for this talk as he lays bare just a few of the mistakes they made and more importantly, the lessons learned and remediation they took to avoid the problems in the future.
Here’s a short write-up of the process and tools that I mention in my talk.
Our process is like many others:
We’re 14 minutes from merge to production and, as important, 3.5 minutes away from rolling back to a previous version (it’s one deploy from RiffRaff).
There are four pillars to stable and resilient continuous deployment environments:
We should check even the smallest of changes.
The guardian does PR-builds, you can see the repo here: GitHub - guardian/prbuilds: Runtime regression testing infra
Those PR builds run:
“Include cause-based information in symptom-based pages or on dashboards, but avoid alerting directly on causes.”
-– Rob Ewaschuk
Read the paper
Users don’t care about the causes of problems like databases being down or data being dropped somewhere or endpoints being inaccessible. Our users do care about errors, slow pages, pages that never return, correct data, working features.
Your dashboards and alerting should reflect the symptoms.
Follow your PRs out to live: GitHub - guardian/prout: Looks after your pull requests, tells you when they’re live
Kibana: Explore, Visualize, Discover Data | Elastic
SpeedCurve: Monitor front-end performance
Alerts into slack:
“we can get value from tests in production that go beyond the conventional merge-deploy-test pipeline”
– Jonathan Hare-Winton & Sam Cutler
Testing in Production: How we combined tests with monitoring | Info | The Guardian
Serving stale is a safety net
Remember the critical-situation role of your tools.
Throwing 500s made us taught us the value of making sure our tools not only do their 99% of the time job but remembering their 1% critical situation role.
There needs to be an open, honest, no-blame culture.
Communication needs work.
There needs to be an open, honest, no-blame culture where people feel they can admit to their mistakes freely and talk about why that happened without fear of repercussions.
The GitLab incident
The true what and why environment, free of finger pointing or arguing, laser focussed on finding remediation tasks to shore up those tools and systems that didn’t work as intended.
They provide a short description of what happened… pull together the timeline of events:
The questions that were raised:
Questions designed to tease out bullet points for the next section - Actions.
Actions will often vary from investigations to answer the previous questions.
Incident Retrospectives create a virtuous cycle.
Each postmortem teaches us how to build more robust, resilient systems that filter into the next product or feature we build.
Each thing that breaks, improves the robustness of our tools, processes and infrastructure.