How We Broke the Guardian Website.. And What We Learned

Watch the talk - you’ll need to create a free account.

Get the slides (PDF)

Moving fast shouldn’t mean breaking things, but Facebook’s old mantra often reflects the realities of a rapidly evolving codebase and infrastructure. The Guardian’s development culture is designed to allow us to move quickly, deploy a dozen times a day and get statistically-significant A/B test results within hours.

Sometimes, though, that leads to the odd slip up.

Join Gareth for this talk as he lays bare just a few of the mistakes they made and more importantly, the lessons learned and remediation they took to avoid the problems in the future.

Here’s a short write-up of the process and tools that I mention in my talk.

Our process is like many others:

We’re 14 minutes from merge to production and, as important, 3.5 minutes away from rolling back to a previous version (it’s one deploy from RiffRaff).

There are four pillars to stable and resilient continuous deployment environments:

Pre-merge monitoring

We should check even the smallest of changes.

The guardian does PR-builds, you can see the repo here: GitHub - guardian/prbuilds: Runtime regression testing infra

Those PR builds run:

Post-deploy monitoring

“Include cause-based information in symptom-based pages or on dashboards, but avoid alerting directly on causes.”
-– Rob Ewaschuk

Read the paper

Users don’t care about the causes of problems like databases being down or data being dropped somewhere or endpoints being inaccessible. Our users do care about errors, slow pages, pages that never return, correct data, working features.

Your dashboards and alerting should reflect the symptoms.

Prout

Follow your PRs out to live: GitHub - guardian/prout: Looks after your pull requests, tells you when they’re live

Kibana

Kibana: Explore, Visualize, Discover Data | Elastic

Our user’s experiences rely on how our site performs.

Speedcurve

SpeedCurve: Monitor front-end performance

Alerts into slack:

Sentry

https://sentry.io/

GitHub - getsentry/raven-js: JavaScript client for Sentry

Production Monitoring

“we can get value from tests in production that go beyond the conventional merge-deploy-test pipeline”

– Jonathan Hare-Winton & Sam Cutler

Testing in Production: How we combined tests with monitoring | Info | The Guardian

Smart and Resilient tools

Serving stale is a safety net

Remember the critical-situation role of your tools.

Throwing 500s made us taught us the value of making sure our tools not only do their 99% of the time job but remembering their 1% critical situation role.

Communication

There needs to be an open, honest, no-blame culture.

Communication needs work.

There needs to be an open, honest, no-blame culture where people feel they can admit to their mistakes freely and talk about why that happened without fear of repercussions.

The GitLab incident

Incident Retrospectives

The true what and why environment, free of finger pointing or arguing, laser focussed on finding remediation tasks to shore up those tools and systems that didn’t work as intended.

They provide a short description of what happened… pull together the timeline of events:

The questions that were raised:

Questions designed to tease out bullet points for the next section - Actions.

Actions will often vary from investigations to answer the previous questions.

Incident Retrospectives create a virtuous cycle.

Each postmortem teaches us how to build more robust, resilient systems that filter into the next product or feature we build.

Recap

Each thing that breaks, improves the robustness of our tools, processes and infrastructure.

Thanks!