Intro
I remember getting a SMS for a major incident. A critical functionality on a production application has stopped working. The Rapid Response team has been engaged. Everyone’s on the call, Engineering leads, Ops, Execs.
The start time of the incident is unclear. The different teams are looking at dashboards, reviewing alerts, analysing logs and reviewing changes. The team has found that the incident started 4 hours before the reported time.
After 2 more hours on the call, a change was found to be the cause of the incident. The change was performed on an infrastructure component and there was no linkage between it and the impacted application.
There were a lot of action items from the PIR (Post Incident Review). It was a blameless post-mortem. The critical item on the list was to find a solution to understand the blast radius of future changes.
What would it take to know the blast radius before the change?
