Once the dust has settled on an incident, it is worth the introspection to discover if there is any the organisation can do to improve. Small steps can lead to big change so any actionable insight, however small it may seem, can lead great improvements.
At EndGame we try to perform incident retros after all outages, and at the introduction of major bugs, or feature discrepancies.
Incident retros should be blameless, you are there to state what happened and when, not to point out who did it. If it is material that a certain role performed an action, by all means record that. However try not to use people’s names when talking about an action — as even with no implied guilt, it can easily be misunderstood by an outside observer to be blame.
It is important not to lay blame on another organisation or system. Once again these are not helpful. An outside force may have taken an action that had some part in the incident and this should be noted, but the point of the retro is not to find a convenient scape goat, but to bring about lasting change.
The purpose of the retro is try to get to the bottom of what you could have done, if anything, to avoid the crisis.
We start by first setting up the white board with the following steps:
On a whiteboard, in the middle, place a When box. Write in the When box what the incident was and when it happened. This is the time that the incident started, not when it was detected (that’s the second step).
To the right (with a bit of space) of the When write up a Detection box. Write in this box how you detected the problem and when.
To the right of the Detection box write up a Fixedbox. Write in this box how you fixed the problem and when.
To the left of the When write up the Introduction box. Write in this box what introduced the issue, and when.
Now between each box write the time difference.
Once the whiteboard is set up, it is time to discuss the gaps between the boxes.
Now is a good time to make sure that people know the discussion is blameless, by reasserting that and telling people that in the discussion they should try to avoid finger pointing or using peoples’ names.
Tackle the gaps in this order:
When → Detection
Detection → Fix
Introduction → When
When discussing the gaps don’t focus on what could have been, it may seem useful to discuss, but usually is a red herring. Instead focus on the questions “what caused the gap?” and “what could you do to close the gap?”
Don’t look at the closest problem as being the one to solve, you may get more bang for you buck by going further down the chain.
A worked example in the detection gap:
The gap was 1 hour between When and Detection, and it was finally detected by the customer.
The “What caused the gap” was you have failed to have simple up-time monitoring in place.
The “What could we do to close the gap”, a simple solution seems to be to put monitoring in place, but is there a deeper problem that you could look at?
Why wasn’t the monitoring in place? Well it isn’t part of the deploy script — OK so let’s put it in there, but let’s go deeper.
Why wasn’t it in the deploy script? The developer was only thinking of deploying the application and not the environment, this is where things get juicy as now you are looking at process solutions, rather than band-aids on the current project.
Typically you may go as deep as 5 whys, I typically stop at a place that a solution can still be achieved at the level of the people in the room.
The Why, “We don’t have enough money / time for this project”, is always good to know and should be feed up the chain, but the solutions might be difficult to put in place and could require governance and stakeholder changes.
The Why, “Human Error”, is not a good place to stop as humans are always going to make errors, it is better to go one level further, and ask what the human could have done to check their result, or been given a speed bump to prompt the correct action / resolution. Process may not always be the best solution — a simple check list or some automation will work wonders.
At the end of this you should have a list of things that will shorten the gap in each case. Then prioritise them by working out A, B and C (below) and multiplying them:
A — how hard is this to implement
0 — takes less than 30 mins 1 — easy: could be a sprint card 2 — medium: could be a set of sprint cards 3 — hard: would need to be an epic
B — how close is this to the introduction of the issue
1 — before introduction 2 — at introduction 3 — before detection 4 — at detection 5 — before fix
C — If the solution will just fix this project, or any project
1 — all projects 2 — some projects 3 — just this project
Explaining the scale:
If you can do anything in 20 mins that will stop an incident from happening then do it NOW.
Anything that can fix problems before they are introduced are gold.
Anything that will affect all projects is better than just the one that had the incident.
The final step is to publish the result to the effected parties. Do this by writing up the time-line, the notes on the gaps, then add the actions that you are going to do. It is not worth telling people about things that are not going to happen. Though those should still go into the backlog for later discussion.
Summing this all up
Have a retro for every incident if at all possible
Make sure that everyone in the retro know that it is blameless
Draw up a time line “When”, “Detection”, “Fix” (and then before When) “Introduction”
Add times to gaps
Discuss the gaps, and how to make the time of the gap smaller (or remove it)
Prioritise the actions from the discussion based on how fast to fix, how close it is to the introduction, and if it effects just this project or all projects.
Publish the results of the retro to those effected.
Do the actions agreed
I hope that this helps you improve your process and take small steps towards greatness.