We will be defined by how we react in a crisis. It is ideal that we act with empathy and a level head, though that can be hard with things crashing round your head.
This is part of a training session I did with EndGame’s technical leaders, on how we should try to act in a crisis.
On learning of a crisis it is easy to spin out, take some deep breaths.
You may be angry at yourself or others for causing the problem, take a minute to acknowledge that internally, and use the energy to push forward.
Remember blame will not fix this so put aside those thoughts as they are not useful.
Tell your squad leader, get them to be the bouncer, while you fix the problem.
Communication is key.
Set a cadence that you will report progress, at all cost report with this cadence.
Start a document, be it a paper notebook or whatever and note down the date and time you found this out.
Write down what it is that you are fixing.
Start a check list of things you can think of that will solve (or start to solve) the crisis. If this is a DR / BC type crisis hope that someone has already made you one of these. Though be prepared to change it.
Grab a buddy, this person will be your sounding board, they don’t need to know the project or be a developer. They will keep you sane, bring you things, help you document, stop you from not following the checklist.
It may not seem like it, but slow and steady wins the race.
Look for the simplest way of restoring service.
Churning solutions is typical, before you try something write down what it is you are going to try and when you did it. Assess the fix, and continue if it’s not fixed. Yes you will try the same thing multiple times.
Stop when it is fixed. Sometimes you will be tempted to try just one more thing, don’t.
If the fix is temporary then that’s OK, just start planning for the proper fix. Make sure you let people know that the fix is temporary.
Watch the fix for a bit, before giving the all clear. Sometimes fixes need a bit to take.
Don’t be afraid to call for support, including from 3rd parties and the rest of the team. Read them in though the notes you have made and things you have tried.
You may have to step outside your remit, make decisions around cost, loss of data, etc. Shunt these up the line, but have a time frame at which you will take action if you have not heard back. Once again document what these are and why you think they are appropriate.
Don’t lose faith. You will try everything you can think of then google till the end of the internet, but the problem is still not fixed. This is normal, this does not mean that you are a failure, keep the faith, push through.
Be prepared to hand off, you may need to take a break, or do something else, the documentation you have kept will be enough to read the next person in, and have them continue it. Once again time stamp the hand-off.
Communicate with your comms person on a regular basis. This person should be radiating this information as far as possible.
Do not pass the buck or blame other people, even in passing. Other people around you will also be in a heightened state, giving them someone to latch on to will cause more problems, even worse it can cause key stakeholders to lose confidence in your ability.
Don’t “over reassure”, it’s bad — otherwise you wouldn’t be in crisis mode, so don’t pretend you aren’t. Tell people the facts of the situation. It is better to be able to say, “it is less serious than we thought” than “it is more serious”.
Acknowledge uncertainty, if possible give a time frame to re-establish service, think in terms of worst case, once again it is better to revise your time frame down than up.
You may not be able to estimate the time frame, in this case have a very good reason that you can’t estimate the time frame. People don’t like to hear you can’t estimate a time frame, once again estimating a day may be better than not estimating if you acknowledge the degree of uncertainty in the estimate.
Accept peoples emotions. While you can’t panic, it doesn’t stop people around you from panicking. It is quite typical that people will react angrily (or want someone to blame), hopefully the comms buddy will be able to take this, but it will flow over. Don’t take it to heart, they like you are human, the best you can do is tell them the facts and give realistic estimates and communicate on a regular cadence. These things build trust and hopefully calm the people around you.
Another type of reaction you may get is the “armchair observer”, these people try to give you “useful” suggestions on what to do and want more information than is necessary. Once again your comms buddy should be bouncing these people, but if you have to react with the same information that you have given everyone else, and that you will take their suggestions on board. They may even once in a while have the right one.
Once it is all over, communicate this and the nature of the fix and if it is temporary.
Try to get an incident report out as soon as possible. But it doesn’t have to be immediately, people will understand if it is midnight and you just want to go to bed, just let them know the incident report is coming.
The incident report should be a non-technical version of your notes, with the times and actions taken. You should have removed all names and anything that might be a security issue. Include any next steps that will need to happen, including publishing the results of the retro at a later date. If you mention any actions in this report ensure that you take them and report them once they are complete.
In the aftermath of an incident you should run a retro.
The purpose of the retro is not to lay blame, or to try to pin responsibility to another organisation or system.
The purpose of the retro is to try to get to the bottom of what you could have done to not have the crisis. There are quite a few ways of doing this. Don’t look at the closest problem as being the one to solve, you may get more bang for your buck by going further down the chain.
See my blog post on how to run one of these.