Incident Mangament Protocol: An example

Incident Protocol
Before the incident
During the incident
After the incident
Incident communication

Incidents are unexpected events that can disrupt the normal operations of an organization. Incidents can range from minor issues, such as a software bug, to major crises, such as a data breach. Therefore, it is essential for organizations to have an incident management protocol in place to respond quickly and effectively to incidents.

In this article, we will discuss an example of an incident management protocol that can be used by organizations.

Incident Protocol

Before the incident

What is an incident?

An incident is any unplanned disruption or degradation of service that is actively affecting customers ability to use our platform and our product(s).

Severity Levels

The first step is to decide what constitutes an incident. This section provides a generic classification by severity level, with lower numbers representing higher severity.

If you are unsure which level an incident is, treat it as the highest one (SEV-1). Don’t discuss the severity level during an incident, you can always review it during the postmortem.

SEV-1 (Critical incident) - Critical issue that warrants public notification and liaison with executive team

The system is in critical state and impacting a large number of customers.
Infrastructure or Platform is down.
User facing services not available.
No data shown at all.
Security vulnerability that exposes customer data has come to our attention.

SEV-2 (Major incident) - Major system issue actively impacting many customers’ ability to use the product

Data for one or more service is not showing (correctly).
Issues in the platform that prevent substantial parts of the data from showing correctly or at all.
Issues in the platform that disable essential functionality without workaround.

SEV-3 (Minor incident, with low impact) - Stability or minor customer-impacting issues that require immediate attention from service owners

Everything else that is not SEV-1 and SEV-2 and is impacting the user experience and the ability to use our platform / product.

Roles and Responsibilities

Incident Commander (IC)

When an incident is first declared, the IC will command and coordinate the incident response, by delegating roles as necessary. Initially, the IC will assume the roles that haven’t been delegated yet. Depending on the ability to solve the incident, the IC may hand off their role to someone else and assume the OL role or delegate the OL role to someone else.

Communications Lead (CL)

This is the person responsible for providing periodic updates to the response team and to the stakeholders, as well as managing inquiries about the incident. The CL is the public face of the incident response team.

Operations Lead (OL)

The Ops Lead will work with the IC to respond to the incident, by applying operational tools to mitigate or resolve the incident. This is also often referred to as the Subject Matter Expert. The Operations team should be the only group modifying the system during an incident.

Note: Both the CL and the OL may lead a team of people to help manage their specific areas of incident response, and these teams may contract or expand as necessary.

Communication in Slack/Chat

The IC is responsible for creating a separate Slack/IRC channel for the present incident. This makes it easier for us to easily scan the chat history when re-building the timeline for the post-mortem, but it will also allow us to handle multiple incidents simultaneously. Also, other team members may join the Slack channel related to the incident that they are particularly interested in.

Incident Calls

Incident calls should be recorded (if possible), so that we can refer to them later (a link to the call(s) should be available in the Post-mortem). It may be the case the OL has a clear idea of how to mitigate / resolve the incident and a call may not be necessary.

During the incident

Declaring an incident

Don’t panic! When declaring an incident, the person who declares the incident should take the following actions:

Declare on Slack that an incident has occurred.
Create a ticket of type Incident. fill in details add links to the customer tickets if any

Create a separate Slack channel for that particular incident. The Slack channel should be named using the following naming convention: warroom--. E.g. warroom-previews-broken-lpdev-12345, where previews-broken is the short description and lpdev-12345 is the ticket number.

Announce on the #developers Slack channel the newly created channel, so that other people can join, if they want. Use @channel to notify everyone. Example message

@channel A new incident channel was created #warroom-- because of an ongoing incident, feel free to join. Update the Slack channel field in the ticket.

Steps for the Incident Commander

Announce on the call and in Slack that you’re the Incident Commander.

If by any chance you’re the expert that knows how to fix the problem, then delegate your role of IC to someone else and assume the role of Operations Lead (OL).

Identify if there is an obvious cause to the incident (recent deployment, spike in traffic, etc), delegate the investigation to the OL.

The OL will assist you in the analysis. Most of the time, the OL will be able to quickly provide confirmation of the cause, but it’s not always the case. Confer with service owners and use their knowledge to help you.

Identify investigation and repair actions. Delegate actions to OL. Some examples (list is non-exhaustive):

Bad deployment: roll it back.
Event flood: validate automatic throttling is sufficient, adjust manually if not.
Degraded service behaviour: gather forensic data (heap dumps, logs, etc), and consider doing a rolling restart.
Listen for prompts from the OL regarding severity escalation, decide whether we need to announce it publicly.

Steps for the Operations Lead (Subject Matter Expert)

You’re there to support the Incident Commander in identifying the cause of the incident, suggesting and evaluating repair actions, and following through on the repair actions.

Investigate the incident and announce all the findings to the Incident Commander (if you’re not in an Incident Call, then make sure to communicate over Slack in the respective incident channel).

If you’re unsure of the cause that’s OK. Simply state that you’re still investigating, but make sure to provide regular updates to the IC.

Announce all the suggestions of resolution to the Incident Commander, and let them decide what the course of action will be (they may also ask you for your opinion in case they’re unsure). Do not follow any actions until it has been decided and announced.

Steps for the Communications Lead

You’re there to provide updates to stakeholders (both internal and external). The interested parties may vary depending on the severity level of the incident.

Be prepared to page other people as directed by the Incident Commander.

Provide regular status updates on Slack to the executive team (roughly every 30-45 minutes), giving an executive summary of the current status. Keep it short and to the point.

You may be required to update our status page (or instruct someone to do so).

You may have to occasionally provide information to the incident response team, if any customer reports any other issues that they are facing and we’re unaware of.

After the incident

Postmortem

When the incident is over, the IC is responsible for starting a draft of the Post-mortem. The previously created Slack channel should have (almost) all the necessary information for building a detailed timeline and filling in (most of) the sections. When the IC finishes writing the draft, it should be shared on the Slack channel so that other team members can fill in any gaps, add comments / feedback or ask for clarification. After a period of 3 days, when everyone had the time to give feedback, the Post-mortem should be ready to be published to Confluence.

Messaging

Once the incident is resolved and the post-mortem is published, we should inform both employees and customers.

Internal

This should be a simple follow-up to the employees, after the post-mortem meeting (if any was scheduled) or once the post-mortem is published. Briefly summarize what happened and include a link to the post-mortem. We can eventually define a template for this kind of emails.

External

This is what will be included in the website Status page, regarding the incident. (Perhaps include a genuine apology to the customers?).

Summary
What happened
What are we doing about this

Incident communication

This section contains suggestions of email templates to send to clients and stakeholders on different stages of an incident. Take these templates as guidelines and adjust them as you see fit.

Discovery

If it was assessed that the incident warrants that clients are informed, then this should happen as soon as possible. Be clear, concise and transparent. If there is no estimation of when the incident will be solved, don’t come up with some random number.

Subject: Website [incident]

Dear [client] We are experiencing [issue/outage] in our platform today. At [time] we discovered that [description of what is happening] and noticed that this might also be the case in your Open DCO Environment.

The impacted parts are [include parts], which means that at the moment you won’t be able to [add impact].

In the meantime, you can [include workaround if exists].

Our engineers are now investigating the issue and you’ll be informed as soon as we have more information.

Please don’t hesitate to contact us shoud you have any questions

Update

Once we have a better idea of where we stand and how long it will take to resolve the incident, we can send another update. It makes sense to skip this email if the fix will take just a few minutes.

Subject: ODC [incident] update

Dear [client]

We are experiencing [issues/outage] in our platform today [Month/day/year].

At [time] we have discovered that [description of what is happening] and noticed that this might also be the case in your Open DCO environment.

The parts that are impacted are [add parts]. Which means, at the moment you are not able to [add impact].

In the mean time you can [add workaround].

We are now investigating this issue to find the cause and a solution. Once we have a clear view on when we expect to make use of these functionalities again, we will inform you once more.

If you have any questions, please don’t hesitate to let us know!

Resolved

An email should be sent to our clients once the incident has been solved.

Subject: ODC [incident] has been resolved

Dear [client],

Earlier today we have informed you about [issue] and we would like to let you know that this has now been fixed. Your platform should now be fully functional again.

What happened was [add cause], we fixed this by [add fix].

We have learned from this and will [what will we do in the future to prevent this].

We apologize for any inconvenience that this may have caused and we want you to know that we take the performance and reliability of WPP Open DC very seriously. We will continuously keep you informed on the additional measures we’re taking concerning the stability and reliability of our platform.

If you have any questions, please don’t hesitate to let us know!

21 Jan 2023

« Incident Response Best Practices for Site Reliability Engineering

The Relationship between DevOps and Site Reliability Engineering »

Spoon Follow Spoon has an expertise in building and maintaining large-scale web applications. He has built infrastructure and platform services that power some of the world’s largest online businesses; Blending systems thinking and good software practices to create scalable and reliable services using whatever technology is needed.