LowEndBox - Cheap VPS, Hosting and Dedicated Server Deals

The Art of Writing a Good Root Cause Analysis/Reason for Outage

OopsRecently HXServers posted an RCA on LowEndTalk. RCA stands for “Root Cause Analysis”. They sometimes are called Reasons For Outage.

The HXServers RCA is excellent, but all too often, these announcements do not serve their purpose.  Done properly, they can be a marvelous tool for increasing customer trust.  Done poorly, they increase user anxiety.

Some examples of how to screw up an RCA:

  • Take too long to write it.  Customers want answers within a day or two, not weeks.
  • Don’t reveal the root cause.  It’s OK to say you’re still investigating, but if you say that, then you need to write a followup.
  • Drown things in technical details.  “Our Gonkulator X-2000 firmware 2023.1103.102 upgrade to patch 1059 failed causing a cascade failing in our Merkel M-105 transducers…”
  • Act like it’s someone else’s fault instead of taking responsibility.

To be effective, an RCA should do the following:

Clearly state what happened.  At 12:05 on January 13th, this and that happened.  It affected  our Atlanta, New Brunswick, and Boise data centers.  It didn’t affect Nashville.

You noticed it immediately.  Your monitoring was on top of this or your engineers noticed the anomaly.  You were not caught unaware and you didn’t have to wait until angry customers pointed it out.

How you took action. You immediately engaged team X and Y, and they sprung into action.  You’re trying to convey the urgency of your response, and how important resolving the issue was to you.  Your customers were feeling pain, and you worked as quickly as possible to fix it.

What you did to fix it.  The actions you took, and why they took so long.  Problems you had along the way.  You were always acting with urgency.

How you know it’s completely fixed.  You don’t want users thinking that the issue could reoccur at any time.  You’ve understood the problem and been thorough in resolving it.

What you’re doing to prevent it.  You’re committed to making sure this issue never happens again.  You’ve put additional monitoring in place, you’ve changed configurations, you’ve added capacity, etc.

You recognize the pain and you apologize.  You’re not making excuses.  You realize you feel short and you’re very sorry for that.

You’re happy to talk more at length, with priority.  Maybe there’s a customer still suffering – if so, you want to jump on that.  Here’s how to get ahold of us.

That kind of RCA makes the customer feel like you’re a sharp team.  Everyone knows that IT is never perfect.  Google, Microsoft, and other tech giants have outages and problems.  It’s not so much that issues occur but rather how you respond.

raindog308

No Comments

    Leave a Reply

    Some notes on commenting on LowEndBox:

    • Do not use LowEndBox for support issues. Go to your hosting provider and issue a ticket there. Coming here saying "my VPS is down, what do I do?!" will only have your comments removed.
    • Akismet is used for spam detection. Some comments may be held temporarily for manual approval.
    • Use <pre>...</pre> to quote the output from your terminal/console, or consider using a pastebin service.

    Your email address will not be published. Required fields are marked *