Recently HXServers posted an RCA on LowEndTalk. RCA stands for “Root Cause Analysis”. They sometimes are called Reasons For Outage.
The HXServers RCA is excellent, but all too often, these announcements do not serve their purpose. Done properly, they can be a marvelous tool for increasing customer trust. Done poorly, they increase user anxiety.
Some examples of how to screw up an RCA:
- Take too long to write it. Customers want answers within a day or two, not weeks.
- Don’t reveal the root cause. It’s OK to say you’re still investigating, but if you say that, then you need to write a followup.
- Drown things in technical details. “Our Gonkulator X-2000 firmware 2023.1103.102 upgrade to patch 1059 failed causing a cascade failing in our Merkel M-105 transducers…”
- Act like it’s someone else’s fault instead of taking responsibility.
To be effective, an RCA should do the following:
Clearly state what happened. At 12:05 on January 13th, this and that happened. It affected our Atlanta, New Brunswick, and Boise data centers. It didn’t affect Nashville.
You noticed it immediately. Your monitoring was on top of this or your engineers noticed the anomaly. You were not caught unaware and you didn’t have to wait until angry customers pointed it out.
How you took action. You immediately engaged team X and Y, and they sprung into action. You’re trying to convey the urgency of your response, and how important resolving the issue was to you. Your customers were feeling pain, and you worked as quickly as possible to fix it.
What you did to fix it. The actions you took, and why they took so long. Problems you had along the way. You were always acting with urgency.
How you know it’s completely fixed. You don’t want users thinking that the issue could reoccur at any time. You’ve understood the problem and been thorough in resolving it.
What you’re doing to prevent it. You’re committed to making sure this issue never happens again. You’ve put additional monitoring in place, you’ve changed configurations, you’ve added capacity, etc.
You recognize the pain and you apologize. You’re not making excuses. You realize you feel short and you’re very sorry for that.
You’re happy to talk more at length, with priority. Maybe there’s a customer still suffering – if so, you want to jump on that. Here’s how to get ahold of us.
That kind of RCA makes the customer feel like you’re a sharp team. Everyone knows that IT is never perfect. Google, Microsoft, and other tech giants have outages and problems. It’s not so much that issues occur but rather how you respond.
- The Awesome Curated Universe - December 17, 2024
- Win Authority: Cheap VPS Offers in Seattle ! - December 16, 2024
- I Don’t Have Time to Win the Hutter Prize, So Maybe You’d Like to Snag 500’000€ With My Idea - December 15, 2024
Leave a Reply