An Incredibly Amazing Co-Incidence Of Doubled, Double Disk Failures!

Jul 12, 2022 @ 11:43 pm

A second failure occurred on the fallback server

Just recently we were introducing our newer Low Enders to Hacker News (“HN”). Now, suddenly, we have news of a doubled, double disk failure over there at HN. The doubled, double disk failure took HN down for about 8 hours.

What is a “doubled, double disk failure?” As explained in our Introduction to HN post, HN has been running on two servers rented from M5 Hosting. One machine is primary and the other machine is standby. Both the primary and the standby machines each have mirrored SSD disks. Normally, only one machine is in use, serving HN’s approximately 6 million requests per day.

What seemed to happen on July 8 was that the primary server’s two disks went down, followed a few hours later by the demise of the secondary server’s two disks. That seems to be a doubled, double disk failure! The failure count seems to be four failures, despite that the Twitter screenshot shown here with this post mentions the “second disk failure.”

The cause seems to be a manufacturing problem related to the SSD disks. Here is HN moderator dang and M5 owner mikiem responding to a suggestion by HN member kabdib that the SSD failures might have been caused by a manufacturing defect. Apparently a software bug caused all four SSDs to self-destruct at approximately 40,000 hours of operation.

Here is further discussion that two disks seem to have failed on each server, making four failures in total (a doubled, double disk catastrophe). Also, this further discussion links to a 2021 Field Notice FN 70545 from Cisco describing the manufacturing defect as an “industry wide firmware index bug.”

It seems obvious that redundant equipment would provide additional safety. It formerly seemed much less obvious, at least to me, that two similarly manufactured items might fail simultaneously. I always imagined there was a very high degree of security added by a second set of equipment.

Some years back, a double disk failure occurred on a server providing shared hosting for one of my websites. Indeed, when I received the support email pictured below, I wondered how such a thing could be possible. However, the failure occurred at Interserver, Because of the super friendly, always helpful, and very capable guys who work there, Interserver always will remain my favorite hosting company. So I just chalked the double failure up to an incredible co-incidence, grabbed my backups, and reinstalled.

Now, thanks to what happened at HN and to the wonderful discussions there, I know better than to think double hardware is sufficient security or that nearly simultaneous double disk failures might be nearly impossible. Even doubled, double disk failures do happen! Now I have increased desire for additional, differently formatted, and differently stored backups. Backups are good! Now might be a great time to make another backup! :)

Electrical fault caused problems with multiple drives motors

Our Cheap GPU Directory is Now Live! Nvidia GPUs for AI, Training LLM Models, and More!

Snapshots vs Backups Explained, And Why You Need Both in Your Life

So What do I Run On? Raindog308's LowEnd Empire and Preferred LowEnd Provider List

The Interserver Interview: Cheap, Awesome Services Plus Datacenter, Dog, and Classic Car Pics!

You Can Win AMAZING Prizes in LowEndTalk's Top Provider Poll!

Four Customers Who Found Out the Hard Way You Need To Backup Your Cloud Data

Not_Oles

It feels like just yesterday—but it was actually fifty years ago—that I stood in a doorway, watching yard after yard of ASCII art scroll out of a Teletype Model 33, surrounded by a group of laughing, wide-eyed tech enthusiasts. That was my first taste of computing, and I’ve been hooked ever since.

My Low End Adventure began right here on LowEndBox, when I stumbled across the perfect deal on a dedicated server from OVH a few years ago. That moment opened the door to a whole new world of tinkering, learning, and community.

Today, I’m the proud owner of Darkstar—a lovingly maintained antique server currently colocated in Dallas, Texas at LevelOneServers.com. She may be older than most, but she runs like a champ and has taught me more than a few lessons about hardware and humility.

In addition to writing for LowEndBox and helping out as a moderator on LowEndTalk, I spend my time exploring the worlds of programming, networking, and Linux system administration. There’s always something new to learn—and luckily, the LowEnd community is filled with brilliant, generous people who are more than willing to teach.

All these years later, I’m still learning, still experimenting, and still having a blast.

It’s very, very fun here on the Low End… isn’t it? 😊

3 Comments

BozoDaKlown:
This has got to be the same 40k hours SSD bug that HPE & Dell patched a couple of years back. Someone isn’t keeping on top of their scheduled maintenance….
July 13, 2022 @ 2:14 pm | Reply
Anindya:
Ah, it’s because of these line from my blog. Innit?
“Like I mentioned in the video description, they have real life time machines too. So I doubt that storage is an issue for them.”
🤣🤷‍♂️
July 14, 2022 @ 1:44 am | Reply
Gareth:
It is a well known principle in engineering safety critical systems that diversity is as important as redundancy, eg a mix of manufacturers for the disks would have avoided this, even if one manufacturer had significantly lower MTBF.
In redundant systems with degraded modes of operation it is easily shown that MTTR is far more important than MTBF for all reasonable MTBFs.
July 15, 2022 @ 11:12 pm | Reply