LowEndBox - Cheap VPS, Hosting and Dedicated Server Deals

An Incredibly Amazing Co-Incidence Of Doubled, Double Disk Failures!

A second failure occurred on the fallback server

Just recently we were introducing our newer Low Enders to Hacker News (“HN”). Now, suddenly, we have news of a doubled, double disk failure over there at HN. The doubled, double disk failure took HN down for about 8 hours.

What is a “doubled, double disk failure?” As explained in our Introduction to HN post, HN has been running on two servers rented from M5 Hosting. One machine is primary and the other machine is standby. Both the primary and the standby machines each have mirrored SSD disks. Normally, only one machine is in use, serving HN’s approximately 6 million requests per day.

What seemed to happen on July 8 was that the primary server’s two disks went down, followed a few hours later by the demise of the secondary server’s two disks. That seems to be a doubled, double disk failure! The failure count seems to be four failures, despite that the Twitter screenshot shown here with this post mentions the “second disk failure.”

The cause seems to be a manufacturing problem related to the SSD disks. Here is HN moderator dang and M5 owner mikiem responding to a suggestion by HN member kabdib that the SSD failures might have been caused by a manufacturing defect. Apparently a software bug caused all four SSDs to self-destruct at approximately 40,000 hours of operation.

Here is further discussion that two disks seem to have failed on each server, making four failures in total (a doubled, double disk catastrophe). Also, this further discussion links to a 2021 Field Notice FN 70545 from Cisco describing the manufacturing defect as an “industry wide firmware index bug.”

It seems obvious that redundant equipment would provide additional safety. It formerly seemed much less obvious, at least to me, that two similarly manufactured items might fail simultaneously. I always imagined there was a very high degree of security added by a second set of equipment.

Some years back, a double disk failure occurred on a server providing shared hosting for one of my websites. Indeed, when I received the support email pictured below, I wondered how such a thing could be possible. However, the failure occurred at Interserver, Because of the super friendly, always helpful, and very capable guys who work there, Interserver always will remain my favorite hosting company. So I just chalked the double failure up to an incredible co-incidence, grabbed my backups, and reinstalled.

Now, thanks to what happened at HN and to the wonderful discussions there, I know better than to think double hardware is sufficient security or that nearly simultaneous double disk failures might be nearly impossible. Even doubled, double disk failures do happen! Now I have increased desire for additional, differently formatted, and differently stored backups. Backups are good! Now might be a great time to make another backup! :)

Electrical fault caused problems with multiple drives motors



Not_Oles

3 Comments

  1. BozoDaKlown:

    This has got to be the same 40k hours SSD bug that HPE & Dell patched a couple of years back. Someone isn’t keeping on top of their scheduled maintenance….

    July 13, 2022 @ 2:14 pm | Reply
  2. Anindya:

    Ah, it’s because of these line from my blog. Innit?

    “Like I mentioned in the video description, they have real life time machines too. So I doubt that storage is an issue for them.”

    🤣🤷‍♂️

    July 14, 2022 @ 1:44 am | Reply
  3. Gareth:

    It is a well known principle in engineering safety critical systems that diversity is as important as redundancy, eg a mix of manufacturers for the disks would have avoided this, even if one manufacturer had significantly lower MTBF.

    In redundant systems with degraded modes of operation it is easily shown that MTTR is far more important than MTBF for all reasonable MTBFs.

    July 15, 2022 @ 11:12 pm | Reply

Leave a Reply

Some notes on commenting on LowEndBox:

  • Do not use LowEndBox for support issues. Go to your hosting provider and issue a ticket there. Coming here saying "my VPS is down, what do I do?!" will only have your comments removed.
  • Akismet is used for spam detection. Some comments may be held temporarily for manual approval.
  • Use <pre>...</pre> to quote the output from your terminal/console, or consider using a pastebin service.

Your email address will not be published.