LowEndBox - Cheap VPS, Hosting and Dedicated Server Deals

How To Check ECC Memory On A Hetzner Dedicated Server

Hetzner H Logo

Recently we’ve been discussing getting started on a new VPS or dedicated server. Specific topics have included

In today’s post we’re going to introduce the difference between Dynamic Random Access Memory (“DRAM” or just “RAM”) and Error Correction Code Memory (“ECC RAM” or “ECC memory”). Using an AX41-NVMe server, Hetzner’s cheapest Ryzen machine, we’re also going to test whether our optional ECC RAM is installed and functioning.

What Is ECC RAM?

Desktops, laptops, servers, and phones each contain several different types of memory. One important distinction between the memory types is whether the memory survives a power cycle. Memory which survives a power cycle is called “static,” and memory which “forgets” when power is lost is called “dynamic.” The hard drives which store our data are static. Our files remain available when we turn our device off and back on again.

It might seem lovely to have all the memory in our devices be static, but static memory is too slow. As our device is operating, we do not want to wait. Therefore, dynamic memory is used for most computer operations because it is much faster than static memory. Typically, devices are provisioned with modules of Dynamic Random Access Memory (“DRAM”). The programs running on our device can access these DRAM modules quickly.

DRAM is susceptible to radiation and other factors which can cause random changes in what is stored and “remembered.” These changes are called “bit flips.” When a bit is flipped, what is remembered when the data is called from the memory is different from what originally had been saved. The result of the intended change depends on the exact circumstances. But the result can range from not much all the way to catastrophic.

Error Correction Code Memory (“ECC RAM”) comes to the rescue by providing additional, extra memory modules which “watch” and report or sometimes correct memory mistakes.

Do We Need ECC RAM?

Whether we need ECC RAM or not is debatable. Linus Torvalds said “ECC absolutely matters.” On the other hand, most of us use non-ECC DRAM all the time, with very little or no noticeable adverse consequences. Non-ECC RAM is cheaper, and who cares about an extra system crash once in awhile? It’s not like we’re using our cheap VPS or server to control an airplane or anything which requires similar precision and availability.

But, for professional websites, downtime and corrupted data can be highly problematic. That’s why “professional” setups come with ECC RAM. ECC doesn’t cost that much more.

Does Our Server Have ECC RAM?

If we are renting a server and the provider says it has ECC RAM, how can we tell whether that’s really true? One way involves those extra memory modules we were mentioning as characteristic of ECC. We can use the utility dmidecode to check whether ECC RAM is installed.

root@hels ~# dmidecode -t memory
dmidecode 3.3
Getting SMBIOS data from sysfs.
SMBIOS 3.3.0 present.

Handle 0x000F, DMI type 16, 23 bytes
Physical Memory Array
Location: System Board Or Motherboard
Use: System Memory
Error Correction Type: Multi-bit ECC
Maximum Capacity: 128 GB
Error Information Handle: 0x000E
Number Of Devices: 4

[ , , , ]

Handle 0x0019, DMI type 17, 92 bytes
Memory Device
Array Handle: 0x000F
Error Information Handle: 0x0018
Total Width: 72 bits
Data Width: 64 bits
Size: 32 GB
Form Factor: DIMM
Set: None
Locator: DIMM 1
Bank Locator: P0 CHANNEL A
Type: DDR4
Type Detail: Synchronous Unbuffered (Unregistered)
Speed: 2667 MT/s
Manufacturer: Samsung
Serial Number: XXXXXXXX
Asset Tag: Not Specified
Part Number: M391A4G43AB1-CWE
Rank: 2
Configured Memory Speed: 2667 MT/s
Minimum Voltage: 1.2 V
Maximum Voltage: 1.2 V
Configured Voltage: 1.2 V
Memory Technology: DRAM
Memory Operating Mode Capability: Volatile memory
Firmware Version: Unknown
Module Manufacturer ID: Bank 1, Hex 0xCE
Module Product ID: Unknown
Memory Subsystem Controller Manufacturer ID: Unknown
Memory Subsystem Controller Product ID: Unknown
Non-Volatile Size: None
Volatile Size: 32 GB
Cache Size: None
Logical Size: None

[ , , , ]

root@hels ~#

The above output tells us that the system board is set up to handle “Error Correction Type: Multi-bit ECC.” Also, from the “Total Width: 72 bits” line we can see the additional memory present in this installed module to support ECC. The 72 bit total width contrasts with the “Data Width: 64 bits.” And the extra bits are for ECC.

This server has two memory modules installed. The second module also shows the 72 bit width. Therefore, we can conclude that the server seems ECC capable and also that ECC RAM actually is installed.

How Can We Know That Our ECC RAM Is Working?

The Linux kernel prints boot messages at startup. We can see the boot messages with the `dmesg` commend. In the kernel, the module for ECC is called Error Detection And Correction (“EDAC”). We can check and see that the kernel thinks ECC is working well enough for the kernel to initialize the EDAC.

root@hels ~ # dmesg | grep EDAC
[ 0.466382] EDAC MC: Ver: 3.0.0
[ 4.176346] EDAC amd64: MCT channel count: 2
[ 4.176542] EDAC MC0: Giving out device to module amd64_edac controller F17h_M70h: DEV 0000:00:18.3 (INTERRUPT)
[ 4.176669] EDAC amd64: F17h_M70h detected (node 0).
[ 4.176752] EDAC MC: UMC0 chip selects:
[ 4.176754] EDAC amd64: MC: 0: 0MB 1: 0MB
[ 4.176832] EDAC amd64: MC: 2: 16384MB 3: 16384MB
[ 4.176835] EDAC MC: UMC1 chip selects:
[ 4.176836] EDAC amd64: MC: 0: 0MB 1: 0MB
[ 4.176836] EDAC amd64: MC: 2: 16384MB 3: 16384MB
[ 4.176837] EDAC amd64: using x16 syndromes.
[ 4.176844] EDAC PCI0: Giving out device to module amd64_edac controller EDAC PCI controller: DEV 0000:00:18.0 (POLLED)
[ 4.176845] AMD64 EDAC driver v3.5.0
root@hels ~ #

How Well Is Our ECC Working?

This is a bit of a tricky question! How do we see random, infrequent errors, which, of course, might not yet have happened? So far I have been trying to monitor using edac-utils and rasdaemon. I haven’t been monitoring for a lengthy period of time, nor on a vastly large system. I haven’t yet tried stressing the memory with high load and overclocking. I’m unsure whether the lack of problems that I am seeing is simply that nothing bad is happening or, perhaps, there might be operator error in my monitoring technique.

Offer Link

Here’s a convenient link to Hetzner’s AX41-NVMe page in case anybody wants to get one. Note that ECC RAM is a small extra cost option for this server.



Not_Oles

No Comments

    Leave a Reply

    Some notes on commenting on LowEndBox:

    • Do not use LowEndBox for support issues. Go to your hosting provider and issue a ticket there. Coming here saying "my VPS is down, what do I do?!" will only have your comments removed.
    • Akismet is used for spam detection. Some comments may be held temporarily for manual approval.
    • Use <pre>...</pre> to quote the output from your terminal/console, or consider using a pastebin service.

    Your email address will not be published.