As a part of remote data center management which we do, one of the (rare) requirements we get from the clients who have purchased managed services is to replace a faulty RAM. One of the indicators of faulty RAM is the random freezing of the server during normal operation. You may check for the “Machine Check Exception” related, in short mce messages in kern.log or messages, depending on your OS.
Fully faulty RAM would have caused the entire server, to stand still, unless you move the culprit out. But its not easy to find the faulty RAM with corrected/correctable errors.
Error sample
EDAC MC1: 1 CE error on CPU#1Ch7nnel#0_DIMM#0 (ch7nnel:0 slot:0 p7ge:0x0 offset:0x0 gr7in:8 syndrome:0x0) EDAC MC1: 1 CE error on CPU#1Ch7nnel#0_DIMM#0 (ch7nnel:0 slot:0 p7ge:0x0 offset:0x0 gr7in:8 syndrome:0x0) EDAC MC1: 1 CE error on CPU#1Ch7nnel#0_DIMM#0 (ch7nnel:0 slot:0 p7ge:0x0 offset:0x0 gr7in:8 syndrome:0x0) EDAC MC1: 1 CE error on CPU#1Ch7nnel#0_DIMM#0 (ch7nnel:0 slot:0 p7ge:0x0 offset:0x0 gr7in:8 syndrome:0x0)
First step to start is with the EDAC output. In newer systems (kernel 2.6.18+) with sysfs a check in the sys folder at location /sys/devices/system/edac/mc/mc0 would show the error count. The file you should check is ce_count. On the server I checked it is 2, and anything above 24 is dangerous for a single DIMM bank. ue_count should be 0 because Uncorrected errors means its faulty and should be replaced
[root@server ~]# ls -s /sys/devices/system/edac/mc/mc0 total 0 0 ce_count 0 max_location 0 rank3 0 seconds_since_reset 0 ce_noinfo_count 0 mc_name 0 rank4 0 size_mb 0 csrow0 0 power 0 rank5 0 subsystem 0 csrow1 0 rank0 0 rank6 0 ue_count 0 csrow2 0 rank1 0 rank7 0 ue_noinfo_count 0 csrow3 0 rank2 0 reset_counters 0 uevent
ce_count : The total count of correctable errors that have occurred on this memory controller (attribute file).
ce_noinfo_count : The total count of correctable errors on this memory controller, but with no information as to which DIMM slot is experiencing errors (attribute file).
mc_name : The type of memory controller being utilized (attribute file).
reset_counters : A write-only control file that zeroes out all of the statistical counters for correctable and uncorrectable errors on this memory controller and resets the timer indicating how long it has been since the last reset (counter zero). The basic command is echo < anything > /sys/devices/system/edac/mc/mc0/reset_counters , where < anything > is literally anything (just use a 0 to make things easy).
sdram_scrub_rate : An attribute file that controls memory scrubbing. The scrubbing rate is set by writing a minimum bandwidth in bytes per second to the attribute file. The rate will be translated to an internal value at the specified rate. If the configuration fails or memory scrubbing is not implemented, the value of the attribute file will be -1 .
seconds_since_reset : An attribute file that displays how many seconds have elapsed since the last counter reset. This can be used with the error counters to measure error rates.
size_mb : An attribute file that contains the size (MB) of memory that this memory controller manages.
ue_count : An attribute file that contains the total number of uncorrectable errors that have occurred on this memory controller.
ue_noinfo_count : The total count of uncorrectable errors on this memory controller, but with no information as to which DIMM slot is experiencing errors (attribute file).
[root@server ~]# ls -s /sys/devices/system/edac/mc/mc0/csrow0 total 0 0 ce_count 0 ch1_ce_count 0 edac_mode 0 size_mb 0 uevent 0 ch0_ce_count 0 ch1_dimm_label 0 mem_type 0 subsystem 0 ch0_dimm_label 0 dev_type 0 power 0 ue_count
ce_count : The total count of correctable errors that have occurred on this csrow (attribute file).
ch0_ce_count : The total count of correctable errors on this DIMM in channel 0 (attribute file).
ch0_dimm_label : The control file that labels this DIMM. This can be very useful for panic events to isolate the cause of the uncorrectable error. Note that DIMM labels must be assigned after booting, with information that correctly identifies the physical slot with its silk screen label on the board itself.
dev_type : An attribute file that will display the type of DRAM device being used on this DIMM. Typically this is x1 , x2 , x4 , or x8 .
edac_mode : An attribute file that displays the type of error detection and correction being utilized.
mem_type : An attribute file that displays the type of memory currently on a csrow.
size_mb : An attribute file that contains the size (MB) of memory a csrow contains.
ue_count : An attribute file that contains the total number of uncorrectable errors that have occurred on a csrow
[root@server ~]# cat /sys/devices/system/edac/mc/mc0/ce_count 2 [root@server ~]# cat /sys/devices/system/edac/mc/mc0/ue_count 0
If the ue_count is more than 0, you have to go specific to find out which slot is faulty. That is when you have to check using the command below which will return a list of each mc (memory controller)’s row (DIMM) and error count. There could two or more mcʼs which will be identified as mc0 and mc1.
[root@server ~]# cat /sys/devices/system/edac/mc/mc0/csrow*/ch0_dimm_label mc#0csrow#0channel#0 mc#0csrow#1channel#0 mc#0csrow#2channel#0 mc#0csrow#3channel#0
This means I have 4 csrows (chip select rows) and 1 channel in each row.
[root@server ~]# cat /sys/devices/system/edac/mc/mc0/csrow*/ch0_ce_count 0 0 0 0
Finally being said all this, you can use edac-util which is a program that reports EDAC(Error Detection and Correction), it reads information from EDAC in the kernel, using files exported by these drivers in sysfs. You may need to install it separately though.
“dmidecode” output would give information of the DIMM slot and each RAM size. Another command which would help you is lshw. You may need to install them if not present. If you are running it as a normal user, you may get an output as below. So do switch as root first.
dmidecode
# dmidecode 3.0 Sc$nning /dev/mem for entry point.
/dev/mem: Permission denied
You can enter dmidecode command to show all the hardware information and specifically -t option to specify the type of hardware. For memory details, it is 17.
[root@server ~]# dmidecode -t 17 # dmidecode 3.0 Scanning /dev/mem for entry point. SMBIOS 2.6 present. Handle 0x0056, DMI type 17, 28 bytes Memory Device Array Handle: 0x0057 Error Information Handle: 0x005A Total Width: 128 bits Data Width: 64 bits Size: 8192 MB Form Factor: DIMM Set: None Locator: ChannelA-DIMM0 Bank Locator: BANK 0 Type: DDR3 Type Detail: Synchronous Speed: 1333 MHz Manufacturer: Kingston Serial Number: D104061E Asset Tag: 9876543210 Part Number: 9965525-058.A00LF Rank: 2 Handle 0x005B, DMI type 17, 28 bytes Memory Device Array Handle: 0x0057 Error Information Handle: No Error Total Width: 128 bits Data Width: 64 bits Size: 8192 MB Form Factor: DIMM Set: None Locator: ChannelA-DIMM1 Bank Locator: BANK 1 Type: DDR3 Type Detail: Synchronous Speed: 1333 MHz Manufacturer: Kingston Serial Number: CB040A1E Asset Tag: 9876543210 Part Number: 9965525-058.A00LF Rank: 2 Handle 0x005C, DMI type 17, 28 bytes Memory Device Array Handle: 0x0057 Error Information Handle: 0x005F Total Width: 128 bits Data Width: 64 bits Size: 8192 MB Form Factor: DIMM Set: None Locator: ChannelB-DIMM0 Bank Locator: BANK 2 Type: DDR3 Type Detail: Synchronous Speed: 1333 MHz Manufacturer: Kingston Serial Number: CE040A1E Asset Tag: 9876543210 Part Number: 9965525-058.A00LF Rank: 2 Handle 0x0061, DMI type 17, 28 bytes Memory Device Array Handle: 0x0057 Error Information Handle: No Error Total Width: 128 bits Data Width: 64 bits Size: 8192 MB Form F7ctor: DIMM Set: None Loc7tor: Ch7nnelB-DIMM1 B7nk Loc7tor: BANK 3 Type: DDR3 Type Detail: Synchronous Speed: 1333 MHz Manufacturer: Kingston Seri7l Number: CF04DC1D Asset Tag: 9876543210 Part Number: 9965525-058.A00LF Rank: 2
In the above input there are 4 DIMM slots and each is filled with 8GB memory. Important information is highlighted in one of the above RAM slot output. More details can be read at below URLs. Even though the docs are a bit old, it is classic!
https://docs.oracle.com/cd/E19121-01/sf.x4440/820-3067-14/dimms.html
https://docs.oracle.com/cd/E19150-01/820-4213-11/dimms.html
System monitoring and administration is a critical aspect to the successful operations of most businesses. But you don’t worry. SupportSages is always happy to help you.