Linux内存错误诊断

Linux内存错误诊断
arm linux先了解⼀些概念
DRAM(Dynamic Random Access Memory),即动态,最为常见的。ECC是“Error Checking and Correcting”的简写,中⽂名称是“错误检查和纠正”。ECC内存,即应⽤了能够实现错误检查和纠正技
术(ECC)的内存条。EDAC,即Error Detection And Correction(错误检测与纠正)。
内存有两种错误类型分别是CE和UE,CE 是 Correctable Error 的简称, UE是Uncorrectable Error的简称,CE即可恢复的错误,暂不影响系统的正常运⾏。可以在时机停机换掉。UE为不可恢复的内
存错误,通常会导致宕机。
系统messages⽇志
[root@my-host mg4a]# grep kernel /var/log/messages
Jan 14 19:01:11 my-host kernel: mce: [Hardware Error]: Machine check events logged
Jan 14 19:01:12 my-host kernel: EDAC MC0: 1 CE memory read error on CPU_SrcID#0_Ha#1_Chan#
1_DIMM#0 (channel:5 slot:0 page:0x554c02 offset:0x3c0 grain:32 syndrome:0x0 - area:DRAM err_code:0001:0091 socke [root@my-host mg4a]# grep "[0-9]" /sys/devices/system/edac/mc/mc*/csrow*/ch*_ce_count
/sys/devices/system/edac/mc/mc0/csrow0/ch1_ce_count:0
/sys/devices/system/edac/mc/mc0/csrow0/ch5_ce_count:1
/sys/devices/system/edac/mc/mc1/csrow0/ch1_ce_count:0
/sys/devices/system/edac/mc/mc1/csrow0/ch5_ce_count:0
/sys/devices/system/edac/mc/mc2/csrow0/ch1_ce_count:0
/sys/devices/system/edac/mc/mc2/csrow0/ch5_ce_count:0
/sys/devices/system/edac/mc/mc3/csrow0/ch1_ce_count:0
/sys/devices/system/edac/mc/mc3/csrow0/ch5_ce_count:0
[root@my-host mg4a]# dmidecode -t 1
# dmidecode 3.0
Getting SMBIOS data from sysfs.
SMBIOS 2.7 present.
Handle 0x0044, DMI type 1, 27 bytes
System Information
Manufacturer: LENOVO
Product Name: Lenovo System x3750 M4 -[8753IH5]-
Version: 03
Serial Number: 06FF367
UUID: C4EF8080-7926-11E5-8B14-6C0B849B418E
中学生数理化Wake-up Type: Other
SKU Number: XxXxXxX
Family: System X
这是另外⼀台设备messges⽇志
Jun 27 13:53:25 irora30 kernel: [Hardware Error]: MC4 Error (node 2): DRAM ECC error detected on the NB.
Jun 27 13:53:25 irora30 kernel: EDAC amd64 MC2: CE ERROR_ADDRESS= 0x8de3b1960
Jun 27 13:53:25 irora30 kernel: EDAC MC2: CE page 0x8de3b1, offset 0x960, grain 0, syndrome 0xab40, row 5, channel 0, label "": amd64_edac
Jun 27 13:53:25 irora30 kernel: [Hardware Error]: Error Status: Corrected error, no action required.
Jun 27 13:53:25 irora30 kernel: [Hardware Error]: CPU:1 (15:2:0) MC4_STATUS[-|CE|MiscV|-|AddrV|-|-|CECC]: 0x8c204000ab080a13
Jun 27 13:53:25 irora30 kernel: [Hardware Error]: MC4_ADDR: 0x00000008de3b1960
Jun 27 13:53:25 irora30 kernel: [Hardware Error]: cache level: L3/GEN, mem/io: MEM, mem-tx: RD, part-proc: RES (no timeout)
Jun 27 14:19:27 irora30 auditd[5571]: Audit daemon rotating log files
Jun 27 19:09:23 irora30 auditd[5571]: Audit daemon rotating log files
Jun 27 23:59:21 irora30 auditd[5571]: Audit daemon rotating log files
Jun 28 02:15:55 irora30 kernel: [Hardware Error]: MC4 Error (node 2): DRAM ECC error detected on the NB.
Jun 28 02:15:55 irora30 kernel: EDAC amd64 MC2: CE ERROR_ADDRESS= 0x8d9ea5960
Jun 28 02:15:55 irora30 kernel: EDAC MC2: CE page 0x8d9ea5, offset 0x960, grain 0, syndrome 0xab40, row 5, channel 0, label "": amd64_edac
Jun 28 02:15:55 irora30 kernel: [Hardware Error]: Error Status: Corrected error, no action required.
Jun 28 02:15:55 irora30 kernel: [Hardware Error]: CPU:1 (15:2:0) MC4_STATUS[-|CE|MiscV|-|AddrV|-|-|CECC]: 0x8c204000ab080813
Jun 28 02:15:55 irora30 kernel: [Hardware Error]: MC4_ADDR: 0x00000008d9ea5960
Jun 28 02:15:55 irora30 kernel: [Hardware Error]: cache level: L3/GEN, mem/io: MEM, mem-tx: RD, part-proc: SRC (no timeout)
Jun 28 03:08:25 irora30 kernel: [Hardware Error]: MC4 Error (node 2): DRAM ECC error detected on the NB.
Jun 28 03:08:25 irora30 kernel: EDAC amd64 MC2: CE ERROR_ADDRESS= 0x8ded39960
Jun 28 03:08:25 irora30 kernel: EDAC MC2: CE page 0x8ded39, offset 0x960, grain 0, syndrome 0xab40, row 5, channel 0, label "": amd64_edac
Jun 28 03:08:25 irora30 kernel: [Hardware Error]: Error Status: Corrected error, no action required.
Jun 28 03:08:25 irora30 kernel: [Hardware Error]: CPU:1 (15:2:0) MC4_STATUS[-|CE|MiscV|-|AddrV|-|-|CECC]: 0x8c204000ab080813
Jun 28 03:08:25 irora30 kernel: [Hardware Error]: MC4_ADDR: 0x00000008ded39960
Jun 28 03:08:25 irora30 kernel: [Hardware Error]: cache level: L3/GEN, mem/io: MEM, mem-tx: RD, part-proc: SRC (no timeout)
Jun 28 03:45:13 irora30 rhsmd: In order for Subscription Manager to provide your system with updates, your system must be registered with the Customer Portal. Please enter your Red Hat login to ensure your system is up-t Jun 28 04:44:25 irora30 auditd[5571]: Audit daemon rotating log files
Jun 28 09:34:22 irora30 auditd[5571]: Audit daemon rotating log files
Jun 28 10:02:30 irora30 ansible-command: Invoked with warn=True executable=None _uses_shell=True _raw_params=df -hl /var|awk 'NR>1 && int($5) > 80' removes=None creates=None chdir=None
Jun 28 14:23:49 irora30 auditd[5571]: Audit daemon rotating log files
Jun 28 19:09:25 irora30 auditd[5571]: Audit daemon rotating log files
故障确认及定位故障内存槽位
[root@irora30 ~]# grep "[0-9]" /sys/devices/system/edac/mc/mc*/csrow*/ch*_ce_count
/sys/devices/system/edac/mc/mc0/csrow4/ch0_ce_count:0
/sys/devices/system/edac/mc/mc0/csrow5/ch0_ce_count:0
/sys/devices/system/edac/mc/mc1/csrow4/ch0_ce_count:0
/sys/devices/system/edac/mc/mc1/csrow5/ch0_ce_count:0
/sys/devices/system/edac/mc/mc2/csrow4/ch0_ce_count:0
/sys/devices/system/edac/mc/mc2/csrow5/ch0_ce_count:294
/sys/devices/system/edac/mc/mc3/csrow4/ch0_ce_count:0
/sys/devices/system/edac/mc/mc3/csrow5/ch0_ce_count:0
/sys/devices/system/edac/mc/mc4/csrow4/ch0_ce_count:0
/sys/devices/system/edac/mc/mc4/csrow5/ch0_ce_count:0
/sys/devices/system/edac/mc/mc5/csrow4/ch0_ce_count:0
/
sys/devices/system/edac/mc/mc5/csrow5/ch0_ce_count:0
/sys/devices/system/edac/mc/mc6/csrow4/ch0_ce_count:0
/sys/devices/system/edac/mc/mc6/csrow5/ch0_ce_count:0
/sys/devices/system/edac/mc/mc7/csrow4/ch0_ce_count:0
/sys/devices/system/edac/mc/mc7/csrow5/ch0_ce_count:0
[root@irora30 ~]#
count:不为0的⾏即代表存在内存错误。
mc:第⼏个CPU。
csrow:内存通道
ch*:通道内的第⼏根内存。
内存安装情况
1 Memory Component    Status
2
3 Proc 1 DIMM 1A    1638
4 MB 1333 MHz
4
5 Proc 1 DIMM 2I    Not installed Not installed
6
7 Proc 1 DIMM 3E    Not installed Not installed
8
9 Proc 1 DIMM 4C    Not installed Not installed
10
11 Proc 1 DIMM 5K    Not installed Not installed
12
13 Proc 1 DIMM 6G    Not installed Not installed
14
15 Proc 1 DIMM 7B    16384 MB 1333 MHz
16
17 Proc 1 DIMM 8J    Not installed Not installed
18
19 Proc 1 DIMM 9F    Not installed Not installed
20
21 Proc 1 DIMM 10D    Not installed Not installed
蒋南翔
22
23 Proc 1 DIMM 11L    Not installed Not installed
24
25 Proc 1 DIMM 12H    Not installed Not installed
26
27 Proc 2 DIMM 1A    16384 MB 1333 MHz
28
29 Proc 2 DIMM 2I    Not installed Not installed
30
31 Proc 2 DIMM 3E    Not installed Not installed
32
33 Proc 2 DIMM 4C    Not installed Not installed
34
35 Proc 2 DIMM 5K    Not installed Not installed
36
37 Proc 2 DIMM 6G    Not installed Not installed
38
39 Proc 2 DIMM 7B    16384 MB 1333 MHz
40
41 Proc 2 DIMM 8J    Not installed Not installed
42
43 Proc 2 DIMM 9F    Not installed Not installed
44
美寻45 Proc 2 DIMM 10D    Not installed Not installed
46
47 Proc 2 DIMM 11L    Not installed Not installed
48
49 Proc 2 DIMM 12H    Not installed Not installed
50
51 Proc 3 DIMM 1A    16384 MB 1333 MHz
52
53 Proc 3 DIMM 2I    Not installed Not installed
54
55 Proc 3 DIMM 3E    Not installed Not installed
56
57 Proc 3 DIMM 4C    Not installed Not installed
58
59 Proc 3 DIMM 5K    Not installed Not installed
60
61 Proc 3 DIMM 6G    Not installed Not installed
62
63 Proc 3 DIMM 7B    16384 MB 1333 MHz
64
65 Proc 3 DIMM 8J    Not installed Not installed
66
67 Proc 3 DIMM 9F    Not installed Not installed
68
69 Proc 3 DIMM 10D    Not installed Not installed
70
71 Proc 3 DIMM 11L    Not installed Not installed
72
73 Proc 3 DIMM 12H    Not installed Not installed
74
75 Proc 4 DIMM 1A    16384 MB 1333 MHz
76
77 Proc 4 DIMM 2I    Not installed Not installed
78
79 Proc 4 DIMM 3E    Not installed Not installed
80
81 Proc 4 DIMM 4C    Not installed Not installed
82
83 Proc 4 DIMM 5K    Not installed Not installed
84
85 Proc 4 DIMM 6G    Not installed Not installed
86
87 Proc 4 DIMM 7B    16384 MB 1333 MHz
88
89 Proc 4 DIMM 8J    Not installed Not installed
90
91 Proc 4 DIMM 9F    Not installed Not installed
92
93 Proc 4 DIMM 10D    Not installed Not installed
94
95 Proc 4 DIMM 11L    Not installed Not installed
96
97 Proc 4 DIMM 12H    Not installed Not installed
使⽤edac⼯具来检测服务器内存故障
随着虚拟化,Redis,BDB内存数据库等应⽤的普及,现在越来越多的服务器配置了⼤容量内存,拿DELL的R620来说在配置双路CPU下,其24个内存插槽,⽀持的内存⾼达960GB。对于ECC,REG这些带有纠错功能的内存故障检测是⼀件很头疼的事情,出现故障,还是可以连续运⾏⼏个⽉甚⾄⼏年,但如果运⽓不好,随时都会挂掉,好在linux中提供了⼀个edac-utils 内存纠错诊断⼯具,可以⽤来检查服务器内存潜在的故障。
下⾯以CentOS为例,介绍下edac-utils ⼯具的使⽤.
在使⽤edac-utils ⼯具之前,需要先了解服务器的硬件架构,以DELL R620为例,(其它如HP DL360P G8,IBM X3650 M4 机型都使⽤了 E5-2600 系列CPU,C600 系列芯⽚组.⼤致相同) 其CPU内存控制器对应通道,内存槽关系,如下所⽰。
处理器0 (对应⼀个内存控制器)
通道0:内存插槽A1、A5 和A9
通道1:内存插槽A2、A6 和A10
通道2:内存插槽A3、A7 和A11
通道3:内存插槽A4、A8 和A12
处理器1 (对应⼀个内存控制器)
通道0:内存插槽B1、B5 和B9
通道1:内存插槽B2、B6 和B10
通道2:内存插槽B3、B7 和B11
通道3:内存插槽B4、B8 和B12
1.安装 edac-utils ⼯具
yum install -y libsysfs edac-utils
2.执⾏检测命令,可查看纠错提⽰如下
edac-util -v
1 mc0: csrow0: CPU_SrcID#0_Ha#0_Chan#0_DIMM#0: A1
2 mc0: csrow0: CPU_SrcID#0_Ha#0_Chan#1_DIMM#0: A2
3 mc0: csrow0: CPU_SrcID#0_Ha#0_Chan#2_DIMM#0: A3
4 mc0: csrow0: CPU_SrcID#0_Ha#0_Chan#3_DIMM#0: A4
5 mc0: csrow1: CPU_SrcID#0_Ha#0_Chan#0_DIMM#1: A5
6 mc0: csrow1: CPU_SrcID#0_Ha#0_Chan#1_DIMM#1: A6
7 mc0: csrow1: CPU_SrcID#0_Ha#0_Chan#2_DIMM#1: A7
8 mc0: csrow1: CPU_SrcID#0_Ha#0_Chan#3_DIMM#1: A8
9 mc0: csrow2: CPU_SrcID#0_Ha#0_Chan#0_DIMM#2: A9
10 mc0: csrow2: CPU_SrcID#0_Ha#0_Chan#1_DIMM#2: A10
彭斯11 mc0: csrow2: CPU_SrcID#0_Ha#0_Chan#2_DIMM#2: A11
12 mc0: csrow2: CPU_SrcID#0_Ha#0_Chan#3_DIMM#2: A12
注塑工艺13
14 mc1: csrow0: CPU_SrcID#1_Ha#0_Chan#0_DIMM#0: B1
15 mc1: csrow0: CPU_SrcID#1_Ha#0_Chan#1_DIMM#0: B2
16 mc1: csrow0: CPU_SrcID#1_Ha#0_Chan#2_DIMM#0: B3
17 mc1: csrow0: CPU_SrcID#1_Ha#0_Chan#3_DIMM#0: B4
18 mc1: csrow1: CPU_SrcID#1_Ha#0_Chan#0_DIMM#1: B5
19 mc1: csrow1: CPU_SrcID#1_Ha#0_Chan#1_DIMM#1: B6
20 mc1: csrow1: CPU_SrcID#1_Ha#0_Chan#2_DIMM#1: B7
21 mc1: csrow1: CPU_SrcID#1_Ha#0_Chan#3_DIMM#1: B8
22 mc1: csrow2: CPU_SrcID#1_Ha#0_Chan#0_DIMM#1: B9
23 mc1: csrow2: CPU_SrcID#1_Ha#0_Chan#1_DIMM#1: B10
24 mc1: csrow2: CPU_SrcID#1_Ha#0_Chan#2_DIMM#1: B11
25 mc1: csrow2: CPU_SrcID#1_Ha#0_Chan#3_DIMM#1: B12
其中
mc06 表⽰表⽰内存控制器0;
CPU_Src_ID#0 表⽰源CPU0;
Channel#0 表⽰通道0;
DIMM#0 标⽰内存槽0;
Corrected Errors 代表已经纠错的次数;
根据前⾯列出的CPU通道和内存槽对应关系即可给edac-utils 返回的信息进⾏编号。
即可得出 A1槽 6312 次纠错,B1槽 6459次纠错,B3槽 535次纠错. 3条内存出现潜在故障,接下来联系供应商进⾏更换即可。12条内存的对应关系
1 mc0: csrow0: CPU#0Channel#0_DIMM#0: A1
2 mc0: csrow0: CPU#0Channel#1_DIMM#0: A2
3 mc0: csrow0: CPU#0Channel#2_DIMM#0: A3
4 mc0: csrow1: CPU#0Channel#0_DIMM#1: A4
5 mc0: csrow1: CPU#0Channel#1_DIMM#1: A5
6 mc0: csrow1: CPU#0Channel#2_DIMM#1: A6
7
8 mc1: csrow0: CPU#1Channel#0_DIMM#0: B1
9 mc1: csrow0: CPU#1Channel#1_DIMM#0: B2
10 mc1: csrow0: CPU#1Channel#2_DIMM#0: B3
11 mc1: csrow1: CPU#1Channel#0_DIMM#1: B4
12 mc1: csrow1: CPU#1Channel#1_DIMM#1: B5
13 mc1: csrow1: CPU#1Channel#2_DIMM#1: B6
20条内存的对应关系
1 mc0: 0 Uncorrected Errors with no DIMM info
2 mc0: 0 Corrected Errors with no DIMM info
3 mc0: csrow0: 0 Uncorrected Errors
4 mc0: csrow0: CPU_SrcID#0_Ha#0_Chan#0_DIMM#0: 0 Corrected Errors A1
5 mc0: csrow0: CPU_SrcID#0_Ha#0_Chan#1_DIMM#0: 0 Corrected Errors B1
6 mc0: csrow0: CPU_SrcID#0_Ha#0_Chan#2_DIMM#0: 0 Corrected Errors C1
7 mc0: csrow0: CPU_SrcID#0_Ha#0_Chan#3_DIMM#0: 0 Corrected Errors D1
8 mc0: csrow1: 0 Uncorrected Errors
9 mc0: csrow1: CPU_SrcID#0_Ha#0_Chan#0_DIMM#1: 0 Corrected Errors A2
10 mc0: csrow1: CPU_SrcID#0_Ha#0_Chan#1_DIMM#1: 0 Corrected Errors B2
11 mc0: csrow1: CPU_SrcID#0_Ha#0_Chan#2_DIMM#1: 0 Corrected Errors C2
12 mc0: csrow1: CPU_SrcID#0_Ha#0_Chan#3_DIMM#1: 0 Corrected Errors D2
13 mc0: csrow2: 0 Uncorrected Errors
14 mc0: csrow2: CPU_SrcID#0_Ha#0_Chan#0_DIMM#2: 0 Corrected Errors A3
15 mc0: csrow2: CPU_SrcID#0_Ha#0_Chan#1_DIMM#2: 11 Corrected Errors B3
16 mc0: csrow2: CPU_SrcID#0_Ha#0_Chan#2_DIMM#2: 0 Corrected Errors C3
17 mc0: csrow2: CPU_SrcID#0_Ha#0_Chan#3_DIMM#2: 0 Corrected Errors D3
18 mc1: 0 Uncorrected Errors with no DIMM info
19 mc1: 0 Corrected Errors with no DIMM info
20 mc1: csrow0: 0 Uncorrected Errors
21 mc1: csrow0: CPU_SrcID#1_Ha#0_Chan#0_DIMM#0: 0 Corrected Errors
22 mc1: csrow0: CPU_SrcID#1_Ha#0_Chan#1_DIMM#0: 0 Corrected Errors
23 mc1: csrow0: CPU_SrcID#1_Ha#0_Chan#2_DIMM#0: 0 Corrected Errors
24 mc1: csrow0: CPU_SrcID#1_Ha#0_Chan#3_DIMM#0: 0 Corrected Errors
25 mc1: csrow1: 0 Uncorrected Errors
26 mc1: csrow1: CPU_SrcID#1_Ha#0_Chan#0_DIMM#1: 0 Corrected Errors
27 mc1: csrow1: CPU_SrcID#1_Ha#0_Chan#1_DIMM#1: 0 Corrected Errors
28 mc1: csrow1: CPU_SrcID#1_Ha#0_Chan#2_DIMM#1: 0 Corrected Errors
29 mc1: csrow1: CPU_SrcID#1_Ha#0_Chan#3_DIMM#1: 0 Corrected Errors
30
31 4x16关系
32 mc0: csrow0: CPU#0Channel#0_DIMM#0: 0 Corrected Errors 8a
33 mc0: csrow0: CPU#0Channel#1_DIMM#0: 0 Corrected Errors 5b
34 mc0: csrow0: CPU#0Channel#2_DIMM#0: 0 Corrected Errors 2c
35 mc0: csrow1: 0 Uncorrected Errors
36 mc0: csrow1: CPU#0Channel#0_DIMM#1: 1 Corrected Errors 7d
37 mc0: csrow1: CPU#0Channel#1_DIMM#1: 0 Corrected Errors 4e
38 mc0: csrow1: CPU#0Channel#2_DIMM#1: 0 Corrected Errors 1f
39 mc0: csrow2: 0 Uncorrected Errors
40 mc0: csrow2: CPU#0Channel#0_DIMM#2: 0 Corrected Errors 6G
41 mc0: csrow2: CPU#0Channel#1_DIMM#2: 0 Corrected Errors 3h

本文发布于:2024-09-23 10:30:19,感谢您对本站的认可!

本文链接:https://www.17tex.com/xueshu/279316.html

版权声明:本站内容均来自互联网,仅供演示用,请勿用于商业和其他非法用途。如果侵犯了您的权益请与我们联系,我们将在24小时内删除。

标签:内存   故障   错误   通道   服务器   对应   插槽
留言与评论(共有 0 条评论)
   
验证码:
Copyright ©2019-2024 Comsenz Inc.Powered by © 易纺专利技术学习网 豫ICP备2022007602号 豫公网安备41160202000603 站长QQ:729038198 关于我们 投诉建议