Oracle RAC一个节点重新启动并在内核环形缓冲区(dmesg)中显示以下消息。
[178763.197155] VFS: file-max limit 6815744 reached [178764.964469] SysRq : Trigger a crash [178764.964511] BUG: unable to handle kernel NULL pointer dereference at (null) [178764.964536] IP: [] sysrq_handle_crash+0x16/0x30 [178764.964564] PGD 0 [178764.964575] Oops: 0002 [#1] SMP
Oracle RAC节点重新启动后并在集群中显示以下消息alert.log文件。
kernel: VFS: file-max limit 65536 reached logger: Oracle clsomon failed with fatal status 13. logger: Oracle CRS failure. Rebooting for cluster integrity.
诊断步骤
观察系统运行时文件描述符的消耗情况,看是否一直在增加并且接近于fs.file-max设置的值,可以使用如下命令查看文件描述符的使用情况:
[root@shizhanxia.com ]# watch cat /proc/sys/fs/file-nr
内核故障转储(VMcore)分析:使用sys命令输出,过滤检查PANIC字符串。
crash> sys | grep -e RELEASE -e PANIC RELEASE: 3.10.0-1160.88.1.el7.x86_64 PANIC: "SysRq : Trigger a crash"
使用bt命令检查backtrace。
crash> set -p PID: 25539 COMMAND: "cssdmonitor" TASK: ffff98b30f929080 [THREAD_INFO: ffff98a6a1638000] CPU: 38 STATE: TASK_RUNNING (SYSRQ) crash> px ((struct file *)((struct task_struct *)0xffff98b30f929080)->mm->exe_file)->f_path.dentry $1 = (struct dentry *) 0xffff98543109b740 crash> files -d 0xffff98543109b740 DENTRY INODE SUPERBLK TYPE PATH ffff98543109b740 ffff98546c1b7cd0 ffff985e89a49000 REG /grid19c/app/grid/product/19.3.0.0/bin/cssdmonitor crash> bt PID: 25539 TASK: ffff98b30f929080 CPU: 38 COMMAND: "cssdmonitor" #0 [ffff98a6a163bae0] machine_kexec at ffffffff83869514 #1 [ffff98a6a163bb40] __crash_kexec at ffffffff83929e82 #2 [ffff98a6a163bc10] crash_kexec at ffffffff83929f78 #3 [ffff98a6a163bc28] oops_end at ffffffff83fbc818 #4 [ffff98a6a163bc50] no_context at ffffffff8387974c #5 [ffff98a6a163bca0] __bad_area_nosemaphore at ffffffff83879a2a #6 [ffff98a6a163bcf0] bad_area_nosemaphore at ffffffff83879b54 #7 [ffff98a6a163bd00] __do_page_fault at ffffffff83fbf8d0 #8 [ffff98a6a163bd70] do_page_fault at ffffffff83fbfb05 #9 [ffff98a6a163bda0] page_fault at ffffffff83fbb7b8 [exception RIP: sysrq_handle_crash+22] RIP: ffffffff83c90f46 RSP: ffff98a6a163be58 RFLAGS: 00010246 RAX: ffffffff83c90f30 RBX: ffffffff844e74a0 RCX: 0000000000000000 RDX: 0000000000000000 RSI: ffff98cdbff938d8 RDI: 0000000000000063 RBP: ffff98a6a163be58 R8: ffffffff8483587c R9: ffff98c0b59217c0 R10: 0000000000000caa R11: 0000000000000ca9 R12: 0000000000000063 R13: 0000000000000000 R14: 0000000000000004 R15: 0000000000000000 ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018 #10 [ffff98a6a163be60] __handle_sysrq at ffffffff83c9182d #11 [ffff98a6a163be90] write_sysrq_trigger at ffffffff83c91c98 #12 [ffff98a6a163bea8] proc_reg_write at ffffffff83ad7780 #13 [ffff98a6a163bec8] vfs_write at ffffffff83a5bcc0 #14 [ffff98a6a163bf08] sys_write at ffffffff83a5ca75 #15 [ffff98a6a163bf50] system_call_fastpath at ffffffff83fc539a RIP: 00007ffb804db6fd RSP: 00007ffb7ccb1420 RFLAGS: 00000202 RAX: 0000000000000001 RBX: 00007ffb7ccb1990 RCX: 000000000000000e RDX: 0000000000000001 RSI: 00007ffb86040dd4 RDI: 0000000000000036 RBP: 00007ffb7ccb1930 R8: 0000000000000001 R9: 00007ffb6803d360 R10: 00007ffb7ccb13a0 R11: 0000000000000293 R12: 0000559994745b14 R13: 0000000000000002 R14: 0000000000000002 R15: 00005599968d2cb0 ORIG_RAX: 0000000000000001 CS: 0033 SS: 002b crash> files -R sysrq PID: 25539 TASK: ffff98b30f929080 CPU: 38 COMMAND: "cssdmonitor" ROOT: / CWD: /grid19c/grid19c_base/crsdata/upi-db5-rac5/core FD FILE DENTRY INODE TYPE PATH 54 ffff98b1beb61e00 ffff98519006a840 ffff98b16da3cf10 REG /proc/sysrq-trigger 检查内核环缓冲区(dmesg ) 消息使用log命令: crash> log .. [178763.197155] VFS: file-max limit 6815744 reached <<< - - - [178764.964469] SysRq : Trigger a crash [178764.964511] BUG: unable to handle kernel NULL pointer dereference at (null) [178764.964536] IP: [] sysrq_handle_crash+0x16/0x30 [178764.964564] PGD 0 [178764.964575] Oops: 0002 [#1] SMP
检查打开文件的数量:
crash> pd files_stat.nr_files $2 = 6815744
检查最大文件句柄数限制(fs.file-max )。
crash> pd files_stat.max_files $3 = 6815744
打开的文件数(nr_files ) 达到最大文件句柄数限制 (max_files ),确定每个任务的打开文件句柄总数 。
Example; crash> foreach files | grep -E '^\s*+[0-9]+' -c 13198564 注意:fs.file-max参数仅限于非root用户。
确定使用最大文件句柄数的应用程序。
crash> foreach files Example; crash> foreach oracle files -R /dev/null | grep -E '^\s*+[0-9]+' | awk '{print $NF}' | uniq -c 6434024 /dev/null <<< - - - crash> ps -Gu oracle PID PPID CPU TASK ST %MEM VSZ RSS COMM 169221 462110 1 ffff98a614719080 IN 0.0 234695984 68104 oracle 244602 40719 7 ffff98537ffc4200 IN 0.0 237381808 74136 oracle 257016 233337 1 ffff98a71b448000 IN 0.0 239870992 82936 oracle 264211 488743 8 ffff983b41d58000 IN 0.0 255344060 112144 oracle 363918 342321 1 ffff98a970710000 IN 0.0 235608988 68836 oracle > 478013 1 62 ffff98ada5876300 RU 0.0 1562140 15356 oracle > 478396 101573 11 ffff984d33ac3180 RU 0.0 0 0 oracle > 478398 138596 54 ffff98a9b6062100 RU 0.0 248522888 120192 oracle
解决方案
增加文件描述符限制,编辑/etc/sysctl.conf文件配置,如下所示。
fs.file-max = xxxxx
根本原因
Oracle RAC节点重启是由cssdagent或者cssdmonitor任务发起,因为Oracle集群软件在尝试打开文件或网络套接字时文件描述符已经用尽,最大文件描述符(file-max)已达到其限制。作为一项安全措施,Oracle集群软件将重启节点以防止潜在的数据损坏。这是一种自我防护机制。
原创文章,作者:保哥,如若转载,请注明出处:https://www.shizhanxia.com/1806.html