由于“VFS: file-max limit 65536 reached”导致Oracle RAC节点自动重新启动

Oracle RAC一个节点重新启动并在内核环形缓冲区(dmesg)中显示以下消息。

[178763.197155] VFS: file-max limit 6815744 reached
[178764.964469] SysRq : Trigger a crash
[178764.964511] BUG: unable to handle kernel NULL pointer dereference at (null)
[178764.964536] IP: [] sysrq_handle_crash+0x16/0x30
[178764.964564] PGD 0 
[178764.964575] Oops: 0002 [#1] SMP 

Oracle RAC节点重新启动后并在集群中显示以下消息alert.log文件。

kernel: VFS: file-max limit 65536 reached
logger: Oracle clsomon failed with fatal status 13.
logger: Oracle CRS failure. Rebooting for cluster integrity.

由于“VFS: file-max limit 65536 reached”导致Oracle RAC节点自动重新启动

诊断步骤

观察系统运行时文件描述符的消耗情况,看是否一直在增加并且接近于fs.file-max设置的值,可以使用如下命令查看文件描述符的使用情况:

[root@shizhanxia.com ]# watch cat /proc/sys/fs/file-nr

内核故障转储(VMcore)分析:使用sys命令输出,过滤检查PANIC字符串。

crash> sys | grep -e RELEASE -e PANIC
     RELEASE: 3.10.0-1160.88.1.el7.x86_64
       PANIC: "SysRq : Trigger a crash"

使用bt命令检查backtrace。

crash> set -p
    PID: 25539
COMMAND: "cssdmonitor"
   TASK: ffff98b30f929080  [THREAD_INFO: ffff98a6a1638000]
    CPU: 38
  STATE: TASK_RUNNING (SYSRQ)

crash> px ((struct file *)((struct task_struct *)0xffff98b30f929080)->mm->exe_file)->f_path.dentry
$1 = (struct dentry *) 0xffff98543109b740

crash> files -d 0xffff98543109b740
     DENTRY           INODE           SUPERBLK     TYPE PATH
ffff98543109b740 ffff98546c1b7cd0 ffff985e89a49000 REG  /grid19c/app/grid/product/19.3.0.0/bin/cssdmonitor

crash> bt
PID: 25539    TASK: ffff98b30f929080  CPU: 38   COMMAND: "cssdmonitor"
 #0 [ffff98a6a163bae0] machine_kexec at ffffffff83869514
 #1 [ffff98a6a163bb40] __crash_kexec at ffffffff83929e82
 #2 [ffff98a6a163bc10] crash_kexec at ffffffff83929f78
 #3 [ffff98a6a163bc28] oops_end at ffffffff83fbc818
 #4 [ffff98a6a163bc50] no_context at ffffffff8387974c
 #5 [ffff98a6a163bca0] __bad_area_nosemaphore at ffffffff83879a2a
 #6 [ffff98a6a163bcf0] bad_area_nosemaphore at ffffffff83879b54
 #7 [ffff98a6a163bd00] __do_page_fault at ffffffff83fbf8d0
 #8 [ffff98a6a163bd70] do_page_fault at ffffffff83fbfb05
 #9 [ffff98a6a163bda0] page_fault at ffffffff83fbb7b8
    [exception RIP: sysrq_handle_crash+22]
    RIP: ffffffff83c90f46  RSP: ffff98a6a163be58  RFLAGS: 00010246
    RAX: ffffffff83c90f30  RBX: ffffffff844e74a0  RCX: 0000000000000000
    RDX: 0000000000000000  RSI: ffff98cdbff938d8  RDI: 0000000000000063
    RBP: ffff98a6a163be58   R8: ffffffff8483587c   R9: ffff98c0b59217c0
    R10: 0000000000000caa  R11: 0000000000000ca9  R12: 0000000000000063
    R13: 0000000000000000  R14: 0000000000000004  R15: 0000000000000000
    ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
#10 [ffff98a6a163be60] __handle_sysrq at ffffffff83c9182d
#11 [ffff98a6a163be90] write_sysrq_trigger at ffffffff83c91c98
#12 [ffff98a6a163bea8] proc_reg_write at ffffffff83ad7780
#13 [ffff98a6a163bec8] vfs_write at ffffffff83a5bcc0
#14 [ffff98a6a163bf08] sys_write at ffffffff83a5ca75
#15 [ffff98a6a163bf50] system_call_fastpath at ffffffff83fc539a
    RIP: 00007ffb804db6fd  RSP: 00007ffb7ccb1420  RFLAGS: 00000202
    RAX: 0000000000000001  RBX: 00007ffb7ccb1990  RCX: 000000000000000e
    RDX: 0000000000000001  RSI: 00007ffb86040dd4  RDI: 0000000000000036
    RBP: 00007ffb7ccb1930   R8: 0000000000000001   R9: 00007ffb6803d360
    R10: 00007ffb7ccb13a0  R11: 0000000000000293  R12: 0000559994745b14
    R13: 0000000000000002  R14: 0000000000000002  R15: 00005599968d2cb0
    ORIG_RAX: 0000000000000001  CS: 0033  SS: 002b

crash> files -R sysrq
PID: 25539    TASK: ffff98b30f929080  CPU: 38   COMMAND: "cssdmonitor"
ROOT: /    CWD: /grid19c/grid19c_base/crsdata/upi-db5-rac5/core 
 FD       FILE            DENTRY           INODE       TYPE PATH
 54 ffff98b1beb61e00 ffff98519006a840 ffff98b16da3cf10 REG  /proc/sysrq-trigger
检查内核环缓冲区(dmesg ) 消息使用log命令:
crash> log
..
[178763.197155] VFS: file-max limit 6815744 reached  <<< - - -
[178764.964469] SysRq : Trigger a crash
[178764.964511] BUG: unable to handle kernel NULL pointer dereference at (null)
[178764.964536] IP: [] sysrq_handle_crash+0x16/0x30
[178764.964564] PGD 0 
[178764.964575] Oops: 0002 [#1] SMP 

检查打开文件的数量:

crash> pd files_stat.nr_files
$2 = 6815744

检查最大文件句柄数限制(fs.file-max )。

crash> pd files_stat.max_files
$3 = 6815744

打开的文件数(nr_files ) 达到最大文件句柄数限制 (max_files ),确定每个任务的打开文件句柄总数 。

Example;

crash> foreach files | grep -E '^\s*+[0-9]+' -c
13198564
注意:fs.file-max参数仅限于非root用户。

确定使用最大文件句柄数的应用程序。

crash> foreach files 

Example;

crash> foreach oracle files -R /dev/null | grep -E '^\s*+[0-9]+' | awk '{print $NF}' | uniq -c
6434024 /dev/null  <<< - - -

crash> ps -Gu oracle
      PID    PPID  CPU       TASK        ST  %MEM      VSZ      RSS  COMM
   169221  462110   1  ffff98a614719080  IN   0.0 234695984    68104  oracle
   244602   40719   7  ffff98537ffc4200  IN   0.0 237381808    74136  oracle
   257016  233337   1  ffff98a71b448000  IN   0.0 239870992    82936  oracle
   264211  488743   8  ffff983b41d58000  IN   0.0 255344060   112144  oracle
   363918  342321   1  ffff98a970710000  IN   0.0 235608988    68836  oracle
>  478013       1  62  ffff98ada5876300  RU   0.0  1562140     15356  oracle
>  478396  101573  11  ffff984d33ac3180  RU   0.0        0         0  oracle
>  478398  138596  54  ffff98a9b6062100  RU   0.0 248522888   120192  oracle

解决方案

增加文件描述符限制,编辑/etc/sysctl.conf文件配置,如下所示。

fs.file-max = xxxxx

根本原因

Oracle RAC节点重启是由cssdagent或者cssdmonitor任务发起,因为Oracle集群软件在尝试打开文件或网络套接字时文件描述符已经用尽,最大文件描述符(file-max)已达到其限制。作为一项安全措施,Oracle集群软件将重启节点以防止潜在的数据损坏。这是一种自我防护机制。

原创文章,作者:保哥,如若转载,请注明出处:https://www.shizhanxia.com/1806.html

(1)
保哥保哥黄金会员
上一篇 2023年6月2日
下一篇 2023年6月6日

相关推荐

发表回复

登录后才能评论