请教大家一下生产环境 Linux 服务器文件系统的一些问题

Distributions

中文资源站

› 网易开源镜像站

This topic created in 661 days ago, the information mentioned may be changed or developed.

我们的系统盘是 xfs （红帽 7 的操作系统），数据盘 ext4 ，但是昨天机器突然所有进程都 hung 了，CPU 使用率从 40 多降低到了个位数，进程状态基本都为 D ，负载从几十到了 20000 多，系统日志显示 jbd2 卡了 120s ，其他进程也有卡主 120s 的(Flink on K8s)。机器 reboot 之后恢复了，现在怀疑是内核的 ext4 文件系统有问题，但是也不能 100%确定。从我的角度上考虑，可以将 ext4 换成 xfs 去规避问题。请大家给个建议吧，目前看只能长时间用红帽 7 了，内核就到 3.10.1160 了。

Supplement 1 · Aug 16, 2024

目前怀疑和 https://access.redhat.com/solutions/4054181 有关，也就是说内核有 bug ，目前难点是难以复现，不能 100%认定，而且我们的内核没有红帽官方的那么高，就到红帽 7.9 的内核（ 1160 ）

ext4

jbd2

诊断

25 replies • 2024-08-19 18:36:55 +08:00

iyiluo

Aug 16, 2024

你应该先怀疑是不是硬盘快坏了，这个几率比遇到文件系统出故障几率大得多

barrysj

Aug 16, 2024

确实考虑硬盘的问题更大
有监控 cpu iowait 和硬盘的读写延迟之类的数据吗

zzlyzq

Aug 16, 2024

硬盘可能有故障，而非文件系统问题。

zhoudaiyu

PRO

Aug 16, 2024

@iyiluo #1 当时看带外硬盘信息是正常的，而且 reboot 后恢复了，运行 1 天了也正常了
@barrysj #2 有，从监控数据看到当时有一块盘的 I/O 掉下去了，突然没有读/写了，进程也都 hung 了，iowait 当时没有
硬盘是 3 块 SAS 的 SSD 组的 RAID

hefish

Aug 16, 2024

op 心里已经有答案了。

liuchao719

Aug 16, 2024

是有什么需求要用这么低版本的内核嘛，通常来讲开发者对低版本会关照少一些，我觉得能用高版本还是尽量用高版本？因为之前做项目很多问题都是版本过旧引起的，升级一下就没有问题了。

zhoudaiyu

PRO

Aug 16, 2024

@hefish #5 只是怀疑，因为 reboot 了就好了
@liuchao719 #6 没人敢用新系统，没人背得起锅，大家都用老的

Hormazed

PRO

Aug 16, 2024

我们也在用 Red Hat Enterprise Linux Server release 7.9 (Maipo)为了避免内核问题，最近陆续把内核升级到 6.6.8 ，数量已有 20 台左右
Linux version 6.6.8 (root@VTW12NET) (gcc (GCC) 11.2.1 20220127 (Red Hat 11.2.1-9), GNU ld version 2.27-44.base.el7_9.1) #1 SMP PREEMPT_DYNAMIC Thu Dec 20 12:01:06 CST 2023
就我们使用情况来说。没出现过问题，IO 平均每周 300G 。

Hormazed

PRO

Aug 16, 2024

Resolution
Red Hat Enterprise Linux 7

This issue has been resolved with the errata RHSA-2020:1016 for the package(s) kernel-3.10.0-1127.el7 or later.
Red Hat Enterprise Linux 8

This issue has been resolved with the errata RHSA-2019:3517 for the package(s) kernel-4.18.0-147.el8 or later.
Root Cause
Ext4/jbd2 deadlock involving the jdb2 checkpoint thread and the jdb2 commit thread.
Each thread is waiting for the other to move forward:
The checkpoint thread aquires the and waits for the commit thread to finish ( waitqueue), but the commit thread cannot progress since it's trying to acquire the same mutex () owned by the checkpoint thread, leading to the deadlock.j_checkpoint_mutexj_wait_done_commitj_checkpoint_mutex
The deadlock is resolved by upstream commit 53cf978457325d8fb2cdecd7981b31a8229e446e (jbd2: fix deadlock while checkpoint thread waits commit thread to finish).
On Red Hat Enterprise Linux 7, the patch resolving this issue has been backported as part of a more general update in (private) BZ 1747387 - ext4 jbd2: stable update for 7.8
On Red Hat Enterprise Linux 8, the patch resolving this issue has been backported as part of a more general update in (private) BZ 1698815 - [ext4][jbd2] Stable update for rhel8.1

Kumo31

Aug 16, 2024

给出的信息太少不足以判断是哪里的问题。不过不用迷信内核，这么老的内核版本 bug 可不少，我们做存储碰上过一篮子的各种内核 bug

JackSlowFcck

Aug 16, 2024

要不，换块硬盘试试？

zhoudaiyu

PRO

Aug 16, 2024

@Hormazed #8 生产系统不是非常敢用 elrepo 提供的内核，尽管我也想用新的
@Hormazed #9 我们的内核版本已经比这个新了，是 1160
@JackSlowFcck #11 周期太长了，机器托管到别的机房了，现在有的业务是单点，只能先再用上了
@Kumo31 #10 大概率我们要 3.10 一直用上去了，我们还有信创的机器是 4.19 的内核，也反馈过 bug ，厂商也没管

lrvy

Aug 16, 2024

RHEL 的话直接找售后查😁

ruidoBlanco

Aug 16, 2024

用红帽，内核问题，首先选择升级内核到当前版本最新看看还有没有问题，其次是找红帽，问社区能有多大用处？

即便是问社区，日志什么也不贴，让人猜？

zhoudaiyu

PRO

Aug 16, 2024 via iPhone

@ruidoBlanco 一会就补上老哥
@lrvy 一言难尽，反正就是能用红帽的包，但是人家不管我们出的问题

julyclyde

Aug 16, 2024

-t long 诊断一下吧
不应该没问题

msg7086

Aug 16, 2024

不敢用 elrepo 的内核的话，那要不要看看 UEK 内核？

zhoudaiyu

PRO

Aug 16, 2024

hang 住时候的内核日志截取了部分
Aug 15 09:33:53 node16 kernel: INFO: task jbd2/dm-2-8:1839 blocked for more than 120 seconds.
Aug 15 09:33:53 node16 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Aug 15 09:33:53 node16 kernel: jbd2/dm-2-8 D ffff8e7efea1acc0 0 1839 2 0x00000000
Aug 15 09:33:53 node16 kernel: Call Trace:
Aug 15 09:33:53 node16 kernel: [<ffffffff8e8d2ba0>] ? task_rq_unlock+0x20/0x20
Aug 15 09:33:53 node16 kernel: [<ffffffff8ef87169>] schedule+0x29/0x70
Aug 15 09:33:53 node16 kernel: [<ffffffffc118433c>] jbd2_journal_commit_transaction+0x23c/0x19c0 [jbd2]
Aug 15 09:33:53 node16 kernel: [<ffffffff8e8e111e>] ? account_entity_dequeue+0xae/0xd0
Aug 15 09:33:53 node16 kernel: [<ffffffff8e8e48bc>] ? dequeue_entity+0x11c/0x5c0
Aug 15 09:33:53 node16 kernel: [<ffffffff8e8e5ec1>] ? put_prev_entity+0x31/0x400
Aug 15 09:33:53 node16 kernel: [<ffffffff8e82b59e>] ? __switch_to+0xce/0x580
Aug 15 09:33:53 node16 kernel: [<ffffffff8e8c6d10>] ? wake_up_atomic_t+0x30/0x30
Aug 15 09:33:53 node16 kernel: [<ffffffff8ef86c8f>] ? __schedule+0x3af/0x860
Aug 15 09:33:53 node16 kernel: [<ffffffff8e8adf0e>] ? try_to_del_timer_sync+0x5e/0x90
Aug 15 09:33:53 node16 kernel: [<ffffffffc118af89>] kjournald2+0xc9/0x260 [jbd2]
Aug 15 09:33:53 node16 kernel: [<ffffffff8e8c6d10>] ? wake_up_atomic_t+0x30/0x30
Aug 15 09:33:53 node16 kernel: [<ffffffffc118aec0>] ? commit_timeout+0x10/0x10 [jbd2]
Aug 15 09:33:53 node16 kernel: [<ffffffff8e8c5c21>] kthread+0xd1/0xe0
Aug 15 09:33:53 node16 kernel: [<ffffffff8e8c5b50>] ? insert_kthread_work+0x40/0x40
Aug 15 09:33:53 node16 kernel: [<ffffffff8ef93ddd>] ret_from_fork_nospec_begin+0x7/0x21
Aug 15 09:33:53 node16 kernel: [<ffffffff8e8c5b50>] ? insert_kthread_work+0x40/0x40
Aug 15 09:33:53 node16 kernel: INFO: task containerd:225811 blocked for more than 120 seconds.
Aug 15 09:33:53 node16 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Aug 15 09:33:53 node16 kernel: containerd D ffff8e7efef1acc0 0 225811 1 0x00000080
Aug 15 09:33:53 node16 kernel: Call Trace:
Aug 15 09:33:53 node16 kernel: [<ffffffff8ef87169>] schedule+0x29/0x70
Aug 15 09:33:53 node16 kernel: [<ffffffffc1181085>] wait_transaction_locked+0x85/0xd0 [jbd2]
Aug 15 09:33:53 node16 kernel: [<ffffffff8e8c6d10>] ? wake_up_atomic_t+0x30/0x30
Aug 15 09:33:53 node16 kernel: [<ffffffffc1181378>] add_transaction_credits+0x278/0x310 [jbd2]
Aug 15 09:33:53 node16 kernel: [<ffffffff8ea2830b>] ? __kmalloc+0x1eb/0x230
Aug 15 09:33:53 node16 kernel: [<ffffffffc11dd8c4>] ? ext4_htree_store_dirent+0x34/0x120 [ext4]
Aug 15 09:33:53 node16 kernel: [<ffffffffc1181601>] start_this_handle+0x1a1/0x430 [jbd2]
Aug 15 09:33:53 node16 kernel: [<ffffffff8ea287c2>] ? kmem_cache_alloc+0x1c2/0x1f0
Aug 15 09:33:53 node16 kernel: [<ffffffffc1181ab3>] jbd2__journal_start+0xf3/0x1f0 [jbd2]
Aug 15 09:33:53 node16 kernel: [<ffffffffc11eb0ba>] ? ext4_dirty_inode+0x2a/0x60 [ext4]
Aug 15 09:33:53 node16 kernel: [<ffffffffc1217759>] __ext4_journal_start_sb+0x69/0xe0 [ext4]
Aug 15 09:33:53 node16 kernel: [<ffffffffc11eb0ba>] ext4_dirty_inode+0x2a/0x60 [ext4]
Aug 15 09:33:53 node16 kernel: [<ffffffff8ea7f00d>] __mark_inode_dirty+0x15d/0x270
Aug 15 09:33:53 node16 kernel: [<ffffffff8ea6b8e9>] update_time+0x89/0xd0
Aug 15 09:33:53 node16 kernel: [<ffffffff8ea6bdfa>] touch_atime+0x10a/0x220
Aug 15 09:33:53 node16 kernel: [<ffffffff8ea63694>] iterate_dir+0xe4/0x130
Aug 15 09:33:53 node16 kernel: [<ffffffff8ea63c8c>] SyS_getdents64+0x9c/0x120
Aug 15 09:33:53 node16 kernel: [<ffffffff8ea63900>] ? fillonedir+0x110/0x110
Aug 15 09:33:53 node16 kernel: [<ffffffff8ef93f92>] system_call_fastpath+0x25/0x2a
Aug 15 09:33:53 node16 kernel: INFO: task containerd:2700571 blocked for more than 120 seconds.
Aug 15 09:33:53 node16 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Aug 15 09:33:53 node16 kernel: containerd D ffff8e3eff79acc0 0 2700571 1 0x00000080
Aug 15 09:33:53 node16 kernel: Call Trace:
Aug 15 09:33:53 node16 kernel: [<ffffffff8ef87169>] schedule+0x29/0x70
Aug 15 09:33:53 node16 kernel: [<ffffffffc1181085>] wait_transaction_locked+0x85/0xd0 [jbd2]
Aug 15 09:33:53 node16 kernel: [<ffffffff8e8c6d10>] ? wake_up_atomic_t+0x30/0x30
Aug 15 09:33:53 node16 kernel: [<ffffffffc1181378>] add_transaction_credits+0x278/0x310 [jbd2]
Aug 15 09:33:53 node16 kernel: [<ffffffff8ef86c8f>] ? __schedule+0x3af/0x860
Aug 15 09:33:53 node16 kernel: [<ffffffffc1181601>] start_this_handle+0x1a1/0x430 [jbd2]
Aug 15 09:33:53 node16 kernel: [<ffffffff8ef87169>] ? schedule+0x29/0x70
Aug 15 09:33:53 node16 kernel: [<ffffffff8ef84c51>] ? schedule_timeout+0x221/0x2d0
Aug 15 09:33:53 node16 kernel: [<ffffffff8ea287c2>] ? kmem_cache_alloc+0x1c2/0x1f0
Aug 15 09:33:53 node16 kernel: [<ffffffffc1181ab3>] jbd2__journal_start+0xf3/0x1f0 [jbd2]
Aug 15 09:33:53 node16 kernel: [<ffffffffc11eb0ba>] ? ext4_dirty_inode+0x2a/0x60 [ext4]
Aug 15 09:33:53 node16 kernel: [<ffffffffc1217759>] __ext4_journal_start_sb+0x69/0xe0 [ext4]
Aug 15 09:33:53 node16 kernel: [<ffffffffc11eb0ba>] ext4_dirty_inode+0x2a/0x60 [ext4]
Aug 15 09:33:53 node16 kernel: [<ffffffff8ea7f00d>] __mark_inode_dirty+0x15d/0x270
Aug 15 09:33:53 node16 kernel: [<ffffffff8ea6b8e9>] update_time+0x89/0xd0
Aug 15 09:33:53 node16 kernel: [<ffffffff8eb8cfe4>] ? __radix_tree_lookup+0x84/0xf0
Aug 15 09:33:53 node16 kernel: [<ffffffff8ea6b9d0>] file_update_time+0xa0/0xf0
Aug 15 09:33:53 node16 kernel: [<ffffffff8e9c03d8>] __generic_file_aio_write+0x198/0x400
Aug 15 09:33:53 node16 kernel: [<ffffffff8e9c0699>] generic_file_aio_write+0x59/0xa0
Aug 15 09:33:53 node16 kernel: [<ffffffffc11de5c8>] ext4_file_write+0x348/0x600 [ext4]
Aug 15 09:33:53 node16 kernel: [<ffffffff8ea010bc>] ? page_add_file_rmap+0x8c/0xc0
Aug 15 09:33:53 node16 kernel: [<ffffffff8e9f339e>] ? do_numa_page+0x1be/0x250
Aug 15 09:33:53 node16 kernel: [<ffffffff8ea4d063>] do_sync_write+0x93/0xe0
Aug 15 09:33:53 node16 kernel: [<ffffffff8ea4db50>] vfs_write+0xc0/0x1f0
Aug 15 09:33:53 node16 kernel: [<ffffffff8ea4eaf2>] SyS_pwrite64+0x92/0xc0
Aug 15 09:33:53 node16 kernel: [<ffffffff8ef93f92>] system_call_fastpath+0x25/0x2a
Aug 15 09:33:53 node16 kernel: INFO: task dcgm-exporter:68381 blocked for more than 120 seconds.
Aug 15 09:33:53 node16 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Aug 15 09:33:53 node16 kernel: dcgm-exporter D ffff8e3eff81acc0 0 68381 57193 0x00000080
Aug 15 09:33:53 node16 kernel: Call Trace:
Aug 15 09:33:53 node16 kernel: [<ffffffff8ef87169>] schedule+0x29/0x70
Aug 15 09:33:53 node16 kernel: [<ffffffffc1181085>] wait_transaction_locked+0x85/0xd0 [jbd2]
Aug 15 09:33:53 node16 kernel: [<ffffffff8e8c6d10>] ? wake_up_atomic_t+0x30/0x30
Aug 15 09:33:53 node16 kernel: [<ffffffffc1181378>] add_transaction_credits+0x278/0x310 [jbd2]
Aug 15 09:33:53 node16 kernel: [<ffffffffc1181601>] start_this_handle+0x1a1/0x430 [jbd2]
Aug 15 09:33:53 node16 kernel: [<ffffffff8e8e8a43>] ? load_balance+0x1a3/0xa10
Aug 15 09:33:53 node16 kernel: [<ffffffff8ea287c2>] ? kmem_cache_alloc+0x1c2/0x1f0
Aug 15 09:33:53 node16 kernel: [<ffffffffc1181ab3>] jbd2__journal_start+0xf3/0x1f0 [jbd2]
Aug 15 09:33:53 node16 kernel: [<ffffffffc11eb0ba>] ? ext4_dirty_inode+0x2a/0x60 [ext4]
Aug 15 09:33:53 node16 kernel: [<ffffffffc1217759>] __ext4_journal_start_sb+0x69/0xe0 [ext4]
Aug 15 09:33:53 node16 kernel: [<ffffffffc11eb0ba>] ext4_dirty_inode+0x2a/0x60 [ext4]
Aug 15 09:33:53 node16 kernel: [<ffffffff8ea7f00d>] __mark_inode_dirty+0x15d/0x270
Aug 15 09:33:53 node16 kernel: [<ffffffff8ea6b8e9>] update_time+0x89/0xd0
Aug 15 09:33:53 node16 kernel: [<ffffffff8ea6b9d0>] file_update_time+0xa0/0xf0
Aug 15 09:33:53 node16 kernel: [<ffffffff8e9c03d8>] __generic_file_aio_write+0x198/0x400
Aug 15 09:33:53 node16 kernel: [<ffffffff8e9c0699>] generic_file_aio_write+0x59/0xa0
Aug 15 09:33:53 node16 kernel: [<ffffffffc11de5c8>] ext4_file_write+0x348/0x600 [ext4]
Aug 15 09:33:53 node16 kernel: [<ffffffff8e9f339e>] ? do_numa_page+0x1be/0x250
Aug 15 09:33:53 node16 kernel: [<ffffffff8ea4d063>] do_sync_write+0x93/0xe0
Aug 15 09:33:53 node16 kernel: [<ffffffff8ea4db50>] vfs_write+0xc0/0x1f0
Aug 15 09:33:53 node16 kernel: [<ffffffff8ea4e92f>] SyS_write+0x7f/0xf0
Aug 15 09:33:53 node16 kernel: [<ffffffff8ef93f92>] system_call_fastpath+0x25/0x2a
Aug 15 09:33:53 node16 kernel: INFO: task gcs_server:21032 blocked for more than 120 seconds.

ruidoBlanco

Aug 16, 2024

@zhoudaiyu 升级内核吧。不像是什么简单能解的。即便能定位到 bug ，最后一定还是得升内核。

zhoudaiyu

PRO

Aug 16, 2024

@ruidoBlanco #19 已经是 1160 了，是红帽 7 的最新的内核了，再升就得用 elrepo 或者楼上说的 uek 内核了，领导估计不大支持升级

zhoudaiyu

PRO

Aug 16, 2024

@msg7086 #17 oracle 的，看上去可能稳定一些，我调研一下

blackeeper

Aug 16, 2024

可能是内存条故障了，你可以 cp 一个大文件，大于内存容量的文件，看看系统是否 hung 住了
我以前有两条内存条，跟你状况一样，文件系统也是 xfs ，运行一段时间也是突然所有进程都 hung 住了，shell 有些命令可以执行，磁盘不能写，以为是文件系统或者磁盘故障，最后排查下来发现是其中一条内存条故障了。

mdeche101644

Aug 17, 2024

mark 一下看后面有没有大佬给解释一下。应该产生 coredump 了吧，可能需要分析那个，看看哪个进程或行为导致日志的 dirtynode 了。日志里是不是在说 dcgm 删某个节点了，然后是不是删了又执行某些操作出问题了，请大佬们指正

zhoudaiyu

PRO

Aug 17, 2024 via iPhone

@blackeeper 试了这个倒是没事儿
@mdeche101644 没有产生，只有内核崩了才有我记得

Emiya1208

Aug 19, 2024

0.1 硬盘坏了可以在 dmesg 里面看到
0.2 内存可能需要添加 mcelog 监控，可能内存坏了。

上面是回复前几楼的猜想。

实际上应该还是内核的 bug 。

1. 红帽的回答已经很清楚了，并给出来了诊断步骤。可以用诊断步骤确认。

2. 由于 JBD2 的锁机制和同步方式在某些情况下可能导致死锁问题，xfs 不使用 JBD2 ，更换 XFS 很可能解决问题。

3. jbd2 日志线程在等待获取 j_checkpoint_mutex 锁。j_checkpoint_mutex 锁当前由某个任务（例如 java 进程）持有。
持有锁的任务正在等待 j_wait_done_commit 队列唤醒，而 j_wait_done_commit 队列需要 j_checkpoint_mutex 锁来继续，这形成了一个死锁。