pve 7.4.3
Linux pve 5.15.102-1-pve #1 SMP PVE 5.15.102-1 (2023-03-14T13:48Z) x86_64 GNU/Linux
开启了 iommu ,开启了 acs 补丁,
GRUB_CMDLINE_LINUX_DEFAULT="quiet intel_iommu=on iommu=pt initcall_blacklist=sysfb_init video=simplefb:off pcie_acs_override=downstream,multifunction split_lock_detect=off"
windows with desktop passthrough 了 1660s 显卡、usb 键盘鼠标、DP 显卡直出画面到显示器。
linux with desktop pasthrough 了 teslpa p4 显卡,用于解码 emby 。
windows with desktop vm 卡死,linux with desktop vm 卡死,键盘鼠标无响应, pve host 无法 ssh
每次崩溃都是类似的 trace(the traces always identical when it crashes.) 最近 2 次 lockup 的 trace 如下:
8-2 号:
Aug 2 14:37:09 pve kernel: [1260278.267429] NMI watchdog: Watchdog detected hard LOCKUP on cpu 3
Aug 2 14:37:09 pve kernel: [1260278.267433] Modules linked in: tcp_diag inet_diag cmac nls_utf8 cifs cifs_arc4 cifs_md4 fscache netfs veth ebtable_filter ebtables ip_set ip6table_raw iptabl
e_raw ip6table_filter ip6_tables iptable_filter bpfilter nf_tables bonding tls softdog nfnetlink_log nfnetlink ipmi_ssif intel_rapl_msr intel_rapl_common i10nm_edac nfit x86_pkg_temp_thermal
intel_powerclamp coretemp snd_hda_codec_realtek kvm_intel snd_hda_codec_generic ledtrig_audio snd_hda_codec_hdmi kvm snd_hda_intel ast crct10dif_pclmul snd_intel_dspcfg ghash_clmulni_intel
snd_usb_audio drm_vram_helper snd_intel_sdw_acpi aesni_intel snd_hda_codec drm_ttm_helper snd_usbmidi_lib ttm crypto_simd snd_rawmidi snd_hda_core cryptd snd_seq_device drm_kms_helper snd_hw
dep mc cec snd_pcm rc_core rapl rndis_host snd_timer fb_sys_fops syscopyarea cdc_ether mei_me snd sysfillrect usbnet isst_if_mbox_pci isst_if_mmio sysimgblt intel_cstate mii soundcore pcspkr
joydev efi_pstore input_leds acpi_ipmi isst_if_common intel_pch_thermal mei ioatdma
Aug 2 14:37:09 pve kernel: [1260278.267474] ipmi_si ipmi_devintf zfs(PO) ipmi_msghandler acpi_power_meter acpi_pad zunicode(PO) zzstd(O) mac_hid zlua(O) zavl(PO) icp(PO) zcommon(PO) znvpai
r(PO) spl(O) vhost_net vhost vhost_iotlb tap ib_iser rdma_cm iw_cm ib_cm ib_core iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi vfio_pci vfio_pci_core vfio_virqfd irqbypass vfio_iommu_
type1 vfio drm sunrpc ip_tables x_tables autofs4 btrfs blake2b_generic xor zstd_compress raid6_pq simplefb dm_thin_pool dm_persistent_data dm_bio_prison dm_bufio usbkbd libcrc32c usbmouse hi
d_generic usbhid hid crc32_pclmul nvme xhci_pci igb i2c_i801 xhci_pci_renesas nvme_core i2c_algo_bit i2c_smbus ahci dca libahci xhci_hcd intel_pmt wmi
Aug 2 14:37:09 pve kernel: [1260278.267507] CPU: 3 PID: 4137035 Comm: CPU 11/KVM Tainted: P W O 5.15.102-1-pve #1
Aug 2 14:37:09 pve kernel: [1260278.267510] Hardware name: Supermicro X12DAi-N6/X12DAi-N6, BIOS 1.1b 09/10/2021
Aug 2 14:37:09 pve kernel: [1260278.267511] RIP: 0010:_raw_spin_lock+0x0/0x30
Aug 2 14:37:09 pve kernel: [1260278.267516] Code: 00 f0 0f b1 17 75 05 c3 cc cc cc cc 55 89 c6 48 89 e5 e8 43 5b 39 ff 66 90 5d c3 cc cc cc cc 66 66 2e 0f 1f 84 00 00 00 00 00 <0f> 1f 44 00
00 31 c0 ba 01 00 00 00 f0 0f b1 17 75 05 c3 cc cc cc
Aug 2 14:37:09 pve kernel: [1260278.267518] RSP: 0018:ff724a9600540d18 EFLAGS: 00000046
Aug 2 14:37:09 pve kernel: [1260278.267520] RAX: ff724a96000c5000 RBX: 0000000000000004 RCX: ff22181d4004b400
Aug 2 14:37:09 pve kernel: [1260278.267521] RDX: ff22181d4004b400 RSI: 0000000000000000 RDI: ff22181d4020dcc0
Aug 2 14:37:09 pve kernel: [1260278.267522] RBP: ff724a9600540dc8 R08: 00000000000003ac R09: ff22181d4020dcc0
Aug 2 14:37:09 pve kernel: [1260278.267523] R10: 0000000000000010 R11: 0000000000000004 R12: 00000000000003ac
Aug 2 14:37:09 pve kernel: [1260278.267524] R13: 0000000000000000 R14: ff22181d401d4e00 R15: ff22181d4020dcc0
Aug 2 14:37:09 pve kernel: [1260278.267525] FS: 00007fa3dbdff700(0000) GS:ff22185bbf2c0000(0000) knlGS:ffffd4815cb15000
Aug 2 14:37:09 pve kernel: [1260278.267527] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Aug 2 14:37:09 pve kernel: [1260278.267528] CR2: 00001bbd00881000 CR3: 00000039b81ec004 CR4: 0000000000773ee0
Aug 2 14:37:09 pve kernel: [1260278.267529] PKRU: 55555554
Aug 2 14:37:09 pve kernel: [1260278.267530] Call Trace:
Aug 2 14:37:09 pve kernel: [1260278.267532] <IRQ>
Aug 2 14:37:09 pve kernel: [1260278.267532] ? qi_submit_sync+0x328/0x5c0
Aug 2 14:37:09 pve kernel: [1260278.267537] qi_flush_iotlb+0x84/0xa0
Aug 2 14:37:09 pve kernel: [1260278.267539] intel_flush_iotlb_all+0x59/0x160
Aug 2 14:37:09 pve kernel: [1260278.267541] iommu_dma_flush_iotlb_all+0x1a/0x30
Aug 2 14:37:09 pve kernel: [1260278.267544] iova_domain_flush+0x1b/0x30
Aug 2 14:37:09 pve kernel: [1260278.267546] fq_flush_timeout+0x39/0xc0
Aug 2 14:37:09 pve kernel: [1260278.267547] ? fq_ring_free+0x170/0x170
Aug 2 14:37:09 pve kernel: [1260278.267549] call_timer_fn+0x29/0x120
Aug 2 14:37:09 pve kernel: [1260278.267554] __run_timers.part.0+0x1e1/0x270
Aug 2 14:37:09 pve kernel: [1260278.267555] ? ktime_get+0x43/0xc0
Aug 2 14:37:09 pve kernel: [1260278.267557] ? lapic_next_deadline+0x2c/0x40
Aug 2 14:37:09 pve kernel: [1260278.267561] ? clockevents_program_event+0xa8/0x130
Aug 2 14:37:09 pve kernel: [1260278.267564] run_timer_softirq+0x2a/0x60
Aug 2 14:37:09 pve kernel: [1260278.267565] __do_softirq+0xd6/0x2ea
Aug 2 14:37:09 pve kernel: [1260278.267568] irq_exit_rcu+0x94/0xc0
Aug 2 14:37:09 pve kernel: [1260278.267570] sysvec_apic_timer_interrupt+0x80/0x90
Aug 2 14:37:09 pve kernel: [1260278.267574] </IRQ>
Aug 2 14:37:09 pve kernel: [1260278.267575] <TASK>
Aug 2 14:37:09 pve kernel: [1260278.267575] asm_sysvec_apic_timer_interrupt+0x1b/0x20
Aug 2 14:37:09 pve kernel: [1260278.267577] RIP: 0010:vmx_do_interrupt_nmi_irqoff+0x10/0x20 [kvm_intel]
Aug 2 14:37:09 pve kernel: [1260278.267590] Code: 41 5b 41 5a 41 59 41 58 5e 5f 5a 59 58 5d e9 47 da c7 dc 0f 1f 80 00 00 00 00 55 48 89 e5 48 83 e4 f0 6a 18 55 9c 6a 10 ff d7 <0f> 1f 00 48
89 ec 5d e9 24 da c7 dc 0f 1f 40 00 0f 1f 44 00 00 55
Aug 2 14:37:09 pve kernel: [1260278.267591] RSP: 0018:ff724a9606cefcd8 EFLAGS: 00000082
Aug 2 14:37:09 pve kernel: [1260278.267593] RAX: 0000000000000e30 RBX: ff22181ef2ce8000 RCX: 0000000000000000
Aug 2 14:37:09 pve kernel: [1260278.267594] RDX: ffffffff00000000 RSI: 0001000000000000 RDI: ffffffff9e000e30
Aug 2 14:37:09 pve kernel: [1260278.267595] RBP: ff724a9606cefcd8 R08: 000006901fa2c781 R09: 0000000000000000
Aug 2 14:37:09 pve kernel: [1260278.267595] R10: 0000000000000000 R11: 0000000000000000 R12: 00000000800000ec
Aug 2 14:37:09 pve kernel: [1260278.267596] R13: 0000000000000000 R14: ff724a960333fb48 R15: ff22181ef2ce8038
Aug 2 14:37:09 pve kernel: [1260278.267598] ? asm_sysvec_spurious_apic_interrupt+0x20/0x20
Aug 2 14:37:09 pve kernel: [1260278.267601] vmx_handle_exit_irqoff+0x175/0x2e0 [kvm_intel]
Aug 2 14:37:09 pve kernel: [1260278.267608] kvm_arch_vcpu_ioctl_run+0xd19/0x1730 [kvm]
Aug 2 14:37:09 pve kernel: [1260278.267658] ? kvm_arch_vcpu_ioctl_run+0x712/0x1730 [kvm]
Aug 2 14:37:09 pve kernel: [1260278.267695] ? __wake_up_locked_key+0x1b/0x30
Aug 2 14:37:09 pve kernel: [1260278.267698] kvm_vcpu_ioctl+0x252/0x6b0 [kvm]
Aug 2 14:37:09 pve kernel: [1260278.267725] ? kvm_vcpu_ioctl+0x2bb/0x6b0 [kvm]
Aug 2 14:37:09 pve kernel: [1260278.267752] ? vfs_write+0xc8/0x270
Aug 2 14:37:09 pve kernel: [1260278.267755] ? __fget_files+0x86/0xc0
Aug 2 14:37:09 pve kernel: [1260278.267758] __x64_sys_ioctl+0x92/0xd0
Aug 2 14:37:09 pve kernel: [1260278.267761] do_syscall_64+0x59/0xc0
Aug 2 14:37:09 pve kernel: [1260278.267763] ? do_syscall_64+0x69/0xc0
Aug 2 14:37:09 pve kernel: [1260278.267765] ? do_syscall_64+0x69/0xc0
Aug 2 14:37:09 pve kernel: [1260278.267766] ? syscall_exit_to_user_mode+0x27/0x50
Aug 2 14:37:09 pve kernel: [1260278.267768] ? do_syscall_64+0x69/0xc0
Aug 2 14:37:09 pve kernel: [1260278.267770] ? do_syscall_64+0x69/0xc0
Aug 2 14:37:09 pve kernel: [1260278.267771] ? do_syscall_64+0x69/0xc0
Aug 2 14:37:09 pve kernel: [1260278.267773] entry_SYSCALL_64_after_hwframe+0x61/0xcb
Aug 2 14:37:09 pve kernel: [1260278.267775] RIP: 0033:0x7fb41c025237
Aug 2 14:37:09 pve kernel: [1260278.267777] Code: 00 00 00 48 8b 05 59 cc 0d 00 64 c7 00 26 00 00 00 48 c7 c0 ff ff ff ff c3 66 2e 0f 1f 84 00 00 00 00 00 b8 10 00 00 00 0f 05 <48> 3d 01 f0
ff ff 73 01 c3 48 8b 0d 29 cc 0d 00 f7 d8 64 89 01 48
Aug 2 14:37:09 pve kernel: [1260278.267778] RSP: 002b:00007fa3dbdfa288 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
Aug 2 14:37:09 pve kernel: [1260278.267780] RAX: ffffffffffffffda RBX: 000000000000ae80 RCX: 00007fb41c025237
Aug 2 14:37:09 pve kernel: [1260278.267781] RDX: 0000000000000000 RSI: 000000000000ae80 RDI: 0000000000000024
Aug 2 14:37:09 pve kernel: [1260278.267782] RBP: 0000561590b8d2d0 R08: 000056158f065240 R09: 00000000ffffffff
Aug 2 14:37:09 pve kernel: [1260278.267783] R10: 0000000000000001 R11: 0000000000000246 R12: 0000000000000000
Aug 2 14:37:09 pve kernel: [1260278.267783] R13: 000056158f770020 R14: 0000000000000000 R15: 0000000000000000
Aug 2 14:37:09 pve kernel: [1260278.267785] </TASK>
Aug 2 14:37:09 pve kernel: [1260295.454974] watchdog: BUG: soft lockup - CPU#11 stuck for 26s! [kworker/11:2:4149114]
Aug 2 14:37:09 pve kernel: [1260295.455414] Modules linked in: tcp_diag inet_diag cmac nls_utf8 cifs cifs_arc4 cifs_md4 fscache netfs veth ebtable_filter ebtables ip_set ip6table_raw iptabl
e_raw ip6table_filter ip6_tables iptable_filter bpfilter nf_tables bonding tls softdog nfnetlink_log nfnetlink ipmi_ssif intel_rapl_msr intel_rapl_common i10nm_edac nfit x86_pkg_temp_thermal
intel_powerclamp coretemp snd_hda_codec_realtek kvm_intel snd_hda_codec_generic ledtrig_audio snd_hda_codec_hdmi kvm snd_hda_intel ast crct10dif_pclmul snd_intel_dspcfg ghash_clmulni_intel
snd_usb_audio drm_vram_helper snd_intel_sdw_acpi aesni_intel snd_hda_codec drm_ttm_helper snd_usbmidi_lib ttm crypto_simd snd_rawmidi snd_hda_core cryptd snd_seq_device drm_kms_helper snd_hw
dep mc cec snd_pcm rc_core rapl rndis_host snd_timer fb_sys_fops syscopyarea cdc_ether mei_me snd sysfillrect usbnet isst_if_mbox_pci isst_if_mmio sysimgblt intel_cstate mii soundcore pcspkr
joydev efi_pstore input_leds acpi_ipmi isst_if_common intel_pch_thermal mei ioatdma
Aug 2 14:37:09 pve kernel: [1260295.455479] ipmi_si ipmi_devintf zfs(PO) ipmi_msghandler acpi_power_meter acpi_pad zunicode(PO) zzstd(O) mac_hid zlua(O) zavl(PO) icp(PO) zcommon(PO) znvpai
r(PO) spl(O) vhost_net vhost vhost_iotlb tap ib_iser rdma_cm iw_cm ib_cm ib_core iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi vfio_pci vfio_pci_core vfio_virqfd irqbypass vfio_iommu_
type1 vfio drm sunrpc ip_tables x_tables autofs4 btrfs blake2b_generic xor zstd_compress raid6_pq simplefb dm_thin_pool dm_persistent_data dm_bio_prison dm_bufio usbkbd libcrc32c usbmouse hi
d_generic usbhid hid crc32_pclmul nvme xhci_pci igb i2c_i801 xhci_pci_renesas nvme_core i2c_algo_bit i2c_smbus ahci dca libahci xhci_hcd intel_pmt wmi
Aug 2 14:37:09 pve kernel: [1260295.455532] CPU: 11 PID: 4149114 Comm: kworker/11:2 Tainted: P W O 5.15.102-1-pve #1
Aug 2 14:37:09 pve kernel: [1260295.455536] Hardware name: Supermicro X12DAi-N6/X12DAi-N6, BIOS 1.1b 09/10/2021
Aug 2 14:37:09 pve kernel: [1260295.455537] Workqueue: rcu_par_gp sync_rcu_exp_select_node_cpus
Aug 2 14:37:09 pve kernel: [1260295.455543] RIP: 0010:smp_call_function_single+0x94/0x130
Aug 2 14:37:09 pve kernel: [1260295.455547] Code: 32 c9 62 a9 00 01 ff 00 0f 85 9e 00 00 00 85 c9 75 4c 48 c7 c6 80 1b 03 00 65 48 03 35 f5 cb c8 62 8b 46 08 a8 01 74 09 f3 90 <8b> 46 08 a8
最近一次 8-19:
Aug 19 09:28:29 pve kernel: [777177.076338] NMI watchdog: Watchdog detected hard LOCKUP on cpu 32
Aug 19 09:28:29 pve kernel: [777177.076340] Modules linked in: tcp_diag inet_diag cmac nls_utf8 cifs cifs_arc4 cifs_md4 fscache net
fs veth ebtable_filter ebtables ip_set ip6table_raw iptable_raw ip6table_filter ip6_tables iptable_filter bpfilter nf_tables bondin
g tls softdog nfnetlink_log nfnetlink ipmi_ssif intel_rapl_msr intel_rapl_common i10nm_edac nfit x86_pkg_temp_thermal intel_powercl
amp coretemp kvm_intel kvm crct10dif_pclmul ghash_clmulni_intel aesni_intel crypto_simd cryptd snd_hda_codec_realtek snd_hda_codec_
generic ledtrig_audio snd_hda_codec_hdmi ast drm_vram_helper drm_ttm_helper snd_hda_intel snd_usb_audio ttm snd_intel_dspcfg snd_us
bmidi_lib snd_intel_sdw_acpi drm_kms_helper snd_rawmidi snd_hda_codec snd_seq_device snd_hda_core cec snd_hwdep mc rc_core zfs(PO)
snd_pcm rndis_host fb_sys_fops snd_timer rapl cdc_ether syscopyarea mei_me zunicode(PO) snd sysfillrect usbnet isst_if_mbox_pci iss
t_if_mmio sysimgblt intel_cstate isst_if_common mii soundcore efi_pstore pcspkr joydev ioatdma intel_pch_thermal mei
Aug 19 09:28:29 pve kernel: [777177.076381] input_leds zzstd(O) zlua(O) acpi_ipmi zavl(PO) ipmi_si icp(PO) ipmi_devintf ipmi_msgha
ndler acpi_power_meter acpi_pad zcommon(PO) mac_hid znvpair(PO) spl(O) vhost_net vhost vhost_iotlb tap ib_iser rdma_cm iw_cm ib_cm
ib_core iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi vfio_pci vfio_pci_core vfio_virqfd irqbypass vfio_iommu_type1 vfio drm
sunrpc ip_tables x_tables autofs4 btrfs blake2b_generic xor zstd_compress raid6_pq simplefb hid_generic usbmouse usbkbd dm_thin_po
ol dm_persistent_data dm_bio_prison dm_bufio libcrc32c usbhid hid crc32_pclmul nvme xhci_pci i2c_i801 xhci_pci_renesas igb nvme_cor
e i2c_smbus i2c_algo_bit ahci dca xhci_hcd libahci intel_pmt wmi
Aug 19 09:28:29 pve kernel: [777177.076415] CPU: 32 PID: 0 Comm: swapper/32 Tainted: P W O 5.15.102-1-pve #1
Aug 19 09:28:29 pve kernel: [777177.076417] Hardware name: Supermicro X12DAi-N6/X12DAi-N6, BIOS 1.1b 09/10/2021
Aug 19 09:28:29 pve kernel: [777177.076418] RIP: 0010:qi_submit_sync+0x2db/0x5c0
Aug 19 09:28:29 pve kernel: [777177.076424] Code: 4d 8b 8e 10 01 00 00 31 db 41 f6 46 25 08 0f 95 c3 49 8b 41 10 83 c3 04 42 83 3c
20 03 0f 84 a3 01 00 00 49 8b 06 44 8b 68 34 <41> f6 c5 70 0f 85 5c 01 00 00 41 f6 c5 10 74 18 49 8b 06 8b 80 80
Aug 19 09:28:29 pve kernel: [777177.076425] RSP: 0018:ff594f7a00b24d20 EFLAGS: 00000093
Aug 19 09:28:29 pve kernel: [777177.076427] RAX: ff594f7a000c5000 RBX: 0000000000000004 RCX: ff3e37f40004b400
Aug 19 09:28:29 pve kernel: [777177.076428] RDX: ff3e37f40004b400 RSI: 0000000000000000 RDI: ff3e37f40020d1c0
Aug 19 09:28:29 pve kernel: [777177.076429] RBP: ff594f7a00b24dc8 R08: 000000000000014c R09: ff3e37f40020d1c0
Aug 19 09:28:29 pve kernel: [777177.076430] R10: 0000000000000010 R11: 0000000000000004 R12: 000000000000014c
Aug 19 09:28:29 pve kernel: [777177.076431] R13: 0000000000000000 R14: ff3e37f4001d4000 R15: ff3e37f40020d1c0
Aug 19 09:28:29 pve kernel: [777177.076432] FS: 0000000000000000(0000) GS:ff3e38327fa00000(0000) knlGS:0000000000000000
Aug 19 09:28:29 pve kernel: [777177.076433] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Aug 19 09:28:29 pve kernel: [777177.076434] CR2: 000000001f1a5080 CR3: 0000000128eb8002 CR4: 0000000000773ee0
Aug 19 09:28:29 pve kernel: [777177.076436] PKRU: 55555554
Aug 19 09:28:29 pve kernel: [777177.076436] Call Trace:
Aug 19 09:28:29 pve kernel: [777177.076437] <IRQ>
Aug 19 09:28:29 pve kernel: [777177.076438] ? enqueue_entity+0x17d/0x760
Aug 19 09:28:29 pve kernel: [777177.076446] qi_flush_iotlb+0x84/0xa0
Aug 19 09:28:29 pve kernel: [777177.076447] intel_flush_iotlb_all+0x59/0x160
Aug 19 09:28:29 pve kernel: [777177.076450] iommu_dma_flush_iotlb_all+0x1a/0x30
Aug 19 09:28:29 pve kernel: [777177.076452] iova_domain_flush+0x1b/0x30
Aug 19 09:28:29 pve kernel: [777177.076454] fq_flush_timeout+0x39/0xc0
Aug 19 09:28:29 pve kernel: [777177.076456] ? fq_ring_free+0x170/0x170
Aug 19 09:28:29 pve kernel: [777177.076458] call_timer_fn+0x29/0x120
Aug 19 09:28:29 pve kernel: [777177.076462] __run_timers.part.0+0x1e1/0x270
Aug 19 09:28:29 pve kernel: [777177.076463] ? ktime_get+0x43/0xc0
Aug 19 09:28:29 pve kernel: [777177.076465] ? lapic_next_deadline+0x2c/0x40
Aug 19 09:28:29 pve kernel: [777177.076469] ? clockevents_program_event+0xa8/0x130
Aug 19 09:28:29 pve kernel: [777177.076473] run_timer_softirq+0x2a/0x60
Aug 19 09:28:29 pve kernel: [777177.076474] __do_softirq+0xd6/0x2ea
Aug 19 09:28:29 pve kernel: [777177.076478] irq_exit_rcu+0x94/0xc0
Aug 19 09:28:29 pve kernel: [777177.076480] sysvec_apic_timer_interrupt+0x80/0x90
Aug 19 09:28:29 pve kernel: [777177.076483] </IRQ>
Aug 19 09:28:29 pve kernel: [777177.076484] <TASK>
Aug 19 09:28:29 pve kernel: [777177.076484] asm_sysvec_apic_timer_interrupt+0x1b/0x20
Aug 19 09:28:29 pve kernel: [777177.076486] RIP: 0010:cpuidle_enter_state+0xd9/0x620
Aug 19 09:28:29 pve kernel: [777177.076491] Code: 3d 04 78 5e 7c e8 37 36 6d ff 49 89 c7 0f 1f 44 00 00 31 ff e8 78 43 6d ff 80 7d
d0 00 0f 85 5e 01 00 00 fb 66 0f 1f 44 00 00 <45> 85 f6 0f 88 6a 01 00 00 4d 63 ee 49 83 fd 09 0f 87 e5 03 00 00
Aug 19 09:28:29 pve kernel: [777177.076492] RSP: 0018:ff594f7a003a7e38 EFLAGS: 00000246
Aug 19 09:28:29 pve kernel: [777177.076493] RAX: ff3e38327fa30bc0 RBX: ff8b4f79fa637d00 RCX: 0000000000000000
Aug 19 09:28:29 pve kernel: [777177.076494] RDX: 0000000000016176 RSI: 00000000471c676c RDI: 0000000000000000
Aug 19 09:28:29 pve kernel: [777177.076495] RBP: ff594f7a003a7e88 R08: 0002c2d2999fe5b0 R09: 00000000000927c0
Aug 19 09:28:29 pve kernel: [777177.076496] R10: 0000000000000004 R11: 071c71c71c71c71c R12: ffffffff84ed4ca0
Aug 19 09:28:29 pve kernel: [777177.076497] R13: 0000000000000002 R14: 0000000000000002 R15: 0002c2d2999fe5b0
Aug 19 09:28:29 pve kernel: [777177.076499] ? cpuidle_enter_state+0xc8/0x620
Aug 19 09:28:29 pve kernel: [777177.076502] cpuidle_enter+0x2e/0x50
可以看到,2 次都和qi_submit_sync 、iommu_dma_flus 、fq_flush_timeout有关,而这 几个都和 iommu 有关,所以我猜测的是和 iommu 开启有关。
https://lists.linuxfoundation.org/pipermail/iommu/2015-January/011506.html 在网上找到一篇老文章提到了 qi_submit_sync()函数,文章里面提到让禁用 CONFIG_NET_DMA ,但是如何禁用不知道如何操作,文章的后面部分看不懂了,
https://bbs.archlinux.org/viewtopic.php?id=284548 里面提到设置 CPU 关联,以便至少 1 个内核永远不会用于 USB 串行通信,但是也有人回帖说对于他不起作用; adomino-engineer 说这个错误很可能是在内核版本 5.11 和 5.13 之间引入的 里面最后给了一个最终补丁,但是我也没看明白啥意思
https://lore.kernel.org/lkml/[email protected]/T/ 里面提到 fq_flush_timeout ,虽然是 soft lockup ,但是我之前也出现过 soft lockup ,但是它这里面讲的完全不明白
在 pve 相关论坛有人提到是内存问题
下面几个方案是准备弄的方案,但是由于复现比较困难,而且机器要用,没有备用机,所以一直没弄,只能说有意往这几个方向尝试。
感觉这个是最可能发现问题的
256G 内存,测试时间比较久。
参考
Kernel v5.4, v5.5 and v5.6 have no lockups, while kernel v5.7, v5.8, v5.9, v5.10, v5.11, v5.12, v5.13, and 5.14 result in an immediate freeze.
其实我很想解决这个问题,也很想给解决这个问题的人 2 百块红包或者 USDT 也行,但是本着 Linux Open Free Share 的精神,如果你不要,我也没办法。 以上开个玩笑,这个问题困扰我很久了,一直没解决,希望有懂内核的师傅能帮忙解决,可以付费,但前提是问题确实是你说的问题并且最后能解决。
1
feather12315 2023-08-20 14:48:36 +08:00 via Android
把 lock check 的 config 打开就知道谁拿来这个锁了。
具体的 config 忘了,在 hacking config menu 里面。 |
2
hanssx OP @feather12315 听不懂,搜了下也没搜出来。。
|
3
1423 2023-08-20 15:35:02 +08:00 1
我的话先试试
BIOS reset 到默认设置 确认 BIOS 是最新版本 尝试移除替换部份硬件 可能试试给 PVE 替换内核或者自己编译,这里有个 https://github.com/fabianishere/pve-edge-kernel |
4
feather12315 2023-08-20 17:24:57 +08:00 via Android 1
@hanssx #2 lockdep ,参考这个:
https://www.cnblogs.com/still-smile/p/12410692.html nmi hardlockup 代表在终端上下文中发生了死锁(非中断上下文的话是普通的 watchdog lockup )。 死锁检测用 lockdep ,把这个选项打开,根据 dmesg 日志就能看到谁拿了这把导致 lockup 的锁了,然后读代码看看怎么加 release |
5
hanssx OP @1423 BIOS 不能升级,用的步进 5 的 ES CPU ,不过应该不是 CPU 的问题,有一定可能是 BIOS 的问题,超微的 BIOS 不敢乱改。
@feather12315 你这个就复杂了,你这是要改内核代码吧,我先看看能不能加上 lockdep |
6
ManjusakaL 2023-08-20 19:49:42 +08:00
看栈是怀疑有可能是死锁的地方。你可以参考楼上说的 lockdep ,开启编译选项之后重新编译内核,然后跑一下
|
7
Jirajine 2023-08-20 21:52:36 +08:00
你先用原始的 qemu ,或者至少 libvirt 测试一下能否复现。
如果确定是新版本内核引入的 bug ,可以用 git bisect 定位到引入问题的 commit ,然后自己试着修改或者直接报告给上游,附上 commit id 一般很快就能修复。 |
8
zzz22333 2023-08-21 09:30:10 +08:00
我最近也遇到一个 hard lock 问题
|
9
zlylong 2023-08-21 10:56:01 +08:00
我之前遇到过状态一样的问题,后来更新了显卡驱动就好了。
|
10
hanssx OP |
14
hanssx OP 再补充 2 个相关的讨论:
Bug 1453 - CPU freezes on KVM guests during high IO load on host: https://bugzilla.proxmox.com/show_bug.cgi?id=1453 CPU hard lockup related to xhci/dma: https://bugzilla.kernel.org/show_bug.cgi?id=217242 |
15
hanssx OP 上面最后 1 个讨论,好像在 linux kernel 6.4 里面已经修复。
|
16
hanssx OP https://wiki.archlinux.org/title/PCI_passthrough_via_OVMF 里面提到启用消息信号中断,我觉得有可能是我这种情况,因为我卡死总是在使用 Windows 主机上面。
Host lockup after virtual machine shutdown This issue seems to primarily affect users running a Windows 10 guest and usually after the virtual machine has been run for a prolonged period of time: the host will experience multiple CPU core lockups (see [8]). To fix this try enabling Message Signal Interrupts on the GPU passed through to the guest. A good guide for how to do this can be found in [9]. You can also download this application for windows here [10] that should make the process easier. 我将在备份后尝试这个方法,或者一周关机一次 win10 , 另外在 pve 论坛也有人使用了 MSI ,参考 https://forum.proxmox.com/threads/nvidia-gpu-passthrough-no-longer-working-with-windows-10-guest-vm.70477/ 里面说到” The issue appears to be due to Nvidia/Windows defaulting the GPU to line-based interrupts. “ 不知道现在还是不是行中断,实在不行换个 win11 ? |
17
zlylong 2023-08-22 11:13:28 +08:00
@hanssx 应该和你的不一样,我是显卡直通到 CT 里面,然后内核更新以后出现的问题,最后是宿主机和 CT 里面都更新了显卡驱动。就好了。
|
18
zlylong 2023-08-22 11:25:37 +08:00
我觉得可以用排除法,先去掉所有显卡观察一段时间。
|
19
TESTFLIGHT2021 2023-08-22 12:40:56 +08:00
玩啥 ES 。。。买正式版本吧
|
20
hanssx OP 目前去掉了 pcie_acs_override=downstream,multifunction ,并且每周重启一次 pve ,已正常使用半月。
|
21
hanssx OP 目前连续开机使用 21 天未出现问题
|
22
hanssx OP 2023-11-17 号又出现了 2 次 windows 虚拟机死机,查了一下,依然是 LOCKUP ,但是出问题之前报了这么个错误`igb 0000:04:00.0: Detected Tx Unit Hang`,网上查询很多都说是网卡问题,解决方案一是更新网卡驱动,二是更改虚拟网卡为 vmware 那个 vmxnet3 。
``` Nov 17 00:51:37 pve kernel: [2426731.260861] igb 0000:04:00.0: Detected Tx Unit Hang Nov 17 00:51:43 pve kernel: [2426737.376495] NETDEV WATCHDOG: eno1 (igb): transmit queue 1 timed out Nov 17 00:51:43 pve kernel: [2426737.376512] WARNING: CPU: 51 PID: 0 at net/sched/sch_generic.c:477 dev_watchdog+0x277/0x280 Nov 17 14:56:55 pve kernel: [24279.339911] igb 0000:04:00.0: Detected Tx Unit Hang Nov 17 14:57:03 pve kernel: [24286.443613] NETDEV WATCHDOG: eno1 (igb): transmit queue 0 timed out Nov 17 14:57:03 pve kernel: [24286.443626] WARNING: CPU: 2 PID: 0 at net/sched/sch_generic.c:477 dev_watchdog+0x277/0x280 Nov 17 14:57:29 pve kernel: [24296.144959] NMI watchdog: Watchdog detected hard LOCKUP on cpu 0 ``` Virtio 与 E1000 ,这是两种不同的网络虚拟化技术,Virtio 是半虚拟化而 E1000 是全虚拟化。对于全虚拟化方案来说,虚拟机是完全感知不到自己是运行在一个虚拟环境中的;而半虚拟化则是虚拟机知道自己就是运行在一个虚拟环境中,此时 IO 驱动就可以做一些针对性的修改优化,以此降低虚拟化层进行转换带来的开销及性能损失。显而易见,半虚拟化技术的隔离度是没有全虚拟化好的,而且要是虚拟机驱动有问题会导致宿主机也出问题。这就是为什么在使用 Virtio 时,OpenWRT 网络出现问题会导致整个 Proxmox 的网络都不能用了的原因。除了这两种虚拟化方式外,还有些更为先进的虚拟化技术,如 SR-IVO 等。 网上大概分 2 种处理方案:一是关闭 tso 、gso:ethtool -K eno1 tso off gso off ,大部分是添加在`/etc/network/interfaces`里面; 可以参考 https://blog.pakro.top/2020/solve_the_problem_of_eno1_Detected_Hardware_Unit_Hang_under_PVE/ 二是关闭 TCP checksum offload ,经过测试关闭 TCP checksum offload 的时候会自动关闭 tso ,gso 好像没被关闭, 可以参考 https://ovear.info/post/356 、https://www.sjfn.com/post/pve-solves-sudden-network-disconnection.html 、https://gaomf.cn/2019/07/28/PVE_OpenWRT_Network_Broken/ 目前先采用了二,不行的话再加上一,`sudo vim /etc/network/if-up.d/01-custom-ethtool` ```shell #!/bin/bash # Check if the interface is eno1 if [ "$IFACE" = "eno1" ]; then # Run ethtool to disable tx and rx /sbin/ethtool -K eno1 tx off rx off fi ``` `chmod +x /etc/network/if-up.d/01-custom-ethtool` 后期打算**单独直通一个网卡进去以避免影响 pve 宿主机**,或尝试全虚拟化方案如 E1000 、vmxnet 之类。 目前已经直通一个 I210 网卡进去 Windows ,Linux 用 virtio 还好,windows 不推荐用。 |
23
hanssx OP 继续更新,作为 PVE 疑难杂症帖子,方便后来者,
[2023-12] 直通网卡之后 win10 主力虚机还是会死机冻屏,事件查看器有个关于 vioscsi 的错误,搜索了一下,目前更新了 I210 网卡驱动自动更新和 virtio-win-0.1.240 ,还在继续观察,目前 3 天无冻屏现象,事件查看器只有少量 DistributedCOM 10016 问题和 DNS 解析问题,PVE 已稳定运行 20 天+ |
25
hanssx OP 尝试降级到 204 驱动,依然不行,回退到 240 ,问题更严重了,开机没一会儿就崩溃
现在已经抛弃 SCSI 控制器的 virtio 模式,改用了默认 (LSI 53C895A),当然改成什么也无所谓,因为我已经把硬盘分离后-编辑为 SATA 硬盘。 还有个小 tips:可以把硬盘转移到另一块存储介质上,以排除当前硬盘可能有坏道之类的问题(我这块看着 SMART 没有问题 后期准备直通一块盘进去,等过年啥时候 M.2 2T 便宜的时候吧。 另外补充一下 qemu-server=7.4-2 是无法备份直通硬盘的虚拟机的,需要 qemu-server=7.4-3 |
26
hanssx OP 目前没有发现 vioscsi 错误(都 sata 盘了应该不会有),但是直通的 USB 键盘鼠标会突然断连,就是失去作用,无法再操作,但是 win10 还是在运行的,并没有卡死,因为在 pve web 页面中的在线 vnc 里面可以用鼠标操作,所以现在把 win10 的 pve 配置中显示改为无(none)了。
|
27
hanssx OP 改为无(none)是不行的,现在显示改回默认,windows 虚拟机目前运行 3 天,不会卡死。因为 win10 是主力直通主机,所以直通的比较多。
|
28
hanssx OP USB 键盘鼠标又开始断连了,不过使用别的电脑 RDP 或者在 PVE Web 页面的 VNC 里面可以操作键盘鼠标,比较奇怪,做了如下改进:
1 、显示依然为默认,为无为默认好像没多大区别? PVE Web 里面的 VNC 和显示器 2 个界面设置复制、扩展、只显示 1 、2 也没多大区别。 2 、USB 直通时选择 USB 供应商/设备 ID ,并取消 USB3 。 3 、显示器分辨率刷新率降到 60HZ 以内,之前好像是 160 多,能不卡吗! 4 、电脑启动之后,把 virtio balloon driver 用类似卸载 U 盘的方式卸载,这个待定! |
29
hanssx OP 分辨率刷新率修改可以通过 NVIDA 驱动附带的 NVIDA 控制面板。
|
30
hanssx OP 另外 pve win10 主机如果休眠就停止,然后也恢复不了,只能重新开机的情况,是因为直通的显卡勾选了主 GPU ,取消掉即可,只保持 ROM-BAR 和 PCI-E 选中。
|
31
kiah 7 天前
所以 windw10 的问题到底是如何解决的呢
|
34
hanssx OP @kiah 我目前死机的情况主要是在直通键盘鼠标的虚拟机上使用时,运行任务比较多的情况下,首先键盘鼠标冻住,然后慢慢 pve console 就会 hard lockup ,所以我现在尽量避免运行较多任务,比如打游戏之类的。
另外你的 CPU 是 ES 的吗?大概率是 Linux 内核问题,硬件也有一定可能。 |
35
hanssx OP 我后面是打算装 esxi 的,它是闭源内核的,听朋友说比 pve 稳定,不过一直没时间迁移,等以后吧
|