Message ID | 20240130005923.983026-1-alan.brady@intel.com (mailing list archive) |
---|---|
Headers | show |
Series | idpf: refactor virtchnl messages | expand |
From: Alan Brady <alan.brady@intel.com> Date: Mon, 29 Jan 2024 16:59:16 -0800 > The motivation for this series has two primary goals. We want to enable > support of multiple simultaneous messages and make the channel more > robust. The way it works right now, the driver can only send and receive > a single message at a time and if something goes really wrong, it can > lead to data corruption and strange bugs. Have you tested v3? I have this on my system (net-next + your series), no other patches applied: > [alobakin@GK3153-KR1-CYP-38282-U39-ETH1 ~]$ sudo modprobe idpf > [ 89.785966] idpf 0000:83:00.0: Device HW Reset initiated > [alobakin@GK3153-KR1-CYP-38282-U39-ETH1 ~]$ [ 90.241658] BUG: unable to handle page fault for address: ff8e1df482000000 > [ 90.241704] #PF: supervisor write access in kernel mode > [ 90.241728] #PF: error_code(0x0002) - not-present page > [ 90.241751] PGD 107ffc8067 P4D 107ffc7067 PUD 207fdc8067 PMD 0 > [ 90.241782] Oops: 0002 [#1] PREEMPT SMP NOPTI > [ 90.241805] CPU: 32 PID: 847 Comm: kworker/32:1 Kdump: loaded Not tainted 6.8.0-rc1-libeth+ #1 > [ 90.241841] Hardware name: Intel Corporation M50CYP2SBSTD/M50CYP2SBSTD, BIOS SE5C620.86B.01.01.0008.2305172341 05/17/2023 > [ 90.241879] Workqueue: idpf-0000:83:00.0-vc_ev idpf_vc_event_task [idpf] > [ 90.241932] RIP: 0010:__free_pages_ok+0x338/0x4f0 > [ 90.241962] Code: e6 06 45 31 e4 41 bd 40 00 00 00 45 85 ff 74 13 4b 8d 34 28 4c 89 c7 e8 36 97 00 00 8b 74 24 04 41 01 c4 66 90 4c 8b 44 24 08 <43> 81 24 28 00 00 80 c0 49 83 c5 40 4d 39 ee 75 d0 e9 7c fd ff ff > [ 90.242027] RSP: 0018:ff3f281b098d7b78 EFLAGS: 00010246 > [ 90.242053] RAX: 0000000000100000 RBX: ff8e1df400000000 RCX: 0000000000000034 > [ 90.242084] RDX: 0000000000000d80 RSI: 0000000000000034 RDI: ff8e1df481d980c0 > [ 90.242115] RBP: 0000000000000000 R08: ff8e1df481d980c0 R09: 0000000000000000 > [ 90.242145] R10: ff25062537f9fe00 R11: 0000000000000020 R12: 0000000000000000 > [ 90.242174] R13: 0000000000267f40 R14: 0000000004000000 R15: 0000000000000000 > [ 90.242206] FS: 0000000000000000(0000) GS:ff2506253ec00000(0000) knlGS:0000000000000000 > [ 90.242240] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > [ 90.242266] CR2: ff8e1df482000000 CR3: 000000207d420006 CR4: 0000000000771ef0 > [ 90.242297] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 > [ 90.242327] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 > [ 90.242357] PKRU: 55555554 > [ 90.242378] Call Trace: > [ 90.242393] <TASK> > [ 90.242410] ? __die_body+0x68/0xb0 > [ 90.242437] ? page_fault_oops+0x3a6/0x400 > [ 90.242467] ? exc_page_fault+0xb2/0x1b0 > [ 90.242496] ? asm_exc_page_fault+0x26/0x30 > [ 90.242527] ? __free_pages_ok+0x338/0x4f0 > [ 90.242554] idpf_mb_clean+0xc1/0x110 [idpf] > [ 90.242600] idpf_send_mb_msg+0x50/0x1b0 [idpf] > [ 90.242643] idpf_vc_xn_exec+0x189/0x350 [idpf] > [ 90.242688] idpf_vc_core_init+0x32c/0x6d0 [idpf] > [ 90.242735] idpf_vc_event_task+0x2da/0x390 [idpf] > [ 90.242779] process_scheduled_works+0x251/0x460 > [ 90.242807] worker_thread+0x21c/0x2d0 > [ 90.242830] ? __pfx_worker_thread+0x10/0x10 > [ 90.242855] kthread+0xe8/0x110 > [ 90.242878] ? __pfx_kthread+0x10/0x10 > [ 90.242902] ret_from_fork+0x37/0x50 > [ 90.242925] ? __pfx_kthread+0x10/0x10 > [ 90.243802] ret_from_fork_asm+0x1b/0x30 > [ 90.244598] </TASK> > [ 90.245361] Modules linked in: idpf libeth rpcrdma rdma_cm ib_cm iw_cm ib_core qrtr rfkill intel_rapl_msr intel_rapl_common intel_uncore_frequency intel_uncore_frequency_common i10nm_edac nfit libnvdimm x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel binfmt_misc vfat fat kvm irqbypass rapl ipmi_ssif iTCO_wdt intel_cstate intel_pmc_bxt iTCO_vendor_support dax_hmem cxl_acpi nfsd ioatdma intel_uncore mei_me isst_if_mmio cxl_core i2c_i801 isst_if_mbox_pci mei acpi_ipmi pcspkr intel_vsec isst_if_common i2c_smbus joydev dca ipmi_si intel_pch_thermal ipmi_devintf auth_rpcgss ipmi_msghandler acpi_power_meter acpi_pad nfs_acl lockd sunrpc loop grace zram xfs crct10dif_pclmul crc32_pclmul crc32c_intel polyval_clmulni polyval_generic ghash_clmulni_intel nvme bnxt_en sha512_ssse3 ast nvme_core sha256_ssse3 sha1_ssse3 i2c_algo_bit wmi scsi_dh_rdac scsi_dh_emc scsi_dh_alua ip6_tables ip_tables dm_multipath fuse > [ 90.252381] CR2: ff8e1df482000000 > [ 90.253263] ---[ end trace 0000000000000000 ]--- > [ 90.314201] pstore: backend (erst) writing error (-28) > [ 90.314686] RIP: 0010:__free_pages_ok+0x338/0x4f0 > [ 90.314970] Code: e6 06 45 31 e4 41 bd 40 00 00 00 45 85 ff 74 13 4b 8d 34 28 4c 89 c7 e8 36 97 00 00 8b 74 24 04 41 01 c4 66 90 4c 8b 44 24 08 <43> 81 24 28 00 00 80 c0 49 83 c5 40 4d 39 ee 75 d0 e9 7c fd ff ff > [ 90.315511] RSP: 0018:ff3f281b098d7b78 EFLAGS: 00010246 > [ 90.315778] RAX: 0000000000100000 RBX: ff8e1df400000000 RCX: 0000000000000034 > [ 90.316043] RDX: 0000000000000d80 RSI: 0000000000000034 RDI: ff8e1df481d980c0 > [ 90.316308] RBP: 0000000000000000 R08: ff8e1df481d980c0 R09: 0000000000000000 > [ 90.316573] R10: ff25062537f9fe00 R11: 0000000000000020 R12: 0000000000000000 > [ 90.316838] R13: 0000000000267f40 R14: 0000000004000000 R15: 0000000000000000 > [ 90.317104] FS: 0000000000000000(0000) GS:ff2506253ec00000(0000) knlGS:0000000000000000 > [ 90.317368] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > [ 90.317631] CR2: ff8e1df482000000 CR3: 000000207d420006 CR4: 0000000000771ef0 > [ 90.317897] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 > [ 90.318163] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 > [ 90.318427] PKRU: 55555554 > [ 90.318687] note: kworker/32:1[847] exited with irqs disabled > [ 90.319202] BUG: kernel NULL pointer dereference, address: 0000000000000008 [...] Thanks, Olek
On 1/31/2024 10:33 AM, Alexander Lobakin wrote: > From: Alan Brady <alan.brady@intel.com> > Date: Mon, 29 Jan 2024 16:59:16 -0800 > >> The motivation for this series has two primary goals. We want to enable >> support of multiple simultaneous messages and make the channel more >> robust. The way it works right now, the driver can only send and receive >> a single message at a time and if something goes really wrong, it can >> lead to data corruption and strange bugs. > Have you tested v3? > I have this on my system (net-next + your series), no other patches applied: Mea culpa I have made a grave error. We did test the vast majority of change and it was all fine. At last minute I noticed it looked like we could clean up idpf_send_mb_msg as well. I had confidence in the other changes weren't a problem so this seemed innocent to me. I was very wrong. There's some complications here with how idpf_mb_clean works that I think need to be worked out separately before we can do this change to idpf_send_mb_msg. I would like to revert the change to idpf_send_mb_msg I did in v3 and instead investigate it further in a follow up series that attempts to address all cases where automatic variables would be a good idea. Apologies and thanks. -Alan > [...] > > Thanks, > Olek