diff mbox series

[1/1] virtio_net: Add timeout handler to avoid kernel hang

Message ID 20240115012918.3081203-1-yanjun.zhu@intel.com (mailing list archive)
State Changes Requested
Delegated to: Netdev Maintainers
Headers show
Series [1/1] virtio_net: Add timeout handler to avoid kernel hang | expand

Checks

Context Check Description
netdev/series_format warning Single patches do not need cover letters; Target tree name not specified in the subject
netdev/tree_selection success Guessed tree name to be net-next
netdev/ynl success Generated files up to date; no warnings/errors; no diff in generated;
netdev/fixes_present success Fixes tag not required for -next series
netdev/header_inline success No static functions without inline keyword in header files
netdev/build_32bit success Errors and warnings before: 1092 this patch: 1092
netdev/cc_maintainers success CCed 0 of 0 maintainers
netdev/build_clang success Errors and warnings before: 1107 this patch: 1107
netdev/verify_signedoff success Signed-off-by tag matches author and committer
netdev/deprecated_api success None detected
netdev/check_selftest success No net selftest shell script
netdev/verify_fixes success No Fixes tag
netdev/build_allmodconfig_warn success Errors and warnings before: 1109 this patch: 1109
netdev/checkpatch success total: 0 errors, 0 warnings, 0 checks, 23 lines checked
netdev/build_clang_rust success No Rust files in patch. Skipping build
netdev/kdoc success Errors and warnings before: 0 this patch: 0
netdev/source_inline success Was 0 now: 0
netdev/contest pending net-next-2024-01-16--12-00

Commit Message

Zhu Yanjun Jan. 15, 2024, 1:29 a.m. UTC
From: Zhu Yanjun <yanjun.zhu@linux.dev>

Some devices emulate the virtio_net hardwares. When virtio_net
driver sends commands to the emulated hardware, normally the
hardware needs time to response. Sometimes the time is very
long. Thus, the following will appear. Then the whole system
will hang.
The similar problems also occur in Intel NICs and Mellanox NICs.
As such, the similar solution is borrowed from them. A timeout
value is added and the timeout value as large as possible is set
to ensure that the driver gets the maximum possible response from
the hardware.

"
[  213.795860] watchdog: BUG: soft lockup - CPU#108 stuck for 26s! [(udev-worker):3157]
[  213.796114] Modules linked in: virtio_net(+) net_failover failover qrtr rfkill sunrpc intel_rapl_msr intel_rapl_common intel_uncore_frequency intel_uncore_frequency_common intel_ifs i10nm_edac nfit libnvdimm x86_pkg_temp_thermal intel_powerclamp coretemp iTCO_wdt rapl intel_pmc_bxt dax_hmem iTCO_vendor_support vfat cxl_acpi intel_cstate pmt_telemetry pmt_class intel_sdsi joydev intel_uncore cxl_core fat pcspkr mei_me isst_if_mbox_pci isst_if_mmio idxd i2c_i801 isst_if_common mei intel_vsec idxd_bus i2c_smbus i2c_ismt ipmi_ssif acpi_ipmi ipmi_si ipmi_devintf ipmi_msghandler acpi_pad acpi_power_meter pfr_telemetry pfr_update fuse loop zram xfs crct10dif_pclmul crc32_pclmul crc32c_intel polyval_clmulni polyval_generic ghash_clmulni_intel sha512_ssse3 bnxt_en sha256_ssse3 sha1_ssse3 nvme ast nvme_core i2c_algo_bit wmi pinctrl_emmitsburg scsi_dh_rdac scsi_dh_emc scsi_dh_alua dm_multipath
[  213.796194] irq event stamp: 67740
[  213.796195] hardirqs last  enabled at (67739): [<ffffffff8c2015ca>] asm_sysvec_apic_timer_interrupt+0x1a/0x20
[  213.796203] hardirqs last disabled at (67740): [<ffffffff8c14108e>] sysvec_apic_timer_interrupt+0xe/0x90
[  213.796208] softirqs last  enabled at (67686): [<ffffffff8b12115e>] __irq_exit_rcu+0xbe/0xe0
[  213.796214] softirqs last disabled at (67681): [<ffffffff8b12115e>] __irq_exit_rcu+0xbe/0xe0
[  213.796217] CPU: 108 PID: 3157 Comm: (udev-worker) Kdump: loaded Not tainted 6.7.0+ #9
[  213.796220] Hardware name: Intel Corporation M50FCP2SBSTD/M50FCP2SBSTD, BIOS SE5C741.86B.01.01.0001.2211140926 11/14/2022
[  213.796221] RIP: 0010:virtqueue_get_buf_ctx_split+0x8d/0x110
[  213.796228] Code: 89 df e8 26 fe ff ff 0f b7 43 50 83 c0 01 66 89 43 50 f6 43 78 01 75 12 80 7b 42 00 48 8b 4b 68 8b 53 58 74 0f 66 87 44 51 04 <48> 89 e8 5b 5d c3 cc cc cc cc 66 89 44 51 04 0f ae f0 48 89 e8 5b
[  213.796230] RSP: 0018:ff4bbb362306f9b0 EFLAGS: 00000246
[  213.796233] RAX: 0000000000000000 RBX: ff2f15095896f000 RCX: 0000000000000001
[  213.796235] RDX: 0000000000000000 RSI: ff4bbb362306f9cc RDI: ff2f15095896f000
[  213.796236] RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000000
[  213.796237] R10: 0000000000000003 R11: ff2f15095893cc40 R12: 0000000000000002
[  213.796239] R13: 0000000000000004 R14: 0000000000000000 R15: ff2f1509534f3000
[  213.796240] FS:  00007f775847d0c0(0000) GS:ff2f1528bac00000(0000) knlGS:0000000000000000
[  213.796242] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  213.796243] CR2: 0000557f987b6e70 CR3: 0000002098602006 CR4: 0000000000f71ef0
[  213.796245] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[  213.796246] DR3: 0000000000000000 DR6: 00000000fffe07f0 DR7: 0000000000000400
[  213.796247] PKRU: 55555554
[  213.796249] Call Trace:
[  213.796250]  <IRQ>
[  213.796252]  ? watchdog_timer_fn+0x1c0/0x220
[  213.796258]  ? __pfx_watchdog_timer_fn+0x10/0x10
[  213.796261]  ? __hrtimer_run_queues+0x1af/0x380
[  213.796269]  ? hrtimer_interrupt+0xf8/0x230
[  213.796274]  ? __sysvec_apic_timer_interrupt+0x64/0x1a0
[  213.796279]  ? sysvec_apic_timer_interrupt+0x6d/0x90
[  213.796282]  </IRQ>
[  213.796284]  <TASK>
[  213.796285]  ? asm_sysvec_apic_timer_interrupt+0x1a/0x20
[  213.796293]  ? virtqueue_get_buf_ctx_split+0x8d/0x110
[  213.796297]  virtnet_send_command+0x18a/0x1f0 [virtio_net]
[  213.796310]  _virtnet_set_queues+0xc6/0x120 [virtio_net]
[  213.796319]  virtnet_probe+0xa06/0xd50 [virtio_net]
[  213.796328]  virtio_dev_probe+0x195/0x230
[  213.796333]  really_probe+0x19f/0x400
[  213.796338]  ? __pfx___driver_attach+0x10/0x10
[  213.796340]  __driver_probe_device+0x78/0x160
[  213.796343]  driver_probe_device+0x1f/0x90
[  213.796346]  __driver_attach+0xd6/0x1d0
[  213.796349]  bus_for_each_dev+0x8c/0xe0
[  213.796355]  bus_add_driver+0x119/0x220
[  213.796359]  driver_register+0x59/0x100
[  213.796362]  ? __pfx_virtio_net_driver_init+0x10/0x10 [virtio_net]
[  213.796369]  virtio_net_driver_init+0x8e/0xff0 [virtio_net]
[  213.796375]  do_one_initcall+0x6f/0x380
[  213.796384]  do_init_module+0x60/0x240
[  213.796388]  init_module_from_file+0x86/0xc0
[  213.796396]  idempotent_init_module+0x129/0x2c0
[  213.796406]  __x64_sys_finit_module+0x5e/0xb0
[  213.796409]  do_syscall_64+0x60/0xe0
[  213.796415]  ? do_syscall_64+0x6f/0xe0
[  213.796418]  ? lockdep_hardirqs_on_prepare+0xe4/0x1a0
[  213.796424]  ? do_syscall_64+0x6f/0xe0
[  213.796427]  ? do_syscall_64+0x6f/0xe0
[  213.796431]  entry_SYSCALL_64_after_hwframe+0x6e/0x76
[  213.796435] RIP: 0033:0x7f7758f279cd
[  213.796465] Code: 5d c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 33 e4 0c 00 f7 d8 64 89 01 48
[  213.796467] RSP: 002b:00007ffe2cad8738 EFLAGS: 00000246 ORIG_RAX: 0000000000000139
[  213.796469] RAX: ffffffffffffffda RBX: 0000557f987a8180 RCX: 00007f7758f279cd
[  213.796471] RDX: 0000000000000000 RSI: 00007f77593e5453 RDI: 000000000000000f
[  213.796472] RBP: 00007f77593e5453 R08: 0000000000000000 R09: 00007ffe2cad8860
[  213.796473] R10: 000000000000000f R11: 0000000000000246 R12: 0000000000020000
[  213.796475] R13: 0000557f9879f8e0 R14: 0000000000000000 R15: 0000557f98783aa0
[  213.796482]  </TASK>
"

Signed-off-by: Zhu Yanjun <yanjun.zhu@linux.dev>
---
 drivers/net/virtio_net.c | 10 ++++++++--
 1 file changed, 8 insertions(+), 2 deletions(-)

Comments

Jason Wang Jan. 15, 2024, 2:20 a.m. UTC | #1
On Mon, Jan 15, 2024 at 9:35 AM Zhu Yanjun <yanjun.zhu@intel.com> wrote:
>
> From: Zhu Yanjun <yanjun.zhu@linux.dev>
>
> Some devices emulate the virtio_net hardwares. When virtio_net
> driver sends commands to the emulated hardware, normally the
> hardware needs time to response. Sometimes the time is very
> long. Thus, the following will appear. Then the whole system
> will hang.
> The similar problems also occur in Intel NICs and Mellanox NICs.
> As such, the similar solution is borrowed from them. A timeout
> value is added and the timeout value as large as possible is set
> to ensure that the driver gets the maximum possible response from
> the hardware.
>
> "
> [  213.795860] watchdog: BUG: soft lockup - CPU#108 stuck for 26s! [(udev-worker):3157]
> [  213.796114] Modules linked in: virtio_net(+) net_failover failover qrtr rfkill sunrpc intel_rapl_msr intel_rapl_common intel_uncore_frequency intel_uncore_frequency_common intel_ifs i10nm_edac nfit libnvdimm x86_pkg_temp_thermal intel_powerclamp coretemp iTCO_wdt rapl intel_pmc_bxt dax_hmem iTCO_vendor_support vfat cxl_acpi intel_cstate pmt_telemetry pmt_class intel_sdsi joydev intel_uncore cxl_core fat pcspkr mei_me isst_if_mbox_pci isst_if_mmio idxd i2c_i801 isst_if_common mei intel_vsec idxd_bus i2c_smbus i2c_ismt ipmi_ssif acpi_ipmi ipmi_si ipmi_devintf ipmi_msghandler acpi_pad acpi_power_meter pfr_telemetry pfr_update fuse loop zram xfs crct10dif_pclmul crc32_pclmul crc32c_intel polyval_clmulni polyval_generic ghash_clmulni_intel sha512_ssse3 bnxt_en sha256_ssse3 sha1_ssse3 nvme ast nvme_core i2c_algo_bit wmi pinctrl_emmitsburg scsi_dh_rdac scsi_dh_emc scsi_dh_alua dm_multipath
> [  213.796194] irq event stamp: 67740
> [  213.796195] hardirqs last  enabled at (67739): [<ffffffff8c2015ca>] asm_sysvec_apic_timer_interrupt+0x1a/0x20
> [  213.796203] hardirqs last disabled at (67740): [<ffffffff8c14108e>] sysvec_apic_timer_interrupt+0xe/0x90
> [  213.796208] softirqs last  enabled at (67686): [<ffffffff8b12115e>] __irq_exit_rcu+0xbe/0xe0
> [  213.796214] softirqs last disabled at (67681): [<ffffffff8b12115e>] __irq_exit_rcu+0xbe/0xe0
> [  213.796217] CPU: 108 PID: 3157 Comm: (udev-worker) Kdump: loaded Not tainted 6.7.0+ #9
> [  213.796220] Hardware name: Intel Corporation M50FCP2SBSTD/M50FCP2SBSTD, BIOS SE5C741.86B.01.01.0001.2211140926 11/14/2022
> [  213.796221] RIP: 0010:virtqueue_get_buf_ctx_split+0x8d/0x110
> [  213.796228] Code: 89 df e8 26 fe ff ff 0f b7 43 50 83 c0 01 66 89 43 50 f6 43 78 01 75 12 80 7b 42 00 48 8b 4b 68 8b 53 58 74 0f 66 87 44 51 04 <48> 89 e8 5b 5d c3 cc cc cc cc 66 89 44 51 04 0f ae f0 48 89 e8 5b
> [  213.796230] RSP: 0018:ff4bbb362306f9b0 EFLAGS: 00000246
> [  213.796233] RAX: 0000000000000000 RBX: ff2f15095896f000 RCX: 0000000000000001
> [  213.796235] RDX: 0000000000000000 RSI: ff4bbb362306f9cc RDI: ff2f15095896f000
> [  213.796236] RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000000
> [  213.796237] R10: 0000000000000003 R11: ff2f15095893cc40 R12: 0000000000000002
> [  213.796239] R13: 0000000000000004 R14: 0000000000000000 R15: ff2f1509534f3000
> [  213.796240] FS:  00007f775847d0c0(0000) GS:ff2f1528bac00000(0000) knlGS:0000000000000000
> [  213.796242] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [  213.796243] CR2: 0000557f987b6e70 CR3: 0000002098602006 CR4: 0000000000f71ef0
> [  213.796245] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> [  213.796246] DR3: 0000000000000000 DR6: 00000000fffe07f0 DR7: 0000000000000400
> [  213.796247] PKRU: 55555554
> [  213.796249] Call Trace:
> [  213.796250]  <IRQ>
> [  213.796252]  ? watchdog_timer_fn+0x1c0/0x220
> [  213.796258]  ? __pfx_watchdog_timer_fn+0x10/0x10
> [  213.796261]  ? __hrtimer_run_queues+0x1af/0x380
> [  213.796269]  ? hrtimer_interrupt+0xf8/0x230
> [  213.796274]  ? __sysvec_apic_timer_interrupt+0x64/0x1a0
> [  213.796279]  ? sysvec_apic_timer_interrupt+0x6d/0x90
> [  213.796282]  </IRQ>
> [  213.796284]  <TASK>
> [  213.796285]  ? asm_sysvec_apic_timer_interrupt+0x1a/0x20
> [  213.796293]  ? virtqueue_get_buf_ctx_split+0x8d/0x110
> [  213.796297]  virtnet_send_command+0x18a/0x1f0 [virtio_net]
> [  213.796310]  _virtnet_set_queues+0xc6/0x120 [virtio_net]
> [  213.796319]  virtnet_probe+0xa06/0xd50 [virtio_net]
> [  213.796328]  virtio_dev_probe+0x195/0x230
> [  213.796333]  really_probe+0x19f/0x400
> [  213.796338]  ? __pfx___driver_attach+0x10/0x10
> [  213.796340]  __driver_probe_device+0x78/0x160
> [  213.796343]  driver_probe_device+0x1f/0x90
> [  213.796346]  __driver_attach+0xd6/0x1d0
> [  213.796349]  bus_for_each_dev+0x8c/0xe0
> [  213.796355]  bus_add_driver+0x119/0x220
> [  213.796359]  driver_register+0x59/0x100
> [  213.796362]  ? __pfx_virtio_net_driver_init+0x10/0x10 [virtio_net]
> [  213.796369]  virtio_net_driver_init+0x8e/0xff0 [virtio_net]
> [  213.796375]  do_one_initcall+0x6f/0x380
> [  213.796384]  do_init_module+0x60/0x240
> [  213.796388]  init_module_from_file+0x86/0xc0
> [  213.796396]  idempotent_init_module+0x129/0x2c0
> [  213.796406]  __x64_sys_finit_module+0x5e/0xb0
> [  213.796409]  do_syscall_64+0x60/0xe0
> [  213.796415]  ? do_syscall_64+0x6f/0xe0
> [  213.796418]  ? lockdep_hardirqs_on_prepare+0xe4/0x1a0
> [  213.796424]  ? do_syscall_64+0x6f/0xe0
> [  213.796427]  ? do_syscall_64+0x6f/0xe0
> [  213.796431]  entry_SYSCALL_64_after_hwframe+0x6e/0x76
> [  213.796435] RIP: 0033:0x7f7758f279cd
> [  213.796465] Code: 5d c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 33 e4 0c 00 f7 d8 64 89 01 48
> [  213.796467] RSP: 002b:00007ffe2cad8738 EFLAGS: 00000246 ORIG_RAX: 0000000000000139
> [  213.796469] RAX: ffffffffffffffda RBX: 0000557f987a8180 RCX: 00007f7758f279cd
> [  213.796471] RDX: 0000000000000000 RSI: 00007f77593e5453 RDI: 000000000000000f
> [  213.796472] RBP: 00007f77593e5453 R08: 0000000000000000 R09: 00007ffe2cad8860
> [  213.796473] R10: 000000000000000f R11: 0000000000000246 R12: 0000000000020000
> [  213.796475] R13: 0000557f9879f8e0 R14: 0000000000000000 R15: 0000557f98783aa0
> [  213.796482]  </TASK>
> "
>
> Signed-off-by: Zhu Yanjun <yanjun.zhu@linux.dev>
> ---
>  drivers/net/virtio_net.c | 10 ++++++++--
>  1 file changed, 8 insertions(+), 2 deletions(-)
>
> diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
> index 51b1868d2f22..28b7dd917a43 100644
> --- a/drivers/net/virtio_net.c
> +++ b/drivers/net/virtio_net.c
> @@ -2468,7 +2468,7 @@ static bool virtnet_send_command(struct virtnet_info *vi, u8 class, u8 cmd,
>  {
>         struct scatterlist *sgs[4], hdr, stat;
>         unsigned out_num = 0, tmp;
> -       int ret;
> +       int ret, timeout = 200;

Any reason we choose this value or how can we know it works for all
types of devices?

A more easy way is to use cond_resched() but it may have side effects
as well[1]. But it seems less intrusive anyhow than the proposal here?

Thanks

[1] https://www.mail-archive.com/virtualization@lists.linux-foundation.org/msg60297.html

>
>         /* Caller should know better */
>         BUG_ON(!virtio_has_feature(vi->vdev, VIRTIO_NET_F_CTRL_VQ));
> @@ -2502,8 +2502,14 @@ static bool virtnet_send_command(struct virtnet_info *vi, u8 class, u8 cmd,
>          * into the hypervisor, so the request should be handled immediately.
>          */
>         while (!virtqueue_get_buf(vi->cvq, &tmp) &&
> -              !virtqueue_is_broken(vi->cvq))
> +              !virtqueue_is_broken(vi->cvq)) {
> +               if (timeout)
> +                       timeout--;
> +               else
> +                       return false; /* long time no response */
> +
>                 cpu_relax();
> +       }
>
>         return vi->ctrl->status == VIRTIO_NET_OK;
>  }
> --
> 2.42.0
>
Zhu Yanjun Jan. 15, 2024, 10:25 a.m. UTC | #2
在 2024/1/15 10:20, Jason Wang 写道:
> On Mon, Jan 15, 2024 at 9:35 AM Zhu Yanjun <yanjun.zhu@intel.com> wrote:
>> From: Zhu Yanjun <yanjun.zhu@linux.dev>
>>
>> Some devices emulate the virtio_net hardwares. When virtio_net
>> driver sends commands to the emulated hardware, normally the
>> hardware needs time to response. Sometimes the time is very
>> long. Thus, the following will appear. Then the whole system
>> will hang.
>> The similar problems also occur in Intel NICs and Mellanox NICs.
>> As such, the similar solution is borrowed from them. A timeout
>> value is added and the timeout value as large as possible is set
>> to ensure that the driver gets the maximum possible response from
>> the hardware.
>>
>> "
>> [  213.795860] watchdog: BUG: soft lockup - CPU#108 stuck for 26s! [(udev-worker):3157]
>> [  213.796114] Modules linked in: virtio_net(+) net_failover failover qrtr rfkill sunrpc intel_rapl_msr intel_rapl_common intel_uncore_frequency intel_uncore_frequency_common intel_ifs i10nm_edac nfit libnvdimm x86_pkg_temp_thermal intel_powerclamp coretemp iTCO_wdt rapl intel_pmc_bxt dax_hmem iTCO_vendor_support vfat cxl_acpi intel_cstate pmt_telemetry pmt_class intel_sdsi joydev intel_uncore cxl_core fat pcspkr mei_me isst_if_mbox_pci isst_if_mmio idxd i2c_i801 isst_if_common mei intel_vsec idxd_bus i2c_smbus i2c_ismt ipmi_ssif acpi_ipmi ipmi_si ipmi_devintf ipmi_msghandler acpi_pad acpi_power_meter pfr_telemetry pfr_update fuse loop zram xfs crct10dif_pclmul crc32_pclmul crc32c_intel polyval_clmulni polyval_generic ghash_clmulni_intel sha512_ssse3 bnxt_en sha256_ssse3 sha1_ssse3 nvme ast nvme_core i2c_algo_bit wmi pinctrl_emmitsburg scsi_dh_rdac scsi_dh_emc scsi_dh_alua dm_multipath
>> [  213.796194] irq event stamp: 67740
>> [  213.796195] hardirqs last  enabled at (67739): [<ffffffff8c2015ca>] asm_sysvec_apic_timer_interrupt+0x1a/0x20
>> [  213.796203] hardirqs last disabled at (67740): [<ffffffff8c14108e>] sysvec_apic_timer_interrupt+0xe/0x90
>> [  213.796208] softirqs last  enabled at (67686): [<ffffffff8b12115e>] __irq_exit_rcu+0xbe/0xe0
>> [  213.796214] softirqs last disabled at (67681): [<ffffffff8b12115e>] __irq_exit_rcu+0xbe/0xe0
>> [  213.796217] CPU: 108 PID: 3157 Comm: (udev-worker) Kdump: loaded Not tainted 6.7.0+ #9
>> [  213.796220] Hardware name: Intel Corporation M50FCP2SBSTD/M50FCP2SBSTD, BIOS SE5C741.86B.01.01.0001.2211140926 11/14/2022
>> [  213.796221] RIP: 0010:virtqueue_get_buf_ctx_split+0x8d/0x110
>> [  213.796228] Code: 89 df e8 26 fe ff ff 0f b7 43 50 83 c0 01 66 89 43 50 f6 43 78 01 75 12 80 7b 42 00 48 8b 4b 68 8b 53 58 74 0f 66 87 44 51 04 <48> 89 e8 5b 5d c3 cc cc cc cc 66 89 44 51 04 0f ae f0 48 89 e8 5b
>> [  213.796230] RSP: 0018:ff4bbb362306f9b0 EFLAGS: 00000246
>> [  213.796233] RAX: 0000000000000000 RBX: ff2f15095896f000 RCX: 0000000000000001
>> [  213.796235] RDX: 0000000000000000 RSI: ff4bbb362306f9cc RDI: ff2f15095896f000
>> [  213.796236] RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000000
>> [  213.796237] R10: 0000000000000003 R11: ff2f15095893cc40 R12: 0000000000000002
>> [  213.796239] R13: 0000000000000004 R14: 0000000000000000 R15: ff2f1509534f3000
>> [  213.796240] FS:  00007f775847d0c0(0000) GS:ff2f1528bac00000(0000) knlGS:0000000000000000
>> [  213.796242] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>> [  213.796243] CR2: 0000557f987b6e70 CR3: 0000002098602006 CR4: 0000000000f71ef0
>> [  213.796245] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
>> [  213.796246] DR3: 0000000000000000 DR6: 00000000fffe07f0 DR7: 0000000000000400
>> [  213.796247] PKRU: 55555554
>> [  213.796249] Call Trace:
>> [  213.796250]  <IRQ>
>> [  213.796252]  ? watchdog_timer_fn+0x1c0/0x220
>> [  213.796258]  ? __pfx_watchdog_timer_fn+0x10/0x10
>> [  213.796261]  ? __hrtimer_run_queues+0x1af/0x380
>> [  213.796269]  ? hrtimer_interrupt+0xf8/0x230
>> [  213.796274]  ? __sysvec_apic_timer_interrupt+0x64/0x1a0
>> [  213.796279]  ? sysvec_apic_timer_interrupt+0x6d/0x90
>> [  213.796282]  </IRQ>
>> [  213.796284]  <TASK>
>> [  213.796285]  ? asm_sysvec_apic_timer_interrupt+0x1a/0x20
>> [  213.796293]  ? virtqueue_get_buf_ctx_split+0x8d/0x110
>> [  213.796297]  virtnet_send_command+0x18a/0x1f0 [virtio_net]
>> [  213.796310]  _virtnet_set_queues+0xc6/0x120 [virtio_net]
>> [  213.796319]  virtnet_probe+0xa06/0xd50 [virtio_net]
>> [  213.796328]  virtio_dev_probe+0x195/0x230
>> [  213.796333]  really_probe+0x19f/0x400
>> [  213.796338]  ? __pfx___driver_attach+0x10/0x10
>> [  213.796340]  __driver_probe_device+0x78/0x160
>> [  213.796343]  driver_probe_device+0x1f/0x90
>> [  213.796346]  __driver_attach+0xd6/0x1d0
>> [  213.796349]  bus_for_each_dev+0x8c/0xe0
>> [  213.796355]  bus_add_driver+0x119/0x220
>> [  213.796359]  driver_register+0x59/0x100
>> [  213.796362]  ? __pfx_virtio_net_driver_init+0x10/0x10 [virtio_net]
>> [  213.796369]  virtio_net_driver_init+0x8e/0xff0 [virtio_net]
>> [  213.796375]  do_one_initcall+0x6f/0x380
>> [  213.796384]  do_init_module+0x60/0x240
>> [  213.796388]  init_module_from_file+0x86/0xc0
>> [  213.796396]  idempotent_init_module+0x129/0x2c0
>> [  213.796406]  __x64_sys_finit_module+0x5e/0xb0
>> [  213.796409]  do_syscall_64+0x60/0xe0
>> [  213.796415]  ? do_syscall_64+0x6f/0xe0
>> [  213.796418]  ? lockdep_hardirqs_on_prepare+0xe4/0x1a0
>> [  213.796424]  ? do_syscall_64+0x6f/0xe0
>> [  213.796427]  ? do_syscall_64+0x6f/0xe0
>> [  213.796431]  entry_SYSCALL_64_after_hwframe+0x6e/0x76
>> [  213.796435] RIP: 0033:0x7f7758f279cd
>> [  213.796465] Code: 5d c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 33 e4 0c 00 f7 d8 64 89 01 48
>> [  213.796467] RSP: 002b:00007ffe2cad8738 EFLAGS: 00000246 ORIG_RAX: 0000000000000139
>> [  213.796469] RAX: ffffffffffffffda RBX: 0000557f987a8180 RCX: 00007f7758f279cd
>> [  213.796471] RDX: 0000000000000000 RSI: 00007f77593e5453 RDI: 000000000000000f
>> [  213.796472] RBP: 00007f77593e5453 R08: 0000000000000000 R09: 00007ffe2cad8860
>> [  213.796473] R10: 000000000000000f R11: 0000000000000246 R12: 0000000000020000
>> [  213.796475] R13: 0000557f9879f8e0 R14: 0000000000000000 R15: 0000557f98783aa0
>> [  213.796482]  </TASK>
>> "
>>
>> Signed-off-by: Zhu Yanjun <yanjun.zhu@linux.dev>
>> ---
>>   drivers/net/virtio_net.c | 10 ++++++++--
>>   1 file changed, 8 insertions(+), 2 deletions(-)
>>
>> diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
>> index 51b1868d2f22..28b7dd917a43 100644
>> --- a/drivers/net/virtio_net.c
>> +++ b/drivers/net/virtio_net.c
>> @@ -2468,7 +2468,7 @@ static bool virtnet_send_command(struct virtnet_info *vi, u8 class, u8 cmd,
>>   {
>>          struct scatterlist *sgs[4], hdr, stat;
>>          unsigned out_num = 0, tmp;
>> -       int ret;
>> +       int ret, timeout = 200;
> Any reason we choose this value or how can we know it works for all
> types of devices?

Sorry. Resend it again because it is rejected by netdev.

As I mentioned in the commit log, the similar problem also occurs in 
Intel NIC driver and Mellanox NIC driver.


This commit is borrowed from the solution of Intel NIC driver. So the 
value is also from Intel NIC driver solution.

>
> A more easy way is to use cond_resched() but it may have side effects
> as well[1]. But it seems less intrusive anyhow than the proposal here?
Thanks a lot for your suggestions. I have made tests with the commits in 
the link 
https://www.mail-archive.com/virtualization@lists.linux-foundation.org/msg60297.html.

Because virtio_net driver spins waiting for the response of hardware, 
virtio_net driver can not be unloaded and kernel hang still occurs when 
running "ip link" or unloading virtio_net module.

Zhu Yanjun
>
> Thanks
>
> [1] https://www.mail-archive.com/virtualization@lists.linux-foundation.org/msg60297.html
>
>>          /* Caller should know better */
>>          BUG_ON(!virtio_has_feature(vi->vdev, VIRTIO_NET_F_CTRL_VQ));
>> @@ -2502,8 +2502,14 @@ static bool virtnet_send_command(struct virtnet_info *vi, u8 class, u8 cmd,
>>           * into the hypervisor, so the request should be handled immediately.
>>           */
>>          while (!virtqueue_get_buf(vi->cvq, &tmp) &&
>> -              !virtqueue_is_broken(vi->cvq))
>> +              !virtqueue_is_broken(vi->cvq)) {
>> +               if (timeout)
>> +                       timeout--;
>> +               else
>> +                       return false; /* long time no response */
>> +
>>                  cpu_relax();
>> +       }
>>
>>          return vi->ctrl->status == VIRTIO_NET_OK;
>>   }
>> --
>> 2.42.0
>>
Paolo Abeni Jan. 16, 2024, 12:04 p.m. UTC | #3
On Mon, 2024-01-15 at 09:29 +0800, Zhu Yanjun wrote:
> From: Zhu Yanjun <yanjun.zhu@linux.dev>
> 
> Some devices emulate the virtio_net hardwares. When virtio_net
> driver sends commands to the emulated hardware, normally the
> hardware needs time to response. Sometimes the time is very
> long. Thus, the following will appear. Then the whole system
> will hang.
> The similar problems also occur in Intel NICs and Mellanox NICs.
> As such, the similar solution is borrowed from them. A timeout
> value is added and the timeout value as large as possible is set
> to ensure that the driver gets the maximum possible response from
> the hardware.
> 
> "
> [  213.795860] watchdog: BUG: soft lockup - CPU#108 stuck for 26s! [(udev-worker):3157]
> [  213.796114] Modules linked in: virtio_net(+) net_failover failover qrtr rfkill sunrpc intel_rapl_msr intel_rapl_common intel_uncore_frequency intel_uncore_frequency_common intel_ifs i10nm_edac nfit libnvdimm x86_pkg_temp_thermal intel_powerclamp coretemp iTCO_wdt rapl intel_pmc_bxt dax_hmem iTCO_vendor_support vfat cxl_acpi intel_cstate pmt_telemetry pmt_class intel_sdsi joydev intel_uncore cxl_core fat pcspkr mei_me isst_if_mbox_pci isst_if_mmio idxd i2c_i801 isst_if_common mei intel_vsec idxd_bus i2c_smbus i2c_ismt ipmi_ssif acpi_ipmi ipmi_si ipmi_devintf ipmi_msghandler acpi_pad acpi_power_meter pfr_telemetry pfr_update fuse loop zram xfs crct10dif_pclmul crc32_pclmul crc32c_intel polyval_clmulni polyval_generic ghash_clmulni_intel sha512_ssse3 bnxt_en sha256_ssse3 sha1_ssse3 nvme ast nvme_core i2c_algo_bit wmi pinctrl_emmitsburg scsi_dh_rdac scsi_dh_emc scsi_dh_alua dm_multipath
> [  213.796194] irq event stamp: 67740
> [  213.796195] hardirqs last  enabled at (67739): [<ffffffff8c2015ca>] asm_sysvec_apic_timer_interrupt+0x1a/0x20
> [  213.796203] hardirqs last disabled at (67740): [<ffffffff8c14108e>] sysvec_apic_timer_interrupt+0xe/0x90
> [  213.796208] softirqs last  enabled at (67686): [<ffffffff8b12115e>] __irq_exit_rcu+0xbe/0xe0
> [  213.796214] softirqs last disabled at (67681): [<ffffffff8b12115e>] __irq_exit_rcu+0xbe/0xe0
> [  213.796217] CPU: 108 PID: 3157 Comm: (udev-worker) Kdump: loaded Not tainted 6.7.0+ #9
> [  213.796220] Hardware name: Intel Corporation M50FCP2SBSTD/M50FCP2SBSTD, BIOS SE5C741.86B.01.01.0001.2211140926 11/14/2022
> [  213.796221] RIP: 0010:virtqueue_get_buf_ctx_split+0x8d/0x110
> [  213.796228] Code: 89 df e8 26 fe ff ff 0f b7 43 50 83 c0 01 66 89 43 50 f6 43 78 01 75 12 80 7b 42 00 48 8b 4b 68 8b 53 58 74 0f 66 87 44 51 04 <48> 89 e8 5b 5d c3 cc cc cc cc 66 89 44 51 04 0f ae f0 48 89 e8 5b
> [  213.796230] RSP: 0018:ff4bbb362306f9b0 EFLAGS: 00000246
> [  213.796233] RAX: 0000000000000000 RBX: ff2f15095896f000 RCX: 0000000000000001
> [  213.796235] RDX: 0000000000000000 RSI: ff4bbb362306f9cc RDI: ff2f15095896f000
> [  213.796236] RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000000
> [  213.796237] R10: 0000000000000003 R11: ff2f15095893cc40 R12: 0000000000000002
> [  213.796239] R13: 0000000000000004 R14: 0000000000000000 R15: ff2f1509534f3000
> [  213.796240] FS:  00007f775847d0c0(0000) GS:ff2f1528bac00000(0000) knlGS:0000000000000000
> [  213.796242] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [  213.796243] CR2: 0000557f987b6e70 CR3: 0000002098602006 CR4: 0000000000f71ef0
> [  213.796245] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> [  213.796246] DR3: 0000000000000000 DR6: 00000000fffe07f0 DR7: 0000000000000400
> [  213.796247] PKRU: 55555554
> [  213.796249] Call Trace:
> [  213.796250]  <IRQ>
> [  213.796252]  ? watchdog_timer_fn+0x1c0/0x220
> [  213.796258]  ? __pfx_watchdog_timer_fn+0x10/0x10
> [  213.796261]  ? __hrtimer_run_queues+0x1af/0x380
> [  213.796269]  ? hrtimer_interrupt+0xf8/0x230
> [  213.796274]  ? __sysvec_apic_timer_interrupt+0x64/0x1a0
> [  213.796279]  ? sysvec_apic_timer_interrupt+0x6d/0x90
> [  213.796282]  </IRQ>
> [  213.796284]  <TASK>
> [  213.796285]  ? asm_sysvec_apic_timer_interrupt+0x1a/0x20
> [  213.796293]  ? virtqueue_get_buf_ctx_split+0x8d/0x110
> [  213.796297]  virtnet_send_command+0x18a/0x1f0 [virtio_net]
> [  213.796310]  _virtnet_set_queues+0xc6/0x120 [virtio_net]
> [  213.796319]  virtnet_probe+0xa06/0xd50 [virtio_net]
> [  213.796328]  virtio_dev_probe+0x195/0x230
> [  213.796333]  really_probe+0x19f/0x400
> [  213.796338]  ? __pfx___driver_attach+0x10/0x10
> [  213.796340]  __driver_probe_device+0x78/0x160
> [  213.796343]  driver_probe_device+0x1f/0x90
> [  213.796346]  __driver_attach+0xd6/0x1d0
> [  213.796349]  bus_for_each_dev+0x8c/0xe0
> [  213.796355]  bus_add_driver+0x119/0x220
> [  213.796359]  driver_register+0x59/0x100
> [  213.796362]  ? __pfx_virtio_net_driver_init+0x10/0x10 [virtio_net]
> [  213.796369]  virtio_net_driver_init+0x8e/0xff0 [virtio_net]
> [  213.796375]  do_one_initcall+0x6f/0x380
> [  213.796384]  do_init_module+0x60/0x240
> [  213.796388]  init_module_from_file+0x86/0xc0
> [  213.796396]  idempotent_init_module+0x129/0x2c0
> [  213.796406]  __x64_sys_finit_module+0x5e/0xb0
> [  213.796409]  do_syscall_64+0x60/0xe0
> [  213.796415]  ? do_syscall_64+0x6f/0xe0
> [  213.796418]  ? lockdep_hardirqs_on_prepare+0xe4/0x1a0
> [  213.796424]  ? do_syscall_64+0x6f/0xe0
> [  213.796427]  ? do_syscall_64+0x6f/0xe0
> [  213.796431]  entry_SYSCALL_64_after_hwframe+0x6e/0x76
> [  213.796435] RIP: 0033:0x7f7758f279cd
> [  213.796465] Code: 5d c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 33 e4 0c 00 f7 d8 64 89 01 48
> [  213.796467] RSP: 002b:00007ffe2cad8738 EFLAGS: 00000246 ORIG_RAX: 0000000000000139
> [  213.796469] RAX: ffffffffffffffda RBX: 0000557f987a8180 RCX: 00007f7758f279cd
> [  213.796471] RDX: 0000000000000000 RSI: 00007f77593e5453 RDI: 000000000000000f
> [  213.796472] RBP: 00007f77593e5453 R08: 0000000000000000 R09: 00007ffe2cad8860
> [  213.796473] R10: 000000000000000f R11: 0000000000000246 R12: 0000000000020000
> [  213.796475] R13: 0000557f9879f8e0 R14: 0000000000000000 R15: 0000557f98783aa0
> [  213.796482]  </TASK>
> "
> 
> Signed-off-by: Zhu Yanjun <yanjun.zhu@linux.dev>
> ---
>  drivers/net/virtio_net.c | 10 ++++++++--
>  1 file changed, 8 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
> index 51b1868d2f22..28b7dd917a43 100644
> --- a/drivers/net/virtio_net.c
> +++ b/drivers/net/virtio_net.c
> @@ -2468,7 +2468,7 @@ static bool virtnet_send_command(struct virtnet_info *vi, u8 class, u8 cmd,
>  {
>  	struct scatterlist *sgs[4], hdr, stat;
>  	unsigned out_num = 0, tmp;
> -	int ret;
> +	int ret, timeout = 200;
>  
>  	/* Caller should know better */
>  	BUG_ON(!virtio_has_feature(vi->vdev, VIRTIO_NET_F_CTRL_VQ));
> @@ -2502,8 +2502,14 @@ static bool virtnet_send_command(struct virtnet_info *vi, u8 class, u8 cmd,
>  	 * into the hypervisor, so the request should be handled immediately.
>  	 */
>  	while (!virtqueue_get_buf(vi->cvq, &tmp) &&
> -	       !virtqueue_is_broken(vi->cvq))
> +	       !virtqueue_is_broken(vi->cvq)) {
> +		if (timeout)
> +			timeout--;

This is not really a timeout, just a loop counter. 200 iterations could
be a very short time on reasonable H/W. I guess this avoid the soft
lockup, but possibly (likely?) breaks the functionality when we need to
loop for some non negligible time.

I fear we need a more complex solution, as mentioned by Micheal in the
thread you quoted.

Cheers,

Paolo
Zhu Yanjun Jan. 18, 2024, 12:01 p.m. UTC | #4
在 2024/1/16 20:04, Paolo Abeni 写道:
> On Mon, 2024-01-15 at 09:29 +0800, Zhu Yanjun wrote:
>> From: Zhu Yanjun <yanjun.zhu@linux.dev>
>>
>> Some devices emulate the virtio_net hardwares. When virtio_net
>> driver sends commands to the emulated hardware, normally the
>> hardware needs time to response. Sometimes the time is very
>> long. Thus, the following will appear. Then the whole system
>> will hang.
>> The similar problems also occur in Intel NICs and Mellanox NICs.
>> As such, the similar solution is borrowed from them. A timeout
>> value is added and the timeout value as large as possible is set
>> to ensure that the driver gets the maximum possible response from
>> the hardware.
>>
>> "
>> [  213.795860] watchdog: BUG: soft lockup - CPU#108 stuck for 26s! [(udev-worker):3157]
>> [  213.796114] Modules linked in: virtio_net(+) net_failover failover qrtr rfkill sunrpc intel_rapl_msr intel_rapl_common intel_uncore_frequency intel_uncore_frequency_common intel_ifs i10nm_edac nfit libnvdimm x86_pkg_temp_thermal intel_powerclamp coretemp iTCO_wdt rapl intel_pmc_bxt dax_hmem iTCO_vendor_support vfat cxl_acpi intel_cstate pmt_telemetry pmt_class intel_sdsi joydev intel_uncore cxl_core fat pcspkr mei_me isst_if_mbox_pci isst_if_mmio idxd i2c_i801 isst_if_common mei intel_vsec idxd_bus i2c_smbus i2c_ismt ipmi_ssif acpi_ipmi ipmi_si ipmi_devintf ipmi_msghandler acpi_pad acpi_power_meter pfr_telemetry pfr_update fuse loop zram xfs crct10dif_pclmul crc32_pclmul crc32c_intel polyval_clmulni polyval_generic ghash_clmulni_intel sha512_ssse3 bnxt_en sha256_ssse3 sha1_ssse3 nvme ast nvme_core i2c_algo_bit wmi pinctrl_emmitsburg scsi_dh_rdac scsi_dh_emc scsi_dh_alua dm_multipath
>> [  213.796194] irq event stamp: 67740
>> [  213.796195] hardirqs last  enabled at (67739): [<ffffffff8c2015ca>] asm_sysvec_apic_timer_interrupt+0x1a/0x20
>> [  213.796203] hardirqs last disabled at (67740): [<ffffffff8c14108e>] sysvec_apic_timer_interrupt+0xe/0x90
>> [  213.796208] softirqs last  enabled at (67686): [<ffffffff8b12115e>] __irq_exit_rcu+0xbe/0xe0
>> [  213.796214] softirqs last disabled at (67681): [<ffffffff8b12115e>] __irq_exit_rcu+0xbe/0xe0
>> [  213.796217] CPU: 108 PID: 3157 Comm: (udev-worker) Kdump: loaded Not tainted 6.7.0+ #9
>> [  213.796220] Hardware name: Intel Corporation M50FCP2SBSTD/M50FCP2SBSTD, BIOS SE5C741.86B.01.01.0001.2211140926 11/14/2022
>> [  213.796221] RIP: 0010:virtqueue_get_buf_ctx_split+0x8d/0x110
>> [  213.796228] Code: 89 df e8 26 fe ff ff 0f b7 43 50 83 c0 01 66 89 43 50 f6 43 78 01 75 12 80 7b 42 00 48 8b 4b 68 8b 53 58 74 0f 66 87 44 51 04 <48> 89 e8 5b 5d c3 cc cc cc cc 66 89 44 51 04 0f ae f0 48 89 e8 5b
>> [  213.796230] RSP: 0018:ff4bbb362306f9b0 EFLAGS: 00000246
>> [  213.796233] RAX: 0000000000000000 RBX: ff2f15095896f000 RCX: 0000000000000001
>> [  213.796235] RDX: 0000000000000000 RSI: ff4bbb362306f9cc RDI: ff2f15095896f000
>> [  213.796236] RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000000
>> [  213.796237] R10: 0000000000000003 R11: ff2f15095893cc40 R12: 0000000000000002
>> [  213.796239] R13: 0000000000000004 R14: 0000000000000000 R15: ff2f1509534f3000
>> [  213.796240] FS:  00007f775847d0c0(0000) GS:ff2f1528bac00000(0000) knlGS:0000000000000000
>> [  213.796242] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>> [  213.796243] CR2: 0000557f987b6e70 CR3: 0000002098602006 CR4: 0000000000f71ef0
>> [  213.796245] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
>> [  213.796246] DR3: 0000000000000000 DR6: 00000000fffe07f0 DR7: 0000000000000400
>> [  213.796247] PKRU: 55555554
>> [  213.796249] Call Trace:
>> [  213.796250]  <IRQ>
>> [  213.796252]  ? watchdog_timer_fn+0x1c0/0x220
>> [  213.796258]  ? __pfx_watchdog_timer_fn+0x10/0x10
>> [  213.796261]  ? __hrtimer_run_queues+0x1af/0x380
>> [  213.796269]  ? hrtimer_interrupt+0xf8/0x230
>> [  213.796274]  ? __sysvec_apic_timer_interrupt+0x64/0x1a0
>> [  213.796279]  ? sysvec_apic_timer_interrupt+0x6d/0x90
>> [  213.796282]  </IRQ>
>> [  213.796284]  <TASK>
>> [  213.796285]  ? asm_sysvec_apic_timer_interrupt+0x1a/0x20
>> [  213.796293]  ? virtqueue_get_buf_ctx_split+0x8d/0x110
>> [  213.796297]  virtnet_send_command+0x18a/0x1f0 [virtio_net]
>> [  213.796310]  _virtnet_set_queues+0xc6/0x120 [virtio_net]
>> [  213.796319]  virtnet_probe+0xa06/0xd50 [virtio_net]
>> [  213.796328]  virtio_dev_probe+0x195/0x230
>> [  213.796333]  really_probe+0x19f/0x400
>> [  213.796338]  ? __pfx___driver_attach+0x10/0x10
>> [  213.796340]  __driver_probe_device+0x78/0x160
>> [  213.796343]  driver_probe_device+0x1f/0x90
>> [  213.796346]  __driver_attach+0xd6/0x1d0
>> [  213.796349]  bus_for_each_dev+0x8c/0xe0
>> [  213.796355]  bus_add_driver+0x119/0x220
>> [  213.796359]  driver_register+0x59/0x100
>> [  213.796362]  ? __pfx_virtio_net_driver_init+0x10/0x10 [virtio_net]
>> [  213.796369]  virtio_net_driver_init+0x8e/0xff0 [virtio_net]
>> [  213.796375]  do_one_initcall+0x6f/0x380
>> [  213.796384]  do_init_module+0x60/0x240
>> [  213.796388]  init_module_from_file+0x86/0xc0
>> [  213.796396]  idempotent_init_module+0x129/0x2c0
>> [  213.796406]  __x64_sys_finit_module+0x5e/0xb0
>> [  213.796409]  do_syscall_64+0x60/0xe0
>> [  213.796415]  ? do_syscall_64+0x6f/0xe0
>> [  213.796418]  ? lockdep_hardirqs_on_prepare+0xe4/0x1a0
>> [  213.796424]  ? do_syscall_64+0x6f/0xe0
>> [  213.796427]  ? do_syscall_64+0x6f/0xe0
>> [  213.796431]  entry_SYSCALL_64_after_hwframe+0x6e/0x76
>> [  213.796435] RIP: 0033:0x7f7758f279cd
>> [  213.796465] Code: 5d c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 33 e4 0c 00 f7 d8 64 89 01 48
>> [  213.796467] RSP: 002b:00007ffe2cad8738 EFLAGS: 00000246 ORIG_RAX: 0000000000000139
>> [  213.796469] RAX: ffffffffffffffda RBX: 0000557f987a8180 RCX: 00007f7758f279cd
>> [  213.796471] RDX: 0000000000000000 RSI: 00007f77593e5453 RDI: 000000000000000f
>> [  213.796472] RBP: 00007f77593e5453 R08: 0000000000000000 R09: 00007ffe2cad8860
>> [  213.796473] R10: 000000000000000f R11: 0000000000000246 R12: 0000000000020000
>> [  213.796475] R13: 0000557f9879f8e0 R14: 0000000000000000 R15: 0000557f98783aa0
>> [  213.796482]  </TASK>
>> "
>>
>> Signed-off-by: Zhu Yanjun <yanjun.zhu@linux.dev>
>> ---
>>   drivers/net/virtio_net.c | 10 ++++++++--
>>   1 file changed, 8 insertions(+), 2 deletions(-)
>>
>> diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
>> index 51b1868d2f22..28b7dd917a43 100644
>> --- a/drivers/net/virtio_net.c
>> +++ b/drivers/net/virtio_net.c
>> @@ -2468,7 +2468,7 @@ static bool virtnet_send_command(struct virtnet_info *vi, u8 class, u8 cmd,
>>   {
>>   	struct scatterlist *sgs[4], hdr, stat;
>>   	unsigned out_num = 0, tmp;
>> -	int ret;
>> +	int ret, timeout = 200;
>>   
>>   	/* Caller should know better */
>>   	BUG_ON(!virtio_has_feature(vi->vdev, VIRTIO_NET_F_CTRL_VQ));
>> @@ -2502,8 +2502,14 @@ static bool virtnet_send_command(struct virtnet_info *vi, u8 class, u8 cmd,
>>   	 * into the hypervisor, so the request should be handled immediately.
>>   	 */
>>   	while (!virtqueue_get_buf(vi->cvq, &tmp) &&
>> -	       !virtqueue_is_broken(vi->cvq))
>> +	       !virtqueue_is_broken(vi->cvq)) {
>> +		if (timeout)
>> +			timeout--;
> This is not really a timeout, just a loop counter. 200 iterations could
> be a very short time on reasonable H/W. I guess this avoid the soft
> lockup, but possibly (likely?) breaks the functionality when we need to
> loop for some non negligible time.
>
> I fear we need a more complex solution, as mentioned by Micheal in the
> thread you quoted.

Got it. I also look forward to the more complex solution to this problem.

Zhu Yanjun

>
> Cheers,
>
> Paolo
>
Michael S. Tsirkin Jan. 18, 2024, 1:14 p.m. UTC | #5
On Tue, Jan 16, 2024 at 01:04:49PM +0100, Paolo Abeni wrote:
> On Mon, 2024-01-15 at 09:29 +0800, Zhu Yanjun wrote:
> > From: Zhu Yanjun <yanjun.zhu@linux.dev>
> > 
> > Some devices emulate the virtio_net hardwares. When virtio_net
> > driver sends commands to the emulated hardware, normally the
> > hardware needs time to response. Sometimes the time is very
> > long. Thus, the following will appear. Then the whole system
> > will hang.
> > The similar problems also occur in Intel NICs and Mellanox NICs.
> > As such, the similar solution is borrowed from them. A timeout
> > value is added and the timeout value as large as possible is set
> > to ensure that the driver gets the maximum possible response from
> > the hardware.
> > 
> > "
> > [  213.795860] watchdog: BUG: soft lockup - CPU#108 stuck for 26s! [(udev-worker):3157]
> > [  213.796114] Modules linked in: virtio_net(+) net_failover failover qrtr rfkill sunrpc intel_rapl_msr intel_rapl_common intel_uncore_frequency intel_uncore_frequency_common intel_ifs i10nm_edac nfit libnvdimm x86_pkg_temp_thermal intel_powerclamp coretemp iTCO_wdt rapl intel_pmc_bxt dax_hmem iTCO_vendor_support vfat cxl_acpi intel_cstate pmt_telemetry pmt_class intel_sdsi joydev intel_uncore cxl_core fat pcspkr mei_me isst_if_mbox_pci isst_if_mmio idxd i2c_i801 isst_if_common mei intel_vsec idxd_bus i2c_smbus i2c_ismt ipmi_ssif acpi_ipmi ipmi_si ipmi_devintf ipmi_msghandler acpi_pad acpi_power_meter pfr_telemetry pfr_update fuse loop zram xfs crct10dif_pclmul crc32_pclmul crc32c_intel polyval_clmulni polyval_generic ghash_clmulni_intel sha512_ssse3 bnxt_en sha256_ssse3 sha1_ssse3 nvme ast nvme_core i2c_algo_bit wmi pinctrl_emmitsburg scsi_dh_rdac scsi_dh_emc scsi_dh_alua dm_multipath
> > [  213.796194] irq event stamp: 67740
> > [  213.796195] hardirqs last  enabled at (67739): [<ffffffff8c2015ca>] asm_sysvec_apic_timer_interrupt+0x1a/0x20
> > [  213.796203] hardirqs last disabled at (67740): [<ffffffff8c14108e>] sysvec_apic_timer_interrupt+0xe/0x90
> > [  213.796208] softirqs last  enabled at (67686): [<ffffffff8b12115e>] __irq_exit_rcu+0xbe/0xe0
> > [  213.796214] softirqs last disabled at (67681): [<ffffffff8b12115e>] __irq_exit_rcu+0xbe/0xe0
> > [  213.796217] CPU: 108 PID: 3157 Comm: (udev-worker) Kdump: loaded Not tainted 6.7.0+ #9
> > [  213.796220] Hardware name: Intel Corporation M50FCP2SBSTD/M50FCP2SBSTD, BIOS SE5C741.86B.01.01.0001.2211140926 11/14/2022
> > [  213.796221] RIP: 0010:virtqueue_get_buf_ctx_split+0x8d/0x110
> > [  213.796228] Code: 89 df e8 26 fe ff ff 0f b7 43 50 83 c0 01 66 89 43 50 f6 43 78 01 75 12 80 7b 42 00 48 8b 4b 68 8b 53 58 74 0f 66 87 44 51 04 <48> 89 e8 5b 5d c3 cc cc cc cc 66 89 44 51 04 0f ae f0 48 89 e8 5b
> > [  213.796230] RSP: 0018:ff4bbb362306f9b0 EFLAGS: 00000246
> > [  213.796233] RAX: 0000000000000000 RBX: ff2f15095896f000 RCX: 0000000000000001
> > [  213.796235] RDX: 0000000000000000 RSI: ff4bbb362306f9cc RDI: ff2f15095896f000
> > [  213.796236] RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000000
> > [  213.796237] R10: 0000000000000003 R11: ff2f15095893cc40 R12: 0000000000000002
> > [  213.796239] R13: 0000000000000004 R14: 0000000000000000 R15: ff2f1509534f3000
> > [  213.796240] FS:  00007f775847d0c0(0000) GS:ff2f1528bac00000(0000) knlGS:0000000000000000
> > [  213.796242] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > [  213.796243] CR2: 0000557f987b6e70 CR3: 0000002098602006 CR4: 0000000000f71ef0
> > [  213.796245] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> > [  213.796246] DR3: 0000000000000000 DR6: 00000000fffe07f0 DR7: 0000000000000400
> > [  213.796247] PKRU: 55555554
> > [  213.796249] Call Trace:
> > [  213.796250]  <IRQ>
> > [  213.796252]  ? watchdog_timer_fn+0x1c0/0x220
> > [  213.796258]  ? __pfx_watchdog_timer_fn+0x10/0x10
> > [  213.796261]  ? __hrtimer_run_queues+0x1af/0x380
> > [  213.796269]  ? hrtimer_interrupt+0xf8/0x230
> > [  213.796274]  ? __sysvec_apic_timer_interrupt+0x64/0x1a0
> > [  213.796279]  ? sysvec_apic_timer_interrupt+0x6d/0x90
> > [  213.796282]  </IRQ>
> > [  213.796284]  <TASK>
> > [  213.796285]  ? asm_sysvec_apic_timer_interrupt+0x1a/0x20
> > [  213.796293]  ? virtqueue_get_buf_ctx_split+0x8d/0x110
> > [  213.796297]  virtnet_send_command+0x18a/0x1f0 [virtio_net]
> > [  213.796310]  _virtnet_set_queues+0xc6/0x120 [virtio_net]
> > [  213.796319]  virtnet_probe+0xa06/0xd50 [virtio_net]
> > [  213.796328]  virtio_dev_probe+0x195/0x230
> > [  213.796333]  really_probe+0x19f/0x400
> > [  213.796338]  ? __pfx___driver_attach+0x10/0x10
> > [  213.796340]  __driver_probe_device+0x78/0x160
> > [  213.796343]  driver_probe_device+0x1f/0x90
> > [  213.796346]  __driver_attach+0xd6/0x1d0
> > [  213.796349]  bus_for_each_dev+0x8c/0xe0
> > [  213.796355]  bus_add_driver+0x119/0x220
> > [  213.796359]  driver_register+0x59/0x100
> > [  213.796362]  ? __pfx_virtio_net_driver_init+0x10/0x10 [virtio_net]
> > [  213.796369]  virtio_net_driver_init+0x8e/0xff0 [virtio_net]
> > [  213.796375]  do_one_initcall+0x6f/0x380
> > [  213.796384]  do_init_module+0x60/0x240
> > [  213.796388]  init_module_from_file+0x86/0xc0
> > [  213.796396]  idempotent_init_module+0x129/0x2c0
> > [  213.796406]  __x64_sys_finit_module+0x5e/0xb0
> > [  213.796409]  do_syscall_64+0x60/0xe0
> > [  213.796415]  ? do_syscall_64+0x6f/0xe0
> > [  213.796418]  ? lockdep_hardirqs_on_prepare+0xe4/0x1a0
> > [  213.796424]  ? do_syscall_64+0x6f/0xe0
> > [  213.796427]  ? do_syscall_64+0x6f/0xe0
> > [  213.796431]  entry_SYSCALL_64_after_hwframe+0x6e/0x76
> > [  213.796435] RIP: 0033:0x7f7758f279cd
> > [  213.796465] Code: 5d c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 33 e4 0c 00 f7 d8 64 89 01 48
> > [  213.796467] RSP: 002b:00007ffe2cad8738 EFLAGS: 00000246 ORIG_RAX: 0000000000000139
> > [  213.796469] RAX: ffffffffffffffda RBX: 0000557f987a8180 RCX: 00007f7758f279cd
> > [  213.796471] RDX: 0000000000000000 RSI: 00007f77593e5453 RDI: 000000000000000f
> > [  213.796472] RBP: 00007f77593e5453 R08: 0000000000000000 R09: 00007ffe2cad8860
> > [  213.796473] R10: 000000000000000f R11: 0000000000000246 R12: 0000000000020000
> > [  213.796475] R13: 0000557f9879f8e0 R14: 0000000000000000 R15: 0000557f98783aa0
> > [  213.796482]  </TASK>
> > "
> > 
> > Signed-off-by: Zhu Yanjun <yanjun.zhu@linux.dev>
> > ---
> >  drivers/net/virtio_net.c | 10 ++++++++--
> >  1 file changed, 8 insertions(+), 2 deletions(-)
> > 
> > diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
> > index 51b1868d2f22..28b7dd917a43 100644
> > --- a/drivers/net/virtio_net.c
> > +++ b/drivers/net/virtio_net.c
> > @@ -2468,7 +2468,7 @@ static bool virtnet_send_command(struct virtnet_info *vi, u8 class, u8 cmd,
> >  {
> >  	struct scatterlist *sgs[4], hdr, stat;
> >  	unsigned out_num = 0, tmp;
> > -	int ret;
> > +	int ret, timeout = 200;
> >  
> >  	/* Caller should know better */
> >  	BUG_ON(!virtio_has_feature(vi->vdev, VIRTIO_NET_F_CTRL_VQ));
> > @@ -2502,8 +2502,14 @@ static bool virtnet_send_command(struct virtnet_info *vi, u8 class, u8 cmd,
> >  	 * into the hypervisor, so the request should be handled immediately.
> >  	 */
> >  	while (!virtqueue_get_buf(vi->cvq, &tmp) &&
> > -	       !virtqueue_is_broken(vi->cvq))
> > +	       !virtqueue_is_broken(vi->cvq)) {
> > +		if (timeout)
> > +			timeout--;
> 
> This is not really a timeout, just a loop counter. 200 iterations could
> be a very short time on reasonable H/W. I guess this avoid the soft
> lockup, but possibly (likely?) breaks the functionality when we need to
> loop for some non negligible time.
> 
> I fear we need a more complex solution, as mentioned by Micheal in the
> thread you quoted.
> 
> Cheers,
> 
> Paolo

So a bunch of devices just queue a rquest on a workqueue. Problem is,
errors are not reported then. I really feel this is something netdev
core should support better: return a "deferred" status and have
netdev core wait outside a lock on a waitqueue.
Xuan Zhuo Jan. 19, 2024, 1:42 a.m. UTC | #6
On Thu, 18 Jan 2024 08:14:21 -0500, "Michael S. Tsirkin" <mst@redhat.com> wrote:
> On Tue, Jan 16, 2024 at 01:04:49PM +0100, Paolo Abeni wrote:
> > On Mon, 2024-01-15 at 09:29 +0800, Zhu Yanjun wrote:
> > > From: Zhu Yanjun <yanjun.zhu@linux.dev>
> > >
> > > Some devices emulate the virtio_net hardwares. When virtio_net
> > > driver sends commands to the emulated hardware, normally the
> > > hardware needs time to response. Sometimes the time is very
> > > long. Thus, the following will appear. Then the whole system
> > > will hang.
> > > The similar problems also occur in Intel NICs and Mellanox NICs.
> > > As such, the similar solution is borrowed from them. A timeout
> > > value is added and the timeout value as large as possible is set
> > > to ensure that the driver gets the maximum possible response from
> > > the hardware.
> > >
> > > "
> > > [  213.795860] watchdog: BUG: soft lockup - CPU#108 stuck for 26s! [(udev-worker):3157]
> > > [  213.796114] Modules linked in: virtio_net(+) net_failover failover qrtr rfkill sunrpc intel_rapl_msr intel_rapl_common intel_uncore_frequency intel_uncore_frequency_common intel_ifs i10nm_edac nfit libnvdimm x86_pkg_temp_thermal intel_powerclamp coretemp iTCO_wdt rapl intel_pmc_bxt dax_hmem iTCO_vendor_support vfat cxl_acpi intel_cstate pmt_telemetry pmt_class intel_sdsi joydev intel_uncore cxl_core fat pcspkr mei_me isst_if_mbox_pci isst_if_mmio idxd i2c_i801 isst_if_common mei intel_vsec idxd_bus i2c_smbus i2c_ismt ipmi_ssif acpi_ipmi ipmi_si ipmi_devintf ipmi_msghandler acpi_pad acpi_power_meter pfr_telemetry pfr_update fuse loop zram xfs crct10dif_pclmul crc32_pclmul crc32c_intel polyval_clmulni polyval_generic ghash_clmulni_intel sha512_ssse3 bnxt_en sha256_ssse3 sha1_ssse3 nvme ast nvme_core i2c_algo_bit wmi pinctrl_emmitsburg scsi_dh_rdac scsi_dh_emc scsi_dh_alua dm_multipath
> > > [  213.796194] irq event stamp: 67740
> > > [  213.796195] hardirqs last  enabled at (67739): [<ffffffff8c2015ca>] asm_sysvec_apic_timer_interrupt+0x1a/0x20
> > > [  213.796203] hardirqs last disabled at (67740): [<ffffffff8c14108e>] sysvec_apic_timer_interrupt+0xe/0x90
> > > [  213.796208] softirqs last  enabled at (67686): [<ffffffff8b12115e>] __irq_exit_rcu+0xbe/0xe0
> > > [  213.796214] softirqs last disabled at (67681): [<ffffffff8b12115e>] __irq_exit_rcu+0xbe/0xe0
> > > [  213.796217] CPU: 108 PID: 3157 Comm: (udev-worker) Kdump: loaded Not tainted 6.7.0+ #9
> > > [  213.796220] Hardware name: Intel Corporation M50FCP2SBSTD/M50FCP2SBSTD, BIOS SE5C741.86B.01.01.0001.2211140926 11/14/2022
> > > [  213.796221] RIP: 0010:virtqueue_get_buf_ctx_split+0x8d/0x110
> > > [  213.796228] Code: 89 df e8 26 fe ff ff 0f b7 43 50 83 c0 01 66 89 43 50 f6 43 78 01 75 12 80 7b 42 00 48 8b 4b 68 8b 53 58 74 0f 66 87 44 51 04 <48> 89 e8 5b 5d c3 cc cc cc cc 66 89 44 51 04 0f ae f0 48 89 e8 5b
> > > [  213.796230] RSP: 0018:ff4bbb362306f9b0 EFLAGS: 00000246
> > > [  213.796233] RAX: 0000000000000000 RBX: ff2f15095896f000 RCX: 0000000000000001
> > > [  213.796235] RDX: 0000000000000000 RSI: ff4bbb362306f9cc RDI: ff2f15095896f000
> > > [  213.796236] RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000000
> > > [  213.796237] R10: 0000000000000003 R11: ff2f15095893cc40 R12: 0000000000000002
> > > [  213.796239] R13: 0000000000000004 R14: 0000000000000000 R15: ff2f1509534f3000
> > > [  213.796240] FS:  00007f775847d0c0(0000) GS:ff2f1528bac00000(0000) knlGS:0000000000000000
> > > [  213.796242] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > > [  213.796243] CR2: 0000557f987b6e70 CR3: 0000002098602006 CR4: 0000000000f71ef0
> > > [  213.796245] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> > > [  213.796246] DR3: 0000000000000000 DR6: 00000000fffe07f0 DR7: 0000000000000400
> > > [  213.796247] PKRU: 55555554
> > > [  213.796249] Call Trace:
> > > [  213.796250]  <IRQ>
> > > [  213.796252]  ? watchdog_timer_fn+0x1c0/0x220
> > > [  213.796258]  ? __pfx_watchdog_timer_fn+0x10/0x10
> > > [  213.796261]  ? __hrtimer_run_queues+0x1af/0x380
> > > [  213.796269]  ? hrtimer_interrupt+0xf8/0x230
> > > [  213.796274]  ? __sysvec_apic_timer_interrupt+0x64/0x1a0
> > > [  213.796279]  ? sysvec_apic_timer_interrupt+0x6d/0x90
> > > [  213.796282]  </IRQ>
> > > [  213.796284]  <TASK>
> > > [  213.796285]  ? asm_sysvec_apic_timer_interrupt+0x1a/0x20
> > > [  213.796293]  ? virtqueue_get_buf_ctx_split+0x8d/0x110
> > > [  213.796297]  virtnet_send_command+0x18a/0x1f0 [virtio_net]
> > > [  213.796310]  _virtnet_set_queues+0xc6/0x120 [virtio_net]
> > > [  213.796319]  virtnet_probe+0xa06/0xd50 [virtio_net]
> > > [  213.796328]  virtio_dev_probe+0x195/0x230
> > > [  213.796333]  really_probe+0x19f/0x400
> > > [  213.796338]  ? __pfx___driver_attach+0x10/0x10
> > > [  213.796340]  __driver_probe_device+0x78/0x160
> > > [  213.796343]  driver_probe_device+0x1f/0x90
> > > [  213.796346]  __driver_attach+0xd6/0x1d0
> > > [  213.796349]  bus_for_each_dev+0x8c/0xe0
> > > [  213.796355]  bus_add_driver+0x119/0x220
> > > [  213.796359]  driver_register+0x59/0x100
> > > [  213.796362]  ? __pfx_virtio_net_driver_init+0x10/0x10 [virtio_net]
> > > [  213.796369]  virtio_net_driver_init+0x8e/0xff0 [virtio_net]
> > > [  213.796375]  do_one_initcall+0x6f/0x380
> > > [  213.796384]  do_init_module+0x60/0x240
> > > [  213.796388]  init_module_from_file+0x86/0xc0
> > > [  213.796396]  idempotent_init_module+0x129/0x2c0
> > > [  213.796406]  __x64_sys_finit_module+0x5e/0xb0
> > > [  213.796409]  do_syscall_64+0x60/0xe0
> > > [  213.796415]  ? do_syscall_64+0x6f/0xe0
> > > [  213.796418]  ? lockdep_hardirqs_on_prepare+0xe4/0x1a0
> > > [  213.796424]  ? do_syscall_64+0x6f/0xe0
> > > [  213.796427]  ? do_syscall_64+0x6f/0xe0
> > > [  213.796431]  entry_SYSCALL_64_after_hwframe+0x6e/0x76
> > > [  213.796435] RIP: 0033:0x7f7758f279cd
> > > [  213.796465] Code: 5d c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 33 e4 0c 00 f7 d8 64 89 01 48
> > > [  213.796467] RSP: 002b:00007ffe2cad8738 EFLAGS: 00000246 ORIG_RAX: 0000000000000139
> > > [  213.796469] RAX: ffffffffffffffda RBX: 0000557f987a8180 RCX: 00007f7758f279cd
> > > [  213.796471] RDX: 0000000000000000 RSI: 00007f77593e5453 RDI: 000000000000000f
> > > [  213.796472] RBP: 00007f77593e5453 R08: 0000000000000000 R09: 00007ffe2cad8860
> > > [  213.796473] R10: 000000000000000f R11: 0000000000000246 R12: 0000000000020000
> > > [  213.796475] R13: 0000557f9879f8e0 R14: 0000000000000000 R15: 0000557f98783aa0
> > > [  213.796482]  </TASK>
> > > "
> > >
> > > Signed-off-by: Zhu Yanjun <yanjun.zhu@linux.dev>
> > > ---
> > >  drivers/net/virtio_net.c | 10 ++++++++--
> > >  1 file changed, 8 insertions(+), 2 deletions(-)
> > >
> > > diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
> > > index 51b1868d2f22..28b7dd917a43 100644
> > > --- a/drivers/net/virtio_net.c
> > > +++ b/drivers/net/virtio_net.c
> > > @@ -2468,7 +2468,7 @@ static bool virtnet_send_command(struct virtnet_info *vi, u8 class, u8 cmd,
> > >  {
> > >  	struct scatterlist *sgs[4], hdr, stat;
> > >  	unsigned out_num = 0, tmp;
> > > -	int ret;
> > > +	int ret, timeout = 200;
> > >
> > >  	/* Caller should know better */
> > >  	BUG_ON(!virtio_has_feature(vi->vdev, VIRTIO_NET_F_CTRL_VQ));
> > > @@ -2502,8 +2502,14 @@ static bool virtnet_send_command(struct virtnet_info *vi, u8 class, u8 cmd,
> > >  	 * into the hypervisor, so the request should be handled immediately.
> > >  	 */
> > >  	while (!virtqueue_get_buf(vi->cvq, &tmp) &&
> > > -	       !virtqueue_is_broken(vi->cvq))
> > > +	       !virtqueue_is_broken(vi->cvq)) {
> > > +		if (timeout)
> > > +			timeout--;
> >
> > This is not really a timeout, just a loop counter. 200 iterations could
> > be a very short time on reasonable H/W. I guess this avoid the soft
> > lockup, but possibly (likely?) breaks the functionality when we need to
> > loop for some non negligible time.
> >
> > I fear we need a more complex solution, as mentioned by Micheal in the
> > thread you quoted.
> >
> > Cheers,
> >
> > Paolo
>
> So a bunch of devices just queue a rquest on a workqueue. Problem is,
> errors are not reported then. I really feel this is something netdev
> core should support better: return a "deferred" status and have
> netdev core wait outside a lock on a waitqueue.

And on the other side, more and more requests will be post to the cvq,
we should refactor this to async mode.

Now the netdim has this request.

@Heng Qi

Thanks.


>
>
> --
> MST
>
Heng Qi Jan. 19, 2024, 2:27 p.m. UTC | #7
在 2024/1/18 下午8:01, Zhu Yanjun 写道:
>
> 在 2024/1/16 20:04, Paolo Abeni 写道:
>> On Mon, 2024-01-15 at 09:29 +0800, Zhu Yanjun wrote:
>>> From: Zhu Yanjun <yanjun.zhu@linux.dev>
>>>
>>> Some devices emulate the virtio_net hardwares. When virtio_net
>>> driver sends commands to the emulated hardware, normally the
>>> hardware needs time to response. Sometimes the time is very
>>> long. Thus, the following will appear. Then the whole system
>>> will hang.
>>> The similar problems also occur in Intel NICs and Mellanox NICs.
>>> As such, the similar solution is borrowed from them. A timeout
>>> value is added and the timeout value as large as possible is set
>>> to ensure that the driver gets the maximum possible response from
>>> the hardware.
>>>
>>> "
>>> [  213.795860] watchdog: BUG: soft lockup - CPU#108 stuck for 26s! 
>>> [(udev-worker):3157]
>>> [  213.796114] Modules linked in: virtio_net(+) net_failover 
>>> failover qrtr rfkill sunrpc intel_rapl_msr intel_rapl_common 
>>> intel_uncore_frequency intel_uncore_frequency_common intel_ifs 
>>> i10nm_edac nfit libnvdimm x86_pkg_temp_thermal intel_powerclamp 
>>> coretemp iTCO_wdt rapl intel_pmc_bxt dax_hmem iTCO_vendor_support 
>>> vfat cxl_acpi intel_cstate pmt_telemetry pmt_class intel_sdsi joydev 
>>> intel_uncore cxl_core fat pcspkr mei_me isst_if_mbox_pci 
>>> isst_if_mmio idxd i2c_i801 isst_if_common mei intel_vsec idxd_bus 
>>> i2c_smbus i2c_ismt ipmi_ssif acpi_ipmi ipmi_si ipmi_devintf 
>>> ipmi_msghandler acpi_pad acpi_power_meter pfr_telemetry pfr_update 
>>> fuse loop zram xfs crct10dif_pclmul crc32_pclmul crc32c_intel 
>>> polyval_clmulni polyval_generic ghash_clmulni_intel sha512_ssse3 
>>> bnxt_en sha256_ssse3 sha1_ssse3 nvme ast nvme_core i2c_algo_bit wmi 
>>> pinctrl_emmitsburg scsi_dh_rdac scsi_dh_emc scsi_dh_alua dm_multipath
>>> [  213.796194] irq event stamp: 67740
>>> [  213.796195] hardirqs last  enabled at (67739): 
>>> [<ffffffff8c2015ca>] asm_sysvec_apic_timer_interrupt+0x1a/0x20
>>> [  213.796203] hardirqs last disabled at (67740): 
>>> [<ffffffff8c14108e>] sysvec_apic_timer_interrupt+0xe/0x90
>>> [  213.796208] softirqs last  enabled at (67686): 
>>> [<ffffffff8b12115e>] __irq_exit_rcu+0xbe/0xe0
>>> [  213.796214] softirqs last disabled at (67681): 
>>> [<ffffffff8b12115e>] __irq_exit_rcu+0xbe/0xe0
>>> [  213.796217] CPU: 108 PID: 3157 Comm: (udev-worker) Kdump: loaded 
>>> Not tainted 6.7.0+ #9
>>> [  213.796220] Hardware name: Intel Corporation 
>>> M50FCP2SBSTD/M50FCP2SBSTD, BIOS SE5C741.86B.01.01.0001.2211140926 
>>> 11/14/2022
>>> [  213.796221] RIP: 0010:virtqueue_get_buf_ctx_split+0x8d/0x110
>>> [  213.796228] Code: 89 df e8 26 fe ff ff 0f b7 43 50 83 c0 01 66 89 
>>> 43 50 f6 43 78 01 75 12 80 7b 42 00 48 8b 4b 68 8b 53 58 74 0f 66 87 
>>> 44 51 04 <48> 89 e8 5b 5d c3 cc cc cc cc 66 89 44 51 04 0f ae f0 48 
>>> 89 e8 5b
>>> [  213.796230] RSP: 0018:ff4bbb362306f9b0 EFLAGS: 00000246
>>> [  213.796233] RAX: 0000000000000000 RBX: ff2f15095896f000 RCX: 
>>> 0000000000000001
>>> [  213.796235] RDX: 0000000000000000 RSI: ff4bbb362306f9cc RDI: 
>>> ff2f15095896f000
>>> [  213.796236] RBP: 0000000000000000 R08: 0000000000000000 R09: 
>>> 0000000000000000
>>> [  213.796237] R10: 0000000000000003 R11: ff2f15095893cc40 R12: 
>>> 0000000000000002
>>> [  213.796239] R13: 0000000000000004 R14: 0000000000000000 R15: 
>>> ff2f1509534f3000
>>> [  213.796240] FS:  00007f775847d0c0(0000) GS:ff2f1528bac00000(0000) 
>>> knlGS:0000000000000000
>>> [  213.796242] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>>> [  213.796243] CR2: 0000557f987b6e70 CR3: 0000002098602006 CR4: 
>>> 0000000000f71ef0
>>> [  213.796245] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 
>>> 0000000000000000
>>> [  213.796246] DR3: 0000000000000000 DR6: 00000000fffe07f0 DR7: 
>>> 0000000000000400
>>> [  213.796247] PKRU: 55555554
>>> [  213.796249] Call Trace:
>>> [  213.796250]  <IRQ>
>>> [  213.796252]  ? watchdog_timer_fn+0x1c0/0x220
>>> [  213.796258]  ? __pfx_watchdog_timer_fn+0x10/0x10
>>> [  213.796261]  ? __hrtimer_run_queues+0x1af/0x380
>>> [  213.796269]  ? hrtimer_interrupt+0xf8/0x230
>>> [  213.796274]  ? __sysvec_apic_timer_interrupt+0x64/0x1a0
>>> [  213.796279]  ? sysvec_apic_timer_interrupt+0x6d/0x90
>>> [  213.796282]  </IRQ>
>>> [  213.796284]  <TASK>
>>> [  213.796285]  ? asm_sysvec_apic_timer_interrupt+0x1a/0x20
>>> [  213.796293]  ? virtqueue_get_buf_ctx_split+0x8d/0x110
>>> [  213.796297]  virtnet_send_command+0x18a/0x1f0 [virtio_net]
>>> [  213.796310]  _virtnet_set_queues+0xc6/0x120 [virtio_net]
>>> [  213.796319]  virtnet_probe+0xa06/0xd50 [virtio_net]
>>> [  213.796328]  virtio_dev_probe+0x195/0x230
>>> [  213.796333]  really_probe+0x19f/0x400
>>> [  213.796338]  ? __pfx___driver_attach+0x10/0x10
>>> [  213.796340]  __driver_probe_device+0x78/0x160
>>> [  213.796343]  driver_probe_device+0x1f/0x90
>>> [  213.796346]  __driver_attach+0xd6/0x1d0
>>> [  213.796349]  bus_for_each_dev+0x8c/0xe0
>>> [  213.796355]  bus_add_driver+0x119/0x220
>>> [  213.796359]  driver_register+0x59/0x100
>>> [  213.796362]  ? __pfx_virtio_net_driver_init+0x10/0x10 [virtio_net]
>>> [  213.796369]  virtio_net_driver_init+0x8e/0xff0 [virtio_net]
>>> [  213.796375]  do_one_initcall+0x6f/0x380
>>> [  213.796384]  do_init_module+0x60/0x240
>>> [  213.796388]  init_module_from_file+0x86/0xc0
>>> [  213.796396]  idempotent_init_module+0x129/0x2c0
>>> [  213.796406]  __x64_sys_finit_module+0x5e/0xb0
>>> [  213.796409]  do_syscall_64+0x60/0xe0
>>> [  213.796415]  ? do_syscall_64+0x6f/0xe0
>>> [  213.796418]  ? lockdep_hardirqs_on_prepare+0xe4/0x1a0
>>> [  213.796424]  ? do_syscall_64+0x6f/0xe0
>>> [  213.796427]  ? do_syscall_64+0x6f/0xe0
>>> [  213.796431]  entry_SYSCALL_64_after_hwframe+0x6e/0x76
>>> [  213.796435] RIP: 0033:0x7f7758f279cd
>>> [  213.796465] Code: 5d c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e 
>>> fa 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 
>>> 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 33 e4 0c 00 f7 d8 64 
>>> 89 01 48
>>> [  213.796467] RSP: 002b:00007ffe2cad8738 EFLAGS: 00000246 ORIG_RAX: 
>>> 0000000000000139
>>> [  213.796469] RAX: ffffffffffffffda RBX: 0000557f987a8180 RCX: 
>>> 00007f7758f279cd
>>> [  213.796471] RDX: 0000000000000000 RSI: 00007f77593e5453 RDI: 
>>> 000000000000000f
>>> [  213.796472] RBP: 00007f77593e5453 R08: 0000000000000000 R09: 
>>> 00007ffe2cad8860
>>> [  213.796473] R10: 000000000000000f R11: 0000000000000246 R12: 
>>> 0000000000020000
>>> [  213.796475] R13: 0000557f9879f8e0 R14: 0000000000000000 R15: 
>>> 0000557f98783aa0
>>> [  213.796482]  </TASK>
>>> "
>>>
>>> Signed-off-by: Zhu Yanjun <yanjun.zhu@linux.dev>
>>> ---
>>>   drivers/net/virtio_net.c | 10 ++++++++--
>>>   1 file changed, 8 insertions(+), 2 deletions(-)
>>>
>>> diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
>>> index 51b1868d2f22..28b7dd917a43 100644
>>> --- a/drivers/net/virtio_net.c
>>> +++ b/drivers/net/virtio_net.c
>>> @@ -2468,7 +2468,7 @@ static bool virtnet_send_command(struct 
>>> virtnet_info *vi, u8 class, u8 cmd,
>>>   {
>>>       struct scatterlist *sgs[4], hdr, stat;
>>>       unsigned out_num = 0, tmp;
>>> -    int ret;
>>> +    int ret, timeout = 200;
>>>         /* Caller should know better */
>>>       BUG_ON(!virtio_has_feature(vi->vdev, VIRTIO_NET_F_CTRL_VQ));
>>> @@ -2502,8 +2502,14 @@ static bool virtnet_send_command(struct 
>>> virtnet_info *vi, u8 class, u8 cmd,
>>>        * into the hypervisor, so the request should be handled 
>>> immediately.
>>>        */
>>>       while (!virtqueue_get_buf(vi->cvq, &tmp) &&
>>> -           !virtqueue_is_broken(vi->cvq))
>>> +           !virtqueue_is_broken(vi->cvq)) {
>>> +        if (timeout)
>>> +            timeout--;
>> This is not really a timeout, just a loop counter. 200 iterations could
>> be a very short time on reasonable H/W. I guess this avoid the soft
>> lockup, but possibly (likely?) breaks the functionality when we need to
>> loop for some non negligible time.
>>
>> I fear we need a more complex solution, as mentioned by Micheal in the
>> thread you quoted.
>
> Got it. I also look forward to the more complex solution to this problem.

Can we add a device capability (new feature bit) such as 
ctrq_wait_timeout to get a reasonable timeout?

Thanks,
Heng

>
> Zhu Yanjun
>
>>
>> Cheers,
>>
>> Paolo
>>
Andrew Lunn Jan. 19, 2024, 5:29 p.m. UTC | #8
> > > >       while (!virtqueue_get_buf(vi->cvq, &tmp) &&
> > > > -           !virtqueue_is_broken(vi->cvq))
> > > > +           !virtqueue_is_broken(vi->cvq)) {
> > > > +        if (timeout)
> > > > +            timeout--;
> > > This is not really a timeout, just a loop counter. 200 iterations could
> > > be a very short time on reasonable H/W. I guess this avoid the soft
> > > lockup, but possibly (likely?) breaks the functionality when we need to
> > > loop for some non negligible time.
> > > 
> > > I fear we need a more complex solution, as mentioned by Micheal in the
> > > thread you quoted.
> > 
> > Got it. I also look forward to the more complex solution to this problem.
> 
> Can we add a device capability (new feature bit) such as ctrq_wait_timeout
> to get a reasonable timeout?

The usual solution to this is include/linux/iopoll.h. If you can sleep
read_poll_timeout() otherwise read_poll_timeout_atomic().

	Andrew
Zhu Yanjun Jan. 20, 2024, 4:21 a.m. UTC | #9
在 2024/1/20 1:29, Andrew Lunn 写道:
>>>>>        while (!virtqueue_get_buf(vi->cvq, &tmp) &&
>>>>> -           !virtqueue_is_broken(vi->cvq))
>>>>> +           !virtqueue_is_broken(vi->cvq)) {
>>>>> +        if (timeout)
>>>>> +            timeout--;
>>>> This is not really a timeout, just a loop counter. 200 iterations could
>>>> be a very short time on reasonable H/W. I guess this avoid the soft
>>>> lockup, but possibly (likely?) breaks the functionality when we need to
>>>> loop for some non negligible time.
>>>>
>>>> I fear we need a more complex solution, as mentioned by Micheal in the
>>>> thread you quoted.
>>> Got it. I also look forward to the more complex solution to this problem.
>> Can we add a device capability (new feature bit) such as ctrq_wait_timeout
>> to get a reasonable timeout?
> The usual solution to this is include/linux/iopoll.h. If you can sleep
> read_poll_timeout() otherwise read_poll_timeout_atomic().

Thanks. The 2 functions read_poll_timeout() and 
read_poll_timeout_atomic() are interesting.

Zhu Yanjun

>
> 	Andrew
Zhu Yanjun Jan. 22, 2024, 2:12 a.m. UTC | #10
在 2024/1/20 1:29, Andrew Lunn 写道:
>>>>>        while (!virtqueue_get_buf(vi->cvq, &tmp) &&
>>>>> -           !virtqueue_is_broken(vi->cvq))
>>>>> +           !virtqueue_is_broken(vi->cvq)) {
>>>>> +        if (timeout)
>>>>> +            timeout--;
>>>> This is not really a timeout, just a loop counter. 200 iterations could
>>>> be a very short time on reasonable H/W. I guess this avoid the soft
>>>> lockup, but possibly (likely?) breaks the functionality when we need to
>>>> loop for some non negligible time.
>>>>
>>>> I fear we need a more complex solution, as mentioned by Micheal in the
>>>> thread you quoted.
>>> Got it. I also look forward to the more complex solution to this problem.
>> Can we add a device capability (new feature bit) such as ctrq_wait_timeout
>> to get a reasonable timeout?
> The usual solution to this is include/linux/iopoll.h. If you can sleep
> read_poll_timeout() otherwise read_poll_timeout_atomic().

I read carefully the functions read_poll_timeout() and 
read_poll_timeout_atomic(). The timeout is set by the caller of the 2 
functions.

As such, can we add a module parameter to customize this timeout value 
by the user?

Or this timeout value is stored in device register, virtio_net driver 
will read this timeout value at initialization?

Zhu Yanjun

>
> 	Andrew
Jason Wang Jan. 22, 2024, 3:06 a.m. UTC | #11
On Mon, Jan 15, 2024 at 6:22 PM Zhu Yanjun <yanjun.zhu@linux.dev> wrote:
>
>
> 在 2024/1/15 10:20, Jason Wang 写道:
>
> On Mon, Jan 15, 2024 at 9:35 AM Zhu Yanjun <yanjun.zhu@intel.com> wrote:
>
> From: Zhu Yanjun <yanjun.zhu@linux.dev>
>
> Some devices emulate the virtio_net hardwares. When virtio_net
> driver sends commands to the emulated hardware, normally the
> hardware needs time to response. Sometimes the time is very
> long. Thus, the following will appear. Then the whole system
> will hang.
> The similar problems also occur in Intel NICs and Mellanox NICs.
> As such, the similar solution is borrowed from them. A timeout
> value is added and the timeout value as large as possible is set
> to ensure that the driver gets the maximum possible response from
> the hardware.
>
> "
> [  213.795860] watchdog: BUG: soft lockup - CPU#108 stuck for 26s! [(udev-worker):3157]
> [  213.796114] Modules linked in: virtio_net(+) net_failover failover qrtr rfkill sunrpc intel_rapl_msr intel_rapl_common intel_uncore_frequency intel_uncore_frequency_common intel_ifs i10nm_edac nfit libnvdimm x86_pkg_temp_thermal intel_powerclamp coretemp iTCO_wdt rapl intel_pmc_bxt dax_hmem iTCO_vendor_support vfat cxl_acpi intel_cstate pmt_telemetry pmt_class intel_sdsi joydev intel_uncore cxl_core fat pcspkr mei_me isst_if_mbox_pci isst_if_mmio idxd i2c_i801 isst_if_common mei intel_vsec idxd_bus i2c_smbus i2c_ismt ipmi_ssif acpi_ipmi ipmi_si ipmi_devintf ipmi_msghandler acpi_pad acpi_power_meter pfr_telemetry pfr_update fuse loop zram xfs crct10dif_pclmul crc32_pclmul crc32c_intel polyval_clmulni polyval_generic ghash_clmulni_intel sha512_ssse3 bnxt_en sha256_ssse3 sha1_ssse3 nvme ast nvme_core i2c_algo_bit wmi pinctrl_emmitsburg scsi_dh_rdac scsi_dh_emc scsi_dh_alua dm_multipath
> [  213.796194] irq event stamp: 67740
> [  213.796195] hardirqs last  enabled at (67739): [<ffffffff8c2015ca>] asm_sysvec_apic_timer_interrupt+0x1a/0x20
> [  213.796203] hardirqs last disabled at (67740): [<ffffffff8c14108e>] sysvec_apic_timer_interrupt+0xe/0x90
> [  213.796208] softirqs last  enabled at (67686): [<ffffffff8b12115e>] __irq_exit_rcu+0xbe/0xe0
> [  213.796214] softirqs last disabled at (67681): [<ffffffff8b12115e>] __irq_exit_rcu+0xbe/0xe0
> [  213.796217] CPU: 108 PID: 3157 Comm: (udev-worker) Kdump: loaded Not tainted 6.7.0+ #9
> [  213.796220] Hardware name: Intel Corporation M50FCP2SBSTD/M50FCP2SBSTD, BIOS SE5C741.86B.01.01.0001.2211140926 11/14/2022
> [  213.796221] RIP: 0010:virtqueue_get_buf_ctx_split+0x8d/0x110
> [  213.796228] Code: 89 df e8 26 fe ff ff 0f b7 43 50 83 c0 01 66 89 43 50 f6 43 78 01 75 12 80 7b 42 00 48 8b 4b 68 8b 53 58 74 0f 66 87 44 51 04 <48> 89 e8 5b 5d c3 cc cc cc cc 66 89 44 51 04 0f ae f0 48 89 e8 5b
> [  213.796230] RSP: 0018:ff4bbb362306f9b0 EFLAGS: 00000246
> [  213.796233] RAX: 0000000000000000 RBX: ff2f15095896f000 RCX: 0000000000000001
> [  213.796235] RDX: 0000000000000000 RSI: ff4bbb362306f9cc RDI: ff2f15095896f000
> [  213.796236] RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000000
> [  213.796237] R10: 0000000000000003 R11: ff2f15095893cc40 R12: 0000000000000002
> [  213.796239] R13: 0000000000000004 R14: 0000000000000000 R15: ff2f1509534f3000
> [  213.796240] FS:  00007f775847d0c0(0000) GS:ff2f1528bac00000(0000) knlGS:0000000000000000
> [  213.796242] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [  213.796243] CR2: 0000557f987b6e70 CR3: 0000002098602006 CR4: 0000000000f71ef0
> [  213.796245] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> [  213.796246] DR3: 0000000000000000 DR6: 00000000fffe07f0 DR7: 0000000000000400
> [  213.796247] PKRU: 55555554
> [  213.796249] Call Trace:
> [  213.796250]  <IRQ>
> [  213.796252]  ? watchdog_timer_fn+0x1c0/0x220
> [  213.796258]  ? __pfx_watchdog_timer_fn+0x10/0x10
> [  213.796261]  ? __hrtimer_run_queues+0x1af/0x380
> [  213.796269]  ? hrtimer_interrupt+0xf8/0x230
> [  213.796274]  ? __sysvec_apic_timer_interrupt+0x64/0x1a0
> [  213.796279]  ? sysvec_apic_timer_interrupt+0x6d/0x90
> [  213.796282]  </IRQ>
> [  213.796284]  <TASK>
> [  213.796285]  ? asm_sysvec_apic_timer_interrupt+0x1a/0x20
> [  213.796293]  ? virtqueue_get_buf_ctx_split+0x8d/0x110
> [  213.796297]  virtnet_send_command+0x18a/0x1f0 [virtio_net]
> [  213.796310]  _virtnet_set_queues+0xc6/0x120 [virtio_net]
> [  213.796319]  virtnet_probe+0xa06/0xd50 [virtio_net]
> [  213.796328]  virtio_dev_probe+0x195/0x230
> [  213.796333]  really_probe+0x19f/0x400
> [  213.796338]  ? __pfx___driver_attach+0x10/0x10
> [  213.796340]  __driver_probe_device+0x78/0x160
> [  213.796343]  driver_probe_device+0x1f/0x90
> [  213.796346]  __driver_attach+0xd6/0x1d0
> [  213.796349]  bus_for_each_dev+0x8c/0xe0
> [  213.796355]  bus_add_driver+0x119/0x220
> [  213.796359]  driver_register+0x59/0x100
> [  213.796362]  ? __pfx_virtio_net_driver_init+0x10/0x10 [virtio_net]
> [  213.796369]  virtio_net_driver_init+0x8e/0xff0 [virtio_net]
> [  213.796375]  do_one_initcall+0x6f/0x380
> [  213.796384]  do_init_module+0x60/0x240
> [  213.796388]  init_module_from_file+0x86/0xc0
> [  213.796396]  idempotent_init_module+0x129/0x2c0
> [  213.796406]  __x64_sys_finit_module+0x5e/0xb0
> [  213.796409]  do_syscall_64+0x60/0xe0
> [  213.796415]  ? do_syscall_64+0x6f/0xe0
> [  213.796418]  ? lockdep_hardirqs_on_prepare+0xe4/0x1a0
> [  213.796424]  ? do_syscall_64+0x6f/0xe0
> [  213.796427]  ? do_syscall_64+0x6f/0xe0
> [  213.796431]  entry_SYSCALL_64_after_hwframe+0x6e/0x76
> [  213.796435] RIP: 0033:0x7f7758f279cd
> [  213.796465] Code: 5d c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 33 e4 0c 00 f7 d8 64 89 01 48
> [  213.796467] RSP: 002b:00007ffe2cad8738 EFLAGS: 00000246 ORIG_RAX: 0000000000000139
> [  213.796469] RAX: ffffffffffffffda RBX: 0000557f987a8180 RCX: 00007f7758f279cd
> [  213.796471] RDX: 0000000000000000 RSI: 00007f77593e5453 RDI: 000000000000000f
> [  213.796472] RBP: 00007f77593e5453 R08: 0000000000000000 R09: 00007ffe2cad8860
> [  213.796473] R10: 000000000000000f R11: 0000000000000246 R12: 0000000000020000
> [  213.796475] R13: 0000557f9879f8e0 R14: 0000000000000000 R15: 0000557f98783aa0
> [  213.796482]  </TASK>
> "
>
> Signed-off-by: Zhu Yanjun <yanjun.zhu@linux.dev>
> ---
>  drivers/net/virtio_net.c | 10 ++++++++--
>  1 file changed, 8 insertions(+), 2 deletions(-)
>
> diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
> index 51b1868d2f22..28b7dd917a43 100644
> --- a/drivers/net/virtio_net.c
> +++ b/drivers/net/virtio_net.c
> @@ -2468,7 +2468,7 @@ static bool virtnet_send_command(struct virtnet_info *vi, u8 class, u8 cmd,
>  {
>         struct scatterlist *sgs[4], hdr, stat;
>         unsigned out_num = 0, tmp;
> -       int ret;
> +       int ret, timeout = 200;
>
> Any reason we choose this value or how can we know it works for all
> types of devices?
>
> As I mentioned in the commit log, the similar problem also occurs in Intel NIC driver and Mellanox NIC driver.
>
> This commit is borrowed from the solution of Intel NIC driver. So the value is also from Intel NIC driver solution.

Right, so basically I meant we need a solution that works for all vendors.

>
> A more easy way is to use cond_resched() but it may have side effects
> as well[1]. But it seems less intrusive anyhow than the proposal here?
>
> Thanks a lot for your suggestions. I have made tests with the commits in the link https://www.mail-archive.com/virtualization@lists.linux-foundation.org/msg60297.html.
>
> Because virtio_net driver spins waiting for the response of hardware, virtio_net driver can not be unloaded and kernel hang still occurs when running "ip link" or unloading virtio_net module.

Well, yes. It aims to solve the lockups as described in this commit
log. It doesn't solve the issue of infinite waiting etc.

Thanks

>
> Zhu Yanjun
>
> Thanks
>
> [1] https://www.mail-archive.com/virtualization@lists.linux-foundation.org/msg60297.html
>
>         /* Caller should know better */
>         BUG_ON(!virtio_has_feature(vi->vdev, VIRTIO_NET_F_CTRL_VQ));
> @@ -2502,8 +2502,14 @@ static bool virtnet_send_command(struct virtnet_info *vi, u8 class, u8 cmd,
>          * into the hypervisor, so the request should be handled immediately.
>          */
>         while (!virtqueue_get_buf(vi->cvq, &tmp) &&
> -              !virtqueue_is_broken(vi->cvq))
> +              !virtqueue_is_broken(vi->cvq)) {
> +               if (timeout)
> +                       timeout--;
> +               else
> +                       return false; /* long time no response */
> +
>                 cpu_relax();
> +       }
>
>         return vi->ctrl->status == VIRTIO_NET_OK;
>  }
> --
> 2.42.0
>
Jason Wang Jan. 22, 2024, 3:08 a.m. UTC | #12
On Fri, Jan 19, 2024 at 10:27 PM Heng Qi <hengqi@linux.alibaba.com> wrote:
>
>
>
> 在 2024/1/18 下午8:01, Zhu Yanjun 写道:
> >
> > 在 2024/1/16 20:04, Paolo Abeni 写道:
> >> On Mon, 2024-01-15 at 09:29 +0800, Zhu Yanjun wrote:
> >>> From: Zhu Yanjun <yanjun.zhu@linux.dev>
> >>>
> >>> Some devices emulate the virtio_net hardwares. When virtio_net
> >>> driver sends commands to the emulated hardware, normally the
> >>> hardware needs time to response. Sometimes the time is very
> >>> long. Thus, the following will appear. Then the whole system
> >>> will hang.
> >>> The similar problems also occur in Intel NICs and Mellanox NICs.
> >>> As such, the similar solution is borrowed from them. A timeout
> >>> value is added and the timeout value as large as possible is set
> >>> to ensure that the driver gets the maximum possible response from
> >>> the hardware.
> >>>
> >>> "
> >>> [  213.795860] watchdog: BUG: soft lockup - CPU#108 stuck for 26s!
> >>> [(udev-worker):3157]
> >>> [  213.796114] Modules linked in: virtio_net(+) net_failover
> >>> failover qrtr rfkill sunrpc intel_rapl_msr intel_rapl_common
> >>> intel_uncore_frequency intel_uncore_frequency_common intel_ifs
> >>> i10nm_edac nfit libnvdimm x86_pkg_temp_thermal intel_powerclamp
> >>> coretemp iTCO_wdt rapl intel_pmc_bxt dax_hmem iTCO_vendor_support
> >>> vfat cxl_acpi intel_cstate pmt_telemetry pmt_class intel_sdsi joydev
> >>> intel_uncore cxl_core fat pcspkr mei_me isst_if_mbox_pci
> >>> isst_if_mmio idxd i2c_i801 isst_if_common mei intel_vsec idxd_bus
> >>> i2c_smbus i2c_ismt ipmi_ssif acpi_ipmi ipmi_si ipmi_devintf
> >>> ipmi_msghandler acpi_pad acpi_power_meter pfr_telemetry pfr_update
> >>> fuse loop zram xfs crct10dif_pclmul crc32_pclmul crc32c_intel
> >>> polyval_clmulni polyval_generic ghash_clmulni_intel sha512_ssse3
> >>> bnxt_en sha256_ssse3 sha1_ssse3 nvme ast nvme_core i2c_algo_bit wmi
> >>> pinctrl_emmitsburg scsi_dh_rdac scsi_dh_emc scsi_dh_alua dm_multipath
> >>> [  213.796194] irq event stamp: 67740
> >>> [  213.796195] hardirqs last  enabled at (67739):
> >>> [<ffffffff8c2015ca>] asm_sysvec_apic_timer_interrupt+0x1a/0x20
> >>> [  213.796203] hardirqs last disabled at (67740):
> >>> [<ffffffff8c14108e>] sysvec_apic_timer_interrupt+0xe/0x90
> >>> [  213.796208] softirqs last  enabled at (67686):
> >>> [<ffffffff8b12115e>] __irq_exit_rcu+0xbe/0xe0
> >>> [  213.796214] softirqs last disabled at (67681):
> >>> [<ffffffff8b12115e>] __irq_exit_rcu+0xbe/0xe0
> >>> [  213.796217] CPU: 108 PID: 3157 Comm: (udev-worker) Kdump: loaded
> >>> Not tainted 6.7.0+ #9
> >>> [  213.796220] Hardware name: Intel Corporation
> >>> M50FCP2SBSTD/M50FCP2SBSTD, BIOS SE5C741.86B.01.01.0001.2211140926
> >>> 11/14/2022
> >>> [  213.796221] RIP: 0010:virtqueue_get_buf_ctx_split+0x8d/0x110
> >>> [  213.796228] Code: 89 df e8 26 fe ff ff 0f b7 43 50 83 c0 01 66 89
> >>> 43 50 f6 43 78 01 75 12 80 7b 42 00 48 8b 4b 68 8b 53 58 74 0f 66 87
> >>> 44 51 04 <48> 89 e8 5b 5d c3 cc cc cc cc 66 89 44 51 04 0f ae f0 48
> >>> 89 e8 5b
> >>> [  213.796230] RSP: 0018:ff4bbb362306f9b0 EFLAGS: 00000246
> >>> [  213.796233] RAX: 0000000000000000 RBX: ff2f15095896f000 RCX:
> >>> 0000000000000001
> >>> [  213.796235] RDX: 0000000000000000 RSI: ff4bbb362306f9cc RDI:
> >>> ff2f15095896f000
> >>> [  213.796236] RBP: 0000000000000000 R08: 0000000000000000 R09:
> >>> 0000000000000000
> >>> [  213.796237] R10: 0000000000000003 R11: ff2f15095893cc40 R12:
> >>> 0000000000000002
> >>> [  213.796239] R13: 0000000000000004 R14: 0000000000000000 R15:
> >>> ff2f1509534f3000
> >>> [  213.796240] FS:  00007f775847d0c0(0000) GS:ff2f1528bac00000(0000)
> >>> knlGS:0000000000000000
> >>> [  213.796242] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> >>> [  213.796243] CR2: 0000557f987b6e70 CR3: 0000002098602006 CR4:
> >>> 0000000000f71ef0
> >>> [  213.796245] DR0: 0000000000000000 DR1: 0000000000000000 DR2:
> >>> 0000000000000000
> >>> [  213.796246] DR3: 0000000000000000 DR6: 00000000fffe07f0 DR7:
> >>> 0000000000000400
> >>> [  213.796247] PKRU: 55555554
> >>> [  213.796249] Call Trace:
> >>> [  213.796250]  <IRQ>
> >>> [  213.796252]  ? watchdog_timer_fn+0x1c0/0x220
> >>> [  213.796258]  ? __pfx_watchdog_timer_fn+0x10/0x10
> >>> [  213.796261]  ? __hrtimer_run_queues+0x1af/0x380
> >>> [  213.796269]  ? hrtimer_interrupt+0xf8/0x230
> >>> [  213.796274]  ? __sysvec_apic_timer_interrupt+0x64/0x1a0
> >>> [  213.796279]  ? sysvec_apic_timer_interrupt+0x6d/0x90
> >>> [  213.796282]  </IRQ>
> >>> [  213.796284]  <TASK>
> >>> [  213.796285]  ? asm_sysvec_apic_timer_interrupt+0x1a/0x20
> >>> [  213.796293]  ? virtqueue_get_buf_ctx_split+0x8d/0x110
> >>> [  213.796297]  virtnet_send_command+0x18a/0x1f0 [virtio_net]
> >>> [  213.796310]  _virtnet_set_queues+0xc6/0x120 [virtio_net]
> >>> [  213.796319]  virtnet_probe+0xa06/0xd50 [virtio_net]
> >>> [  213.796328]  virtio_dev_probe+0x195/0x230
> >>> [  213.796333]  really_probe+0x19f/0x400
> >>> [  213.796338]  ? __pfx___driver_attach+0x10/0x10
> >>> [  213.796340]  __driver_probe_device+0x78/0x160
> >>> [  213.796343]  driver_probe_device+0x1f/0x90
> >>> [  213.796346]  __driver_attach+0xd6/0x1d0
> >>> [  213.796349]  bus_for_each_dev+0x8c/0xe0
> >>> [  213.796355]  bus_add_driver+0x119/0x220
> >>> [  213.796359]  driver_register+0x59/0x100
> >>> [  213.796362]  ? __pfx_virtio_net_driver_init+0x10/0x10 [virtio_net]
> >>> [  213.796369]  virtio_net_driver_init+0x8e/0xff0 [virtio_net]
> >>> [  213.796375]  do_one_initcall+0x6f/0x380
> >>> [  213.796384]  do_init_module+0x60/0x240
> >>> [  213.796388]  init_module_from_file+0x86/0xc0
> >>> [  213.796396]  idempotent_init_module+0x129/0x2c0
> >>> [  213.796406]  __x64_sys_finit_module+0x5e/0xb0
> >>> [  213.796409]  do_syscall_64+0x60/0xe0
> >>> [  213.796415]  ? do_syscall_64+0x6f/0xe0
> >>> [  213.796418]  ? lockdep_hardirqs_on_prepare+0xe4/0x1a0
> >>> [  213.796424]  ? do_syscall_64+0x6f/0xe0
> >>> [  213.796427]  ? do_syscall_64+0x6f/0xe0
> >>> [  213.796431]  entry_SYSCALL_64_after_hwframe+0x6e/0x76
> >>> [  213.796435] RIP: 0033:0x7f7758f279cd
> >>> [  213.796465] Code: 5d c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e
> >>> fa 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24
> >>> 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 33 e4 0c 00 f7 d8 64
> >>> 89 01 48
> >>> [  213.796467] RSP: 002b:00007ffe2cad8738 EFLAGS: 00000246 ORIG_RAX:
> >>> 0000000000000139
> >>> [  213.796469] RAX: ffffffffffffffda RBX: 0000557f987a8180 RCX:
> >>> 00007f7758f279cd
> >>> [  213.796471] RDX: 0000000000000000 RSI: 00007f77593e5453 RDI:
> >>> 000000000000000f
> >>> [  213.796472] RBP: 00007f77593e5453 R08: 0000000000000000 R09:
> >>> 00007ffe2cad8860
> >>> [  213.796473] R10: 000000000000000f R11: 0000000000000246 R12:
> >>> 0000000000020000
> >>> [  213.796475] R13: 0000557f9879f8e0 R14: 0000000000000000 R15:
> >>> 0000557f98783aa0
> >>> [  213.796482]  </TASK>
> >>> "
> >>>
> >>> Signed-off-by: Zhu Yanjun <yanjun.zhu@linux.dev>
> >>> ---
> >>>   drivers/net/virtio_net.c | 10 ++++++++--
> >>>   1 file changed, 8 insertions(+), 2 deletions(-)
> >>>
> >>> diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
> >>> index 51b1868d2f22..28b7dd917a43 100644
> >>> --- a/drivers/net/virtio_net.c
> >>> +++ b/drivers/net/virtio_net.c
> >>> @@ -2468,7 +2468,7 @@ static bool virtnet_send_command(struct
> >>> virtnet_info *vi, u8 class, u8 cmd,
> >>>   {
> >>>       struct scatterlist *sgs[4], hdr, stat;
> >>>       unsigned out_num = 0, tmp;
> >>> -    int ret;
> >>> +    int ret, timeout = 200;
> >>>         /* Caller should know better */
> >>>       BUG_ON(!virtio_has_feature(vi->vdev, VIRTIO_NET_F_CTRL_VQ));
> >>> @@ -2502,8 +2502,14 @@ static bool virtnet_send_command(struct
> >>> virtnet_info *vi, u8 class, u8 cmd,
> >>>        * into the hypervisor, so the request should be handled
> >>> immediately.
> >>>        */
> >>>       while (!virtqueue_get_buf(vi->cvq, &tmp) &&
> >>> -           !virtqueue_is_broken(vi->cvq))
> >>> +           !virtqueue_is_broken(vi->cvq)) {
> >>> +        if (timeout)
> >>> +            timeout--;
> >> This is not really a timeout, just a loop counter. 200 iterations could
> >> be a very short time on reasonable H/W. I guess this avoid the soft
> >> lockup, but possibly (likely?) breaks the functionality when we need to
> >> loop for some non negligible time.
> >>
> >> I fear we need a more complex solution, as mentioned by Micheal in the
> >> thread you quoted.
> >
> > Got it. I also look forward to the more complex solution to this problem.
>
> Can we add a device capability (new feature bit) such as
> ctrq_wait_timeout to get a reasonable timeout?

This adds another kind of complexity for migration compatibility.

And we need to make it more general, e.g

1) it should not be cvq specific
2) or we can have a timeout that works for all queues

?

Thanks

>
> Thanks,
> Heng
>
> >
> > Zhu Yanjun
> >
> >>
> >> Cheers,
> >>
> >> Paolo
> >>
>
Jason Wang Jan. 22, 2024, 3:14 a.m. UTC | #13
On Mon, Jan 22, 2024 at 10:12 AM Zhu Yanjun <yanjun.zhu@linux.dev> wrote:
>
>
> 在 2024/1/20 1:29, Andrew Lunn 写道:
> >>>>>        while (!virtqueue_get_buf(vi->cvq, &tmp) &&
> >>>>> -           !virtqueue_is_broken(vi->cvq))
> >>>>> +           !virtqueue_is_broken(vi->cvq)) {
> >>>>> +        if (timeout)
> >>>>> +            timeout--;
> >>>> This is not really a timeout, just a loop counter. 200 iterations could
> >>>> be a very short time on reasonable H/W. I guess this avoid the soft
> >>>> lockup, but possibly (likely?) breaks the functionality when we need to
> >>>> loop for some non negligible time.
> >>>>
> >>>> I fear we need a more complex solution, as mentioned by Micheal in the
> >>>> thread you quoted.
> >>> Got it. I also look forward to the more complex solution to this problem.
> >> Can we add a device capability (new feature bit) such as ctrq_wait_timeout
> >> to get a reasonable timeout?
> > The usual solution to this is include/linux/iopoll.h. If you can sleep
> > read_poll_timeout() otherwise read_poll_timeout_atomic().
>
> I read carefully the functions read_poll_timeout() and
> read_poll_timeout_atomic(). The timeout is set by the caller of the 2
> functions.

FYI, in order to avoid a swtich of atomic or not, we need convert rx
mode setting to workqueue first:

https://www.mail-archive.com/virtualization@lists.linux-foundation.org/msg60298.html

>
> As such, can we add a module parameter to customize this timeout value
> by the user?

Who is the "user" here, or how can the "user" know the value?

>
> Or this timeout value is stored in device register, virtio_net driver
> will read this timeout value at initialization?

See another thread. The design needs to be general, or you can post a RFC.

In another thought, we've already had a tx watchdog, maybe we can have
something similar to cvq and use timeout + reset in that case.

Thans

>
> Zhu Yanjun
>
> >
> >       Andrew
>
Xuan Zhuo Jan. 22, 2024, 3:58 a.m. UTC | #14
On Mon, 22 Jan 2024 11:14:30 +0800, Jason Wang <jasowang@redhat.com> wrote:
> On Mon, Jan 22, 2024 at 10:12 AM Zhu Yanjun <yanjun.zhu@linux.dev> wrote:
> >
> >
> > 在 2024/1/20 1:29, Andrew Lunn 写道:
> > >>>>>        while (!virtqueue_get_buf(vi->cvq, &tmp) &&
> > >>>>> -           !virtqueue_is_broken(vi->cvq))
> > >>>>> +           !virtqueue_is_broken(vi->cvq)) {
> > >>>>> +        if (timeout)
> > >>>>> +            timeout--;
> > >>>> This is not really a timeout, just a loop counter. 200 iterations could
> > >>>> be a very short time on reasonable H/W. I guess this avoid the soft
> > >>>> lockup, but possibly (likely?) breaks the functionality when we need to
> > >>>> loop for some non negligible time.
> > >>>>
> > >>>> I fear we need a more complex solution, as mentioned by Micheal in the
> > >>>> thread you quoted.
> > >>> Got it. I also look forward to the more complex solution to this problem.
> > >> Can we add a device capability (new feature bit) such as ctrq_wait_timeout
> > >> to get a reasonable timeout?
> > > The usual solution to this is include/linux/iopoll.h. If you can sleep
> > > read_poll_timeout() otherwise read_poll_timeout_atomic().
> >
> > I read carefully the functions read_poll_timeout() and
> > read_poll_timeout_atomic(). The timeout is set by the caller of the 2
> > functions.
>
> FYI, in order to avoid a swtich of atomic or not, we need convert rx
> mode setting to workqueue first:
>
> https://www.mail-archive.com/virtualization@lists.linux-foundation.org/msg60298.html
>
> >
> > As such, can we add a module parameter to customize this timeout value
> > by the user?
>
> Who is the "user" here, or how can the "user" know the value?
>
> >
> > Or this timeout value is stored in device register, virtio_net driver
> > will read this timeout value at initialization?
>
> See another thread. The design needs to be general, or you can post a RFC.
>
> In another thought, we've already had a tx watchdog, maybe we can have
> something similar to cvq and use timeout + reset in that case.

But we may block by the reset ^_^ if the device is broken?

Thanks.


>
> Thans
>
> >
> > Zhu Yanjun
> >
> > >
> > >       Andrew
> >
>
Jason Wang Jan. 22, 2024, 4:16 a.m. UTC | #15
On Mon, Jan 22, 2024 at 12:00 PM Xuan Zhuo <xuanzhuo@linux.alibaba.com> wrote:
>
> On Mon, 22 Jan 2024 11:14:30 +0800, Jason Wang <jasowang@redhat.com> wrote:
> > On Mon, Jan 22, 2024 at 10:12 AM Zhu Yanjun <yanjun.zhu@linux.dev> wrote:
> > >
> > >
> > > 在 2024/1/20 1:29, Andrew Lunn 写道:
> > > >>>>>        while (!virtqueue_get_buf(vi->cvq, &tmp) &&
> > > >>>>> -           !virtqueue_is_broken(vi->cvq))
> > > >>>>> +           !virtqueue_is_broken(vi->cvq)) {
> > > >>>>> +        if (timeout)
> > > >>>>> +            timeout--;
> > > >>>> This is not really a timeout, just a loop counter. 200 iterations could
> > > >>>> be a very short time on reasonable H/W. I guess this avoid the soft
> > > >>>> lockup, but possibly (likely?) breaks the functionality when we need to
> > > >>>> loop for some non negligible time.
> > > >>>>
> > > >>>> I fear we need a more complex solution, as mentioned by Micheal in the
> > > >>>> thread you quoted.
> > > >>> Got it. I also look forward to the more complex solution to this problem.
> > > >> Can we add a device capability (new feature bit) such as ctrq_wait_timeout
> > > >> to get a reasonable timeout?
> > > > The usual solution to this is include/linux/iopoll.h. If you can sleep
> > > > read_poll_timeout() otherwise read_poll_timeout_atomic().
> > >
> > > I read carefully the functions read_poll_timeout() and
> > > read_poll_timeout_atomic(). The timeout is set by the caller of the 2
> > > functions.
> >
> > FYI, in order to avoid a swtich of atomic or not, we need convert rx
> > mode setting to workqueue first:
> >
> > https://www.mail-archive.com/virtualization@lists.linux-foundation.org/msg60298.html
> >
> > >
> > > As such, can we add a module parameter to customize this timeout value
> > > by the user?
> >
> > Who is the "user" here, or how can the "user" know the value?
> >
> > >
> > > Or this timeout value is stored in device register, virtio_net driver
> > > will read this timeout value at initialization?
> >
> > See another thread. The design needs to be general, or you can post a RFC.
> >
> > In another thought, we've already had a tx watchdog, maybe we can have
> > something similar to cvq and use timeout + reset in that case.
>
> But we may block by the reset ^_^ if the device is broken?

I mean vq reset here.

It looks like we have multiple goals here

1) avoid lockups, using workqueue + cond_resched() seems to be
sufficient, it has issue but nothing new
2) recover from the unresponsive device, the issue for timeout is that
it needs to deal with false positives

Thanks

>
> Thanks.
>
>
> >
> > Thans
> >
> > >
> > > Zhu Yanjun
> > >
> > > >
> > > >       Andrew
> > >
> >
>
Heng Qi Jan. 22, 2024, 4:42 a.m. UTC | #16
在 2024/1/22 上午11:08, Jason Wang 写道:
> On Fri, Jan 19, 2024 at 10:27 PM Heng Qi <hengqi@linux.alibaba.com> wrote:
>>
>>
>> 在 2024/1/18 下午8:01, Zhu Yanjun 写道:
>>> 在 2024/1/16 20:04, Paolo Abeni 写道:
>>>> On Mon, 2024-01-15 at 09:29 +0800, Zhu Yanjun wrote:
>>>>> From: Zhu Yanjun <yanjun.zhu@linux.dev>
>>>>>
>>>>> Some devices emulate the virtio_net hardwares. When virtio_net
>>>>> driver sends commands to the emulated hardware, normally the
>>>>> hardware needs time to response. Sometimes the time is very
>>>>> long. Thus, the following will appear. Then the whole system
>>>>> will hang.
>>>>> The similar problems also occur in Intel NICs and Mellanox NICs.
>>>>> As such, the similar solution is borrowed from them. A timeout
>>>>> value is added and the timeout value as large as possible is set
>>>>> to ensure that the driver gets the maximum possible response from
>>>>> the hardware.
>>>>>
>>>>> "
>>>>> [  213.795860] watchdog: BUG: soft lockup - CPU#108 stuck for 26s!
>>>>> [(udev-worker):3157]
>>>>> [  213.796114] Modules linked in: virtio_net(+) net_failover
>>>>> failover qrtr rfkill sunrpc intel_rapl_msr intel_rapl_common
>>>>> intel_uncore_frequency intel_uncore_frequency_common intel_ifs
>>>>> i10nm_edac nfit libnvdimm x86_pkg_temp_thermal intel_powerclamp
>>>>> coretemp iTCO_wdt rapl intel_pmc_bxt dax_hmem iTCO_vendor_support
>>>>> vfat cxl_acpi intel_cstate pmt_telemetry pmt_class intel_sdsi joydev
>>>>> intel_uncore cxl_core fat pcspkr mei_me isst_if_mbox_pci
>>>>> isst_if_mmio idxd i2c_i801 isst_if_common mei intel_vsec idxd_bus
>>>>> i2c_smbus i2c_ismt ipmi_ssif acpi_ipmi ipmi_si ipmi_devintf
>>>>> ipmi_msghandler acpi_pad acpi_power_meter pfr_telemetry pfr_update
>>>>> fuse loop zram xfs crct10dif_pclmul crc32_pclmul crc32c_intel
>>>>> polyval_clmulni polyval_generic ghash_clmulni_intel sha512_ssse3
>>>>> bnxt_en sha256_ssse3 sha1_ssse3 nvme ast nvme_core i2c_algo_bit wmi
>>>>> pinctrl_emmitsburg scsi_dh_rdac scsi_dh_emc scsi_dh_alua dm_multipath
>>>>> [  213.796194] irq event stamp: 67740
>>>>> [  213.796195] hardirqs last  enabled at (67739):
>>>>> [<ffffffff8c2015ca>] asm_sysvec_apic_timer_interrupt+0x1a/0x20
>>>>> [  213.796203] hardirqs last disabled at (67740):
>>>>> [<ffffffff8c14108e>] sysvec_apic_timer_interrupt+0xe/0x90
>>>>> [  213.796208] softirqs last  enabled at (67686):
>>>>> [<ffffffff8b12115e>] __irq_exit_rcu+0xbe/0xe0
>>>>> [  213.796214] softirqs last disabled at (67681):
>>>>> [<ffffffff8b12115e>] __irq_exit_rcu+0xbe/0xe0
>>>>> [  213.796217] CPU: 108 PID: 3157 Comm: (udev-worker) Kdump: loaded
>>>>> Not tainted 6.7.0+ #9
>>>>> [  213.796220] Hardware name: Intel Corporation
>>>>> M50FCP2SBSTD/M50FCP2SBSTD, BIOS SE5C741.86B.01.01.0001.2211140926
>>>>> 11/14/2022
>>>>> [  213.796221] RIP: 0010:virtqueue_get_buf_ctx_split+0x8d/0x110
>>>>> [  213.796228] Code: 89 df e8 26 fe ff ff 0f b7 43 50 83 c0 01 66 89
>>>>> 43 50 f6 43 78 01 75 12 80 7b 42 00 48 8b 4b 68 8b 53 58 74 0f 66 87
>>>>> 44 51 04 <48> 89 e8 5b 5d c3 cc cc cc cc 66 89 44 51 04 0f ae f0 48
>>>>> 89 e8 5b
>>>>> [  213.796230] RSP: 0018:ff4bbb362306f9b0 EFLAGS: 00000246
>>>>> [  213.796233] RAX: 0000000000000000 RBX: ff2f15095896f000 RCX:
>>>>> 0000000000000001
>>>>> [  213.796235] RDX: 0000000000000000 RSI: ff4bbb362306f9cc RDI:
>>>>> ff2f15095896f000
>>>>> [  213.796236] RBP: 0000000000000000 R08: 0000000000000000 R09:
>>>>> 0000000000000000
>>>>> [  213.796237] R10: 0000000000000003 R11: ff2f15095893cc40 R12:
>>>>> 0000000000000002
>>>>> [  213.796239] R13: 0000000000000004 R14: 0000000000000000 R15:
>>>>> ff2f1509534f3000
>>>>> [  213.796240] FS:  00007f775847d0c0(0000) GS:ff2f1528bac00000(0000)
>>>>> knlGS:0000000000000000
>>>>> [  213.796242] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>>>>> [  213.796243] CR2: 0000557f987b6e70 CR3: 0000002098602006 CR4:
>>>>> 0000000000f71ef0
>>>>> [  213.796245] DR0: 0000000000000000 DR1: 0000000000000000 DR2:
>>>>> 0000000000000000
>>>>> [  213.796246] DR3: 0000000000000000 DR6: 00000000fffe07f0 DR7:
>>>>> 0000000000000400
>>>>> [  213.796247] PKRU: 55555554
>>>>> [  213.796249] Call Trace:
>>>>> [  213.796250]  <IRQ>
>>>>> [  213.796252]  ? watchdog_timer_fn+0x1c0/0x220
>>>>> [  213.796258]  ? __pfx_watchdog_timer_fn+0x10/0x10
>>>>> [  213.796261]  ? __hrtimer_run_queues+0x1af/0x380
>>>>> [  213.796269]  ? hrtimer_interrupt+0xf8/0x230
>>>>> [  213.796274]  ? __sysvec_apic_timer_interrupt+0x64/0x1a0
>>>>> [  213.796279]  ? sysvec_apic_timer_interrupt+0x6d/0x90
>>>>> [  213.796282]  </IRQ>
>>>>> [  213.796284]  <TASK>
>>>>> [  213.796285]  ? asm_sysvec_apic_timer_interrupt+0x1a/0x20
>>>>> [  213.796293]  ? virtqueue_get_buf_ctx_split+0x8d/0x110
>>>>> [  213.796297]  virtnet_send_command+0x18a/0x1f0 [virtio_net]
>>>>> [  213.796310]  _virtnet_set_queues+0xc6/0x120 [virtio_net]
>>>>> [  213.796319]  virtnet_probe+0xa06/0xd50 [virtio_net]
>>>>> [  213.796328]  virtio_dev_probe+0x195/0x230
>>>>> [  213.796333]  really_probe+0x19f/0x400
>>>>> [  213.796338]  ? __pfx___driver_attach+0x10/0x10
>>>>> [  213.796340]  __driver_probe_device+0x78/0x160
>>>>> [  213.796343]  driver_probe_device+0x1f/0x90
>>>>> [  213.796346]  __driver_attach+0xd6/0x1d0
>>>>> [  213.796349]  bus_for_each_dev+0x8c/0xe0
>>>>> [  213.796355]  bus_add_driver+0x119/0x220
>>>>> [  213.796359]  driver_register+0x59/0x100
>>>>> [  213.796362]  ? __pfx_virtio_net_driver_init+0x10/0x10 [virtio_net]
>>>>> [  213.796369]  virtio_net_driver_init+0x8e/0xff0 [virtio_net]
>>>>> [  213.796375]  do_one_initcall+0x6f/0x380
>>>>> [  213.796384]  do_init_module+0x60/0x240
>>>>> [  213.796388]  init_module_from_file+0x86/0xc0
>>>>> [  213.796396]  idempotent_init_module+0x129/0x2c0
>>>>> [  213.796406]  __x64_sys_finit_module+0x5e/0xb0
>>>>> [  213.796409]  do_syscall_64+0x60/0xe0
>>>>> [  213.796415]  ? do_syscall_64+0x6f/0xe0
>>>>> [  213.796418]  ? lockdep_hardirqs_on_prepare+0xe4/0x1a0
>>>>> [  213.796424]  ? do_syscall_64+0x6f/0xe0
>>>>> [  213.796427]  ? do_syscall_64+0x6f/0xe0
>>>>> [  213.796431]  entry_SYSCALL_64_after_hwframe+0x6e/0x76
>>>>> [  213.796435] RIP: 0033:0x7f7758f279cd
>>>>> [  213.796465] Code: 5d c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e
>>>>> fa 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24
>>>>> 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 33 e4 0c 00 f7 d8 64
>>>>> 89 01 48
>>>>> [  213.796467] RSP: 002b:00007ffe2cad8738 EFLAGS: 00000246 ORIG_RAX:
>>>>> 0000000000000139
>>>>> [  213.796469] RAX: ffffffffffffffda RBX: 0000557f987a8180 RCX:
>>>>> 00007f7758f279cd
>>>>> [  213.796471] RDX: 0000000000000000 RSI: 00007f77593e5453 RDI:
>>>>> 000000000000000f
>>>>> [  213.796472] RBP: 00007f77593e5453 R08: 0000000000000000 R09:
>>>>> 00007ffe2cad8860
>>>>> [  213.796473] R10: 000000000000000f R11: 0000000000000246 R12:
>>>>> 0000000000020000
>>>>> [  213.796475] R13: 0000557f9879f8e0 R14: 0000000000000000 R15:
>>>>> 0000557f98783aa0
>>>>> [  213.796482]  </TASK>
>>>>> "
>>>>>
>>>>> Signed-off-by: Zhu Yanjun <yanjun.zhu@linux.dev>
>>>>> ---
>>>>>    drivers/net/virtio_net.c | 10 ++++++++--
>>>>>    1 file changed, 8 insertions(+), 2 deletions(-)
>>>>>
>>>>> diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
>>>>> index 51b1868d2f22..28b7dd917a43 100644
>>>>> --- a/drivers/net/virtio_net.c
>>>>> +++ b/drivers/net/virtio_net.c
>>>>> @@ -2468,7 +2468,7 @@ static bool virtnet_send_command(struct
>>>>> virtnet_info *vi, u8 class, u8 cmd,
>>>>>    {
>>>>>        struct scatterlist *sgs[4], hdr, stat;
>>>>>        unsigned out_num = 0, tmp;
>>>>> -    int ret;
>>>>> +    int ret, timeout = 200;
>>>>>          /* Caller should know better */
>>>>>        BUG_ON(!virtio_has_feature(vi->vdev, VIRTIO_NET_F_CTRL_VQ));
>>>>> @@ -2502,8 +2502,14 @@ static bool virtnet_send_command(struct
>>>>> virtnet_info *vi, u8 class, u8 cmd,
>>>>>         * into the hypervisor, so the request should be handled
>>>>> immediately.
>>>>>         */
>>>>>        while (!virtqueue_get_buf(vi->cvq, &tmp) &&
>>>>> -           !virtqueue_is_broken(vi->cvq))
>>>>> +           !virtqueue_is_broken(vi->cvq)) {
>>>>> +        if (timeout)
>>>>> +            timeout--;
>>>> This is not really a timeout, just a loop counter. 200 iterations could
>>>> be a very short time on reasonable H/W. I guess this avoid the soft
>>>> lockup, but possibly (likely?) breaks the functionality when we need to
>>>> loop for some non negligible time.
>>>>
>>>> I fear we need a more complex solution, as mentioned by Micheal in the
>>>> thread you quoted.
>>> Got it. I also look forward to the more complex solution to this problem.
>> Can we add a device capability (new feature bit) such as
>> ctrq_wait_timeout to get a reasonable timeout?
> This adds another kind of complexity for migration compatibility.

Yes, this requires device adaptation. But the value will be more 
reasonable, different devices have different requirements for timeout.

>
> And we need to make it more general, e.g
>
> 1) it should not be cvq specific

Agree. Especially considering avq also has this problem.

> 2) or we can have a timeout that works for all queues

I remember Parav mentioned that they were also making some bulk 
improvements to ctrlq/avq,
or we should look at a combined solution.

rxq timeout can be used for our detection aliveness needs, especially 
after a live migration.
txq timeout can be used with watchdog + vq reset.

Thanks,
Heng

> ?
>
> Thanks
>
>> Thanks,
>> Heng
>>
>>> Zhu Yanjun
>>>
>>>> Cheers,
>>>>
>>>> Paolo
>>>>
Xuan Zhuo Jan. 22, 2024, 6:16 a.m. UTC | #17
On Mon, 22 Jan 2024 12:16:27 +0800, Jason Wang <jasowang@redhat.com> wrote:
> On Mon, Jan 22, 2024 at 12:00 PM Xuan Zhuo <xuanzhuo@linux.alibaba.com> wrote:
> >
> > On Mon, 22 Jan 2024 11:14:30 +0800, Jason Wang <jasowang@redhat.com> wrote:
> > > On Mon, Jan 22, 2024 at 10:12 AM Zhu Yanjun <yanjun.zhu@linux.dev> wrote:
> > > >
> > > >
> > > > 在 2024/1/20 1:29, Andrew Lunn 写道:
> > > > >>>>>        while (!virtqueue_get_buf(vi->cvq, &tmp) &&
> > > > >>>>> -           !virtqueue_is_broken(vi->cvq))
> > > > >>>>> +           !virtqueue_is_broken(vi->cvq)) {
> > > > >>>>> +        if (timeout)
> > > > >>>>> +            timeout--;
> > > > >>>> This is not really a timeout, just a loop counter. 200 iterations could
> > > > >>>> be a very short time on reasonable H/W. I guess this avoid the soft
> > > > >>>> lockup, but possibly (likely?) breaks the functionality when we need to
> > > > >>>> loop for some non negligible time.
> > > > >>>>
> > > > >>>> I fear we need a more complex solution, as mentioned by Micheal in the
> > > > >>>> thread you quoted.
> > > > >>> Got it. I also look forward to the more complex solution to this problem.
> > > > >> Can we add a device capability (new feature bit) such as ctrq_wait_timeout
> > > > >> to get a reasonable timeout?
> > > > > The usual solution to this is include/linux/iopoll.h. If you can sleep
> > > > > read_poll_timeout() otherwise read_poll_timeout_atomic().
> > > >
> > > > I read carefully the functions read_poll_timeout() and
> > > > read_poll_timeout_atomic(). The timeout is set by the caller of the 2
> > > > functions.
> > >
> > > FYI, in order to avoid a swtich of atomic or not, we need convert rx
> > > mode setting to workqueue first:
> > >
> > > https://www.mail-archive.com/virtualization@lists.linux-foundation.org/msg60298.html
> > >
> > > >
> > > > As such, can we add a module parameter to customize this timeout value
> > > > by the user?
> > >
> > > Who is the "user" here, or how can the "user" know the value?
> > >
> > > >
> > > > Or this timeout value is stored in device register, virtio_net driver
> > > > will read this timeout value at initialization?
> > >
> > > See another thread. The design needs to be general, or you can post a RFC.
> > >
> > > In another thought, we've already had a tx watchdog, maybe we can have
> > > something similar to cvq and use timeout + reset in that case.
> >
> > But we may block by the reset ^_^ if the device is broken?
>
> I mean vq reset here.

I see.

I mean when the deivce is broken, the vq reset also many be blocked.

	void vp_modern_set_queue_reset(struct virtio_pci_modern_device *mdev, u16 index)
	{
		struct virtio_pci_modern_common_cfg __iomem *cfg;

		cfg = (struct virtio_pci_modern_common_cfg __iomem *)mdev->common;

		vp_iowrite16(index, &cfg->cfg.queue_select);
		vp_iowrite16(1, &cfg->queue_reset);

		while (vp_ioread16(&cfg->queue_reset))
			msleep(1);

		while (vp_ioread16(&cfg->cfg.queue_enable))
			msleep(1);
	}
	EXPORT_SYMBOL_GPL(vp_modern_set_queue_reset);

In this function, for the broken device, we can not expect something.


>
> It looks like we have multiple goals here
>
> 1) avoid lockups, using workqueue + cond_resched() seems to be
> sufficient, it has issue but nothing new
> 2) recover from the unresponsive device, the issue for timeout is that
> it needs to deal with false positives


I agree.

But I want to add a new goal, cvq async. In the netdim, we will
send many requests via the cvq, so the cvq async will be nice.

Thanks.


>
> Thanks
>
> >
> > Thanks.
> >
> >
> > >
> > > Thans
> > >
> > > >
> > > > Zhu Yanjun
> > > >
> > > > >
> > > > >       Andrew
> > > >
> > >
> >
>
Jason Wang Jan. 22, 2024, 6:55 a.m. UTC | #18
On Mon, Jan 22, 2024 at 2:20 PM Xuan Zhuo <xuanzhuo@linux.alibaba.com> wrote:
>
> On Mon, 22 Jan 2024 12:16:27 +0800, Jason Wang <jasowang@redhat.com> wrote:
> > On Mon, Jan 22, 2024 at 12:00 PM Xuan Zhuo <xuanzhuo@linux.alibaba.com> wrote:
> > >
> > > On Mon, 22 Jan 2024 11:14:30 +0800, Jason Wang <jasowang@redhat.com> wrote:
> > > > On Mon, Jan 22, 2024 at 10:12 AM Zhu Yanjun <yanjun.zhu@linux.dev> wrote:
> > > > >
> > > > >
> > > > > 在 2024/1/20 1:29, Andrew Lunn 写道:
> > > > > >>>>>        while (!virtqueue_get_buf(vi->cvq, &tmp) &&
> > > > > >>>>> -           !virtqueue_is_broken(vi->cvq))
> > > > > >>>>> +           !virtqueue_is_broken(vi->cvq)) {
> > > > > >>>>> +        if (timeout)
> > > > > >>>>> +            timeout--;
> > > > > >>>> This is not really a timeout, just a loop counter. 200 iterations could
> > > > > >>>> be a very short time on reasonable H/W. I guess this avoid the soft
> > > > > >>>> lockup, but possibly (likely?) breaks the functionality when we need to
> > > > > >>>> loop for some non negligible time.
> > > > > >>>>
> > > > > >>>> I fear we need a more complex solution, as mentioned by Micheal in the
> > > > > >>>> thread you quoted.
> > > > > >>> Got it. I also look forward to the more complex solution to this problem.
> > > > > >> Can we add a device capability (new feature bit) such as ctrq_wait_timeout
> > > > > >> to get a reasonable timeout?
> > > > > > The usual solution to this is include/linux/iopoll.h. If you can sleep
> > > > > > read_poll_timeout() otherwise read_poll_timeout_atomic().
> > > > >
> > > > > I read carefully the functions read_poll_timeout() and
> > > > > read_poll_timeout_atomic(). The timeout is set by the caller of the 2
> > > > > functions.
> > > >
> > > > FYI, in order to avoid a swtich of atomic or not, we need convert rx
> > > > mode setting to workqueue first:
> > > >
> > > > https://www.mail-archive.com/virtualization@lists.linux-foundation.org/msg60298.html
> > > >
> > > > >
> > > > > As such, can we add a module parameter to customize this timeout value
> > > > > by the user?
> > > >
> > > > Who is the "user" here, or how can the "user" know the value?
> > > >
> > > > >
> > > > > Or this timeout value is stored in device register, virtio_net driver
> > > > > will read this timeout value at initialization?
> > > >
> > > > See another thread. The design needs to be general, or you can post a RFC.
> > > >
> > > > In another thought, we've already had a tx watchdog, maybe we can have
> > > > something similar to cvq and use timeout + reset in that case.
> > >
> > > But we may block by the reset ^_^ if the device is broken?
> >
> > I mean vq reset here.
>
> I see.
>
> I mean when the deivce is broken, the vq reset also many be blocked.
>
>         void vp_modern_set_queue_reset(struct virtio_pci_modern_device *mdev, u16 index)
>         {
>                 struct virtio_pci_modern_common_cfg __iomem *cfg;
>
>                 cfg = (struct virtio_pci_modern_common_cfg __iomem *)mdev->common;
>
>                 vp_iowrite16(index, &cfg->cfg.queue_select);
>                 vp_iowrite16(1, &cfg->queue_reset);
>
>                 while (vp_ioread16(&cfg->queue_reset))
>                         msleep(1);
>
>                 while (vp_ioread16(&cfg->cfg.queue_enable))
>                         msleep(1);
>         }
>         EXPORT_SYMBOL_GPL(vp_modern_set_queue_reset);
>
> In this function, for the broken device, we can not expect something.

Yes, it's best effort, there's no guarantee then. But it doesn't harm to try.

Thanks

>
>
> >
> > It looks like we have multiple goals here
> >
> > 1) avoid lockups, using workqueue + cond_resched() seems to be
> > sufficient, it has issue but nothing new
> > 2) recover from the unresponsive device, the issue for timeout is that
> > it needs to deal with false positives
>
>
> I agree.
>
> But I want to add a new goal, cvq async. In the netdim, we will
> send many requests via the cvq, so the cvq async will be nice.
>
> Thanks.
>
>
> >
> > Thanks
> >
> > >
> > > Thanks.
> > >
> > >
> > > >
> > > > Thans
> > > >
> > > > >
> > > > > Zhu Yanjun
> > > > >
> > > > > >
> > > > > >       Andrew
> > > > >
> > > >
> > >
> >
>
Jason Wang Jan. 22, 2024, 6:58 a.m. UTC | #19
On Mon, Jan 22, 2024 at 2:55 PM Jason Wang <jasowang@redhat.com> wrote:
>
> On Mon, Jan 22, 2024 at 2:20 PM Xuan Zhuo <xuanzhuo@linux.alibaba.com> wrote:
> >
> > On Mon, 22 Jan 2024 12:16:27 +0800, Jason Wang <jasowang@redhat.com> wrote:
> > > On Mon, Jan 22, 2024 at 12:00 PM Xuan Zhuo <xuanzhuo@linux.alibaba.com> wrote:
> > > >
> > > > On Mon, 22 Jan 2024 11:14:30 +0800, Jason Wang <jasowang@redhat.com> wrote:
> > > > > On Mon, Jan 22, 2024 at 10:12 AM Zhu Yanjun <yanjun.zhu@linux.dev> wrote:
> > > > > >
> > > > > >
> > > > > > 在 2024/1/20 1:29, Andrew Lunn 写道:
> > > > > > >>>>>        while (!virtqueue_get_buf(vi->cvq, &tmp) &&
> > > > > > >>>>> -           !virtqueue_is_broken(vi->cvq))
> > > > > > >>>>> +           !virtqueue_is_broken(vi->cvq)) {
> > > > > > >>>>> +        if (timeout)
> > > > > > >>>>> +            timeout--;
> > > > > > >>>> This is not really a timeout, just a loop counter. 200 iterations could
> > > > > > >>>> be a very short time on reasonable H/W. I guess this avoid the soft
> > > > > > >>>> lockup, but possibly (likely?) breaks the functionality when we need to
> > > > > > >>>> loop for some non negligible time.
> > > > > > >>>>
> > > > > > >>>> I fear we need a more complex solution, as mentioned by Micheal in the
> > > > > > >>>> thread you quoted.
> > > > > > >>> Got it. I also look forward to the more complex solution to this problem.
> > > > > > >> Can we add a device capability (new feature bit) such as ctrq_wait_timeout
> > > > > > >> to get a reasonable timeout?
> > > > > > > The usual solution to this is include/linux/iopoll.h. If you can sleep
> > > > > > > read_poll_timeout() otherwise read_poll_timeout_atomic().
> > > > > >
> > > > > > I read carefully the functions read_poll_timeout() and
> > > > > > read_poll_timeout_atomic(). The timeout is set by the caller of the 2
> > > > > > functions.
> > > > >
> > > > > FYI, in order to avoid a swtich of atomic or not, we need convert rx
> > > > > mode setting to workqueue first:
> > > > >
> > > > > https://www.mail-archive.com/virtualization@lists.linux-foundation.org/msg60298.html
> > > > >
> > > > > >
> > > > > > As such, can we add a module parameter to customize this timeout value
> > > > > > by the user?
> > > > >
> > > > > Who is the "user" here, or how can the "user" know the value?
> > > > >
> > > > > >
> > > > > > Or this timeout value is stored in device register, virtio_net driver
> > > > > > will read this timeout value at initialization?
> > > > >
> > > > > See another thread. The design needs to be general, or you can post a RFC.
> > > > >
> > > > > In another thought, we've already had a tx watchdog, maybe we can have
> > > > > something similar to cvq and use timeout + reset in that case.
> > > >
> > > > But we may block by the reset ^_^ if the device is broken?
> > >
> > > I mean vq reset here.
> >
> > I see.
> >
> > I mean when the deivce is broken, the vq reset also many be blocked.
> >
> >         void vp_modern_set_queue_reset(struct virtio_pci_modern_device *mdev, u16 index)
> >         {
> >                 struct virtio_pci_modern_common_cfg __iomem *cfg;
> >
> >                 cfg = (struct virtio_pci_modern_common_cfg __iomem *)mdev->common;
> >
> >                 vp_iowrite16(index, &cfg->cfg.queue_select);
> >                 vp_iowrite16(1, &cfg->queue_reset);
> >
> >                 while (vp_ioread16(&cfg->queue_reset))
> >                         msleep(1);
> >
> >                 while (vp_ioread16(&cfg->cfg.queue_enable))
> >                         msleep(1);
> >         }
> >         EXPORT_SYMBOL_GPL(vp_modern_set_queue_reset);
> >
> > In this function, for the broken device, we can not expect something.
>
> Yes, it's best effort, there's no guarantee then. But it doesn't harm to try.
>
> Thanks
>
> >
> >
> > >
> > > It looks like we have multiple goals here
> > >
> > > 1) avoid lockups, using workqueue + cond_resched() seems to be
> > > sufficient, it has issue but nothing new
> > > 2) recover from the unresponsive device, the issue for timeout is that
> > > it needs to deal with false positives
> >
> >
> > I agree.
> >
> > But I want to add a new goal, cvq async. In the netdim, we will
> > send many requests via the cvq, so the cvq async will be nice.

Then you need an interrupt for cvq.

FYI, I've posted a series that use interrupt for cvq in the past:

https://lore.kernel.org/lkml/6026e801-6fda-fee9-a69b-d06a80368621@redhat.com/t/

Haven't found time in working on this anymore, maybe we can start from
this or not.

Thanks

> >
> > Thanks.
> >
> >
> > >
> > > Thanks
> > >
> > > >
> > > > Thanks.
> > > >
> > > >
> > > > >
> > > > > Thans
> > > > >
> > > > > >
> > > > > > Zhu Yanjun
> > > > > >
> > > > > > >
> > > > > > >       Andrew
> > > > > >
> > > > >
> > > >
> > >
> >
Xuan Zhuo Jan. 22, 2024, 7:01 a.m. UTC | #20
On Mon, 22 Jan 2024 14:55:46 +0800, Jason Wang <jasowang@redhat.com> wrote:
> On Mon, Jan 22, 2024 at 2:20 PM Xuan Zhuo <xuanzhuo@linux.alibaba.com> wrote:
> >
> > On Mon, 22 Jan 2024 12:16:27 +0800, Jason Wang <jasowang@redhat.com> wrote:
> > > On Mon, Jan 22, 2024 at 12:00 PM Xuan Zhuo <xuanzhuo@linux.alibaba.com> wrote:
> > > >
> > > > On Mon, 22 Jan 2024 11:14:30 +0800, Jason Wang <jasowang@redhat.com> wrote:
> > > > > On Mon, Jan 22, 2024 at 10:12 AM Zhu Yanjun <yanjun.zhu@linux.dev> wrote:
> > > > > >
> > > > > >
> > > > > > 在 2024/1/20 1:29, Andrew Lunn 写道:
> > > > > > >>>>>        while (!virtqueue_get_buf(vi->cvq, &tmp) &&
> > > > > > >>>>> -           !virtqueue_is_broken(vi->cvq))
> > > > > > >>>>> +           !virtqueue_is_broken(vi->cvq)) {
> > > > > > >>>>> +        if (timeout)
> > > > > > >>>>> +            timeout--;
> > > > > > >>>> This is not really a timeout, just a loop counter. 200 iterations could
> > > > > > >>>> be a very short time on reasonable H/W. I guess this avoid the soft
> > > > > > >>>> lockup, but possibly (likely?) breaks the functionality when we need to
> > > > > > >>>> loop for some non negligible time.
> > > > > > >>>>
> > > > > > >>>> I fear we need a more complex solution, as mentioned by Micheal in the
> > > > > > >>>> thread you quoted.
> > > > > > >>> Got it. I also look forward to the more complex solution to this problem.
> > > > > > >> Can we add a device capability (new feature bit) such as ctrq_wait_timeout
> > > > > > >> to get a reasonable timeout?
> > > > > > > The usual solution to this is include/linux/iopoll.h. If you can sleep
> > > > > > > read_poll_timeout() otherwise read_poll_timeout_atomic().
> > > > > >
> > > > > > I read carefully the functions read_poll_timeout() and
> > > > > > read_poll_timeout_atomic(). The timeout is set by the caller of the 2
> > > > > > functions.
> > > > >
> > > > > FYI, in order to avoid a swtich of atomic or not, we need convert rx
> > > > > mode setting to workqueue first:
> > > > >
> > > > > https://www.mail-archive.com/virtualization@lists.linux-foundation.org/msg60298.html
> > > > >
> > > > > >
> > > > > > As such, can we add a module parameter to customize this timeout value
> > > > > > by the user?
> > > > >
> > > > > Who is the "user" here, or how can the "user" know the value?
> > > > >
> > > > > >
> > > > > > Or this timeout value is stored in device register, virtio_net driver
> > > > > > will read this timeout value at initialization?
> > > > >
> > > > > See another thread. The design needs to be general, or you can post a RFC.
> > > > >
> > > > > In another thought, we've already had a tx watchdog, maybe we can have
> > > > > something similar to cvq and use timeout + reset in that case.
> > > >
> > > > But we may block by the reset ^_^ if the device is broken?
> > >
> > > I mean vq reset here.
> >
> > I see.
> >
> > I mean when the deivce is broken, the vq reset also many be blocked.
> >
> >         void vp_modern_set_queue_reset(struct virtio_pci_modern_device *mdev, u16 index)
> >         {
> >                 struct virtio_pci_modern_common_cfg __iomem *cfg;
> >
> >                 cfg = (struct virtio_pci_modern_common_cfg __iomem *)mdev->common;
> >
> >                 vp_iowrite16(index, &cfg->cfg.queue_select);
> >                 vp_iowrite16(1, &cfg->queue_reset);
> >
> >                 while (vp_ioread16(&cfg->queue_reset))
> >                         msleep(1);
> >
> >                 while (vp_ioread16(&cfg->cfg.queue_enable))
> >                         msleep(1);
> >         }
> >         EXPORT_SYMBOL_GPL(vp_modern_set_queue_reset);
> >
> > In this function, for the broken device, we can not expect something.
>
> Yes, it's best effort, there's no guarantee then. But it doesn't harm to try.
>

YES. I agree.

Thanks.


> Thanks
>
> >
> >
> > >
> > > It looks like we have multiple goals here
> > >
> > > 1) avoid lockups, using workqueue + cond_resched() seems to be
> > > sufficient, it has issue but nothing new
> > > 2) recover from the unresponsive device, the issue for timeout is that
> > > it needs to deal with false positives
> >
> >
> > I agree.
> >
> > But I want to add a new goal, cvq async. In the netdim, we will
> > send many requests via the cvq, so the cvq async will be nice.
> >
> > Thanks.
> >
> >
> > >
> > > Thanks
> > >
> > > >
> > > > Thanks.
> > > >
> > > >
> > > > >
> > > > > Thans
> > > > >
> > > > > >
> > > > > > Zhu Yanjun
> > > > > >
> > > > > > >
> > > > > > >       Andrew
> > > > > >
> > > > >
> > > >
> > >
> >
>
Xuan Zhuo Jan. 22, 2024, 7:02 a.m. UTC | #21
On Mon, 22 Jan 2024 14:58:09 +0800, Jason Wang <jasowang@redhat.com> wrote:
> On Mon, Jan 22, 2024 at 2:55 PM Jason Wang <jasowang@redhat.com> wrote:
> >
> > On Mon, Jan 22, 2024 at 2:20 PM Xuan Zhuo <xuanzhuo@linux.alibaba.com> wrote:
> > >
> > > On Mon, 22 Jan 2024 12:16:27 +0800, Jason Wang <jasowang@redhat.com> wrote:
> > > > On Mon, Jan 22, 2024 at 12:00 PM Xuan Zhuo <xuanzhuo@linux.alibaba.com> wrote:
> > > > >
> > > > > On Mon, 22 Jan 2024 11:14:30 +0800, Jason Wang <jasowang@redhat.com> wrote:
> > > > > > On Mon, Jan 22, 2024 at 10:12 AM Zhu Yanjun <yanjun.zhu@linux.dev> wrote:
> > > > > > >
> > > > > > >
> > > > > > > 在 2024/1/20 1:29, Andrew Lunn 写道:
> > > > > > > >>>>>        while (!virtqueue_get_buf(vi->cvq, &tmp) &&
> > > > > > > >>>>> -           !virtqueue_is_broken(vi->cvq))
> > > > > > > >>>>> +           !virtqueue_is_broken(vi->cvq)) {
> > > > > > > >>>>> +        if (timeout)
> > > > > > > >>>>> +            timeout--;
> > > > > > > >>>> This is not really a timeout, just a loop counter. 200 iterations could
> > > > > > > >>>> be a very short time on reasonable H/W. I guess this avoid the soft
> > > > > > > >>>> lockup, but possibly (likely?) breaks the functionality when we need to
> > > > > > > >>>> loop for some non negligible time.
> > > > > > > >>>>
> > > > > > > >>>> I fear we need a more complex solution, as mentioned by Micheal in the
> > > > > > > >>>> thread you quoted.
> > > > > > > >>> Got it. I also look forward to the more complex solution to this problem.
> > > > > > > >> Can we add a device capability (new feature bit) such as ctrq_wait_timeout
> > > > > > > >> to get a reasonable timeout?
> > > > > > > > The usual solution to this is include/linux/iopoll.h. If you can sleep
> > > > > > > > read_poll_timeout() otherwise read_poll_timeout_atomic().
> > > > > > >
> > > > > > > I read carefully the functions read_poll_timeout() and
> > > > > > > read_poll_timeout_atomic(). The timeout is set by the caller of the 2
> > > > > > > functions.
> > > > > >
> > > > > > FYI, in order to avoid a swtich of atomic or not, we need convert rx
> > > > > > mode setting to workqueue first:
> > > > > >
> > > > > > https://www.mail-archive.com/virtualization@lists.linux-foundation.org/msg60298.html
> > > > > >
> > > > > > >
> > > > > > > As such, can we add a module parameter to customize this timeout value
> > > > > > > by the user?
> > > > > >
> > > > > > Who is the "user" here, or how can the "user" know the value?
> > > > > >
> > > > > > >
> > > > > > > Or this timeout value is stored in device register, virtio_net driver
> > > > > > > will read this timeout value at initialization?
> > > > > >
> > > > > > See another thread. The design needs to be general, or you can post a RFC.
> > > > > >
> > > > > > In another thought, we've already had a tx watchdog, maybe we can have
> > > > > > something similar to cvq and use timeout + reset in that case.
> > > > >
> > > > > But we may block by the reset ^_^ if the device is broken?
> > > >
> > > > I mean vq reset here.
> > >
> > > I see.
> > >
> > > I mean when the deivce is broken, the vq reset also many be blocked.
> > >
> > >         void vp_modern_set_queue_reset(struct virtio_pci_modern_device *mdev, u16 index)
> > >         {
> > >                 struct virtio_pci_modern_common_cfg __iomem *cfg;
> > >
> > >                 cfg = (struct virtio_pci_modern_common_cfg __iomem *)mdev->common;
> > >
> > >                 vp_iowrite16(index, &cfg->cfg.queue_select);
> > >                 vp_iowrite16(1, &cfg->queue_reset);
> > >
> > >                 while (vp_ioread16(&cfg->queue_reset))
> > >                         msleep(1);
> > >
> > >                 while (vp_ioread16(&cfg->cfg.queue_enable))
> > >                         msleep(1);
> > >         }
> > >         EXPORT_SYMBOL_GPL(vp_modern_set_queue_reset);
> > >
> > > In this function, for the broken device, we can not expect something.
> >
> > Yes, it's best effort, there's no guarantee then. But it doesn't harm to try.
> >
> > Thanks
> >
> > >
> > >
> > > >
> > > > It looks like we have multiple goals here
> > > >
> > > > 1) avoid lockups, using workqueue + cond_resched() seems to be
> > > > sufficient, it has issue but nothing new
> > > > 2) recover from the unresponsive device, the issue for timeout is that
> > > > it needs to deal with false positives
> > >
> > >
> > > I agree.
> > >
> > > But I want to add a new goal, cvq async. In the netdim, we will
> > > send many requests via the cvq, so the cvq async will be nice.
>
> Then you need an interrupt for cvq.
>
> FYI, I've posted a series that use interrupt for cvq in the past:
>
> https://lore.kernel.org/lkml/6026e801-6fda-fee9-a69b-d06a80368621@redhat.com/t/

I know this. But the interrupt maybe not a good solution without new space.

>
> Haven't found time in working on this anymore, maybe we can start from
> this or not.


I said async, but my aim is to put many requests to the cvq before getting the
response.

Heng Qi posted this https://lore.kernel.org/all/1705410693-118895-4-git-send-email-hengqi@linux.alibaba.com/

Thanks.


>
> Thanks
>
> > >
> > > Thanks.
> > >
> > >
> > > >
> > > > Thanks
> > > >
> > > > >
> > > > > Thanks.
> > > > >
> > > > >
> > > > > >
> > > > > > Thans
> > > > > >
> > > > > > >
> > > > > > > Zhu Yanjun
> > > > > > >
> > > > > > > >
> > > > > > > >       Andrew
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
>
Jason Wang Jan. 22, 2024, 7:19 a.m. UTC | #22
On Mon, Jan 22, 2024 at 3:07 PM Xuan Zhuo <xuanzhuo@linux.alibaba.com> wrote:
>
> On Mon, 22 Jan 2024 14:58:09 +0800, Jason Wang <jasowang@redhat.com> wrote:
> > On Mon, Jan 22, 2024 at 2:55 PM Jason Wang <jasowang@redhat.com> wrote:
> > >
> > > On Mon, Jan 22, 2024 at 2:20 PM Xuan Zhuo <xuanzhuo@linux.alibaba.com> wrote:
> > > >
> > > > On Mon, 22 Jan 2024 12:16:27 +0800, Jason Wang <jasowang@redhat.com> wrote:
> > > > > On Mon, Jan 22, 2024 at 12:00 PM Xuan Zhuo <xuanzhuo@linux.alibaba.com> wrote:
> > > > > >
> > > > > > On Mon, 22 Jan 2024 11:14:30 +0800, Jason Wang <jasowang@redhat.com> wrote:
> > > > > > > On Mon, Jan 22, 2024 at 10:12 AM Zhu Yanjun <yanjun.zhu@linux.dev> wrote:
> > > > > > > >
> > > > > > > >
> > > > > > > > 在 2024/1/20 1:29, Andrew Lunn 写道:
> > > > > > > > >>>>>        while (!virtqueue_get_buf(vi->cvq, &tmp) &&
> > > > > > > > >>>>> -           !virtqueue_is_broken(vi->cvq))
> > > > > > > > >>>>> +           !virtqueue_is_broken(vi->cvq)) {
> > > > > > > > >>>>> +        if (timeout)
> > > > > > > > >>>>> +            timeout--;
> > > > > > > > >>>> This is not really a timeout, just a loop counter. 200 iterations could
> > > > > > > > >>>> be a very short time on reasonable H/W. I guess this avoid the soft
> > > > > > > > >>>> lockup, but possibly (likely?) breaks the functionality when we need to
> > > > > > > > >>>> loop for some non negligible time.
> > > > > > > > >>>>
> > > > > > > > >>>> I fear we need a more complex solution, as mentioned by Micheal in the
> > > > > > > > >>>> thread you quoted.
> > > > > > > > >>> Got it. I also look forward to the more complex solution to this problem.
> > > > > > > > >> Can we add a device capability (new feature bit) such as ctrq_wait_timeout
> > > > > > > > >> to get a reasonable timeout?
> > > > > > > > > The usual solution to this is include/linux/iopoll.h. If you can sleep
> > > > > > > > > read_poll_timeout() otherwise read_poll_timeout_atomic().
> > > > > > > >
> > > > > > > > I read carefully the functions read_poll_timeout() and
> > > > > > > > read_poll_timeout_atomic(). The timeout is set by the caller of the 2
> > > > > > > > functions.
> > > > > > >
> > > > > > > FYI, in order to avoid a swtich of atomic or not, we need convert rx
> > > > > > > mode setting to workqueue first:
> > > > > > >
> > > > > > > https://www.mail-archive.com/virtualization@lists.linux-foundation.org/msg60298.html
> > > > > > >
> > > > > > > >
> > > > > > > > As such, can we add a module parameter to customize this timeout value
> > > > > > > > by the user?
> > > > > > >
> > > > > > > Who is the "user" here, or how can the "user" know the value?
> > > > > > >
> > > > > > > >
> > > > > > > > Or this timeout value is stored in device register, virtio_net driver
> > > > > > > > will read this timeout value at initialization?
> > > > > > >
> > > > > > > See another thread. The design needs to be general, or you can post a RFC.
> > > > > > >
> > > > > > > In another thought, we've already had a tx watchdog, maybe we can have
> > > > > > > something similar to cvq and use timeout + reset in that case.
> > > > > >
> > > > > > But we may block by the reset ^_^ if the device is broken?
> > > > >
> > > > > I mean vq reset here.
> > > >
> > > > I see.
> > > >
> > > > I mean when the deivce is broken, the vq reset also many be blocked.
> > > >
> > > >         void vp_modern_set_queue_reset(struct virtio_pci_modern_device *mdev, u16 index)
> > > >         {
> > > >                 struct virtio_pci_modern_common_cfg __iomem *cfg;
> > > >
> > > >                 cfg = (struct virtio_pci_modern_common_cfg __iomem *)mdev->common;
> > > >
> > > >                 vp_iowrite16(index, &cfg->cfg.queue_select);
> > > >                 vp_iowrite16(1, &cfg->queue_reset);
> > > >
> > > >                 while (vp_ioread16(&cfg->queue_reset))
> > > >                         msleep(1);
> > > >
> > > >                 while (vp_ioread16(&cfg->cfg.queue_enable))
> > > >                         msleep(1);
> > > >         }
> > > >         EXPORT_SYMBOL_GPL(vp_modern_set_queue_reset);
> > > >
> > > > In this function, for the broken device, we can not expect something.
> > >
> > > Yes, it's best effort, there's no guarantee then. But it doesn't harm to try.
> > >
> > > Thanks
> > >
> > > >
> > > >
> > > > >
> > > > > It looks like we have multiple goals here
> > > > >
> > > > > 1) avoid lockups, using workqueue + cond_resched() seems to be
> > > > > sufficient, it has issue but nothing new
> > > > > 2) recover from the unresponsive device, the issue for timeout is that
> > > > > it needs to deal with false positives
> > > >
> > > >
> > > > I agree.
> > > >
> > > > But I want to add a new goal, cvq async. In the netdim, we will
> > > > send many requests via the cvq, so the cvq async will be nice.
> >
> > Then you need an interrupt for cvq.
> >
> > FYI, I've posted a series that use interrupt for cvq in the past:
> >
> > https://lore.kernel.org/lkml/6026e801-6fda-fee9-a69b-d06a80368621@redhat.com/t/
>
> I know this. But the interrupt maybe not a good solution without new space.

What do you mean by "new space"?

We can introduce something like enable_cb_delayed(), then you will
only get notified after several requests.

>
> >
> > Haven't found time in working on this anymore, maybe we can start from
> > this or not.
>
>
> I said async, but my aim is to put many requests to the cvq before getting the
> response.

It doesn't differ from TX/RX in this case.

>
> Heng Qi posted this https://lore.kernel.org/all/1705410693-118895-4-git-send-email-hengqi@linux.alibaba.com/
>

This seems like a hack, if interrupt is used, you can simply do that
in the callback.

Thanks

> Thanks.
>
>
> >
> > Thanks
> >
> > > >
> > > > Thanks.
> > > >
> > > >
> > > > >
> > > > > Thanks
> > > > >
> > > > > >
> > > > > > Thanks.
> > > > > >
> > > > > >
> > > > > > >
> > > > > > > Thans
> > > > > > >
> > > > > > > >
> > > > > > > > Zhu Yanjun
> > > > > > > >
> > > > > > > > >
> > > > > > > > >       Andrew
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> >
>
Xuan Zhuo Jan. 22, 2024, 7:25 a.m. UTC | #23
On Mon, 22 Jan 2024 15:19:12 +0800, Jason Wang <jasowang@redhat.com> wrote:
> On Mon, Jan 22, 2024 at 3:07 PM Xuan Zhuo <xuanzhuo@linux.alibaba.com> wrote:
> >
> > On Mon, 22 Jan 2024 14:58:09 +0800, Jason Wang <jasowang@redhat.com> wrote:
> > > On Mon, Jan 22, 2024 at 2:55 PM Jason Wang <jasowang@redhat.com> wrote:
> > > >
> > > > On Mon, Jan 22, 2024 at 2:20 PM Xuan Zhuo <xuanzhuo@linux.alibaba.com> wrote:
> > > > >
> > > > > On Mon, 22 Jan 2024 12:16:27 +0800, Jason Wang <jasowang@redhat.com> wrote:
> > > > > > On Mon, Jan 22, 2024 at 12:00 PM Xuan Zhuo <xuanzhuo@linux.alibaba.com> wrote:
> > > > > > >
> > > > > > > On Mon, 22 Jan 2024 11:14:30 +0800, Jason Wang <jasowang@redhat.com> wrote:
> > > > > > > > On Mon, Jan 22, 2024 at 10:12 AM Zhu Yanjun <yanjun.zhu@linux.dev> wrote:
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > 在 2024/1/20 1:29, Andrew Lunn 写道:
> > > > > > > > > >>>>>        while (!virtqueue_get_buf(vi->cvq, &tmp) &&
> > > > > > > > > >>>>> -           !virtqueue_is_broken(vi->cvq))
> > > > > > > > > >>>>> +           !virtqueue_is_broken(vi->cvq)) {
> > > > > > > > > >>>>> +        if (timeout)
> > > > > > > > > >>>>> +            timeout--;
> > > > > > > > > >>>> This is not really a timeout, just a loop counter. 200 iterations could
> > > > > > > > > >>>> be a very short time on reasonable H/W. I guess this avoid the soft
> > > > > > > > > >>>> lockup, but possibly (likely?) breaks the functionality when we need to
> > > > > > > > > >>>> loop for some non negligible time.
> > > > > > > > > >>>>
> > > > > > > > > >>>> I fear we need a more complex solution, as mentioned by Micheal in the
> > > > > > > > > >>>> thread you quoted.
> > > > > > > > > >>> Got it. I also look forward to the more complex solution to this problem.
> > > > > > > > > >> Can we add a device capability (new feature bit) such as ctrq_wait_timeout
> > > > > > > > > >> to get a reasonable timeout?
> > > > > > > > > > The usual solution to this is include/linux/iopoll.h. If you can sleep
> > > > > > > > > > read_poll_timeout() otherwise read_poll_timeout_atomic().
> > > > > > > > >
> > > > > > > > > I read carefully the functions read_poll_timeout() and
> > > > > > > > > read_poll_timeout_atomic(). The timeout is set by the caller of the 2
> > > > > > > > > functions.
> > > > > > > >
> > > > > > > > FYI, in order to avoid a swtich of atomic or not, we need convert rx
> > > > > > > > mode setting to workqueue first:
> > > > > > > >
> > > > > > > > https://www.mail-archive.com/virtualization@lists.linux-foundation.org/msg60298.html
> > > > > > > >
> > > > > > > > >
> > > > > > > > > As such, can we add a module parameter to customize this timeout value
> > > > > > > > > by the user?
> > > > > > > >
> > > > > > > > Who is the "user" here, or how can the "user" know the value?
> > > > > > > >
> > > > > > > > >
> > > > > > > > > Or this timeout value is stored in device register, virtio_net driver
> > > > > > > > > will read this timeout value at initialization?
> > > > > > > >
> > > > > > > > See another thread. The design needs to be general, or you can post a RFC.
> > > > > > > >
> > > > > > > > In another thought, we've already had a tx watchdog, maybe we can have
> > > > > > > > something similar to cvq and use timeout + reset in that case.
> > > > > > >
> > > > > > > But we may block by the reset ^_^ if the device is broken?
> > > > > >
> > > > > > I mean vq reset here.
> > > > >
> > > > > I see.
> > > > >
> > > > > I mean when the deivce is broken, the vq reset also many be blocked.
> > > > >
> > > > >         void vp_modern_set_queue_reset(struct virtio_pci_modern_device *mdev, u16 index)
> > > > >         {
> > > > >                 struct virtio_pci_modern_common_cfg __iomem *cfg;
> > > > >
> > > > >                 cfg = (struct virtio_pci_modern_common_cfg __iomem *)mdev->common;
> > > > >
> > > > >                 vp_iowrite16(index, &cfg->cfg.queue_select);
> > > > >                 vp_iowrite16(1, &cfg->queue_reset);
> > > > >
> > > > >                 while (vp_ioread16(&cfg->queue_reset))
> > > > >                         msleep(1);
> > > > >
> > > > >                 while (vp_ioread16(&cfg->cfg.queue_enable))
> > > > >                         msleep(1);
> > > > >         }
> > > > >         EXPORT_SYMBOL_GPL(vp_modern_set_queue_reset);
> > > > >
> > > > > In this function, for the broken device, we can not expect something.
> > > >
> > > > Yes, it's best effort, there's no guarantee then. But it doesn't harm to try.
> > > >
> > > > Thanks
> > > >
> > > > >
> > > > >
> > > > > >
> > > > > > It looks like we have multiple goals here
> > > > > >
> > > > > > 1) avoid lockups, using workqueue + cond_resched() seems to be
> > > > > > sufficient, it has issue but nothing new
> > > > > > 2) recover from the unresponsive device, the issue for timeout is that
> > > > > > it needs to deal with false positives
> > > > >
> > > > >
> > > > > I agree.
> > > > >
> > > > > But I want to add a new goal, cvq async. In the netdim, we will
> > > > > send many requests via the cvq, so the cvq async will be nice.
> > >
> > > Then you need an interrupt for cvq.
> > >
> > > FYI, I've posted a series that use interrupt for cvq in the past:
> > >
> > > https://lore.kernel.org/lkml/6026e801-6fda-fee9-a69b-d06a80368621@redhat.com/t/
> >
> > I know this. But the interrupt maybe not a good solution without new space.
>
> What do you mean by "new space"?

Yes, I know, the cvq can work with interrupt by the virtio spec.
But as I know, many hypervisors implement the cvq without supporting interrupt.
If we let the cvq work with interrupt without negotiation then
many hypervisors will hang on the new kernel.

>
> We can introduce something like enable_cb_delayed(), then you will
> only get notified after several requests.
>
> >
> > >
> > > Haven't found time in working on this anymore, maybe we can start from
> > > this or not.
> >
> >
> > I said async, but my aim is to put many requests to the cvq before getting the
> > response.
>
> It doesn't differ from TX/RX in this case.
>
> >
> > Heng Qi posted this https://lore.kernel.org/all/1705410693-118895-4-git-send-email-hengqi@linux.alibaba.com/
> >
>
> This seems like a hack, if interrupt is used, you can simply do that
> in the callback.

YES.

I also want to change the code, I just want to say the async is a goal.

For the rx mode, we have introduce a work queue, I want to move the
sending command job to the work queue. The caller just wakeup
the work queue.

If the caller want to got the result sync, then the caller can wait for it.
If not, the caller can register an function to the work queue.

And I think it will be easy to implement the timeout inside the workqueue.

Thanks.


>
> Thanks
>
> > Thanks.
> >
> >
> > >
> > > Thanks
> > >
> > > > >
> > > > > Thanks.
> > > > >
> > > > >
> > > > > >
> > > > > > Thanks
> > > > > >
> > > > > > >
> > > > > > > Thanks.
> > > > > > >
> > > > > > >
> > > > > > > >
> > > > > > > > Thans
> > > > > > > >
> > > > > > > > >
> > > > > > > > > Zhu Yanjun
> > > > > > > > >
> > > > > > > > > >
> > > > > > > > > >       Andrew
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > >
> >
>
Jason Wang Jan. 22, 2024, 7:57 a.m. UTC | #24
On Mon, Jan 22, 2024 at 3:36 PM Xuan Zhuo <xuanzhuo@linux.alibaba.com> wrote:
>
> On Mon, 22 Jan 2024 15:19:12 +0800, Jason Wang <jasowang@redhat.com> wrote:
> > On Mon, Jan 22, 2024 at 3:07 PM Xuan Zhuo <xuanzhuo@linux.alibaba.com> wrote:
> > >
> > > On Mon, 22 Jan 2024 14:58:09 +0800, Jason Wang <jasowang@redhat.com> wrote:
> > > > On Mon, Jan 22, 2024 at 2:55 PM Jason Wang <jasowang@redhat.com> wrote:
> > > > >
> > > > > On Mon, Jan 22, 2024 at 2:20 PM Xuan Zhuo <xuanzhuo@linux.alibaba.com> wrote:
> > > > > >
> > > > > > On Mon, 22 Jan 2024 12:16:27 +0800, Jason Wang <jasowang@redhat.com> wrote:
> > > > > > > On Mon, Jan 22, 2024 at 12:00 PM Xuan Zhuo <xuanzhuo@linux.alibaba.com> wrote:
> > > > > > > >
> > > > > > > > On Mon, 22 Jan 2024 11:14:30 +0800, Jason Wang <jasowang@redhat.com> wrote:
> > > > > > > > > On Mon, Jan 22, 2024 at 10:12 AM Zhu Yanjun <yanjun.zhu@linux.dev> wrote:
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > 在 2024/1/20 1:29, Andrew Lunn 写道:
> > > > > > > > > > >>>>>        while (!virtqueue_get_buf(vi->cvq, &tmp) &&
> > > > > > > > > > >>>>> -           !virtqueue_is_broken(vi->cvq))
> > > > > > > > > > >>>>> +           !virtqueue_is_broken(vi->cvq)) {
> > > > > > > > > > >>>>> +        if (timeout)
> > > > > > > > > > >>>>> +            timeout--;
> > > > > > > > > > >>>> This is not really a timeout, just a loop counter. 200 iterations could
> > > > > > > > > > >>>> be a very short time on reasonable H/W. I guess this avoid the soft
> > > > > > > > > > >>>> lockup, but possibly (likely?) breaks the functionality when we need to
> > > > > > > > > > >>>> loop for some non negligible time.
> > > > > > > > > > >>>>
> > > > > > > > > > >>>> I fear we need a more complex solution, as mentioned by Micheal in the
> > > > > > > > > > >>>> thread you quoted.
> > > > > > > > > > >>> Got it. I also look forward to the more complex solution to this problem.
> > > > > > > > > > >> Can we add a device capability (new feature bit) such as ctrq_wait_timeout
> > > > > > > > > > >> to get a reasonable timeout?
> > > > > > > > > > > The usual solution to this is include/linux/iopoll.h. If you can sleep
> > > > > > > > > > > read_poll_timeout() otherwise read_poll_timeout_atomic().
> > > > > > > > > >
> > > > > > > > > > I read carefully the functions read_poll_timeout() and
> > > > > > > > > > read_poll_timeout_atomic(). The timeout is set by the caller of the 2
> > > > > > > > > > functions.
> > > > > > > > >
> > > > > > > > > FYI, in order to avoid a swtich of atomic or not, we need convert rx
> > > > > > > > > mode setting to workqueue first:
> > > > > > > > >
> > > > > > > > > https://www.mail-archive.com/virtualization@lists.linux-foundation.org/msg60298.html
> > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > As such, can we add a module parameter to customize this timeout value
> > > > > > > > > > by the user?
> > > > > > > > >
> > > > > > > > > Who is the "user" here, or how can the "user" know the value?
> > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > Or this timeout value is stored in device register, virtio_net driver
> > > > > > > > > > will read this timeout value at initialization?
> > > > > > > > >
> > > > > > > > > See another thread. The design needs to be general, or you can post a RFC.
> > > > > > > > >
> > > > > > > > > In another thought, we've already had a tx watchdog, maybe we can have
> > > > > > > > > something similar to cvq and use timeout + reset in that case.
> > > > > > > >
> > > > > > > > But we may block by the reset ^_^ if the device is broken?
> > > > > > >
> > > > > > > I mean vq reset here.
> > > > > >
> > > > > > I see.
> > > > > >
> > > > > > I mean when the deivce is broken, the vq reset also many be blocked.
> > > > > >
> > > > > >         void vp_modern_set_queue_reset(struct virtio_pci_modern_device *mdev, u16 index)
> > > > > >         {
> > > > > >                 struct virtio_pci_modern_common_cfg __iomem *cfg;
> > > > > >
> > > > > >                 cfg = (struct virtio_pci_modern_common_cfg __iomem *)mdev->common;
> > > > > >
> > > > > >                 vp_iowrite16(index, &cfg->cfg.queue_select);
> > > > > >                 vp_iowrite16(1, &cfg->queue_reset);
> > > > > >
> > > > > >                 while (vp_ioread16(&cfg->queue_reset))
> > > > > >                         msleep(1);
> > > > > >
> > > > > >                 while (vp_ioread16(&cfg->cfg.queue_enable))
> > > > > >                         msleep(1);
> > > > > >         }
> > > > > >         EXPORT_SYMBOL_GPL(vp_modern_set_queue_reset);
> > > > > >
> > > > > > In this function, for the broken device, we can not expect something.
> > > > >
> > > > > Yes, it's best effort, there's no guarantee then. But it doesn't harm to try.
> > > > >
> > > > > Thanks
> > > > >
> > > > > >
> > > > > >
> > > > > > >
> > > > > > > It looks like we have multiple goals here
> > > > > > >
> > > > > > > 1) avoid lockups, using workqueue + cond_resched() seems to be
> > > > > > > sufficient, it has issue but nothing new
> > > > > > > 2) recover from the unresponsive device, the issue for timeout is that
> > > > > > > it needs to deal with false positives
> > > > > >
> > > > > >
> > > > > > I agree.
> > > > > >
> > > > > > But I want to add a new goal, cvq async. In the netdim, we will
> > > > > > send many requests via the cvq, so the cvq async will be nice.
> > > >
> > > > Then you need an interrupt for cvq.
> > > >
> > > > FYI, I've posted a series that use interrupt for cvq in the past:
> > > >
> > > > https://lore.kernel.org/lkml/6026e801-6fda-fee9-a69b-d06a80368621@redhat.com/t/
> > >
> > > I know this. But the interrupt maybe not a good solution without new space.
> >
> > What do you mean by "new space"?
>
> Yes, I know, the cvq can work with interrupt by the virtio spec.
> But as I know, many hypervisors implement the cvq without supporting interrupt.

It's a bug of the hypervisor that needs to be fix. Interrupt is
provided by transport not the virtio itself.

Otherwise it can only support for Linux but not other OSes.

> If we let the cvq work with interrupt without negotiation then
> many hypervisors will hang on the new kernel.
>
> >
> > We can introduce something like enable_cb_delayed(), then you will
> > only get notified after several requests.
> >
> > >
> > > >
> > > > Haven't found time in working on this anymore, maybe we can start from
> > > > this or not.
> > >
> > >
> > > I said async, but my aim is to put many requests to the cvq before getting the
> > > response.
> >
> > It doesn't differ from TX/RX in this case.
> >
> > >
> > > Heng Qi posted this https://lore.kernel.org/all/1705410693-118895-4-git-send-email-hengqi@linux.alibaba.com/
> > >
> >
> > This seems like a hack, if interrupt is used, you can simply do that
> > in the callback.
>
> YES.
>
> I also want to change the code, I just want to say the async is a goal.
>
> For the rx mode, we have introduce a work queue, I want to move the
> sending command job to the work queue. The caller just wakeup
> the work queue.
>
> If the caller want to got the result sync, then the caller can wait for it.
> If not, the caller can register an function to the work queue.
>
> And I think it will be easy to implement the timeout inside the workqueue.

Looks much more complicated than a simple interrupt + timer/watchdog etc.

Thanks

>
> Thanks.
>
>
> >
> > Thanks
> >
> > > Thanks.
> > >
> > >
> > > >
> > > > Thanks
> > > >
> > > > > >
> > > > > > Thanks.
> > > > > >
> > > > > >
> > > > > > >
> > > > > > > Thanks
> > > > > > >
> > > > > > > >
> > > > > > > > Thanks.
> > > > > > > >
> > > > > > > >
> > > > > > > > >
> > > > > > > > > Thans
> > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > Zhu Yanjun
> > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > >       Andrew
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > >
> > >
> >
>
Xuan Zhuo Jan. 22, 2024, 8:01 a.m. UTC | #25
On Mon, 22 Jan 2024 15:57:08 +0800, Jason Wang <jasowang@redhat.com> wrote:
> On Mon, Jan 22, 2024 at 3:36 PM Xuan Zhuo <xuanzhuo@linux.alibaba.com> wrote:
> >
> > On Mon, 22 Jan 2024 15:19:12 +0800, Jason Wang <jasowang@redhat.com> wrote:
> > > On Mon, Jan 22, 2024 at 3:07 PM Xuan Zhuo <xuanzhuo@linux.alibaba.com> wrote:
> > > >
> > > > On Mon, 22 Jan 2024 14:58:09 +0800, Jason Wang <jasowang@redhat.com> wrote:
> > > > > On Mon, Jan 22, 2024 at 2:55 PM Jason Wang <jasowang@redhat.com> wrote:
> > > > > >
> > > > > > On Mon, Jan 22, 2024 at 2:20 PM Xuan Zhuo <xuanzhuo@linux.alibaba.com> wrote:
> > > > > > >
> > > > > > > On Mon, 22 Jan 2024 12:16:27 +0800, Jason Wang <jasowang@redhat.com> wrote:
> > > > > > > > On Mon, Jan 22, 2024 at 12:00 PM Xuan Zhuo <xuanzhuo@linux.alibaba.com> wrote:
> > > > > > > > >
> > > > > > > > > On Mon, 22 Jan 2024 11:14:30 +0800, Jason Wang <jasowang@redhat.com> wrote:
> > > > > > > > > > On Mon, Jan 22, 2024 at 10:12 AM Zhu Yanjun <yanjun.zhu@linux.dev> wrote:
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > 在 2024/1/20 1:29, Andrew Lunn 写道:
> > > > > > > > > > > >>>>>        while (!virtqueue_get_buf(vi->cvq, &tmp) &&
> > > > > > > > > > > >>>>> -           !virtqueue_is_broken(vi->cvq))
> > > > > > > > > > > >>>>> +           !virtqueue_is_broken(vi->cvq)) {
> > > > > > > > > > > >>>>> +        if (timeout)
> > > > > > > > > > > >>>>> +            timeout--;
> > > > > > > > > > > >>>> This is not really a timeout, just a loop counter. 200 iterations could
> > > > > > > > > > > >>>> be a very short time on reasonable H/W. I guess this avoid the soft
> > > > > > > > > > > >>>> lockup, but possibly (likely?) breaks the functionality when we need to
> > > > > > > > > > > >>>> loop for some non negligible time.
> > > > > > > > > > > >>>>
> > > > > > > > > > > >>>> I fear we need a more complex solution, as mentioned by Micheal in the
> > > > > > > > > > > >>>> thread you quoted.
> > > > > > > > > > > >>> Got it. I also look forward to the more complex solution to this problem.
> > > > > > > > > > > >> Can we add a device capability (new feature bit) such as ctrq_wait_timeout
> > > > > > > > > > > >> to get a reasonable timeout?
> > > > > > > > > > > > The usual solution to this is include/linux/iopoll.h. If you can sleep
> > > > > > > > > > > > read_poll_timeout() otherwise read_poll_timeout_atomic().
> > > > > > > > > > >
> > > > > > > > > > > I read carefully the functions read_poll_timeout() and
> > > > > > > > > > > read_poll_timeout_atomic(). The timeout is set by the caller of the 2
> > > > > > > > > > > functions.
> > > > > > > > > >
> > > > > > > > > > FYI, in order to avoid a swtich of atomic or not, we need convert rx
> > > > > > > > > > mode setting to workqueue first:
> > > > > > > > > >
> > > > > > > > > > https://www.mail-archive.com/virtualization@lists.linux-foundation.org/msg60298.html
> > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > As such, can we add a module parameter to customize this timeout value
> > > > > > > > > > > by the user?
> > > > > > > > > >
> > > > > > > > > > Who is the "user" here, or how can the "user" know the value?
> > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > Or this timeout value is stored in device register, virtio_net driver
> > > > > > > > > > > will read this timeout value at initialization?
> > > > > > > > > >
> > > > > > > > > > See another thread. The design needs to be general, or you can post a RFC.
> > > > > > > > > >
> > > > > > > > > > In another thought, we've already had a tx watchdog, maybe we can have
> > > > > > > > > > something similar to cvq and use timeout + reset in that case.
> > > > > > > > >
> > > > > > > > > But we may block by the reset ^_^ if the device is broken?
> > > > > > > >
> > > > > > > > I mean vq reset here.
> > > > > > >
> > > > > > > I see.
> > > > > > >
> > > > > > > I mean when the deivce is broken, the vq reset also many be blocked.
> > > > > > >
> > > > > > >         void vp_modern_set_queue_reset(struct virtio_pci_modern_device *mdev, u16 index)
> > > > > > >         {
> > > > > > >                 struct virtio_pci_modern_common_cfg __iomem *cfg;
> > > > > > >
> > > > > > >                 cfg = (struct virtio_pci_modern_common_cfg __iomem *)mdev->common;
> > > > > > >
> > > > > > >                 vp_iowrite16(index, &cfg->cfg.queue_select);
> > > > > > >                 vp_iowrite16(1, &cfg->queue_reset);
> > > > > > >
> > > > > > >                 while (vp_ioread16(&cfg->queue_reset))
> > > > > > >                         msleep(1);
> > > > > > >
> > > > > > >                 while (vp_ioread16(&cfg->cfg.queue_enable))
> > > > > > >                         msleep(1);
> > > > > > >         }
> > > > > > >         EXPORT_SYMBOL_GPL(vp_modern_set_queue_reset);
> > > > > > >
> > > > > > > In this function, for the broken device, we can not expect something.
> > > > > >
> > > > > > Yes, it's best effort, there's no guarantee then. But it doesn't harm to try.
> > > > > >
> > > > > > Thanks
> > > > > >
> > > > > > >
> > > > > > >
> > > > > > > >
> > > > > > > > It looks like we have multiple goals here
> > > > > > > >
> > > > > > > > 1) avoid lockups, using workqueue + cond_resched() seems to be
> > > > > > > > sufficient, it has issue but nothing new
> > > > > > > > 2) recover from the unresponsive device, the issue for timeout is that
> > > > > > > > it needs to deal with false positives
> > > > > > >
> > > > > > >
> > > > > > > I agree.
> > > > > > >
> > > > > > > But I want to add a new goal, cvq async. In the netdim, we will
> > > > > > > send many requests via the cvq, so the cvq async will be nice.
> > > > >
> > > > > Then you need an interrupt for cvq.
> > > > >
> > > > > FYI, I've posted a series that use interrupt for cvq in the past:
> > > > >
> > > > > https://lore.kernel.org/lkml/6026e801-6fda-fee9-a69b-d06a80368621@redhat.com/t/
> > > >
> > > > I know this. But the interrupt maybe not a good solution without new space.
> > >
> > > What do you mean by "new space"?
> >
> > Yes, I know, the cvq can work with interrupt by the virtio spec.
> > But as I know, many hypervisors implement the cvq without supporting interrupt.
>
> It's a bug of the hypervisor that needs to be fix. Interrupt is
> provided by transport not the virtio itself.

YES. I agree.

But I still think we should not work with interrupt without any negotiation
directly. I more like to introduce a new feature to enable this.

Thanks.


>
> Otherwise it can only support for Linux but not other OSes.
>
> > If we let the cvq work with interrupt without negotiation then
> > many hypervisors will hang on the new kernel.
> >
> > >
> > > We can introduce something like enable_cb_delayed(), then you will
> > > only get notified after several requests.
> > >
> > > >
> > > > >
> > > > > Haven't found time in working on this anymore, maybe we can start from
> > > > > this or not.
> > > >
> > > >
> > > > I said async, but my aim is to put many requests to the cvq before getting the
> > > > response.
> > >
> > > It doesn't differ from TX/RX in this case.
> > >
> > > >
> > > > Heng Qi posted this https://lore.kernel.org/all/1705410693-118895-4-git-send-email-hengqi@linux.alibaba.com/
> > > >
> > >
> > > This seems like a hack, if interrupt is used, you can simply do that
> > > in the callback.
> >
> > YES.
> >
> > I also want to change the code, I just want to say the async is a goal.
> >
> > For the rx mode, we have introduce a work queue, I want to move the
> > sending command job to the work queue. The caller just wakeup
> > the work queue.
> >
> > If the caller want to got the result sync, then the caller can wait for it.
> > If not, the caller can register an function to the work queue.
> >
> > And I think it will be easy to implement the timeout inside the workqueue.
>
> Looks much more complicated than a simple interrupt + timer/watchdog etc.



>
> Thanks
>
> >
> > Thanks.
> >
> >
> > >
> > > Thanks
> > >
> > > > Thanks.
> > > >
> > > >
> > > > >
> > > > > Thanks
> > > > >
> > > > > > >
> > > > > > > Thanks.
> > > > > > >
> > > > > > >
> > > > > > > >
> > > > > > > > Thanks
> > > > > > > >
> > > > > > > > >
> > > > > > > > > Thanks.
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > Thans
> > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > Zhu Yanjun
> > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > >       Andrew
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > >
> > > >
> > >
> >
>
Jason Wang Jan. 22, 2024, 8:32 a.m. UTC | #26
On Mon, Jan 22, 2024 at 4:04 PM Xuan Zhuo <xuanzhuo@linux.alibaba.com> wrote:
>
> On Mon, 22 Jan 2024 15:57:08 +0800, Jason Wang <jasowang@redhat.com> wrote:
> > On Mon, Jan 22, 2024 at 3:36 PM Xuan Zhuo <xuanzhuo@linux.alibaba.com> wrote:
> > >
> > > On Mon, 22 Jan 2024 15:19:12 +0800, Jason Wang <jasowang@redhat.com> wrote:
> > > > On Mon, Jan 22, 2024 at 3:07 PM Xuan Zhuo <xuanzhuo@linux.alibaba.com> wrote:
> > > > >
> > > > > On Mon, 22 Jan 2024 14:58:09 +0800, Jason Wang <jasowang@redhat.com> wrote:
> > > > > > On Mon, Jan 22, 2024 at 2:55 PM Jason Wang <jasowang@redhat.com> wrote:
> > > > > > >
> > > > > > > On Mon, Jan 22, 2024 at 2:20 PM Xuan Zhuo <xuanzhuo@linux.alibaba.com> wrote:
> > > > > > > >
> > > > > > > > On Mon, 22 Jan 2024 12:16:27 +0800, Jason Wang <jasowang@redhat.com> wrote:
> > > > > > > > > On Mon, Jan 22, 2024 at 12:00 PM Xuan Zhuo <xuanzhuo@linux.alibaba.com> wrote:
> > > > > > > > > >
> > > > > > > > > > On Mon, 22 Jan 2024 11:14:30 +0800, Jason Wang <jasowang@redhat.com> wrote:
> > > > > > > > > > > On Mon, Jan 22, 2024 at 10:12 AM Zhu Yanjun <yanjun.zhu@linux.dev> wrote:
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > 在 2024/1/20 1:29, Andrew Lunn 写道:
> > > > > > > > > > > > >>>>>        while (!virtqueue_get_buf(vi->cvq, &tmp) &&
> > > > > > > > > > > > >>>>> -           !virtqueue_is_broken(vi->cvq))
> > > > > > > > > > > > >>>>> +           !virtqueue_is_broken(vi->cvq)) {
> > > > > > > > > > > > >>>>> +        if (timeout)
> > > > > > > > > > > > >>>>> +            timeout--;
> > > > > > > > > > > > >>>> This is not really a timeout, just a loop counter. 200 iterations could
> > > > > > > > > > > > >>>> be a very short time on reasonable H/W. I guess this avoid the soft
> > > > > > > > > > > > >>>> lockup, but possibly (likely?) breaks the functionality when we need to
> > > > > > > > > > > > >>>> loop for some non negligible time.
> > > > > > > > > > > > >>>>
> > > > > > > > > > > > >>>> I fear we need a more complex solution, as mentioned by Micheal in the
> > > > > > > > > > > > >>>> thread you quoted.
> > > > > > > > > > > > >>> Got it. I also look forward to the more complex solution to this problem.
> > > > > > > > > > > > >> Can we add a device capability (new feature bit) such as ctrq_wait_timeout
> > > > > > > > > > > > >> to get a reasonable timeout?
> > > > > > > > > > > > > The usual solution to this is include/linux/iopoll.h. If you can sleep
> > > > > > > > > > > > > read_poll_timeout() otherwise read_poll_timeout_atomic().
> > > > > > > > > > > >
> > > > > > > > > > > > I read carefully the functions read_poll_timeout() and
> > > > > > > > > > > > read_poll_timeout_atomic(). The timeout is set by the caller of the 2
> > > > > > > > > > > > functions.
> > > > > > > > > > >
> > > > > > > > > > > FYI, in order to avoid a swtich of atomic or not, we need convert rx
> > > > > > > > > > > mode setting to workqueue first:
> > > > > > > > > > >
> > > > > > > > > > > https://www.mail-archive.com/virtualization@lists.linux-foundation.org/msg60298.html
> > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > As such, can we add a module parameter to customize this timeout value
> > > > > > > > > > > > by the user?
> > > > > > > > > > >
> > > > > > > > > > > Who is the "user" here, or how can the "user" know the value?
> > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > Or this timeout value is stored in device register, virtio_net driver
> > > > > > > > > > > > will read this timeout value at initialization?
> > > > > > > > > > >
> > > > > > > > > > > See another thread. The design needs to be general, or you can post a RFC.
> > > > > > > > > > >
> > > > > > > > > > > In another thought, we've already had a tx watchdog, maybe we can have
> > > > > > > > > > > something similar to cvq and use timeout + reset in that case.
> > > > > > > > > >
> > > > > > > > > > But we may block by the reset ^_^ if the device is broken?
> > > > > > > > >
> > > > > > > > > I mean vq reset here.
> > > > > > > >
> > > > > > > > I see.
> > > > > > > >
> > > > > > > > I mean when the deivce is broken, the vq reset also many be blocked.
> > > > > > > >
> > > > > > > >         void vp_modern_set_queue_reset(struct virtio_pci_modern_device *mdev, u16 index)
> > > > > > > >         {
> > > > > > > >                 struct virtio_pci_modern_common_cfg __iomem *cfg;
> > > > > > > >
> > > > > > > >                 cfg = (struct virtio_pci_modern_common_cfg __iomem *)mdev->common;
> > > > > > > >
> > > > > > > >                 vp_iowrite16(index, &cfg->cfg.queue_select);
> > > > > > > >                 vp_iowrite16(1, &cfg->queue_reset);
> > > > > > > >
> > > > > > > >                 while (vp_ioread16(&cfg->queue_reset))
> > > > > > > >                         msleep(1);
> > > > > > > >
> > > > > > > >                 while (vp_ioread16(&cfg->cfg.queue_enable))
> > > > > > > >                         msleep(1);
> > > > > > > >         }
> > > > > > > >         EXPORT_SYMBOL_GPL(vp_modern_set_queue_reset);
> > > > > > > >
> > > > > > > > In this function, for the broken device, we can not expect something.
> > > > > > >
> > > > > > > Yes, it's best effort, there's no guarantee then. But it doesn't harm to try.
> > > > > > >
> > > > > > > Thanks
> > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > > >
> > > > > > > > > It looks like we have multiple goals here
> > > > > > > > >
> > > > > > > > > 1) avoid lockups, using workqueue + cond_resched() seems to be
> > > > > > > > > sufficient, it has issue but nothing new
> > > > > > > > > 2) recover from the unresponsive device, the issue for timeout is that
> > > > > > > > > it needs to deal with false positives
> > > > > > > >
> > > > > > > >
> > > > > > > > I agree.
> > > > > > > >
> > > > > > > > But I want to add a new goal, cvq async. In the netdim, we will
> > > > > > > > send many requests via the cvq, so the cvq async will be nice.
> > > > > >
> > > > > > Then you need an interrupt for cvq.
> > > > > >
> > > > > > FYI, I've posted a series that use interrupt for cvq in the past:
> > > > > >
> > > > > > https://lore.kernel.org/lkml/6026e801-6fda-fee9-a69b-d06a80368621@redhat.com/t/
> > > > >
> > > > > I know this. But the interrupt maybe not a good solution without new space.
> > > >
> > > > What do you mean by "new space"?
> > >
> > > Yes, I know, the cvq can work with interrupt by the virtio spec.
> > > But as I know, many hypervisors implement the cvq without supporting interrupt.
> >
> > It's a bug of the hypervisor that needs to be fix. Interrupt is
> > provided by transport not the virtio itself.
>
> YES. I agree.
>
> But I still think we should not work with interrupt without any negotiation
> directly. I more like to introduce a new feature to enable this.

I can hardly believe we need to workaround the issue of specific
hypervisors like this...

Thanks

>
> Thanks.
>
>
Xuan Zhuo Jan. 22, 2024, 9:11 a.m. UTC | #27
On Mon, 22 Jan 2024 16:32:46 +0800, Jason Wang <jasowang@redhat.com> wrote:
> On Mon, Jan 22, 2024 at 4:04 PM Xuan Zhuo <xuanzhuo@linux.alibaba.com> wrote:
> >
> > On Mon, 22 Jan 2024 15:57:08 +0800, Jason Wang <jasowang@redhat.com> wrote:
> > > On Mon, Jan 22, 2024 at 3:36 PM Xuan Zhuo <xuanzhuo@linux.alibaba.com> wrote:
> > > >
> > > > On Mon, 22 Jan 2024 15:19:12 +0800, Jason Wang <jasowang@redhat.com> wrote:
> > > > > On Mon, Jan 22, 2024 at 3:07 PM Xuan Zhuo <xuanzhuo@linux.alibaba.com> wrote:
> > > > > >
> > > > > > On Mon, 22 Jan 2024 14:58:09 +0800, Jason Wang <jasowang@redhat.com> wrote:
> > > > > > > On Mon, Jan 22, 2024 at 2:55 PM Jason Wang <jasowang@redhat.com> wrote:
> > > > > > > >
> > > > > > > > On Mon, Jan 22, 2024 at 2:20 PM Xuan Zhuo <xuanzhuo@linux.alibaba.com> wrote:
> > > > > > > > >
> > > > > > > > > On Mon, 22 Jan 2024 12:16:27 +0800, Jason Wang <jasowang@redhat.com> wrote:
> > > > > > > > > > On Mon, Jan 22, 2024 at 12:00 PM Xuan Zhuo <xuanzhuo@linux.alibaba.com> wrote:
> > > > > > > > > > >
> > > > > > > > > > > On Mon, 22 Jan 2024 11:14:30 +0800, Jason Wang <jasowang@redhat.com> wrote:
> > > > > > > > > > > > On Mon, Jan 22, 2024 at 10:12 AM Zhu Yanjun <yanjun.zhu@linux.dev> wrote:
> > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > 在 2024/1/20 1:29, Andrew Lunn 写道:
> > > > > > > > > > > > > >>>>>        while (!virtqueue_get_buf(vi->cvq, &tmp) &&
> > > > > > > > > > > > > >>>>> -           !virtqueue_is_broken(vi->cvq))
> > > > > > > > > > > > > >>>>> +           !virtqueue_is_broken(vi->cvq)) {
> > > > > > > > > > > > > >>>>> +        if (timeout)
> > > > > > > > > > > > > >>>>> +            timeout--;
> > > > > > > > > > > > > >>>> This is not really a timeout, just a loop counter. 200 iterations could
> > > > > > > > > > > > > >>>> be a very short time on reasonable H/W. I guess this avoid the soft
> > > > > > > > > > > > > >>>> lockup, but possibly (likely?) breaks the functionality when we need to
> > > > > > > > > > > > > >>>> loop for some non negligible time.
> > > > > > > > > > > > > >>>>
> > > > > > > > > > > > > >>>> I fear we need a more complex solution, as mentioned by Micheal in the
> > > > > > > > > > > > > >>>> thread you quoted.
> > > > > > > > > > > > > >>> Got it. I also look forward to the more complex solution to this problem.
> > > > > > > > > > > > > >> Can we add a device capability (new feature bit) such as ctrq_wait_timeout
> > > > > > > > > > > > > >> to get a reasonable timeout?
> > > > > > > > > > > > > > The usual solution to this is include/linux/iopoll.h. If you can sleep
> > > > > > > > > > > > > > read_poll_timeout() otherwise read_poll_timeout_atomic().
> > > > > > > > > > > > >
> > > > > > > > > > > > > I read carefully the functions read_poll_timeout() and
> > > > > > > > > > > > > read_poll_timeout_atomic(). The timeout is set by the caller of the 2
> > > > > > > > > > > > > functions.
> > > > > > > > > > > >
> > > > > > > > > > > > FYI, in order to avoid a swtich of atomic or not, we need convert rx
> > > > > > > > > > > > mode setting to workqueue first:
> > > > > > > > > > > >
> > > > > > > > > > > > https://www.mail-archive.com/virtualization@lists.linux-foundation.org/msg60298.html
> > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > As such, can we add a module parameter to customize this timeout value
> > > > > > > > > > > > > by the user?
> > > > > > > > > > > >
> > > > > > > > > > > > Who is the "user" here, or how can the "user" know the value?
> > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > Or this timeout value is stored in device register, virtio_net driver
> > > > > > > > > > > > > will read this timeout value at initialization?
> > > > > > > > > > > >
> > > > > > > > > > > > See another thread. The design needs to be general, or you can post a RFC.
> > > > > > > > > > > >
> > > > > > > > > > > > In another thought, we've already had a tx watchdog, maybe we can have
> > > > > > > > > > > > something similar to cvq and use timeout + reset in that case.
> > > > > > > > > > >
> > > > > > > > > > > But we may block by the reset ^_^ if the device is broken?
> > > > > > > > > >
> > > > > > > > > > I mean vq reset here.
> > > > > > > > >
> > > > > > > > > I see.
> > > > > > > > >
> > > > > > > > > I mean when the deivce is broken, the vq reset also many be blocked.
> > > > > > > > >
> > > > > > > > >         void vp_modern_set_queue_reset(struct virtio_pci_modern_device *mdev, u16 index)
> > > > > > > > >         {
> > > > > > > > >                 struct virtio_pci_modern_common_cfg __iomem *cfg;
> > > > > > > > >
> > > > > > > > >                 cfg = (struct virtio_pci_modern_common_cfg __iomem *)mdev->common;
> > > > > > > > >
> > > > > > > > >                 vp_iowrite16(index, &cfg->cfg.queue_select);
> > > > > > > > >                 vp_iowrite16(1, &cfg->queue_reset);
> > > > > > > > >
> > > > > > > > >                 while (vp_ioread16(&cfg->queue_reset))
> > > > > > > > >                         msleep(1);
> > > > > > > > >
> > > > > > > > >                 while (vp_ioread16(&cfg->cfg.queue_enable))
> > > > > > > > >                         msleep(1);
> > > > > > > > >         }
> > > > > > > > >         EXPORT_SYMBOL_GPL(vp_modern_set_queue_reset);
> > > > > > > > >
> > > > > > > > > In this function, for the broken device, we can not expect something.
> > > > > > > >
> > > > > > > > Yes, it's best effort, there's no guarantee then. But it doesn't harm to try.
> > > > > > > >
> > > > > > > > Thanks
> > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > It looks like we have multiple goals here
> > > > > > > > > >
> > > > > > > > > > 1) avoid lockups, using workqueue + cond_resched() seems to be
> > > > > > > > > > sufficient, it has issue but nothing new
> > > > > > > > > > 2) recover from the unresponsive device, the issue for timeout is that
> > > > > > > > > > it needs to deal with false positives
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > I agree.
> > > > > > > > >
> > > > > > > > > But I want to add a new goal, cvq async. In the netdim, we will
> > > > > > > > > send many requests via the cvq, so the cvq async will be nice.
> > > > > > >
> > > > > > > Then you need an interrupt for cvq.
> > > > > > >
> > > > > > > FYI, I've posted a series that use interrupt for cvq in the past:
> > > > > > >
> > > > > > > https://lore.kernel.org/lkml/6026e801-6fda-fee9-a69b-d06a80368621@redhat.com/t/
> > > > > >
> > > > > > I know this. But the interrupt maybe not a good solution without new space.
> > > > >
> > > > > What do you mean by "new space"?
> > > >
> > > > Yes, I know, the cvq can work with interrupt by the virtio spec.
> > > > But as I know, many hypervisors implement the cvq without supporting interrupt.
> > >
> > > It's a bug of the hypervisor that needs to be fix. Interrupt is
> > > provided by transport not the virtio itself.
> >
> > YES. I agree.
> >
> > But I still think we should not work with interrupt without any negotiation
> > directly. I more like to introduce a new feature to enable this.
>
> I can hardly believe we need to workaround the issue of specific
> hypervisors like this...

Maybe we should hear others.

We are ok for this. I just think for other hypervisors.

Thanks.



>
> Thanks
>
> >
> > Thanks.
> >
> >
>
Zhu Yanjun Jan. 26, 2024, 3:13 a.m. UTC | #28
在 2024/1/26 11:11, Zhu Yanjun 写道:
>
>
> 在 2024/1/22 15:02, Xuan Zhuo 写道:
>> On Mon, 22 Jan 2024 14:58:09 +0800, Jason Wang<jasowang@redhat.com>  wrote:
>>> On Mon, Jan 22, 2024 at 2:55 PM Jason Wang<jasowang@redhat.com>  wrote:
>>>> On Mon, Jan 22, 2024 at 2:20 PM Xuan Zhuo<xuanzhuo@linux.alibaba.com>  wrote:
>>>>> On Mon, 22 Jan 2024 12:16:27 +0800, Jason Wang<jasowang@redhat.com>  wrote:
>>>>>> On Mon, Jan 22, 2024 at 12:00 PM Xuan Zhuo<xuanzhuo@linux.alibaba.com>  wrote:
>>>>>>> On Mon, 22 Jan 2024 11:14:30 +0800, Jason Wang<jasowang@redhat.com>  wrote:
>>>>>>>> On Mon, Jan 22, 2024 at 10:12 AM Zhu Yanjun<yanjun.zhu@linux.dev>  wrote:
>>>>>>>>> 在 2024/1/20 1:29, Andrew Lunn 写道:
>>>>>>>>>>>>>>         while (!virtqueue_get_buf(vi->cvq, &tmp) &&
>>>>>>>>>>>>>> -           !virtqueue_is_broken(vi->cvq))
>>>>>>>>>>>>>> +           !virtqueue_is_broken(vi->cvq)) {
>>>>>>>>>>>>>> +        if (timeout)
>>>>>>>>>>>>>> +            timeout--;
>>>>>>>>>>>>> This is not really a timeout, just a loop counter. 200 iterations could
>>>>>>>>>>>>> be a very short time on reasonable H/W. I guess this avoid the soft
>>>>>>>>>>>>> lockup, but possibly (likely?) breaks the functionality when we need to
>>>>>>>>>>>>> loop for some non negligible time.
>>>>>>>>>>>>>
>>>>>>>>>>>>> I fear we need a more complex solution, as mentioned by Micheal in the
>>>>>>>>>>>>> thread you quoted.
>>>>>>>>>>>> Got it. I also look forward to the more complex solution to this problem.
>>>>>>>>>>> Can we add a device capability (new feature bit) such as ctrq_wait_timeout
>>>>>>>>>>> to get a reasonable timeout?
>>>>>>>>>> The usual solution to this is include/linux/iopoll.h. If you can sleep
>>>>>>>>>> read_poll_timeout() otherwise read_poll_timeout_atomic().
>>>>>>>>> I read carefully the functions read_poll_timeout() and
>>>>>>>>> read_poll_timeout_atomic(). The timeout is set by the caller of the 2
>>>>>>>>> functions.
>>>>>>>> FYI, in order to avoid a swtich of atomic or not, we need convert rx
>>>>>>>> mode setting to workqueue first:
>>>>>>>>
>>>>>>>> https://www.mail-archive.com/virtualization@lists.linux-foundation.org/msg60298.html
>>>>>>>>
>>>>>>>>> As such, can we add a module parameter to customize this timeout value
>>>>>>>>> by the user?
>>>>>>>> Who is the "user" here, or how can the "user" know the value?
>>>>>>>>
>>>>>>>>> Or this timeout value is stored in device register, virtio_net driver
>>>>>>>>> will read this timeout value at initialization?
>>>>>>>> See another thread. The design needs to be general, or you can post a RFC.
>>>>>>>>
>>>>>>>> In another thought, we've already had a tx watchdog, maybe we can have
>>>>>>>> something similar to cvq and use timeout + reset in that case.
>>>>>>> But we may block by the reset ^_^ if the device is broken?
>>>>>> I mean vq reset here.
>>>>> I see.
>>>>>
>>>>> I mean when the deivce is broken, the vq reset also many be blocked.
>>>>>
>>>>>          void vp_modern_set_queue_reset(struct virtio_pci_modern_device *mdev, u16 index)
>>>>>          {
>>>>>                  struct virtio_pci_modern_common_cfg __iomem *cfg;
>>>>>
>>>>>                  cfg = (struct virtio_pci_modern_common_cfg __iomem *)mdev->common;
>>>>>
>>>>>                  vp_iowrite16(index, &cfg->cfg.queue_select);
>>>>>                  vp_iowrite16(1, &cfg->queue_reset);
>>>>>
>>>>>                  while (vp_ioread16(&cfg->queue_reset))
>>>>>                          msleep(1);
>>>>>
>>>>>                  while (vp_ioread16(&cfg->cfg.queue_enable))
>>>>>                          msleep(1);
>>>>>          }
>>>>>          EXPORT_SYMBOL_GPL(vp_modern_set_queue_reset);
>>>>>
>>>>> In this function, for the broken device, we can not expect something.
>>>> Yes, it's best effort, there's no guarantee then. But it doesn't harm to try.
>>>>
>>>> Thanks
>>>>
>>>>>> It looks like we have multiple goals here
>>>>>>
>>>>>> 1) avoid lockups, using workqueue + cond_resched() seems to be
>>>>>> sufficient, it has issue but nothing new
>>>>>> 2) recover from the unresponsive device, the issue for timeout is that
>>>>>> it needs to deal with false positives
>>>>> I agree.
>>>>>
>>>>> But I want to add a new goal, cvq async. In the netdim, we will
>>>>> send many requests via the cvq, so the cvq async will be nice.
>>> Then you need an interrupt for cvq.
>>>
>>> FYI, I've posted a series that use interrupt for cvq in the past:
>>>
>>> https://lore.kernel.org/lkml/6026e801-6fda-fee9-a69b-d06a80368621@redhat.com/t/
>> I know this. But the interrupt maybe not a good solution without new space.
>>
>>> Haven't found time in working on this anymore, maybe we can start from
>>> this or not.
>> I said async, but my aim is to put many requests to the cvq before getting the
>> response.
>>
>> Heng Qi posted thishttps://lore.kernel.org/all/1705410693-118895-4-git-send-email-hengqi@linux.alibaba.com/

Sorry. This mail is rejected by netdev maillist. So I have to resend it.


Thanks a lot. I read Heng Qi's commits carefully. This patch series are 
similiar with the NIC feature xmit_more.

But if cvq command is urgent, can we let this urgent cvq command be 
passed ASAP?

I mean, can we set a flag similar to xmit_more? if cvq command is not 
urgent, it can be queued. If it is urgent,

this cvq command is passed ASAP.

Zhu Yanjun

> Zhu Yanjun
>
>> Thanks.
>>
>>
>>> Thanks
>>>
>>>>> Thanks.
>>>>>
>>>>>
>>>>>> Thanks
>>>>>>
>>>>>>> Thanks.
>>>>>>>
>>>>>>>
>>>>>>>> Thans
>>>>>>>>
>>>>>>>>> Zhu Yanjun
>>>>>>>>>
>>>>>>>>>>        Andrew
diff mbox series

Patch

diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index 51b1868d2f22..28b7dd917a43 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -2468,7 +2468,7 @@  static bool virtnet_send_command(struct virtnet_info *vi, u8 class, u8 cmd,
 {
 	struct scatterlist *sgs[4], hdr, stat;
 	unsigned out_num = 0, tmp;
-	int ret;
+	int ret, timeout = 200;
 
 	/* Caller should know better */
 	BUG_ON(!virtio_has_feature(vi->vdev, VIRTIO_NET_F_CTRL_VQ));
@@ -2502,8 +2502,14 @@  static bool virtnet_send_command(struct virtnet_info *vi, u8 class, u8 cmd,
 	 * into the hypervisor, so the request should be handled immediately.
 	 */
 	while (!virtqueue_get_buf(vi->cvq, &tmp) &&
-	       !virtqueue_is_broken(vi->cvq))
+	       !virtqueue_is_broken(vi->cvq)) {
+		if (timeout)
+			timeout--;
+		else
+			return false; /* long time no response */
+
 		cpu_relax();
+	}
 
 	return vi->ctrl->status == VIRTIO_NET_OK;
 }