mbox series

[v3,0/6] virtio,vhost: Add VIRTIO_F_IN_ORDER support

Message ID 20240620175612.2381019-1-jonah.palmer@oracle.com (mailing list archive)
Headers show
Series virtio,vhost: Add VIRTIO_F_IN_ORDER support | expand

Message

Jonah Palmer June 20, 2024, 5:56 p.m. UTC
The goal of these patches is to add support to a variety of virtio and
vhost devices for the VIRTIO_F_IN_ORDER transport feature. This feature
indicates that all buffers are used by the device in the same order in
which they were made available by the driver.

These patches attempt to implement a generalized, non-device-specific
solution to support this feature.

The core feature behind this solution is a buffer mechanism in the form
of a VirtQueue's used_elems VirtQueueElement array. This allows devices
who always use buffers in-order by default to have a minimal overhead
impact. Devices that may not always use buffers in-order likely will
experience a performance hit. How large that performance hit is will
depend on how frequently elements are completed out-of-order.

A VirtQueue whose device uses this feature will use its used_elems
VirtQueueElement array to hold used VirtQueueElements. The index that
used elements are placed in used_elems is the same index on the
used/descriptor ring that would satisfy the in-order requirement. In
other words, used elements are placed in their in-order locations on
used_elems and are only written to the used/descriptor ring once the
elements on used_elems are able to continue their expected order.

To differentiate between a "used" and "unused" element on the used_elems
array (a "used" element being an element that has returned from
processing and an "unused" element being an element that has not yet
been processed), we added a boolean 'in_order_filled' member to the
VirtQueueElement struct. This flag is set to true when the element comes
back from processing (virtqueue_ordered_fill) and then set back to false
once it's been written to the used/descriptor ring
(virtqueue_ordered_flush).

Testing:
========
Testing was done using the dpdk-testpmd application on both the host and
guest using the following configurations. Traffic was generated between
the host and guest after running 'start tx_first' on both the host and
guest dpdk-testpmd applications. Results are below after traffic was
generated for several seconds.

Relevant Qemu args:
-------------------
-chardev socket,id=char1,path=/tmp/vhost-user1,server=off
-chardev socket,id=char2,path=/tmp/vhost-user2,server=off
-netdev type=vhost-user,id=net1,chardev=char1,vhostforce=on,queues=1
-netdev type=vhost-user,id=net2,chardev=char2,vhostforce=on,queues=1
-device virtio-net-pci,in_order=true,packed=true,netdev=net1,
        mac=56:48:4f:53:54:00,mq=on,vectors=4,rx_queue_size=256
-device virtio-net-pci,in_order=true,packed=true,netdev=net2,
        mac=56:48:4f:53:54:01,mq=on,vectors=4,rx_queue_size=256

Host dpdk-testpmd command:
--------------------------
dpdk-testpmd -l 0,2,3,4,5 --socket-mem=1024 -n 4
    --vdev 'net_vhost0,iface=/tmp/vhost-user1'
    --vdev 'net_vhost1,iface=/tmp/vhost-user2' --
    --portmask=f -i --rxq=1 --txq=1 --nb-cores=4 --forward-mode=io

Guest dpdk-testpmd command:
---------------------------
dpdk-testpmd -l 0,1 -a 0000:00:02.0 -a 0000:00:03.0 -- --portmask=3
    --rxq=1 --txq=1 --nb-cores=1 --forward-mode=io -i

Results:
--------
+++++++++++++++ Accumulated forward statistics for all ports+++++++++++++++
RX-packets: 79067488       RX-dropped: 0             RX-total: 79067488
TX-packets: 79067552       TX-dropped: 0             TX-total: 79067552
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

---
v3: Drop Tested-by tags until patches are re-tested.
    Replace 'prev_avail_idx' with 'vq->last_avail_idx - 1' in
    virtqueue_split_pop.
    Remove redundant '+vq->vring.num' in 'max_steps' calculation in
    virtqueue_ordered_fill.
    Add test results to CV.

v2: Make 'in_order_filled' more descriptive.
    Change 'j' to more descriptive var name in virtqueue_split_pop.
    Use more definitive search conditional in virtqueue_ordered_fill.
    Avoid code duplication in virtqueue_ordered_flush.

v1: Move series from RFC to PATCH for submission.

Jonah Palmer (6):
  virtio: Add bool to VirtQueueElement
  virtio: virtqueue_pop - VIRTIO_F_IN_ORDER support
  virtio: virtqueue_ordered_fill - VIRTIO_F_IN_ORDER support
  virtio: virtqueue_ordered_flush - VIRTIO_F_IN_ORDER support
  vhost,vhost-user: Add VIRTIO_F_IN_ORDER to vhost feature bits
  virtio: Add VIRTIO_F_IN_ORDER property definition

 hw/block/vhost-user-blk.c    |   1 +
 hw/net/vhost_net.c           |   2 +
 hw/scsi/vhost-scsi.c         |   1 +
 hw/scsi/vhost-user-scsi.c    |   1 +
 hw/virtio/vhost-user-fs.c    |   1 +
 hw/virtio/vhost-user-vsock.c |   1 +
 hw/virtio/virtio.c           | 123 ++++++++++++++++++++++++++++++++++-
 include/hw/virtio/virtio.h   |   6 +-
 net/vhost-vdpa.c             |   1 +
 9 files changed, 134 insertions(+), 3 deletions(-)

Comments

Eugenio Perez Martin June 20, 2024, 6:40 p.m. UTC | #1
On Thu, Jun 20, 2024 at 7:56 PM Jonah Palmer <jonah.palmer@oracle.com> wrote:
>
> The goal of these patches is to add support to a variety of virtio and
> vhost devices for the VIRTIO_F_IN_ORDER transport feature. This feature
> indicates that all buffers are used by the device in the same order in
> which they were made available by the driver.
>
> These patches attempt to implement a generalized, non-device-specific
> solution to support this feature.
>
> The core feature behind this solution is a buffer mechanism in the form
> of a VirtQueue's used_elems VirtQueueElement array. This allows devices
> who always use buffers in-order by default to have a minimal overhead
> impact. Devices that may not always use buffers in-order likely will
> experience a performance hit. How large that performance hit is will
> depend on how frequently elements are completed out-of-order.
>
> A VirtQueue whose device uses this feature will use its used_elems
> VirtQueueElement array to hold used VirtQueueElements. The index that
> used elements are placed in used_elems is the same index on the
> used/descriptor ring that would satisfy the in-order requirement. In
> other words, used elements are placed in their in-order locations on
> used_elems and are only written to the used/descriptor ring once the
> elements on used_elems are able to continue their expected order.
>
> To differentiate between a "used" and "unused" element on the used_elems
> array (a "used" element being an element that has returned from
> processing and an "unused" element being an element that has not yet
> been processed), we added a boolean 'in_order_filled' member to the
> VirtQueueElement struct. This flag is set to true when the element comes
> back from processing (virtqueue_ordered_fill) and then set back to false
> once it's been written to the used/descriptor ring
> (virtqueue_ordered_flush).
>
> Testing:
> ========
> Testing was done using the dpdk-testpmd application on both the host and
> guest using the following configurations. Traffic was generated between
> the host and guest after running 'start tx_first' on both the host and
> guest dpdk-testpmd applications. Results are below after traffic was
> generated for several seconds.
>
> Relevant Qemu args:
> -------------------
> -chardev socket,id=char1,path=/tmp/vhost-user1,server=off
> -chardev socket,id=char2,path=/tmp/vhost-user2,server=off
> -netdev type=vhost-user,id=net1,chardev=char1,vhostforce=on,queues=1
> -netdev type=vhost-user,id=net2,chardev=char2,vhostforce=on,queues=1
> -device virtio-net-pci,in_order=true,packed=true,netdev=net1,
>         mac=56:48:4f:53:54:00,mq=on,vectors=4,rx_queue_size=256
> -device virtio-net-pci,in_order=true,packed=true,netdev=net2,
>         mac=56:48:4f:53:54:01,mq=on,vectors=4,rx_queue_size=256
>

Hi Jonah,

These tests are great, but others should also be performed. In
particular, QEMU should run ok with "tap" netdev with vhost=off
instead of vhost-user:

-netdev type=tap,id=net1,vhost=off
-netdev type=tap,id=net2,vhost=off

This way, packets are going through the modified code. With this
configuration, QEMU is the one forwarding the packets so testpmd is
not needed in the host. It's still needed in the guest as linux guest
driver does not support in_order. The guest kernel cmdline and testpmd
cmdline should require no changes from the configuration you describe
here.

And then try with in_order=true,packed=false and
in_order=true,packed=off in corresponding virtio-net-pci.

Performance comparison between in_order=true and in_order=false is
also interesting but we're not batching so I don't think we will get
an extreme improvement.

Does the plan work for you?

Thanks!

> Host dpdk-testpmd command:
> --------------------------
> dpdk-testpmd -l 0,2,3,4,5 --socket-mem=1024 -n 4
>     --vdev 'net_vhost0,iface=/tmp/vhost-user1'
>     --vdev 'net_vhost1,iface=/tmp/vhost-user2' --
>     --portmask=f -i --rxq=1 --txq=1 --nb-cores=4 --forward-mode=io
>
> Guest dpdk-testpmd command:
> ---------------------------
> dpdk-testpmd -l 0,1 -a 0000:00:02.0 -a 0000:00:03.0 -- --portmask=3
>     --rxq=1 --txq=1 --nb-cores=1 --forward-mode=io -i
>
> Results:
> --------
> +++++++++++++++ Accumulated forward statistics for all ports+++++++++++++++
> RX-packets: 79067488       RX-dropped: 0             RX-total: 79067488
> TX-packets: 79067552       TX-dropped: 0             TX-total: 79067552
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>
> ---
> v3: Drop Tested-by tags until patches are re-tested.
>     Replace 'prev_avail_idx' with 'vq->last_avail_idx - 1' in
>     virtqueue_split_pop.
>     Remove redundant '+vq->vring.num' in 'max_steps' calculation in
>     virtqueue_ordered_fill.
>     Add test results to CV.
>
> v2: Make 'in_order_filled' more descriptive.
>     Change 'j' to more descriptive var name in virtqueue_split_pop.
>     Use more definitive search conditional in virtqueue_ordered_fill.
>     Avoid code duplication in virtqueue_ordered_flush.
>
> v1: Move series from RFC to PATCH for submission.
>
> Jonah Palmer (6):
>   virtio: Add bool to VirtQueueElement
>   virtio: virtqueue_pop - VIRTIO_F_IN_ORDER support
>   virtio: virtqueue_ordered_fill - VIRTIO_F_IN_ORDER support
>   virtio: virtqueue_ordered_flush - VIRTIO_F_IN_ORDER support
>   vhost,vhost-user: Add VIRTIO_F_IN_ORDER to vhost feature bits
>   virtio: Add VIRTIO_F_IN_ORDER property definition
>
>  hw/block/vhost-user-blk.c    |   1 +
>  hw/net/vhost_net.c           |   2 +
>  hw/scsi/vhost-scsi.c         |   1 +
>  hw/scsi/vhost-user-scsi.c    |   1 +
>  hw/virtio/vhost-user-fs.c    |   1 +
>  hw/virtio/vhost-user-vsock.c |   1 +
>  hw/virtio/virtio.c           | 123 ++++++++++++++++++++++++++++++++++-
>  include/hw/virtio/virtio.h   |   6 +-
>  net/vhost-vdpa.c             |   1 +
>  9 files changed, 134 insertions(+), 3 deletions(-)
>
> --
> 2.43.0
>
Jonah Palmer June 24, 2024, 2:57 p.m. UTC | #2
On 6/20/24 2:40 PM, Eugenio Perez Martin wrote:
> On Thu, Jun 20, 2024 at 7:56 PM Jonah Palmer <jonah.palmer@oracle.com> wrote:
>>
>> The goal of these patches is to add support to a variety of virtio and
>> vhost devices for the VIRTIO_F_IN_ORDER transport feature. This feature
>> indicates that all buffers are used by the device in the same order in
>> which they were made available by the driver.
>>
>> These patches attempt to implement a generalized, non-device-specific
>> solution to support this feature.
>>
>> The core feature behind this solution is a buffer mechanism in the form
>> of a VirtQueue's used_elems VirtQueueElement array. This allows devices
>> who always use buffers in-order by default to have a minimal overhead
>> impact. Devices that may not always use buffers in-order likely will
>> experience a performance hit. How large that performance hit is will
>> depend on how frequently elements are completed out-of-order.
>>
>> A VirtQueue whose device uses this feature will use its used_elems
>> VirtQueueElement array to hold used VirtQueueElements. The index that
>> used elements are placed in used_elems is the same index on the
>> used/descriptor ring that would satisfy the in-order requirement. In
>> other words, used elements are placed in their in-order locations on
>> used_elems and are only written to the used/descriptor ring once the
>> elements on used_elems are able to continue their expected order.
>>
>> To differentiate between a "used" and "unused" element on the used_elems
>> array (a "used" element being an element that has returned from
>> processing and an "unused" element being an element that has not yet
>> been processed), we added a boolean 'in_order_filled' member to the
>> VirtQueueElement struct. This flag is set to true when the element comes
>> back from processing (virtqueue_ordered_fill) and then set back to false
>> once it's been written to the used/descriptor ring
>> (virtqueue_ordered_flush).
>>
>> Testing:
>> ========
>> Testing was done using the dpdk-testpmd application on both the host and
>> guest using the following configurations. Traffic was generated between
>> the host and guest after running 'start tx_first' on both the host and
>> guest dpdk-testpmd applications. Results are below after traffic was
>> generated for several seconds.
>>
>> Relevant Qemu args:
>> -------------------
>> -chardev socket,id=char1,path=/tmp/vhost-user1,server=off
>> -chardev socket,id=char2,path=/tmp/vhost-user2,server=off
>> -netdev type=vhost-user,id=net1,chardev=char1,vhostforce=on,queues=1
>> -netdev type=vhost-user,id=net2,chardev=char2,vhostforce=on,queues=1
>> -device virtio-net-pci,in_order=true,packed=true,netdev=net1,
>>          mac=56:48:4f:53:54:00,mq=on,vectors=4,rx_queue_size=256
>> -device virtio-net-pci,in_order=true,packed=true,netdev=net2,
>>          mac=56:48:4f:53:54:01,mq=on,vectors=4,rx_queue_size=256
>>
> 
> Hi Jonah,
> 
> These tests are great, but others should also be performed. In
> particular, QEMU should run ok with "tap" netdev with vhost=off
> instead of vhost-user:
> 
> -netdev type=tap,id=net1,vhost=off
> -netdev type=tap,id=net2,vhost=off
> 
> This way, packets are going through the modified code. With this
> configuration, QEMU is the one forwarding the packets so testpmd is
> not needed in the host. It's still needed in the guest as linux guest
> driver does not support in_order. The guest kernel cmdline and testpmd
> cmdline should require no changes from the configuration you describe
> here.
> 

Oof... I didn't even realize that my tests weren't actually testing all 
of the modified code. I was so focused on getting DPDK to work that I 
didn't think to check that. My apologies.

I am running into some trouble though trying to get packets flowing in 
the guest. My Qemu args look like this:

# Regular virtio-net device
-device virtio-net-pci,netdev=net0,disable-legacy=off,disable-modern=off
-netdev tap,id=net0,vhost=off,ifname=tap0,script=${QEMU_IFUP},
  downscript=no
# Virtio-net devices for testing
-netdev type=tap,id=net1,vhost=off,ifname=tap1,script=no,downscript=no
-netdev type=tap,id=net2,vhost=off,ifname=tap2,script=no,downscript=no
-device virtio-net-pci,in_order=true,packed=true,netdev=net1,
  mac=56:48:4f:53:54:00,mq=on,vectors=4,rx_queue_size=256
-device virtio-net-pci,in_order=true,packed=true,netdev=net2,
  mac=56:48:4f:53:54:01,mq=on,vectors=4,rx_queue_size=256

In the guest I have the virtio-net devices I'm using for testing bound 
to the uio_pci_generic driver:

dpdk-devbind.py --status net

Network devices using DPDK-compatible driver
============================================
0000:00:02.0 'Virtio network device 1000' drv=uio_pci_generic 
unused=virtio_pci,vfio-pci
0000:00:03.0 'Virtio network device 1000' drv=uio_pci_generic 
unused=virtio_pci,vfio-pci

Network devices using kernel driver
===================================
0000:00:04.0 'Virtio network device 1000' if=enp0s4
drv=virtio-pci unused=virtio_pci,vfio-pci,uio_pci_generic *Active*

But then after running the dpdk-testpmd command in the guest:

dpdk-testpmd -l 0,1 -a 0000:00:02.0 -a 0000:00:03.0 --
   --portmask=3 --rxq=1 --txq=1 --nb-cores=1 --forward-mode=io -i

EAL: Detected CPU lcores: 6
EAL: Detected NUMA nodes: 1
EAL: Detected static linkage of DPDK
EAL: Multi-process socket /var/run/dpdk/rte/mp_socket
EAL: Selected IOVA mode 'PA'
EAL: VFIO support initialized
EAL: Probe PCI driver: net_virtio (1af4:1000) device: 0000:00:02.0 
(socket -1)
EAL: Probe PCI driver: net_virtio (1af4:1000) device: 0000:00:03.0 
(socket -1)
TELEMETRY: No legacy callbacks, legacy socket not created
Set io packet forwarding mode
Interactive-mode selected
Warning: NUMA should be configured manually by using --port-numa-config 
and --ring-numa-config parameters along with --numa.
testpmd: create a new mbuf pool <mb_pool_0>: n=155456, size=2176, socket=0
testpmd: preferred mempool ops selected: ring_mp_mc
Configuring Port 0 (socket 0)
Port 0: 56:48:4F:53:54:00
Configuring Port 1 (socket 0)
Port 1: 56:48:4F:53:54:01
Checking link statuses...
Done

I'm not able to see any packets flowing after starting packet forwarding 
and running 'show port stats all':

testpmd> start
io packet forwarding - ports=2 - cores=1 - streams=2 - NUMA support 
enabled, MP allocation mode: native
Logical Core 1 (socket 0) forwards packets on 2 streams:
   RX P=0/Q=0 (socket 0) -> TX P=1/Q=0 (socket 0) peer=02:00:00:00:00:01
   RX P=1/Q=0 (socket 0) -> TX P=0/Q=0 (socket 0) peer=02:00:00:00:00:00

   io packet forwarding packets/burst=32
   nb forwarding cores=1 - nb forwarding ports=2
   port 0: RX queue number: 1 Tx queue number: 1
     Rx offloads=0x0 Tx offloads=0x0
     RX queue: 0
       RX desc=0 - RX free threshold=0
       RX threshold registers: pthresh=0 hthresh=0  wthresh=0
       RX Offloads=0x0
     TX queue: 0
       TX desc=0 - TX free threshold=0
       TX threshold registers: pthresh=0 hthresh=0  wthresh=0
       TX offloads=0x0 - TX RS bit threshold=0
   port 1: RX queue number: 1 Tx queue number: 1
     Rx offloads=0x0 Tx offloads=0x0
     RX queue: 0
       RX desc=0 - RX free threshold=0
       RX threshold registers: pthresh=0 hthresh=0  wthresh=0
       RX Offloads=0x0
     TX queue: 0
       TX desc=0 - TX free threshold=0
       TX threshold registers: pthresh=0 hthresh=0  wthresh=0
       TX offloads=0x0 - TX RS bit threshold=0
testpmd>
testpmd> show port stats all

   ###### NIC statistics for port 0  ######
   RX-packets: 0          RX-missed: 0          RX-bytes:  0
   RX-errors: 0
   RX-nombuf:  0
   TX-packets: 0          TX-errors: 0          TX-bytes:  0

   Throughput (since last show)
   Rx-pps:            0          Rx-bps:            0
   Tx-pps:            0          Tx-bps:            0
   #########################################################

   ###### NIC statistics for port 1  ######
   RX-packets: 0          RX-missed: 0          RX-bytes:  0
   RX-errors: 0
   RX-nombuf:  0
   TX-packets: 0          TX-errors: 0          TX-bytes:  0

   Throughput (since last show)
   Rx-pps:            0          Rx-bps:            0
   Tx-pps:            0          Tx-bps:            0
   #########################################################

I'm still working on improving my networking skills so I'm going to try 
and figure out what's going on here. Will let you know if I figure it 
out and check in with you to see if the test results are satisfactory 
before sending out a v4.

> And then try with in_order=true,packed=false and
> in_order=true,packed=off in corresponding virtio-net-pci.
> 
> Performance comparison between in_order=true and in_order=false is
> also interesting but we're not batching so I don't think we will get
> an extreme improvement.
> 
> Does the plan work for you?
> 
> Thanks!
> 
>> Host dpdk-testpmd command:
>> --------------------------
>> dpdk-testpmd -l 0,2,3,4,5 --socket-mem=1024 -n 4
>>      --vdev 'net_vhost0,iface=/tmp/vhost-user1'
>>      --vdev 'net_vhost1,iface=/tmp/vhost-user2' --
>>      --portmask=f -i --rxq=1 --txq=1 --nb-cores=4 --forward-mode=io
>>
>> Guest dpdk-testpmd command:
>> ---------------------------
>> dpdk-testpmd -l 0,1 -a 0000:00:02.0 -a 0000:00:03.0 -- --portmask=3
>>      --rxq=1 --txq=1 --nb-cores=1 --forward-mode=io -i
>>
>> Results:
>> --------
>> +++++++++++++++ Accumulated forward statistics for all ports+++++++++++++++
>> RX-packets: 79067488       RX-dropped: 0             RX-total: 79067488
>> TX-packets: 79067552       TX-dropped: 0             TX-total: 79067552
>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>
>> ---
>> v3: Drop Tested-by tags until patches are re-tested.
>>      Replace 'prev_avail_idx' with 'vq->last_avail_idx - 1' in
>>      virtqueue_split_pop.
>>      Remove redundant '+vq->vring.num' in 'max_steps' calculation in
>>      virtqueue_ordered_fill.
>>      Add test results to CV.
>>
>> v2: Make 'in_order_filled' more descriptive.
>>      Change 'j' to more descriptive var name in virtqueue_split_pop.
>>      Use more definitive search conditional in virtqueue_ordered_fill.
>>      Avoid code duplication in virtqueue_ordered_flush.
>>
>> v1: Move series from RFC to PATCH for submission.
>>
>> Jonah Palmer (6):
>>    virtio: Add bool to VirtQueueElement
>>    virtio: virtqueue_pop - VIRTIO_F_IN_ORDER support
>>    virtio: virtqueue_ordered_fill - VIRTIO_F_IN_ORDER support
>>    virtio: virtqueue_ordered_flush - VIRTIO_F_IN_ORDER support
>>    vhost,vhost-user: Add VIRTIO_F_IN_ORDER to vhost feature bits
>>    virtio: Add VIRTIO_F_IN_ORDER property definition
>>
>>   hw/block/vhost-user-blk.c    |   1 +
>>   hw/net/vhost_net.c           |   2 +
>>   hw/scsi/vhost-scsi.c         |   1 +
>>   hw/scsi/vhost-user-scsi.c    |   1 +
>>   hw/virtio/vhost-user-fs.c    |   1 +
>>   hw/virtio/vhost-user-vsock.c |   1 +
>>   hw/virtio/virtio.c           | 123 ++++++++++++++++++++++++++++++++++-
>>   include/hw/virtio/virtio.h   |   6 +-
>>   net/vhost-vdpa.c             |   1 +
>>   9 files changed, 134 insertions(+), 3 deletions(-)
>>
>> --
>> 2.43.0
>>
>