mbox series

[0/3,RFC] virtio-blk: add io_uring passthrough support for virtio-blk

Message ID 20241203121424.19887-1-mengferry@linux.alibaba.com (mailing list archive)
Headers show
Series virtio-blk: add io_uring passthrough support for virtio-blk | expand

Message

Ferry Meng Dec. 3, 2024, 12:14 p.m. UTC
We seek to develop a more flexible way to use virtio-blk and bypass the block
layer logic in order to accomplish certain performance optimizations. As a
result, we referred to the implementation of io_uring passthrough in NVMe
and implemented it in the virtio-blk driver. This patch series adds io_uring
passthrough support for virtio-blk devices, resulting in lower submit latency
and increased flexibility when utilizing virtio-blk.

To test this patch series, I changed fio's code: 
1. Added virtio-blk support to engines/io_uring.c.
2. Added virtio-blk support to the t/io_uring.c testing tool.
Link: https://github.com/jdmfr/fio 

Using t/io_uring-vblk, the performance of virtio-blk based on uring-cmd
scales better than block device access. (such as below, Virtio-Blk with QEMU,
1-depth fio) 
(passthru) read: IOPS=17.2k, BW=67.4MiB/s (70.6MB/s) 
slat (nsec): min=2907, max=43592, avg=3981.87, stdev=595.10 
clat (usec): min=38, max=285,avg=53.47, stdev= 8.28 
lat (usec): min=44, max=288, avg=57.45, stdev= 8.28
(block) read: IOPS=15.3k, BW=59.8MiB/s (62.7MB/s) 
slat (nsec): min=3408, max=35366, avg=5102.17, stdev=790.79 
clat (usec): min=35, max=343, avg=59.63, stdev=10.26 
lat (usec): min=43, max=349, avg=64.73, stdev=10.21

Testing the virtio-blk device with fio using 'engines=io_uring_cmd'
and 'engines=io_uring' also demonstrates improvements in submit latency.
(passthru) taskset -c 0 t/io_uring-vblk -b4096 -d8 -c4 -s4 -p0 -F1 -B0 -O0 -n1 -u1 /dev/vdcc0 
IOPS=189.80K, BW=741MiB/s, IOS/call=4/3
IOPS=187.68K, BW=733MiB/s, IOS/call=4/3 
(block) taskset -c 0 t/io_uring-vblk -b4096 -d8 -c4 -s4 -p0 -F1 -B0 -O0 -n1 -u0 /dev/vdc 
IOPS=101.51K, BW=396MiB/s, IOS/call=4/3
IOPS=100.01K, BW=390MiB/s, IOS/call=4/4

The performance overhead of submitting IO can be decreased by 25% overall
with this patch series. The implementation primarily references 'nvme io_uring
passthrough', supporting io_uring_cmd through a separate character interface
(temporarily named /dev/vdXc0). Since this is an early version, many
details need to be taken into account and redesigned, like:
● Currently, it only considers READ/WRITE scenarios, some more complex operations 
not included like discard or zone ops.(Normal sqe64 is sufficient, in my opinion;
following upgrades, sqe128 and cqe32 might not be needed).
● ......

I would appreciate any useful recommendations.

Ferry Meng (3):
  virtio-blk: add virtio-blk chardev support.
  virtio-blk: add uring_cmd support for I/O passthru on chardev.
  virtio-blk: add uring_cmd iopoll support.

 drivers/block/virtio_blk.c      | 325 +++++++++++++++++++++++++++++++-
 include/uapi/linux/virtio_blk.h |  16 ++
 2 files changed, 336 insertions(+), 5 deletions(-)

Comments

Stefan Hajnoczi Dec. 4, 2024, 9:47 p.m. UTC | #1
On Tue, 3 Dec 2024 at 07:17, Ferry Meng <mengferry@linux.alibaba.com> wrote:
>
> We seek to develop a more flexible way to use virtio-blk and bypass the block
> layer logic in order to accomplish certain performance optimizations. As a
> result, we referred to the implementation of io_uring passthrough in NVMe
> and implemented it in the virtio-blk driver. This patch series adds io_uring
> passthrough support for virtio-blk devices, resulting in lower submit latency
> and increased flexibility when utilizing virtio-blk.

First I thought this was similar to Stefano Garzarella's previous
virtio-blk io_uring passthrough work where a host io_uring was passed
through into the guest:
https://static.sched.com/hosted_files/kvmforum2020/9c/KVMForum_2020_io_uring_passthrough_Stefano_Garzarella.pdf

But now I see this is a uring_cmd interface for sending virtio_blk
commands from userspace like the one offered by the NVMe driver.

Unlike NVMe, the virtio-blk command set is minimal and does not offer
a rich set of features. Is the motivation really virtio-blk command
passthrough or is the goal just to create a fast path for I/O?

If the goal is just a fast path for I/O, then maybe Jens would
consider a generic command set that is not device-specific? That way
any driver (NVMe, virtio-blk, etc) can implement this uring_cmd
interface and any application can use it without worrying about the
underlying command set. I think a generic fast path would be much more
useful to applications than driver-specific interfaces.

>
> To test this patch series, I changed fio's code:
> 1. Added virtio-blk support to engines/io_uring.c.
> 2. Added virtio-blk support to the t/io_uring.c testing tool.
> Link: https://github.com/jdmfr/fio
>
> Using t/io_uring-vblk, the performance of virtio-blk based on uring-cmd
> scales better than block device access. (such as below, Virtio-Blk with QEMU,
> 1-depth fio)
> (passthru) read: IOPS=17.2k, BW=67.4MiB/s (70.6MB/s)
> slat (nsec): min=2907, max=43592, avg=3981.87, stdev=595.10
> clat (usec): min=38, max=285,avg=53.47, stdev= 8.28
> lat (usec): min=44, max=288, avg=57.45, stdev= 8.28
> (block) read: IOPS=15.3k, BW=59.8MiB/s (62.7MB/s)
> slat (nsec): min=3408, max=35366, avg=5102.17, stdev=790.79
> clat (usec): min=35, max=343, avg=59.63, stdev=10.26
> lat (usec): min=43, max=349, avg=64.73, stdev=10.21
>
> Testing the virtio-blk device with fio using 'engines=io_uring_cmd'
> and 'engines=io_uring' also demonstrates improvements in submit latency.
> (passthru) taskset -c 0 t/io_uring-vblk -b4096 -d8 -c4 -s4 -p0 -F1 -B0 -O0 -n1 -u1 /dev/vdcc0
> IOPS=189.80K, BW=741MiB/s, IOS/call=4/3
> IOPS=187.68K, BW=733MiB/s, IOS/call=4/3
> (block) taskset -c 0 t/io_uring-vblk -b4096 -d8 -c4 -s4 -p0 -F1 -B0 -O0 -n1 -u0 /dev/vdc
> IOPS=101.51K, BW=396MiB/s, IOS/call=4/3
> IOPS=100.01K, BW=390MiB/s, IOS/call=4/4
>
> The performance overhead of submitting IO can be decreased by 25% overall
> with this patch series. The implementation primarily references 'nvme io_uring
> passthrough', supporting io_uring_cmd through a separate character interface
> (temporarily named /dev/vdXc0). Since this is an early version, many
> details need to be taken into account and redesigned, like:
> ● Currently, it only considers READ/WRITE scenarios, some more complex operations
> not included like discard or zone ops.(Normal sqe64 is sufficient, in my opinion;
> following upgrades, sqe128 and cqe32 might not be needed).
> ● ......
>
> I would appreciate any useful recommendations.
>
> Ferry Meng (3):
>   virtio-blk: add virtio-blk chardev support.
>   virtio-blk: add uring_cmd support for I/O passthru on chardev.
>   virtio-blk: add uring_cmd iopoll support.
>
>  drivers/block/virtio_blk.c      | 325 +++++++++++++++++++++++++++++++-
>  include/uapi/linux/virtio_blk.h |  16 ++
>  2 files changed, 336 insertions(+), 5 deletions(-)
>
> --
> 2.43.5
>
>
Ferry Meng Dec. 5, 2024, 9:51 a.m. UTC | #2
Resend after change this into plain text.

On 12/5/24 5:47 AM, Stefan Hajnoczi wrote:
> On Tue, 3 Dec 2024 at 07:17, Ferry Meng <mengferry@linux.alibaba.com> wrote:
>> We seek to develop a more flexible way to use virtio-blk and bypass the block
>> layer logic in order to accomplish certain performance optimizations. As a
>> result, we referred to the implementation of io_uring passthrough in NVMe
>> and implemented it in the virtio-blk driver. This patch series adds io_uring
>> passthrough support for virtio-blk devices, resulting in lower submit latency
>> and increased flexibility when utilizing virtio-blk.
> First I thought this was similar to Stefano Garzarella's previous
> virtio-blk io_uring passthrough work where a host io_uring was passed
> through into the guest:
> https://static.sched.com/hosted_files/kvmforum2020/9c/KVMForum_2020_io_uring_passthrough_Stefano_Garzarella.pdf
>
> But now I see this is a uring_cmd interface for sending virtio_blk
> commands from userspace like the one offered by the NVMe driver.
>
> Unlike NVMe, the virtio-blk command set is minimal and does not offer
> a rich set of features. Is the motivation really virtio-blk command
> passthrough or is the goal just to create a fast path for I/O?
Sure, this series only works with guest os, not between host and guest. 
Well, 'io_uring passthrough'
gives a 'fast path for I/O' , and is ideal for our use case --- like 
using virtio-blk with command directly
from userspace scales better.
> If the goal is just a fast path for I/O, then maybe Jens would
> consider a generic command set that is not device-specific? That way
> any driver (NVMe, virtio-blk, etc) can implement this uring_cmd
> interface and any application can use it without worrying about the
> underlying command set. I think a generic fast path would be much more
> useful to applications than driver-specific interfaces.
If I understand correctly, io_uring passthrough is already a complete 
abstract framework for I/O request dispatch.
Its aim is to allow driver to handle commands on its own with bypass 
unused logic. Thus I chose this method of
implementation. I really agreed that the command set of virtio-blk is 
sufficient and minimal enough, but I also believe
that this makes it more convenient for us to adapt all command types 
with fewer modifications, so that virtio-blk
can use io_uring passthrough.

Of course, we can wait for Jens' view on the preceding discussion.
>> To test this patch series, I changed fio's code:
>> 1. Added virtio-blk support to engines/io_uring.c.
>> 2. Added virtio-blk support to the t/io_uring.c testing tool.
>> Link: https://github.com/jdmfr/fio
>>
>> Using t/io_uring-vblk, the performance of virtio-blk based on uring-cmd
>> scales better than block device access. (such as below, Virtio-Blk with QEMU,
>> 1-depth fio)
>> (passthru) read: IOPS=17.2k, BW=67.4MiB/s (70.6MB/s)
>> slat (nsec): min=2907, max=43592, avg=3981.87, stdev=595.10
>> clat (usec): min=38, max=285,avg=53.47, stdev= 8.28
>> lat (usec): min=44, max=288, avg=57.45, stdev= 8.28
>> (block) read: IOPS=15.3k, BW=59.8MiB/s (62.7MB/s)
>> slat (nsec): min=3408, max=35366, avg=5102.17, stdev=790.79
>> clat (usec): min=35, max=343, avg=59.63, stdev=10.26
>> lat (usec): min=43, max=349, avg=64.73, stdev=10.21
>>
>> Testing the virtio-blk device with fio using 'engines=io_uring_cmd'
>> and 'engines=io_uring' also demonstrates improvements in submit latency.
>> (passthru) taskset -c 0 t/io_uring-vblk -b4096 -d8 -c4 -s4 -p0 -F1 -B0 -O0 -n1 -u1 /dev/vdcc0
>> IOPS=189.80K, BW=741MiB/s, IOS/call=4/3
>> IOPS=187.68K, BW=733MiB/s, IOS/call=4/3
>> (block) taskset -c 0 t/io_uring-vblk -b4096 -d8 -c4 -s4 -p0 -F1 -B0 -O0 -n1 -u0 /dev/vdc
>> IOPS=101.51K, BW=396MiB/s, IOS/call=4/3
>> IOPS=100.01K, BW=390MiB/s, IOS/call=4/4
>>
>> The performance overhead of submitting IO can be decreased by 25% overall
>> with this patch series. The implementation primarily references 'nvme io_uring
>> passthrough', supporting io_uring_cmd through a separate character interface
>> (temporarily named /dev/vdXc0). Since this is an early version, many
>> details need to be taken into account and redesigned, like:
>> ● Currently, it only considers READ/WRITE scenarios, some more complex operations
>> not included like discard or zone ops.(Normal sqe64 is sufficient, in my opinion;
>> following upgrades, sqe128 and cqe32 might not be needed).
>> ● ......
>>
>> I would appreciate any useful recommendations.
>>
>> Ferry Meng (3):
>>    virtio-blk: add virtio-blk chardev support.
>>    virtio-blk: add uring_cmd support for I/O passthru on chardev.
>>    virtio-blk: add uring_cmd iopoll support.
>>
>>   drivers/block/virtio_blk.c      | 325 +++++++++++++++++++++++++++++++-
>>   include/uapi/linux/virtio_blk.h |  16 ++
>>   2 files changed, 336 insertions(+), 5 deletions(-)
>>
>> --
>> 2.43.5
>>
>>
Ferry Meng Dec. 16, 2024, 2:01 a.m. UTC | #3
On 12/3/24 8:14 PM, Ferry Meng wrote:
> We seek to develop a more flexible way to use virtio-blk and bypass the block
> layer logic in order to accomplish certain performance optimizations. As a
> result, we referred to the implementation of io_uring passthrough in NVMe
> and implemented it in the virtio-blk driver. This patch series adds io_uring
> passthrough support for virtio-blk devices, resulting in lower submit latency
> and increased flexibility when utilizing virtio-blk.
>
> To test this patch series, I changed fio's code:
> 1. Added virtio-blk support to engines/io_uring.c.
> 2. Added virtio-blk support to the t/io_uring.c testing tool.
> Link: https://github.com/jdmfr/fio
>
> Using t/io_uring-vblk, the performance of virtio-blk based on uring-cmd
> scales better than block device access. (such as below, Virtio-Blk with QEMU,
> 1-depth fio)
> (passthru) read: IOPS=17.2k, BW=67.4MiB/s (70.6MB/s)
> slat (nsec): min=2907, max=43592, avg=3981.87, stdev=595.10
> clat (usec): min=38, max=285,avg=53.47, stdev= 8.28
> lat (usec): min=44, max=288, avg=57.45, stdev= 8.28
> (block) read: IOPS=15.3k, BW=59.8MiB/s (62.7MB/s)
> slat (nsec): min=3408, max=35366, avg=5102.17, stdev=790.79
> clat (usec): min=35, max=343, avg=59.63, stdev=10.26
> lat (usec): min=43, max=349, avg=64.73, stdev=10.21
>
> Testing the virtio-blk device with fio using 'engines=io_uring_cmd'
> and 'engines=io_uring' also demonstrates improvements in submit latency.
> (passthru) taskset -c 0 t/io_uring-vblk -b4096 -d8 -c4 -s4 -p0 -F1 -B0 -O0 -n1 -u1 /dev/vdcc0
> IOPS=189.80K, BW=741MiB/s, IOS/call=4/3
> IOPS=187.68K, BW=733MiB/s, IOS/call=4/3
> (block) taskset -c 0 t/io_uring-vblk -b4096 -d8 -c4 -s4 -p0 -F1 -B0 -O0 -n1 -u0 /dev/vdc
> IOPS=101.51K, BW=396MiB/s, IOS/call=4/3
> IOPS=100.01K, BW=390MiB/s, IOS/call=4/4
>
> The performance overhead of submitting IO can be decreased by 25% overall
> with this patch series. The implementation primarily references 'nvme io_uring
> passthrough', supporting io_uring_cmd through a separate character interface
> (temporarily named /dev/vdXc0). Since this is an early version, many
> details need to be taken into account and redesigned, like:
> ● Currently, it only considers READ/WRITE scenarios, some more complex operations
> not included like discard or zone ops.(Normal sqe64 is sufficient, in my opinion;
> following upgrades, sqe128 and cqe32 might not be needed).
> ● ......
>
> I would appreciate any useful recommendations.
>
> Ferry Meng (3):
>    virtio-blk: add virtio-blk chardev support.
>    virtio-blk: add uring_cmd support for I/O passthru on chardev.
>    virtio-blk: add uring_cmd iopoll support.
>
>   drivers/block/virtio_blk.c      | 325 +++++++++++++++++++++++++++++++-
>   include/uapi/linux/virtio_blk.h |  16 ++
>   2 files changed, 336 insertions(+), 5 deletions(-)

Hi, Micheal & Jason :

What about yours' opinion? As virtio-blk maintainer. Looking forward to 
your reply.

Thanks
Jason Wang Dec. 16, 2024, 7:38 a.m. UTC | #4
On Mon, Dec 16, 2024 at 10:01 AM Ferry Meng <mengferry@linux.alibaba.com> wrote:
>
>
> On 12/3/24 8:14 PM, Ferry Meng wrote:
> > We seek to develop a more flexible way to use virtio-blk and bypass the block
> > layer logic in order to accomplish certain performance optimizations. As a
> > result, we referred to the implementation of io_uring passthrough in NVMe
> > and implemented it in the virtio-blk driver. This patch series adds io_uring
> > passthrough support for virtio-blk devices, resulting in lower submit latency
> > and increased flexibility when utilizing virtio-blk.
> >
> > To test this patch series, I changed fio's code:
> > 1. Added virtio-blk support to engines/io_uring.c.
> > 2. Added virtio-blk support to the t/io_uring.c testing tool.
> > Link: https://github.com/jdmfr/fio
> >
> > Using t/io_uring-vblk, the performance of virtio-blk based on uring-cmd
> > scales better than block device access. (such as below, Virtio-Blk with QEMU,
> > 1-depth fio)
> > (passthru) read: IOPS=17.2k, BW=67.4MiB/s (70.6MB/s)
> > slat (nsec): min=2907, max=43592, avg=3981.87, stdev=595.10
> > clat (usec): min=38, max=285,avg=53.47, stdev= 8.28
> > lat (usec): min=44, max=288, avg=57.45, stdev= 8.28
> > (block) read: IOPS=15.3k, BW=59.8MiB/s (62.7MB/s)
> > slat (nsec): min=3408, max=35366, avg=5102.17, stdev=790.79
> > clat (usec): min=35, max=343, avg=59.63, stdev=10.26
> > lat (usec): min=43, max=349, avg=64.73, stdev=10.21
> >
> > Testing the virtio-blk device with fio using 'engines=io_uring_cmd'
> > and 'engines=io_uring' also demonstrates improvements in submit latency.
> > (passthru) taskset -c 0 t/io_uring-vblk -b4096 -d8 -c4 -s4 -p0 -F1 -B0 -O0 -n1 -u1 /dev/vdcc0
> > IOPS=189.80K, BW=741MiB/s, IOS/call=4/3
> > IOPS=187.68K, BW=733MiB/s, IOS/call=4/3
> > (block) taskset -c 0 t/io_uring-vblk -b4096 -d8 -c4 -s4 -p0 -F1 -B0 -O0 -n1 -u0 /dev/vdc
> > IOPS=101.51K, BW=396MiB/s, IOS/call=4/3
> > IOPS=100.01K, BW=390MiB/s, IOS/call=4/4
> >
> > The performance overhead of submitting IO can be decreased by 25% overall
> > with this patch series. The implementation primarily references 'nvme io_uring
> > passthrough', supporting io_uring_cmd through a separate character interface
> > (temporarily named /dev/vdXc0). Since this is an early version, many
> > details need to be taken into account and redesigned, like:
> > ● Currently, it only considers READ/WRITE scenarios, some more complex operations
> > not included like discard or zone ops.(Normal sqe64 is sufficient, in my opinion;
> > following upgrades, sqe128 and cqe32 might not be needed).
> > ● ......
> >
> > I would appreciate any useful recommendations.
> >
> > Ferry Meng (3):
> >    virtio-blk: add virtio-blk chardev support.
> >    virtio-blk: add uring_cmd support for I/O passthru on chardev.
> >    virtio-blk: add uring_cmd iopoll support.
> >
> >   drivers/block/virtio_blk.c      | 325 +++++++++++++++++++++++++++++++-
> >   include/uapi/linux/virtio_blk.h |  16 ++
> >   2 files changed, 336 insertions(+), 5 deletions(-)
>
> Hi, Micheal & Jason :
>
> What about yours' opinion? As virtio-blk maintainer. Looking forward to
> your reply.
>
> Thanks

If I understand this correctly, this proposal wants to make io_uring a
transport of the virito-blk command. So the application doesn't need
to worry about compatibility etc. This seems to be fine.

But I wonder what's the security consideration, for example do we
allow all virtio-blk commands to be passthroughs and why.

Thanks

>
Ferry Meng Dec. 16, 2024, 12:07 p.m. UTC | #5
On 12/16/24 3:38 PM, Jason Wang wrote:
> On Mon, Dec 16, 2024 at 10:01 AM Ferry Meng <mengferry@linux.alibaba.com> wrote:
>>
>> On 12/3/24 8:14 PM, Ferry Meng wrote:
>>> We seek to develop a more flexible way to use virtio-blk and bypass the block
>>> layer logic in order to accomplish certain performance optimizations. As a
>>> result, we referred to the implementation of io_uring passthrough in NVMe
>>> and implemented it in the virtio-blk driver. This patch series adds io_uring
>>> passthrough support for virtio-blk devices, resulting in lower submit latency
>>> and increased flexibility when utilizing virtio-blk.
>>>
>>> To test this patch series, I changed fio's code:
>>> 1. Added virtio-blk support to engines/io_uring.c.
>>> 2. Added virtio-blk support to the t/io_uring.c testing tool.
>>> Link: https://github.com/jdmfr/fio
>>>
>>> Using t/io_uring-vblk, the performance of virtio-blk based on uring-cmd
>>> scales better than block device access. (such as below, Virtio-Blk with QEMU,
>>> 1-depth fio)
>>> (passthru) read: IOPS=17.2k, BW=67.4MiB/s (70.6MB/s)
>>> slat (nsec): min=2907, max=43592, avg=3981.87, stdev=595.10
>>> clat (usec): min=38, max=285,avg=53.47, stdev= 8.28
>>> lat (usec): min=44, max=288, avg=57.45, stdev= 8.28
>>> (block) read: IOPS=15.3k, BW=59.8MiB/s (62.7MB/s)
>>> slat (nsec): min=3408, max=35366, avg=5102.17, stdev=790.79
>>> clat (usec): min=35, max=343, avg=59.63, stdev=10.26
>>> lat (usec): min=43, max=349, avg=64.73, stdev=10.21
>>>
>>> Testing the virtio-blk device with fio using 'engines=io_uring_cmd'
>>> and 'engines=io_uring' also demonstrates improvements in submit latency.
>>> (passthru) taskset -c 0 t/io_uring-vblk -b4096 -d8 -c4 -s4 -p0 -F1 -B0 -O0 -n1 -u1 /dev/vdcc0
>>> IOPS=189.80K, BW=741MiB/s, IOS/call=4/3
>>> IOPS=187.68K, BW=733MiB/s, IOS/call=4/3
>>> (block) taskset -c 0 t/io_uring-vblk -b4096 -d8 -c4 -s4 -p0 -F1 -B0 -O0 -n1 -u0 /dev/vdc
>>> IOPS=101.51K, BW=396MiB/s, IOS/call=4/3
>>> IOPS=100.01K, BW=390MiB/s, IOS/call=4/4
>>>
>>> The performance overhead of submitting IO can be decreased by 25% overall
>>> with this patch series. The implementation primarily references 'nvme io_uring
>>> passthrough', supporting io_uring_cmd through a separate character interface
>>> (temporarily named /dev/vdXc0). Since this is an early version, many
>>> details need to be taken into account and redesigned, like:
>>> ● Currently, it only considers READ/WRITE scenarios, some more complex operations
>>> not included like discard or zone ops.(Normal sqe64 is sufficient, in my opinion;
>>> following upgrades, sqe128 and cqe32 might not be needed).
>>> ● ......
>>>
>>> I would appreciate any useful recommendations.
>>>
>>> Ferry Meng (3):
>>>     virtio-blk: add virtio-blk chardev support.
>>>     virtio-blk: add uring_cmd support for I/O passthru on chardev.
>>>     virtio-blk: add uring_cmd iopoll support.
>>>
>>>    drivers/block/virtio_blk.c      | 325 +++++++++++++++++++++++++++++++-
>>>    include/uapi/linux/virtio_blk.h |  16 ++
>>>    2 files changed, 336 insertions(+), 5 deletions(-)
>> Hi, Micheal & Jason :
>>
>> What about yours' opinion? As virtio-blk maintainer. Looking forward to
>> your reply.
>>
>> Thanks
> If I understand this correctly, this proposal wants to make io_uring a
> transport of the virito-blk command. So the application doesn't need
> to worry about compatibility etc. This seems to be fine.
>
> But I wonder what's the security consideration, for example do we
> allow all virtio-blk commands to be passthroughs and why.

About 'security consideration', the generic char-dev belongs to root, so 
only root can use this passthrough path.

On the other hand, to what I know, virtio-blk commands are all related 
to 'I/O operations', so we can support all those opcodes with bypassing 
vfs&block layer (if we want). I just realized the most  basic read/write 
in this RFC patch series, others will be considered later.

> Thanks
>
Christoph Hellwig Dec. 16, 2024, 3:54 p.m. UTC | #6
Hacking passthrough into virtio_blk seems like not very good layering.
If you have a use case where you want to use the core kernel virtio code
but not the protocol drivers we'll probably need a virtqueue passthrough
option of some kind.
Stefan Hajnoczi Dec. 16, 2024, 4:13 p.m. UTC | #7
On Mon, 16 Dec 2024 at 10:54, Christoph Hellwig <hch@infradead.org> wrote:
>
> Hacking passthrough into virtio_blk seems like not very good layering.
> If you have a use case where you want to use the core kernel virtio code
> but not the protocol drivers we'll probably need a virtqueue passthrough
> option of some kind.

I think people are finding that submitting I/O via uring_cmd is faster
than traditional io_uring. The use case isn't really passthrough, it's
bypass :).

That's why I asked Jens to weigh in on whether there is a generic
block layer solution here. If uring_cmd is faster then maybe a generic
uring_cmd I/O interface can be defined without tying applications to
device-specific commands. Or maybe the traditional io_uring code path
can be optimized so that bypass is no longer attractive.

The virtio-level virtqueue passthrough idea is interesting for use
cases that mix passthrough applications with non-passthrough
applications. VFIO isn't enough because it prevents sharing and
excludes non-passthrough applications. Something similar to  VDPA
might be able to pass through just a subset of virtqueues that
userspace could access via the vhost_vdpa driver. This approach
doesn't scale if many applications are running at the same time
because the number of virtqueues is finite and often the same as the
number of CPUs.

Stefan
Jason Wang Dec. 17, 2024, 2:08 a.m. UTC | #8
On Mon, Dec 16, 2024 at 8:07 PM Ferry Meng <mengferry@linux.alibaba.com> wrote:
>
>
> On 12/16/24 3:38 PM, Jason Wang wrote:
> > On Mon, Dec 16, 2024 at 10:01 AM Ferry Meng <mengferry@linux.alibaba.com> wrote:
> >>
> >> On 12/3/24 8:14 PM, Ferry Meng wrote:
> >>> We seek to develop a more flexible way to use virtio-blk and bypass the block
> >>> layer logic in order to accomplish certain performance optimizations. As a
> >>> result, we referred to the implementation of io_uring passthrough in NVMe
> >>> and implemented it in the virtio-blk driver. This patch series adds io_uring
> >>> passthrough support for virtio-blk devices, resulting in lower submit latency
> >>> and increased flexibility when utilizing virtio-blk.
> >>>
> >>> To test this patch series, I changed fio's code:
> >>> 1. Added virtio-blk support to engines/io_uring.c.
> >>> 2. Added virtio-blk support to the t/io_uring.c testing tool.
> >>> Link: https://github.com/jdmfr/fio
> >>>
> >>> Using t/io_uring-vblk, the performance of virtio-blk based on uring-cmd
> >>> scales better than block device access. (such as below, Virtio-Blk with QEMU,
> >>> 1-depth fio)
> >>> (passthru) read: IOPS=17.2k, BW=67.4MiB/s (70.6MB/s)
> >>> slat (nsec): min=2907, max=43592, avg=3981.87, stdev=595.10
> >>> clat (usec): min=38, max=285,avg=53.47, stdev= 8.28
> >>> lat (usec): min=44, max=288, avg=57.45, stdev= 8.28
> >>> (block) read: IOPS=15.3k, BW=59.8MiB/s (62.7MB/s)
> >>> slat (nsec): min=3408, max=35366, avg=5102.17, stdev=790.79
> >>> clat (usec): min=35, max=343, avg=59.63, stdev=10.26
> >>> lat (usec): min=43, max=349, avg=64.73, stdev=10.21
> >>>
> >>> Testing the virtio-blk device with fio using 'engines=io_uring_cmd'
> >>> and 'engines=io_uring' also demonstrates improvements in submit latency.
> >>> (passthru) taskset -c 0 t/io_uring-vblk -b4096 -d8 -c4 -s4 -p0 -F1 -B0 -O0 -n1 -u1 /dev/vdcc0
> >>> IOPS=189.80K, BW=741MiB/s, IOS/call=4/3
> >>> IOPS=187.68K, BW=733MiB/s, IOS/call=4/3
> >>> (block) taskset -c 0 t/io_uring-vblk -b4096 -d8 -c4 -s4 -p0 -F1 -B0 -O0 -n1 -u0 /dev/vdc
> >>> IOPS=101.51K, BW=396MiB/s, IOS/call=4/3
> >>> IOPS=100.01K, BW=390MiB/s, IOS/call=4/4
> >>>
> >>> The performance overhead of submitting IO can be decreased by 25% overall
> >>> with this patch series. The implementation primarily references 'nvme io_uring
> >>> passthrough', supporting io_uring_cmd through a separate character interface
> >>> (temporarily named /dev/vdXc0). Since this is an early version, many
> >>> details need to be taken into account and redesigned, like:
> >>> ● Currently, it only considers READ/WRITE scenarios, some more complex operations
> >>> not included like discard or zone ops.(Normal sqe64 is sufficient, in my opinion;
> >>> following upgrades, sqe128 and cqe32 might not be needed).
> >>> ● ......
> >>>
> >>> I would appreciate any useful recommendations.
> >>>
> >>> Ferry Meng (3):
> >>>     virtio-blk: add virtio-blk chardev support.
> >>>     virtio-blk: add uring_cmd support for I/O passthru on chardev.
> >>>     virtio-blk: add uring_cmd iopoll support.
> >>>
> >>>    drivers/block/virtio_blk.c      | 325 +++++++++++++++++++++++++++++++-
> >>>    include/uapi/linux/virtio_blk.h |  16 ++
> >>>    2 files changed, 336 insertions(+), 5 deletions(-)
> >> Hi, Micheal & Jason :
> >>
> >> What about yours' opinion? As virtio-blk maintainer. Looking forward to
> >> your reply.
> >>
> >> Thanks
> > If I understand this correctly, this proposal wants to make io_uring a
> > transport of the virito-blk command. So the application doesn't need
> > to worry about compatibility etc. This seems to be fine.
> >
> > But I wonder what's the security consideration, for example do we
> > allow all virtio-blk commands to be passthroughs and why.
>
> About 'security consideration', the generic char-dev belongs to root, so
> only root can use this passthrough path.

This seems like a restriction. A lot of applications want to be run
without privilege to be safe.

>
> On the other hand, to what I know, virtio-blk commands are all related
> to 'I/O operations', so we can support all those opcodes with bypassing
> vfs&block layer (if we want). I just realized the most  basic read/write
> in this RFC patch series, others will be considered later.
>
> > Thanks
> >
>

Thanks
Jason Wang Dec. 17, 2024, 2:12 a.m. UTC | #9
On Tue, Dec 17, 2024 at 12:14 AM Stefan Hajnoczi <stefanha@gmail.com> wrote:
>
> On Mon, 16 Dec 2024 at 10:54, Christoph Hellwig <hch@infradead.org> wrote:
> >
> > Hacking passthrough into virtio_blk seems like not very good layering.
> > If you have a use case where you want to use the core kernel virtio code
> > but not the protocol drivers we'll probably need a virtqueue passthrough
> > option of some kind.
>
> I think people are finding that submitting I/O via uring_cmd is faster
> than traditional io_uring. The use case isn't really passthrough, it's
> bypass :).
>
> That's why I asked Jens to weigh in on whether there is a generic
> block layer solution here. If uring_cmd is faster then maybe a generic
> uring_cmd I/O interface can be defined without tying applications to
> device-specific commands. Or maybe the traditional io_uring code path
> can be optimized so that bypass is no longer attractive.
>
> The virtio-level virtqueue passthrough idea is interesting for use
> cases that mix passthrough applications with non-passthrough
> applications. VFIO isn't enough because it prevents sharing and
> excludes non-passthrough applications. Something similar to  VDPA
> might be able to pass through just a subset of virtqueues that
> userspace could access via the vhost_vdpa driver.

I thought it could be reused as a mixing approach like this. The vDPA
driver might just do a shadow virtqueue so in fact we just replace
io_uring here with the virtqueue. Or if we think vDPA is heavyweight,
vhost-blk could be another way.

> This approach
> doesn't scale if many applications are running at the same time
> because the number of virtqueues is finite and often the same as the
> number of CPUs.
>
> Stefan
>

Thanks
Ferry Meng Dec. 17, 2024, 6:04 a.m. UTC | #10
On 12/17/24 10:08 AM, Jason Wang wrote:
> On Mon, Dec 16, 2024 at 8:07 PM Ferry Meng <mengferry@linux.alibaba.com> wrote:
>>
>> On 12/16/24 3:38 PM, Jason Wang wrote:
>>> On Mon, Dec 16, 2024 at 10:01 AM Ferry Meng <mengferry@linux.alibaba.com> wrote:
>>>> On 12/3/24 8:14 PM, Ferry Meng wrote:
>>>>> We seek to develop a more flexible way to use virtio-blk and bypass the block
>>>>> layer logic in order to accomplish certain performance optimizations. As a
>>>>> result, we referred to the implementation of io_uring passthrough in NVMe
>>>>> and implemented it in the virtio-blk driver. This patch series adds io_uring
>>>>> passthrough support for virtio-blk devices, resulting in lower submit latency
>>>>> and increased flexibility when utilizing virtio-blk.
>>>>>
>>>>> To test this patch series, I changed fio's code:
>>>>> 1. Added virtio-blk support to engines/io_uring.c.
>>>>> 2. Added virtio-blk support to the t/io_uring.c testing tool.
>>>>> Link: https://github.com/jdmfr/fio
>>>>>
>>>>> Using t/io_uring-vblk, the performance of virtio-blk based on uring-cmd
>>>>> scales better than block device access. (such as below, Virtio-Blk with QEMU,
>>>>> 1-depth fio)
>>>>> (passthru) read: IOPS=17.2k, BW=67.4MiB/s (70.6MB/s)
>>>>> slat (nsec): min=2907, max=43592, avg=3981.87, stdev=595.10
>>>>> clat (usec): min=38, max=285,avg=53.47, stdev= 8.28
>>>>> lat (usec): min=44, max=288, avg=57.45, stdev= 8.28
>>>>> (block) read: IOPS=15.3k, BW=59.8MiB/s (62.7MB/s)
>>>>> slat (nsec): min=3408, max=35366, avg=5102.17, stdev=790.79
>>>>> clat (usec): min=35, max=343, avg=59.63, stdev=10.26
>>>>> lat (usec): min=43, max=349, avg=64.73, stdev=10.21
>>>>>
>>>>> Testing the virtio-blk device with fio using 'engines=io_uring_cmd'
>>>>> and 'engines=io_uring' also demonstrates improvements in submit latency.
>>>>> (passthru) taskset -c 0 t/io_uring-vblk -b4096 -d8 -c4 -s4 -p0 -F1 -B0 -O0 -n1 -u1 /dev/vdcc0
>>>>> IOPS=189.80K, BW=741MiB/s, IOS/call=4/3
>>>>> IOPS=187.68K, BW=733MiB/s, IOS/call=4/3
>>>>> (block) taskset -c 0 t/io_uring-vblk -b4096 -d8 -c4 -s4 -p0 -F1 -B0 -O0 -n1 -u0 /dev/vdc
>>>>> IOPS=101.51K, BW=396MiB/s, IOS/call=4/3
>>>>> IOPS=100.01K, BW=390MiB/s, IOS/call=4/4
>>>>>
>>>>> The performance overhead of submitting IO can be decreased by 25% overall
>>>>> with this patch series. The implementation primarily references 'nvme io_uring
>>>>> passthrough', supporting io_uring_cmd through a separate character interface
>>>>> (temporarily named /dev/vdXc0). Since this is an early version, many
>>>>> details need to be taken into account and redesigned, like:
>>>>> ● Currently, it only considers READ/WRITE scenarios, some more complex operations
>>>>> not included like discard or zone ops.(Normal sqe64 is sufficient, in my opinion;
>>>>> following upgrades, sqe128 and cqe32 might not be needed).
>>>>> ● ......
>>>>>
>>>>> I would appreciate any useful recommendations.
>>>>>
>>>>> Ferry Meng (3):
>>>>>      virtio-blk: add virtio-blk chardev support.
>>>>>      virtio-blk: add uring_cmd support for I/O passthru on chardev.
>>>>>      virtio-blk: add uring_cmd iopoll support.
>>>>>
>>>>>     drivers/block/virtio_blk.c      | 325 +++++++++++++++++++++++++++++++-
>>>>>     include/uapi/linux/virtio_blk.h |  16 ++
>>>>>     2 files changed, 336 insertions(+), 5 deletions(-)
>>>> Hi, Micheal & Jason :
>>>>
>>>> What about yours' opinion? As virtio-blk maintainer. Looking forward to
>>>> your reply.
>>>>
>>>> Thanks
>>> If I understand this correctly, this proposal wants to make io_uring a
>>> transport of the virito-blk command. So the application doesn't need
>>> to worry about compatibility etc. This seems to be fine.
>>>
>>> But I wonder what's the security consideration, for example do we
>>> allow all virtio-blk commands to be passthroughs and why.
>> About 'security consideration', the generic char-dev belongs to root, so
>> only root can use this passthrough path.
> This seems like a restriction. A lot of applications want to be run
> without privilege to be safe.
>
I'm sorry that there may have been some misunderstanding in my previous 
explanation. The generic cdev file's default group is 'root,' but we can 
just use 'chgrp' and change it to what we want.

After which, apps can then utilize it, just like they would with a 
standard file.

>> On the other hand, to what I know, virtio-blk commands are all related
>> to 'I/O operations', so we can support all those opcodes with bypassing
>> vfs&block layer (if we want). I just realized the most  basic read/write
>> in this RFC patch series, others will be considered later.
>>
>>> Thanks
>>>
> Thanks
Jingbo Xu Dec. 17, 2024, 6:08 a.m. UTC | #11
Hi Stefan & Christoph,

On 12/17/24 12:13 AM, Stefan Hajnoczi wrote:
> On Mon, 16 Dec 2024 at 10:54, Christoph Hellwig <hch@infradead.org> wrote:
>>
>> Hacking passthrough into virtio_blk seems like not very good layering.
>> If you have a use case where you want to use the core kernel virtio code
>> but not the protocol drivers we'll probably need a virtqueue passthrough
>> option of some kind.
> 
> I think people are finding that submitting I/O via uring_cmd is faster
> than traditional io_uring. The use case isn't really passthrough, it's
> bypass :).

Right, the initial purpose is bypassing the block layer (in the guest)
to achieve better latency when the user process is operating on a raw
virtio-blk device directly.


> 
> That's why I asked Jens to weigh in on whether there is a generic
> block layer solution here. If uring_cmd is faster then maybe a generic
> uring_cmd I/O interface can be defined without tying applications to
> device-specific commands. Or maybe the traditional io_uring code path
> can be optimized so that bypass is no longer attractive.

We are fine with that if it looks good to Jens.
Jens Axboe Dec. 17, 2024, 5:54 p.m. UTC | #12
On 12/16/24 11:08 PM, Jingbo Xu wrote:
>> That's why I asked Jens to weigh in on whether there is a generic
>> block layer solution here. If uring_cmd is faster then maybe a generic
>> uring_cmd I/O interface can be defined without tying applications to
>> device-specific commands. Or maybe the traditional io_uring code path
>> can be optimized so that bypass is no longer attractive.

It's not that the traditional io_uring code path is slower, it's in fact
basically the same thing. It's that all the other jazz that happens
below io_uring slows things down, which is why passthrough ends up being
faster.
Stefan Hajnoczi Dec. 17, 2024, 9 p.m. UTC | #13
On Tue, 17 Dec 2024 at 12:54, Jens Axboe <axboe@kernel.dk> wrote:
>
> On 12/16/24 11:08 PM, Jingbo Xu wrote:
> >> That's why I asked Jens to weigh in on whether there is a generic
> >> block layer solution here. If uring_cmd is faster then maybe a generic
> >> uring_cmd I/O interface can be defined without tying applications to
> >> device-specific commands. Or maybe the traditional io_uring code path
> >> can be optimized so that bypass is no longer attractive.
>
> It's not that the traditional io_uring code path is slower, it's in fact
> basically the same thing. It's that all the other jazz that happens
> below io_uring slows things down, which is why passthrough ends up being
> faster.

Are you happy with virtio_blk passthrough or do you want a different approach?

Stefan
Jens Axboe Dec. 17, 2024, 9:07 p.m. UTC | #14
On 12/17/24 2:00 PM, Stefan Hajnoczi wrote:
> On Tue, 17 Dec 2024 at 12:54, Jens Axboe <axboe@kernel.dk> wrote:
>>
>> On 12/16/24 11:08 PM, Jingbo Xu wrote:
>>>> That's why I asked Jens to weigh in on whether there is a generic
>>>> block layer solution here. If uring_cmd is faster then maybe a generic
>>>> uring_cmd I/O interface can be defined without tying applications to
>>>> device-specific commands. Or maybe the traditional io_uring code path
>>>> can be optimized so that bypass is no longer attractive.
>>
>> It's not that the traditional io_uring code path is slower, it's in fact
>> basically the same thing. It's that all the other jazz that happens
>> below io_uring slows things down, which is why passthrough ends up being
>> faster.
> 
> Are you happy with virtio_blk passthrough or do you want a different approach?

I think it looks fine.
Ferry Meng Dec. 18, 2024, 3:35 a.m. UTC | #15
On 12/18/24 5:07 AM, Jens Axboe wrote:
> On 12/17/24 2:00 PM, Stefan Hajnoczi wrote:
>> On Tue, 17 Dec 2024 at 12:54, Jens Axboe <axboe@kernel.dk> wrote:
>>> On 12/16/24 11:08 PM, Jingbo Xu wrote:
>>>>> That's why I asked Jens to weigh in on whether there is a generic
>>>>> block layer solution here. If uring_cmd is faster then maybe a generic
>>>>> uring_cmd I/O interface can be defined without tying applications to
>>>>> device-specific commands. Or maybe the traditional io_uring code path
>>>>> can be optimized so that bypass is no longer attractive.
>>> It's not that the traditional io_uring code path is slower, it's in fact
>>> basically the same thing. It's that all the other jazz that happens
>>> below io_uring slows things down, which is why passthrough ends up being
>>> faster.
>> Are you happy with virtio_blk passthrough or do you want a different approach?
> I think it looks fine.
>
OK, thx. I will submit the official patch for review soon after 
resolving the test bot warning.