Message ID | 20241203121424.19887-1-mengferry@linux.alibaba.com (mailing list archive) |
---|---|
Headers | show |
Series | virtio-blk: add io_uring passthrough support for virtio-blk | expand |
On Tue, 3 Dec 2024 at 07:17, Ferry Meng <mengferry@linux.alibaba.com> wrote: > > We seek to develop a more flexible way to use virtio-blk and bypass the block > layer logic in order to accomplish certain performance optimizations. As a > result, we referred to the implementation of io_uring passthrough in NVMe > and implemented it in the virtio-blk driver. This patch series adds io_uring > passthrough support for virtio-blk devices, resulting in lower submit latency > and increased flexibility when utilizing virtio-blk. First I thought this was similar to Stefano Garzarella's previous virtio-blk io_uring passthrough work where a host io_uring was passed through into the guest: https://static.sched.com/hosted_files/kvmforum2020/9c/KVMForum_2020_io_uring_passthrough_Stefano_Garzarella.pdf But now I see this is a uring_cmd interface for sending virtio_blk commands from userspace like the one offered by the NVMe driver. Unlike NVMe, the virtio-blk command set is minimal and does not offer a rich set of features. Is the motivation really virtio-blk command passthrough or is the goal just to create a fast path for I/O? If the goal is just a fast path for I/O, then maybe Jens would consider a generic command set that is not device-specific? That way any driver (NVMe, virtio-blk, etc) can implement this uring_cmd interface and any application can use it without worrying about the underlying command set. I think a generic fast path would be much more useful to applications than driver-specific interfaces. > > To test this patch series, I changed fio's code: > 1. Added virtio-blk support to engines/io_uring.c. > 2. Added virtio-blk support to the t/io_uring.c testing tool. > Link: https://github.com/jdmfr/fio > > Using t/io_uring-vblk, the performance of virtio-blk based on uring-cmd > scales better than block device access. (such as below, Virtio-Blk with QEMU, > 1-depth fio) > (passthru) read: IOPS=17.2k, BW=67.4MiB/s (70.6MB/s) > slat (nsec): min=2907, max=43592, avg=3981.87, stdev=595.10 > clat (usec): min=38, max=285,avg=53.47, stdev= 8.28 > lat (usec): min=44, max=288, avg=57.45, stdev= 8.28 > (block) read: IOPS=15.3k, BW=59.8MiB/s (62.7MB/s) > slat (nsec): min=3408, max=35366, avg=5102.17, stdev=790.79 > clat (usec): min=35, max=343, avg=59.63, stdev=10.26 > lat (usec): min=43, max=349, avg=64.73, stdev=10.21 > > Testing the virtio-blk device with fio using 'engines=io_uring_cmd' > and 'engines=io_uring' also demonstrates improvements in submit latency. > (passthru) taskset -c 0 t/io_uring-vblk -b4096 -d8 -c4 -s4 -p0 -F1 -B0 -O0 -n1 -u1 /dev/vdcc0 > IOPS=189.80K, BW=741MiB/s, IOS/call=4/3 > IOPS=187.68K, BW=733MiB/s, IOS/call=4/3 > (block) taskset -c 0 t/io_uring-vblk -b4096 -d8 -c4 -s4 -p0 -F1 -B0 -O0 -n1 -u0 /dev/vdc > IOPS=101.51K, BW=396MiB/s, IOS/call=4/3 > IOPS=100.01K, BW=390MiB/s, IOS/call=4/4 > > The performance overhead of submitting IO can be decreased by 25% overall > with this patch series. The implementation primarily references 'nvme io_uring > passthrough', supporting io_uring_cmd through a separate character interface > (temporarily named /dev/vdXc0). Since this is an early version, many > details need to be taken into account and redesigned, like: > ● Currently, it only considers READ/WRITE scenarios, some more complex operations > not included like discard or zone ops.(Normal sqe64 is sufficient, in my opinion; > following upgrades, sqe128 and cqe32 might not be needed). > ● ...... > > I would appreciate any useful recommendations. > > Ferry Meng (3): > virtio-blk: add virtio-blk chardev support. > virtio-blk: add uring_cmd support for I/O passthru on chardev. > virtio-blk: add uring_cmd iopoll support. > > drivers/block/virtio_blk.c | 325 +++++++++++++++++++++++++++++++- > include/uapi/linux/virtio_blk.h | 16 ++ > 2 files changed, 336 insertions(+), 5 deletions(-) > > -- > 2.43.5 > >
Resend after change this into plain text. On 12/5/24 5:47 AM, Stefan Hajnoczi wrote: > On Tue, 3 Dec 2024 at 07:17, Ferry Meng <mengferry@linux.alibaba.com> wrote: >> We seek to develop a more flexible way to use virtio-blk and bypass the block >> layer logic in order to accomplish certain performance optimizations. As a >> result, we referred to the implementation of io_uring passthrough in NVMe >> and implemented it in the virtio-blk driver. This patch series adds io_uring >> passthrough support for virtio-blk devices, resulting in lower submit latency >> and increased flexibility when utilizing virtio-blk. > First I thought this was similar to Stefano Garzarella's previous > virtio-blk io_uring passthrough work where a host io_uring was passed > through into the guest: > https://static.sched.com/hosted_files/kvmforum2020/9c/KVMForum_2020_io_uring_passthrough_Stefano_Garzarella.pdf > > But now I see this is a uring_cmd interface for sending virtio_blk > commands from userspace like the one offered by the NVMe driver. > > Unlike NVMe, the virtio-blk command set is minimal and does not offer > a rich set of features. Is the motivation really virtio-blk command > passthrough or is the goal just to create a fast path for I/O? Sure, this series only works with guest os, not between host and guest. Well, 'io_uring passthrough' gives a 'fast path for I/O' , and is ideal for our use case --- like using virtio-blk with command directly from userspace scales better. > If the goal is just a fast path for I/O, then maybe Jens would > consider a generic command set that is not device-specific? That way > any driver (NVMe, virtio-blk, etc) can implement this uring_cmd > interface and any application can use it without worrying about the > underlying command set. I think a generic fast path would be much more > useful to applications than driver-specific interfaces. If I understand correctly, io_uring passthrough is already a complete abstract framework for I/O request dispatch. Its aim is to allow driver to handle commands on its own with bypass unused logic. Thus I chose this method of implementation. I really agreed that the command set of virtio-blk is sufficient and minimal enough, but I also believe that this makes it more convenient for us to adapt all command types with fewer modifications, so that virtio-blk can use io_uring passthrough. Of course, we can wait for Jens' view on the preceding discussion. >> To test this patch series, I changed fio's code: >> 1. Added virtio-blk support to engines/io_uring.c. >> 2. Added virtio-blk support to the t/io_uring.c testing tool. >> Link: https://github.com/jdmfr/fio >> >> Using t/io_uring-vblk, the performance of virtio-blk based on uring-cmd >> scales better than block device access. (such as below, Virtio-Blk with QEMU, >> 1-depth fio) >> (passthru) read: IOPS=17.2k, BW=67.4MiB/s (70.6MB/s) >> slat (nsec): min=2907, max=43592, avg=3981.87, stdev=595.10 >> clat (usec): min=38, max=285,avg=53.47, stdev= 8.28 >> lat (usec): min=44, max=288, avg=57.45, stdev= 8.28 >> (block) read: IOPS=15.3k, BW=59.8MiB/s (62.7MB/s) >> slat (nsec): min=3408, max=35366, avg=5102.17, stdev=790.79 >> clat (usec): min=35, max=343, avg=59.63, stdev=10.26 >> lat (usec): min=43, max=349, avg=64.73, stdev=10.21 >> >> Testing the virtio-blk device with fio using 'engines=io_uring_cmd' >> and 'engines=io_uring' also demonstrates improvements in submit latency. >> (passthru) taskset -c 0 t/io_uring-vblk -b4096 -d8 -c4 -s4 -p0 -F1 -B0 -O0 -n1 -u1 /dev/vdcc0 >> IOPS=189.80K, BW=741MiB/s, IOS/call=4/3 >> IOPS=187.68K, BW=733MiB/s, IOS/call=4/3 >> (block) taskset -c 0 t/io_uring-vblk -b4096 -d8 -c4 -s4 -p0 -F1 -B0 -O0 -n1 -u0 /dev/vdc >> IOPS=101.51K, BW=396MiB/s, IOS/call=4/3 >> IOPS=100.01K, BW=390MiB/s, IOS/call=4/4 >> >> The performance overhead of submitting IO can be decreased by 25% overall >> with this patch series. The implementation primarily references 'nvme io_uring >> passthrough', supporting io_uring_cmd through a separate character interface >> (temporarily named /dev/vdXc0). Since this is an early version, many >> details need to be taken into account and redesigned, like: >> ● Currently, it only considers READ/WRITE scenarios, some more complex operations >> not included like discard or zone ops.(Normal sqe64 is sufficient, in my opinion; >> following upgrades, sqe128 and cqe32 might not be needed). >> ● ...... >> >> I would appreciate any useful recommendations. >> >> Ferry Meng (3): >> virtio-blk: add virtio-blk chardev support. >> virtio-blk: add uring_cmd support for I/O passthru on chardev. >> virtio-blk: add uring_cmd iopoll support. >> >> drivers/block/virtio_blk.c | 325 +++++++++++++++++++++++++++++++- >> include/uapi/linux/virtio_blk.h | 16 ++ >> 2 files changed, 336 insertions(+), 5 deletions(-) >> >> -- >> 2.43.5 >> >>
On 12/3/24 8:14 PM, Ferry Meng wrote: > We seek to develop a more flexible way to use virtio-blk and bypass the block > layer logic in order to accomplish certain performance optimizations. As a > result, we referred to the implementation of io_uring passthrough in NVMe > and implemented it in the virtio-blk driver. This patch series adds io_uring > passthrough support for virtio-blk devices, resulting in lower submit latency > and increased flexibility when utilizing virtio-blk. > > To test this patch series, I changed fio's code: > 1. Added virtio-blk support to engines/io_uring.c. > 2. Added virtio-blk support to the t/io_uring.c testing tool. > Link: https://github.com/jdmfr/fio > > Using t/io_uring-vblk, the performance of virtio-blk based on uring-cmd > scales better than block device access. (such as below, Virtio-Blk with QEMU, > 1-depth fio) > (passthru) read: IOPS=17.2k, BW=67.4MiB/s (70.6MB/s) > slat (nsec): min=2907, max=43592, avg=3981.87, stdev=595.10 > clat (usec): min=38, max=285,avg=53.47, stdev= 8.28 > lat (usec): min=44, max=288, avg=57.45, stdev= 8.28 > (block) read: IOPS=15.3k, BW=59.8MiB/s (62.7MB/s) > slat (nsec): min=3408, max=35366, avg=5102.17, stdev=790.79 > clat (usec): min=35, max=343, avg=59.63, stdev=10.26 > lat (usec): min=43, max=349, avg=64.73, stdev=10.21 > > Testing the virtio-blk device with fio using 'engines=io_uring_cmd' > and 'engines=io_uring' also demonstrates improvements in submit latency. > (passthru) taskset -c 0 t/io_uring-vblk -b4096 -d8 -c4 -s4 -p0 -F1 -B0 -O0 -n1 -u1 /dev/vdcc0 > IOPS=189.80K, BW=741MiB/s, IOS/call=4/3 > IOPS=187.68K, BW=733MiB/s, IOS/call=4/3 > (block) taskset -c 0 t/io_uring-vblk -b4096 -d8 -c4 -s4 -p0 -F1 -B0 -O0 -n1 -u0 /dev/vdc > IOPS=101.51K, BW=396MiB/s, IOS/call=4/3 > IOPS=100.01K, BW=390MiB/s, IOS/call=4/4 > > The performance overhead of submitting IO can be decreased by 25% overall > with this patch series. The implementation primarily references 'nvme io_uring > passthrough', supporting io_uring_cmd through a separate character interface > (temporarily named /dev/vdXc0). Since this is an early version, many > details need to be taken into account and redesigned, like: > ● Currently, it only considers READ/WRITE scenarios, some more complex operations > not included like discard or zone ops.(Normal sqe64 is sufficient, in my opinion; > following upgrades, sqe128 and cqe32 might not be needed). > ● ...... > > I would appreciate any useful recommendations. > > Ferry Meng (3): > virtio-blk: add virtio-blk chardev support. > virtio-blk: add uring_cmd support for I/O passthru on chardev. > virtio-blk: add uring_cmd iopoll support. > > drivers/block/virtio_blk.c | 325 +++++++++++++++++++++++++++++++- > include/uapi/linux/virtio_blk.h | 16 ++ > 2 files changed, 336 insertions(+), 5 deletions(-) Hi, Micheal & Jason : What about yours' opinion? As virtio-blk maintainer. Looking forward to your reply. Thanks
On Mon, Dec 16, 2024 at 10:01 AM Ferry Meng <mengferry@linux.alibaba.com> wrote: > > > On 12/3/24 8:14 PM, Ferry Meng wrote: > > We seek to develop a more flexible way to use virtio-blk and bypass the block > > layer logic in order to accomplish certain performance optimizations. As a > > result, we referred to the implementation of io_uring passthrough in NVMe > > and implemented it in the virtio-blk driver. This patch series adds io_uring > > passthrough support for virtio-blk devices, resulting in lower submit latency > > and increased flexibility when utilizing virtio-blk. > > > > To test this patch series, I changed fio's code: > > 1. Added virtio-blk support to engines/io_uring.c. > > 2. Added virtio-blk support to the t/io_uring.c testing tool. > > Link: https://github.com/jdmfr/fio > > > > Using t/io_uring-vblk, the performance of virtio-blk based on uring-cmd > > scales better than block device access. (such as below, Virtio-Blk with QEMU, > > 1-depth fio) > > (passthru) read: IOPS=17.2k, BW=67.4MiB/s (70.6MB/s) > > slat (nsec): min=2907, max=43592, avg=3981.87, stdev=595.10 > > clat (usec): min=38, max=285,avg=53.47, stdev= 8.28 > > lat (usec): min=44, max=288, avg=57.45, stdev= 8.28 > > (block) read: IOPS=15.3k, BW=59.8MiB/s (62.7MB/s) > > slat (nsec): min=3408, max=35366, avg=5102.17, stdev=790.79 > > clat (usec): min=35, max=343, avg=59.63, stdev=10.26 > > lat (usec): min=43, max=349, avg=64.73, stdev=10.21 > > > > Testing the virtio-blk device with fio using 'engines=io_uring_cmd' > > and 'engines=io_uring' also demonstrates improvements in submit latency. > > (passthru) taskset -c 0 t/io_uring-vblk -b4096 -d8 -c4 -s4 -p0 -F1 -B0 -O0 -n1 -u1 /dev/vdcc0 > > IOPS=189.80K, BW=741MiB/s, IOS/call=4/3 > > IOPS=187.68K, BW=733MiB/s, IOS/call=4/3 > > (block) taskset -c 0 t/io_uring-vblk -b4096 -d8 -c4 -s4 -p0 -F1 -B0 -O0 -n1 -u0 /dev/vdc > > IOPS=101.51K, BW=396MiB/s, IOS/call=4/3 > > IOPS=100.01K, BW=390MiB/s, IOS/call=4/4 > > > > The performance overhead of submitting IO can be decreased by 25% overall > > with this patch series. The implementation primarily references 'nvme io_uring > > passthrough', supporting io_uring_cmd through a separate character interface > > (temporarily named /dev/vdXc0). Since this is an early version, many > > details need to be taken into account and redesigned, like: > > ● Currently, it only considers READ/WRITE scenarios, some more complex operations > > not included like discard or zone ops.(Normal sqe64 is sufficient, in my opinion; > > following upgrades, sqe128 and cqe32 might not be needed). > > ● ...... > > > > I would appreciate any useful recommendations. > > > > Ferry Meng (3): > > virtio-blk: add virtio-blk chardev support. > > virtio-blk: add uring_cmd support for I/O passthru on chardev. > > virtio-blk: add uring_cmd iopoll support. > > > > drivers/block/virtio_blk.c | 325 +++++++++++++++++++++++++++++++- > > include/uapi/linux/virtio_blk.h | 16 ++ > > 2 files changed, 336 insertions(+), 5 deletions(-) > > Hi, Micheal & Jason : > > What about yours' opinion? As virtio-blk maintainer. Looking forward to > your reply. > > Thanks If I understand this correctly, this proposal wants to make io_uring a transport of the virito-blk command. So the application doesn't need to worry about compatibility etc. This seems to be fine. But I wonder what's the security consideration, for example do we allow all virtio-blk commands to be passthroughs and why. Thanks >
On 12/16/24 3:38 PM, Jason Wang wrote: > On Mon, Dec 16, 2024 at 10:01 AM Ferry Meng <mengferry@linux.alibaba.com> wrote: >> >> On 12/3/24 8:14 PM, Ferry Meng wrote: >>> We seek to develop a more flexible way to use virtio-blk and bypass the block >>> layer logic in order to accomplish certain performance optimizations. As a >>> result, we referred to the implementation of io_uring passthrough in NVMe >>> and implemented it in the virtio-blk driver. This patch series adds io_uring >>> passthrough support for virtio-blk devices, resulting in lower submit latency >>> and increased flexibility when utilizing virtio-blk. >>> >>> To test this patch series, I changed fio's code: >>> 1. Added virtio-blk support to engines/io_uring.c. >>> 2. Added virtio-blk support to the t/io_uring.c testing tool. >>> Link: https://github.com/jdmfr/fio >>> >>> Using t/io_uring-vblk, the performance of virtio-blk based on uring-cmd >>> scales better than block device access. (such as below, Virtio-Blk with QEMU, >>> 1-depth fio) >>> (passthru) read: IOPS=17.2k, BW=67.4MiB/s (70.6MB/s) >>> slat (nsec): min=2907, max=43592, avg=3981.87, stdev=595.10 >>> clat (usec): min=38, max=285,avg=53.47, stdev= 8.28 >>> lat (usec): min=44, max=288, avg=57.45, stdev= 8.28 >>> (block) read: IOPS=15.3k, BW=59.8MiB/s (62.7MB/s) >>> slat (nsec): min=3408, max=35366, avg=5102.17, stdev=790.79 >>> clat (usec): min=35, max=343, avg=59.63, stdev=10.26 >>> lat (usec): min=43, max=349, avg=64.73, stdev=10.21 >>> >>> Testing the virtio-blk device with fio using 'engines=io_uring_cmd' >>> and 'engines=io_uring' also demonstrates improvements in submit latency. >>> (passthru) taskset -c 0 t/io_uring-vblk -b4096 -d8 -c4 -s4 -p0 -F1 -B0 -O0 -n1 -u1 /dev/vdcc0 >>> IOPS=189.80K, BW=741MiB/s, IOS/call=4/3 >>> IOPS=187.68K, BW=733MiB/s, IOS/call=4/3 >>> (block) taskset -c 0 t/io_uring-vblk -b4096 -d8 -c4 -s4 -p0 -F1 -B0 -O0 -n1 -u0 /dev/vdc >>> IOPS=101.51K, BW=396MiB/s, IOS/call=4/3 >>> IOPS=100.01K, BW=390MiB/s, IOS/call=4/4 >>> >>> The performance overhead of submitting IO can be decreased by 25% overall >>> with this patch series. The implementation primarily references 'nvme io_uring >>> passthrough', supporting io_uring_cmd through a separate character interface >>> (temporarily named /dev/vdXc0). Since this is an early version, many >>> details need to be taken into account and redesigned, like: >>> ● Currently, it only considers READ/WRITE scenarios, some more complex operations >>> not included like discard or zone ops.(Normal sqe64 is sufficient, in my opinion; >>> following upgrades, sqe128 and cqe32 might not be needed). >>> ● ...... >>> >>> I would appreciate any useful recommendations. >>> >>> Ferry Meng (3): >>> virtio-blk: add virtio-blk chardev support. >>> virtio-blk: add uring_cmd support for I/O passthru on chardev. >>> virtio-blk: add uring_cmd iopoll support. >>> >>> drivers/block/virtio_blk.c | 325 +++++++++++++++++++++++++++++++- >>> include/uapi/linux/virtio_blk.h | 16 ++ >>> 2 files changed, 336 insertions(+), 5 deletions(-) >> Hi, Micheal & Jason : >> >> What about yours' opinion? As virtio-blk maintainer. Looking forward to >> your reply. >> >> Thanks > If I understand this correctly, this proposal wants to make io_uring a > transport of the virito-blk command. So the application doesn't need > to worry about compatibility etc. This seems to be fine. > > But I wonder what's the security consideration, for example do we > allow all virtio-blk commands to be passthroughs and why. About 'security consideration', the generic char-dev belongs to root, so only root can use this passthrough path. On the other hand, to what I know, virtio-blk commands are all related to 'I/O operations', so we can support all those opcodes with bypassing vfs&block layer (if we want). I just realized the most basic read/write in this RFC patch series, others will be considered later. > Thanks >
Hacking passthrough into virtio_blk seems like not very good layering. If you have a use case where you want to use the core kernel virtio code but not the protocol drivers we'll probably need a virtqueue passthrough option of some kind.
On Mon, 16 Dec 2024 at 10:54, Christoph Hellwig <hch@infradead.org> wrote: > > Hacking passthrough into virtio_blk seems like not very good layering. > If you have a use case where you want to use the core kernel virtio code > but not the protocol drivers we'll probably need a virtqueue passthrough > option of some kind. I think people are finding that submitting I/O via uring_cmd is faster than traditional io_uring. The use case isn't really passthrough, it's bypass :). That's why I asked Jens to weigh in on whether there is a generic block layer solution here. If uring_cmd is faster then maybe a generic uring_cmd I/O interface can be defined without tying applications to device-specific commands. Or maybe the traditional io_uring code path can be optimized so that bypass is no longer attractive. The virtio-level virtqueue passthrough idea is interesting for use cases that mix passthrough applications with non-passthrough applications. VFIO isn't enough because it prevents sharing and excludes non-passthrough applications. Something similar to VDPA might be able to pass through just a subset of virtqueues that userspace could access via the vhost_vdpa driver. This approach doesn't scale if many applications are running at the same time because the number of virtqueues is finite and often the same as the number of CPUs. Stefan
On Mon, Dec 16, 2024 at 8:07 PM Ferry Meng <mengferry@linux.alibaba.com> wrote: > > > On 12/16/24 3:38 PM, Jason Wang wrote: > > On Mon, Dec 16, 2024 at 10:01 AM Ferry Meng <mengferry@linux.alibaba.com> wrote: > >> > >> On 12/3/24 8:14 PM, Ferry Meng wrote: > >>> We seek to develop a more flexible way to use virtio-blk and bypass the block > >>> layer logic in order to accomplish certain performance optimizations. As a > >>> result, we referred to the implementation of io_uring passthrough in NVMe > >>> and implemented it in the virtio-blk driver. This patch series adds io_uring > >>> passthrough support for virtio-blk devices, resulting in lower submit latency > >>> and increased flexibility when utilizing virtio-blk. > >>> > >>> To test this patch series, I changed fio's code: > >>> 1. Added virtio-blk support to engines/io_uring.c. > >>> 2. Added virtio-blk support to the t/io_uring.c testing tool. > >>> Link: https://github.com/jdmfr/fio > >>> > >>> Using t/io_uring-vblk, the performance of virtio-blk based on uring-cmd > >>> scales better than block device access. (such as below, Virtio-Blk with QEMU, > >>> 1-depth fio) > >>> (passthru) read: IOPS=17.2k, BW=67.4MiB/s (70.6MB/s) > >>> slat (nsec): min=2907, max=43592, avg=3981.87, stdev=595.10 > >>> clat (usec): min=38, max=285,avg=53.47, stdev= 8.28 > >>> lat (usec): min=44, max=288, avg=57.45, stdev= 8.28 > >>> (block) read: IOPS=15.3k, BW=59.8MiB/s (62.7MB/s) > >>> slat (nsec): min=3408, max=35366, avg=5102.17, stdev=790.79 > >>> clat (usec): min=35, max=343, avg=59.63, stdev=10.26 > >>> lat (usec): min=43, max=349, avg=64.73, stdev=10.21 > >>> > >>> Testing the virtio-blk device with fio using 'engines=io_uring_cmd' > >>> and 'engines=io_uring' also demonstrates improvements in submit latency. > >>> (passthru) taskset -c 0 t/io_uring-vblk -b4096 -d8 -c4 -s4 -p0 -F1 -B0 -O0 -n1 -u1 /dev/vdcc0 > >>> IOPS=189.80K, BW=741MiB/s, IOS/call=4/3 > >>> IOPS=187.68K, BW=733MiB/s, IOS/call=4/3 > >>> (block) taskset -c 0 t/io_uring-vblk -b4096 -d8 -c4 -s4 -p0 -F1 -B0 -O0 -n1 -u0 /dev/vdc > >>> IOPS=101.51K, BW=396MiB/s, IOS/call=4/3 > >>> IOPS=100.01K, BW=390MiB/s, IOS/call=4/4 > >>> > >>> The performance overhead of submitting IO can be decreased by 25% overall > >>> with this patch series. The implementation primarily references 'nvme io_uring > >>> passthrough', supporting io_uring_cmd through a separate character interface > >>> (temporarily named /dev/vdXc0). Since this is an early version, many > >>> details need to be taken into account and redesigned, like: > >>> ● Currently, it only considers READ/WRITE scenarios, some more complex operations > >>> not included like discard or zone ops.(Normal sqe64 is sufficient, in my opinion; > >>> following upgrades, sqe128 and cqe32 might not be needed). > >>> ● ...... > >>> > >>> I would appreciate any useful recommendations. > >>> > >>> Ferry Meng (3): > >>> virtio-blk: add virtio-blk chardev support. > >>> virtio-blk: add uring_cmd support for I/O passthru on chardev. > >>> virtio-blk: add uring_cmd iopoll support. > >>> > >>> drivers/block/virtio_blk.c | 325 +++++++++++++++++++++++++++++++- > >>> include/uapi/linux/virtio_blk.h | 16 ++ > >>> 2 files changed, 336 insertions(+), 5 deletions(-) > >> Hi, Micheal & Jason : > >> > >> What about yours' opinion? As virtio-blk maintainer. Looking forward to > >> your reply. > >> > >> Thanks > > If I understand this correctly, this proposal wants to make io_uring a > > transport of the virito-blk command. So the application doesn't need > > to worry about compatibility etc. This seems to be fine. > > > > But I wonder what's the security consideration, for example do we > > allow all virtio-blk commands to be passthroughs and why. > > About 'security consideration', the generic char-dev belongs to root, so > only root can use this passthrough path. This seems like a restriction. A lot of applications want to be run without privilege to be safe. > > On the other hand, to what I know, virtio-blk commands are all related > to 'I/O operations', so we can support all those opcodes with bypassing > vfs&block layer (if we want). I just realized the most basic read/write > in this RFC patch series, others will be considered later. > > > Thanks > > > Thanks
On Tue, Dec 17, 2024 at 12:14 AM Stefan Hajnoczi <stefanha@gmail.com> wrote: > > On Mon, 16 Dec 2024 at 10:54, Christoph Hellwig <hch@infradead.org> wrote: > > > > Hacking passthrough into virtio_blk seems like not very good layering. > > If you have a use case where you want to use the core kernel virtio code > > but not the protocol drivers we'll probably need a virtqueue passthrough > > option of some kind. > > I think people are finding that submitting I/O via uring_cmd is faster > than traditional io_uring. The use case isn't really passthrough, it's > bypass :). > > That's why I asked Jens to weigh in on whether there is a generic > block layer solution here. If uring_cmd is faster then maybe a generic > uring_cmd I/O interface can be defined without tying applications to > device-specific commands. Or maybe the traditional io_uring code path > can be optimized so that bypass is no longer attractive. > > The virtio-level virtqueue passthrough idea is interesting for use > cases that mix passthrough applications with non-passthrough > applications. VFIO isn't enough because it prevents sharing and > excludes non-passthrough applications. Something similar to VDPA > might be able to pass through just a subset of virtqueues that > userspace could access via the vhost_vdpa driver. I thought it could be reused as a mixing approach like this. The vDPA driver might just do a shadow virtqueue so in fact we just replace io_uring here with the virtqueue. Or if we think vDPA is heavyweight, vhost-blk could be another way. > This approach > doesn't scale if many applications are running at the same time > because the number of virtqueues is finite and often the same as the > number of CPUs. > > Stefan > Thanks
On 12/17/24 10:08 AM, Jason Wang wrote: > On Mon, Dec 16, 2024 at 8:07 PM Ferry Meng <mengferry@linux.alibaba.com> wrote: >> >> On 12/16/24 3:38 PM, Jason Wang wrote: >>> On Mon, Dec 16, 2024 at 10:01 AM Ferry Meng <mengferry@linux.alibaba.com> wrote: >>>> On 12/3/24 8:14 PM, Ferry Meng wrote: >>>>> We seek to develop a more flexible way to use virtio-blk and bypass the block >>>>> layer logic in order to accomplish certain performance optimizations. As a >>>>> result, we referred to the implementation of io_uring passthrough in NVMe >>>>> and implemented it in the virtio-blk driver. This patch series adds io_uring >>>>> passthrough support for virtio-blk devices, resulting in lower submit latency >>>>> and increased flexibility when utilizing virtio-blk. >>>>> >>>>> To test this patch series, I changed fio's code: >>>>> 1. Added virtio-blk support to engines/io_uring.c. >>>>> 2. Added virtio-blk support to the t/io_uring.c testing tool. >>>>> Link: https://github.com/jdmfr/fio >>>>> >>>>> Using t/io_uring-vblk, the performance of virtio-blk based on uring-cmd >>>>> scales better than block device access. (such as below, Virtio-Blk with QEMU, >>>>> 1-depth fio) >>>>> (passthru) read: IOPS=17.2k, BW=67.4MiB/s (70.6MB/s) >>>>> slat (nsec): min=2907, max=43592, avg=3981.87, stdev=595.10 >>>>> clat (usec): min=38, max=285,avg=53.47, stdev= 8.28 >>>>> lat (usec): min=44, max=288, avg=57.45, stdev= 8.28 >>>>> (block) read: IOPS=15.3k, BW=59.8MiB/s (62.7MB/s) >>>>> slat (nsec): min=3408, max=35366, avg=5102.17, stdev=790.79 >>>>> clat (usec): min=35, max=343, avg=59.63, stdev=10.26 >>>>> lat (usec): min=43, max=349, avg=64.73, stdev=10.21 >>>>> >>>>> Testing the virtio-blk device with fio using 'engines=io_uring_cmd' >>>>> and 'engines=io_uring' also demonstrates improvements in submit latency. >>>>> (passthru) taskset -c 0 t/io_uring-vblk -b4096 -d8 -c4 -s4 -p0 -F1 -B0 -O0 -n1 -u1 /dev/vdcc0 >>>>> IOPS=189.80K, BW=741MiB/s, IOS/call=4/3 >>>>> IOPS=187.68K, BW=733MiB/s, IOS/call=4/3 >>>>> (block) taskset -c 0 t/io_uring-vblk -b4096 -d8 -c4 -s4 -p0 -F1 -B0 -O0 -n1 -u0 /dev/vdc >>>>> IOPS=101.51K, BW=396MiB/s, IOS/call=4/3 >>>>> IOPS=100.01K, BW=390MiB/s, IOS/call=4/4 >>>>> >>>>> The performance overhead of submitting IO can be decreased by 25% overall >>>>> with this patch series. The implementation primarily references 'nvme io_uring >>>>> passthrough', supporting io_uring_cmd through a separate character interface >>>>> (temporarily named /dev/vdXc0). Since this is an early version, many >>>>> details need to be taken into account and redesigned, like: >>>>> ● Currently, it only considers READ/WRITE scenarios, some more complex operations >>>>> not included like discard or zone ops.(Normal sqe64 is sufficient, in my opinion; >>>>> following upgrades, sqe128 and cqe32 might not be needed). >>>>> ● ...... >>>>> >>>>> I would appreciate any useful recommendations. >>>>> >>>>> Ferry Meng (3): >>>>> virtio-blk: add virtio-blk chardev support. >>>>> virtio-blk: add uring_cmd support for I/O passthru on chardev. >>>>> virtio-blk: add uring_cmd iopoll support. >>>>> >>>>> drivers/block/virtio_blk.c | 325 +++++++++++++++++++++++++++++++- >>>>> include/uapi/linux/virtio_blk.h | 16 ++ >>>>> 2 files changed, 336 insertions(+), 5 deletions(-) >>>> Hi, Micheal & Jason : >>>> >>>> What about yours' opinion? As virtio-blk maintainer. Looking forward to >>>> your reply. >>>> >>>> Thanks >>> If I understand this correctly, this proposal wants to make io_uring a >>> transport of the virito-blk command. So the application doesn't need >>> to worry about compatibility etc. This seems to be fine. >>> >>> But I wonder what's the security consideration, for example do we >>> allow all virtio-blk commands to be passthroughs and why. >> About 'security consideration', the generic char-dev belongs to root, so >> only root can use this passthrough path. > This seems like a restriction. A lot of applications want to be run > without privilege to be safe. > I'm sorry that there may have been some misunderstanding in my previous explanation. The generic cdev file's default group is 'root,' but we can just use 'chgrp' and change it to what we want. After which, apps can then utilize it, just like they would with a standard file. >> On the other hand, to what I know, virtio-blk commands are all related >> to 'I/O operations', so we can support all those opcodes with bypassing >> vfs&block layer (if we want). I just realized the most basic read/write >> in this RFC patch series, others will be considered later. >> >>> Thanks >>> > Thanks
Hi Stefan & Christoph, On 12/17/24 12:13 AM, Stefan Hajnoczi wrote: > On Mon, 16 Dec 2024 at 10:54, Christoph Hellwig <hch@infradead.org> wrote: >> >> Hacking passthrough into virtio_blk seems like not very good layering. >> If you have a use case where you want to use the core kernel virtio code >> but not the protocol drivers we'll probably need a virtqueue passthrough >> option of some kind. > > I think people are finding that submitting I/O via uring_cmd is faster > than traditional io_uring. The use case isn't really passthrough, it's > bypass :). Right, the initial purpose is bypassing the block layer (in the guest) to achieve better latency when the user process is operating on a raw virtio-blk device directly. > > That's why I asked Jens to weigh in on whether there is a generic > block layer solution here. If uring_cmd is faster then maybe a generic > uring_cmd I/O interface can be defined without tying applications to > device-specific commands. Or maybe the traditional io_uring code path > can be optimized so that bypass is no longer attractive. We are fine with that if it looks good to Jens.
On 12/16/24 11:08 PM, Jingbo Xu wrote: >> That's why I asked Jens to weigh in on whether there is a generic >> block layer solution here. If uring_cmd is faster then maybe a generic >> uring_cmd I/O interface can be defined without tying applications to >> device-specific commands. Or maybe the traditional io_uring code path >> can be optimized so that bypass is no longer attractive. It's not that the traditional io_uring code path is slower, it's in fact basically the same thing. It's that all the other jazz that happens below io_uring slows things down, which is why passthrough ends up being faster.
On Tue, 17 Dec 2024 at 12:54, Jens Axboe <axboe@kernel.dk> wrote: > > On 12/16/24 11:08 PM, Jingbo Xu wrote: > >> That's why I asked Jens to weigh in on whether there is a generic > >> block layer solution here. If uring_cmd is faster then maybe a generic > >> uring_cmd I/O interface can be defined without tying applications to > >> device-specific commands. Or maybe the traditional io_uring code path > >> can be optimized so that bypass is no longer attractive. > > It's not that the traditional io_uring code path is slower, it's in fact > basically the same thing. It's that all the other jazz that happens > below io_uring slows things down, which is why passthrough ends up being > faster. Are you happy with virtio_blk passthrough or do you want a different approach? Stefan
On 12/17/24 2:00 PM, Stefan Hajnoczi wrote: > On Tue, 17 Dec 2024 at 12:54, Jens Axboe <axboe@kernel.dk> wrote: >> >> On 12/16/24 11:08 PM, Jingbo Xu wrote: >>>> That's why I asked Jens to weigh in on whether there is a generic >>>> block layer solution here. If uring_cmd is faster then maybe a generic >>>> uring_cmd I/O interface can be defined without tying applications to >>>> device-specific commands. Or maybe the traditional io_uring code path >>>> can be optimized so that bypass is no longer attractive. >> >> It's not that the traditional io_uring code path is slower, it's in fact >> basically the same thing. It's that all the other jazz that happens >> below io_uring slows things down, which is why passthrough ends up being >> faster. > > Are you happy with virtio_blk passthrough or do you want a different approach? I think it looks fine.
On 12/18/24 5:07 AM, Jens Axboe wrote: > On 12/17/24 2:00 PM, Stefan Hajnoczi wrote: >> On Tue, 17 Dec 2024 at 12:54, Jens Axboe <axboe@kernel.dk> wrote: >>> On 12/16/24 11:08 PM, Jingbo Xu wrote: >>>>> That's why I asked Jens to weigh in on whether there is a generic >>>>> block layer solution here. If uring_cmd is faster then maybe a generic >>>>> uring_cmd I/O interface can be defined without tying applications to >>>>> device-specific commands. Or maybe the traditional io_uring code path >>>>> can be optimized so that bypass is no longer attractive. >>> It's not that the traditional io_uring code path is slower, it's in fact >>> basically the same thing. It's that all the other jazz that happens >>> below io_uring slows things down, which is why passthrough ends up being >>> faster. >> Are you happy with virtio_blk passthrough or do you want a different approach? > I think it looks fine. > OK, thx. I will submit the official patch for review soon after resolving the test bot warning.