mbox series

[RFC,0/5] implement asynchronous BLKDISCARD via io_uring

Message ID cover.1723601133.git.asml.silence@gmail.com (mailing list archive)
Headers show
Series implement asynchronous BLKDISCARD via io_uring | expand

Message

Pavel Begunkov Aug. 14, 2024, 10:45 a.m. UTC
There is an interest in having asynchronous block operations like
discard. The patch set implements that as io_uring commands, which is
an io_uring request type allowing to implement custom file specific
operations.

First 4 patches are simple preps, and the main part is in Patch 5.
Not tested with a real drive yet, hence sending as an RFC.

I'm also going to add BLKDISCARDZEROES and BLKSECDISCARD, which should
reuse structures and helpers from Patch 5.

liburing tests for reference:

https://github.com/isilence/liburing.git discard-cmd-test

Pavel Begunkov (5):
  io_uring/cmd: expose iowq to cmds
  io_uring/cmd: give inline space in request to cmds
  filemap: introduce filemap_invalidate_pages
  block: introduce blk_validate_discard()
  block: implement io_uring discard cmd

 block/blk.h                  |   1 +
 block/fops.c                 |   2 +
 block/ioctl.c                | 139 ++++++++++++++++++++++++++++++-----
 include/linux/io_uring/cmd.h |  15 ++++
 include/linux/pagemap.h      |   2 +
 include/uapi/linux/fs.h      |   2 +
 io_uring/io_uring.c          |  11 +++
 io_uring/io_uring.h          |   1 +
 io_uring/uring_cmd.c         |   7 ++
 mm/filemap.c                 |  18 +++--
 10 files changed, 176 insertions(+), 22 deletions(-)

Comments

Jens Axboe Aug. 15, 2024, 3:50 p.m. UTC | #1
On 8/14/24 4:45 AM, Pavel Begunkov wrote:
> There is an interest in having asynchronous block operations like
> discard. The patch set implements that as io_uring commands, which is
> an io_uring request type allowing to implement custom file specific
> operations.
> 
> First 4 patches are simple preps, and the main part is in Patch 5.
> Not tested with a real drive yet, hence sending as an RFC.
> 
> I'm also going to add BLKDISCARDZEROES and BLKSECDISCARD, which should
> reuse structures and helpers from Patch 5.
> 
> liburing tests for reference:
> 
> https://github.com/isilence/liburing.git discard-cmd-test

FWIW, did a quick patch to wire it up for fio. Using this
job file:

[trim]
filename=/dev/nvme31n1
ioengine=io_uring
iodepth=64
rw=randtrim
norandommap
bs=4k

the stock kernel gets:

  trim: IOPS=21.6k, BW=84.4MiB/s (88.5MB/s)(847MiB/10036msec); 0 zone resets

using ~5% CPU, and with the process predictably stuck in D state all of
the time, waiting on a sync trim.

With the patches:

  trim: IOPS=75.8k, BW=296MiB/s (310MB/s)(2653MiB/8961msec); 0 zone resets

using ~11% CPU.

Didn't verify actual functionality, but did check trims are going up and
down. Drive used is:

Dell NVMe PM1743 RI E3.S 7.68TB

Outside of async trim being useful for actual workloads rather than the
sync trim we have now, it'll also help characterizing real world mixed
workloads that have trims with reads/writes.
Martin K. Petersen Aug. 15, 2024, 4:15 p.m. UTC | #2
Hi Pavel!

> I'm also going to add BLKDISCARDZEROES and BLKSECDISCARD, which should
> reuse structures and helpers from Patch 5.

Adding these is going to be very useful.

Just a nit: Please use either ZEROOUT (or WRITE_ZEROES) and SECURE_ERASE
terminology for the additional operations when you add them.

DISCARDZEROES is obsolete and secure discard has been replaced by secure
erase. We are stuck with the legacy ioctl names but we should avoid
perpetuating unfortunate naming choices from the past.
Pavel Begunkov Aug. 15, 2024, 5:12 p.m. UTC | #3
On 8/15/24 17:15, Martin K. Petersen wrote:
> 
> Hi Pavel!
> 
>> I'm also going to add BLKDISCARDZEROES and BLKSECDISCARD, which should
>> reuse structures and helpers from Patch 5.
> 
> Adding these is going to be very useful.
> 
> Just a nit: Please use either ZEROOUT (or WRITE_ZEROES) and SECURE_ERASE
> terminology for the additional operations when you add them.
> 
> DISCARDZEROES is obsolete and secure discard has been replaced by secure
> erase. We are stuck with the legacy ioctl names but we should avoid
> perpetuating unfortunate naming choices from the past.

Noted, thanks, the ioctl names are indeed confusing.
Pavel Begunkov Aug. 15, 2024, 5:26 p.m. UTC | #4
On 8/15/24 16:50, Jens Axboe wrote:
> On 8/14/24 4:45 AM, Pavel Begunkov wrote:
>> There is an interest in having asynchronous block operations like
>> discard. The patch set implements that as io_uring commands, which is
>> an io_uring request type allowing to implement custom file specific
>> operations.
>>
>> First 4 patches are simple preps, and the main part is in Patch 5.
>> Not tested with a real drive yet, hence sending as an RFC.
>>
>> I'm also going to add BLKDISCARDZEROES and BLKSECDISCARD, which should
>> reuse structures and helpers from Patch 5.
>>
>> liburing tests for reference:
>>
>> https://github.com/isilence/liburing.git discard-cmd-test
> 
> FWIW, did a quick patch to wire it up for fio. Using this
> job file:
> 
> [trim]
> filename=/dev/nvme31n1
> ioengine=io_uring
> iodepth=64
> rw=randtrim
> norandommap
> bs=4k
> 
> the stock kernel gets:
> 
>    trim: IOPS=21.6k, BW=84.4MiB/s (88.5MB/s)(847MiB/10036msec); 0 zone resets
> 
> using ~5% CPU, and with the process predictably stuck in D state all of
> the time, waiting on a sync trim.
> 
> With the patches:
> 
>    trim: IOPS=75.8k, BW=296MiB/s (310MB/s)(2653MiB/8961msec); 0 zone resets
> 
> using ~11% CPU.

Thanks for giving it a run. Can be further improved for particular use cases,
e.g. by adding a vectored version as an additional feature, i.e. multiple
discard ranges per request, and something like retry of short IO, but for
that we'd want getting -EAGAIN directly from submit_bio unlike failing via
the callback.


> Didn't verify actual functionality, but did check trims are going up and
> down. Drive used is:
> 
> Dell NVMe PM1743 RI E3.S 7.68TB
> 
> Outside of async trim being useful for actual workloads rather than the
> sync trim we have now, it'll also help characterizing real world mixed
> workloads that have trims with reads/writes.