mbox series

[PATCHSET,v2,0/12] Add support for async buffered reads

Message ID 20200523185755.8494-1-axboe@kernel.dk (mailing list archive)
Headers show
Series Add support for async buffered reads | expand

Message

Jens Axboe May 23, 2020, 6:57 p.m. UTC
We technically support this already through io_uring, but it's
implemented with a thread backend to support cases where we would
block. This isn't ideal.

After a few prep patches, the core of this patchset is adding support
for async callbacks on page unlock. With this primitive, we can simply
retry the IO operation. With io_uring, this works a lot like poll based
retry for files that support it. If a page is currently locked and
needed, -EIOCBQUEUED is returned with a callback armed. The callers
callback is responsible for restarting the operation.

With this callback primitive, we can add support for
generic_file_buffered_read(), which is what most file systems end up
using for buffered reads. XFS/ext4/btrfs/bdev is wired up, but probably
trivial to add more.

The file flags support for this by setting FMODE_BUF_RASYNC, similar
to what we do for FMODE_NOWAIT. Open to suggestions here if this is
the preferred method or not.

In terms of results, I wrote a small test app that randomly reads 4G
of data in 4K chunks from a file hosted by ext4. The app uses a queue
depth of 32. If you want to test yourself, you can just use buffered=1
with ioengine=io_uring with fio. No application changes are needed to
use the more optimized buffered async read.

preadv for comparison:
	real    1m13.821s
	user    0m0.558s
	sys     0m11.125s
	CPU	~13%

Mainline:
	real    0m12.054s
	user    0m0.111s
	sys     0m5.659s
	CPU	~32% + ~50% == ~82%

This patchset:
	real    0m9.283s
	user    0m0.147s
	sys     0m4.619s
	CPU	~52%

The CPU numbers are just a rough estimate. For the mainline io_uring
run, this includes the app itself and all the threads doing IO on its
behalf (32% for the app, ~1.6% per worker and 32 of them). Context
switch rate is much smaller with the patchset, since we only have the
one task performing IO.

The goal here is efficiency. Async thread offload adds latency, and
it also adds noticable overhead on items such as adding pages to the
page cache. By allowing proper async buffered read support, we don't
have X threads hammering on the same inode page cache, we have just
the single app actually doing IO.

Been beating on this and it's solid for me, and I'm now pretty happy
with how it all turned out. Not aware of any missing bits/pieces or
code cleanups that need doing.

Series can also be found here:

https://git.kernel.dk/cgit/linux-block/log/?h=async-buffered.3

or pull from:

git://git.kernel.dk/linux-block async-buffered.3

 fs/block_dev.c            |   2 +-
 fs/btrfs/file.c           |   2 +-
 fs/ext4/file.c            |   2 +-
 fs/io_uring.c             |  99 ++++++++++++++++++++++++++++++++++
 fs/xfs/xfs_file.c         |   2 +-
 include/linux/blk_types.h |   3 +-
 include/linux/fs.h        |   5 ++
 include/linux/pagemap.h   |  64 ++++++++++++++++++++++
 mm/filemap.c              | 111 ++++++++++++++++++++++++--------------
 9 files changed, 245 insertions(+), 45 deletions(-)

Changes since v2:
- Get rid of unnecessary wait_page_async struct, just use wait_page_async
- Add another prep handler, adding wake_page_match()
- Use wake_page_match() in both callers
Changes since v1:
- Fix an issue with inline page locking
- Fix a potential race with __wait_on_page_locked_async()
- Fix a hang related to not setting page_match, thus missing a wakeup

Comments

Jens Axboe May 23, 2020, 7:20 p.m. UTC | #1
And this one is v3, obviously, not v2...


On 5/23/20 12:57 PM, Jens Axboe wrote:
> We technically support this already through io_uring, but it's
> implemented with a thread backend to support cases where we would
> block. This isn't ideal.
> 
> After a few prep patches, the core of this patchset is adding support
> for async callbacks on page unlock. With this primitive, we can simply
> retry the IO operation. With io_uring, this works a lot like poll based
> retry for files that support it. If a page is currently locked and
> needed, -EIOCBQUEUED is returned with a callback armed. The callers
> callback is responsible for restarting the operation.
> 
> With this callback primitive, we can add support for
> generic_file_buffered_read(), which is what most file systems end up
> using for buffered reads. XFS/ext4/btrfs/bdev is wired up, but probably
> trivial to add more.
> 
> The file flags support for this by setting FMODE_BUF_RASYNC, similar
> to what we do for FMODE_NOWAIT. Open to suggestions here if this is
> the preferred method or not.
> 
> In terms of results, I wrote a small test app that randomly reads 4G
> of data in 4K chunks from a file hosted by ext4. The app uses a queue
> depth of 32. If you want to test yourself, you can just use buffered=1
> with ioengine=io_uring with fio. No application changes are needed to
> use the more optimized buffered async read.
> 
> preadv for comparison:
> 	real    1m13.821s
> 	user    0m0.558s
> 	sys     0m11.125s
> 	CPU	~13%
> 
> Mainline:
> 	real    0m12.054s
> 	user    0m0.111s
> 	sys     0m5.659s
> 	CPU	~32% + ~50% == ~82%
> 
> This patchset:
> 	real    0m9.283s
> 	user    0m0.147s
> 	sys     0m4.619s
> 	CPU	~52%
> 
> The CPU numbers are just a rough estimate. For the mainline io_uring
> run, this includes the app itself and all the threads doing IO on its
> behalf (32% for the app, ~1.6% per worker and 32 of them). Context
> switch rate is much smaller with the patchset, since we only have the
> one task performing IO.
> 
> The goal here is efficiency. Async thread offload adds latency, and
> it also adds noticable overhead on items such as adding pages to the
> page cache. By allowing proper async buffered read support, we don't
> have X threads hammering on the same inode page cache, we have just
> the single app actually doing IO.
> 
> Been beating on this and it's solid for me, and I'm now pretty happy
> with how it all turned out. Not aware of any missing bits/pieces or
> code cleanups that need doing.
> 
> Series can also be found here:
> 
> https://git.kernel.dk/cgit/linux-block/log/?h=async-buffered.3
> 
> or pull from:
> 
> git://git.kernel.dk/linux-block async-buffered.3
> 
>  fs/block_dev.c            |   2 +-
>  fs/btrfs/file.c           |   2 +-
>  fs/ext4/file.c            |   2 +-
>  fs/io_uring.c             |  99 ++++++++++++++++++++++++++++++++++
>  fs/xfs/xfs_file.c         |   2 +-
>  include/linux/blk_types.h |   3 +-
>  include/linux/fs.h        |   5 ++
>  include/linux/pagemap.h   |  64 ++++++++++++++++++++++
>  mm/filemap.c              | 111 ++++++++++++++++++++++++--------------
>  9 files changed, 245 insertions(+), 45 deletions(-)
> 
> Changes since v2:
> - Get rid of unnecessary wait_page_async struct, just use wait_page_async
> - Add another prep handler, adding wake_page_match()
> - Use wake_page_match() in both callers
> Changes since v1:
> - Fix an issue with inline page locking
> - Fix a potential race with __wait_on_page_locked_async()
> - Fix a hang related to not setting page_match, thus missing a wakeup
>
Chris Panayis May 24, 2020, 9:46 a.m. UTC | #2
Yes! Jens & Team! Yes!

My code has never looked so beautiful, been so efficient and run so well 
since switching to io_uring/async awesome-ness.. Really, really is a 
game-changer in terms of software design, control, performance, 
expressiveness... so many levels. Really, really great work! Thank you!

Chris


On 23/05/2020 20:20, Jens Axboe wrote:
> And this one is v3, obviously, not v2...
>
>
> On 5/23/20 12:57 PM, Jens Axboe wrote:
>> We technically support this already through io_uring, but it's
>> implemented with a thread backend to support cases where we would
>> block. This isn't ideal.
>>
>> After a few prep patches, the core of this patchset is adding support
>> for async callbacks on page unlock. With this primitive, we can simply
>> retry the IO operation. With io_uring, this works a lot like poll based
>> retry for files that support it. If a page is currently locked and
>> needed, -EIOCBQUEUED is returned with a callback armed. The callers
>> callback is responsible for restarting the operation.
>>
>> With this callback primitive, we can add support for
>> generic_file_buffered_read(), which is what most file systems end up
>> using for buffered reads. XFS/ext4/btrfs/bdev is wired up, but probably
>> trivial to add more.
>>
>> The file flags support for this by setting FMODE_BUF_RASYNC, similar
>> to what we do for FMODE_NOWAIT. Open to suggestions here if this is
>> the preferred method or not.
>>
>> In terms of results, I wrote a small test app that randomly reads 4G
>> of data in 4K chunks from a file hosted by ext4. The app uses a queue
>> depth of 32. If you want to test yourself, you can just use buffered=1
>> with ioengine=io_uring with fio. No application changes are needed to
>> use the more optimized buffered async read.
>>
>> preadv for comparison:
>> 	real    1m13.821s
>> 	user    0m0.558s
>> 	sys     0m11.125s
>> 	CPU	~13%
>>
>> Mainline:
>> 	real    0m12.054s
>> 	user    0m0.111s
>> 	sys     0m5.659s
>> 	CPU	~32% + ~50% == ~82%
>>
>> This patchset:
>> 	real    0m9.283s
>> 	user    0m0.147s
>> 	sys     0m4.619s
>> 	CPU	~52%
>>
>> The CPU numbers are just a rough estimate. For the mainline io_uring
>> run, this includes the app itself and all the threads doing IO on its
>> behalf (32% for the app, ~1.6% per worker and 32 of them). Context
>> switch rate is much smaller with the patchset, since we only have the
>> one task performing IO.
>>
>> The goal here is efficiency. Async thread offload adds latency, and
>> it also adds noticable overhead on items such as adding pages to the
>> page cache. By allowing proper async buffered read support, we don't
>> have X threads hammering on the same inode page cache, we have just
>> the single app actually doing IO.
>>
>> Been beating on this and it's solid for me, and I'm now pretty happy
>> with how it all turned out. Not aware of any missing bits/pieces or
>> code cleanups that need doing.
>>
>> Series can also be found here:
>>
>> https://git.kernel.dk/cgit/linux-block/log/?h=async-buffered.3
>>
>> or pull from:
>>
>> git://git.kernel.dk/linux-block async-buffered.3
>>
>>   fs/block_dev.c            |   2 +-
>>   fs/btrfs/file.c           |   2 +-
>>   fs/ext4/file.c            |   2 +-
>>   fs/io_uring.c             |  99 ++++++++++++++++++++++++++++++++++
>>   fs/xfs/xfs_file.c         |   2 +-
>>   include/linux/blk_types.h |   3 +-
>>   include/linux/fs.h        |   5 ++
>>   include/linux/pagemap.h   |  64 ++++++++++++++++++++++
>>   mm/filemap.c              | 111 ++++++++++++++++++++++++--------------
>>   9 files changed, 245 insertions(+), 45 deletions(-)
>>
>> Changes since v2:
>> - Get rid of unnecessary wait_page_async struct, just use wait_page_async
>> - Add another prep handler, adding wake_page_match()
>> - Use wake_page_match() in both callers
>> Changes since v1:
>> - Fix an issue with inline page locking
>> - Fix a potential race with __wait_on_page_locked_async()
>> - Fix a hang related to not setting page_match, thus missing a wakeup
>>
>
Jens Axboe May 24, 2020, 7:24 p.m. UTC | #3
On 5/24/20 3:46 AM, Chris Panayis wrote:
> Yes! Jens & Team! Yes!
> 
> My code has never looked so beautiful, been so efficient and run so well 
> since switching to io_uring/async awesome-ness.. Really, really is a 
> game-changer in terms of software design, control, performance, 
> expressiveness... so many levels. Really, really great work! Thank you!

Thanks! Glad you like it.