mbox series

[PATCHSET,0/5] Support for RWF_UNCACHED

Message ID 20191210162454.8608-1-axboe@kernel.dk (mailing list archive)
Headers show
Series Support for RWF_UNCACHED | expand

Message

Jens Axboe Dec. 10, 2019, 4:24 p.m. UTC
Recently someone asked me how io_uring buffered IO compares to mmaped
IO in terms of performance. So I ran some tests with buffered IO, and
found the experience to be somewhat painful. The test case is pretty
basic, random reads over a dataset that's 10x the size of RAM.
Performance starts out fine, and then the page cache fills up and we
hit a throughput cliff. CPU usage of the IO threads go up, and we have
kswapd spending 100% of a core trying to keep up. Seeing that, I was
reminded of the many complaints I here about buffered IO, and the fact
that most of the folks complaining will ultimately bite the bullet and
move to O_DIRECT to just get the kernel out of the way.

But I don't think it needs to be like that. Switching to O_DIRECT isn't
always easily doable. The buffers have different life times, size and
alignment constraints, etc. On top of that, mixing buffered and O_DIRECT
can be painful.

Seems to me that we have an opportunity to provide something that sits
somewhere in between buffered and O_DIRECT, and this is where
RWF_UNCACHED enters the picture. If this flag is set on IO, we get the
following behavior:

- If the data is in cache, it remains in cache and the copy (in or out)
  is served to/from that.

- If the data is NOT in cache, we add it while performing the IO. When
  the IO is done, we remove it again.

With this, I can do 100% smooth buffered reads or writes without pushing
the kernel to the state where kswapd is sweating bullets. In fact it
doesn't even register.

Comments appreciated! Patches are against current git (ish), and can
also be found here:

https://git.kernel.dk/cgit/linux-block/log/?h=buffered-uncached

 fs/ceph/file.c          |   2 +-
 fs/dax.c                |   2 +-
 fs/ext4/file.c          |   2 +-
 fs/iomap/apply.c        |   2 +-
 fs/iomap/buffered-io.c  |  75 +++++++++++++++++------
 fs/iomap/direct-io.c    |   3 +-
 fs/iomap/fiemap.c       |   5 +-
 fs/iomap/seek.c         |   6 +-
 fs/iomap/swapfile.c     |   2 +-
 fs/nfs/file.c           |   2 +-
 include/linux/fs.h      |   9 ++-
 include/linux/iomap.h   |   6 +-
 include/uapi/linux/fs.h |   5 +-
 mm/filemap.c            | 132 ++++++++++++++++++++++++++++++++++++----
 14 files changed, 208 insertions(+), 45 deletions(-)

Comments

Andreas Dilger Dec. 10, 2019, 9:17 p.m. UTC | #1
On Dec 10, 2019, at 9:24 AM, Jens Axboe <axboe@kernel.dk> wrote:
> 
> Recently someone asked me how io_uring buffered IO compares to mmaped
> IO in terms of performance. So I ran some tests with buffered IO, and
> found the experience to be somewhat painful. The test case is pretty
> basic, random reads over a dataset that's 10x the size of RAM.
> Performance starts out fine, and then the page cache fills up and we
> hit a throughput cliff. CPU usage of the IO threads go up, and we have
> kswapd spending 100% of a core trying to keep up. Seeing that, I was
> reminded of the many complaints I here about buffered IO, and the fact
> that most of the folks complaining will ultimately bite the bullet and
> move to O_DIRECT to just get the kernel out of the way.
> 
> But I don't think it needs to be like that. Switching to O_DIRECT isn't
> always easily doable. The buffers have different life times, size and
> alignment constraints, etc. On top of that, mixing buffered and O_DIRECT
> can be painful.
> 
> Seems to me that we have an opportunity to provide something that sits
> somewhere in between buffered and O_DIRECT, and this is where
> RWF_UNCACHED enters the picture. If this flag is set on IO, we get the
> following behavior:
> 
> - If the data is in cache, it remains in cache and the copy (in or out)
>  is served to/from that.
> 
> - If the data is NOT in cache, we add it while performing the IO. When
>  the IO is done, we remove it again.
> 
> With this, I can do 100% smooth buffered reads or writes without pushing
> the kernel to the state where kswapd is sweating bullets. In fact it
> doesn't even register.
> 
> Comments appreciated!

I think this is a definite win for e.g. NVMe/Optane devices where the
underlying storage is fast enough to avoid the need for page cache.

In our testing of Lustre on NVMe, it was faster to avoid the page cache
entirely - just inserting and removing the pages from cache took a
considerable amount of CPU for workloads where we knew it was not
beneficial (e.g. IO that was large enough that the storage was as fast
as the network).

This also makes it easier to keep other data in cache (e.g. filesystem
metadata, small IOs, etc.).

Cheers, Andreas

> Patches are against current git (ish), and can also be found here:
> 
> https://git.kernel.dk/cgit/linux-block/log/?h=buffered-uncached
> 
> fs/ceph/file.c          |   2 +-
> fs/dax.c                |   2 +-
> fs/ext4/file.c          |   2 +-
> fs/iomap/apply.c        |   2 +-
> fs/iomap/buffered-io.c  |  75 +++++++++++++++++------
> fs/iomap/direct-io.c    |   3 +-
> fs/iomap/fiemap.c       |   5 +-
> fs/iomap/seek.c         |   6 +-
> fs/iomap/swapfile.c     |   2 +-
> fs/nfs/file.c           |   2 +-
> include/linux/fs.h      |   9 ++-
> include/linux/iomap.h   |   6 +-
> include/uapi/linux/fs.h |   5 +-
> mm/filemap.c            | 132 ++++++++++++++++++++++++++++++++++++----
> 14 files changed, 208 insertions(+), 45 deletions(-)
> 
> --
> Jens Axboe
> 
> 


Cheers, Andreas
Christoph Hellwig Dec. 12, 2019, 3:47 p.m. UTC | #2
On Tue, Dec 10, 2019 at 09:24:49AM -0700, Jens Axboe wrote:
> Seems to me that we have an opportunity to provide something that sits
> somewhere in between buffered and O_DIRECT, and this is where
> RWF_UNCACHED enters the picture. If this flag is set on IO, we get the
> following behavior:
> 
> - If the data is in cache, it remains in cache and the copy (in or out)
>   is served to/from that.
> 
> - If the data is NOT in cache, we add it while performing the IO. When
>   the IO is done, we remove it again.
> 
> With this, I can do 100% smooth buffered reads or writes without pushing
> the kernel to the state where kswapd is sweating bullets. In fact it
> doesn't even register.
> 
> Comments appreciated! Patches are against current git (ish), and can
> also be found here:

I can't say I particularly like the model, as it still has all the
page cache overhead.  Direct I/O with bounce buffers for unaligned I/O
sounds simpler and faster to me.
Jens Axboe Dec. 12, 2019, 3:52 p.m. UTC | #3
On 12/12/19 8:47 AM, Christoph Hellwig wrote:
> On Tue, Dec 10, 2019 at 09:24:49AM -0700, Jens Axboe wrote:
>> Seems to me that we have an opportunity to provide something that sits
>> somewhere in between buffered and O_DIRECT, and this is where
>> RWF_UNCACHED enters the picture. If this flag is set on IO, we get the
>> following behavior:
>>
>> - If the data is in cache, it remains in cache and the copy (in or out)
>>   is served to/from that.
>>
>> - If the data is NOT in cache, we add it while performing the IO. When
>>   the IO is done, we remove it again.
>>
>> With this, I can do 100% smooth buffered reads or writes without pushing
>> the kernel to the state where kswapd is sweating bullets. In fact it
>> doesn't even register.
>>
>> Comments appreciated! Patches are against current git (ish), and can
>> also be found here:
> 
> I can't say I particularly like the model, as it still has all the
> page cache overhead.  Direct I/O with bounce buffers for unaligned I/O
> sounds simpler and faster to me.

The current patchset read side does not, hopefully the same can be done
on the write side. No page cache usage for reads, because it did indeed
turn out to have too much overhead.