mbox series

[PATCHSET,v3,0/5] Support for RWF_UNCACHED

Message ID 20191211152943.2933-1-axboe@kernel.dk (mailing list archive)
Headers show
Series Support for RWF_UNCACHED | expand

Message

Jens Axboe Dec. 11, 2019, 3:29 p.m. UTC
Recently someone asked me how io_uring buffered IO compares to mmaped
IO in terms of performance. So I ran some tests with buffered IO, and
found the experience to be somewhat painful. The test case is pretty
basic, random reads over a dataset that's 10x the size of RAM.
Performance starts out fine, and then the page cache fills up and we
hit a throughput cliff. CPU usage of the IO threads go up, and we have
kswapd spending 100% of a core trying to keep up. Seeing that, I was
reminded of the many complaints I here about buffered IO, and the fact
that most of the folks complaining will ultimately bite the bullet and
move to O_DIRECT to just get the kernel out of the way.

But I don't think it needs to be like that. Switching to O_DIRECT isn't
always easily doable. The buffers have different life times, size and
alignment constraints, etc. On top of that, mixing buffered and O_DIRECT
can be painful.

Seems to me that we have an opportunity to provide something that sits
somewhere in between buffered and O_DIRECT, and this is where
RWF_UNCACHED enters the picture. If this flag is set on IO, we get the
following behavior:

- If the data is in cache, it remains in cache and the copy (in or out)
  is served to/from that.

- If the data is NOT in cache, we add it while performing the IO. When
  the IO is done, we remove it again.

With this, I can do 100% smooth buffered reads or writes without pushing
the kernel to the state where kswapd is sweating bullets. In fact it
doesn't even register.

Comments appreciated! This should work on any standard file system,
using either the generic helpers or iomap. I have tested ext4 and xfs
for the right read/write behavior, but no further validation has been
done yet. Patches are against current git, and can also be found here:

https://git.kernel.dk/cgit/linux-block/log/?h=buffered-uncached

 fs/ceph/file.c          |  2 +-
 fs/dax.c                |  2 +-
 fs/ext4/file.c          |  2 +-
 fs/iomap/apply.c        | 26 ++++++++++-
 fs/iomap/buffered-io.c  | 54 ++++++++++++++++-------
 fs/iomap/direct-io.c    |  3 +-
 fs/iomap/fiemap.c       |  5 ++-
 fs/iomap/seek.c         |  6 ++-
 fs/iomap/swapfile.c     |  2 +-
 fs/nfs/file.c           |  2 +-
 include/linux/fs.h      |  7 ++-
 include/linux/iomap.h   | 10 ++++-
 include/uapi/linux/fs.h |  5 ++-
 mm/filemap.c            | 95 ++++++++++++++++++++++++++++++++++++-----
 14 files changed, 181 insertions(+), 40 deletions(-)

Changes since v2:
- Rework the write side according to Chinners suggestions. Much cleaner
  this way. It does mean that we invalidate the full write region if just
  ONE page (or more) had to be created, where before it was more granular.
  I don't think that's a concern, and on the plus side, we now no longer
  have to chunk invalidations into 15/16 pages at the time.
- Cleanups

Changes since v1:
- Switch to pagevecs for write_drop_cached_pages()
- Use page_offset() instead of manual shift
- Ensure we hold a reference on the page between calling ->write_end()
  and checking the mapping on the locked page
- Fix XFS multi-page streamed writes, we'd drop the UNCACHED flag after
  the first page

Comments

Linus Torvalds Dec. 11, 2019, 5:37 p.m. UTC | #1
On Wed, Dec 11, 2019 at 7:29 AM Jens Axboe <axboe@kernel.dk> wrote:
>
> Comments appreciated! This should work on any standard file system,
> using either the generic helpers or iomap. I have tested ext4 and xfs
> for the right read/write behavior, but no further validation has been
> done yet. Patches are against current git, and can also be found here:

I don't think this is conceptually wrong, but the implementation smells a bit.

I commented on the trivial part (the horrendous argument list to
iomap_actor), but I wonder how much of the explicit invalidation is
actually needed?

Because active invalidation really is a horrible horrible thing to do.
It immediately means that you can't use this interface for normal
everyday things that may actually cache perfectly fine.

What happens if you simply never _activate_ the page? Then they should
get invalidated on their own, without impacting any other load - but
only when there is _some_ memory pressure. They'll just stay on the
inactive lru list, and get re-used quickly.

Note that there are two ways to activate a page: the "accessed while
on the inactive list" will activate it, but these days we also have a
"pre-activate" path in the workingset code (see workingset_refault()).

Even if you might also want an explicit invalidate path, I would like
to hear what it looks like if you instead of - or in addition to -
invalidating, have a "don't activate" flag.

We don't have all _that_ many places where we activate pages, and they
should be easy to find (just grep for "SetPageActive()"), although the
call chain may make it a bit painful to add a "don't do it for this
access" kind of things.

But I think most of the regular IO call chains come through
"mark_page_accessed()". So _that_ is the part you want to avoid (and
maybe the workingset code). And that should be fairly straightforward,
I think.

In fact, that you say that just a pure random read case causes lots of
kswapd activity makes me think that maybe we've screwed up page
activation in general, and never noticed (because if you have enough
memory, you don't really see it that often)? So this might not be an
io_ring issue, but an issue in general.

                     Linus
Jens Axboe Dec. 11, 2019, 5:56 p.m. UTC | #2
On 12/11/19 10:37 AM, Linus Torvalds wrote:
> On Wed, Dec 11, 2019 at 7:29 AM Jens Axboe <axboe@kernel.dk> wrote:
>>
>> Comments appreciated! This should work on any standard file system,
>> using either the generic helpers or iomap. I have tested ext4 and xfs
>> for the right read/write behavior, but no further validation has been
>> done yet. Patches are against current git, and can also be found here:
> 
> I don't think this is conceptually wrong, but the implementation
> smells a bit.
> 
> I commented on the trivial part (the horrendous argument list to
> iomap_actor), but I wonder how much of the explicit invalidation is
> actually needed?

Agree on the other email on that part, if we continue on this path, then
I'll clean that up and shove the arguments in an actor struct.

> Because active invalidation really is a horrible horrible thing to do.
> It immediately means that you can't use this interface for normal
> everyday things that may actually cache perfectly fine.
> 
> What happens if you simply never _activate_ the page? Then they should
> get invalidated on their own, without impacting any other load - but
> only when there is _some_ memory pressure. They'll just stay on the
> inactive lru list, and get re-used quickly.
> 
> Note that there are two ways to activate a page: the "accessed while
> on the inactive list" will activate it, but these days we also have a
> "pre-activate" path in the workingset code (see workingset_refault()).
> 
> Even if you might also want an explicit invalidate path, I would like
> to hear what it looks like if you instead of - or in addition to -
> invalidating, have a "don't activate" flag.
> 
> We don't have all _that_ many places where we activate pages, and they
> should be easy to find (just grep for "SetPageActive()"), although the
> call chain may make it a bit painful to add a "don't do it for this
> access" kind of things.
> 
> But I think most of the regular IO call chains come through
> "mark_page_accessed()". So _that_ is the part you want to avoid (and
> maybe the workingset code). And that should be fairly straightforward,
> I think.

Sure, I can give that a go and see how that behaves.

> In fact, that you say that just a pure random read case causes lots of
> kswapd activity makes me think that maybe we've screwed up page
> activation in general, and never noticed (because if you have enough
> memory, you don't really see it that often)? So this might not be an
> io_ring issue, but an issue in general.

This is very much not an io_uring issue, you can see exactly the same
kind of behavior with normal buffered reads or mmap'ed IO. I do wonder
if streamed reads are as bad in terms of making kswapd go crazy, I
forget if I tested that explicitly as well.

I'll run some streamed and random read testing on both and see how they
behave, then report back.
Linus Torvalds Dec. 11, 2019, 7:14 p.m. UTC | #3
[ Adding Johannes Weiner to the cc, I think he's looked at the working
set and the inactive/active LRU lists the most ]

On Wed, Dec 11, 2019 at 9:56 AM Jens Axboe <axboe@kernel.dk> wrote:
>
> > In fact, that you say that just a pure random read case causes lots of
> > kswapd activity makes me think that maybe we've screwed up page
> > activation in general, and never noticed (because if you have enough
> > memory, you don't really see it that often)? So this might not be an
> > io_ring issue, but an issue in general.
>
> This is very much not an io_uring issue, you can see exactly the same
> kind of behavior with normal buffered reads or mmap'ed IO. I do wonder
> if streamed reads are as bad in terms of making kswapd go crazy, I
> forget if I tested that explicitly as well.

We definitely used to have people test things like "read the same
much-bigger-than-memory file over and over", and it wasn't supposed to
be all _that_ painful, because the pages never activated, and they got
moved out of the cache quickly and didn't disturb other activities
(other than the constant IO, of course, which can be a big deal in
itself).

But maybe that was just the streaming case. With read-around and
random accesses, maybe we end up activating too much (and maybe we
always did).

But I wouldn't be surprised if we've lost that as people went from
having 16-32MB to having that many GB instead - simply because a lot
of loads are basically entirely cached, and the few things that are
not tend to be explicitly uncached (ie O_DIRECT etc).

I think the workingset changes actually were maybe kind of related to
this - the inactive list can become too small to ever give people time
to do a good job of picking the _right_ thing to activate.

So this might either be the reverse situation - maybe we let the
inactive list grow too large, and then even a big random load will
activate pages that really shouldn't be activated? Or it might be
related to the workingset issue in that we've activated pages too
eagerly and not ever moved things back to the inactive list (which
then in some situations causes the inactive list to be very small).

Who knows. But this is definitely an area that I suspect hasn't gotten
all that much attention simply because memory has become much more
plentiful, and a lot of regular loads basically have enough memory
that almost all IO is cached anyway, and the old "you needed to be
more clever about balancing swap/inactive/active even under normal
loads" thing may have gone away a bit.

These days, even if you do somewhat badly in that balancing act, a lot
of loads probably won't notice that much. Either there is still so
much memory for caching that the added IO isn't really ever dominant,
or you had such a big load to begin with that it was long since
rewritten to use O_DIRECT.

            Linus
Jens Axboe Dec. 11, 2019, 7:34 p.m. UTC | #4
On 12/11/19 10:56 AM, Jens Axboe wrote:
>> But I think most of the regular IO call chains come through
>> "mark_page_accessed()". So _that_ is the part you want to avoid (and
>> maybe the workingset code). And that should be fairly straightforward,
>> I think.
> 
> Sure, I can give that a go and see how that behaves.

Before doing that, I ran a streamed read test instead of just random
reads, and the behavior is roughly the same. kswapd consumes a bit less
CPU, but it's still very active once the page cache has been filled. For
specifics on the setup, I deliberately boot the box with 32G of RAM, and
the dataset is 320G. My initial tests were with 1 320G file, but
Johannes complained about that so I went to 32 10G files instead. That's
what I'm currently using.

For the random test case, top of profile for kswapd is:

+   33.49%  kswapd0  [kernel.vmlinux]  [k] xas_create                          ◆
+    7.93%  kswapd0  [kernel.vmlinux]  [k] __isolate_lru_page                  ▒
+    7.18%  kswapd0  [kernel.vmlinux]  [k] unlock_page                         ▒
+    5.90%  kswapd0  [kernel.vmlinux]  [k] free_pcppages_bulk                  ▒
+    5.64%  kswapd0  [kernel.vmlinux]  [k] _raw_spin_lock_irqsave              ▒
+    5.57%  kswapd0  [kernel.vmlinux]  [k] shrink_page_list                    ▒
+    3.48%  kswapd0  [kernel.vmlinux]  [k] __remove_mapping                    ▒
+    3.35%  kswapd0  [kernel.vmlinux]  [k] isolate_lru_pages                   ▒
+    3.14%  kswapd0  [kernel.vmlinux]  [k] __delete_from_page_cache            ▒

Next I ran with NOT calling mark_page_accessed() to see if that makes a
difference. See patch below, I just applied this on top of this patchset
and added a new RWF_NOACCESS flag for it for ease of teting. I verified
that we are indeed skipping the mark_page_accessed() call in
generic_file_buffered_read().

I can't tell a difference in the results, there's no discernable
difference between NOT calling mark_page_accessed() or calling it.
Behavior seems about the same, in terms of pre and post page cache full,
and kswapd still churns a lot once the page cache is filled up.
Linus Torvalds Dec. 11, 2019, 8:03 p.m. UTC | #5
On Wed, Dec 11, 2019 at 11:34 AM Jens Axboe <axboe@kernel.dk> wrote:
>
> I can't tell a difference in the results, there's no discernable
> difference between NOT calling mark_page_accessed() or calling it.
> Behavior seems about the same, in terms of pre and post page cache full,
> and kswapd still churns a lot once the page cache is filled up.

Yeah, that sounds like a bug. I'm sure the RWF_UNCACHED flag fixes it
when you do the IO that way, but it seems to be a bug relardless.

Does /proc/meminfo have everything inactive for file data (ie the
"Active(file)" line is basically zero?).

Maybe pages got activated other ways (eg a problem with the workingset
code)? You said "See patch below", but there wasn't any.

That said, it's also entirely possible that even with everything in
the inactive list, we might try to shrink other things first for
whatever odd reason..

The fact that you see that xas_create() so prominently would imply
perhaps add_to_swap_cache(), which certainly implies that the page
shrinking isn't hitting the file pages...

               Linus
Jens Axboe Dec. 11, 2019, 8:04 p.m. UTC | #6
On 12/11/19 12:34 PM, Jens Axboe wrote:
> On 12/11/19 10:56 AM, Jens Axboe wrote:
>>> But I think most of the regular IO call chains come through
>>> "mark_page_accessed()". So _that_ is the part you want to avoid (and
>>> maybe the workingset code). And that should be fairly straightforward,
>>> I think.
>>
>> Sure, I can give that a go and see how that behaves.
> 
> Before doing that, I ran a streamed read test instead of just random
> reads, and the behavior is roughly the same. kswapd consumes a bit less
> CPU, but it's still very active once the page cache has been filled. For
> specifics on the setup, I deliberately boot the box with 32G of RAM, and
> the dataset is 320G. My initial tests were with 1 320G file, but
> Johannes complained about that so I went to 32 10G files instead. That's
> what I'm currently using.
> 
> For the random test case, top of profile for kswapd is:
> 
> +   33.49%  kswapd0  [kernel.vmlinux]  [k] xas_create
> +    7.93%  kswapd0  [kernel.vmlinux]  [k] __isolate_lru_page
> +    7.18%  kswapd0  [kernel.vmlinux]  [k] unlock_page
> +    5.90%  kswapd0  [kernel.vmlinux]  [k] free_pcppages_bulk
> +    5.64%  kswapd0  [kernel.vmlinux]  [k] _raw_spin_lock_irqsave
> +    5.57%  kswapd0  [kernel.vmlinux]  [k] shrink_page_list
> +    3.48%  kswapd0  [kernel.vmlinux]  [k] __remove_mapping
> +    3.35%  kswapd0  [kernel.vmlinux]  [k] isolate_lru_pages
> +    3.14%  kswapd0  [kernel.vmlinux]  [k] __delete_from_page_cache

Here's the profile for the !mark_page_accessed() run, looks very much
the same:

+   32.84%  kswapd0  [kernel.vmlinux]  [k] xas_create
+    8.05%  kswapd0  [kernel.vmlinux]  [k] unlock_page
+    7.68%  kswapd0  [kernel.vmlinux]  [k] __isolate_lru_page
+    6.08%  kswapd0  [kernel.vmlinux]  [k] free_pcppages_bulk
+    5.96%  kswapd0  [kernel.vmlinux]  [k] _raw_spin_lock_irqsave
+    5.56%  kswapd0  [kernel.vmlinux]  [k] shrink_page_list
+    4.02%  kswapd0  [kernel.vmlinux]  [k] __remove_mapping
+    3.70%  kswapd0  [kernel.vmlinux]  [k] __delete_from_page_cache
+    3.55%  kswapd0  [kernel.vmlinux]  [k] isolate_lru_pages
Jens Axboe Dec. 11, 2019, 8:08 p.m. UTC | #7
On 12/11/19 1:03 PM, Linus Torvalds wrote:
> On Wed, Dec 11, 2019 at 11:34 AM Jens Axboe <axboe@kernel.dk> wrote:
>>
>> I can't tell a difference in the results, there's no discernable
>> difference between NOT calling mark_page_accessed() or calling it.
>> Behavior seems about the same, in terms of pre and post page cache full,
>> and kswapd still churns a lot once the page cache is filled up.
> 
> Yeah, that sounds like a bug. I'm sure the RWF_UNCACHED flag fixes it
> when you do the IO that way, but it seems to be a bug relardless.

Hard to disagree with that.

> Does /proc/meminfo have everything inactive for file data (ie the
> "Active(file)" line is basically zero?).

$ cat /proc/meminfo | grep -i active
Active:           134136 kB
Inactive:       28683916 kB
Active(anon):      97064 kB
Inactive(anon):        4 kB
Active(file):      37072 kB
Inactive(file): 28683912 kB

This is after a run with RWF_NOACCESS.

> Maybe pages got activated other ways (eg a problem with the workingset
> code)? You said "See patch below", but there wasn't any.

Oops, now below.

> 
> That said, it's also entirely possible that even with everything in
> the inactive list, we might try to shrink other things first for
> whatever odd reason..
> 
> The fact that you see that xas_create() so prominently would imply
> perhaps add_to_swap_cache(), which certainly implies that the page
> shrinking isn't hitting the file pages...

That's presumably misleading, as it's just lookups. But yes,
confusing...

diff --git a/include/linux/fs.h b/include/linux/fs.h
index 5ea5fc167524..b2ecc66f5bd5 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -316,6 +316,7 @@ enum rw_hint {
 #define IOCB_WRITE		(1 << 6)
 #define IOCB_NOWAIT		(1 << 7)
 #define IOCB_UNCACHED		(1 << 8)
+#define IOCB_NOACCESS		(1 << 9)
 
 struct kiocb {
 	struct file		*ki_filp;
@@ -3423,6 +3424,8 @@ static inline int kiocb_set_rw_flags(struct kiocb *ki, rwf_t flags)
 		ki->ki_flags |= IOCB_APPEND;
 	if (flags & RWF_UNCACHED)
 		ki->ki_flags |= IOCB_UNCACHED;
+	if (flags & RWF_NOACCESS)
+		ki->ki_flags |= IOCB_NOACCESS;
 	return 0;
 }
 
diff --git a/include/uapi/linux/fs.h b/include/uapi/linux/fs.h
index 357ebb0e0c5d..f20f0048d5c5 100644
--- a/include/uapi/linux/fs.h
+++ b/include/uapi/linux/fs.h
@@ -302,8 +302,10 @@ typedef int __bitwise __kernel_rwf_t;
 /* drop cache after reading or writing data */
 #define RWF_UNCACHED	((__force __kernel_rwf_t)0x00000040)
 
+#define RWF_NOACCESS	((__force __kernel_rwf_t)0x00000080)
+
 /* mask of flags supported by the kernel */
 #define RWF_SUPPORTED	(RWF_HIPRI | RWF_DSYNC | RWF_SYNC | RWF_NOWAIT |\
-			 RWF_APPEND | RWF_UNCACHED)
+			 RWF_APPEND | RWF_UNCACHED | RWF_NOACCESS)
 
 #endif /* _UAPI_LINUX_FS_H */
diff --git a/mm/filemap.c b/mm/filemap.c
index 4dadd1a4ca7c..c37b0e221a8a 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -2058,7 +2058,7 @@ static ssize_t generic_file_buffered_read(struct kiocb *iocb,
 			if (iocb->ki_flags & IOCB_NOWAIT)
 				goto would_block;
 			/* UNCACHED implies no read-ahead */
-			if (iocb->ki_flags & IOCB_UNCACHED)
+			if (iocb->ki_flags & (IOCB_UNCACHED|IOCB_NOACCESS))
 				goto no_cached_page;
 			page_cache_sync_readahead(mapping,
 					ra, filp,
@@ -2144,7 +2144,8 @@ static ssize_t generic_file_buffered_read(struct kiocb *iocb,
 		 * When a sequential read accesses a page several times,
 		 * only mark it as accessed the first time.
 		 */
-		if (prev_index != index || offset != prev_offset)
+		if ((prev_index != index || offset != prev_offset) &&
+		    !(iocb->ki_flags & IOCB_NOACCESS))
 			mark_page_accessed(page);
 		prev_index = index;
Linus Torvalds Dec. 11, 2019, 8:18 p.m. UTC | #8
On Wed, Dec 11, 2019 at 12:08 PM Jens Axboe <axboe@kernel.dk> wrote:
>
> $ cat /proc/meminfo | grep -i active
> Active:           134136 kB
> Inactive:       28683916 kB
> Active(anon):      97064 kB
> Inactive(anon):        4 kB
> Active(file):      37072 kB
> Inactive(file): 28683912 kB

Yeah, that should not put pressure on some swap activity. We have 28
GB of basically free inactive file data, and the VM is doing something
very very bad if it then doesn't just quickly free it with no real
drama.

In fact, I don't think it should even trigger kswapd at all, it should
all be direct reclaim. Of course, some of the mm people hate that with
a passion, but this does look like a prime example of why it should
just be done.

MM people - mind giving this a look?  Jens, if you have that NOACCESS
flag in a git tree too and a trivial way to recreate your load, that
would be good for people to be able to just try things out.

                     Linus
Matthew Wilcox Dec. 11, 2019, 8:43 p.m. UTC | #9
On Wed, Dec 11, 2019 at 01:08:32PM -0700, Jens Axboe wrote:
> > The fact that you see that xas_create() so prominently would imply
> > perhaps add_to_swap_cache(), which certainly implies that the page
> > shrinking isn't hitting the file pages...
> 
> That's presumably misleading, as it's just lookups. But yes,
> confusing...

While xas_create() could be called directly, it's more often called
through xas_store() ... which would be called if we're storing a shadow
entry to replace a page, which this workload would presumably be doing.
Johannes Weiner Dec. 11, 2019, 9:04 p.m. UTC | #10
On Wed, Dec 11, 2019 at 12:18:38PM -0800, Linus Torvalds wrote:
> On Wed, Dec 11, 2019 at 12:08 PM Jens Axboe <axboe@kernel.dk> wrote:
> >
> > $ cat /proc/meminfo | grep -i active
> > Active:           134136 kB
> > Inactive:       28683916 kB
> > Active(anon):      97064 kB
> > Inactive(anon):        4 kB
> > Active(file):      37072 kB
> > Inactive(file): 28683912 kB
> 
> Yeah, that should not put pressure on some swap activity. We have 28
> GB of basically free inactive file data, and the VM is doing something
> very very bad if it then doesn't just quickly free it with no real
> drama.

I was looking at this with Jens offline last week. One thing to note
is the rate of IO that Jens is working with: combined with the low
cache hit rate, it was pushing upwards of half a million pages through
the page cache each second.

There isn't anything obvious sticking out in the kswapd profile: it's
dominated by cache tree deletions (or rather replacing pages with
shadow entries, hence the misleading xas_store()), tree lock
contention, etc. - all work that a direct reclaimer would have to do
as well, with one exceptions: RWC_UNCACHED doesn't need to go through
the LRU list, and 8-9% of kswapd cycles alone are going into
physically getting pages off the list. (And I suspect part of that is
also contention over the LRU lock as kswapd gets overwhelmed and
direct reclaim kicks in).

Jens, how much throughput difference does kswapd vs RWC_UNCACHED make?
Jens Axboe Dec. 11, 2019, 11:41 p.m. UTC | #11
On 12/11/19 1:18 PM, Linus Torvalds wrote:
> On Wed, Dec 11, 2019 at 12:08 PM Jens Axboe <axboe@kernel.dk> wrote:
>>
>> $ cat /proc/meminfo | grep -i active
>> Active:           134136 kB
>> Inactive:       28683916 kB
>> Active(anon):      97064 kB
>> Inactive(anon):        4 kB
>> Active(file):      37072 kB
>> Inactive(file): 28683912 kB
> 
> Yeah, that should not put pressure on some swap activity. We have 28
> GB of basically free inactive file data, and the VM is doing something
> very very bad if it then doesn't just quickly free it with no real
> drama.
> 
> In fact, I don't think it should even trigger kswapd at all, it should
> all be direct reclaim. Of course, some of the mm people hate that with
> a passion, but this does look like a prime example of why it should
> just be done.

For giggles, I ran just a single thread on the file set. We're only
doing about 100K IOPS at that point, yet when the page cache fills,
kswapd still eats 10% cpu. That seems like a lot for something that
slow.

> MM people - mind giving this a look?  Jens, if you have that NOACCESS
> flag in a git tree too and a trivial way to recreate your load, that
> would be good for people to be able to just try things out.

I've pushed the NOACCESS thing to my buffered-uncached branch as well,
and fio has a 'noaccess' branch that enables it for pvsync2 (which is
preadv2/pwritev2) and the io_uring engine.

Here's what I did to reproduce:

- Boot the box with 32G of memory.
- On a fast device, create 10x RAM size of files. I used 32 files, each
  10G. Mine are in /data, and they are named file[1-32].
- Run a buffered read workload on those files.

For pvsync2, something ala:

$ cat job.fio
[test]
ioengine=pvsync2
#uncached=1
#noaccess=1
iodepth=4
bs=4k
group_reporting=1
rw=randread
norandommap
buffered=1
directory=/data
filename=file1:file2:file3:file4:file5:file6:file7:file8:file9:file10:file11:file12:file13:file14:file15:file16:file17:file18:file19:file20:file21:file22:file23:file24:file25:file26:file27:file28:file29:file30:file31:file32

If you want to use more than one thread, add:

numjobs=4

for 4 threads. Uncomment the 'uncached=1' and/or 'noaccess=1' to enable
either RWF_UNCACHED or RWF_NOACCESS.
Linus Torvalds Dec. 12, 2019, 1:08 a.m. UTC | #12
On Wed, Dec 11, 2019 at 3:41 PM Jens Axboe <axboe@kernel.dk> wrote:
>
> Here's what I did to reproduce:

Gaah. I have a fairly fast ssd (but it's "consumer fast" - Samsung 970
Pro - I'm assuming yours is a different class), but I encrypt my disk,
so I only get about 15k iops and I see kcyrptd in my profiles, not
kswapd.

I didn't even try to worry about RWF_UNCACHED or RWF_NOACCESS, since I
wanted to see the problem. But I think that with my setup, I can't
really see it just due to my IO being slow ;(

I do see xas_create and kswapd0, but it's 0.42%. I'm assuming it's the
very first signs of this problem, but it's certainly not all that
noticeable.

               Linus
Jens Axboe Dec. 12, 2019, 1:09 a.m. UTC | #13
On 12/11/19 4:41 PM, Jens Axboe wrote:
> On 12/11/19 1:18 PM, Linus Torvalds wrote:
>> On Wed, Dec 11, 2019 at 12:08 PM Jens Axboe <axboe@kernel.dk> wrote:
>>>
>>> $ cat /proc/meminfo | grep -i active
>>> Active:           134136 kB
>>> Inactive:       28683916 kB
>>> Active(anon):      97064 kB
>>> Inactive(anon):        4 kB
>>> Active(file):      37072 kB
>>> Inactive(file): 28683912 kB
>>
>> Yeah, that should not put pressure on some swap activity. We have 28
>> GB of basically free inactive file data, and the VM is doing something
>> very very bad if it then doesn't just quickly free it with no real
>> drama.
>>
>> In fact, I don't think it should even trigger kswapd at all, it should
>> all be direct reclaim. Of course, some of the mm people hate that with
>> a passion, but this does look like a prime example of why it should
>> just be done.
> 
> For giggles, I ran just a single thread on the file set. We're only
> doing about 100K IOPS at that point, yet when the page cache fills,
> kswapd still eats 10% cpu. That seems like a lot for something that
> slow.

Warning, the below is from the really crazy department...

Anyway, I took a closer look at the profiles for the uncached case.
We're spending a lot of time doing memsets (this is the xa_node init,
from the radix tree constructor), and call_rcu for the node free later
on. All wasted time, and something that meant we weren't as close to the
performance of O_DIRECT as we could be.

So Chris and I started talking about this, and pondered "what would
happen if we simply bypassed the page cache completely?". Case in point,
see below incremental patch. We still do the page cache lookup, and use
that page to copy from if it's there. If the page isn't there, allocate
one and do IO to it, but DON'T add it to the page cache. With that,
we're almost at O_DIRECT levels of performance for the 4k read case,
without 1-2%. I think 512b would look awesome, but we're reading full
pages, so that won't really help us much. Compared to the previous
uncached method, this is 30% faster on this device. That's substantial.

Obviously this has issues with truncate that would need to be resolved,
and it's definitely dirtier. But the performance is very enticing...


diff --git a/mm/filemap.c b/mm/filemap.c
index c37b0e221a8a..9834c394fffd 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -933,8 +933,8 @@ int add_to_page_cache_locked(struct page *page, struct address_space *mapping,
 }
 EXPORT_SYMBOL(add_to_page_cache_locked);
 
-static int __add_to_page_cache(struct page *page, struct address_space *mapping,
-			       pgoff_t offset, gfp_t gfp_mask, bool lru)
+int add_to_page_cache_lru(struct page *page, struct address_space *mapping,
+				pgoff_t offset, gfp_t gfp_mask)
 {
 	void *shadow = NULL;
 	int ret;
@@ -956,18 +956,11 @@ static int __add_to_page_cache(struct page *page, struct address_space *mapping,
 		WARN_ON_ONCE(PageActive(page));
 		if (!(gfp_mask & __GFP_WRITE) && shadow)
 			workingset_refault(page, shadow);
-		if (lru)
-			lru_cache_add(page);
+		lru_cache_add(page);
 	}
 	return ret;
 
 }
-
-int add_to_page_cache_lru(struct page *page, struct address_space *mapping,
-				pgoff_t offset, gfp_t gfp_mask)
-{
-	return __add_to_page_cache(page, mapping, offset, gfp_mask, true);
-}
 EXPORT_SYMBOL_GPL(add_to_page_cache_lru);
 
 #ifdef CONFIG_NUMA
@@ -1998,6 +1991,13 @@ static void shrink_readahead_size_eio(struct file *filp,
 	ra->ra_pages /= 4;
 }
 
+static void buffered_put_page(struct page *page, bool clear_mapping)
+{
+	if (clear_mapping)
+		page->mapping = NULL;
+	put_page(page);
+}
+
 /**
  * generic_file_buffered_read - generic file read routine
  * @iocb:	the iocb to read
@@ -2040,7 +2040,7 @@ static ssize_t generic_file_buffered_read(struct kiocb *iocb,
 	offset = *ppos & ~PAGE_MASK;
 
 	for (;;) {
-		bool drop_page = false;
+		bool clear_mapping = false;
 		struct page *page;
 		pgoff_t end_index;
 		loff_t isize;
@@ -2118,7 +2118,7 @@ static ssize_t generic_file_buffered_read(struct kiocb *iocb,
 		isize = i_size_read(inode);
 		end_index = (isize - 1) >> PAGE_SHIFT;
 		if (unlikely(!isize || index > end_index)) {
-			put_page(page);
+			buffered_put_page(page, clear_mapping);
 			goto out;
 		}
 
@@ -2127,7 +2127,7 @@ static ssize_t generic_file_buffered_read(struct kiocb *iocb,
 		if (index == end_index) {
 			nr = ((isize - 1) & ~PAGE_MASK) + 1;
 			if (nr <= offset) {
-				put_page(page);
+				buffered_put_page(page, clear_mapping);
 				goto out;
 			}
 		}
@@ -2160,27 +2160,7 @@ static ssize_t generic_file_buffered_read(struct kiocb *iocb,
 		offset &= ~PAGE_MASK;
 		prev_offset = offset;
 
-		/*
-		 * If we're dropping this page due to drop-behind, then
-		 * lock it first. Ignore errors here, we can just leave it
-		 * in the page cache. Note that we didn't add this page to
-		 * the LRU when we added it to the page cache. So if we
-		 * fail removing it, or lock it, add to the LRU.
-		 */
-		if (drop_page) {
-			bool addlru = true;
-
-			if (!lock_page_killable(page)) {
-				if (page->mapping == mapping)
-					addlru = !remove_mapping(mapping, page);
-				else
-					addlru = false;
-				unlock_page(page);
-			}
-			if (addlru)
-				lru_cache_add(page);
-		}
-		put_page(page);
+		buffered_put_page(page, clear_mapping);
 		written += ret;
 		if (!iov_iter_count(iter))
 			goto out;
@@ -2222,7 +2202,7 @@ static ssize_t generic_file_buffered_read(struct kiocb *iocb,
 
 		if (unlikely(error)) {
 			if (error == AOP_TRUNCATED_PAGE) {
-				put_page(page);
+				buffered_put_page(page, clear_mapping);
 				error = 0;
 				goto find_page;
 			}
@@ -2239,7 +2219,7 @@ static ssize_t generic_file_buffered_read(struct kiocb *iocb,
 					 * invalidate_mapping_pages got it
 					 */
 					unlock_page(page);
-					put_page(page);
+					buffered_put_page(page, clear_mapping);
 					goto find_page;
 				}
 				unlock_page(page);
@@ -2254,7 +2234,7 @@ static ssize_t generic_file_buffered_read(struct kiocb *iocb,
 
 readpage_error:
 		/* UHHUH! A synchronous read error occurred. Report it */
-		put_page(page);
+		buffered_put_page(page, clear_mapping);
 		goto out;
 
 no_cached_page:
@@ -2267,12 +2247,16 @@ static ssize_t generic_file_buffered_read(struct kiocb *iocb,
 			error = -ENOMEM;
 			goto out;
 		}
-		if (iocb->ki_flags & IOCB_UNCACHED)
-			drop_page = true;
+		if (iocb->ki_flags & IOCB_UNCACHED) {
+			__SetPageLocked(page);
+			page->mapping = mapping;
+			page->index = index;
+			clear_mapping = true;
+			goto readpage;
+		}
 
-		error = __add_to_page_cache(page, mapping, index,
-				mapping_gfp_constraint(mapping, GFP_KERNEL),
-				!drop_page);
+		error = add_to_page_cache(page, mapping, index,
+				mapping_gfp_constraint(mapping, GFP_KERNEL));
 		if (error) {
 			put_page(page);
 			if (error == -EEXIST) {
Jens Axboe Dec. 12, 2019, 1:11 a.m. UTC | #14
On 12/11/19 6:08 PM, Linus Torvalds wrote:
> On Wed, Dec 11, 2019 at 3:41 PM Jens Axboe <axboe@kernel.dk> wrote:
>>
>> Here's what I did to reproduce:
> 
> Gaah. I have a fairly fast ssd (but it's "consumer fast" - Samsung 970
> Pro - I'm assuming yours is a different class), but I encrypt my disk,
> so I only get about 15k iops and I see kcyrptd in my profiles, not
> kswapd.
> 
> I didn't even try to worry about RWF_UNCACHED or RWF_NOACCESS, since I
> wanted to see the problem. But I think that with my setup, I can't
> really see it just due to my IO being slow ;(
> 
> I do see xas_create and kswapd0, but it's 0.42%. I'm assuming it's the
> very first signs of this problem, but it's certainly not all that
> noticeable.

15K is likely too slow to really show an issue, I'm afraid. The 970
is no slouch, but your crypt setup will likely hamper it a lot. You
don't have a non-encrypted partition on it? I always set aside a
decently sized empty partition on my laptop to be able to run these
kinds of things without having to resort to a test box.

Latency is where it's down the most, you'll likely do better with
more threads to get the IOPS up.
Linus Torvalds Dec. 12, 2019, 1:22 a.m. UTC | #15
On Wed, Dec 11, 2019 at 5:11 PM Jens Axboe <axboe@kernel.dk> wrote:
>
> 15K is likely too slow to really show an issue, I'm afraid. The 970
> is no slouch, but your crypt setup will likely hamper it a lot. You
> don't have a non-encrypted partition on it?

No. I normally don't need all that much disk, so I've never upgraded
my ssd from the 512G size.

Which means that it's actually half full or so, and I never felt like
"I should keep an unencrypted partition for IO testing", since I don't
generally _do_ any IO testing.

I can get my load up with "numjobs=8" and get my iops up to the 100k
range, though.

But kswapd doesn't much seem to care, the CPU percentage actually does
_down_ to 0.39% when I try that. Probably simply because now my CPU's
are busy, so they are running at 4.7Ghz instead of the 800Mhz "mostly
idle" state ...

I guess I should be happy. It does mean that the situation you see
isn't exactly the normal case. I understand why you want to do the
non-cached case, but the case I think it the worrisome one is the
regular buffered one, so that's what I'm testing (not even trying the
noaccess patches).

So from your report I went "uhhuh, that sounds like a bug". And it
appears that it largely isn't - you're seeing it because of pushing
the IO subsystem by another order of magnitude (and then I agree that
"under those kinds of IO loads, caching just won't help")

                   Linus
Jens Axboe Dec. 12, 2019, 1:29 a.m. UTC | #16
On 12/11/19 6:22 PM, Linus Torvalds wrote:
> On Wed, Dec 11, 2019 at 5:11 PM Jens Axboe <axboe@kernel.dk> wrote:
>>
>> 15K is likely too slow to really show an issue, I'm afraid. The 970
>> is no slouch, but your crypt setup will likely hamper it a lot. You
>> don't have a non-encrypted partition on it?
> 
> No. I normally don't need all that much disk, so I've never upgraded
> my ssd from the 512G size.
> 
> Which means that it's actually half full or so, and I never felt like
> "I should keep an unencrypted partition for IO testing", since I don't
> generally _do_ any IO testing.
> 
> I can get my load up with "numjobs=8" and get my iops up to the 100k
> range, though.
> 
> But kswapd doesn't much seem to care, the CPU percentage actually does
> _down_ to 0.39% when I try that. Probably simply because now my CPU's
> are busy, so they are running at 4.7Ghz instead of the 800Mhz "mostly
> idle" state ...
> 
> I guess I should be happy. It does mean that the situation you see
> isn't exactly the normal case. I understand why you want to do the
> non-cached case, but the case I think it the worrisome one is the
> regular buffered one, so that's what I'm testing (not even trying the
> noaccess patches).
> 
> So from your report I went "uhhuh, that sounds like a bug". And it
> appears that it largely isn't - you're seeing it because of pushing
> the IO subsystem by another order of magnitude (and then I agree that
> "under those kinds of IO loads, caching just won't help")

I'd very much argue that it IS a bug, maybe just doesn't show on your
system. My test box is a pretty standard 2 socket system, 24 cores / 48
threads, 2 nodes. The last numbers I sent were 100K IOPS, so nothing
crazy, and granted that's only 10% kswapd cpu time, but that still seems
very high for those kinds of rates. I'm surprised you see essentially no
kswapd time for the same data rate.

We'll keep poking here, I know Johannes is spending some time looking
into the reclaim side.
Jens Axboe Dec. 12, 2019, 1:30 a.m. UTC | #17
On 12/11/19 2:04 PM, Johannes Weiner wrote:
> On Wed, Dec 11, 2019 at 12:18:38PM -0800, Linus Torvalds wrote:
>> On Wed, Dec 11, 2019 at 12:08 PM Jens Axboe <axboe@kernel.dk> wrote:
>>>
>>> $ cat /proc/meminfo | grep -i active
>>> Active:           134136 kB
>>> Inactive:       28683916 kB
>>> Active(anon):      97064 kB
>>> Inactive(anon):        4 kB
>>> Active(file):      37072 kB
>>> Inactive(file): 28683912 kB
>>
>> Yeah, that should not put pressure on some swap activity. We have 28
>> GB of basically free inactive file data, and the VM is doing something
>> very very bad if it then doesn't just quickly free it with no real
>> drama.
> 
> I was looking at this with Jens offline last week. One thing to note
> is the rate of IO that Jens is working with: combined with the low
> cache hit rate, it was pushing upwards of half a million pages through
> the page cache each second.
> 
> There isn't anything obvious sticking out in the kswapd profile: it's
> dominated by cache tree deletions (or rather replacing pages with
> shadow entries, hence the misleading xas_store()), tree lock
> contention, etc. - all work that a direct reclaimer would have to do
> as well, with one exceptions: RWC_UNCACHED doesn't need to go through
> the LRU list, and 8-9% of kswapd cycles alone are going into
> physically getting pages off the list. (And I suspect part of that is
> also contention over the LRU lock as kswapd gets overwhelmed and
> direct reclaim kicks in).
> 
> Jens, how much throughput difference does kswapd vs RWC_UNCACHED make?

It's not huge, like 5-10%. The CPU usage is the most noticable,
particularly at the higher IO rates.
Linus Torvalds Dec. 12, 2019, 1:41 a.m. UTC | #18
On Wed, Dec 11, 2019 at 5:29 PM Jens Axboe <axboe@kernel.dk> wrote:
>
> I'd very much argue that it IS a bug, maybe just doesn't show on your
> system.

Oh, I agree. But I also understand why people hadn't noticed, and I
don't think it's all that critical - because if you do 1M iops cached,
you're doing something really really strange.

I too can see xas_create using 30% CPU time, but that's when I do a
perf record on just kswapd - and when I actually look at it on a
system level, it looks nowhere near that bad.

So I think people should look at this. Part of it might be for Willy:
does that xas_create() need to be that expensive? I hate how "perf"
callchains work, but it does look like it is probably
page_cache_delete -> xas_store -> xas_create that is the bulk of the
cost there.

Replacing the real page with the shadow entry shouldn't be that big of
a deal, I would really hope.

Willy, that used to be a __radix_tree_lookup -> __radix_tree_replace
thing, is there perhaps some simple optmization that could be done on
the XArray case here?

                Linus
Jens Axboe Dec. 12, 2019, 1:41 a.m. UTC | #19
On 12/11/19 6:29 PM, Jens Axboe wrote:
> On 12/11/19 6:22 PM, Linus Torvalds wrote:
>> On Wed, Dec 11, 2019 at 5:11 PM Jens Axboe <axboe@kernel.dk> wrote:
>>>
>>> 15K is likely too slow to really show an issue, I'm afraid. The 970
>>> is no slouch, but your crypt setup will likely hamper it a lot. You
>>> don't have a non-encrypted partition on it?
>>
>> No. I normally don't need all that much disk, so I've never upgraded
>> my ssd from the 512G size.
>>
>> Which means that it's actually half full or so, and I never felt like
>> "I should keep an unencrypted partition for IO testing", since I don't
>> generally _do_ any IO testing.
>>
>> I can get my load up with "numjobs=8" and get my iops up to the 100k
>> range, though.
>>
>> But kswapd doesn't much seem to care, the CPU percentage actually does
>> _down_ to 0.39% when I try that. Probably simply because now my CPU's
>> are busy, so they are running at 4.7Ghz instead of the 800Mhz "mostly
>> idle" state ...
>>
>> I guess I should be happy. It does mean that the situation you see
>> isn't exactly the normal case. I understand why you want to do the
>> non-cached case, but the case I think it the worrisome one is the
>> regular buffered one, so that's what I'm testing (not even trying the
>> noaccess patches).
>>
>> So from your report I went "uhhuh, that sounds like a bug". And it
>> appears that it largely isn't - you're seeing it because of pushing
>> the IO subsystem by another order of magnitude (and then I agree that
>> "under those kinds of IO loads, caching just won't help")
> 
> I'd very much argue that it IS a bug, maybe just doesn't show on your
> system. My test box is a pretty standard 2 socket system, 24 cores / 48
> threads, 2 nodes. The last numbers I sent were 100K IOPS, so nothing
> crazy, and granted that's only 10% kswapd cpu time, but that still seems
> very high for those kinds of rates. I'm surprised you see essentially no
> kswapd time for the same data rate.
> 
> We'll keep poking here, I know Johannes is spending some time looking
> into the reclaim side.

Out of curiosity, just tried it on my laptop, which also has some
samsung drive. Using 8 jobs, I get around 100K IOPS too, and this
is my top listing:

23308 axboe     20   0  623156   1304      8 D  10.3  0.0   0:03.81 fio
23309 axboe     20   0  623160   1304      8 D  10.3  0.0   0:03.81 fio
23311 axboe     20   0  623168   1304      8 D  10.3  0.0   0:03.82 fio
23313 axboe     20   0  623176   1304      8 D  10.3  0.0   0:03.82 fio
23314 axboe     20   0  623180   1304      8 D  10.3  0.0   0:03.81 fio
  162 root      20   0       0      0      0 S   9.9  0.0   0:12.97 kswapd0
23307 axboe     20   0  623152   1304      8 D   9.9  0.0   0:03.84 fio
23310 axboe     20   0  623164   1304      8 D   9.9  0.0   0:03.81 fio
23312 axboe     20   0  623172   1304      8 D   9.9  0.0   0:03.80 fio

kswapd is between 9-11% the whole time, and the profile looks very
similar to what I saw on my test box:

    35.79%  kswapd0  [kernel.vmlinux]  [k] xas_create
     9.97%  kswapd0  [kernel.vmlinux]  [k] free_pcppages_bulk
     9.94%  kswapd0  [kernel.vmlinux]  [k] isolate_lru_pages
     7.78%  kswapd0  [kernel.vmlinux]  [k] shrink_page_list
     3.78%  kswapd0  [kernel.vmlinux]  [k] xas_clear_mark
     3.08%  kswapd0  [kernel.vmlinux]  [k] workingset_eviction
     2.48%  kswapd0  [kernel.vmlinux]  [k] __isolate_lru_page
     2.06%  kswapd0  [kernel.vmlinux]  [k] page_mapping
     1.95%  kswapd0  [kernel.vmlinux]  [k] __remove_mapping

So now I'm even more puzzled why your (desktop?) doesn't show it, it
must be more potent than my x1 laptop. But for me, the laptop and 2
socket test box show EXACTLY the same behavior, laptop is just too slow
to make it really pathological.
Linus Torvalds Dec. 12, 2019, 1:49 a.m. UTC | #20
On Wed, Dec 11, 2019 at 5:41 PM Jens Axboe <axboe@kernel.dk> wrote:
>
> So now I'm even more puzzled why your (desktop?) doesn't show it, it
> must be more potent than my x1 laptop. But for me, the laptop and 2
> socket test box show EXACTLY the same behavior, laptop is just too slow
> to make it really pathological.

I see the exact same thing.

And now, go do "perf record -a sleep 100" while this is going on, to
see the big picture.

Suddenly that number is a lot smaller. Because the kswapd cost is less
than 10% of one cpu, and the xas_create is 30% of that. IOW, it's not
all that dominant until you zoom in on it. Considering the amount of
IO going on, it's kind of understandable.

But as mentioned, it does look like something that should be fixed.
Maybe it's even something where it's a bad XArray case? Willy?

              Linus
Matthew Wilcox Dec. 12, 2019, 1:56 a.m. UTC | #21
On Wed, Dec 11, 2019 at 05:41:16PM -0800, Linus Torvalds wrote:
> I too can see xas_create using 30% CPU time, but that's when I do a
> perf record on just kswapd - and when I actually look at it on a
> system level, it looks nowhere near that bad.
> 
> So I think people should look at this. Part of it might be for Willy:
> does that xas_create() need to be that expensive? I hate how "perf"
> callchains work, but it does look like it is probably
> page_cache_delete -> xas_store -> xas_create that is the bulk of the
> cost there.
> 
> Replacing the real page with the shadow entry shouldn't be that big of
> a deal, I would really hope.
> 
> Willy, that used to be a __radix_tree_lookup -> __radix_tree_replace
> thing, is there perhaps some simple optmization that could be done on
> the XArray case here?

It _should_ be the same order of complexity.  Since there's already
a page in the page cache, xas_create() just walks its way down to the
right node calling xas_descend() and then xas_store() does the equivalent
of __radix_tree_replace().  I don't see a bug that would make it more
expensive than the old code ... a 10GB file is going to have four levels
of radix tree node, so it shouldn't even spend that long looking for
the right node.
Jens Axboe Dec. 12, 2019, 2:03 a.m. UTC | #22
On 12/11/19 6:09 PM, Jens Axboe wrote:
> On 12/11/19 4:41 PM, Jens Axboe wrote:
>> On 12/11/19 1:18 PM, Linus Torvalds wrote:
>>> On Wed, Dec 11, 2019 at 12:08 PM Jens Axboe <axboe@kernel.dk> wrote:
>>>>
>>>> $ cat /proc/meminfo | grep -i active
>>>> Active:           134136 kB
>>>> Inactive:       28683916 kB
>>>> Active(anon):      97064 kB
>>>> Inactive(anon):        4 kB
>>>> Active(file):      37072 kB
>>>> Inactive(file): 28683912 kB
>>>
>>> Yeah, that should not put pressure on some swap activity. We have 28
>>> GB of basically free inactive file data, and the VM is doing something
>>> very very bad if it then doesn't just quickly free it with no real
>>> drama.
>>>
>>> In fact, I don't think it should even trigger kswapd at all, it should
>>> all be direct reclaim. Of course, some of the mm people hate that with
>>> a passion, but this does look like a prime example of why it should
>>> just be done.
>>
>> For giggles, I ran just a single thread on the file set. We're only
>> doing about 100K IOPS at that point, yet when the page cache fills,
>> kswapd still eats 10% cpu. That seems like a lot for something that
>> slow.
> 
> Warning, the below is from the really crazy department...
> 
> Anyway, I took a closer look at the profiles for the uncached case.
> We're spending a lot of time doing memsets (this is the xa_node init,
> from the radix tree constructor), and call_rcu for the node free later
> on. All wasted time, and something that meant we weren't as close to the
> performance of O_DIRECT as we could be.
> 
> So Chris and I started talking about this, and pondered "what would
> happen if we simply bypassed the page cache completely?". Case in point,
> see below incremental patch. We still do the page cache lookup, and use
> that page to copy from if it's there. If the page isn't there, allocate
> one and do IO to it, but DON'T add it to the page cache. With that,
> we're almost at O_DIRECT levels of performance for the 4k read case,
> without 1-2%. I think 512b would look awesome, but we're reading full
> pages, so that won't really help us much. Compared to the previous
> uncached method, this is 30% faster on this device. That's substantial.
> 
> Obviously this has issues with truncate that would need to be resolved,
> and it's definitely dirtier. But the performance is very enticing...

Tested and cleaned a bit, and added truncate protection through
inode_dio_begin()/inode_dio_end().

https://git.kernel.dk/cgit/linux-block/commit/?h=buffered-uncached&id=6dac80bc340dabdcbfb4230b9331e52510acca87

This is much faster than the previous page cache dance, and I _think_
we're ok as long as we block truncate and hole punching.

Comments?
Jens Axboe Dec. 12, 2019, 2:10 a.m. UTC | #23
On 12/11/19 7:03 PM, Jens Axboe wrote:
> On 12/11/19 6:09 PM, Jens Axboe wrote:
>> On 12/11/19 4:41 PM, Jens Axboe wrote:
>>> On 12/11/19 1:18 PM, Linus Torvalds wrote:
>>>> On Wed, Dec 11, 2019 at 12:08 PM Jens Axboe <axboe@kernel.dk> wrote:
>>>>>
>>>>> $ cat /proc/meminfo | grep -i active
>>>>> Active:           134136 kB
>>>>> Inactive:       28683916 kB
>>>>> Active(anon):      97064 kB
>>>>> Inactive(anon):        4 kB
>>>>> Active(file):      37072 kB
>>>>> Inactive(file): 28683912 kB
>>>>
>>>> Yeah, that should not put pressure on some swap activity. We have 28
>>>> GB of basically free inactive file data, and the VM is doing something
>>>> very very bad if it then doesn't just quickly free it with no real
>>>> drama.
>>>>
>>>> In fact, I don't think it should even trigger kswapd at all, it should
>>>> all be direct reclaim. Of course, some of the mm people hate that with
>>>> a passion, but this does look like a prime example of why it should
>>>> just be done.
>>>
>>> For giggles, I ran just a single thread on the file set. We're only
>>> doing about 100K IOPS at that point, yet when the page cache fills,
>>> kswapd still eats 10% cpu. That seems like a lot for something that
>>> slow.
>>
>> Warning, the below is from the really crazy department...
>>
>> Anyway, I took a closer look at the profiles for the uncached case.
>> We're spending a lot of time doing memsets (this is the xa_node init,
>> from the radix tree constructor), and call_rcu for the node free later
>> on. All wasted time, and something that meant we weren't as close to the
>> performance of O_DIRECT as we could be.
>>
>> So Chris and I started talking about this, and pondered "what would
>> happen if we simply bypassed the page cache completely?". Case in point,
>> see below incremental patch. We still do the page cache lookup, and use
>> that page to copy from if it's there. If the page isn't there, allocate
>> one and do IO to it, but DON'T add it to the page cache. With that,
>> we're almost at O_DIRECT levels of performance for the 4k read case,
>> without 1-2%. I think 512b would look awesome, but we're reading full
>> pages, so that won't really help us much. Compared to the previous
>> uncached method, this is 30% faster on this device. That's substantial.
>>
>> Obviously this has issues with truncate that would need to be resolved,
>> and it's definitely dirtier. But the performance is very enticing...
> 
> Tested and cleaned a bit, and added truncate protection through
> inode_dio_begin()/inode_dio_end().
> 
> https://git.kernel.dk/cgit/linux-block/commit/?h=buffered-uncached&id=6dac80bc340dabdcbfb4230b9331e52510acca87
> 
> This is much faster than the previous page cache dance, and I _think_
> we're ok as long as we block truncate and hole punching.
> 
> Comments?

Incremental was unnecessarily hard to read, here's the original
RWF_UNCACHED with it folded in instead for easier reading:

commit d1c9eec7b958760ea4d4e75483b0f2e3cfec53d5
Author: Jens Axboe <axboe@kernel.dk>
Date:   Thu Dec 5 17:45:00 2019 -0700

    fs: add read support for RWF_UNCACHED
    
    If RWF_UNCACHED is set for io_uring (or preadv2(2)), we'll use private
    pages for the buffered reads. These pages will never be inserted into
    the page cache, and they are simply droped when we have done the copy at
    the end of IO.
    
    If pages in the read range are already in the page cache, then use those
    for just copying the data instead of starting IO on private pages.
    
    A previous solution used the page cache even for non-cached ranges, but
    the cost of doing so was too high. Removing nodes at the end is
    expensive, even with LRU bypass. On top of that, repeatedly
    instantiating new xarray nodes is very costly, as it needs to memset 576
    bytes of data, and freeing said nodes involve an RCU call per node as
    well. All that adds up, making uncached somewhat slower than O_DIRECT.
    
    With the current solition, we're basically at O_DIRECT levels of
    performance for RWF_UNCACHED IO.
    
    Protect against truncate the same way O_DIRECT does, by calling
    inode_dio_begin() to elevate the inode->i_dio_count.
    
    Signed-off-by: Jens Axboe <axboe@kernel.dk>

diff --git a/include/linux/fs.h b/include/linux/fs.h
index 98e0349adb52..092ea2a4319b 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -314,6 +314,7 @@ enum rw_hint {
 #define IOCB_SYNC		(1 << 5)
 #define IOCB_WRITE		(1 << 6)
 #define IOCB_NOWAIT		(1 << 7)
+#define IOCB_UNCACHED		(1 << 8)
 
 struct kiocb {
 	struct file		*ki_filp;
@@ -3418,6 +3419,8 @@ static inline int kiocb_set_rw_flags(struct kiocb *ki, rwf_t flags)
 		ki->ki_flags |= (IOCB_DSYNC | IOCB_SYNC);
 	if (flags & RWF_APPEND)
 		ki->ki_flags |= IOCB_APPEND;
+	if (flags & RWF_UNCACHED)
+		ki->ki_flags |= IOCB_UNCACHED;
 	return 0;
 }
 
diff --git a/include/uapi/linux/fs.h b/include/uapi/linux/fs.h
index 379a612f8f1d..357ebb0e0c5d 100644
--- a/include/uapi/linux/fs.h
+++ b/include/uapi/linux/fs.h
@@ -299,8 +299,11 @@ typedef int __bitwise __kernel_rwf_t;
 /* per-IO O_APPEND */
 #define RWF_APPEND	((__force __kernel_rwf_t)0x00000010)
 
+/* drop cache after reading or writing data */
+#define RWF_UNCACHED	((__force __kernel_rwf_t)0x00000040)
+
 /* mask of flags supported by the kernel */
 #define RWF_SUPPORTED	(RWF_HIPRI | RWF_DSYNC | RWF_SYNC | RWF_NOWAIT |\
-			 RWF_APPEND)
+			 RWF_APPEND | RWF_UNCACHED)
 
 #endif /* _UAPI_LINUX_FS_H */
diff --git a/mm/filemap.c b/mm/filemap.c
index bf6aa30be58d..5d299d69b185 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -1990,6 +1990,13 @@ static void shrink_readahead_size_eio(struct file *filp,
 	ra->ra_pages /= 4;
 }
 
+static void buffered_put_page(struct page *page, bool clear_mapping)
+{
+	if (clear_mapping)
+		page->mapping = NULL;
+	put_page(page);
+}
+
 /**
  * generic_file_buffered_read - generic file read routine
  * @iocb:	the iocb to read
@@ -2013,6 +2020,7 @@ static ssize_t generic_file_buffered_read(struct kiocb *iocb,
 	struct address_space *mapping = filp->f_mapping;
 	struct inode *inode = mapping->host;
 	struct file_ra_state *ra = &filp->f_ra;
+	bool did_dio_begin = false;
 	loff_t *ppos = &iocb->ki_pos;
 	pgoff_t index;
 	pgoff_t last_index;
@@ -2032,6 +2040,7 @@ static ssize_t generic_file_buffered_read(struct kiocb *iocb,
 	offset = *ppos & ~PAGE_MASK;
 
 	for (;;) {
+		bool clear_mapping = false;
 		struct page *page;
 		pgoff_t end_index;
 		loff_t isize;
@@ -2048,6 +2057,13 @@ static ssize_t generic_file_buffered_read(struct kiocb *iocb,
 		if (!page) {
 			if (iocb->ki_flags & IOCB_NOWAIT)
 				goto would_block;
+			/* UNCACHED implies no read-ahead */
+			if (iocb->ki_flags & IOCB_UNCACHED) {
+				did_dio_begin = true;
+				/* block truncate for UNCACHED reads */
+				inode_dio_begin(inode);
+				goto no_cached_page;
+			}
 			page_cache_sync_readahead(mapping,
 					ra, filp,
 					index, last_index - index);
@@ -2106,7 +2122,7 @@ static ssize_t generic_file_buffered_read(struct kiocb *iocb,
 		isize = i_size_read(inode);
 		end_index = (isize - 1) >> PAGE_SHIFT;
 		if (unlikely(!isize || index > end_index)) {
-			put_page(page);
+			buffered_put_page(page, clear_mapping);
 			goto out;
 		}
 
@@ -2115,7 +2131,7 @@ static ssize_t generic_file_buffered_read(struct kiocb *iocb,
 		if (index == end_index) {
 			nr = ((isize - 1) & ~PAGE_MASK) + 1;
 			if (nr <= offset) {
-				put_page(page);
+				buffered_put_page(page, clear_mapping);
 				goto out;
 			}
 		}
@@ -2147,7 +2163,7 @@ static ssize_t generic_file_buffered_read(struct kiocb *iocb,
 		offset &= ~PAGE_MASK;
 		prev_offset = offset;
 
-		put_page(page);
+		buffered_put_page(page, clear_mapping);
 		written += ret;
 		if (!iov_iter_count(iter))
 			goto out;
@@ -2189,7 +2205,7 @@ static ssize_t generic_file_buffered_read(struct kiocb *iocb,
 
 		if (unlikely(error)) {
 			if (error == AOP_TRUNCATED_PAGE) {
-				put_page(page);
+				buffered_put_page(page, clear_mapping);
 				error = 0;
 				goto find_page;
 			}
@@ -2206,7 +2222,7 @@ static ssize_t generic_file_buffered_read(struct kiocb *iocb,
 					 * invalidate_mapping_pages got it
 					 */
 					unlock_page(page);
-					put_page(page);
+					buffered_put_page(page, clear_mapping);
 					goto find_page;
 				}
 				unlock_page(page);
@@ -2221,7 +2237,7 @@ static ssize_t generic_file_buffered_read(struct kiocb *iocb,
 
 readpage_error:
 		/* UHHUH! A synchronous read error occurred. Report it */
-		put_page(page);
+		buffered_put_page(page, clear_mapping);
 		goto out;
 
 no_cached_page:
@@ -2234,7 +2250,15 @@ static ssize_t generic_file_buffered_read(struct kiocb *iocb,
 			error = -ENOMEM;
 			goto out;
 		}
-		error = add_to_page_cache_lru(page, mapping, index,
+		if (iocb->ki_flags & IOCB_UNCACHED) {
+			__SetPageLocked(page);
+			page->mapping = mapping;
+			page->index = index;
+			clear_mapping = true;
+			goto readpage;
+		}
+
+		error = add_to_page_cache(page, mapping, index,
 				mapping_gfp_constraint(mapping, GFP_KERNEL));
 		if (error) {
 			put_page(page);
@@ -2250,6 +2274,8 @@ static ssize_t generic_file_buffered_read(struct kiocb *iocb,
 would_block:
 	error = -EAGAIN;
 out:
+	if (did_dio_begin)
+		inode_dio_end(inode);
 	ra->prev_pos = prev_index;
 	ra->prev_pos <<= PAGE_SHIFT;
 	ra->prev_pos |= prev_offset;
Matthew Wilcox Dec. 12, 2019, 2:21 a.m. UTC | #24
On Wed, Dec 11, 2019 at 07:03:40PM -0700, Jens Axboe wrote:
> Tested and cleaned a bit, and added truncate protection through
> inode_dio_begin()/inode_dio_end().
> 
> https://git.kernel.dk/cgit/linux-block/commit/?h=buffered-uncached&id=6dac80bc340dabdcbfb4230b9331e52510acca87
> 
> This is much faster than the previous page cache dance, and I _think_
> we're ok as long as we block truncate and hole punching.

What about races between UNCACHED and regular buffered IO?  Seems like
we might end up sending overlapping writes to the device?
Jens Axboe Dec. 12, 2019, 2:38 a.m. UTC | #25
On 12/11/19 7:21 PM, Matthew Wilcox wrote:
> On Wed, Dec 11, 2019 at 07:03:40PM -0700, Jens Axboe wrote:
>> Tested and cleaned a bit, and added truncate protection through
>> inode_dio_begin()/inode_dio_end().
>>
>> https://git.kernel.dk/cgit/linux-block/commit/?h=buffered-uncached&id=6dac80bc340dabdcbfb4230b9331e52510acca87
>>
>> This is much faster than the previous page cache dance, and I _think_
>> we're ok as long as we block truncate and hole punching.
> 
> What about races between UNCACHED and regular buffered IO?  Seems like
> we might end up sending overlapping writes to the device?

The UNCACHED out-of-page-cache IO is just for reads, it's not writes. The
write part is still using the same approach as earlier, though now I bet
it looks kinda slow compared to the read side...
Linus Torvalds Dec. 12, 2019, 2:47 a.m. UTC | #26
On Wed, Dec 11, 2019 at 5:56 PM Matthew Wilcox <willy@infradead.org> wrote:
>
> It _should_ be the same order of complexity.  Since there's already
> a page in the page cache, xas_create() just walks its way down to the
> right node calling xas_descend() and then xas_store() does the equivalent
> of __radix_tree_replace().  I don't see a bug that would make it more
> expensive than the old code ... a 10GB file is going to have four levels
> of radix tree node, so it shouldn't even spend that long looking for
> the right node.

The profile seems to say that 85% of the cost of xas_create() is two
instructions:

# lib/xarray.c:143:     return (index >> node->shift) & XA_CHUNK_MASK;
        movzbl  (%rsi), %ecx    # MEM[(unsigned char *)node_13],
MEM[(unsigned char *)node_13]
...
# ./include/linux/xarray.h:1145:        return
rcu_dereference_check(node->slots[offset],
# ./include/linux/compiler.h:199:       __READ_ONCE_SIZE;
        movq    (%rax), %rax    # MEM[(volatile __u64 *)_80], <retval>

where that first load is "node->shift", and the second load seems to
be just the node dereference.

I think both of them are basically just xas_descend(), but it's a bit
hard to read the several levels of inlining.

I suspect the real issue is that going through kswapd means we've lost
almost all locality, and with the random walking the LRU list is
basically entirely unordered so you don't get any advantage of the
xarray having (otherwise) possibly good cache patterns.

So we're just missing in the caches all the time, making it expensive.

              Linus
Martin Steigerwald Dec. 12, 2019, 10:44 a.m. UTC | #27
Hi Jens.

Jens Axboe - 11.12.19, 16:29:38 CET:
> Recently someone asked me how io_uring buffered IO compares to mmaped
> IO in terms of performance. So I ran some tests with buffered IO, and
> found the experience to be somewhat painful. The test case is pretty
> basic, random reads over a dataset that's 10x the size of RAM.
> Performance starts out fine, and then the page cache fills up and we
> hit a throughput cliff. CPU usage of the IO threads go up, and we have
> kswapd spending 100% of a core trying to keep up. Seeing that, I was
> reminded of the many complaints I here about buffered IO, and the
> fact that most of the folks complaining will ultimately bite the
> bullet and move to O_DIRECT to just get the kernel out of the way.
> 
> But I don't think it needs to be like that. Switching to O_DIRECT
> isn't always easily doable. The buffers have different life times,
> size and alignment constraints, etc. On top of that, mixing buffered
> and O_DIRECT can be painful.
> 
> Seems to me that we have an opportunity to provide something that sits
> somewhere in between buffered and O_DIRECT, and this is where
> RWF_UNCACHED enters the picture. If this flag is set on IO, we get
> the following behavior:
> 
> - If the data is in cache, it remains in cache and the copy (in or
> out) is served to/from that.
> 
> - If the data is NOT in cache, we add it while performing the IO. When
> the IO is done, we remove it again.
> 
> With this, I can do 100% smooth buffered reads or writes without
> pushing the kernel to the state where kswapd is sweating bullets. In
> fact it doesn't even register.

A question from a user or Linux Performance trainer perspective:

How does this compare with posix_fadvise() with POSIX_FADV_DONTNEED that 
for example the nocache¹ command is using? Excerpt from manpage 
posix_fadvice(2):

       POSIX_FADV_DONTNEED
              The specified data will not be accessed  in  the  near
              future.

              POSIX_FADV_DONTNEED  attempts to free cached pages as‐
              sociated with the specified region.  This  is  useful,
              for  example,  while streaming large files.  A program
              may periodically request the  kernel  to  free  cached
              data  that  has already been used, so that more useful
              cached pages are not discarded instead.

[1] packaged in Debian as nocache or available herehttps://github.com/
Feh/nocache

In any way, would be nice to have some option in rsync… I still did not 
change my backup script to call rsync via nocache.

Thanks,
Martin

> Comments appreciated! This should work on any standard file system,
> using either the generic helpers or iomap. I have tested ext4 and xfs
> for the right read/write behavior, but no further validation has been
> done yet. Patches are against current git, and can also be found here:
> 
> https://git.kernel.dk/cgit/linux-block/log/?h=buffered-uncached
> 
>  fs/ceph/file.c          |  2 +-
>  fs/dax.c                |  2 +-
>  fs/ext4/file.c          |  2 +-
>  fs/iomap/apply.c        | 26 ++++++++++-
>  fs/iomap/buffered-io.c  | 54 ++++++++++++++++-------
>  fs/iomap/direct-io.c    |  3 +-
>  fs/iomap/fiemap.c       |  5 ++-
>  fs/iomap/seek.c         |  6 ++-
>  fs/iomap/swapfile.c     |  2 +-
>  fs/nfs/file.c           |  2 +-
>  include/linux/fs.h      |  7 ++-
>  include/linux/iomap.h   | 10 ++++-
>  include/uapi/linux/fs.h |  5 ++-
>  mm/filemap.c            | 95
> ++++++++++++++++++++++++++++++++++++----- 14 files changed, 181
> insertions(+), 40 deletions(-)
> 
> Changes since v2:
> - Rework the write side according to Chinners suggestions. Much
> cleaner this way. It does mean that we invalidate the full write
> region if just ONE page (or more) had to be created, where before it
> was more granular. I don't think that's a concern, and on the plus
> side, we now no longer have to chunk invalidations into 15/16 pages
> at the time.
> - Cleanups
> 
> Changes since v1:
> - Switch to pagevecs for write_drop_cached_pages()
> - Use page_offset() instead of manual shift
> - Ensure we hold a reference on the page between calling ->write_end()
> and checking the mapping on the locked page
> - Fix XFS multi-page streamed writes, we'd drop the UNCACHED flag
> after the first page
Jens Axboe Dec. 12, 2019, 3:16 p.m. UTC | #28
On 12/12/19 3:44 AM, Martin Steigerwald wrote:
> Hi Jens.
> 
> Jens Axboe - 11.12.19, 16:29:38 CET:
>> Recently someone asked me how io_uring buffered IO compares to mmaped
>> IO in terms of performance. So I ran some tests with buffered IO, and
>> found the experience to be somewhat painful. The test case is pretty
>> basic, random reads over a dataset that's 10x the size of RAM.
>> Performance starts out fine, and then the page cache fills up and we
>> hit a throughput cliff. CPU usage of the IO threads go up, and we have
>> kswapd spending 100% of a core trying to keep up. Seeing that, I was
>> reminded of the many complaints I here about buffered IO, and the
>> fact that most of the folks complaining will ultimately bite the
>> bullet and move to O_DIRECT to just get the kernel out of the way.
>>
>> But I don't think it needs to be like that. Switching to O_DIRECT
>> isn't always easily doable. The buffers have different life times,
>> size and alignment constraints, etc. On top of that, mixing buffered
>> and O_DIRECT can be painful.
>>
>> Seems to me that we have an opportunity to provide something that sits
>> somewhere in between buffered and O_DIRECT, and this is where
>> RWF_UNCACHED enters the picture. If this flag is set on IO, we get
>> the following behavior:
>>
>> - If the data is in cache, it remains in cache and the copy (in or
>> out) is served to/from that.
>>
>> - If the data is NOT in cache, we add it while performing the IO. When
>> the IO is done, we remove it again.
>>
>> With this, I can do 100% smooth buffered reads or writes without
>> pushing the kernel to the state where kswapd is sweating bullets. In
>> fact it doesn't even register.
> 
> A question from a user or Linux Performance trainer perspective:
> 
> How does this compare with posix_fadvise() with POSIX_FADV_DONTNEED that 
> for example the nocache¹ command is using? Excerpt from manpage 
> posix_fadvice(2):
> 
>        POSIX_FADV_DONTNEED
>               The specified data will not be accessed  in  the  near
>               future.
> 
>               POSIX_FADV_DONTNEED  attempts to free cached pages as‐
>               sociated with the specified region.  This  is  useful,
>               for  example,  while streaming large files.  A program
>               may periodically request the  kernel  to  free  cached
>               data  that  has already been used, so that more useful
>               cached pages are not discarded instead.
> 
> [1] packaged in Debian as nocache or available herehttps://github.com/
> Feh/nocache
> 
> In any way, would be nice to have some option in rsync… I still did not 
> change my backup script to call rsync via nocache.

I don't know the nocache tool, but I'm guessing it just does the writes
(or reads) and then uses FADV_DONTNEED to drop behind those pages?
That's fine for slower use cases, it won't work very well for fast IO.
The write side currently works pretty much like that internally, whereas
the read side doesn't use the page cache at all.
Matthew Wilcox Dec. 12, 2019, 5:52 p.m. UTC | #29
On Wed, Dec 11, 2019 at 06:47:25PM -0800, Linus Torvalds wrote:
> On Wed, Dec 11, 2019 at 5:56 PM Matthew Wilcox <willy@infradead.org> wrote:
> > It _should_ be the same order of complexity.  Since there's already
> > a page in the page cache, xas_create() just walks its way down to the
> > right node calling xas_descend() and then xas_store() does the equivalent
> > of __radix_tree_replace().  I don't see a bug that would make it more
> > expensive than the old code ... a 10GB file is going to have four levels
> > of radix tree node, so it shouldn't even spend that long looking for
> > the right node.
> 
> The profile seems to say that 85% of the cost of xas_create() is two
> instructions:
> 
> # lib/xarray.c:143:     return (index >> node->shift) & XA_CHUNK_MASK;
>         movzbl  (%rsi), %ecx    # MEM[(unsigned char *)node_13],
> MEM[(unsigned char *)node_13]
> ...
> # ./include/linux/xarray.h:1145:        return
> rcu_dereference_check(node->slots[offset],
> # ./include/linux/compiler.h:199:       __READ_ONCE_SIZE;
>         movq    (%rax), %rax    # MEM[(volatile __u64 *)_80], <retval>
> 
> where that first load is "node->shift", and the second load seems to
> be just the node dereference.
> 
> I think both of them are basically just xas_descend(), but it's a bit
> hard to read the several levels of inlining.

Yep, that's basically the two cache misses you get for each level of
the tree.  I'd expect the node->shift to be slightly more costly because
sometimes the node->slots[offset] is going to be in the same cacheline
or the next cacheline after node->shift.

> I suspect the real issue is that going through kswapd means we've lost
> almost all locality, and with the random walking the LRU list is
> basically entirely unordered so you don't get any advantage of the
> xarray having (otherwise) possibly good cache patterns.
> 
> So we're just missing in the caches all the time, making it expensive.

I have some bad ideas for improving it.

1. We could semi-sort the pages on the LRU list.  If we know we're going
to remove a bunch of pages, we could take a batch of them off the list,
sort them and remove them in-order.  This probably wouldn't be terribly
effective.

2. We could change struct page to point to the xa_node that holds them.
Looking up the page mapping would be page->xa_node->array and then
offsetof(i_pages) to get the mapping.

3. Once the maple tree is ready, a 10GB file can actually be held more
efficiently in a maple tree than in the radix tree.  Because each level
of the tree is 128 bytes and (Intel) CPUs fetch a pair of cachelines,
there's only one cache miss per level of the tree.  So despite the tree
being deeper, there are fewer cache misses.  With an average of 6 pointers
per level of the tree, terminating in a dense node, we'd see:

: 6 * 6 * 6 * 6 * 6 * 6 * 6 * 15
: 4199040
(a 10GB file contains 2621440 4kB pages)

which is still eight cache misses, so to see an improvement, we'd need to
implement another planned improvement, which is allocating a large, dense
leaf node of 4kB.  That would reduce the height of the tree by 2:

: 6 * 6 * 6 * 6 * 6 * 500
: 3888000

4. For consecutive accesses, I'm working on allocating larger pages
(16kB, 64kB, 256kB, ...).  That will help by reducing the number of
deletions from the page cache by a factor of 4/16/64/...
Linus Torvalds Dec. 12, 2019, 6:29 p.m. UTC | #30
On Thu, Dec 12, 2019 at 9:52 AM Matthew Wilcox <willy@infradead.org> wrote:
>
> 1. We could semi-sort the pages on the LRU list.  If we know we're going
> to remove a bunch of pages, we could take a batch of them off the list,
> sort them and remove them in-order.  This probably wouldn't be terribly
> effective.

I don't think the sorting is relevant.

Once you batch things, you already would get most of the locality
advantage in the cache if it exists (and the batch isn't insanely
large so that one batch already causes cache overflows).

The problem - I suspect - is that we don't batch at all. Or rather,
the "batching" does exist at a high level, but it's so high that
there's just tons of stuff going on between single pages. It is at the
shrink_page_list() level, which is pretty high up and basically does
one page at a time with locking and a lot of tests for each page, and
then we do "__remove_mapping()" (which does some more work) one at a
time before we actually get to __delete_from_page_cache().

So it's "batched", but it's in a huge loop, and even at that huge loop
level the batch size is fairly small. We limit it to SWAP_CLUSTER_MAX,
which is just 32.

Thinking about it, that SWAP_CLUSTER_MAX may make sense in some other
circumstances, but not necessarily in the "shrink clean inactive
pages" thing. I wonder if we could just batch clean pages a _lot_ more
aggressively. Yes, our batching loop is still very big and it might
not help at an L1 level, but it might help in the L2, at least.

In kswapd, when we have 28 GB of pages on the inactive list, a batch
of 32 pages at a time is pretty small ;)

> 2. We could change struct page to point to the xa_node that holds them.
> Looking up the page mapping would be page->xa_node->array and then
> offsetof(i_pages) to get the mapping.

I don't think we have space in 'struct page', and I'm pretty sure we
don't want to grow it. That's one of the more common data structures
in the kernel.

                         Linus
Matthew Wilcox Dec. 12, 2019, 8:05 p.m. UTC | #31
On Thu, Dec 12, 2019 at 10:29:02AM -0800, Linus Torvalds wrote:
> On Thu, Dec 12, 2019 at 9:52 AM Matthew Wilcox <willy@infradead.org> wrote:
> > 1. We could semi-sort the pages on the LRU list.  If we know we're going
> > to remove a bunch of pages, we could take a batch of them off the list,
> > sort them and remove them in-order.  This probably wouldn't be terribly
> > effective.
> 
> I don't think the sorting is relevant.
> 
> Once you batch things, you already would get most of the locality
> advantage in the cache if it exists (and the batch isn't insanely
> large so that one batch already causes cache overflows).
> 
> The problem - I suspect - is that we don't batch at all. Or rather,
> the "batching" does exist at a high level, but it's so high that
> there's just tons of stuff going on between single pages. It is at the
> shrink_page_list() level, which is pretty high up and basically does
> one page at a time with locking and a lot of tests for each page, and
> then we do "__remove_mapping()" (which does some more work) one at a
> time before we actually get to __delete_from_page_cache().
> 
> So it's "batched", but it's in a huge loop, and even at that huge loop
> level the batch size is fairly small. We limit it to SWAP_CLUSTER_MAX,
> which is just 32.
> 
> Thinking about it, that SWAP_CLUSTER_MAX may make sense in some other
> circumstances, but not necessarily in the "shrink clean inactive
> pages" thing. I wonder if we could just batch clean pages a _lot_ more
> aggressively. Yes, our batching loop is still very big and it might
> not help at an L1 level, but it might help in the L2, at least.
> 
> In kswapd, when we have 28 GB of pages on the inactive list, a batch
> of 32 pages at a time is pretty small ;)

Yeah, that's pretty poor.  I just read through it, and even if pages are
in order on the page list, they're not going to batch nicely.  It'd be
nice to accumulate them and call delete_from_page_cache_batch(), but we
need to put shadow entries in to replace them, so we'd need a variant
of that which took two pagevecs.

> > 2. We could change struct page to point to the xa_node that holds them.
> > Looking up the page mapping would be page->xa_node->array and then
> > offsetof(i_pages) to get the mapping.
> 
> I don't think we have space in 'struct page', and I'm pretty sure we
> don't want to grow it. That's one of the more common data structures
> in the kernel.

Oh, I wasn't clear.  I meant replace page->mapping with page->xa_node.
We could still get from page to mapping, but it would be an extra
dereference.  I did say it was a _bad_ idea.
Martin Steigerwald Dec. 12, 2019, 9:45 p.m. UTC | #32
Jens Axboe - 12.12.19, 16:16:31 CET:
> On 12/12/19 3:44 AM, Martin Steigerwald wrote:
> > Jens Axboe - 11.12.19, 16:29:38 CET:
> >> Recently someone asked me how io_uring buffered IO compares to
> >> mmaped
> >> IO in terms of performance. So I ran some tests with buffered IO,
> >> and
> >> found the experience to be somewhat painful. The test case is
> >> pretty
> >> basic, random reads over a dataset that's 10x the size of RAM.
> >> Performance starts out fine, and then the page cache fills up and
> >> we
> >> hit a throughput cliff. CPU usage of the IO threads go up, and we
> >> have kswapd spending 100% of a core trying to keep up. Seeing
> >> that, I was reminded of the many complaints I here about buffered
> >> IO, and the fact that most of the folks complaining will
> >> ultimately bite the bullet and move to O_DIRECT to just get the
> >> kernel out of the way.
> >> 
> >> But I don't think it needs to be like that. Switching to O_DIRECT
> >> isn't always easily doable. The buffers have different life times,
> >> size and alignment constraints, etc. On top of that, mixing
> >> buffered
> >> and O_DIRECT can be painful.
> >> 
> >> Seems to me that we have an opportunity to provide something that
> >> sits somewhere in between buffered and O_DIRECT, and this is where
> >> RWF_UNCACHED enters the picture. If this flag is set on IO, we get
> >> the following behavior:
> >> 
> >> - If the data is in cache, it remains in cache and the copy (in or
> >> out) is served to/from that.
> >> 
> >> - If the data is NOT in cache, we add it while performing the IO.
> >> When the IO is done, we remove it again.
> >> 
> >> With this, I can do 100% smooth buffered reads or writes without
> >> pushing the kernel to the state where kswapd is sweating bullets.
> >> In
> >> fact it doesn't even register.
> > 
> > A question from a user or Linux Performance trainer perspective:
> > 
> > How does this compare with posix_fadvise() with POSIX_FADV_DONTNEED
> > that for example the nocache¹ command is using? Excerpt from
> > manpage> 
> > posix_fadvice(2):
> >        POSIX_FADV_DONTNEED
> >        
> >               The specified data will not be accessed  in  the  near
> >               future.
> >               
> >               POSIX_FADV_DONTNEED  attempts to free cached pages as‐
> >               sociated with the specified region.  This  is  useful,
> >               for  example,  while streaming large files.  A program
> >               may periodically request the  kernel  to  free  cached
> >               data  that  has already been used, so that more useful
> >               cached pages are not discarded instead.
> > 
> > [1] packaged in Debian as nocache or available
> > herehttps://github.com/ Feh/nocache
> > 
> > In any way, would be nice to have some option in rsync… I still did
> > not change my backup script to call rsync via nocache.
> 
> I don't know the nocache tool, but I'm guessing it just does the
> writes (or reads) and then uses FADV_DONTNEED to drop behind those
> pages? That's fine for slower use cases, it won't work very well for
> fast IO. The write side currently works pretty much like that
> internally, whereas the read side doesn't use the page cache at all.

Yes, it does that. And yeah I saw you changed the read site to bypass 
the cache entirely.

Also as I understand it this is for asynchronous using io uring 
primarily?
Jens Axboe Dec. 12, 2019, 10:15 p.m. UTC | #33
On 12/12/19 2:45 PM, Martin Steigerwald wrote:
> Jens Axboe - 12.12.19, 16:16:31 CET:
>> On 12/12/19 3:44 AM, Martin Steigerwald wrote:
>>> Jens Axboe - 11.12.19, 16:29:38 CET:
>>>> Recently someone asked me how io_uring buffered IO compares to
>>>> mmaped
>>>> IO in terms of performance. So I ran some tests with buffered IO,
>>>> and
>>>> found the experience to be somewhat painful. The test case is
>>>> pretty
>>>> basic, random reads over a dataset that's 10x the size of RAM.
>>>> Performance starts out fine, and then the page cache fills up and
>>>> we
>>>> hit a throughput cliff. CPU usage of the IO threads go up, and we
>>>> have kswapd spending 100% of a core trying to keep up. Seeing
>>>> that, I was reminded of the many complaints I here about buffered
>>>> IO, and the fact that most of the folks complaining will
>>>> ultimately bite the bullet and move to O_DIRECT to just get the
>>>> kernel out of the way.
>>>>
>>>> But I don't think it needs to be like that. Switching to O_DIRECT
>>>> isn't always easily doable. The buffers have different life times,
>>>> size and alignment constraints, etc. On top of that, mixing
>>>> buffered
>>>> and O_DIRECT can be painful.
>>>>
>>>> Seems to me that we have an opportunity to provide something that
>>>> sits somewhere in between buffered and O_DIRECT, and this is where
>>>> RWF_UNCACHED enters the picture. If this flag is set on IO, we get
>>>> the following behavior:
>>>>
>>>> - If the data is in cache, it remains in cache and the copy (in or
>>>> out) is served to/from that.
>>>>
>>>> - If the data is NOT in cache, we add it while performing the IO.
>>>> When the IO is done, we remove it again.
>>>>
>>>> With this, I can do 100% smooth buffered reads or writes without
>>>> pushing the kernel to the state where kswapd is sweating bullets.
>>>> In
>>>> fact it doesn't even register.
>>>
>>> A question from a user or Linux Performance trainer perspective:
>>>
>>> How does this compare with posix_fadvise() with POSIX_FADV_DONTNEED
>>> that for example the nocache¹ command is using? Excerpt from
>>> manpage> 
>>> posix_fadvice(2):
>>>        POSIX_FADV_DONTNEED
>>>        
>>>               The specified data will not be accessed  in  the  near
>>>               future.
>>>               
>>>               POSIX_FADV_DONTNEED  attempts to free cached pages as‐
>>>               sociated with the specified region.  This  is  useful,
>>>               for  example,  while streaming large files.  A program
>>>               may periodically request the  kernel  to  free  cached
>>>               data  that  has already been used, so that more useful
>>>               cached pages are not discarded instead.
>>>
>>> [1] packaged in Debian as nocache or available
>>> herehttps://github.com/ Feh/nocache
>>>
>>> In any way, would be nice to have some option in rsync… I still did
>>> not change my backup script to call rsync via nocache.
>>
>> I don't know the nocache tool, but I'm guessing it just does the
>> writes (or reads) and then uses FADV_DONTNEED to drop behind those
>> pages? That's fine for slower use cases, it won't work very well for
>> fast IO. The write side currently works pretty much like that
>> internally, whereas the read side doesn't use the page cache at all.
> 
> Yes, it does that. And yeah I saw you changed the read site to bypass 
> the cache entirely.
> 
> Also as I understand it this is for asynchronous using io uring 
> primarily?

Or preadv2/pwritev2, they also allow passing in RWF_* flags.
Linus Torvalds Dec. 12, 2019, 10:18 p.m. UTC | #34
On Thu, Dec 12, 2019 at 7:16 AM Jens Axboe <axboe@kernel.dk> wrote:
>
> I don't know the nocache tool, but I'm guessing it just does the writes
> (or reads) and then uses FADV_DONTNEED to drop behind those pages?
> That's fine for slower use cases, it won't work very well for fast IO.
> The write side currently works pretty much like that internally, whereas
> the read side doesn't use the page cache at all.

Well, I think that if we have this RWF/IOC_UNCACHED flag, maybe we
should make FADV_NOREUSE file descriptors just use it even for regular
read/write..

Right now FADV_NOREUSE is a no-op for the kernel, I think, because we
historically didn't have the facility.

But FADV_DONTNEED is different. That's for dropping pages after use
(ie invalidate_mapping_pages())

               Linus
Dave Chinner Dec. 12, 2019, 10:18 p.m. UTC | #35
On Wed, Dec 11, 2019 at 06:09:14PM -0700, Jens Axboe wrote:
> On 12/11/19 4:41 PM, Jens Axboe wrote:
> > On 12/11/19 1:18 PM, Linus Torvalds wrote:
> >> On Wed, Dec 11, 2019 at 12:08 PM Jens Axboe <axboe@kernel.dk> wrote:
> >>>
> >>> $ cat /proc/meminfo | grep -i active
> >>> Active:           134136 kB
> >>> Inactive:       28683916 kB
> >>> Active(anon):      97064 kB
> >>> Inactive(anon):        4 kB
> >>> Active(file):      37072 kB
> >>> Inactive(file): 28683912 kB
> >>
> >> Yeah, that should not put pressure on some swap activity. We have 28
> >> GB of basically free inactive file data, and the VM is doing something
> >> very very bad if it then doesn't just quickly free it with no real
> >> drama.
> >>
> >> In fact, I don't think it should even trigger kswapd at all, it should
> >> all be direct reclaim. Of course, some of the mm people hate that with
> >> a passion, but this does look like a prime example of why it should
> >> just be done.
> > 
> > For giggles, I ran just a single thread on the file set. We're only
> > doing about 100K IOPS at that point, yet when the page cache fills,
> > kswapd still eats 10% cpu. That seems like a lot for something that
> > slow.
> 
> Warning, the below is from the really crazy department...
> 
> Anyway, I took a closer look at the profiles for the uncached case.
> We're spending a lot of time doing memsets (this is the xa_node init,
> from the radix tree constructor), and call_rcu for the node free later
> on. All wasted time, and something that meant we weren't as close to the
> performance of O_DIRECT as we could be.
> 
> So Chris and I started talking about this, and pondered "what would
> happen if we simply bypassed the page cache completely?". Case in point,
> see below incremental patch. We still do the page cache lookup, and use
> that page to copy from if it's there. If the page isn't there, allocate
> one and do IO to it, but DON'T add it to the page cache. With that,
> we're almost at O_DIRECT levels of performance for the 4k read case,
> without 1-2%. I think 512b would look awesome, but we're reading full
> pages, so that won't really help us much. Compared to the previous
> uncached method, this is 30% faster on this device. That's substantial.

Interesting idea, but this seems like it is just direct IO with
kernel pages and a memcpy() rather than just mapping user pages, but
has none of the advantages of direct IO in that we can run reads and
writes concurrently because it's going through the buffered IO path.

It also needs all the special DIO truncate/hole punch serialisation
mechanisms to be propagated into the buffered IO path - the
requirement for inode_dio_wait() serialisation is something I'm
trying to remove from XFS, not have to add into more paths. And it
introduces the same issues with other buffered read/mmap access to
the same file ranges as direct IO has.

> Obviously this has issues with truncate that would need to be resolved,
> and it's definitely dirtier. But the performance is very enticing...

At which point I have to ask: why are we considering repeating the
mistakes that were made with direct IO?  Yes, it might be faster
than a coherent RWF_UNCACHED IO implementation, but I don't think
making it more like O_DIRECT is worth the price.

And, ultimately, RWF_UNCACHED will never be as fast as direct IO
because it *requires* the CPU to copy the data at least once. Direct
IO is zero-copy, and so it's always going to have lower overhead
than RWF_UNCACHED, and so when CPU or memory bandwidth is the
limiting facter, O_DIRECT will always be faster.

IOWs, I think trying to make RWF_UNCACHED as fast as O_DIRECT is a
fool's game and attempting to do so is taking a step in the wrong
direction architecturally.  I'd much prefer a sane IO model for
RWF_UNCACHED that provides coherency w/ mmap and other buffered IO
than compromise these things in the chase for ultimate performance.

Speaking of IO path architecture, perhaps what we really need here
is an iomap_apply()->iomap_read_actor loop here similar to the write
side. This would allow us to bypass all the complex readahead
shenanigans that generic_file_buffered_read() has to deal with and
directly control page cache residency and build the exact IOs we
need when RWF_UNCACHED is set. This moves it much closer to the
direct IO path in terms IO setup overhead and physical IO patterns,
but still has all the benefits of being fully cache coherent....

And, really, when we are talking about high end nvme drives that can
do 5-10GB/s read each, and we can put 20+ of them in a single
machine, there's no real value in doing readahead. i.e. there's
little read IO latency to hide in the first place and we such
systems have little memory bandwidth to spare to waste on readahead
IO that we don't end up using...

Cheers,

Dave.
Chris Mason Dec. 13, 2019, 1:32 a.m. UTC | #36
On 12 Dec 2019, at 17:18, Dave Chinner wrote:

> On Wed, Dec 11, 2019 at 06:09:14PM -0700, Jens Axboe wrote:
>>
>> So Chris and I started talking about this, and pondered "what would
>> happen if we simply bypassed the page cache completely?". Case in 
>> point,
>> see below incremental patch. We still do the page cache lookup, and 
>> use
>> that page to copy from if it's there. If the page isn't there, 
>> allocate
>> one and do IO to it, but DON'T add it to the page cache. With that,
>> we're almost at O_DIRECT levels of performance for the 4k read case,
>> without 1-2%. I think 512b would look awesome, but we're reading full
>> pages, so that won't really help us much. Compared to the previous
>> uncached method, this is 30% faster on this device. That's 
>> substantial.
>
> Interesting idea, but this seems like it is just direct IO with
> kernel pages and a memcpy() rather than just mapping user pages, but
> has none of the advantages of direct IO in that we can run reads and
> writes concurrently because it's going through the buffered IO path.
>
> It also needs all the special DIO truncate/hole punch serialisation
> mechanisms to be propagated into the buffered IO path - the
> requirement for inode_dio_wait() serialisation is something I'm
> trying to remove from XFS, not have to add into more paths. And it
> introduces the same issues with other buffered read/mmap access to
> the same file ranges as direct IO has.
>
>> Obviously this has issues with truncate that would need to be 
>> resolved,
>> and it's definitely dirtier. But the performance is very enticing...
>
> At which point I have to ask: why are we considering repeating the
> mistakes that were made with direct IO?  Yes, it might be faster
> than a coherent RWF_UNCACHED IO implementation, but I don't think
> making it more like O_DIRECT is worth the price.
>
> And, ultimately, RWF_UNCACHED will never be as fast as direct IO
> because it *requires* the CPU to copy the data at least once.

They just have different tradeoffs.  O_DIRECT actively blows away caches 
and can also force writes during reads, making RWF_UNCACHED a more 
natural fit for some applications.  There are fewer surprises, and some 
services are willing to pay for flexibility with a memcpy.  In general, 
they still want to do some cache management because it reduces p90+ 
latencies across the board, and gives them more control over which pages 
stay in cache.

Most services using buffered IO here as part of their main workload are 
pairing it with sync_file_range() and sometimes fadvise DONT_NEED.  
We've seen kswapd saturating cores with much slower flash than the fancy 
stuff Jens is using, and the solution is usually O_DIRECT or fadvise.

Grepping through the code shows a wonderful assortment of helpers to 
control the cache, and RWF_UNCACHED would be both cleaner and faster 
than what we have today.  I'm on the fence about asking for 
RWF_FILE_RANGE_WRITE (+/- naming) to force writes to start without 
pitching pages, but we can talk to some service owners to see how useful 
that would be.   They can always chain a sync_file_range() in io_uring, 
but RWF_ would be lower overhead if it were a common pattern.

With all of that said, I really agree that xfs+O_DIRECT wins on write 
concurrency.  Jens's current patches are a great first step, but I think 
that if he really loved us, Jens would carve up a concurrent pageless 
write patch series before Christmas.

> Direct
> IO is zero-copy, and so it's always going to have lower overhead
> than RWF_UNCACHED, and so when CPU or memory bandwidth is the
> limiting facter, O_DIRECT will always be faster.
>
> IOWs, I think trying to make RWF_UNCACHED as fast as O_DIRECT is a
> fool's game and attempting to do so is taking a step in the wrong
> direction architecturally.  I'd much prefer a sane IO model for
> RWF_UNCACHED that provides coherency w/ mmap and other buffered IO
> than compromise these things in the chase for ultimate performance.

No matter what I wrote in my letters to Santa this year, I agree that we 
shouldn't compromise on avoiding the warts from O_DIRECT.

-chris
Christoph Hellwig Jan. 7, 2020, 5:42 p.m. UTC | #37
On Fri, Dec 13, 2019 at 01:32:10AM +0000, Chris Mason wrote:
> They just have different tradeoffs.  O_DIRECT actively blows away caches 
> and can also force writes during reads, making RWF_UNCACHED a more 
> natural fit for some applications.  There are fewer surprises, and some 
> services are willing to pay for flexibility with a memcpy.  In general, 
> they still want to do some cache management because it reduces p90+ 
> latencies across the board, and gives them more control over which pages 
> stay in cache.

We can always have a variant of O_DIRECT that doesn't do that and
instead check if data was in the cache and then also copy / from to
it in that case.  I need some time to actually look through this series,
so it might be pretty similar to the implementation, but if defined
the right way it could be concurrent for at least the fast path of no
cached pages.
Chris Mason Jan. 8, 2020, 2:09 p.m. UTC | #38
On 7 Jan 2020, at 12:42, Christoph Hellwig wrote:

> On Fri, Dec 13, 2019 at 01:32:10AM +0000, Chris Mason wrote:
>> They just have different tradeoffs.  O_DIRECT actively blows away 
>> caches
>> and can also force writes during reads, making RWF_UNCACHED a more
>> natural fit for some applications.  There are fewer surprises, and 
>> some
>> services are willing to pay for flexibility with a memcpy.  In 
>> general,
>> they still want to do some cache management because it reduces p90+
>> latencies across the board, and gives them more control over which 
>> pages
>> stay in cache.
>
> We can always have a variant of O_DIRECT that doesn't do that and
> instead check if data was in the cache and then also copy / from to
> it in that case.  I need some time to actually look through this 
> series,
> so it might be pretty similar to the implementation, but if defined
> the right way it could be concurrent for at least the fast path of no
> cached pages.

Yeah, I really do think we can end up with a fairly unified solution 
through iomap:

* Allowing concurrent writes (xfs DIO does this now)
* Optionally doing zero copy if alignment is good (btrfs DIO does this 
now)
* Optionally tossing pages at the end (requires a separate syscall now)
* Supporting aio via io_uring

We could just call this O_DIRECT, but I like RWF_UNCACHED as a way to 
avoid surprises for people that know and love the existing O_DIRECT 
semantics.

-chris
Andres Freund Feb. 1, 2020, 10:33 a.m. UTC | #39
Hi,

On 2019-12-13 01:32:10 +0000, Chris Mason wrote:
> Grepping through the code shows a wonderful assortment of helpers to
> control the cache, and RWF_UNCACHED would be both cleaner and faster
> than what we have today.  I'm on the fence about asking for
> RWF_FILE_RANGE_WRITE (+/- naming) to force writes to start without
> pitching pages, but we can talk to some service owners to see how useful
> that would be.   They can always chain a sync_file_range() in io_uring,
> but RWF_ would be lower overhead if it were a common pattern.

FWIW, for postgres something that'd allow us to do writes that

a) Doesn't remove pages from the pagecache if they're already there.
b) Doesn't delay writeback to some unpredictable point later.

   The later write causes both latency issues, and often under-utilizes
   write bandwidth for a while. For most cases where we write, we know
   that we're not likely to write the same page again soon.

c) Doesn't (except maybe temporarily) bring pages into the pagecache, if
   they weren't before.

   In the cases where the page previously wasn't in the page cache, and
   we wrote it out, it's very likely to have been resident for long
   enough in our cache, that the kernel caching it for the future isn't
   useful.

would be really helpful. Right now we simulate that to some degree by
doing normal buffered writes followed by sync_file_range(WRITE).

For most environments always using O_DIRECT isn't really an option for
us, as we can't rely on settings being tuned well enough (i.e. using a
large enough application cache), as well as continuing to want to
support setups where using a large enough postgres buffer cache isn't an
option because it'd prevent putting a number of variably used database
servers on one piece of hardware.

(There's also postgres side issues preventing us from doing O_DIRECT
performantly, partially because we so far couldn't rely on AIO, due to
also using buffered IO, but we're fixing that now.)


For us a per-request interface where we'd have to fulfill all the
requirements for O_DIRECT, but where neither reads nor writes would
cause a page to move in/out of the pagecache, would be optimal for a
good part of our IO. Especially when we still could get zero-copy IO for
the pretty common case that there's no pagecache presence for a file at
all.

That'd allow us to use zero copy writes for the common case of a file's
data fitting entirely in our cache, and us only occasionally writing the
deta out at checkpoints. And do zero copy reads for the the cases where
we know it's unnecessary for the kernel to cache (e.g. because we are
scanning a few TB of data on a machine with less memory, because we're
re-filling our cache after a restart, or because it's a maintenance
operation doing the reading). But still rely on the kernel page cache
for other reads where the kernel caching when memory is available is a
huge benefit.  Some well tuned workloads would turn that off, to only
use O_DIRECT, but everyone else would benefit with that being the
default.

We can concoct an approximation of that behaviour with a mix of
sync_file_range() (to force writeback), RWF_NOWAIT (to see if we should
read with O_DIRECT) and mmap()/mincore()/munmap() (to determine if
writes should use O_DIRECT). But that's quite a bit of overhead.

The reason that specifying this on a per-request basis would be useful
is mainly that that would allow us to avoid having to either have two
sets of FDs, or having to turn O_DIRECT on/off with fcntl.

Greetings,

Andres Freund