[1/3] fuse: remove reliance on bdi congestion

Message ID	164360183348.4233.761031466326833349.stgit@noble.brown (mailing list archive)
State	New, archived
Headers	show Return-Path: <linux-fsdevel-owner@kernel.org> Subject: [PATCH 1/3] fuse: remove reliance on bdi congestion From: NeilBrown <neilb@suse.de> To: Andrew Morton <akpm@linux-foundation.org>, Jeff Layton <jlayton@kernel.org>, Ilya Dryomov <idryomov@gmail.com>, Miklos Szeredi <miklos@szeredi.hu>, Trond Myklebust <trond.myklebust@hammerspace.com>, Anna Schumaker <anna.schumaker@netapp.com> Cc: linux-mm@kvack.org, linux-nfs@vger.kernel.org, linux-fsdevel@vger.kernel.org, ceph-devel@vger.kernel.org, linux-kernel@vger.kernel.org Date: Mon, 31 Jan 2022 15:03:53 +1100 Message-ID: <164360183348.4233.761031466326833349.stgit@noble.brown> In-Reply-To: <164360127045.4233.2606812444285122570.stgit@noble.brown> References: <164360127045.4233.2606812444285122570.stgit@noble.brown> User-Agent: StGit/0.23 MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: 7bit Precedence: bulk
Series	remove dependence of inode_congested() \| expand [0/3] remove dependence of inode_congested() [1/3] fuse: remove reliance on bdi congestion [2/3] nfs: remove reliance on bdi congestion [3/3] ceph: remove reliance on bdi congestion

NeilBrown Jan. 31, 2022, 4:03 a.m. UTC

The bdi congestion tracking in not widely used and will be removed.

Fuse is one of a small number of filesystems that uses it, setting both
the sync (read) and async (write) congestion flags at what it determines
are appropriate times.

The only remaining effect of the sync flag is to cause read-ahead to be
skipped.
The only remaining effect of the async flag is to cause (some)
WB_SYNC_NONE writes to be skipped.

So instead of setting the flags, change:
 - .readahead to do nothing if the flag would be set
 - .writepages to do nothing if WB_SYNC_NONE and the flag would be set
 - .writepage to return AOP_WRITEPAGE_ACTIVATE if WB_SYNC_NONE
    and the flag would be set.

The writepages change causes a behavioural change in that pageout() can
now return PAGE_ACTIVATE instead of PAGE_KEEP, so SetPageActive() will
be called on the page which (I think) will further delay the next attempt
at writeout.  This might be a good thing.

Signed-off-by: NeilBrown <neilb@suse.de>
---
 fs/fuse/control.c |   17 -----------------
 fs/fuse/dax.c     |    3 +++
 fs/fuse/dev.c     |    8 --------
 fs/fuse/file.c    |   11 +++++++++++
 4 files changed, 14 insertions(+), 25 deletions(-)

Matthew Wilcox Jan. 31, 2022, 4:28 a.m. UTC | #1

On Mon, Jan 31, 2022 at 03:03:53PM +1100, NeilBrown wrote:
> diff --git a/fs/fuse/dax.c b/fs/fuse/dax.c
> index 182b24a14804..5f74e2585f50 100644
> --- a/fs/fuse/dax.c
> +++ b/fs/fuse/dax.c
> @@ -781,6 +781,9 @@ static int fuse_dax_writepages(struct address_space *mapping,
>  	struct inode *inode = mapping->host;
>  	struct fuse_conn *fc = get_fuse_conn(inode);
>  
> +	if (wbc->sync_mode == WB_SYNC_NONE &&
> +	    fc->num_background >= fc->congestion_threshold)
> +		return 0;
>  	return dax_writeback_mapping_range(mapping, fc->dax->dev, wbc);

This makes no sense.  Doing writeback for DAX means flushing the
CPU cache (in a terribly inefficient way), but it's not going to
be doing anything in the background; it's a sync operation.

> +++ b/fs/fuse/file.c
> @@ -958,6 +958,8 @@ static void fuse_readahead(struct readahead_control *rac)
>  
>  	if (fuse_is_bad(inode))
>  		return;
> +	if (fc->num_background >= fc->congestion_threshold)
> +		return;

This seems like a bad idea to me.  If we don't even start reads on
readahead pages, they'll get ->readpage called on them one at a time
and the reading thread will block.  It's going to lead to some nasty
performance problems, exactly when you don't want them.  Better to
queue the reads internally and wait for congestion to ease before
submitting the read.

NeilBrown Jan. 31, 2022, 4:47 a.m. UTC | #2

On Mon, 31 Jan 2022, Matthew Wilcox wrote:
> On Mon, Jan 31, 2022 at 03:03:53PM +1100, NeilBrown wrote:
> > diff --git a/fs/fuse/dax.c b/fs/fuse/dax.c
> > index 182b24a14804..5f74e2585f50 100644
> > --- a/fs/fuse/dax.c
> > +++ b/fs/fuse/dax.c
> > @@ -781,6 +781,9 @@ static int fuse_dax_writepages(struct address_space *mapping,
> >  	struct inode *inode = mapping->host;
> >  	struct fuse_conn *fc = get_fuse_conn(inode);
> >  
> > +	if (wbc->sync_mode == WB_SYNC_NONE &&
> > +	    fc->num_background >= fc->congestion_threshold)
> > +		return 0;
> >  	return dax_writeback_mapping_range(mapping, fc->dax->dev, wbc);
> 
> This makes no sense.  Doing writeback for DAX means flushing the
> CPU cache (in a terribly inefficient way), but it's not going to
> be doing anything in the background; it's a sync operation.

Fair enough ...  I was just being consistent.  I didn't wonder if dax
might be a bit special, but figured the change couldn't hurt.


> 
> > +++ b/fs/fuse/file.c
> > @@ -958,6 +958,8 @@ static void fuse_readahead(struct readahead_control *rac)
> >  
> >  	if (fuse_is_bad(inode))
> >  		return;
> > +	if (fc->num_background >= fc->congestion_threshold)
> > +		return;
> 
> This seems like a bad idea to me.  If we don't even start reads on
> readahead pages, they'll get ->readpage called on them one at a time
> and the reading thread will block.  It's going to lead to some nasty
> performance problems, exactly when you don't want them.  Better to
> queue the reads internally and wait for congestion to ease before
> submitting the read.
> 

Isn't that exactly what happens now? page_cache_async_ra() sees that
inode_read_congested() returns true, so it doesn't start readahead.
???

NeilBrown

Miklos Szeredi Jan. 31, 2022, 10:21 a.m. UTC | #3

On Mon, 31 Jan 2022 at 05:47, NeilBrown <neilb@suse.de> wrote:

> > > +++ b/fs/fuse/file.c
> > > @@ -958,6 +958,8 @@ static void fuse_readahead(struct readahead_control *rac)
> > >
> > >     if (fuse_is_bad(inode))
> > >             return;
> > > +   if (fc->num_background >= fc->congestion_threshold)
> > > +           return;
> >
> > This seems like a bad idea to me.  If we don't even start reads on
> > readahead pages, they'll get ->readpage called on them one at a time
> > and the reading thread will block.  It's going to lead to some nasty
> > performance problems, exactly when you don't want them.  Better to
> > queue the reads internally and wait for congestion to ease before
> > submitting the read.
> >
>
> Isn't that exactly what happens now? page_cache_async_ra() sees that
> inode_read_congested() returns true, so it doesn't start readahead.
> ???

I agree.

Fuse throttles async requests even before allocating them, which
precludes placing them on any queue.  I guess it was done to limit the
amount of kernel memory pinned by a task (sync requests allow just one
request per task).

This has worked well, and I haven't heard complaints about performance
loss due to readahead throttling.

Thanks,
Miklos

Matthew Wilcox Jan. 31, 2022, 1:12 p.m. UTC | #4

On Mon, Jan 31, 2022 at 03:47:41PM +1100, NeilBrown wrote:
> On Mon, 31 Jan 2022, Matthew Wilcox wrote:
> > > +++ b/fs/fuse/file.c
> > > @@ -958,6 +958,8 @@ static void fuse_readahead(struct readahead_control *rac)
> > >  
> > >  	if (fuse_is_bad(inode))
> > >  		return;
> > > +	if (fc->num_background >= fc->congestion_threshold)
> > > +		return;
> > 
> > This seems like a bad idea to me.  If we don't even start reads on
> > readahead pages, they'll get ->readpage called on them one at a time
> > and the reading thread will block.  It's going to lead to some nasty
> > performance problems, exactly when you don't want them.  Better to
> > queue the reads internally and wait for congestion to ease before
> > submitting the read.
> > 
> 
> Isn't that exactly what happens now? page_cache_async_ra() sees that
> inode_read_congested() returns true, so it doesn't start readahead.
> ???

It's rather different.  Imagine the readahead window has expanded to
256kB (64 pages).  Today, we see congestion and don't do anything.
That means we miss the async readahed opportunity, find a missing
page and end up calling into page_cache_sync_ra(), by which time
we may or may not be congested.

If the inode_read_congested() in page_cache_async_ra() is removed and
the patch above is added to replace it, we'll allocate those 64 pages and
add them to the page cache.  But then we'll return without starting IO.
When we hit one of those !uptodate pages, we'll call ->readpage on it,
but we won't do anything to the other 63 pages.  So we'll go through a
protracted slow period of sending 64 reads, one at a time, whether or
not congestion has eased.  Then we'll hit a missing page and proceed
to the sync ra case as above.

(I'm assuming this is a workload which does a linear scan and so
readahead is actually effective)

NeilBrown Jan. 31, 2022, 11 p.m. UTC | #5

On Tue, 01 Feb 2022, Matthew Wilcox wrote:
> On Mon, Jan 31, 2022 at 03:47:41PM +1100, NeilBrown wrote:
> > On Mon, 31 Jan 2022, Matthew Wilcox wrote:
> > > > +++ b/fs/fuse/file.c
> > > > @@ -958,6 +958,8 @@ static void fuse_readahead(struct readahead_control *rac)
> > > >  
> > > >  	if (fuse_is_bad(inode))
> > > >  		return;
> > > > +	if (fc->num_background >= fc->congestion_threshold)
> > > > +		return;
> > > 
> > > This seems like a bad idea to me.  If we don't even start reads on
> > > readahead pages, they'll get ->readpage called on them one at a time
> > > and the reading thread will block.  It's going to lead to some nasty
> > > performance problems, exactly when you don't want them.  Better to
> > > queue the reads internally and wait for congestion to ease before
> > > submitting the read.
> > > 
> > 
> > Isn't that exactly what happens now? page_cache_async_ra() sees that
> > inode_read_congested() returns true, so it doesn't start readahead.
> > ???
> 
> It's rather different.  Imagine the readahead window has expanded to
> 256kB (64 pages).  Today, we see congestion and don't do anything.
> That means we miss the async readahed opportunity, find a missing
> page and end up calling into page_cache_sync_ra(), by which time
> we may or may not be congested.
> 
> If the inode_read_congested() in page_cache_async_ra() is removed and
> the patch above is added to replace it, we'll allocate those 64 pages and
> add them to the page cache.  But then we'll return without starting IO.
> When we hit one of those !uptodate pages, we'll call ->readpage on it,
> but we won't do anything to the other 63 pages.  So we'll go through a
> protracted slow period of sending 64 reads, one at a time, whether or
> not congestion has eased.  Then we'll hit a missing page and proceed
> to the sync ra case as above.

Hmmm... where is all this documented?
The entry for readahead in vfs.rst says:

    If the filesystem decides to stop attempting I/O before reaching the
    end of the readahead window, it can simply return.

but you are saying that if it simply returns, it'll most likely just get
called again.  So maybe it shouldn't say that?

What do other filesystems do?
ext4 sets REQ_RAHEAD, but otherwise just pushes ahead and submits all
requests. btrfs seems much the same.
xfs uses iomp_readahead ..  which again sets REQ_RAHEAD but otherwise
just does a normal read.

The effect of REQ_RAHEAD seems to be primarily to avoid retries on
failure.

So it seems that core read-ahead code it not set up to expect readahead
to fail, though it is (begrudgingly) permitted.

The current inode_read_congested() test in page_cache_async_ra() seems
to be just delaying the inevitable (and in fairness, the comment does
say "Defer....").  Maybe just blocking on the congestion is an equally
good way to delay it...

I note that ->readahead isn't told if the read-ahead is async or not, so
my patch will drop sync read-ahead on congestion, which the current code
doesn't do.

So maybe this congestion tracking really is useful, and we really want
to keep it.

I really would like to see that high-level documentation!!

Thanks,
NeilBrown

Matthew Wilcox Feb. 1, 2022, 2:01 a.m. UTC | #6

On Tue, Feb 01, 2022 at 10:00:23AM +1100, NeilBrown wrote:
> On Tue, 01 Feb 2022, Matthew Wilcox wrote:
> > On Mon, Jan 31, 2022 at 03:47:41PM +1100, NeilBrown wrote:
> > > On Mon, 31 Jan 2022, Matthew Wilcox wrote:
> > > > > +++ b/fs/fuse/file.c
> > > > > @@ -958,6 +958,8 @@ static void fuse_readahead(struct readahead_control *rac)
> > > > >  
> > > > >  	if (fuse_is_bad(inode))
> > > > >  		return;
> > > > > +	if (fc->num_background >= fc->congestion_threshold)
> > > > > +		return;
> > > > 
> > > > This seems like a bad idea to me.  If we don't even start reads on
> > > > readahead pages, they'll get ->readpage called on them one at a time
> > > > and the reading thread will block.  It's going to lead to some nasty
> > > > performance problems, exactly when you don't want them.  Better to
> > > > queue the reads internally and wait for congestion to ease before
> > > > submitting the read.
> > > > 
> > > 
> > > Isn't that exactly what happens now? page_cache_async_ra() sees that
> > > inode_read_congested() returns true, so it doesn't start readahead.
> > > ???
> > 
> > It's rather different.  Imagine the readahead window has expanded to
> > 256kB (64 pages).  Today, we see congestion and don't do anything.
> > That means we miss the async readahed opportunity, find a missing
> > page and end up calling into page_cache_sync_ra(), by which time
> > we may or may not be congested.
> > 
> > If the inode_read_congested() in page_cache_async_ra() is removed and
> > the patch above is added to replace it, we'll allocate those 64 pages and
> > add them to the page cache.  But then we'll return without starting IO.
> > When we hit one of those !uptodate pages, we'll call ->readpage on it,
> > but we won't do anything to the other 63 pages.  So we'll go through a
> > protracted slow period of sending 64 reads, one at a time, whether or
> > not congestion has eased.  Then we'll hit a missing page and proceed
> > to the sync ra case as above.
> 
> Hmmm... where is all this documented?
> The entry for readahead in vfs.rst says:
> 
>     If the filesystem decides to stop attempting I/O before reaching the
>     end of the readahead window, it can simply return.
> 
> but you are saying that if it simply returns, it'll most likely just get
> called again.  So maybe it shouldn't say that?

That's not what I'm saying at all.  I'm saying that if ->readahead fails
to read the page, ->readpage will be called to read the page (if it's
actually accessed).

> What do other filesystems do?
> ext4 sets REQ_RAHEAD, but otherwise just pushes ahead and submits all
> requests. btrfs seems much the same.
> xfs uses iomp_readahead ..  which again sets REQ_RAHEAD but otherwise
> just does a normal read.
> 
> The effect of REQ_RAHEAD seems to be primarily to avoid retries on
> failure.
> 
> So it seems that core read-ahead code it not set up to expect readahead
> to fail, though it is (begrudgingly) permitted.

Well, yes.  The vast majority of reads don't fail.

> The current inode_read_congested() test in page_cache_async_ra() seems
> to be just delaying the inevitable (and in fairness, the comment does
> say "Defer....").  Maybe just blocking on the congestion is an equally
> good way to delay it...

I don't think we should _block_ for an async read request.  We're in the
context of a process which has read a different page.  Maybe what we
need is a readahead_abandon() call that removes the just-added pages
from the page cache, so we fall back to doing a sync readahead?

> I note that ->readahead isn't told if the read-ahead is async or not, so
> my patch will drop sync read-ahead on congestion, which the current code
> doesn't do.

Now that we have a readahead_control, it's simple to add that
information to it.

> So maybe this congestion tracking really is useful, and we really want
> to keep it.
> 
> I really would like to see that high-level documentation!!

I've done my best to add documentation.  There's more than before
I started.

NeilBrown Feb. 1, 2022, 3:28 a.m. UTC | #7

On Tue, 01 Feb 2022, Matthew Wilcox wrote:
> On Tue, Feb 01, 2022 at 10:00:23AM +1100, NeilBrown wrote:
> > On Tue, 01 Feb 2022, Matthew Wilcox wrote:
> > > On Mon, Jan 31, 2022 at 03:47:41PM +1100, NeilBrown wrote:
> > > > On Mon, 31 Jan 2022, Matthew Wilcox wrote:
> > > > > > +++ b/fs/fuse/file.c
> > > > > > @@ -958,6 +958,8 @@ static void fuse_readahead(struct readahead_control *rac)
> > > > > >  
> > > > > >  	if (fuse_is_bad(inode))
> > > > > >  		return;
> > > > > > +	if (fc->num_background >= fc->congestion_threshold)
> > > > > > +		return;
> > > > > 
> > > > > This seems like a bad idea to me.  If we don't even start reads on
> > > > > readahead pages, they'll get ->readpage called on them one at a time
> > > > > and the reading thread will block.  It's going to lead to some nasty
> > > > > performance problems, exactly when you don't want them.  Better to
> > > > > queue the reads internally and wait for congestion to ease before
> > > > > submitting the read.
> > > > > 
> > > > 
> > > > Isn't that exactly what happens now? page_cache_async_ra() sees that
> > > > inode_read_congested() returns true, so it doesn't start readahead.
> > > > ???
> > > 
> > > It's rather different.  Imagine the readahead window has expanded to
> > > 256kB (64 pages).  Today, we see congestion and don't do anything.
> > > That means we miss the async readahed opportunity, find a missing
> > > page and end up calling into page_cache_sync_ra(), by which time
> > > we may or may not be congested.
> > > 
> > > If the inode_read_congested() in page_cache_async_ra() is removed and
> > > the patch above is added to replace it, we'll allocate those 64 pages and
> > > add them to the page cache.  But then we'll return without starting IO.
> > > When we hit one of those !uptodate pages, we'll call ->readpage on it,
> > > but we won't do anything to the other 63 pages.  So we'll go through a
> > > protracted slow period of sending 64 reads, one at a time, whether or
> > > not congestion has eased.  Then we'll hit a missing page and proceed
> > > to the sync ra case as above.
> > 
> > Hmmm... where is all this documented?
> > The entry for readahead in vfs.rst says:
> > 
> >     If the filesystem decides to stop attempting I/O before reaching the
> >     end of the readahead window, it can simply return.
> > 
> > but you are saying that if it simply returns, it'll most likely just get
> > called again.  So maybe it shouldn't say that?
> 
> That's not what I'm saying at all.  I'm saying that if ->readahead fails
> to read the page, ->readpage will be called to read the page (if it's
> actually accessed).

Yes, I see that now - thanks.

But looking at the first part of what you wrote - currently if
congestion means we skip page_cache_async_ra() (and it is the
WB_sync_congested (not async!) which causes us to skip that) then we end
up in page_cache_sync_ra() - which also calls ->readahead but without
the 'congested' protection.

Presumably the sync readahead asks for fewer pages or something?  What is
the logic there?

> 
> > What do other filesystems do?
> > ext4 sets REQ_RAHEAD, but otherwise just pushes ahead and submits all
> > requests. btrfs seems much the same.
> > xfs uses iomp_readahead ..  which again sets REQ_RAHEAD but otherwise
> > just does a normal read.
> > 
> > The effect of REQ_RAHEAD seems to be primarily to avoid retries on
> > failure.
> > 
> > So it seems that core read-ahead code it not set up to expect readahead
> > to fail, though it is (begrudgingly) permitted.
> 
> Well, yes.  The vast majority of reads don't fail.

Which makes one wonder why we have the special-case handling.  The code
that tests REQ_RAHEAD has probably never been tested.  Fortunately it is
quite simple code....

> 
> > The current inode_read_congested() test in page_cache_async_ra() seems
> > to be just delaying the inevitable (and in fairness, the comment does
> > say "Defer....").  Maybe just blocking on the congestion is an equally
> > good way to delay it...
> 
> I don't think we should _block_ for an async read request.  We're in the
> context of a process which has read a different page.  Maybe what we
> need is a readahead_abandon() call that removes the just-added pages
> from the page cache, so we fall back to doing a sync readahead?

Well, we do potentially block - when allocating a bio or other similar
structure, and when reading an index block to know where to read from.
But we don't block waiting for the read, and we don't block waiting to
allocate the page to read-ahead.  Just how much blocking is acceptable,
I wonder.  Maybe we should punt readahead to a workqueue and let it do
the small-time waiting.

Why does the presence of an unlocked non-uptodate page cause readahead
to be skipped?  Is this somehow related to the PG_readahead flag?  If we
set PG_readahead on the first page that we decided to skip in
->readahead, would that help?

> 
> > I note that ->readahead isn't told if the read-ahead is async or not, so
> > my patch will drop sync read-ahead on congestion, which the current code
> > doesn't do.
> 
> Now that we have a readahead_control, it's simple to add that
> information to it.

True.

> 
> > So maybe this congestion tracking really is useful, and we really want
> > to keep it.
> > 
> > I really would like to see that high-level documentation!!
> 
> I've done my best to add documentation.  There's more than before
> I started.

I guess it's my turn then - if I can manage to understand it.

Thanks,
NeilBrown

Matthew Wilcox Feb. 1, 2022, 4:06 a.m. UTC | #8

On Tue, Feb 01, 2022 at 02:28:32PM +1100, NeilBrown wrote:
> On Tue, 01 Feb 2022, Matthew Wilcox wrote:
> > On Tue, Feb 01, 2022 at 10:00:23AM +1100, NeilBrown wrote:
> > > On Tue, 01 Feb 2022, Matthew Wilcox wrote:
> > > > On Mon, Jan 31, 2022 at 03:47:41PM +1100, NeilBrown wrote:
> > > > > On Mon, 31 Jan 2022, Matthew Wilcox wrote:
> > > > > > > +++ b/fs/fuse/file.c
> > > > > > > @@ -958,6 +958,8 @@ static void fuse_readahead(struct readahead_control *rac)
> > > > > > >  
> > > > > > >  	if (fuse_is_bad(inode))
> > > > > > >  		return;
> > > > > > > +	if (fc->num_background >= fc->congestion_threshold)
> > > > > > > +		return;
> > > > > > 
> > > > > > This seems like a bad idea to me.  If we don't even start reads on
> > > > > > readahead pages, they'll get ->readpage called on them one at a time
> > > > > > and the reading thread will block.  It's going to lead to some nasty
> > > > > > performance problems, exactly when you don't want them.  Better to
> > > > > > queue the reads internally and wait for congestion to ease before
> > > > > > submitting the read.
> > > > > > 
> > > > > 
> > > > > Isn't that exactly what happens now? page_cache_async_ra() sees that
> > > > > inode_read_congested() returns true, so it doesn't start readahead.
> > > > > ???
> > > > 
> > > > It's rather different.  Imagine the readahead window has expanded to
> > > > 256kB (64 pages).  Today, we see congestion and don't do anything.
> > > > That means we miss the async readahed opportunity, find a missing
> > > > page and end up calling into page_cache_sync_ra(), by which time
> > > > we may or may not be congested.
> > > > 
> > > > If the inode_read_congested() in page_cache_async_ra() is removed and
> > > > the patch above is added to replace it, we'll allocate those 64 pages and
> > > > add them to the page cache.  But then we'll return without starting IO.
> > > > When we hit one of those !uptodate pages, we'll call ->readpage on it,
> > > > but we won't do anything to the other 63 pages.  So we'll go through a
> > > > protracted slow period of sending 64 reads, one at a time, whether or
> > > > not congestion has eased.  Then we'll hit a missing page and proceed
> > > > to the sync ra case as above.
> > > 
> > > Hmmm... where is all this documented?
> > > The entry for readahead in vfs.rst says:
> > > 
> > >     If the filesystem decides to stop attempting I/O before reaching the
> > >     end of the readahead window, it can simply return.
> > > 
> > > but you are saying that if it simply returns, it'll most likely just get
> > > called again.  So maybe it shouldn't say that?
> > 
> > That's not what I'm saying at all.  I'm saying that if ->readahead fails
> > to read the page, ->readpage will be called to read the page (if it's
> > actually accessed).
> 
> Yes, I see that now - thanks.
> 
> But looking at the first part of what you wrote - currently if
> congestion means we skip page_cache_async_ra() (and it is the
> WB_sync_congested (not async!) which causes us to skip that) then we end
> up in page_cache_sync_ra() - which also calls ->readahead but without
> the 'congested' protection.
> 
> Presumably the sync readahead asks for fewer pages or something?  What is
> the logic there?

Assuming you open() the file and read() one byte at a time sequentially,
a sufficiently large file will work like this:

 - No page at index 0
   - Sync readahead pages 0-15
   - Set the readahead marker on page 8
   - Wait for page 0 to come uptodate (assume readahead succeeds)
 - Read pages 1-7
 - Notice the readahead mark on page 8
   - Async readahead pages 16-47
   - Set the readahead marker on page 32
 - Read pages 8-15
 - Hopefully the async readahead for page 16 already finished; if not
   wait for it
 - Read pages 17-31
 - Notice the readahead mark on page 32
   - Async readahead pages 48-111
   - Set the readahead marker on page 80
...

The sync readahead is "We need to read this page now, we may as well
start the read for other pages at the same time".  Async readahead is
"We predict we'll need these pages in the future".  Readpage only
gets used if readahead has failed.

> > > So it seems that core read-ahead code it not set up to expect readahead
> > > to fail, though it is (begrudgingly) permitted.
> > 
> > Well, yes.  The vast majority of reads don't fail.
> 
> Which makes one wonder why we have the special-case handling.  The code
> that tests REQ_RAHEAD has probably never been tested.  Fortunately it is
> quite simple code....

I actually did a set of tests while developing folios that failed every
readahead or some proportion of readaheads.  Found some interesting bugs
that way.  Might be a good idea to upstream an error injection so that
people can keep testing it.

> > > The current inode_read_congested() test in page_cache_async_ra() seems
> > > to be just delaying the inevitable (and in fairness, the comment does
> > > say "Defer....").  Maybe just blocking on the congestion is an equally
> > > good way to delay it...
> > 
> > I don't think we should _block_ for an async read request.  We're in the
> > context of a process which has read a different page.  Maybe what we
> > need is a readahead_abandon() call that removes the just-added pages
> > from the page cache, so we fall back to doing a sync readahead?
> 
> Well, we do potentially block - when allocating a bio or other similar
> structure, and when reading an index block to know where to read from.
> But we don't block waiting for the read, and we don't block waiting to
> allocate the page to read-ahead.  Just how much blocking is acceptable,
> I wonder.  Maybe we should punt readahead to a workqueue and let it do
> the small-time waiting.

Right, I meant "block on I/O" rather than "block trying to free memory".
We are under GFP_NOFS during the readahead path, so while we can block
for a previously-submitted I/O to finish, we can't start a new I/O.

> Why does the presence of an unlocked non-uptodate page cause readahead
> to be skipped?  Is this somehow related to the PG_readahead flag?  If we
> set PG_readahead on the first page that we decided to skip in
> ->readahead, would that help?

To go back to the example above, let's suppose the first async read hits
congestion.  Before your patches, we don't even allocate pages 16-47.
So we see a missing page, and the ondemand algorithm will submit a sync
ra for pages 16-31.

After your patches, we allocate pages 16-47, add them to the page cache
and then leave them there !uptodate.  Now each time our read() hits
a !uptodate page, we call ->readpage on it.  We have no idea that the
remaining pages in that readahead batch were also abandoned and could
be profitably read.  I think we'll submit another async readahead
for 48-112, but I'd have to check on that.

> > > I really would like to see that high-level documentation!!
> > 
> > I've done my best to add documentation.  There's more than before
> > I started.
> 
> I guess it's my turn then - if I can manage to understand it.

It always works out better when two people are interested in the
documentation.

NeilBrown Feb. 7, 2022, 12:47 a.m. UTC | #9

On Tue, 01 Feb 2022, Matthew Wilcox wrote:
> On Tue, Feb 01, 2022 at 02:28:32PM +1100, NeilBrown wrote:
> > On Tue, 01 Feb 2022, Matthew Wilcox wrote:
> > > On Tue, Feb 01, 2022 at 10:00:23AM +1100, NeilBrown wrote:
> > > > I really would like to see that high-level documentation!!
> > > 
> > > I've done my best to add documentation.  There's more than before
> > > I started.
> > 
> > I guess it's my turn then - if I can manage to understand it.
> 
> It always works out better when two people are interested in the
> documentation.
> 
> 

Please review...

From: NeilBrown <neilb@suse.de>
Subject: [PATCH] MM: document and polish read-ahead code.

Add some "big-picture" documentation for read-ahead and polish the code
to make it fit this documentation.

The meaning of ->async_size is clarified to match its name.
i.e. Any request to ->readahead() has a sync part and an async part.
The caller will wait for the sync pages to complete, but will not wait
for the async pages.  The first async page is still marked PG_readahead

- When ->readhead does not consume all pages, any remaining async pages
  are now discarded with delete_from_page_cache().  This make it
  possible for the filesystem to delay readahead due e.g. to congestion.
- in try_context_readahead(), the async_sync is set correctly rather
  than being set to 1.  Prior to Commit 2cad40180197 ("readahead: make
  context readahead more conservative") it was set to ra->size which
  is not correct (that implies no sync component).  As this was too
  high and caused problems it was reduced to 1, again incorrect but less
  problematic.  The setting provided with this patch does not restore
  those problems, and is now not arbitrary.
- The calculation of ->async_size in the initial_readahead section of
  ondemand_readahead() now makes sense - it is zero if the chosen
  size does not exceed the requested size.  This means that we will not
  set the PG_readahead flag in this case, but as the requested size
  has not been satisfied we can expect a subsequent read ahead request
  any way.

Note that the current function names page_cache_sync_ra() and
page_cache_async_ra() are misleading.  All ra request are partly sync
and partly async, so either part can be empty.
A page_cache_sync_ra() request will usually set ->async_size non-zero,
implying it is not all synchronous.
When a non-zero req_count is passed to page_cache_async_ra(), the
implication is that some prefix of the request is synchronous, though
the calculation made there is incorrect - I haven't tried to fix it.

Signed-off-by: NeilBrown <neilb@suse.de>
---
 mm/readahead.c | 105 ++++++++++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 103 insertions(+), 2 deletions(-)

diff --git a/mm/readahead.c b/mm/readahead.c
index cf0dcf89eb69..5676f5c1aa39 100644
--- a/mm/readahead.c
+++ b/mm/readahead.c
@@ -8,6 +8,105 @@
  *		Initial version.
  */
 
+/**
+ * Readahead is used to read content into the page cache before it is
+ * explicitly requested by the application.  Readahead only ever
+ * attempts to read pages which are not yet in the page cache.  If a
+ * page is present but not up-to-date, readahead will not try to read
+ * it. In that case a simple ->readpage() will be requested.
+ *
+ * Readahead is triggered when an application read request (whether a
+ * systemcall or a page fault) finds that the requested page is not in
+ * the page cache, or that it is in the page cache and has the
+ * PG_readahead flag set.  This flag indicates that the page was loaded
+ * as part of a previous read-ahead request and now that it has been
+ * accessed, it is time for the next read-ahead.
+ *
+ * Each readahead request is partly synchronous read, and partly async
+ * read-ahead.  This is reflected in the struct file_ra_state which
+ * contains ->size being to total number of pages, and ->async_size
+ * which is the number of pages in the async section.  The first page in
+ * this async section will have PG_readahead set as a trigger for a
+ * subsequent read ahead.  Once a series of sequential reads has been
+ * established, there should be no need for a synchronous component and
+ * all read ahead request will be fully asynchronous.
+ *
+ * When either of the triggers causes a readahead, three numbers need to
+ * be determined: the start of the region, the size of the region, and
+ * the size of the async tail.
+ *
+ * The start of the region is simply the first page address at or after
+ * the accessed address, which is not currently populated in the page
+ * cache.  This is found with a simple search in the page cache.
+ *
+ * The size of the async tail is determined by subtracting the size that
+ * was explicitly requested from the determined request size, unless
+ * this would be less than zero - then zero is used.  NOTE THIS
+ * CALCULATION IS WRONG WHEN THE START OF THE REGION IS NOT THE ACCESSED
+ * PAGE.
+ *
+ * The size of the region is normally determined from the size of the
+ * previous readahead which loaded the preceding pages.  This may be
+ * discovered from the struct file_ra_state for simple sequential reads,
+ * or from examining the state of the page cache when multiple
+ * sequential reads are interleaved.  Specifically: where the readahead
+ * was triggered by the PG_readahead flag, the size of the previous
+ * readahead is assumed to be the number of pages from the triggering
+ * page to the start of the new readahead.  In these cases, the size of
+ * the previous readahead is scaled, often doubled, for the new
+ * readahead, though see get_next_ra_size() for details.
+ *
+ * If the size of the previous read cannot be determined, the number of
+ * preceding pages in the page cache is used to estimate the size of
+ * a previous read.  This estimate could easily be misled by random
+ * reads being coincidentally adjacent, so it is ignored unless it is
+ * larger than the current request, and it is not scaled up, unless it
+ * is at the start of file.
+ *
+ * In generally read ahead is accelerated at the start of the file, as
+ * reads from there are often sequential.  There are other minor
+ * adjustments to the read ahead size in various special cases and these
+ * are best discovered by reading the code.
+ *
+ * The above calculation determine the readahead, to which any requested
+ * read size may be added.
+ *
+ * Readahead requests are sent to the filesystem using the ->readahead
+ * address space operation, for which mpage_readahead() is a canonical
+ * implementation.  ->readahead() should normally initiate reads on all
+ * pages, but may fail to read any or all pages without causing an IO
+ * error.  The page cache reading code will issue a ->readpage() request
+ * for any page which ->readahead() does not provided, and only an error
+ * from this will be final.
+ *
+ * ->readahead will generally call readahead_page() repeatedly to get
+ * each page from those prepared for read ahead.  It may fail to read a
+ * page by:
+ *  - not calling readahead_page() sufficiently many times, effectively
+ *    ignoring some pages, as might be appropriate if the path to
+ *    storage is congested.
+ *  - failing to actually submit a read request for a given page,
+ *    possibly due to insufficient resources, or
+ *  - getting an error during subsequent processing of a request.
+ * In the last two cases, the page should be unlocked to indicate that
+ * the read attempt has failed.  In the first case the page will be
+ * unlocked by the caller.
+ *
+ * Those pages not in the final ``async_size`` of the request should be
+ * considered to be important and ->readahead() should not fail them due
+ * to congestion or temporary resource unavailability, but should wait
+ * for necessary resources (e.g.  memory or indexing information) to
+ * become available.  Pages in the final ``async_size`` may be
+ * considered less urgent and failure to read them is more acceptable.
+ * In this case it best to use delete_from_page_cache() to remove the
+ * pages from the page cache as is automatically done for pages that
+ * were not fetched with readahead_page().  This will allow a
+ * subsequent synchronous read ahead request to try them again.  If they
+ * are left in the page cache, then they will be read individually using
+ * ->readpage().
+ *
+ */
+
 #include <linux/kernel.h>
 #include <linux/dax.h>
 #include <linux/gfp.h>
@@ -129,6 +228,8 @@ static void read_pages(struct readahead_control *rac, struct list_head *pages,
 		aops->readahead(rac);
 		/* Clean up the remaining pages */
 		while ((page = readahead_page(rac))) {
+			if (rac->ra->async_pages >= readahead_count(rac))
+				delete_from_page_cache(page);
 			unlock_page(page);
 			put_page(page);
 		}
@@ -426,7 +527,7 @@ static int try_context_readahead(struct address_space *mapping,
 
 	ra->start = index;
 	ra->size = min(size + req_size, max);
-	ra->async_size = 1;
+	ra->async_size = ra->size - req_size;
 
 	return 1;
 }
@@ -527,7 +628,7 @@ static void ondemand_readahead(struct readahead_control *ractl,
 initial_readahead:
 	ra->start = index;
 	ra->size = get_init_ra_size(req_size, max_pages);
-	ra->async_size = ra->size > req_size ? ra->size - req_size : ra->size;
+	ra->async_size = ra->size > req_size ? ra->size - req_size : 0;
 
 readit:
 	/*

[1/3] fuse: remove reliance on bdi congestion

Commit Message

Comments

Patch