[03/33] mm: Implement readahead_control pageset expansion

Message ID	161340389201.1303470.14353807284546854878.stgit@warthog.procyon.org.uk (mailing list archive)
State	New, archived
Headers	show Return-Path: <linux-nfs-owner@kernel.org> Organization: Red Hat UK Ltd. Registered Address: Red Hat UK Ltd, Amberley Place, 107-111 Peascod Street, Windsor, Berkshire, SI4 1TE, United Kingdom. Registered in England and Wales under Company Registration No. 3798903 Subject: [PATCH 03/33] mm: Implement readahead_control pageset expansion From: David Howells <dhowells@redhat.com> To: Trond Myklebust <trondmy@hammerspace.com>, Anna Schumaker <anna.schumaker@netapp.com>, Steve French <sfrench@samba.org>, Dominique Martinet <asmadeus@codewreck.org> Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>, "Matthew Wilcox (Oracle)" <willy@infradead.org>, Alexander Viro <viro@zeniv.linux.org.uk>, Christoph Hellwig <hch@lst.de>, linux-mm@kvack.org, linux-cachefs@redhat.com, linux-afs@lists.infradead.org, linux-nfs@vger.kernel.org, linux-cifs@vger.kernel.org, ceph-devel@vger.kernel.org, v9fs-developer@lists.sourceforge.net, linux-fsdevel@vger.kernel.org, dhowells@redhat.com, Jeff Layton <jlayton@redhat.com>, David Wysochanski <dwysocha@redhat.com>, "Matthew Wilcox (Oracle)" <willy@infradead.org>, Alexander Viro <viro@zeniv.linux.org.uk>, linux-cachefs@redhat.com, linux-afs@lists.infradead.org, linux-nfs@vger.kernel.org, linux-cifs@vger.kernel.org, ceph-devel@vger.kernel.org, v9fs-developer@lists.sourceforge.net, linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org Date: Mon, 15 Feb 2021 15:44:52 +0000 Message-ID: <161340389201.1303470.14353807284546854878.stgit@warthog.procyon.org.uk> In-Reply-To: <161340385320.1303470.2392622971006879777.stgit@warthog.procyon.org.uk> References: <161340385320.1303470.2392622971006879777.stgit@warthog.procyon.org.uk> User-Agent: StGit/0.23 MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: 7bit Precedence: bulk
Series	Network fs helper library & fscache kiocb API [ver #3] \| expand [00/33] Network fs helper library & fscache kiocb API [ver #3] [01/33] iov_iter: Add ITER_XARRAY [02/33] mm: Add an unlock function for PG_private_2/PG_fscache [03/33] mm: Implement readahead_control pageset expansion [04/33] vfs: Export rw_verify_area() for use by cachefiles [05/33] netfs: Make a netfs helper module [06/33] netfs, mm: Move PG_fscache helper funcs to linux/netfs.h [07/33] netfs, mm: Add unlock_page_fscache() and wait_on_page_fscache() [08/33] netfs: Provide readahead and readpage netfs helpers [09/33] netfs: Add tracepoints [10/33] netfs: Gather stats [11/33] netfs: Add write_begin helper [12/33] netfs: Define an interface to talk to a cache [13/33] netfs: Hold a ref on a page when PG_private_2 is set [14/33] fscache, cachefiles: Add alternate API to use kiocb for read/write to cache [15/33] afs: Disable use of the fscache I/O routines [16/33] afs: Pass page into dirty region helpers to provide THP size [17/33] afs: Print the operation debug_id when logging an unexpected data version [18/33] afs: Move key to afs_read struct [19/33] afs: Don't truncate iter during data fetch [20/33] afs: Log remote unmarshalling errors [21/33] afs: Set up the iov_iter before calling afs_extract_data() [22/33] afs: Use ITER_XARRAY for writing [23/33] afs: Wait on PG_fscache before modifying/releasing a page [24/33] afs: Extract writeback extension into its own function [25/33] afs: Prepare for use of THPs [26/33] afs: Use the fs operation ops to handle FetchData completion [27/33] afs: Use new fscache read helper API [28/33] ceph: disable old fscache readpage handling [29/33] ceph: rework PageFsCache handling [30/33] ceph: fix fscache invalidation [31/33] ceph: convert readpage to fscache read helper [32/33] ceph: plug write_begin into read helper [33/33] ceph: convert ceph_readpages to ceph_readahead [34/33] netfs: Use in_interrupt() not in_softirq()

David Howells Feb. 15, 2021, 3:44 p.m. UTC

Provide a function, readahead_expand(), that expands the set of pages
specified by a readahead_control object to encompass a revised area with a
proposed size and length.

The proposed area must include all of the old area and may be expanded yet
more by this function so that the edges align on (transparent huge) page
boundaries as allocated.

The expansion will be cut short if a page already exists in either of the
areas being expanded into.  Note that any expansion made in such a case is
not rolled back.

This will be used by fscache so that reads can be expanded to cache granule
boundaries, thereby allowing whole granules to be stored in the cache, but
there are other potential users also.

Suggested-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Matthew Wilcox (Oracle) <willy@infradead.org>
cc: Alexander Viro <viro@zeniv.linux.org.uk>
cc: Christoph Hellwig <hch@lst.de>
cc: linux-mm@kvack.org
cc: linux-cachefs@redhat.com
cc: linux-afs@lists.infradead.org
cc: linux-nfs@vger.kernel.org
cc: linux-cifs@vger.kernel.org
cc: ceph-devel@vger.kernel.org
cc: v9fs-developer@lists.sourceforge.net
cc: linux-fsdevel@vger.kernel.org
---

 include/linux/pagemap.h |    2 +
 mm/readahead.c          |   70 +++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 72 insertions(+)

Christoph Hellwig Feb. 16, 2021, 10:32 a.m. UTC | #1

On Mon, Feb 15, 2021 at 03:44:52PM +0000, David Howells wrote:
> Provide a function, readahead_expand(), that expands the set of pages
> specified by a readahead_control object to encompass a revised area with a
> proposed size and length.
> 
> The proposed area must include all of the old area and may be expanded yet
> more by this function so that the edges align on (transparent huge) page
> boundaries as allocated.
> 
> The expansion will be cut short if a page already exists in either of the
> areas being expanded into.  Note that any expansion made in such a case is
> not rolled back.
> 
> This will be used by fscache so that reads can be expanded to cache granule
> boundaries, thereby allowing whole granules to be stored in the cache, but
> there are other potential users also.

So looking at linux-next this seems to have a user, but that user is
dead wood given that nothing implements ->expand_readahead.

Looking at the code structure I think netfs_readahead and
netfs_rreq_expand is a complete mess and needs to be turned upside
down, that is instead of calling back from netfs_readahead to the
calling file system, split it into a few helpers called by the
caller.

But even after this can't we just expose the cache granule boundary
to the VM so that the read-ahead request gets setup correctly from
the very beginning?

David Howells Feb. 16, 2021, 11:48 a.m. UTC | #2

Christoph Hellwig <hch@lst.de> wrote:

> On Mon, Feb 15, 2021 at 03:44:52PM +0000, David Howells wrote:
> > Provide a function, readahead_expand(), that expands the set of pages
> > specified by a readahead_control object to encompass a revised area with a
> > proposed size and length.
> ...
> So looking at linux-next this seems to have a user, but that user is
> dead wood given that nothing implements ->expand_readahead.

Interesting question.  Code on my fscache-iter branch does implement this, but
I was asked to split the patchset up, so that's not in this subset.

> Looking at the code structure I think netfs_readahead and
> netfs_rreq_expand is a complete mess and needs to be turned upside
> down, that is instead of calling back from netfs_readahead to the
> calling file system, split it into a few helpers called by the
> caller.
> 
> But even after this can't we just expose the cache granule boundary
> to the VM so that the read-ahead request gets setup correctly from
> the very beginning?

You need to argue this one with Willy.  In my opinion, the VM should ask the
filesystem and the expansion be done before ->readahead() is called.  Willy
disagrees, however.

David

Matthew Wilcox Feb. 16, 2021, 1:22 p.m. UTC | #3

On Tue, Feb 16, 2021 at 11:32:15AM +0100, Christoph Hellwig wrote:
> On Mon, Feb 15, 2021 at 03:44:52PM +0000, David Howells wrote:
> > Provide a function, readahead_expand(), that expands the set of pages
> > specified by a readahead_control object to encompass a revised area with a
> > proposed size and length.
> > 
> > The proposed area must include all of the old area and may be expanded yet
> > more by this function so that the edges align on (transparent huge) page
> > boundaries as allocated.
> > 
> > The expansion will be cut short if a page already exists in either of the
> > areas being expanded into.  Note that any expansion made in such a case is
> > not rolled back.
> > 
> > This will be used by fscache so that reads can be expanded to cache granule
> > boundaries, thereby allowing whole granules to be stored in the cache, but
> > there are other potential users also.
> 
> So looking at linux-next this seems to have a user, but that user is
> dead wood given that nothing implements ->expand_readahead.
> 
> Looking at the code structure I think netfs_readahead and
> netfs_rreq_expand is a complete mess and needs to be turned upside
> down, that is instead of calling back from netfs_readahead to the
> calling file system, split it into a few helpers called by the
> caller.

That's funny, we modelled it after iomap.

> But even after this can't we just expose the cache granule boundary
> to the VM so that the read-ahead request gets setup correctly from
> the very beginning?

The intent is that this be usable by filesystems which want to (for
example) compress variable sized blocks.  So they won't know which pages
they want to readahead until they're in their iomap actor routine,
see that the extent they're in is compressed, and find out how large
the extent is.

Mike Marshall Feb. 17, 2021, 2:36 p.m. UTC | #4

I plan to try and use readahead_expand in Orangefs...

-Mike

On Tue, Feb 16, 2021 at 8:28 AM Matthew Wilcox <willy@infradead.org> wrote:
>
> On Tue, Feb 16, 2021 at 11:32:15AM +0100, Christoph Hellwig wrote:
> > On Mon, Feb 15, 2021 at 03:44:52PM +0000, David Howells wrote:
> > > Provide a function, readahead_expand(), that expands the set of pages
> > > specified by a readahead_control object to encompass a revised area with a
> > > proposed size and length.
> > >
> > > The proposed area must include all of the old area and may be expanded yet
> > > more by this function so that the edges align on (transparent huge) page
> > > boundaries as allocated.
> > >
> > > The expansion will be cut short if a page already exists in either of the
> > > areas being expanded into.  Note that any expansion made in such a case is
> > > not rolled back.
> > >
> > > This will be used by fscache so that reads can be expanded to cache granule
> > > boundaries, thereby allowing whole granules to be stored in the cache, but
> > > there are other potential users also.
> >
> > So looking at linux-next this seems to have a user, but that user is
> > dead wood given that nothing implements ->expand_readahead.
> >
> > Looking at the code structure I think netfs_readahead and
> > netfs_rreq_expand is a complete mess and needs to be turned upside
> > down, that is instead of calling back from netfs_readahead to the
> > calling file system, split it into a few helpers called by the
> > caller.
>
> That's funny, we modelled it after iomap.
>
> > But even after this can't we just expose the cache granule boundary
> > to the VM so that the read-ahead request gets setup correctly from
> > the very beginning?
>
> The intent is that this be usable by filesystems which want to (for
> example) compress variable sized blocks.  So they won't know which pages
> they want to readahead until they're in their iomap actor routine,
> see that the extent they're in is compressed, and find out how large
> the extent is.

David Howells Feb. 17, 2021, 3:42 p.m. UTC | #5

Mike Marshall <hubcap@omnibond.com> wrote:

> I plan to try and use readahead_expand in Orangefs...

Would it help if I shuffled the readahead_expand patch to the bottom of the
pack?

David

Matthew Wilcox Feb. 17, 2021, 4:13 p.m. UTC | #6

On Mon, Feb 15, 2021 at 03:44:52PM +0000, David Howells wrote:
> +++ b/include/linux/pagemap.h
> @@ -761,6 +761,8 @@ extern void __delete_from_page_cache(struct page *page, void *shadow);
>  int replace_page_cache_page(struct page *old, struct page *new, gfp_t gfp_mask);
>  void delete_from_page_cache_batch(struct address_space *mapping,
>  				  struct pagevec *pvec);
> +void readahead_expand(struct readahead_control *ractl,
> +		      loff_t new_start, size_t new_len);

If we're revising this patchset, I'd rather this lived with the other
readahead declarations, ie after the definition of readahead_control.

> +	/* Expand the trailing edge upwards */
> +	while (ractl->_nr_pages < new_nr_pages) {
> +		unsigned long index = ractl->_index + ractl->_nr_pages;
> +		struct page *page = xa_load(&mapping->i_pages, index);
> +
> +		if (page && !xa_is_value(page))
> +			return; /* Page apparently present */
> +
> +		page = __page_cache_alloc(gfp_mask);
> +		if (!page)
> +			return;
> +		if (add_to_page_cache_lru(page, mapping, index, gfp_mask) < 0) {
> +			put_page(page);
> +			return;
> +		}
> +		ractl->_nr_pages++;
> +	}

We're defeating the ondemand_readahead() algorithm here.  Let's suppose
userspace is doing 64kB reads, the filesystem is OrangeFS which only
wants to do 4MB reads, the page cache is initially empty and there's
only one thread doing a sequential read.  ondemand_readahead() calls
get_init_ra_size() which tells it to allocate 128kB and set the async
marker at 64kB.  Then orangefs calls readahead_expand() to allocate the
remainder of the 4MB.  After the app has read the first 64kB, it comes
back to read the next 64kB, sees the readahead marker and tries to trigger
the next batch of readahead, but it's already present, so it does nothing
(see page_cache_ra_unbounded() for what happens with pages present).

Then it keeps going through the 4MB that's been read, not seeing any more
readahead markers, gets to 4MB and asks for ... 256kB?  Not quite sure.
Anyway, it then has to wait for the next 4MB because the readahead didn't
overlap with the application processing.

So readahead_expand() needs to adjust the file's f_ra so that when the
application gets to 64kB, it kicks off the readahead of 4MB-8MB chunk (and
then when we get to 4MB+256kB, it kicks off the readahead of 8MB-12MB,
and so on).

Unless someone sees a better way to do this?  I don't
want to inadvertently break POSIX_FADV_WILLNEED which calls
force_page_cache_readahead() and should not perturb the kernel's
ondemand algorithm.  Perhaps we need to add an 'ra' pointer to the
ractl to indicate whether the file_ra_state should be updated by
readahead_expand()?

Mike Marshall Feb. 17, 2021, 4:59 p.m. UTC | #7

Matthew has looked at how I'm fumbling about
trying to deal with Orangefs's need for much larger
than page-sized IO...

I think I need to implement orangefs_readahead
and from there fire off an asynchronous read
and while that's going I'll call readahead_page
with a rac that I've cranked up with readahead_expand
and when the read gets done I'll have plenty of pages
for the large IO I did.

Even if what I think I need to do is somewhere
near right, the async code in the Orangefs
kernel module didn't make it into the upstream
version, so I have to refurbish that. All that to
say: I don't need readahead_expand
"tomorrow", but it fits into my plan to
get Orangefs the extra pages it needs
without me having open-coded page cache
code in orangefs_readpage.

-Mike

On Wed, Feb 17, 2021 at 10:42 AM David Howells <dhowells@redhat.com> wrote:
>
> Mike Marshall <hubcap@omnibond.com> wrote:
>
> > I plan to try and use readahead_expand in Orangefs...
>
> Would it help if I shuffled the readahead_expand patch to the bottom of the
> pack?
>
> David
>

David Howells Feb. 17, 2021, 10:20 p.m. UTC | #8

Mike Marshall <hubcap@omnibond.com> wrote:

> Matthew has looked at how I'm fumbling about
> trying to deal with Orangefs's need for much larger
> than page-sized IO...
> 
> I think I need to implement orangefs_readahead
> and from there fire off an asynchronous read
> and while that's going I'll call readahead_page
> with a rac that I've cranked up with readahead_expand
> and when the read gets done I'll have plenty of pages
> for the large IO I did.

Would the netfs helper lib in patches 5-13 here be of use to orangefs?  Most
of the information about it is on patch 8.

David

David Howells Feb. 17, 2021, 10:34 p.m. UTC | #9

Matthew Wilcox <willy@infradead.org> wrote:

> We're defeating the ondemand_readahead() algorithm here.  Let's suppose
> userspace is doing 64kB reads, the filesystem is OrangeFS which only
> wants to do 4MB reads, the page cache is initially empty and there's
> only one thread doing a sequential read.  ondemand_readahead() calls
> get_init_ra_size() which tells it to allocate 128kB and set the async
> marker at 64kB.  Then orangefs calls readahead_expand() to allocate the
> remainder of the 4MB.  After the app has read the first 64kB, it comes
> back to read the next 64kB, sees the readahead marker and tries to trigger
> the next batch of readahead, but it's already present, so it does nothing
> (see page_cache_ra_unbounded() for what happens with pages present).

It sounds like Christoph is right on the right track and the vm needs to ask
the filesystem (and by extension, the cache) before doing the allocation and
before setting the trigger flag.  Then we don't need to call back into the vm
to expand the readahead.

Also, there's Steve's request to try and keep at least two requests in flight
for CIFS/SMB at the same time to consider.

David

Matthew Wilcox Feb. 17, 2021, 10:49 p.m. UTC | #10

On Wed, Feb 17, 2021 at 10:34:39PM +0000, David Howells wrote:
> Matthew Wilcox <willy@infradead.org> wrote:
> 
> > We're defeating the ondemand_readahead() algorithm here.  Let's suppose
> > userspace is doing 64kB reads, the filesystem is OrangeFS which only
> > wants to do 4MB reads, the page cache is initially empty and there's
> > only one thread doing a sequential read.  ondemand_readahead() calls
> > get_init_ra_size() which tells it to allocate 128kB and set the async
> > marker at 64kB.  Then orangefs calls readahead_expand() to allocate the
> > remainder of the 4MB.  After the app has read the first 64kB, it comes
> > back to read the next 64kB, sees the readahead marker and tries to trigger
> > the next batch of readahead, but it's already present, so it does nothing
> > (see page_cache_ra_unbounded() for what happens with pages present).
> 
> It sounds like Christoph is right on the right track and the vm needs to ask
> the filesystem (and by extension, the cache) before doing the allocation and
> before setting the trigger flag.  Then we don't need to call back into the vm
> to expand the readahead.

Doesn't work.  You could read my reply to Christoph, or try to figure out
how to get rid of
https://evilpiepirate.org/git/bcachefs.git/tree/fs/bcachefs/fs-io.c#n742
for yourself.

> Also, there's Steve's request to try and keep at least two requests in flight
> for CIFS/SMB at the same time to consider.

That's not relevant to this problem.

David Howells Feb. 18, 2021, 5:47 p.m. UTC | #11

Matthew Wilcox <willy@infradead.org> wrote:

> So readahead_expand() needs to adjust the file's f_ra so that when the
> application gets to 64kB, it kicks off the readahead of 4MB-8MB chunk (and
> then when we get to 4MB+256kB, it kicks off the readahead of 8MB-12MB,
> and so on).

Ummm...  Two questions:

Firstly, how do I do that?  Set ->async_size?  And to what?  The expansion
could be 2MB from a ceph stripe, 256k from the cache.  Just to add to the fun,
the leading edge of the window might also be rounded downwards and the RA
trigger could be before where the app is going to start reading.

Secondly, what happens if, say, a 4MB read is covered by a single 4MB THP?

David

[03/33] mm: Implement readahead_control pageset expansion

Commit Message

Comments

Patch