diff mbox series

[RFC] cifs: Transition from ->readpages() to ->readahead()

Message ID 164303051132.2163193.10493291874899600548.stgit@warthog.procyon.org.uk (mailing list archive)
State New, archived
Headers show
Series [RFC] cifs: Transition from ->readpages() to ->readahead() | expand

Commit Message

David Howells Jan. 24, 2022, 1:21 p.m. UTC
Transition the cifs filesystem from using the old ->readpages() method to
using the new ->readahead() method.

For the moment, this removes any invocation of fscache to read data from
the local cache, leaving that to another patch.

Questions for Willy:
 - Can we get a function to return the number of pages in a readahead
   batch?

 - Can we get a function to commit a readahead batch?  Currently, this is
   done when we call __readahead_batch(), but that means ractl->_nr_pages
   isn't up to date at the point we need it to be.  However, we want to
   check to see if the ractl is empty, then get server credits and only
   *then* call __readahead_batch() as we don't know how big a batch we're
   allowed till we have the credits.

Signed-off-by: David Howells <dhowells@redhat.com>
cc: Steve French <smfrench@gmail.com>
cc: Shyam Prasad N <nspmangalore@gmail.com>
cc: Matthew Wilcox <willy@infradead.org>
cc: Jeff Layton <jlayton@kernel.org>
cc: linux-cifs@vger.kernel.org
cc: linux-cachefs@redhat.com
---

 fs/cifs/file.c |  163 ++++++++++++--------------------------------------------
 1 file changed, 36 insertions(+), 127 deletions(-)

Comments

Matthew Wilcox Jan. 24, 2022, 2:19 p.m. UTC | #1
On Mon, Jan 24, 2022 at 01:21:51PM +0000, David Howells wrote:
> Transition the cifs filesystem from using the old ->readpages() method to
> using the new ->readahead() method.
> 
> For the moment, this removes any invocation of fscache to read data from
> the local cache, leaving that to another patch.
> 
> Questions for Willy:
>  - Can we get a function to return the number of pages in a readahead
>    batch?
> 
>  - Can we get a function to commit a readahead batch?  Currently, this is
>    done when we call __readahead_batch(), but that means ractl->_nr_pages
>    isn't up to date at the point we need it to be.  However, we want to
>    check to see if the ractl is empty, then get server credits and only
>    *then* call __readahead_batch() as we don't know how big a batch we're
>    allowed till we have the credits.

If you insist on using the primitives in a way that nobody else uses
them, you're going to find they don't work.  What's wrong with the
way that FUSE uses them in fuse_readahead()?
David Howells Jan. 24, 2022, 3:14 p.m. UTC | #2
Matthew Wilcox <willy@infradead.org> wrote:

> > Questions for Willy:
> >  - Can we get a function to return the number of pages in a readahead
> >    batch?
> > 
> >  - Can we get a function to commit a readahead batch?  Currently, this is
> >    done when we call __readahead_batch(), but that means ractl->_nr_pages
> >    isn't up to date at the point we need it to be.  However, we want to
> >    check to see if the ractl is empty, then get server credits and only
> >    *then* call __readahead_batch() as we don't know how big a batch we're
> >    allowed till we have the credits.
> 
> If you insist on using the primitives in a way that nobody else uses
> them, you're going to find they don't work.  What's wrong with the
> way that FUSE uses them in fuse_readahead()?

You mean doing this?

		nr_pages = readahead_count(rac) - nr_pages;

that would seem to indicate that the readahead interface is wrong.  Why should
readahead_count() need correction?  I think I can see *why* the batching stuff
is hidden, but it seems that the comment for readahead_count() needs to
mention this if you aren't going to fix it.

Would it be possible to make readahead_count() do:

	return rac->_nr_pages - rac->_batch_count;

maybe?

David
Matthew Wilcox Jan. 24, 2022, 3:33 p.m. UTC | #3
On Mon, Jan 24, 2022 at 03:14:41PM +0000, David Howells wrote:
> Matthew Wilcox <willy@infradead.org> wrote:
> 
> > > Questions for Willy:
> > >  - Can we get a function to return the number of pages in a readahead
> > >    batch?
> > > 
> > >  - Can we get a function to commit a readahead batch?  Currently, this is
> > >    done when we call __readahead_batch(), but that means ractl->_nr_pages
> > >    isn't up to date at the point we need it to be.  However, we want to
> > >    check to see if the ractl is empty, then get server credits and only
> > >    *then* call __readahead_batch() as we don't know how big a batch we're
> > >    allowed till we have the credits.
> > 
> > If you insist on using the primitives in a way that nobody else uses
> > them, you're going to find they don't work.  What's wrong with the
> > way that FUSE uses them in fuse_readahead()?
> 
> You mean doing this?
> 
> 		nr_pages = readahead_count(rac) - nr_pages;
> 
> that would seem to indicate that the readahead interface is wrong.  Why should
> readahead_count() need correction?  I think I can see *why* the batching stuff
> is hidden, but it seems that the comment for readahead_count() needs to
> mention this if you aren't going to fix it.
> 
> Would it be possible to make readahead_count() do:
> 
> 	return rac->_nr_pages - rac->_batch_count;
> 
> maybe?

Yes, I think that would make sense.  Do we also need to change
readhead_length()?  It seems to me that it's only ever called once at
initialisation, so it should be possible to keep the two in sync.
Can you write & test such a patch?  I'll support it going upstream
(either by taking it myself or giving you a R-b to take it through your
tree).
David Howells Jan. 24, 2022, 3:46 p.m. UTC | #4
Matthew Wilcox <willy@infradead.org> wrote:

> > Would it be possible to make readahead_count() do:
> > 
> > 	return rac->_nr_pages - rac->_batch_count;
> > 
> > maybe?
> 
> Yes, I think that would make sense.  Do we also need to change
> readhead_length()?  It seems to me that it's only ever called once at
> initialisation, so it should be possible to keep the two in sync.
> Can you write & test such a patch?  I'll support it going upstream
> (either by taking it myself or giving you a R-b to take it through your
> tree).

It seems I also have a problem with readahead_index() needing compensation
too.  I'm guessing that's more of a problem, however, as I think that's
expected to refer to the beginning of the current batch.

David
Matthew Wilcox Jan. 24, 2022, 3:58 p.m. UTC | #5
On Mon, Jan 24, 2022 at 03:46:27PM +0000, David Howells wrote:
> Matthew Wilcox <willy@infradead.org> wrote:
> 
> > > Would it be possible to make readahead_count() do:
> > > 
> > > 	return rac->_nr_pages - rac->_batch_count;
> > > 
> > > maybe?
> > 
> > Yes, I think that would make sense.  Do we also need to change
> > readhead_length()?  It seems to me that it's only ever called once at
> > initialisation, so it should be possible to keep the two in sync.
> > Can you write & test such a patch?  I'll support it going upstream
> > (either by taking it myself or giving you a R-b to take it through your
> > tree).
> 
> It seems I also have a problem with readahead_index() needing compensation
> too.  I'm guessing that's more of a problem, however, as I think that's
> expected to refer to the beginning of the current batch.

Well, that's the problem isn't it?  You're expecting to mutate the state
and then refer to its previous state instead of its current state,
whereas the other users refer to the current state instead of the
previous state.  Can't you pull readahead_index() out of the ractl
ahead of the mutation?
David Howells Jan. 24, 2022, 4:25 p.m. UTC | #6
Matthew Wilcox <willy@infradead.org> wrote:

> Well, that's the problem isn't it?  You're expecting to mutate the state
> and then refer to its previous state instead of its current state,

No.  I *don't* want to see the previous state.  That's where the problem is:
The previous state isn't cleared up until the point I ask for a new batch -
but I need to query the ractl to find the remaining window before I can ask
for a new batch.

The first pass through the loop is, in effect, substantially different to the
second and subsequent passes.

> whereas the other users refer to the current state instead of the
> previous state.

Fuse has exactly the same issues:

		nr_pages = readahead_count(rac) - nr_pages;
		if (nr_pages > max_pages)
			nr_pages = max_pages;
		if (nr_pages == 0)
			break;
		ia = fuse_io_alloc(NULL, nr_pages);
		...
		nr_pages = __readahead_batch(rac, ap->pages, nr_pages);

It needs to find out how many pages it needs so that it can allocate its page
array before it can get a new batch, but the ractl is holding the previous
state.

> Can't you pull readahead_index() out of the ractl ahead of the mutation?

I'm not sure what you mean by that.  What I'm now doing is:

	while ((nr_pages = readahead_count(ractl) - ractl->_batch_count)) {
		...
		pgoff_t index = readahead_index(ractl) + ractl->_batch_count;
		...
		rc = cifs_fscache_query_occupancy(
			ractl->mapping->host, index, nr_pages,
		...
		rc = server->ops->wait_mtu_credits(server, cifs_sb->ctx->rsize,
						   &rsize, credits);
		...
		nr_pages = min_t(size_t, rsize / PAGE_SIZE, readahead_count(ractl));
		nr_pages = min_t(size_t, nr_pages, next_cached - index);
		...
		rdata = cifs_readdata_alloc(nr_pages, cifs_readv_complete);
		...
		got = __readahead_batch(ractl, rdata->pages, nr_pages);
		...
	}

I need the count to know how many, if any, pages are remaining; I need the
count and index of the window to find out if fscache has any data; I need the
count to allocate a page array.  Only after all that can I crank the next
batch for the server (assuming I didn't delegate to the cache along the way,
but that calls readahead_page()).

(Yes, I would like to remove this code entirely and just call into netfslib.
I have patches to do that, but it involves quite an overhaul of the cifs
driver, but the above might get cifs caching again more quickly pending that.
Maybe I should just take the easy way here and cache the last state so that I
can compensate in the way fuse does).

David
Matthew Wilcox Jan. 25, 2022, 2:41 p.m. UTC | #7
On Mon, Jan 24, 2022 at 04:25:11PM +0000, David Howells wrote:
> Matthew Wilcox <willy@infradead.org> wrote:
> 
> > Well, that's the problem isn't it?  You're expecting to mutate the state
> > and then refer to its previous state instead of its current state,
> 
> No.  I *don't* want to see the previous state.  That's where the problem is:
> The previous state isn't cleared up until the point I ask for a new batch -
> but I need to query the ractl to find the remaining window before I can ask
> for a new batch.

Oh, right, even worse -- you want to know the _future_ state of the
iterator before you mutate it (it's kind of the same thing though
if you haven't looked at how the iterator works internally).

> > whereas the other users refer to the current state instead of the
> > previous state.
> 
> Fuse has exactly the same issues:

... and yet you refuse to solve it the same way as fuse ...

> > Can't you pull readahead_index() out of the ractl ahead of the mutation?
> 
> I'm not sure what you mean by that.  What I'm now doing is:
> 
> 	while ((nr_pages = readahead_count(ractl) - ractl->_batch_count)) {
> 		...
> 		pgoff_t index = readahead_index(ractl) + ractl->_batch_count;
> 		...
> 		rc = cifs_fscache_query_occupancy(
> 			ractl->mapping->host, index, nr_pages,
> 		...
> 		rc = server->ops->wait_mtu_credits(server, cifs_sb->ctx->rsize,
> 						   &rsize, credits);
> 		...
> 		nr_pages = min_t(size_t, rsize / PAGE_SIZE, readahead_count(ractl));
> 		nr_pages = min_t(size_t, nr_pages, next_cached - index);
> 		...
> 		rdata = cifs_readdata_alloc(nr_pages, cifs_readv_complete);
> 		...
> 		got = __readahead_batch(ractl, rdata->pages, nr_pages);
> 		...
> 	}
> 
> I need the count to know how many, if any, pages are remaining; I need the
> count and index of the window to find out if fscache has any data; I need the
> count to allocate a page array.  Only after all that can I crank the next
> batch for the server (assuming I didn't delegate to the cache along the way,
> but that calls readahead_page()).

The problem is that the other users very much want to refer to the
pre-mutation state.  eg, btrfs:

        while ((nr = readahead_page_batch(rac, pagepool))) {
                u64 contig_start = readahead_pos(rac);
                u64 contig_end = contig_start + readahead_batch_length(rac) - 1;

                contiguous_readpages(pagepool, nr, contig_start, contig_end,
                                &em_cached, &bio_ctrl, &prev_em_start);
        }

The API isn't designed to work the way you want it to work.  So either
we redesign it (... and change all the existing users ...) or you use
it the way that FUSE does, which is admittedly suboptimal, but you're
also saying it's temporary.
diff mbox series

Patch

diff --git a/fs/cifs/file.c b/fs/cifs/file.c
index 59334be9ed3b..618e4851e240 100644
--- a/fs/cifs/file.c
+++ b/fs/cifs/file.c
@@ -4269,8 +4269,6 @@  cifs_readv_complete(struct work_struct *work)
 	for (i = 0; i < rdata->nr_pages; i++) {
 		struct page *page = rdata->pages[i];
 
-		lru_cache_add(page);
-
 		if (rdata->result == 0 ||
 		    (rdata->result == -EAGAIN && got_bytes)) {
 			flush_dcache_page(page);
@@ -4340,7 +4338,6 @@  readpages_fill_pages(struct TCP_Server_Info *server,
 			 * fill them until the writes are flushed.
 			 */
 			zero_user(page, 0, PAGE_SIZE);
-			lru_cache_add(page);
 			flush_dcache_page(page);
 			SetPageUptodate(page);
 			unlock_page(page);
@@ -4350,7 +4347,6 @@  readpages_fill_pages(struct TCP_Server_Info *server,
 			continue;
 		} else {
 			/* no need to hold page hostage */
-			lru_cache_add(page);
 			unlock_page(page);
 			put_page(page);
 			rdata->pages[i] = NULL;
@@ -4393,92 +4389,16 @@  cifs_readpages_copy_into_pages(struct TCP_Server_Info *server,
 	return readpages_fill_pages(server, rdata, iter, iter->count);
 }
 
-static int
-readpages_get_pages(struct address_space *mapping, struct list_head *page_list,
-		    unsigned int rsize, struct list_head *tmplist,
-		    unsigned int *nr_pages, loff_t *offset, unsigned int *bytes)
-{
-	struct page *page, *tpage;
-	unsigned int expected_index;
-	int rc;
-	gfp_t gfp = readahead_gfp_mask(mapping);
-
-	INIT_LIST_HEAD(tmplist);
-
-	page = lru_to_page(page_list);
-
-	/*
-	 * Lock the page and put it in the cache. Since no one else
-	 * should have access to this page, we're safe to simply set
-	 * PG_locked without checking it first.
-	 */
-	__SetPageLocked(page);
-	rc = add_to_page_cache_locked(page, mapping,
-				      page->index, gfp);
-
-	/* give up if we can't stick it in the cache */
-	if (rc) {
-		__ClearPageLocked(page);
-		return rc;
-	}
-
-	/* move first page to the tmplist */
-	*offset = (loff_t)page->index << PAGE_SHIFT;
-	*bytes = PAGE_SIZE;
-	*nr_pages = 1;
-	list_move_tail(&page->lru, tmplist);
-
-	/* now try and add more pages onto the request */
-	expected_index = page->index + 1;
-	list_for_each_entry_safe_reverse(page, tpage, page_list, lru) {
-		/* discontinuity ? */
-		if (page->index != expected_index)
-			break;
-
-		/* would this page push the read over the rsize? */
-		if (*bytes + PAGE_SIZE > rsize)
-			break;
-
-		__SetPageLocked(page);
-		rc = add_to_page_cache_locked(page, mapping, page->index, gfp);
-		if (rc) {
-			__ClearPageLocked(page);
-			break;
-		}
-		list_move_tail(&page->lru, tmplist);
-		(*bytes) += PAGE_SIZE;
-		expected_index++;
-		(*nr_pages)++;
-	}
-	return rc;
-}
-
-static int cifs_readpages(struct file *file, struct address_space *mapping,
-	struct list_head *page_list, unsigned num_pages)
+static void cifs_readahead(struct readahead_control *ractl)
 {
 	int rc;
-	int err = 0;
-	struct list_head tmplist;
-	struct cifsFileInfo *open_file = file->private_data;
-	struct cifs_sb_info *cifs_sb = CIFS_FILE_SB(file);
+	struct cifsFileInfo *open_file = ractl->file->private_data;
+	struct cifs_sb_info *cifs_sb = CIFS_FILE_SB(ractl->file);
 	struct TCP_Server_Info *server;
 	pid_t pid;
 	unsigned int xid;
 
 	xid = get_xid();
-	/*
-	 * Reads as many pages as possible from fscache. Returns -ENOBUFS
-	 * immediately if the cookie is negative
-	 *
-	 * After this point, every page in the list might have PG_fscache set,
-	 * so we will need to clean that up off of every page we don't use.
-	 */
-	rc = cifs_readpages_from_fscache(mapping->host, mapping, page_list,
-					 &num_pages);
-	if (rc == 0) {
-		free_xid(xid);
-		return rc;
-	}
 
 	if (cifs_sb->mnt_cifs_flags & CIFS_MOUNT_RWPIDFORWARD)
 		pid = open_file->pid;
@@ -4489,7 +4409,7 @@  static int cifs_readpages(struct file *file, struct address_space *mapping,
 	server = cifs_pick_channel(tlink_tcon(open_file->tlink)->ses);
 
 	cifs_dbg(FYI, "%s: file=%p mapping=%p num_pages=%u\n",
-		 __func__, file, mapping, num_pages);
+		 __func__, ractl->file, ractl->mapping, readahead_count(ractl));
 
 	/*
 	 * Start with the page at end of list and move it to private
@@ -4502,26 +4422,27 @@  static int cifs_readpages(struct file *file, struct address_space *mapping,
 	 * the order of declining indexes. When we put the pages in
 	 * the rdata->pages, then we want them in increasing order.
 	 */
-	while (!list_empty(page_list) && !err) {
-		unsigned int i, nr_pages, bytes, rsize;
-		loff_t offset;
-		struct page *page, *tpage;
+	while (readahead_count(ractl) - ractl->_batch_count) {
+		unsigned int i, nr_pages, got, rsize;
+		struct page *page;
 		struct cifs_readdata *rdata;
 		struct cifs_credits credits_on_stack;
 		struct cifs_credits *credits = &credits_on_stack;
 
 		if (open_file->invalidHandle) {
 			rc = cifs_reopen_file(open_file, true);
-			if (rc == -EAGAIN)
-				continue;
-			else if (rc)
+			if (rc) {
+				if (rc == -EAGAIN)
+					continue;
 				break;
+			}
 		}
 
 		rc = server->ops->wait_mtu_credits(server, cifs_sb->ctx->rsize,
 						   &rsize, credits);
 		if (rc)
 			break;
+		nr_pages = min_t(size_t, rsize / PAGE_SIZE, readahead_count(ractl));
 
 		/*
 		 * Give up immediately if rsize is too small to read an entire
@@ -4529,16 +4450,7 @@  static int cifs_readpages(struct file *file, struct address_space *mapping,
 		 * reach this point however since we set ra_pages to 0 when the
 		 * rsize is smaller than a cache page.
 		 */
-		if (unlikely(rsize < PAGE_SIZE)) {
-			add_credits_and_wake_if(server, credits, 0);
-			free_xid(xid);
-			return 0;
-		}
-
-		nr_pages = 0;
-		err = readpages_get_pages(mapping, page_list, rsize, &tmplist,
-					 &nr_pages, &offset, &bytes);
-		if (!nr_pages) {
+		if (unlikely(!nr_pages)) {
 			add_credits_and_wake_if(server, credits, 0);
 			break;
 		}
@@ -4546,36 +4458,31 @@  static int cifs_readpages(struct file *file, struct address_space *mapping,
 		rdata = cifs_readdata_alloc(nr_pages, cifs_readv_complete);
 		if (!rdata) {
 			/* best to give up if we're out of mem */
-			list_for_each_entry_safe(page, tpage, &tmplist, lru) {
-				list_del(&page->lru);
-				lru_cache_add(page);
-				unlock_page(page);
-				put_page(page);
-			}
-			rc = -ENOMEM;
 			add_credits_and_wake_if(server, credits, 0);
 			break;
 		}
 
-		rdata->cfile = cifsFileInfo_get(open_file);
-		rdata->server = server;
-		rdata->mapping = mapping;
-		rdata->offset = offset;
-		rdata->bytes = bytes;
-		rdata->pid = pid;
-		rdata->pagesz = PAGE_SIZE;
-		rdata->tailsz = PAGE_SIZE;
+		got = __readahead_batch(ractl, rdata->pages, nr_pages);
+		if (got != nr_pages) {
+			pr_warn("__readahead_batch() returned %u/%u\n",
+				got, nr_pages);
+			nr_pages = got;
+		}
+
+		rdata->nr_pages = nr_pages;
+		rdata->bytes	= readahead_batch_length(ractl);
+		rdata->cfile	= cifsFileInfo_get(open_file);
+		rdata->server	= server;
+		rdata->mapping	= ractl->mapping;
+		rdata->offset	= readahead_pos(ractl);
+		rdata->pid	= pid;
+		rdata->pagesz	= PAGE_SIZE;
+		rdata->tailsz	= PAGE_SIZE;
 		rdata->read_into_pages = cifs_readpages_read_into_pages;
 		rdata->copy_into_pages = cifs_readpages_copy_into_pages;
-		rdata->credits = credits_on_stack;
-
-		list_for_each_entry_safe(page, tpage, &tmplist, lru) {
-			list_del(&page->lru);
-			rdata->pages[rdata->nr_pages++] = page;
-		}
+		rdata->credits	= credits_on_stack;
 
 		rc = adjust_credits(server, &rdata->credits, rdata->bytes);
-
 		if (!rc) {
 			if (rdata->cfile->invalidHandle)
 				rc = -EAGAIN;
@@ -4587,7 +4494,6 @@  static int cifs_readpages(struct file *file, struct address_space *mapping,
 			add_credits_and_wake_if(server, &rdata->credits, 0);
 			for (i = 0; i < rdata->nr_pages; i++) {
 				page = rdata->pages[i];
-				lru_cache_add(page);
 				unlock_page(page);
 				put_page(page);
 			}
@@ -4597,10 +4503,13 @@  static int cifs_readpages(struct file *file, struct address_space *mapping,
 		}
 
 		kref_put(&rdata->refcount, cifs_readdata_release);
+
+		ractl->_nr_pages -= ractl->_batch_count;
+		ractl->_index += ractl->_batch_count;
+		ractl->_batch_count = 0;
 	}
 
 	free_xid(xid);
-	return rc;
 }
 
 /*
@@ -4924,7 +4833,7 @@  void cifs_oplock_break(struct work_struct *work)
  * In the non-cached mode (mount with cache=none), we shunt off direct read and write requests
  * so this method should never be called.
  *
- * Direct IO is not yet supported in the cached mode. 
+ * Direct IO is not yet supported in the cached mode.
  */
 static ssize_t
 cifs_direct_io(struct kiocb *iocb, struct iov_iter *iter)
@@ -5006,7 +4915,7 @@  static int cifs_set_page_dirty(struct page *page)
 
 const struct address_space_operations cifs_addr_ops = {
 	.readpage = cifs_readpage,
-	.readpages = cifs_readpages,
+	.readahead = cifs_readahead,
 	.writepage = cifs_writepage,
 	.writepages = cifs_writepages,
 	.write_begin = cifs_write_begin,