diff mbox series

[041/178] mm: provide filemap_range_needs_writeback() helper

Message ID 20210430055518.KCtlWixJr%akpm@linux-foundation.org (mailing list archive)
State New
Headers show
Series [001/178] arch/ia64/kernel/head.S: remove duplicate include | expand

Commit Message

Andrew Morton April 30, 2021, 5:55 a.m. UTC
From: Jens Axboe <axboe@kernel.dk>
Subject: mm: provide filemap_range_needs_writeback() helper

Patch series "Improve IOCB_NOWAIT O_DIRECT reads", v3.

An internal workload complained because it was using too much CPU, and
when I took a look, we had a lot of io_uring workers going to town. 
For an async buffered read like workload, I am normally expecting
_zero_ offloads to a worker thread, but this one had tons of them.  I'd
drop caches and things would look good again, but then a minute later
we'd regress back to using workers.  Turns out that every minute
something was reading parts of the device, which would add page cache
for that inode.  I put patches like these in for our kernel, and the
problem was solved.

Don't -EAGAIN IOCB_NOWAIT dio reads just because we have page cache
entries for the given range.  This causes unnecessary work from the
callers side, when the IO could have been issued totally fine without
blocking on writeback when there is none.


This patch (of 3):

For O_DIRECT reads/writes, we check if we need to issue a call to
filemap_write_and_wait_range() to issue and/or wait for writeback for any
page in the given range.  The existing mechanism just checks for a page in
the range, which is suboptimal for IOCB_NOWAIT as we'll fallback to the
slow path (and needing retry) if there's just a clean page cache page in
the range.

Provide filemap_range_needs_writeback() which tries a little harder to
check if we actually need to issue and/or wait for writeback in the range.

Link: https://lkml.kernel.org/r/20210224164455.1096727-1-axboe@kernel.dk
Link: https://lkml.kernel.org/r/20210224164455.1096727-2-axboe@kernel.dk
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 include/linux/fs.h |    2 ++
 mm/filemap.c       |   43 +++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 45 insertions(+)

Comments

Linus Torvalds April 30, 2021, 4:50 p.m. UTC | #1
Wait, what?

On Thu, Apr 29, 2021 at 10:55 PM Andrew Morton
<akpm@linux-foundation.org> wrote:
>
> +       if (!mapping_needs_writeback(mapping))
> +               return false;
> +       if (!mapping_tagged(mapping, PAGECACHE_TAG_DIRTY) &&
> +           !mapping_tagged(mapping, PAGECACHE_TAG_WRITEBACK))
> +               return false;
> +       if (end_byte < start_byte)
> +               return false;

Ok, good, get rid of the trivial cases first.

> +       rcu_read_lock();
> +       xas_for_each(&xas, page, max) {
> +               if (xas_retry(&xas, page))
> +                       continue;
> +               if (xa_is_value(page))
> +                       continue;
> +               if (PageDirty(page) || PageLocked(page) || PageWriteback(page))
> +                       break;
> +       }
> +       rcu_read_unlock();

Whee. This looks very very expensive indeed.

Why is is going to each page, why does it care about locked when the
simple early cases did not?

Why isn't this just using xas_for_each_marked(), or maybe even just a
single "xa_find()", since all it cares about is "is _any_ page.."

Maybe I'm missing something, but this really does look potentially
horrendously slow.

Jens/Willy?

                Linus
Matthew Wilcox May 9, 2021, 7:40 p.m. UTC | #2
On Fri, Apr 30, 2021 at 09:50:29AM -0700, Linus Torvalds wrote:
> Wait, what?
> 
> On Thu, Apr 29, 2021 at 10:55 PM Andrew Morton
> <akpm@linux-foundation.org> wrote:
> >
> > +       if (!mapping_needs_writeback(mapping))
> > +               return false;
> > +       if (!mapping_tagged(mapping, PAGECACHE_TAG_DIRTY) &&
> > +           !mapping_tagged(mapping, PAGECACHE_TAG_WRITEBACK))
> > +               return false;
> > +       if (end_byte < start_byte)
> > +               return false;
> 
> Ok, good, get rid of the trivial cases first.
> 
> > +       rcu_read_lock();
> > +       xas_for_each(&xas, page, max) {
> > +               if (xas_retry(&xas, page))
> > +                       continue;
> > +               if (xa_is_value(page))
> > +                       continue;
> > +               if (PageDirty(page) || PageLocked(page) || PageWriteback(page))
> > +                       break;
> > +       }
> > +       rcu_read_unlock();
> 
> Whee. This looks very very expensive indeed.
> 
> Why is is going to each page, why does it care about locked when the
> simple early cases did not?

I was hoping Jens would answer this, but I guess he missed it so far.
I don't know why it now cares about whether the page is locked or not.
I presume it's not even the slightest bit slow for Jens because the
common case is that there are no pages in the range, and so we never
check any pages.

> Why isn't this just using xas_for_each_marked(), or maybe even just a
> single "xa_find()", since all it cares about is "is _any_ page.."

We don't have an iterator that looks for either mark A or mark B,
so we'd have to iterate twice, once to look for dirty and once to
look for writeback.  For the common case of there being no pages in
the range, that'd be slower.

I could add an iterator that returns the next entry in the given range
that has any mark set (almost trivial; OR the mark bitmaps together
before scanning them).  But I don't know why we care about the Locked
case, so I don't know if it'd be the right thing to do.

xa_find() isn't quite right because that'll return shadow entries, and
we do want to skip those.
diff mbox series

Patch

--- a/include/linux/fs.h~mm-provide-filemap_range_needs_writeback-helper
+++ a/include/linux/fs.h
@@ -2878,6 +2878,8 @@  static inline int filemap_fdatawait(stru
 
 extern bool filemap_range_has_page(struct address_space *, loff_t lstart,
 				  loff_t lend);
+extern bool filemap_range_needs_writeback(struct address_space *,
+					  loff_t lstart, loff_t lend);
 extern int filemap_write_and_wait_range(struct address_space *mapping,
 				        loff_t lstart, loff_t lend);
 extern int __filemap_fdatawrite_range(struct address_space *mapping,
--- a/mm/filemap.c~mm-provide-filemap_range_needs_writeback-helper
+++ a/mm/filemap.c
@@ -636,6 +636,49 @@  static bool mapping_needs_writeback(stru
 }
 
 /**
+ * filemap_range_needs_writeback - check if range potentially needs writeback
+ * @mapping:           address space within which to check
+ * @start_byte:        offset in bytes where the range starts
+ * @end_byte:          offset in bytes where the range ends (inclusive)
+ *
+ * Find at least one page in the range supplied, usually used to check if
+ * direct writing in this range will trigger a writeback. Used by O_DIRECT
+ * read/write with IOCB_NOWAIT, to see if the caller needs to do
+ * filemap_write_and_wait_range() before proceeding.
+ *
+ * Return: %true if the caller should do filemap_write_and_wait_range() before
+ * doing O_DIRECT to a page in this range, %false otherwise.
+ */
+bool filemap_range_needs_writeback(struct address_space *mapping,
+				   loff_t start_byte, loff_t end_byte)
+{
+	XA_STATE(xas, &mapping->i_pages, start_byte >> PAGE_SHIFT);
+	pgoff_t max = end_byte >> PAGE_SHIFT;
+	struct page *page;
+
+	if (!mapping_needs_writeback(mapping))
+		return false;
+	if (!mapping_tagged(mapping, PAGECACHE_TAG_DIRTY) &&
+	    !mapping_tagged(mapping, PAGECACHE_TAG_WRITEBACK))
+		return false;
+	if (end_byte < start_byte)
+		return false;
+
+	rcu_read_lock();
+	xas_for_each(&xas, page, max) {
+		if (xas_retry(&xas, page))
+			continue;
+		if (xa_is_value(page))
+			continue;
+		if (PageDirty(page) || PageLocked(page) || PageWriteback(page))
+			break;
+	}
+	rcu_read_unlock();
+	return page != NULL;
+}
+EXPORT_SYMBOL_GPL(filemap_range_needs_writeback);
+
+/**
  * filemap_write_and_wait_range - write out & wait on a file range
  * @mapping:	the address_space for the pages
  * @lstart:	offset in bytes where the range starts