diff mbox series

btrfs: Avoid getting stuck during cyclic writebacks

Message ID 20191003142713.GA2622251@devbig004.ftw2.facebook.com (mailing list archive)
State New, archived
Headers show
Series btrfs: Avoid getting stuck during cyclic writebacks | expand

Commit Message

Tejun Heo Oct. 3, 2019, 2:27 p.m. UTC
During a cyclic writeback, extent_write_cache_pages() uses done_index
to update the writeback_index after the current run is over.  However,
instead of current index + 1, it gets to to the current index itself.

Unfortunately, this, combined with returning on EOF instead of looping
back, can lead to the following pathlogical behavior.

1. There is a single file which has accumulated enough dirty pages to
   trigger balance_dirty_pages() and the writer appending to the file
   with a series of short writes.

2. bdp kicks in, wakes up background writeback and sleeps.

3. Writeback kicks in and the cursor is on the last page of the dirty
   file.  Writeback is started or skipped if already in progress.  As
   it's EOF, extent_write_cache_pages() returns and the cursor is set
   to done_index which is pointing to the last page.

4. Writeback is done.  Nothing happens till bdp finishes, at which
   point we go back to #1.

This can almost completely stall out writing back of the file and keep
the system over dirty threshold for a long time which can mess up the
whole system.  We encountered this issue in production with a package
handling application which can reliably reproduce the issue when
running under tight memory limits.

Reading the comment in the error handling section, this seems to be to
avoid accidentally skipping a page in case the write attempt on the
page doesn't succeed.  However, this concern seems bogus.

On each page, the code either:

* Skips and moves onto the next page.

* Fails issue and sets done_index to index + 1.

* Successfully issues and continue to the next page if budget allows
  and not EOF.

IOW, as long as it's not EOF and there's budget, the code never
retries writing back the same page.  Only when a page happens to be
the last page of a particular run, we end up retrying the page, which
can't possibly guarantee anything data integrity related.  Besides,
cyclic writes are only used for non-syncing writebacks meaning that
there's no data integrity implication to begin with.

Fix it by always setting done_index past the current page being
processed.

Note that this problem exists in other writepages too.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: stable@vger.kernel.org
---
 fs/btrfs/extent_io.c |   12 +-----------
 1 file changed, 1 insertion(+), 11 deletions(-)

Comments

David Sterba Oct. 8, 2019, 2:23 p.m. UTC | #1
On Thu, Oct 03, 2019 at 07:27:13AM -0700, Tejun Heo wrote:
> During a cyclic writeback, extent_write_cache_pages() uses done_index
> to update the writeback_index after the current run is over.  However,
> instead of current index + 1, it gets to to the current index itself.
> 
> Unfortunately, this, combined with returning on EOF instead of looping
> back, can lead to the following pathlogical behavior.

Tricky stuff.

> 1. There is a single file which has accumulated enough dirty pages to
>    trigger balance_dirty_pages() and the writer appending to the file
>    with a series of short writes.
> 
> 2. bdp kicks in, wakes up background writeback and sleeps.

What does 'bdp' refer to?

> 3. Writeback kicks in and the cursor is on the last page of the dirty
>    file.  Writeback is started or skipped if already in progress.  As
>    it's EOF, extent_write_cache_pages() returns and the cursor is set
>    to done_index which is pointing to the last page.
> 
> 4. Writeback is done.  Nothing happens till bdp finishes, at which
>    point we go back to #1.
> 
> This can almost completely stall out writing back of the file and keep
> the system over dirty threshold for a long time which can mess up the
> whole system.  We encountered this issue in production with a package
> handling application which can reliably reproduce the issue when
> running under tight memory limits.
> 
> Reading the comment in the error handling section, this seems to be to
> avoid accidentally skipping a page in case the write attempt on the
> page doesn't succeed.  However, this concern seems bogus.
> 
> On each page, the code either:
> 
> * Skips and moves onto the next page.
> 
> * Fails issue and sets done_index to index + 1.
> 
> * Successfully issues and continue to the next page if budget allows
>   and not EOF.
> 
> IOW, as long as it's not EOF and there's budget, the code never
> retries writing back the same page.  Only when a page happens to be
> the last page of a particular run, we end up retrying the page, which
> can't possibly guarantee anything data integrity related.  Besides,
> cyclic writes are only used for non-syncing writebacks meaning that
> there's no data integrity implication to begin with.

The code was added in a91326679f2a0a4c239 ("Btrfs: make
mapping->writeback_index point to the last written page") after a user
report in https://www.spinics.net/lists/linux-btrfs/msg52628.html , slow
appends that caused fragmentation

What you describe as the cause is similar, but you're partially
reverting the fix that was supposed to fix it. As there's more code
added by the original patch, the revert won't probably bring back the
bug.

The whole function and states are hard to follow, I agree with your
reasoning about the check being bogus and overall I'd rather see fewer
special cases in the function.

Also the removed comment mentions media errors but this was not the
problem for the original report and is not a common scenario either. So
as long as the fallback in such case is sane (ie. set done = 1 and
exit), I don't see futher problems.

> Fix it by always setting done_index past the current page being
> processed.
> 
> Note that this problem exists in other writepages too.

I can see that write_cache_pages does the same done_index updates.  So
it makes sense that the page walking and writeback index tracking
behaviour is consistent, unless extent_write_cache_pages has diverged
too much.

I'll add the patch to misc-next. Thanks.
Tejun Heo Oct. 8, 2019, 2:42 p.m. UTC | #2
Hello,

On Tue, Oct 08, 2019 at 04:23:22PM +0200, David Sterba wrote:
> > 1. There is a single file which has accumulated enough dirty pages to
> >    trigger balance_dirty_pages() and the writer appending to the file
> >    with a series of short writes.
> > 
> > 2. bdp kicks in, wakes up background writeback and sleeps.
> 
> What does 'bdp' refer to?

Oh, Sorry about that.  balance_dirty_pages().

> > IOW, as long as it's not EOF and there's budget, the code never
> > retries writing back the same page.  Only when a page happens to be
> > the last page of a particular run, we end up retrying the page, which
> > can't possibly guarantee anything data integrity related.  Besides,
> > cyclic writes are only used for non-syncing writebacks meaning that
> > there's no data integrity implication to begin with.
> 
> The code was added in a91326679f2a0a4c239 ("Btrfs: make
> mapping->writeback_index point to the last written page") after a user
> report in https://www.spinics.net/lists/linux-btrfs/msg52628.html , slow
> appends that caused fragmentation

Ah, okay, that makes sense.  That prolly warrants some comments.

> What you describe as the cause is similar, but you're partially
> reverting the fix that was supposed to fix it. As there's more code
> added by the original patch, the revert won't probably bring back the
> bug.
> 
> The whole function and states are hard to follow, I agree with your
> reasoning about the check being bogus and overall I'd rather see fewer
> special cases in the function.
> 
> Also the removed comment mentions media errors but this was not the
> problem for the original report and is not a common scenario either. So
> as long as the fallback in such case is sane (ie. set done = 1 and
> exit), I don't see futher problems.
> 
> > Fix it by always setting done_index past the current page being
> > processed.
> > 
> > Note that this problem exists in other writepages too.
> 
> I can see that write_cache_pages does the same done_index updates.  So
> it makes sense that the page walking and writeback index tracking
> behaviour is consistent, unless extent_write_cache_pages has diverged
> too much.
> 
> I'll add the patch to misc-next. Thanks.

So, in case trying to write the last page is still needed, the thing
necessary to fix this deadlock is avoiding repeating on the last page
too many times / indefinitely as that's what ends up blocking forward
progress - almost all of dirty data is behind the cyclic cursor and
the cursor can't wrap back to get to them.

Thanks.
diff mbox series

Patch

diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index cceaf05aada2..4905f48587df 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -4121,7 +4121,7 @@  static int extent_write_cache_pages(struct address_space *mapping,
 		for (i = 0; i < nr_pages; i++) {
 			struct page *page = pvec.pages[i];
 
-			done_index = page->index;
+			done_index = page->index + 1;
 			/*
 			 * At this point we hold neither the i_pages lock nor
 			 * the page lock: the page may be truncated or
@@ -4156,16 +4156,6 @@  static int extent_write_cache_pages(struct address_space *mapping,
 
 			ret = __extent_writepage(page, wbc, epd);
 			if (ret < 0) {
-				/*
-				 * done_index is set past this page,
-				 * so media errors will not choke
-				 * background writeout for the entire
-				 * file. This has consequences for
-				 * range_cyclic semantics (ie. it may
-				 * not be suitable for data integrity
-				 * writeout).
-				 */
-				done_index = page->index + 1;
 				done = 1;
 				break;
 			}