diff mbox series

mm: don't break integrity writeback on ->writepage() error

Message ID 20181105163613.7542-1-bfoster@redhat.com (mailing list archive)
State New, archived
Headers show
Series mm: don't break integrity writeback on ->writepage() error | expand

Commit Message

Brian Foster Nov. 5, 2018, 4:36 p.m. UTC
write_cache_pages() currently breaks out of the writepage loop in
the event of a ->writepage() error. This causes problems for
integrity writeback on XFS in the event of a persistent error as XFS
expects to process every dirty+delalloc page such that it can
discard delalloc blocks when real block allocation fails.  Failure
to handle all delalloc pages leaves the filesystem in an
inconsistent state if the integrity writeback happens to be due to
an unmount, for example.

Update write_cache_pages() to continue processing pages for
integrity writeback regardless of ->writepage() errors. Save the
first encountered error and return it once complete. This
facilitates XFS or any other fs that expects integrity writeback to
process the entire set of dirty pages regardless of errors.
Background writeback continues to exit on the first error
encountered.

Signed-off-by: Brian Foster <bfoster@redhat.com>
---

Hi all,

This was actually first posted[1] as a patch in XFS to not return errors
from ->writepage() when called via write_cache_pages(). After some
discussion with Dave, it was suggested that this is a
write_cache_pages() bug rather than one in XFS. I think that could go
either way, so I'm floating this patch as an alternative. FWIW, that
same thread also includes a supporting patch for an fstests test[2] that
demonstrates the original problem this patch attempts to resolve.

This applies on top of v4.19 and I've tested it against XFS and ext4
(defaults) and not seen any regressions. Note that it's not clear to me
if ext4 is affected by the same or similar problem and I skipped btrfs
since it seems to duplicate all of the associated writeback code.

Finally, I'm not totally sure about the ->for_sync bit in the error
handling logic. I included it out of caution to try and handle any sort
of potential (->sync_mode == WB_SYNC_NONE && ->for_sync == 1)
combination, but that doesn't appear to be used anywhere that I can see.
Instead, ->for_sync seems more like an exceptional case of ->sync_mode
== WB_SYNC ALL.

Thoughts?

Brian

[1] https://marc.info/?l=linux-xfs&m=154102085505264&w=2
[2] https://marc.info/?l=fstests&m=154031860022439&w=2

 mm/page-writeback.c | 19 ++++++++++++-------
 1 file changed, 12 insertions(+), 7 deletions(-)

Comments

Andrew Morton Nov. 9, 2018, 11:42 p.m. UTC | #1
On Mon,  5 Nov 2018 11:36:13 -0500 Brian Foster <bfoster@redhat.com> wrote:

> write_cache_pages() currently breaks out of the writepage loop in
> the event of a ->writepage() error. This causes problems for
> integrity writeback on XFS

For the uninitiated, please define the term "integrity writeback". 
Quite carefully ;) I'm not sure what it actually means.  grepping
fs/xfs for "integrity" doesn't reveal anything.

<reads the code>

OK, it appears the term means "to sync data to disk" as opposed to
"periodic dirty memory cleaning".  I guess we don't have particularly
well-established terms for the two concepts.

> in the event of a persistent error as XFS
> expects to process every dirty+delalloc page such that it can
> discard delalloc blocks when real block allocation fails.  Failure
> to handle all delalloc pages leaves the filesystem in an
> inconsistent state if the integrity writeback happens to be due to
> an unmount, for example.
> 
> Update write_cache_pages() to continue processing pages for
> integrity writeback regardless of ->writepage() errors. Save the
> first encountered error and return it once complete. This
> facilitates XFS or any other fs that expects integrity writeback to
> process the entire set of dirty pages regardless of errors.
> Background writeback continues to exit on the first error
> encountered.
> 
> ...
>
> --- a/mm/page-writeback.c
> +++ b/mm/page-writeback.c
> @@ -2156,6 +2156,7 @@ int write_cache_pages(struct address_space *mapping,
>  {
>  	int ret = 0;
>  	int done = 0;
> +	int error;
>  	struct pagevec pvec;
>  	int nr_pages;
>  	pgoff_t uninitialized_var(writeback_index);
> @@ -2236,25 +2237,29 @@ int write_cache_pages(struct address_space *mapping,
>  				goto continue_unlock;
>  
>  			trace_wbc_writepage(wbc, inode_to_bdi(mapping->host));
> -			ret = (*writepage)(page, wbc, data);
> -			if (unlikely(ret)) {
> -				if (ret == AOP_WRITEPAGE_ACTIVATE) {
> +			error = (*writepage)(page, wbc, data);
> +			if (unlikely(error)) {
> +				if (error == AOP_WRITEPAGE_ACTIVATE) {
>  					unlock_page(page);
> -					ret = 0;
> -				} else {
> +					error = 0;
> +				} else if (wbc->sync_mode != WB_SYNC_ALL &&
> +					   !wbc->for_sync) {

And here we're determining that it is not a sync-data-to-disk
operation, hence it must be a clean-dirty-pages operation.

This isn't very well-controlled, is it?  It's an inference which was
put together by examining current callers, I assume?

It would be good if we could force callers to be explicit about their
intent here.  But I'm not sure that adding a new writeback_sync_mode is
the way to do this.

At a minimum it would be good to have careful comments in here
explaining what is going on, justifying the above inference, explaining
the xfs requirement (hopefully in a way which isn't xfs-specific).

>  					/*
> -					 * done_index is set past this page,
> -					 * so media errors will not choke
> +					 * done_index is set past this page, so
> +					 * media errors will not choke
>  					 * background writeout for the entire
>  					 * file. This has consequences for
>  					 * range_cyclic semantics (ie. it may
>  					 * not be suitable for data integrity
>  					 * writeout).
>  					 */
Brian Foster Nov. 10, 2018, 3:19 p.m. UTC | #2
On Fri, Nov 09, 2018 at 03:42:51PM -0800, Andrew Morton wrote:
> On Mon,  5 Nov 2018 11:36:13 -0500 Brian Foster <bfoster@redhat.com> wrote:
> 
> > write_cache_pages() currently breaks out of the writepage loop in
> > the event of a ->writepage() error. This causes problems for
> > integrity writeback on XFS
> 
> For the uninitiated, please define the term "integrity writeback". 
> Quite carefully ;) I'm not sure what it actually means.  grepping
> fs/xfs for "integrity" doesn't reveal anything.
> 
> <reads the code>
> 
> OK, it appears the term means "to sync data to disk" as opposed to
> "periodic dirty memory cleaning".  I guess we don't have particularly
> well-established terms for the two concepts.
> 

Indeed. The intent is basically to describe any writeback that is
intended to persist data and so so before returning (i.e., fsync(),
etc.). That was the best term I came across to describe it ("integrity
sync" is used in some of the existing comments), but I can try to be
more descriptive in the commit log.

> > in the event of a persistent error as XFS
> > expects to process every dirty+delalloc page such that it can
> > discard delalloc blocks when real block allocation fails.  Failure
> > to handle all delalloc pages leaves the filesystem in an
> > inconsistent state if the integrity writeback happens to be due to
> > an unmount, for example.
> > 
> > Update write_cache_pages() to continue processing pages for
> > integrity writeback regardless of ->writepage() errors. Save the
> > first encountered error and return it once complete. This
> > facilitates XFS or any other fs that expects integrity writeback to
> > process the entire set of dirty pages regardless of errors.
> > Background writeback continues to exit on the first error
> > encountered.
> > 
> > ...
> >
> > --- a/mm/page-writeback.c
> > +++ b/mm/page-writeback.c
> > @@ -2156,6 +2156,7 @@ int write_cache_pages(struct address_space *mapping,
> >  {
> >  	int ret = 0;
> >  	int done = 0;
> > +	int error;
> >  	struct pagevec pvec;
> >  	int nr_pages;
> >  	pgoff_t uninitialized_var(writeback_index);
> > @@ -2236,25 +2237,29 @@ int write_cache_pages(struct address_space *mapping,
> >  				goto continue_unlock;
> >  
> >  			trace_wbc_writepage(wbc, inode_to_bdi(mapping->host));
> > -			ret = (*writepage)(page, wbc, data);
> > -			if (unlikely(ret)) {
> > -				if (ret == AOP_WRITEPAGE_ACTIVATE) {
> > +			error = (*writepage)(page, wbc, data);
> > +			if (unlikely(error)) {
> > +				if (error == AOP_WRITEPAGE_ACTIVATE) {
> >  					unlock_page(page);
> > -					ret = 0;
> > -				} else {
> > +					error = 0;
> > +				} else if (wbc->sync_mode != WB_SYNC_ALL &&
> > +					   !wbc->for_sync) {
> 
> And here we're determining that it is not a sync-data-to-disk
> operation, hence it must be a clean-dirty-pages operation.
> 
> This isn't very well-controlled, is it?  It's an inference which was
> put together by examining current callers, I assume?
> 

Yeah, sort of. Some of the comments do already explain how WB_SYNC_ALL
refers to "integrity" writeback/sync (above write_cache_pages(), for
example). The ->for_sync thing is more of an inference based on its use
in __writeback_single_inode() and the comment where it is defined.

> It would be good if we could force callers to be explicit about their
> intent here.  But I'm not sure that adding a new writeback_sync_mode is
> the way to do this.
> 

The more I look at it, however, I think I could probably drop the
for_sync bit here. It appears only be used for a special case of
WB_SYNC_ALL that isn't relevant to this patch, so it only serves to
complicate in this context.

I'm not sure if you had more in mind beyond that..? There are a lot of
knobs on the wbc in general. It might be interesting to see if some of
that could be cleaned up to factor out some of those seemingly bolt-on
knobs, but I'd have to stare at the code more and think about it. Even
still, any such change is probably better as a follow on patch to this
one, which is intended to be an isolated bug fix. Thoughts on any of
that is appreciated.

> At a minimum it would be good to have careful comments in here
> explaining what is going on, justifying the above inference, explaining
> the xfs requirement (hopefully in a way which isn't xfs-specific).
> 

Ok, I can add a comment here that covers such details. Thanks for the
feedback.

Brian

> >  					/*
> > -					 * done_index is set past this page,
> > -					 * so media errors will not choke
> > +					 * done_index is set past this page, so
> > +					 * media errors will not choke
> >  					 * background writeout for the entire
> >  					 * file. This has consequences for
> >  					 * range_cyclic semantics (ie. it may
> >  					 * not be suitable for data integrity
> >  					 * writeout).
> >  					 */
>
diff mbox series

Patch

diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index 84ae9bf5858a..9dbbf9465ff9 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -2156,6 +2156,7 @@  int write_cache_pages(struct address_space *mapping,
 {
 	int ret = 0;
 	int done = 0;
+	int error;
 	struct pagevec pvec;
 	int nr_pages;
 	pgoff_t uninitialized_var(writeback_index);
@@ -2236,25 +2237,29 @@  int write_cache_pages(struct address_space *mapping,
 				goto continue_unlock;
 
 			trace_wbc_writepage(wbc, inode_to_bdi(mapping->host));
-			ret = (*writepage)(page, wbc, data);
-			if (unlikely(ret)) {
-				if (ret == AOP_WRITEPAGE_ACTIVATE) {
+			error = (*writepage)(page, wbc, data);
+			if (unlikely(error)) {
+				if (error == AOP_WRITEPAGE_ACTIVATE) {
 					unlock_page(page);
-					ret = 0;
-				} else {
+					error = 0;
+				} else if (wbc->sync_mode != WB_SYNC_ALL &&
+					   !wbc->for_sync) {
 					/*
-					 * done_index is set past this page,
-					 * so media errors will not choke
+					 * done_index is set past this page, so
+					 * media errors will not choke
 					 * background writeout for the entire
 					 * file. This has consequences for
 					 * range_cyclic semantics (ie. it may
 					 * not be suitable for data integrity
 					 * writeout).
 					 */
+					ret = error;
 					done_index = page->index + 1;
 					done = 1;
 					break;
 				}
+				if (!ret)
+					ret = error;
 			}
 
 			/*