diff mbox series

btrfs: fix race in subpage sync writeback handling

Message ID ff2b56ecb70e4db53de11a019530c768a24f48f1.1744659146.git.josef@toxicpanda.com (mailing list archive)
State New
Headers show
Series btrfs: fix race in subpage sync writeback handling | expand

Commit Message

Josef Bacik April 14, 2025, 7:32 p.m. UTC
While debugging a fs corruption with subpage we noticed a potential race
in how we do tagging for writeback.  Boris came up with the following
diagram to describe the race potential

EB1                                                       EB2
btree_write_cache_pages
  tag_pages_for_writeback
  filemap_get_folios_tag(TOWRITE)
  submit_eb_page
    submit_eb_subpage
      for eb in folio:
        write_one_eb
                                                          set_extent_buffer_dirty
                                                          btrfs_meta_folio_set_dirty
                                                          ...
                                                          filemap_fdatawrite_range
                                                            btree_write_cache_pages
                                                            tag_pages_for_writeback
          folio_lock
          btrfs_meta_folio_clear_dirty
          btrfs_meta_folio_set_writeback // clear TOWRITE
          folio_unlock
                                                            filemap_get_folios_tag(TOWRITE) //missed

The problem here is that we'll call folio_start_writeback() the first
time we initiate writeback on one extent buffer, if we happened to dirty
the extent buffer behind the one we were currently writing in the first
task, and race in as described above, we would miss the TOWRITE tag as
it would have been cleared, and we will never initiate writeback on that
EB.

This is kind of complicated for us, the best thing to do here is to
simply leave the TOWRITE tag in place, and only clear it if we aren't
dirty after we finish processing all the eb's in the folio.

Fixes: c4aec299fa8f ("btrfs: introduce submit_eb_subpage() to submit a subpage metadata page")
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
---
 fs/btrfs/extent_io.c | 23 +++++++++++++++++++++++
 fs/btrfs/subpage.c   |  2 +-
 2 files changed, 24 insertions(+), 1 deletion(-)

Comments

Boris Burkov April 14, 2025, 7:52 p.m. UTC | #1
On Mon, Apr 14, 2025 at 03:32:34PM -0400, Josef Bacik wrote:
> While debugging a fs corruption with subpage we noticed a potential race
> in how we do tagging for writeback.  Boris came up with the following
> diagram to describe the race potential
> 
> EB1                                                       EB2
> btree_write_cache_pages
>   tag_pages_for_writeback
>   filemap_get_folios_tag(TOWRITE)
>   submit_eb_page
>     submit_eb_subpage
>       for eb in folio:
>         write_one_eb
>                                                           set_extent_buffer_dirty
>                                                           btrfs_meta_folio_set_dirty
>                                                           ...
>                                                           filemap_fdatawrite_range
>                                                             btree_write_cache_pages
>                                                             tag_pages_for_writeback
>           folio_lock
>           btrfs_meta_folio_clear_dirty
>           btrfs_meta_folio_set_writeback // clear TOWRITE
>           folio_unlock
>                                                             filemap_get_folios_tag(TOWRITE) //missed
> 
> The problem here is that we'll call folio_start_writeback() the first
> time we initiate writeback on one extent buffer, if we happened to dirty
> the extent buffer behind the one we were currently writing in the first
> task, and race in as described above, we would miss the TOWRITE tag as
> it would have been cleared, and we will never initiate writeback on that
> EB.
> 
> This is kind of complicated for us, the best thing to do here is to
> simply leave the TOWRITE tag in place, and only clear it if we aren't
> dirty after we finish processing all the eb's in the folio.
> 
> Fixes: c4aec299fa8f ("btrfs: introduce submit_eb_subpage() to submit a subpage metadata page")
> Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Boris Burkov <boris@bur.io>
> ---
>  fs/btrfs/extent_io.c | 23 +++++++++++++++++++++++
>  fs/btrfs/subpage.c   |  2 +-
>  2 files changed, 24 insertions(+), 1 deletion(-)
> 
> diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
> index 6cfd286b8bbc..5d09a47c1c28 100644
> --- a/fs/btrfs/extent_io.c
> +++ b/fs/btrfs/extent_io.c
> @@ -2063,6 +2063,29 @@ static int submit_eb_subpage(struct folio *folio, struct writeback_control *wbc)
>  		}
>  		free_extent_buffer(eb);
>  	}
> +	/*
> +	 * Normally folio_start_writeback() will clear TAG_TOWRITE, but for
> +	 * subpage we use __folio_start_writeback(folio, true), which keeps it
> +	 * from clearing TOWRITE.  This is because we walk the bitmap and
> +	 * process each eb one at a time, and then locking the folio when we
> +	 * process the eb.  We could have somebody dirty behind us, and then
> +	 * subsequently mark this range as TOWRITE.  In that case we must not
> +	 * clear TOWRITE or we will skip writing back the dirty folio.
> +	 *
> +	 * So here lock the folio, if it is clean we know we are done with it,
> +	 * and we can clear TOWRITE.
> +	 */
> +	folio_lock(folio);
> +	if (!folio_test_dirty(folio)) {
> +		XA_STATE(xas, &folio->mapping->i_pages, folio_index(folio));
> +		unsigned long flags;
> +
> +		xas_lock_irqsave(&xas, flags);
> +		xas_load(&xas);
> +		xas_clear_mark(&xas, PAGECACHE_TAG_TOWRITE);
> +		xas_unlock_irqrestore(&xas, flags);
> +	}
> +	folio_unlock(folio);
>  	return submitted;
>  }
>  
> diff --git a/fs/btrfs/subpage.c b/fs/btrfs/subpage.c
> index d4f019233493..53140a9dad9d 100644
> --- a/fs/btrfs/subpage.c
> +++ b/fs/btrfs/subpage.c
> @@ -454,7 +454,7 @@ void btrfs_subpage_set_writeback(const struct btrfs_fs_info *fs_info,
>  	spin_lock_irqsave(&subpage->lock, flags);
>  	bitmap_set(subpage->bitmaps, start_bit, len >> fs_info->sectorsize_bits);
>  	if (!folio_test_writeback(folio))
> -		folio_start_writeback(folio);
> +		__folio_start_writeback(folio, true);
>  	spin_unlock_irqrestore(&subpage->lock, flags);
>  }
>  
> -- 
> 2.48.1
>
Qu Wenruo April 14, 2025, 9:45 p.m. UTC | #2
在 2025/4/15 05:02, Josef Bacik 写道:
> While debugging a fs corruption with subpage we noticed a potential race
> in how we do tagging for writeback.  Boris came up with the following
> diagram to describe the race potential
> 
> EB1                                                       EB2
> btree_write_cache_pages
>    tag_pages_for_writeback
>    filemap_get_folios_tag(TOWRITE)
>    submit_eb_page
>      submit_eb_subpage
>        for eb in folio:
>          write_one_eb
>                                                            set_extent_buffer_dirty
>                                                            btrfs_meta_folio_set_dirty
>                                                            ...
>                                                            filemap_fdatawrite_range
>                                                              btree_write_cache_pages
>                                                              tag_pages_for_writeback
>            folio_lock
>            btrfs_meta_folio_clear_dirty
>            btrfs_meta_folio_set_writeback // clear TOWRITE
>            folio_unlock
>                                                              filemap_get_folios_tag(TOWRITE) //missed
> 
> The problem here is that we'll call folio_start_writeback() the first
> time we initiate writeback on one extent buffer, if we happened to dirty
> the extent buffer behind the one we were currently writing in the first
> task, and race in as described above, we would miss the TOWRITE tag as
> it would have been cleared, and we will never initiate writeback on that
> EB.
> 
> This is kind of complicated for us, the best thing to do here is to
> simply leave the TOWRITE tag in place, and only clear it if we aren't
> dirty after we finish processing all the eb's in the folio.
> 
> Fixes: c4aec299fa8f ("btrfs: introduce submit_eb_subpage() to submit a subpage metadata page")
> Signed-off-by: Josef Bacik <josef@toxicpanda.com>
> ---
>   fs/btrfs/extent_io.c | 23 +++++++++++++++++++++++
>   fs/btrfs/subpage.c   |  2 +-
>   2 files changed, 24 insertions(+), 1 deletion(-)
> 
> diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
> index 6cfd286b8bbc..5d09a47c1c28 100644
> --- a/fs/btrfs/extent_io.c
> +++ b/fs/btrfs/extent_io.c
> @@ -2063,6 +2063,29 @@ static int submit_eb_subpage(struct folio *folio, struct writeback_control *wbc)
>   		}
>   		free_extent_buffer(eb);
>   	}
> +	/*
> +	 * Normally folio_start_writeback() will clear TAG_TOWRITE, but for
> +	 * subpage we use __folio_start_writeback(folio, true), which keeps it
> +	 * from clearing TOWRITE.  This is because we walk the bitmap and
> +	 * process each eb one at a time, and then locking the folio when we
> +	 * process the eb.  We could have somebody dirty behind us, and then
> +	 * subsequently mark this range as TOWRITE.  In that case we must not
> +	 * clear TOWRITE or we will skip writing back the dirty folio.
> +	 *
> +	 * So here lock the folio, if it is clean we know we are done with it,
> +	 * and we can clear TOWRITE.
> +	 */
> +	folio_lock(folio);
> +	if (!folio_test_dirty(folio)) {
> +		XA_STATE(xas, &folio->mapping->i_pages, folio_index(folio));
> +		unsigned long flags;
> +
> +		xas_lock_irqsave(&xas, flags);
> +		xas_load(&xas);
> +		xas_clear_mark(&xas, PAGECACHE_TAG_TOWRITE);
> +		xas_unlock_irqrestore(&xas, flags);
> +	}
> +	folio_unlock(folio);
>   	return submitted;
>   }
>   
> diff --git a/fs/btrfs/subpage.c b/fs/btrfs/subpage.c
> index d4f019233493..53140a9dad9d 100644
> --- a/fs/btrfs/subpage.c
> +++ b/fs/btrfs/subpage.c
> @@ -454,7 +454,7 @@ void btrfs_subpage_set_writeback(const struct btrfs_fs_info *fs_info,
>   	spin_lock_irqsave(&subpage->lock, flags);
>   	bitmap_set(subpage->bitmaps, start_bit, len >> fs_info->sectorsize_bits);
>   	if (!folio_test_writeback(folio))
> -		folio_start_writeback(folio);
> +		__folio_start_writeback(folio, true);

This looks a little dangerous to me, as for data writeback we rely on 
folio_start_writeback() to clear the TOWRITE tag before this change.

Can we do a test on the dirty bitmap and only call clear TOWRITE tag 
when there is no dirty blocks?

(This also means, we must clear the folio dirty before start writeback, 
there are some exceptions that needs to be addressed)

By that, we do not need the manual tag handling in submit_eb_subpage().

Thanks,
Qu

>   	spin_unlock_irqrestore(&subpage->lock, flags);
>   }
>
Boris Burkov April 15, 2025, 12:40 a.m. UTC | #3
On Tue, Apr 15, 2025 at 07:15:42AM +0930, Qu Wenruo wrote:
> 
> 
> 在 2025/4/15 05:02, Josef Bacik 写道:
> > While debugging a fs corruption with subpage we noticed a potential race
> > in how we do tagging for writeback.  Boris came up with the following
> > diagram to describe the race potential
> > 
> > EB1                                                       EB2
> > btree_write_cache_pages
> >    tag_pages_for_writeback
> >    filemap_get_folios_tag(TOWRITE)
> >    submit_eb_page
> >      submit_eb_subpage
> >        for eb in folio:
> >          write_one_eb
> >                                                            set_extent_buffer_dirty
> >                                                            btrfs_meta_folio_set_dirty
> >                                                            ...
> >                                                            filemap_fdatawrite_range
> >                                                              btree_write_cache_pages
> >                                                              tag_pages_for_writeback
> >            folio_lock
> >            btrfs_meta_folio_clear_dirty
> >            btrfs_meta_folio_set_writeback // clear TOWRITE
> >            folio_unlock
> >                                                              filemap_get_folios_tag(TOWRITE) //missed
> > 
> > The problem here is that we'll call folio_start_writeback() the first
> > time we initiate writeback on one extent buffer, if we happened to dirty
> > the extent buffer behind the one we were currently writing in the first
> > task, and race in as described above, we would miss the TOWRITE tag as
> > it would have been cleared, and we will never initiate writeback on that
> > EB.
> > 
> > This is kind of complicated for us, the best thing to do here is to
> > simply leave the TOWRITE tag in place, and only clear it if we aren't
> > dirty after we finish processing all the eb's in the folio.
> > 
> > Fixes: c4aec299fa8f ("btrfs: introduce submit_eb_subpage() to submit a subpage metadata page")
> > Signed-off-by: Josef Bacik <josef@toxicpanda.com>
> > ---
> >   fs/btrfs/extent_io.c | 23 +++++++++++++++++++++++
> >   fs/btrfs/subpage.c   |  2 +-
> >   2 files changed, 24 insertions(+), 1 deletion(-)
> > 
> > diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
> > index 6cfd286b8bbc..5d09a47c1c28 100644
> > --- a/fs/btrfs/extent_io.c
> > +++ b/fs/btrfs/extent_io.c
> > @@ -2063,6 +2063,29 @@ static int submit_eb_subpage(struct folio *folio, struct writeback_control *wbc)
> >   		}
> >   		free_extent_buffer(eb);
> >   	}
> > +	/*
> > +	 * Normally folio_start_writeback() will clear TAG_TOWRITE, but for
> > +	 * subpage we use __folio_start_writeback(folio, true), which keeps it
> > +	 * from clearing TOWRITE.  This is because we walk the bitmap and
> > +	 * process each eb one at a time, and then locking the folio when we
> > +	 * process the eb.  We could have somebody dirty behind us, and then
> > +	 * subsequently mark this range as TOWRITE.  In that case we must not
> > +	 * clear TOWRITE or we will skip writing back the dirty folio.
> > +	 *
> > +	 * So here lock the folio, if it is clean we know we are done with it,
> > +	 * and we can clear TOWRITE.
> > +	 */
> > +	folio_lock(folio);
> > +	if (!folio_test_dirty(folio)) {
> > +		XA_STATE(xas, &folio->mapping->i_pages, folio_index(folio));
> > +		unsigned long flags;
> > +
> > +		xas_lock_irqsave(&xas, flags);
> > +		xas_load(&xas);
> > +		xas_clear_mark(&xas, PAGECACHE_TAG_TOWRITE);
> > +		xas_unlock_irqrestore(&xas, flags);
> > +	}
> > +	folio_unlock(folio);
> >   	return submitted;
> >   }
> > diff --git a/fs/btrfs/subpage.c b/fs/btrfs/subpage.c
> > index d4f019233493..53140a9dad9d 100644
> > --- a/fs/btrfs/subpage.c
> > +++ b/fs/btrfs/subpage.c
> > @@ -454,7 +454,7 @@ void btrfs_subpage_set_writeback(const struct btrfs_fs_info *fs_info,
> >   	spin_lock_irqsave(&subpage->lock, flags);
> >   	bitmap_set(subpage->bitmaps, start_bit, len >> fs_info->sectorsize_bits);
> >   	if (!folio_test_writeback(folio))
> > -		folio_start_writeback(folio);
> > +		__folio_start_writeback(folio, true);
> 
> This looks a little dangerous to me, as for data writeback we rely on
> folio_start_writeback() to clear the TOWRITE tag before this change.

During an extended test of a branch with both fixes, I did in fact see a
deadlock that looked like an extent_write_cache_pages going re-doing
writeback on a page that should have had the mark cleared. I'm not 100%
sure it is this, since it took 60+ runs of the logwrites crash test to hit
the deadlock during the instrumented workload

Anyway, I will take a look at and test your other proposed fix in
Josef's other email.

Thanks,
Boris

> 
> Can we do a test on the dirty bitmap and only call clear TOWRITE tag when
> there is no dirty blocks?
> 
> (This also means, we must clear the folio dirty before start writeback,
> there are some exceptions that needs to be addressed)
> 
> By that, we do not need the manual tag handling in submit_eb_subpage().
> 
> Thanks,
> Qu
> 
> >   	spin_unlock_irqrestore(&subpage->lock, flags);
> >   }
>
Josef Bacik April 15, 2025, 5:22 p.m. UTC | #4
On Tue, Apr 15, 2025 at 07:15:42AM +0930, Qu Wenruo wrote:
> 
> 
> 在 2025/4/15 05:02, Josef Bacik 写道:
> > While debugging a fs corruption with subpage we noticed a potential race
> > in how we do tagging for writeback.  Boris came up with the following
> > diagram to describe the race potential
> > 
> > EB1                                                       EB2
> > btree_write_cache_pages
> >    tag_pages_for_writeback
> >    filemap_get_folios_tag(TOWRITE)
> >    submit_eb_page
> >      submit_eb_subpage
> >        for eb in folio:
> >          write_one_eb
> >                                                            set_extent_buffer_dirty
> >                                                            btrfs_meta_folio_set_dirty
> >                                                            ...
> >                                                            filemap_fdatawrite_range
> >                                                              btree_write_cache_pages
> >                                                              tag_pages_for_writeback
> >            folio_lock
> >            btrfs_meta_folio_clear_dirty
> >            btrfs_meta_folio_set_writeback // clear TOWRITE
> >            folio_unlock
> >                                                              filemap_get_folios_tag(TOWRITE) //missed
> > 
> > The problem here is that we'll call folio_start_writeback() the first
> > time we initiate writeback on one extent buffer, if we happened to dirty
> > the extent buffer behind the one we were currently writing in the first
> > task, and race in as described above, we would miss the TOWRITE tag as
> > it would have been cleared, and we will never initiate writeback on that
> > EB.
> > 
> > This is kind of complicated for us, the best thing to do here is to
> > simply leave the TOWRITE tag in place, and only clear it if we aren't
> > dirty after we finish processing all the eb's in the folio.
> > 
> > Fixes: c4aec299fa8f ("btrfs: introduce submit_eb_subpage() to submit a subpage metadata page")
> > Signed-off-by: Josef Bacik <josef@toxicpanda.com>
> > ---
> >   fs/btrfs/extent_io.c | 23 +++++++++++++++++++++++
> >   fs/btrfs/subpage.c   |  2 +-
> >   2 files changed, 24 insertions(+), 1 deletion(-)
> > 
> > diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
> > index 6cfd286b8bbc..5d09a47c1c28 100644
> > --- a/fs/btrfs/extent_io.c
> > +++ b/fs/btrfs/extent_io.c
> > @@ -2063,6 +2063,29 @@ static int submit_eb_subpage(struct folio *folio, struct writeback_control *wbc)
> >   		}
> >   		free_extent_buffer(eb);
> >   	}
> > +	/*
> > +	 * Normally folio_start_writeback() will clear TAG_TOWRITE, but for
> > +	 * subpage we use __folio_start_writeback(folio, true), which keeps it
> > +	 * from clearing TOWRITE.  This is because we walk the bitmap and
> > +	 * process each eb one at a time, and then locking the folio when we
> > +	 * process the eb.  We could have somebody dirty behind us, and then
> > +	 * subsequently mark this range as TOWRITE.  In that case we must not
> > +	 * clear TOWRITE or we will skip writing back the dirty folio.
> > +	 *
> > +	 * So here lock the folio, if it is clean we know we are done with it,
> > +	 * and we can clear TOWRITE.
> > +	 */
> > +	folio_lock(folio);
> > +	if (!folio_test_dirty(folio)) {
> > +		XA_STATE(xas, &folio->mapping->i_pages, folio_index(folio));
> > +		unsigned long flags;
> > +
> > +		xas_lock_irqsave(&xas, flags);
> > +		xas_load(&xas);
> > +		xas_clear_mark(&xas, PAGECACHE_TAG_TOWRITE);
> > +		xas_unlock_irqrestore(&xas, flags);
> > +	}
> > +	folio_unlock(folio);
> >   	return submitted;
> >   }
> > diff --git a/fs/btrfs/subpage.c b/fs/btrfs/subpage.c
> > index d4f019233493..53140a9dad9d 100644
> > --- a/fs/btrfs/subpage.c
> > +++ b/fs/btrfs/subpage.c
> > @@ -454,7 +454,7 @@ void btrfs_subpage_set_writeback(const struct btrfs_fs_info *fs_info,
> >   	spin_lock_irqsave(&subpage->lock, flags);
> >   	bitmap_set(subpage->bitmaps, start_bit, len >> fs_info->sectorsize_bits);
> >   	if (!folio_test_writeback(folio))
> > -		folio_start_writeback(folio);
> > +		__folio_start_writeback(folio, true);
> 
> This looks a little dangerous to me, as for data writeback we rely on
> folio_start_writeback() to clear the TOWRITE tag before this change.
> 
> Can we do a test on the dirty bitmap and only call clear TOWRITE tag when
> there is no dirty blocks?

This accomplishes the same thing, we only modify the dirty bitmap under the
folio lock, so this is safe.

> 
> (This also means, we must clear the folio dirty before start writeback,
> there are some exceptions that needs to be addressed)
> 

Which is what we're doing, and what we're supposed to do, even not in subpage,
because that's how the tag clearing works for the DIRTY tag, it checks to see if
the folio is dirty and only clears the tag if DIRTY is cleared, and it does this
in folio_start_writeback().

Keep in mind this is a quick patch to fix the bug in a way that we can backport
to all the kernels that are affected, which is basically anything that has
subpage support.  The followup series I'm writing will make this all much more
sane.

Thanks,

Josef
diff mbox series

Patch

diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index 6cfd286b8bbc..5d09a47c1c28 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -2063,6 +2063,29 @@  static int submit_eb_subpage(struct folio *folio, struct writeback_control *wbc)
 		}
 		free_extent_buffer(eb);
 	}
+	/*
+	 * Normally folio_start_writeback() will clear TAG_TOWRITE, but for
+	 * subpage we use __folio_start_writeback(folio, true), which keeps it
+	 * from clearing TOWRITE.  This is because we walk the bitmap and
+	 * process each eb one at a time, and then locking the folio when we
+	 * process the eb.  We could have somebody dirty behind us, and then
+	 * subsequently mark this range as TOWRITE.  In that case we must not
+	 * clear TOWRITE or we will skip writing back the dirty folio.
+	 *
+	 * So here lock the folio, if it is clean we know we are done with it,
+	 * and we can clear TOWRITE.
+	 */
+	folio_lock(folio);
+	if (!folio_test_dirty(folio)) {
+		XA_STATE(xas, &folio->mapping->i_pages, folio_index(folio));
+		unsigned long flags;
+
+		xas_lock_irqsave(&xas, flags);
+		xas_load(&xas);
+		xas_clear_mark(&xas, PAGECACHE_TAG_TOWRITE);
+		xas_unlock_irqrestore(&xas, flags);
+	}
+	folio_unlock(folio);
 	return submitted;
 }
 
diff --git a/fs/btrfs/subpage.c b/fs/btrfs/subpage.c
index d4f019233493..53140a9dad9d 100644
--- a/fs/btrfs/subpage.c
+++ b/fs/btrfs/subpage.c
@@ -454,7 +454,7 @@  void btrfs_subpage_set_writeback(const struct btrfs_fs_info *fs_info,
 	spin_lock_irqsave(&subpage->lock, flags);
 	bitmap_set(subpage->bitmaps, start_bit, len >> fs_info->sectorsize_bits);
 	if (!folio_test_writeback(folio))
-		folio_start_writeback(folio);
+		__folio_start_writeback(folio, true);
 	spin_unlock_irqrestore(&subpage->lock, flags);
 }