mbox series

[v2,0/3] Metadata IO error fixes

Message ID cover.1637781110.git.josef@toxicpanda.com (mailing list archive)
Headers show
Series Metadata IO error fixes | expand

Message

Josef Bacik Nov. 24, 2021, 7:14 p.m. UTC
v1->v2:
- I was debugging generic/484 separately because I thought it was data related,
  but it turned out to be metadata related as well, so I've added the patch
  "btrfs: call mapping_set_error() on btree inode with a write error" to the
  series.

--- Original email ---

Hello,

I saw a dmesg failure with generic/281 on our overnight runs.  This turned out
to be because we weren't getting an error back from btrfs_search_slot() even
though we found a metadata block that shouldn't have been uptodate.

The root cause is that write errors on the page clear uptodate on the page, but
not on the extent buffer itself.  Since we rely on that bit to tell wether the
extent buffer is valid or not we don't notice that the eb is bogus when we find
it in cache in a subsequent write, and eventually trip over
assert_eb_page_uptodate() warnings.

This fixes the problem I was seeing, I could easily reproduce by running
generic/281 in a loop a few times.  With these pages I haven't reproduced in 20
loops.  Thanks,

Josef

Josef Bacik (3):
  btrfs: clear extent buffer uptodate when we fail to write it
  btrfs: check the root node for uptodate before returning it
  btrfs: call mapping_set_error() on btree inode with a write error

 fs/btrfs/ctree.c     | 19 +++++++++++++++----
 fs/btrfs/extent_io.c | 14 ++++++++++++++
 2 files changed, 29 insertions(+), 4 deletions(-)

Comments

Nikolay Borisov Nov. 25, 2021, 9:12 a.m. UTC | #1
On 24.11.21 г. 21:14, Josef Bacik wrote:
> v1->v2:
> - I was debugging generic/484 separately because I thought it was data related,
>   but it turned out to be metadata related as well, so I've added the patch
>   "btrfs: call mapping_set_error() on btree inode with a write error" to the
>   series.
> 
> --- Original email ---
> 
> Hello,
> 
> I saw a dmesg failure with generic/281 on our overnight runs.  This turned out
> to be because we weren't getting an error back from btrfs_search_slot() even
> though we found a metadata block that shouldn't have been uptodate.
> 
> The root cause is that write errors on the page clear uptodate on the page, but
> not on the extent buffer itself.  Since we rely on that bit to tell wether the
> extent buffer is valid or not we don't notice that the eb is bogus when we find
> it in cache in a subsequent write, and eventually trip over
> assert_eb_page_uptodate() warnings.
> 
> This fixes the problem I was seeing, I could easily reproduce by running
> generic/281 in a loop a few times.  With these pages I haven't reproduced in 20
> loops.  Thanks,
> 
> Josef
> 
> Josef Bacik (3):
>   btrfs: clear extent buffer uptodate when we fail to write it
>   btrfs: check the root node for uptodate before returning it
>   btrfs: call mapping_set_error() on btree inode with a write error
> 
>  fs/btrfs/ctree.c     | 19 +++++++++++++++----
>  fs/btrfs/extent_io.c | 14 ++++++++++++++
>  2 files changed, 29 insertions(+), 4 deletions(-)
> 


This is stable material as well, right?
David Sterba Nov. 29, 2021, 4:56 p.m. UTC | #2
On Thu, Nov 25, 2021 at 11:12:53AM +0200, Nikolay Borisov wrote:
> 
> 
> On 24.11.21 г. 21:14, Josef Bacik wrote:
> > v1->v2:
> > - I was debugging generic/484 separately because I thought it was data related,
> >   but it turned out to be metadata related as well, so I've added the patch
> >   "btrfs: call mapping_set_error() on btree inode with a write error" to the
> >   series.
> > 
> > --- Original email ---
> > 
> > Hello,
> > 
> > I saw a dmesg failure with generic/281 on our overnight runs.  This turned out
> > to be because we weren't getting an error back from btrfs_search_slot() even
> > though we found a metadata block that shouldn't have been uptodate.
> > 
> > The root cause is that write errors on the page clear uptodate on the page, but
> > not on the extent buffer itself.  Since we rely on that bit to tell wether the
> > extent buffer is valid or not we don't notice that the eb is bogus when we find
> > it in cache in a subsequent write, and eventually trip over
> > assert_eb_page_uptodate() warnings.
> > 
> > This fixes the problem I was seeing, I could easily reproduce by running
> > generic/281 in a loop a few times.  With these pages I haven't reproduced in 20
> > loops.  Thanks,
> > 
> > Josef
> > 
> > Josef Bacik (3):
> >   btrfs: clear extent buffer uptodate when we fail to write it
> >   btrfs: check the root node for uptodate before returning it
> >   btrfs: call mapping_set_error() on btree inode with a write error
> > 
> >  fs/btrfs/ctree.c     | 19 +++++++++++++++----
> >  fs/btrfs/extent_io.c | 14 ++++++++++++++
> >  2 files changed, 29 insertions(+), 4 deletions(-)
> > 
> 
> 
> This is stable material as well, right?

Yes, tags added.
David Sterba Nov. 29, 2021, 4:56 p.m. UTC | #3
On Wed, Nov 24, 2021 at 02:14:22PM -0500, Josef Bacik wrote:
> v1->v2:
> - I was debugging generic/484 separately because I thought it was data related,
>   but it turned out to be metadata related as well, so I've added the patch
>   "btrfs: call mapping_set_error() on btree inode with a write error" to the
>   series.
> 
> --- Original email ---
> 
> Hello,
> 
> I saw a dmesg failure with generic/281 on our overnight runs.  This turned out
> to be because we weren't getting an error back from btrfs_search_slot() even
> though we found a metadata block that shouldn't have been uptodate.
> 
> The root cause is that write errors on the page clear uptodate on the page, but
> not on the extent buffer itself.  Since we rely on that bit to tell wether the
> extent buffer is valid or not we don't notice that the eb is bogus when we find
> it in cache in a subsequent write, and eventually trip over
> assert_eb_page_uptodate() warnings.
> 
> This fixes the problem I was seeing, I could easily reproduce by running
> generic/281 in a loop a few times.  With these pages I haven't reproduced in 20
> loops.  Thanks,
> 
> Josef
> 
> Josef Bacik (3):
>   btrfs: clear extent buffer uptodate when we fail to write it
>   btrfs: check the root node for uptodate before returning it
>   btrfs: call mapping_set_error() on btree inode with a write error

Added to misc-next, thanks.