diff mbox

Btrfs: kill BUG_ON in do_relocation

Message ID 1473870467-18721-1-git-send-email-bo.li.liu@oracle.com (mailing list archive)
State Superseded
Headers show

Commit Message

Liu Bo Sept. 14, 2016, 4:27 p.m. UTC
While updating btree, we try to push items between sibling
nodes/leaves in order to keep height as low as possible.
But we don't memset the original places with zero when
pushing items so that we could end up leaving stale content
in nodes/leaves.  One may read the above stale content by
increasing btree blocks' @nritems.

One case I've come across is that in fs tree, a leaf has two
parent nodes, hence running balance ends up with processing
this leaf with two parent nodes, but it can only reach the
valid parent node through btrfs_search_slot, so it'd be like,

do_relocation
    for P in all parent nodes of block A:
        if !P->eb:
            btrfs_search_slot(key);   --> get path from P to A.
        if lowest:
            BUG_ON(A->bytenr != bytenr of A recorded in P);
        btrfs_cow_block(P, A);   --> change A's bytenr in P.

After btrfs_cow_block, P has the new bytenr of A, but with the
same @key, we get the same path again, and get panic by BUG_ON.

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
---
 fs/btrfs/relocation.c | 9 ++++++++-
 1 file changed, 8 insertions(+), 1 deletion(-)

Comments

Josef Bacik Sept. 14, 2016, 5:13 p.m. UTC | #1
On 09/14/2016 12:27 PM, Liu Bo wrote:
> While updating btree, we try to push items between sibling
> nodes/leaves in order to keep height as low as possible.
> But we don't memset the original places with zero when
> pushing items so that we could end up leaving stale content
> in nodes/leaves.  One may read the above stale content by
> increasing btree blocks' @nritems.
>

Ok this sounds really bad.  Is this as bad as I think it sounds?  We should 
probably fix this like right now right?

> One case I've come across is that in fs tree, a leaf has two
> parent nodes, hence running balance ends up with processing
> this leaf with two parent nodes, but it can only reach the
> valid parent node through btrfs_search_slot, so it'd be like,
>
> do_relocation
>     for P in all parent nodes of block A:
>         if !P->eb:
>             btrfs_search_slot(key);   --> get path from P to A.
>         if lowest:
>             BUG_ON(A->bytenr != bytenr of A recorded in P);
>         btrfs_cow_block(P, A);   --> change A's bytenr in P.
>
> After btrfs_cow_block, P has the new bytenr of A, but with the
> same @key, we get the same path again, and get panic by BUG_ON.
>

Ok so this happens because of the problem you described above right?  Because 
this shouldn't actually happen.  We should go down the next parent and still get 
to the original block where A->bytenr == node->bytenr.  So I'm all for killing 
this BUG_ON(), but the problem description is wrong.  We need to keep this 
scenario from happening in the first place.  And then we kill this BUG_ON() 
because it can happen if there is FS corruption or something.  Thanks,

Josef
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Chris Mason Sept. 14, 2016, 5:29 p.m. UTC | #2
On 09/14/2016 01:13 PM, Josef Bacik wrote:
> On 09/14/2016 12:27 PM, Liu Bo wrote:
>> While updating btree, we try to push items between sibling
>> nodes/leaves in order to keep height as low as possible.
>> But we don't memset the original places with zero when
>> pushing items so that we could end up leaving stale content
>> in nodes/leaves.  One may read the above stale content by
>> increasing btree blocks' @nritems.
>>
>
> Ok this sounds really bad.  Is this as bad as I think it sounds?  We
> should probably fix this like right now right?

He's bumping @nritems with a fuzzer I think?  As in this happens when 
someone forces it (or via some other bug) but not in normal operations.

-chris
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Josef Bacik Sept. 14, 2016, 5:31 p.m. UTC | #3
On 09/14/2016 01:29 PM, Chris Mason wrote:
>
>
> On 09/14/2016 01:13 PM, Josef Bacik wrote:
>> On 09/14/2016 12:27 PM, Liu Bo wrote:
>>> While updating btree, we try to push items between sibling
>>> nodes/leaves in order to keep height as low as possible.
>>> But we don't memset the original places with zero when
>>> pushing items so that we could end up leaving stale content
>>> in nodes/leaves.  One may read the above stale content by
>>> increasing btree blocks' @nritems.
>>>
>>
>> Ok this sounds really bad.  Is this as bad as I think it sounds?  We
>> should probably fix this like right now right?
>
> He's bumping @nritems with a fuzzer I think?  As in this happens when someone
> forces it (or via some other bug) but not in normal operations.
>

Oh ok if this happens with a fuzzer than this is fine, but I'd rather do -EIO so 
we know this is something bad with the fs.  And change the changelog to make it 
explicit that this is the result of fs corruption, not normal operation.  Then 
you can add

Reviewed-by: Josef Bacik <jbacik@fb.com>

Thanks,

Josef

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Liu Bo Sept. 14, 2016, 6:16 p.m. UTC | #4
On Wed, Sep 14, 2016 at 01:13:34PM -0400, Josef Bacik wrote:
> On 09/14/2016 12:27 PM, Liu Bo wrote:
> > While updating btree, we try to push items between sibling
> > nodes/leaves in order to keep height as low as possible.
> > But we don't memset the original places with zero when
> > pushing items so that we could end up leaving stale content
> > in nodes/leaves.  One may read the above stale content by
> > increasing btree blocks' @nritems.
> > 
> 
> Ok this sounds really bad.  Is this as bad as I think it sounds?  We should
> probably fix this like right now right?

Yeah, I'm preparing two patches to address it.

> 
> > One case I've come across is that in fs tree, a leaf has two
> > parent nodes, hence running balance ends up with processing
> > this leaf with two parent nodes, but it can only reach the
> > valid parent node through btrfs_search_slot, so it'd be like,
> > 
> > do_relocation
> >     for P in all parent nodes of block A:
> >         if !P->eb:
> >             btrfs_search_slot(key);   --> get path from P to A.
> >         if lowest:
> >             BUG_ON(A->bytenr != bytenr of A recorded in P);
> >         btrfs_cow_block(P, A);   --> change A's bytenr in P.
> > 
> > After btrfs_cow_block, P has the new bytenr of A, but with the
> > same @key, we get the same path again, and get panic by BUG_ON.
> > 
> 
> Ok so this happens because of the problem you described above right?
> Because this shouldn't actually happen.  We should go down the next parent
> and still get to the original block where A->bytenr == node->bytenr.

After bumping block 55279616's @nritems from 320 to 492,

fs tree key (FS_TREE ROOT_ITEM 0) 
node 55230464 level 2 items 4 free 489 generation 11 owner 5
fs uuid 03737dfb-8087-4923-b058-2ec629bf39bd
chunk uuid d586e037-9d50-4332-9b3a-fa2344d210e1
        key (256 INODE_ITEM 0) block 55279616 (3374) gen 11 nr 0
        key (257 DIR_INDEX 24879) block 56410112 (3443) gen 11 nr 1
        key (10404 INODE_ITEM 0) block 60391424 (3686) gen 11 nr 2
        key (34068 INODE_ITEM 0) block 55246848 (3372) gen 11 nr 3
node 55279616 level 1 items 492 free 1 generation 11 owner 5
        ...
        key (257 DIR_INDEX 24625) block 44335104 (2706) gen 9 nr 319
        key (2100 INODE_ITEM 0) block 30425088 (1857) gen 7 nr 320
        key (2148 INODE_ITEM 0) block 30441472 (1858) gen 7 nr 321
node 56410112 level 1 items 311 free 182 generation 11 owner 5
fs uuid 03737dfb-8087-4923-b058-2ec629bf39bd
chunk uuid d586e037-9d50-4332-9b3a-fa2344d210e1
        ...
        key (2052 INODE_ITEM 0) block 30408704 (1856) gen 7 nr 137
        key (2100 INODE_ITEM 0) block 30425088 (1857) gen 7 nr 138
        key (2148 INODE_ITEM 0) block 30441472 (1858) gen 7 nr 139
        ...

If we search fs tree with disk key (2100 INODE_ITEM 0), we always get
into node block 56410112, not node block 55279616 after bin_search in
the top level, so leaf block 30408704 never gets both parent.

>  So I'm
> all for killing this BUG_ON(), but the problem description is wrong.  We
> need to keep this scenario from happening in the first place.  And then we
> kill this BUG_ON() because it can happen if there is FS corruption or
> something.  Thanks,

We really should, we can prevent it from happening by checking btree
node's validation although it is not as easy as checking a leaf.

The description is what I got from debugging, just wanna show how we
come to the BUG_ON since it's not straighforward at all.

But yes, I'll update the description that this problem is due to fs
corruption.

Thanks,

-liubo
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Liu Bo Sept. 14, 2016, 6:19 p.m. UTC | #5
On Wed, Sep 14, 2016 at 01:31:31PM -0400, Josef Bacik wrote:
> On 09/14/2016 01:29 PM, Chris Mason wrote:
> > 
> > 
> > On 09/14/2016 01:13 PM, Josef Bacik wrote:
> > > On 09/14/2016 12:27 PM, Liu Bo wrote:
> > > > While updating btree, we try to push items between sibling
> > > > nodes/leaves in order to keep height as low as possible.
> > > > But we don't memset the original places with zero when
> > > > pushing items so that we could end up leaving stale content
> > > > in nodes/leaves.  One may read the above stale content by
> > > > increasing btree blocks' @nritems.
> > > > 
> > > 
> > > Ok this sounds really bad.  Is this as bad as I think it sounds?  We
> > > should probably fix this like right now right?
> > 
> > He's bumping @nritems with a fuzzer I think?  As in this happens when someone
> > forces it (or via some other bug) but not in normal operations.
> > 
> 
> Oh ok if this happens with a fuzzer than this is fine, but I'd rather do
> -EIO so we know this is something bad with the fs. 

-EIO may be more appropriate to be given while reading btree blocks and
checking their validation?

> And change the changelog
> to make it explicit that this is the result of fs corruption, not normal
> operation.  Then you can add
> 
> Reviewed-by: Josef Bacik <jbacik@fb.com>

OK, make sense.

Thanks,

-liubo
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Chris Mason Sept. 15, 2016, 6:58 p.m. UTC | #6
On 09/15/2016 03:01 PM, Liu Bo wrote:
> On Wed, Sep 14, 2016 at 11:19:04AM -0700, Liu Bo wrote:
>> On Wed, Sep 14, 2016 at 01:31:31PM -0400, Josef Bacik wrote:
>>> On 09/14/2016 01:29 PM, Chris Mason wrote:
>>>>
>>>>
>>>> On 09/14/2016 01:13 PM, Josef Bacik wrote:
>>>>> On 09/14/2016 12:27 PM, Liu Bo wrote:
>>>>>> While updating btree, we try to push items between sibling
>>>>>> nodes/leaves in order to keep height as low as possible.
>>>>>> But we don't memset the original places with zero when
>>>>>> pushing items so that we could end up leaving stale content
>>>>>> in nodes/leaves.  One may read the above stale content by
>>>>>> increasing btree blocks' @nritems.
>>>>>>
>>>>>
>>>>> Ok this sounds really bad.  Is this as bad as I think it sounds?  We
>>>>> should probably fix this like right now right?
>>>>
>>>> He's bumping @nritems with a fuzzer I think?  As in this happens when someone
>>>> forces it (or via some other bug) but not in normal operations.
>>>>
>>>
>>> Oh ok if this happens with a fuzzer than this is fine, but I'd rather do
>>> -EIO so we know this is something bad with the fs.
>>
>> -EIO may be more appropriate to be given while reading btree blocks and
>> checking their validation?
>
> Looks like EIO doesn't fit into this case, either, do we have any errno
> representing 'corrupted filesystem'?

That's EIO.  Sometimes the EIO is big enough we have to abort, but 
really the abort is just adding bonus.

-chris

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Liu Bo Sept. 15, 2016, 7:01 p.m. UTC | #7
On Wed, Sep 14, 2016 at 11:19:04AM -0700, Liu Bo wrote:
> On Wed, Sep 14, 2016 at 01:31:31PM -0400, Josef Bacik wrote:
> > On 09/14/2016 01:29 PM, Chris Mason wrote:
> > > 
> > > 
> > > On 09/14/2016 01:13 PM, Josef Bacik wrote:
> > > > On 09/14/2016 12:27 PM, Liu Bo wrote:
> > > > > While updating btree, we try to push items between sibling
> > > > > nodes/leaves in order to keep height as low as possible.
> > > > > But we don't memset the original places with zero when
> > > > > pushing items so that we could end up leaving stale content
> > > > > in nodes/leaves.  One may read the above stale content by
> > > > > increasing btree blocks' @nritems.
> > > > > 
> > > > 
> > > > Ok this sounds really bad.  Is this as bad as I think it sounds?  We
> > > > should probably fix this like right now right?
> > > 
> > > He's bumping @nritems with a fuzzer I think?  As in this happens when someone
> > > forces it (or via some other bug) but not in normal operations.
> > > 
> > 
> > Oh ok if this happens with a fuzzer than this is fine, but I'd rather do
> > -EIO so we know this is something bad with the fs. 
> 
> -EIO may be more appropriate to be given while reading btree blocks and
> checking their validation?

Looks like EIO doesn't fit into this case, either, do we have any errno
representing 'corrupted filesystem'?

Thanks,

-liubo

> 
> > And change the changelog
> > to make it explicit that this is the result of fs corruption, not normal
> > operation.  Then you can add
> > 
> > Reviewed-by: Josef Bacik <jbacik@fb.com>
> 
> OK, make sense.
> 
> Thanks,
> 
> -liubo
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
David Sterba Sept. 19, 2016, 6:01 p.m. UTC | #8
On Thu, Sep 15, 2016 at 02:58:12PM -0400, Chris Mason wrote:
> 
> 
> On 09/15/2016 03:01 PM, Liu Bo wrote:
> > On Wed, Sep 14, 2016 at 11:19:04AM -0700, Liu Bo wrote:
> >> On Wed, Sep 14, 2016 at 01:31:31PM -0400, Josef Bacik wrote:
> >>> On 09/14/2016 01:29 PM, Chris Mason wrote:
> >>>>
> >>>>
> >>>> On 09/14/2016 01:13 PM, Josef Bacik wrote:
> >>>>> On 09/14/2016 12:27 PM, Liu Bo wrote:
> >>>>>> While updating btree, we try to push items between sibling
> >>>>>> nodes/leaves in order to keep height as low as possible.
> >>>>>> But we don't memset the original places with zero when
> >>>>>> pushing items so that we could end up leaving stale content
> >>>>>> in nodes/leaves.  One may read the above stale content by
> >>>>>> increasing btree blocks' @nritems.
> >>>>>>
> >>>>>
> >>>>> Ok this sounds really bad.  Is this as bad as I think it sounds?  We
> >>>>> should probably fix this like right now right?
> >>>>
> >>>> He's bumping @nritems with a fuzzer I think?  As in this happens when someone
> >>>> forces it (or via some other bug) but not in normal operations.
> >>>>
> >>>
> >>> Oh ok if this happens with a fuzzer than this is fine, but I'd rather do
> >>> -EIO so we know this is something bad with the fs.
> >>
> >> -EIO may be more appropriate to be given while reading btree blocks and
> >> checking their validation?
> >
> > Looks like EIO doesn't fit into this case, either, do we have any errno
> > representing 'corrupted filesystem'?
> 
> That's EIO.  Sometimes the EIO is big enough we have to abort, but 
> really the abort is just adding bonus.

I think we misuse the EIO where we should really return EFSCORRUPTED
that's an alias for EUCLEAN, looking at xfs or ext4. EIO should be
really a message that the hardware is bad.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Liu Bo Sept. 19, 2016, 11:11 p.m. UTC | #9
On Mon, Sep 19, 2016 at 08:01:05PM +0200, David Sterba wrote:
> On Thu, Sep 15, 2016 at 02:58:12PM -0400, Chris Mason wrote:
> > 
> > 
> > On 09/15/2016 03:01 PM, Liu Bo wrote:
> > > On Wed, Sep 14, 2016 at 11:19:04AM -0700, Liu Bo wrote:
> > >> On Wed, Sep 14, 2016 at 01:31:31PM -0400, Josef Bacik wrote:
> > >>> On 09/14/2016 01:29 PM, Chris Mason wrote:
> > >>>>
> > >>>>
> > >>>> On 09/14/2016 01:13 PM, Josef Bacik wrote:
> > >>>>> On 09/14/2016 12:27 PM, Liu Bo wrote:
> > >>>>>> While updating btree, we try to push items between sibling
> > >>>>>> nodes/leaves in order to keep height as low as possible.
> > >>>>>> But we don't memset the original places with zero when
> > >>>>>> pushing items so that we could end up leaving stale content
> > >>>>>> in nodes/leaves.  One may read the above stale content by
> > >>>>>> increasing btree blocks' @nritems.
> > >>>>>>
> > >>>>>
> > >>>>> Ok this sounds really bad.  Is this as bad as I think it sounds?  We
> > >>>>> should probably fix this like right now right?
> > >>>>
> > >>>> He's bumping @nritems with a fuzzer I think?  As in this happens when someone
> > >>>> forces it (or via some other bug) but not in normal operations.
> > >>>>
> > >>>
> > >>> Oh ok if this happens with a fuzzer than this is fine, but I'd rather do
> > >>> -EIO so we know this is something bad with the fs.
> > >>
> > >> -EIO may be more appropriate to be given while reading btree blocks and
> > >> checking their validation?
> > >
> > > Looks like EIO doesn't fit into this case, either, do we have any errno
> > > representing 'corrupted filesystem'?
> > 
> > That's EIO.  Sometimes the EIO is big enough we have to abort, but 
> > really the abort is just adding bonus.
> 
> I think we misuse the EIO where we should really return EFSCORRUPTED
> that's an alias for EUCLEAN, looking at xfs or ext4. EIO should be
> really a message that the hardware is bad.

I love this idea, but one quick question, when returning EUCLEAN, what
message do users get? 

"#define EUCLEAN         117     /* Structure needs cleaning */"

Thanks,

-liubo
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
David Sterba Sept. 20, 2016, 8:03 a.m. UTC | #10
On Mon, Sep 19, 2016 at 04:11:44PM -0700, Liu Bo wrote:
> > > That's EIO.  Sometimes the EIO is big enough we have to abort, but 
> > > really the abort is just adding bonus.
> > 
> > I think we misuse the EIO where we should really return EFSCORRUPTED
> > that's an alias for EUCLEAN, looking at xfs or ext4. EIO should be
> > really a message that the hardware is bad.
> 
> I love this idea, but one quick question, when returning EUCLEAN, what
> message do users get? 
> 
> "#define EUCLEAN         117     /* Structure needs cleaning */"

strerror(EUCLEAN) -> "Structure needs cleaning"
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Liu Bo Sept. 20, 2016, 5:59 p.m. UTC | #11
On Tue, Sep 20, 2016 at 10:03:43AM +0200, David Sterba wrote:
> On Mon, Sep 19, 2016 at 04:11:44PM -0700, Liu Bo wrote:
> > > > That's EIO.  Sometimes the EIO is big enough we have to abort, but 
> > > > really the abort is just adding bonus.
> > > 
> > > I think we misuse the EIO where we should really return EFSCORRUPTED
> > > that's an alias for EUCLEAN, looking at xfs or ext4. EIO should be
> > > really a message that the hardware is bad.
> > 
> > I love this idea, but one quick question, when returning EUCLEAN, what
> > message do users get? 
> > 
> > "#define EUCLEAN         117     /* Structure needs cleaning */"
> 
> strerror(EUCLEAN) -> "Structure needs cleaning"

Hmm, if I was the user, I'm not sure how to deal with "Structure needs cleaning", so still need to take a glance at dmesg log.

Thanks,

-liubo
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
David Sterba Sept. 21, 2016, 8:14 a.m. UTC | #12
On Tue, Sep 20, 2016 at 10:59:59AM -0700, Liu Bo wrote:
> On Tue, Sep 20, 2016 at 10:03:43AM +0200, David Sterba wrote:
> > On Mon, Sep 19, 2016 at 04:11:44PM -0700, Liu Bo wrote:
> > > > > That's EIO.  Sometimes the EIO is big enough we have to abort, but 
> > > > > really the abort is just adding bonus.
> > > > 
> > > > I think we misuse the EIO where we should really return EFSCORRUPTED
> > > > that's an alias for EUCLEAN, looking at xfs or ext4. EIO should be
> > > > really a message that the hardware is bad.
> > > 
> > > I love this idea, but one quick question, when returning EUCLEAN, what
> > > message do users get? 
> > > 
> > > "#define EUCLEAN         117     /* Structure needs cleaning */"
> > 
> > strerror(EUCLEAN) -> "Structure needs cleaning"
> 
> Hmm, if I was the user, I'm not sure how to deal with "Structure needs
> cleaning", so still need to take a glance at dmesg log.

I understand that, it's not descriptive at all. We could remap it to EIO
once it goes outside of btrfs module, so it would be only internal
error.  The commit that introduces it to xfs is 10 years old,
da2f4d679c8070ba5b6a920281e495917b293aa0 and mentions something like
that.

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/fs/btrfs/relocation.c b/fs/btrfs/relocation.c
index 680e234..bce8fea 100644
--- a/fs/btrfs/relocation.c
+++ b/fs/btrfs/relocation.c
@@ -2723,7 +2723,14 @@  static int do_relocation(struct btrfs_trans_handle *trans,
 
 		bytenr = btrfs_node_blockptr(upper->eb, slot);
 		if (lowest) {
-			BUG_ON(bytenr != node->bytenr);
+			if (bytenr != node->bytenr) {
+				btrfs_err(root->fs_info,
+		"lowest leaf/node mismatch: bytenr %llu node->bytenr %llu slot %d upper %llu",
+					  bytenr, node->bytenr, slot,
+					  upper->eb->start);
+				err = -EEXIST;
+				goto next;
+			}
 		} else {
 			if (node->eb->start == bytenr)
 				goto next;