diff mbox

Btrfs: fix the corruption by reading stale btree blocks

Message ID 1526405857-88702-1-git-send-email-bo.liu@linux.alibaba.com (mailing list archive)
State New, archived
Headers show

Commit Message

Liu Bo May 15, 2018, 5:37 p.m. UTC
If a btree block, aka. extent buffer, is not available in the extent
buffer cache, it'll be read out from the disk instead, i.e.

btrfs_search_slot()
  read_block_for_search()  # hold parent and its lock, go to read child
    btrfs_release_path()
    read_tree_block()  # read child

Unfortunately, the parent lock got released before reading child, so
commit 5bdd3536cbbe ("Btrfs: Fix block generation verification race") had
used 0 as parent transid to read the child block.  It forces
read_tree_block() not to check if parent transid is different with the
generation id of the child that it reads out from disk.

A simple PoC is included in btrfs/124,

0. A two-disk raid1 btrfs,

1. Right after mkfs.btrfs, block A is allocated to be device tree's root.

2. Mount this filesystem and put it in use, after a while, device tree's
   root got COW but block A hasn't been allocated/overwritten yet.

3. Umount it and reload the btrfs module to remove both disks from the
   global @fs_devices list.

4. mount -odegraded dev1 and write some data, so now block A is allocated
   to be a leaf in checksum tree.  Note that only dev1 has the latest
   metadata of this filesystem.

5. Umount it and mount it again normally (with both disks), since raid1
   can pick up one disk by the writer task's pid, if btrfs_search_slot()
   needs to read block A, dev2 which does NOT have the latest metadata
   might be read for block A, then we got a stale block A.

6. As parent transid is not checked, block A is marked as uptodate and
   put into the extent buffer cache, so the future search won't bother
   to read disk again, which means it'll make changes on this stale
   one and make it dirty and flush it onto disk.

To avoid the problem, parent transid needs to be passed to
read_tree_block().

In order to get a valid parent transid, we need to hold the parent's
lock until finish reading child.

Signed-off-by: Liu Bo <bo.liu@linux.alibaba.com>
---
 fs/btrfs/ctree.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

Comments

Filipe Manana May 15, 2018, 5:45 p.m. UTC | #1
On Tue, May 15, 2018 at 6:37 PM, Liu Bo <bo.liu@linux.alibaba.com> wrote:
> If a btree block, aka. extent buffer, is not available in the extent
> buffer cache, it'll be read out from the disk instead, i.e.
>
> btrfs_search_slot()
>   read_block_for_search()  # hold parent and its lock, go to read child
>     btrfs_release_path()
>     read_tree_block()  # read child
>
> Unfortunately, the parent lock got released before reading child, so
> commit 5bdd3536cbbe ("Btrfs: Fix block generation verification race") had
> used 0 as parent transid to read the child block.  It forces
> read_tree_block() not to check if parent transid is different with the
> generation id of the child that it reads out from disk.
>
> A simple PoC is included in btrfs/124,
>
> 0. A two-disk raid1 btrfs,
>
> 1. Right after mkfs.btrfs, block A is allocated to be device tree's root.
>
> 2. Mount this filesystem and put it in use, after a while, device tree's
>    root got COW but block A hasn't been allocated/overwritten yet.
>
> 3. Umount it and reload the btrfs module to remove both disks from the
>    global @fs_devices list.
>
> 4. mount -odegraded dev1 and write some data, so now block A is allocated
>    to be a leaf in checksum tree.  Note that only dev1 has the latest
>    metadata of this filesystem.
>
> 5. Umount it and mount it again normally (with both disks), since raid1
>    can pick up one disk by the writer task's pid, if btrfs_search_slot()
>    needs to read block A, dev2 which does NOT have the latest metadata
>    might be read for block A, then we got a stale block A.
>
> 6. As parent transid is not checked, block A is marked as uptodate and
>    put into the extent buffer cache, so the future search won't bother
>    to read disk again, which means it'll make changes on this stale
>    one and make it dirty and flush it onto disk.
>
> To avoid the problem, parent transid needs to be passed to
> read_tree_block().
>
> In order to get a valid parent transid, we need to hold the parent's
> lock until finish reading child.
>
> Signed-off-by: Liu Bo <bo.liu@linux.alibaba.com>
Reviewed-by: Filipe Manana <fdmanana@suse.com>

Looks good, great finding and explanation.

> ---
>  fs/btrfs/ctree.c | 6 +++---
>  1 file changed, 3 insertions(+), 3 deletions(-)
>
> diff --git a/fs/btrfs/ctree.c b/fs/btrfs/ctree.c
> index 3fd44835b386..b3f6f300e492 100644
> --- a/fs/btrfs/ctree.c
> +++ b/fs/btrfs/ctree.c
> @@ -2436,10 +2436,8 @@ noinline void btrfs_unlock_up_safe(struct btrfs_path *path, int level)
>         if (p->reada != READA_NONE)
>                 reada_for_search(fs_info, p, level, slot, key->objectid);
>
> -       btrfs_release_path(p);
> -
>         ret = -EAGAIN;
> -       tmp = read_tree_block(fs_info, blocknr, 0, parent_level - 1,
> +       tmp = read_tree_block(fs_info, blocknr, gen, parent_level - 1,
>                               &first_key);
>         if (!IS_ERR(tmp)) {
>                 /*
> @@ -2454,6 +2452,8 @@ noinline void btrfs_unlock_up_safe(struct btrfs_path *path, int level)
>         } else {
>                 ret = PTR_ERR(tmp);
>         }
> +
> +       btrfs_release_path(p);
>         return ret;
>  }
>
> --
> 1.8.3.1
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
Qu Wenruo May 16, 2018, 1:21 a.m. UTC | #2
On 2018年05月16日 01:37, Liu Bo wrote:
> If a btree block, aka. extent buffer, is not available in the extent
> buffer cache, it'll be read out from the disk instead, i.e.
> 
> btrfs_search_slot()
>   read_block_for_search()  # hold parent and its lock, go to read child
>     btrfs_release_path()
>     read_tree_block()  # read child
> 
> Unfortunately, the parent lock got released before reading child, so
> commit 5bdd3536cbbe ("Btrfs: Fix block generation verification race") had
> used 0 as parent transid to read the child block.  It forces
> read_tree_block() not to check if parent transid is different with the
> generation id of the child that it reads out from disk.
> 
> A simple PoC is included in btrfs/124,
> 
> 0. A two-disk raid1 btrfs,
> 
> 1. Right after mkfs.btrfs, block A is allocated to be device tree's root.
> 
> 2. Mount this filesystem and put it in use, after a while, device tree's
>    root got COW but block A hasn't been allocated/overwritten yet.
> 
> 3. Umount it and reload the btrfs module to remove both disks from the
>    global @fs_devices list.
> 
> 4. mount -odegraded dev1 and write some data, so now block A is allocated
>    to be a leaf in checksum tree.  Note that only dev1 has the latest
>    metadata of this filesystem.
> 
> 5. Umount it and mount it again normally (with both disks), since raid1
>    can pick up one disk by the writer task's pid, if btrfs_search_slot()
>    needs to read block A, dev2 which does NOT have the latest metadata
>    might be read for block A, then we got a stale block A.
> 
> 6. As parent transid is not checked, block A is marked as uptodate and
>    put into the extent buffer cache, so the future search won't bother
>    to read disk again, which means it'll make changes on this stale
>    one and make it dirty and flush it onto disk.
> 
> To avoid the problem, parent transid needs to be passed to
> read_tree_block().
> 
> In order to get a valid parent transid, we need to hold the parent's
> lock until finish reading child.

Thanks for the detailed explanation.

It explains the first_key check error reported, thanks a lot!

Reviewed-by: Qu Wenruo <wqu@suse.com>

Thanks,
Qu

> 
> Signed-off-by: Liu Bo <bo.liu@linux.alibaba.com>
> ---
>  fs/btrfs/ctree.c | 6 +++---
>  1 file changed, 3 insertions(+), 3 deletions(-)
> 
> diff --git a/fs/btrfs/ctree.c b/fs/btrfs/ctree.c
> index 3fd44835b386..b3f6f300e492 100644
> --- a/fs/btrfs/ctree.c
> +++ b/fs/btrfs/ctree.c
> @@ -2436,10 +2436,8 @@ noinline void btrfs_unlock_up_safe(struct btrfs_path *path, int level)
>  	if (p->reada != READA_NONE)
>  		reada_for_search(fs_info, p, level, slot, key->objectid);
>  
> -	btrfs_release_path(p);
> -
>  	ret = -EAGAIN;
> -	tmp = read_tree_block(fs_info, blocknr, 0, parent_level - 1,
> +	tmp = read_tree_block(fs_info, blocknr, gen, parent_level - 1,
>  			      &first_key);
>  	if (!IS_ERR(tmp)) {
>  		/*
> @@ -2454,6 +2452,8 @@ noinline void btrfs_unlock_up_safe(struct btrfs_path *path, int level)
>  	} else {
>  		ret = PTR_ERR(tmp);
>  	}
> +
> +	btrfs_release_path(p);
>  	return ret;
>  }
>  
>
Nikolay Borisov May 16, 2018, 8 a.m. UTC | #3
On 15.05.2018 20:37, Liu Bo wrote:
> If a btree block, aka. extent buffer, is not available in the extent
> buffer cache, it'll be read out from the disk instead, i.e.
> 
> btrfs_search_slot()
>   read_block_for_search()  # hold parent and its lock, go to read child
>     btrfs_release_path()
>     read_tree_block()  # read child
> 
> Unfortunately, the parent lock got released before reading child, so
> commit 5bdd3536cbbe ("Btrfs: Fix block generation verification race") had
> used 0 as parent transid to read the child block.  It forces
> read_tree_block() not to check if parent transid is different with the
> generation id of the child that it reads out from disk.
> 
> A simple PoC is included in btrfs/124,
> 
> 0. A two-disk raid1 btrfs,
> 
> 1. Right after mkfs.btrfs, block A is allocated to be device tree's root.
> 
> 2. Mount this filesystem and put it in use, after a while, device tree's
>    root got COW but block A hasn't been allocated/overwritten yet.
> 
> 3. Umount it and reload the btrfs module to remove both disks from the
>    global @fs_devices list.
> 
> 4. mount -odegraded dev1 and write some data, so now block A is allocated
>    to be a leaf in checksum tree.  Note that only dev1 has the latest
>    metadata of this filesystem.
> 
> 5. Umount it and mount it again normally (with both disks), since raid1
>    can pick up one disk by the writer task's pid, if btrfs_search_slot()
>    needs to read block A, dev2 which does NOT have the latest metadata
>    might be read for block A, then we got a stale block A.
> 
> 6. As parent transid is not checked, block A is marked as uptodate and
>    put into the extent buffer cache, so the future search won't bother
>    to read disk again, which means it'll make changes on this stale
>    one and make it dirty and flush it onto disk.
> 
> To avoid the problem, parent transid needs to be passed to
> read_tree_block().
> 
> In order to get a valid parent transid, we need to hold the parent's
> lock until finish reading child.
> 
> Signed-off-by: Liu Bo <bo.liu@linux.alibaba.com>

Doesn't this warrant a stable tag and
Fixes: 5bdd3536cbbe ("Btrfs: Fix block generation verification race")
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Anand Jain May 16, 2018, 10:28 a.m. UTC | #4
On 05/16/2018 01:37 AM, Liu Bo wrote:
> If a btree block, aka. extent buffer, is not available in the extent
> buffer cache, it'll be read out from the disk instead, i.e.
> 
> btrfs_search_slot()
>    read_block_for_search()  # hold parent and its lock, go to read child
>      btrfs_release_path()
>      read_tree_block()  # read child
> 
> Unfortunately, the parent lock got released before reading child, so
> commit 5bdd3536cbbe ("Btrfs: Fix block generation verification race") had
> used 0 as parent transid to read the child block.  It forces
> read_tree_block() not to check if parent transid is different with the
> generation id of the child that it reads out from disk.
> 
> A simple PoC is included in btrfs/124,
> 
> 0. A two-disk raid1 btrfs,
> 
> 1. Right after mkfs.btrfs, block A is allocated to be device tree's root.
> 
> 2. Mount this filesystem and put it in use, after a while, device tree's
>     root got COW but block A hasn't been allocated/overwritten yet.
> 
> 3. Umount it and reload the btrfs module to remove both disks from the
>     global @fs_devices list.
> 
> 4. mount -odegraded dev1 and write some data, so now block A is allocated
>     to be a leaf in checksum tree.  Note that only dev1 has the latest
>     metadata of this filesystem.
> 
> 5. Umount it and mount it again normally (with both disks), since raid1
>     can pick up one disk by the writer task's pid, if btrfs_search_slot()
>     needs to read block A, dev2 which does NOT have the latest metadata
>     might be read for block A, then we got a stale block A.
> 
> 6. As parent transid is not checked, block A is marked as uptodate and
>     put into the extent buffer cache, so the future search won't bother
>     to read disk again, which means it'll make changes on this stale
>     one and make it dirty and flush it onto disk.
> 
> To avoid the problem, parent transid needs to be passed to
> read_tree_block().
> 
> In order to get a valid parent transid, we need to hold the parent's
> lock until finish reading child.

  Thanks for the patch. I was trying to fix this as well.  I have
  a test case for this btrfs/161 which I just posted to the mailing
  list. And I find it still fails.

Thanks, Anand

> Signed-off-by: Liu Bo <bo.liu@linux.alibaba.com>
> ---
>   fs/btrfs/ctree.c | 6 +++---
>   1 file changed, 3 insertions(+), 3 deletions(-)
> 
> diff --git a/fs/btrfs/ctree.c b/fs/btrfs/ctree.c
> index 3fd44835b386..b3f6f300e492 100644
> --- a/fs/btrfs/ctree.c
> +++ b/fs/btrfs/ctree.c
> @@ -2436,10 +2436,8 @@ noinline void btrfs_unlock_up_safe(struct btrfs_path *path, int level)
>   	if (p->reada != READA_NONE)
>   		reada_for_search(fs_info, p, level, slot, key->objectid);
>   
> -	btrfs_release_path(p);
> -
>   	ret = -EAGAIN;
> -	tmp = read_tree_block(fs_info, blocknr, 0, parent_level - 1,
> +	tmp = read_tree_block(fs_info, blocknr, gen, parent_level - 1,
>   			      &first_key);
>   	if (!IS_ERR(tmp)) {
>   		/*
> @@ -2454,6 +2452,8 @@ noinline void btrfs_unlock_up_safe(struct btrfs_path *path, int level)
>   	} else {
>   		ret = PTR_ERR(tmp);
>   	}
> +
> +	btrfs_release_path(p);
>   	return ret;
>   }
>   
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
David Sterba May 16, 2018, 2:12 p.m. UTC | #5
On Wed, May 16, 2018 at 11:00:14AM +0300, Nikolay Borisov wrote:
> 
> 
> On 15.05.2018 20:37, Liu Bo wrote:
> > If a btree block, aka. extent buffer, is not available in the extent
> > buffer cache, it'll be read out from the disk instead, i.e.
> > 
> > btrfs_search_slot()
> >   read_block_for_search()  # hold parent and its lock, go to read child
> >     btrfs_release_path()
> >     read_tree_block()  # read child
> > 
> > Unfortunately, the parent lock got released before reading child, so
> > commit 5bdd3536cbbe ("Btrfs: Fix block generation verification race") had
> > used 0 as parent transid to read the child block.  It forces
> > read_tree_block() not to check if parent transid is different with the
> > generation id of the child that it reads out from disk.
> > 
> > A simple PoC is included in btrfs/124,
> > 
> > 0. A two-disk raid1 btrfs,
> > 
> > 1. Right after mkfs.btrfs, block A is allocated to be device tree's root.
> > 
> > 2. Mount this filesystem and put it in use, after a while, device tree's
> >    root got COW but block A hasn't been allocated/overwritten yet.
> > 
> > 3. Umount it and reload the btrfs module to remove both disks from the
> >    global @fs_devices list.
> > 
> > 4. mount -odegraded dev1 and write some data, so now block A is allocated
> >    to be a leaf in checksum tree.  Note that only dev1 has the latest
> >    metadata of this filesystem.
> > 
> > 5. Umount it and mount it again normally (with both disks), since raid1
> >    can pick up one disk by the writer task's pid, if btrfs_search_slot()
> >    needs to read block A, dev2 which does NOT have the latest metadata
> >    might be read for block A, then we got a stale block A.
> > 
> > 6. As parent transid is not checked, block A is marked as uptodate and
> >    put into the extent buffer cache, so the future search won't bother
> >    to read disk again, which means it'll make changes on this stale
> >    one and make it dirty and flush it onto disk.
> > 
> > To avoid the problem, parent transid needs to be passed to
> > read_tree_block().
> > 
> > In order to get a valid parent transid, we need to hold the parent's
> > lock until finish reading child.
> > 
> > Signed-off-by: Liu Bo <bo.liu@linux.alibaba.com>
> 
> Doesn't this warrant a stable tag and
> Fixes: 5bdd3536cbbe ("Btrfs: Fix block generation verification race")

The patch will not apply to < 4.16 as it depends on the addition of
&first_key check from 581c1760415c4, but yes we need it for stable.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Liu Bo May 18, 2018, 1:15 p.m. UTC | #6
On Wed, May 16, 2018 at 04:12:21PM +0200, David Sterba wrote:
> On Wed, May 16, 2018 at 11:00:14AM +0300, Nikolay Borisov wrote:
> > 
> > 
> > On 15.05.2018 20:37, Liu Bo wrote:
> > > If a btree block, aka. extent buffer, is not available in the extent
> > > buffer cache, it'll be read out from the disk instead, i.e.
> > > 
> > > btrfs_search_slot()
> > >   read_block_for_search()  # hold parent and its lock, go to read child
> > >     btrfs_release_path()
> > >     read_tree_block()  # read child
> > > 
> > > Unfortunately, the parent lock got released before reading child, so
> > > commit 5bdd3536cbbe ("Btrfs: Fix block generation verification race") had
> > > used 0 as parent transid to read the child block.  It forces
> > > read_tree_block() not to check if parent transid is different with the
> > > generation id of the child that it reads out from disk.
> > > 
> > > A simple PoC is included in btrfs/124,
> > > 
> > > 0. A two-disk raid1 btrfs,
> > > 
> > > 1. Right after mkfs.btrfs, block A is allocated to be device tree's root.
> > > 
> > > 2. Mount this filesystem and put it in use, after a while, device tree's
> > >    root got COW but block A hasn't been allocated/overwritten yet.
> > > 
> > > 3. Umount it and reload the btrfs module to remove both disks from the
> > >    global @fs_devices list.
> > > 
> > > 4. mount -odegraded dev1 and write some data, so now block A is allocated
> > >    to be a leaf in checksum tree.  Note that only dev1 has the latest
> > >    metadata of this filesystem.
> > > 
> > > 5. Umount it and mount it again normally (with both disks), since raid1
> > >    can pick up one disk by the writer task's pid, if btrfs_search_slot()
> > >    needs to read block A, dev2 which does NOT have the latest metadata
> > >    might be read for block A, then we got a stale block A.
> > > 
> > > 6. As parent transid is not checked, block A is marked as uptodate and
> > >    put into the extent buffer cache, so the future search won't bother
> > >    to read disk again, which means it'll make changes on this stale
> > >    one and make it dirty and flush it onto disk.
> > > 
> > > To avoid the problem, parent transid needs to be passed to
> > > read_tree_block().
> > > 
> > > In order to get a valid parent transid, we need to hold the parent's
> > > lock until finish reading child.
> > > 
> > > Signed-off-by: Liu Bo <bo.liu@linux.alibaba.com>
> > 
> > Doesn't this warrant a stable tag and
> > Fixes: 5bdd3536cbbe ("Btrfs: Fix block generation verification race")
> 
> The patch will not apply to < 4.16 as it depends on the addition of
> &first_key check from 581c1760415c4, but yes we need it for stable.

True, we can easily fix the conflicts by removing @first_key, can't
we?

The problem does exist before it.

thanks,
-liubo
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
David Sterba May 18, 2018, 2:06 p.m. UTC | #7
On Fri, May 18, 2018 at 09:15:16PM +0800, Liu Bo wrote:
> > > Doesn't this warrant a stable tag and
> > > Fixes: 5bdd3536cbbe ("Btrfs: Fix block generation verification race")
> > 
> > The patch will not apply to < 4.16 as it depends on the addition of
> > &first_key check from 581c1760415c4, but yes we need it for stable.
> 
> True, we can easily fix the conflicts by removing @first_key, can't
> we?

Yes the backport is easy but still needs to be done manually and we
can't and don't expect the stable maintainers to resolve the conflicts
but rather send a adapter version ourselves. Nikolay promised to do
that.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Nikolay Borisov May 18, 2018, 2:12 p.m. UTC | #8
On 18.05.2018 17:06, David Sterba wrote:
> On Fri, May 18, 2018 at 09:15:16PM +0800, Liu Bo wrote:
>>>> Doesn't this warrant a stable tag and
>>>> Fixes: 5bdd3536cbbe ("Btrfs: Fix block generation verification race")
>>>
>>> The patch will not apply to < 4.16 as it depends on the addition of
>>> &first_key check from 581c1760415c4, but yes we need it for stable.
>>
>> True, we can easily fix the conflicts by removing @first_key, can't
>> we?
> 
> Yes the backport is easy but still needs to be done manually and we
> can't and don't expect the stable maintainers to resolve the conflicts
> but rather send a adapter version ourselves. Nikolay promised to do
> that.

Once the patch gets sent to Linus I will backport it for 4.4/4.9 and 4.14

> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/fs/btrfs/ctree.c b/fs/btrfs/ctree.c
index 3fd44835b386..b3f6f300e492 100644
--- a/fs/btrfs/ctree.c
+++ b/fs/btrfs/ctree.c
@@ -2436,10 +2436,8 @@  noinline void btrfs_unlock_up_safe(struct btrfs_path *path, int level)
 	if (p->reada != READA_NONE)
 		reada_for_search(fs_info, p, level, slot, key->objectid);
 
-	btrfs_release_path(p);
-
 	ret = -EAGAIN;
-	tmp = read_tree_block(fs_info, blocknr, 0, parent_level - 1,
+	tmp = read_tree_block(fs_info, blocknr, gen, parent_level - 1,
 			      &first_key);
 	if (!IS_ERR(tmp)) {
 		/*
@@ -2454,6 +2452,8 @@  noinline void btrfs_unlock_up_safe(struct btrfs_path *path, int level)
 	} else {
 		ret = PTR_ERR(tmp);
 	}
+
+	btrfs_release_path(p);
 	return ret;
 }