diff mbox series

[v2,3/4] btrfs: send: fix invalid commands for inodes with changed rdev but same gen

Message ID 20210125194210.24071-4-roman.anasal@bdsu.de (mailing list archive)
State New, archived
Headers show
Series btrfs: send: correctly recreate changed inodes | expand

Commit Message

Roman Anasal | BDSU Jan. 25, 2021, 7:42 p.m. UTC
This is analogous to the preceding patch ("btrfs: send: fix invalid
commands for inodes with changed type but same gen") but for changed
rdev:

When doing an incremental send, if a new inode has the same number as an
inode in the parent subvolume, was created with the same generation but
has differing rdev it will not be detected as changed and thus not
recreated. This will lead to incorrect results on the receiver where the
inode will keep the rdev of the inode in the parent subvolume or even
fail when also the ref is unchanged.

This case does not happen when doing incremental sends with snapshots
that are kept read-only by the user all the time, but it may happen if
- a snapshot was modified in the same transaction as its parent after it
  was created
- the subvol used as parent was created independently from the sent subvol

Example reproducers:

  # case 1: same ino at same path
  btrfs subvolume create subvol1
  btrfs subvolume create subvol2
  mknod subvol1/a c 1 3
  mknod subvol2/a c 1 5
  btrfs property set subvol1 ro true
  btrfs property set subvol2 ro true
  btrfs send -p subvol1 subvol2 | btrfs receive --dump

The produced tree state here is:
  |-- subvol1
  |   `-- a         (ino 257, c 1,3)
  |
  `-- subvol2
      `-- a         (ino 257, c 1,5)

Where subvol1/a and subvol2/a are character devices with differing minor
numbers but same inode number and same generation.

Example output of the receive command:
  At subvol subvol2
  snapshot        ./subvol2                       uuid=7513941c-4ef7-f847-b05e-4fdfe003af7b transid=9 parent_uuid=b66f015b-c226-2548-9e39-048c7fdbec99 parent_transid=9
  utimes          ./subvol2/                      atime=2021-01-25T17:14:36+0000 mtime=2021-01-25T17:14:36+0000 ctime=2021-01-25T17:14:36+0000
  link            ./subvol2/a                     dest=a
  unlink          ./subvol2/a
  utimes          ./subvol2/                      atime=2021-01-25T17:14:36+0000 mtime=2021-01-25T17:14:36+0000 ctime=2021-01-25T17:14:36+0000
  utimes          ./subvol2/a                     atime=2021-01-25T17:14:36+0000 mtime=2021-01-25T17:14:36+0000 ctime=2021-01-25T17:14:36+0000

=> the `link` command causes the receiver to fail with:
   ERROR: link a -> a failed: File exists

Second example:
  # case 2: same ino at different path
  btrfs subvolume create subvol1
  btrfs subvolume create subvol2
  mknod subvol1/a c 1 3
  mknod subvol2/b c 1 5
  btrfs property set subvol1 ro true
  btrfs property set subvol2 ro true
  btrfs send -p subvol1 subvol2 | btrfs receive --dump

The produced tree state here is:
  |-- subvol1
  |   `-- a         (ino 257, c 1,3)
  |
  `-- subvol2
      `-- b         (ino 257, c 1,5)

Where subvol1/a and subvol2/b are character devices with differing minor
numbers but same inode number and same generation.

Example output of the receive command:
  At subvol subvol2
  snapshot        ./subvol2                       uuid=1c175819-8b97-0046-a20e-5f95e37cbd40 transid=13 parent_uuid=bad4a908-21b4-6f40-9a08-6b0768346725 parent_transid=13
  utimes          ./subvol2/                      atime=2021-01-25T17:18:46+0000 mtime=2021-01-25T17:18:46+0000 ctime=2021-01-25T17:18:46+0000
  link            ./subvol2/b                     dest=a
  unlink          ./subvol2/a
  utimes          ./subvol2/                      atime=2021-01-25T17:18:46+0000 mtime=2021-01-25T17:18:46+0000 ctime=2021-01-25T17:18:46+0000
  utimes          ./subvol2/b                     atime=2021-01-25T17:18:46+0000 mtime=2021-01-25T17:18:46+0000 ctime=2021-01-25T17:18:46+0000

=> subvol1/a is renamed to subvol2/b instead of recreated to updated
   rdev which results in received subvol2/b having the wrong minor
   number:

  257 crw-r--r--. 1 root root 1, 3 Jan 25 17:18 subvol2/b

Signed-off-by: Roman Anasal <roman.anasal@bdsu.de>
---
v2:
  - add this patch to also handle changed rdev
---
 fs/btrfs/send.c | 15 ++++++++++-----
 1 file changed, 10 insertions(+), 5 deletions(-)

Comments

Filipe Manana Jan. 25, 2021, 8:51 p.m. UTC | #1
On Mon, Jan 25, 2021 at 7:51 PM Roman Anasal <roman.anasal@bdsu.de> wrote:
>
> This is analogous to the preceding patch ("btrfs: send: fix invalid
> commands for inodes with changed type but same gen") but for changed
> rdev:
>
> When doing an incremental send, if a new inode has the same number as an
> inode in the parent subvolume, was created with the same generation but
> has differing rdev it will not be detected as changed and thus not
> recreated. This will lead to incorrect results on the receiver where the
> inode will keep the rdev of the inode in the parent subvolume or even
> fail when also the ref is unchanged.
>
> This case does not happen when doing incremental sends with snapshots
> that are kept read-only by the user all the time, but it may happen if
> - a snapshot was modified in the same transaction as its parent after it
>   was created
> - the subvol used as parent was created independently from the sent subvol
>
> Example reproducers:
>
>   # case 1: same ino at same path
>   btrfs subvolume create subvol1
>   btrfs subvolume create subvol2
>   mknod subvol1/a c 1 3
>   mknod subvol2/a c 1 5
>   btrfs property set subvol1 ro true
>   btrfs property set subvol2 ro true
>   btrfs send -p subvol1 subvol2 | btrfs receive --dump
>
> The produced tree state here is:
>   |-- subvol1
>   |   `-- a         (ino 257, c 1,3)
>   |
>   `-- subvol2
>       `-- a         (ino 257, c 1,5)
>
> Where subvol1/a and subvol2/a are character devices with differing minor
> numbers but same inode number and same generation.
>
> Example output of the receive command:
>   At subvol subvol2
>   snapshot        ./subvol2                       uuid=7513941c-4ef7-f847-b05e-4fdfe003af7b transid=9 parent_uuid=b66f015b-c226-2548-9e39-048c7fdbec99 parent_transid=9
>   utimes          ./subvol2/                      atime=2021-01-25T17:14:36+0000 mtime=2021-01-25T17:14:36+0000 ctime=2021-01-25T17:14:36+0000
>   link            ./subvol2/a                     dest=a
>   unlink          ./subvol2/a
>   utimes          ./subvol2/                      atime=2021-01-25T17:14:36+0000 mtime=2021-01-25T17:14:36+0000 ctime=2021-01-25T17:14:36+0000
>   utimes          ./subvol2/a                     atime=2021-01-25T17:14:36+0000 mtime=2021-01-25T17:14:36+0000 ctime=2021-01-25T17:14:36+0000
>
> => the `link` command causes the receiver to fail with:
>    ERROR: link a -> a failed: File exists
>
> Second example:
>   # case 2: same ino at different path
>   btrfs subvolume create subvol1
>   btrfs subvolume create subvol2
>   mknod subvol1/a c 1 3
>   mknod subvol2/b c 1 5
>   btrfs property set subvol1 ro true
>   btrfs property set subvol2 ro true
>   btrfs send -p subvol1 subvol2 | btrfs receive --dump

As I've told you before for the v1 patchset from a week or two ago,
this is not a supported scenario for incremental sends.
Incremental sends are meant to be used on RO snapshots of the same
subvolume, and those snapshots must never be changed after they were
created.

Incremental sends were simply not designed for these cases, and can
never be guaranteed to work with such cases.

The bug is not having incremental sends fail right away, with an
explicit error message, when the send and parent roots aren't RO
snapshots of the same subvolume.

>
> The produced tree state here is:
>   |-- subvol1
>   |   `-- a         (ino 257, c 1,3)
>   |
>   `-- subvol2
>       `-- b         (ino 257, c 1,5)
>
> Where subvol1/a and subvol2/b are character devices with differing minor
> numbers but same inode number and same generation.
>
> Example output of the receive command:
>   At subvol subvol2
>   snapshot        ./subvol2                       uuid=1c175819-8b97-0046-a20e-5f95e37cbd40 transid=13 parent_uuid=bad4a908-21b4-6f40-9a08-6b0768346725 parent_transid=13
>   utimes          ./subvol2/                      atime=2021-01-25T17:18:46+0000 mtime=2021-01-25T17:18:46+0000 ctime=2021-01-25T17:18:46+0000
>   link            ./subvol2/b                     dest=a
>   unlink          ./subvol2/a
>   utimes          ./subvol2/                      atime=2021-01-25T17:18:46+0000 mtime=2021-01-25T17:18:46+0000 ctime=2021-01-25T17:18:46+0000
>   utimes          ./subvol2/b                     atime=2021-01-25T17:18:46+0000 mtime=2021-01-25T17:18:46+0000 ctime=2021-01-25T17:18:46+0000
>
> => subvol1/a is renamed to subvol2/b instead of recreated to updated
>    rdev which results in received subvol2/b having the wrong minor
>    number:
>
>   257 crw-r--r--. 1 root root 1, 3 Jan 25 17:18 subvol2/b
>
> Signed-off-by: Roman Anasal <roman.anasal@bdsu.de>
> ---
> v2:
>   - add this patch to also handle changed rdev
> ---
>  fs/btrfs/send.c | 15 ++++++++++-----
>  1 file changed, 10 insertions(+), 5 deletions(-)
>
> diff --git a/fs/btrfs/send.c b/fs/btrfs/send.c
> index c8b1f441f..ef544525f 100644
> --- a/fs/btrfs/send.c
> +++ b/fs/btrfs/send.c
> @@ -6263,6 +6263,7 @@ static int changed_inode(struct send_ctx *sctx,
>         struct btrfs_inode_item *right_ii = NULL;
>         u64 left_gen = 0;
>         u64 right_gen = 0;
> +       u64 left_rdev, right_rdev;
>         u64 left_type, right_type;
>
>         sctx->cur_ino = key->objectid;
> @@ -6285,6 +6286,8 @@ static int changed_inode(struct send_ctx *sctx,
>                                 struct btrfs_inode_item);
>                 left_gen = btrfs_inode_generation(sctx->left_path->nodes[0],
>                                 left_ii);
> +               left_rdev = btrfs_inode_rdev(sctx->left_path->nodes[0],
> +                               left_ii);
>         } else {
>                 right_ii = btrfs_item_ptr(sctx->right_path->nodes[0],
>                                 sctx->right_path->slots[0],
> @@ -6300,6 +6303,9 @@ static int changed_inode(struct send_ctx *sctx,
>                 right_gen = btrfs_inode_generation(sctx->right_path->nodes[0],
>                                 right_ii);
>
> +               right_rdev = btrfs_inode_rdev(sctx->right_path->nodes[0],
> +                               right_ii);
> +
>                 left_type = S_IFMT & btrfs_inode_mode(
>                                 sctx->left_path->nodes[0], left_ii);
>                 right_type = S_IFMT & btrfs_inode_mode(
> @@ -6310,7 +6316,8 @@ static int changed_inode(struct send_ctx *sctx,
>                  * the inode as deleted+reused because it would generate a
>                  * stream that tries to delete/mkdir the root dir.
>                  */
> -               if ((left_gen != right_gen || left_type != right_type) &&
> +               if ((left_gen != right_gen || left_type != right_type ||
> +                   left_rdev != right_rdev) &&
>                     sctx->cur_ino != BTRFS_FIRST_FREE_OBJECTID)
>                         sctx->cur_inode_recreated = 1;
>         }
> @@ -6350,8 +6357,7 @@ static int changed_inode(struct send_ctx *sctx,
>                                 sctx->left_path->nodes[0], left_ii);
>                 sctx->cur_inode_mode = btrfs_inode_mode(
>                                 sctx->left_path->nodes[0], left_ii);
> -               sctx->cur_inode_rdev = btrfs_inode_rdev(
> -                               sctx->left_path->nodes[0], left_ii);
> +               sctx->cur_inode_rdev = left_rdev;
>                 if (sctx->cur_ino != BTRFS_FIRST_FREE_OBJECTID)
>                         ret = send_create_inode_if_needed(sctx);
>         } else if (result == BTRFS_COMPARE_TREE_DELETED) {
> @@ -6396,8 +6402,7 @@ static int changed_inode(struct send_ctx *sctx,
>                                         sctx->left_path->nodes[0], left_ii);
>                         sctx->cur_inode_mode = btrfs_inode_mode(
>                                         sctx->left_path->nodes[0], left_ii);
> -                       sctx->cur_inode_rdev = btrfs_inode_rdev(
> -                                       sctx->left_path->nodes[0], left_ii);
> +                       sctx->cur_inode_rdev = left_rdev;
>                         ret = send_create_inode_if_needed(sctx);
>                         if (ret < 0)
>                                 goto out;
> --
> 2.26.2
>
Roman Anasal | BDSU Jan. 26, 2021, 7:19 p.m. UTC | #2
Am Montag, den 25.01.2021, 20:51 +0000 schrieb Filipe Manana:
> On Mon, Jan 25, 2021 at 7:51 PM Roman Anasal <roman.anasal@bdsu.de>
> wrote:
> > This is analogous to the preceding patch ("btrfs: send: fix invalid
> > commands for inodes with changed type but same gen") but for
> > changed
> > rdev:
> > 
> > When doing an incremental send, if a new inode has the same number
> > as an
> > inode in the parent subvolume, was created with the same generation
> > but
> > has differing rdev it will not be detected as changed and thus not
> > recreated. This will lead to incorrect results on the receiver
> > where the
> > inode will keep the rdev of the inode in the parent subvolume or
> > even
> > fail when also the ref is unchanged.
> > 
> > This case does not happen when doing incremental sends with
> > snapshots
> > that are kept read-only by the user all the time, but it may happen
> > if
> > - a snapshot was modified in the same transaction as its parent
> > after it
> >   was created
> > - the subvol used as parent was created independently from the sent
> > subvol
> > 
> > Example reproducers:
> > 
> >   # case 1: same ino at same path
> >   btrfs subvolume create subvol1
> >   btrfs subvolume create subvol2
> >   mknod subvol1/a c 1 3
> >   mknod subvol2/a c 1 5
> >   btrfs property set subvol1 ro true
> >   btrfs property set subvol2 ro true
> >   btrfs send -p subvol1 subvol2 | btrfs receive --dump
> > 
> > The produced tree state here is:
> >   |-- subvol1
> >   |   `-- a         (ino 257, c 1,3)
> >   |
> >   `-- subvol2
> >       `-- a         (ino 257, c 1,5)
> > 
> > Where subvol1/a and subvol2/a are character devices with differing
> > minor
> > numbers but same inode number and same generation.
> > 
> > Example output of the receive command:
> >   At subvol subvol2
> >   snapshot        ./subvol2                       uuid=7513941c-
> > 4ef7-f847-b05e-4fdfe003af7b transid=9 parent_uuid=b66f015b-c226-
> > 2548-9e39-048c7fdbec99 parent_transid=9
> >   utimes          ./subvol2/                      atime=2021-01-
> > 25T17:14:36+0000 mtime=2021-01-25T17:14:36+0000 ctime=2021-01-
> > 25T17:14:36+0000
> >   link            ./subvol2/a                     dest=a
> >   unlink          ./subvol2/a
> >   utimes          ./subvol2/                      atime=2021-01-
> > 25T17:14:36+0000 mtime=2021-01-25T17:14:36+0000 ctime=2021-01-
> > 25T17:14:36+0000
> >   utimes          ./subvol2/a                     atime=2021-01-
> > 25T17:14:36+0000 mtime=2021-01-25T17:14:36+0000 ctime=2021-01-
> > 25T17:14:36+0000
> > 
> > => the `link` command causes the receiver to fail with:
> >    ERROR: link a -> a failed: File exists
> > 
> > Second example:
> >   # case 2: same ino at different path
> >   btrfs subvolume create subvol1
> >   btrfs subvolume create subvol2
> >   mknod subvol1/a c 1 3
> >   mknod subvol2/b c 1 5
> >   btrfs property set subvol1 ro true
> >   btrfs property set subvol2 ro true
> >   btrfs send -p subvol1 subvol2 | btrfs receive --dump
> 
> As I've told you before for the v1 patchset from a week or two ago,
> this is not a supported scenario for incremental sends.
> Incremental sends are meant to be used on RO snapshots of the same
> subvolume, and those snapshots must never be changed after they were
> created.
> 
> Incremental sends were simply not designed for these cases, and can
> never be guaranteed to work with such cases.
> 
> The bug is not having incremental sends fail right away, with an
> explicit error message, when the send and parent roots aren't RO
> snapshots of the same subvolume.

I am sorry, I didn't want to anger you or to appear to be just stubborn
by posting this.

As I wrote in the cover letter I am aware that this is not a supported
use case and I understand that that makes the patches likely to be
rejected.
As said the reason I _wrote_ the patches was simply to learn more about
the btrfs code and its internals and see if I would be able to
understand it enough. The reason I _submitted_ them was just to
document what I found out so others could have a look into it and just
in case it maybe useful at a later time.

I also don't want to claim that these will add full support for sending
unrelated roots - they don't! They only handle those very specific edge
cases I found, which are currently _possible_, although still not
supported.

I took a deeper look into the rest to see if it could be supported:
the comparing algorithm actually works fine, even with completely
unrelated subvolumes (i.e. btrfs_compare_trees, changed_cb,
changed_inode etc.), but the processing of the changes (i.e.
process_recorded_refs etc.) is heavily based on (ino, gen) as
identifying handle, which can not be changed without the high risk of
regression - just as you said in your earlier comments - since side
effects of any changes are hard to see or understand without a very
deep understanding of the whole code; which is why I didn't even try to
touch that parts.

I apologize if I appeared to be stubborn or ignorant of your feedback!
That really wasn't my intent.


> > The produced tree state here is:
> >   |-- subvol1
> >   |   `-- a         (ino 257, c 1,3)
> >   |
> >   `-- subvol2
> >       `-- b         (ino 257, c 1,5)
> > 
> > Where subvol1/a and subvol2/b are character devices with differing
> > minor
> > numbers but same inode number and same generation.
> > 
> > Example output of the receive command:
> >   At subvol subvol2
> >   snapshot        ./subvol2                       uuid=1c175819-
> > 8b97-0046-a20e-5f95e37cbd40 transid=13 parent_uuid=bad4a908-21b4-
> > 6f40-9a08-6b0768346725 parent_transid=13
> >   utimes          ./subvol2/                      atime=2021-01-
> > 25T17:18:46+0000 mtime=2021-01-25T17:18:46+0000 ctime=2021-01-
> > 25T17:18:46+0000
> >   link            ./subvol2/b                     dest=a
> >   unlink          ./subvol2/a
> >   utimes          ./subvol2/                      atime=2021-01-
> > 25T17:18:46+0000 mtime=2021-01-25T17:18:46+0000 ctime=2021-01-
> > 25T17:18:46+0000
> >   utimes          ./subvol2/b                     atime=2021-01-
> > 25T17:18:46+0000 mtime=2021-01-25T17:18:46+0000 ctime=2021-01-
> > 25T17:18:46+0000
> > 
> > => subvol1/a is renamed to subvol2/b instead of recreated to
> > updated
> >    rdev which results in received subvol2/b having the wrong minor
> >    number:
> > 
> >   257 crw-r--r--. 1 root root 1, 3 Jan 25 17:18 subvol2/b
> > 
> > Signed-off-by: Roman Anasal <roman.anasal@bdsu.de>
> > ---
> > v2:
> >   - add this patch to also handle changed rdev
> > ---
> >  fs/btrfs/send.c | 15 ++++++++++-----
> >  1 file changed, 10 insertions(+), 5 deletions(-)
> > 
> > diff --git a/fs/btrfs/send.c b/fs/btrfs/send.c
> > index c8b1f441f..ef544525f 100644
> > --- a/fs/btrfs/send.c
> > +++ b/fs/btrfs/send.c
> > @@ -6263,6 +6263,7 @@ static int changed_inode(struct send_ctx
> > *sctx,
> >         struct btrfs_inode_item *right_ii = NULL;
> >         u64 left_gen = 0;
> >         u64 right_gen = 0;
> > +       u64 left_rdev, right_rdev;
> >         u64 left_type, right_type;
> > 
> >         sctx->cur_ino = key->objectid;
> > @@ -6285,6 +6286,8 @@ static int changed_inode(struct send_ctx
> > *sctx,
> >                                 struct btrfs_inode_item);
> >                 left_gen = btrfs_inode_generation(sctx->left_path-
> > >nodes[0],
> >                                 left_ii);
> > +               left_rdev = btrfs_inode_rdev(sctx->left_path-
> > >nodes[0],
> > +                               left_ii);
> >         } else {
> >                 right_ii = btrfs_item_ptr(sctx->right_path-
> > >nodes[0],
> >                                 sctx->right_path->slots[0],
> > @@ -6300,6 +6303,9 @@ static int changed_inode(struct send_ctx
> > *sctx,
> >                 right_gen = btrfs_inode_generation(sctx-
> > >right_path->nodes[0],
> >                                 right_ii);
> > 
> > +               right_rdev = btrfs_inode_rdev(sctx->right_path-
> > >nodes[0],
> > +                               right_ii);
> > +
> >                 left_type = S_IFMT & btrfs_inode_mode(
> >                                 sctx->left_path->nodes[0],
> > left_ii);
> >                 right_type = S_IFMT & btrfs_inode_mode(
> > @@ -6310,7 +6316,8 @@ static int changed_inode(struct send_ctx
> > *sctx,
> >                  * the inode as deleted+reused because it would
> > generate a
> >                  * stream that tries to delete/mkdir the root dir.
> >                  */
> > -               if ((left_gen != right_gen || left_type !=
> > right_type) &&
> > +               if ((left_gen != right_gen || left_type !=
> > right_type ||
> > +                   left_rdev != right_rdev) &&
> >                     sctx->cur_ino != BTRFS_FIRST_FREE_OBJECTID)
> >                         sctx->cur_inode_recreated = 1;
> >         }
> > @@ -6350,8 +6357,7 @@ static int changed_inode(struct send_ctx
> > *sctx,
> >                                 sctx->left_path->nodes[0],
> > left_ii);
> >                 sctx->cur_inode_mode = btrfs_inode_mode(
> >                                 sctx->left_path->nodes[0],
> > left_ii);
> > -               sctx->cur_inode_rdev = btrfs_inode_rdev(
> > -                               sctx->left_path->nodes[0],
> > left_ii);
> > +               sctx->cur_inode_rdev = left_rdev;
> >                 if (sctx->cur_ino != BTRFS_FIRST_FREE_OBJECTID)
> >                         ret = send_create_inode_if_needed(sctx);
> >         } else if (result == BTRFS_COMPARE_TREE_DELETED) {
> > @@ -6396,8 +6402,7 @@ static int changed_inode(struct send_ctx
> > *sctx,
> >                                         sctx->left_path->nodes[0],
> > left_ii);
> >                         sctx->cur_inode_mode = btrfs_inode_mode(
> >                                         sctx->left_path->nodes[0],
> > left_ii);
> > -                       sctx->cur_inode_rdev = btrfs_inode_rdev(
> > -                                       sctx->left_path->nodes[0],
> > left_ii);
> > +                       sctx->cur_inode_rdev = left_rdev;
> >                         ret = send_create_inode_if_needed(sctx);
> >                         if (ret < 0)
> >                                 goto out;
> > --
> > 2.26.2
> > 
> 
>
Filipe Manana Jan. 27, 2021, 10:53 a.m. UTC | #3
On Tue, Jan 26, 2021 at 7:19 PM Roman Anasal | BDSU
<roman.anasal@bdsu.de> wrote:
>
> Am Montag, den 25.01.2021, 20:51 +0000 schrieb Filipe Manana:
> > On Mon, Jan 25, 2021 at 7:51 PM Roman Anasal <roman.anasal@bdsu.de>
> > wrote:
> > > This is analogous to the preceding patch ("btrfs: send: fix invalid
> > > commands for inodes with changed type but same gen") but for
> > > changed
> > > rdev:
> > >
> > > When doing an incremental send, if a new inode has the same number
> > > as an
> > > inode in the parent subvolume, was created with the same generation
> > > but
> > > has differing rdev it will not be detected as changed and thus not
> > > recreated. This will lead to incorrect results on the receiver
> > > where the
> > > inode will keep the rdev of the inode in the parent subvolume or
> > > even
> > > fail when also the ref is unchanged.
> > >
> > > This case does not happen when doing incremental sends with
> > > snapshots
> > > that are kept read-only by the user all the time, but it may happen
> > > if
> > > - a snapshot was modified in the same transaction as its parent
> > > after it
> > >   was created
> > > - the subvol used as parent was created independently from the sent
> > > subvol
> > >
> > > Example reproducers:
> > >
> > >   # case 1: same ino at same path
> > >   btrfs subvolume create subvol1
> > >   btrfs subvolume create subvol2
> > >   mknod subvol1/a c 1 3
> > >   mknod subvol2/a c 1 5
> > >   btrfs property set subvol1 ro true
> > >   btrfs property set subvol2 ro true
> > >   btrfs send -p subvol1 subvol2 | btrfs receive --dump
> > >
> > > The produced tree state here is:
> > >   |-- subvol1
> > >   |   `-- a         (ino 257, c 1,3)
> > >   |
> > >   `-- subvol2
> > >       `-- a         (ino 257, c 1,5)
> > >
> > > Where subvol1/a and subvol2/a are character devices with differing
> > > minor
> > > numbers but same inode number and same generation.
> > >
> > > Example output of the receive command:
> > >   At subvol subvol2
> > >   snapshot        ./subvol2                       uuid=7513941c-
> > > 4ef7-f847-b05e-4fdfe003af7b transid=9 parent_uuid=b66f015b-c226-
> > > 2548-9e39-048c7fdbec99 parent_transid=9
> > >   utimes          ./subvol2/                      atime=2021-01-
> > > 25T17:14:36+0000 mtime=2021-01-25T17:14:36+0000 ctime=2021-01-
> > > 25T17:14:36+0000
> > >   link            ./subvol2/a                     dest=a
> > >   unlink          ./subvol2/a
> > >   utimes          ./subvol2/                      atime=2021-01-
> > > 25T17:14:36+0000 mtime=2021-01-25T17:14:36+0000 ctime=2021-01-
> > > 25T17:14:36+0000
> > >   utimes          ./subvol2/a                     atime=2021-01-
> > > 25T17:14:36+0000 mtime=2021-01-25T17:14:36+0000 ctime=2021-01-
> > > 25T17:14:36+0000
> > >
> > > => the `link` command causes the receiver to fail with:
> > >    ERROR: link a -> a failed: File exists
> > >
> > > Second example:
> > >   # case 2: same ino at different path
> > >   btrfs subvolume create subvol1
> > >   btrfs subvolume create subvol2
> > >   mknod subvol1/a c 1 3
> > >   mknod subvol2/b c 1 5
> > >   btrfs property set subvol1 ro true
> > >   btrfs property set subvol2 ro true
> > >   btrfs send -p subvol1 subvol2 | btrfs receive --dump
> >
> > As I've told you before for the v1 patchset from a week or two ago,
> > this is not a supported scenario for incremental sends.
> > Incremental sends are meant to be used on RO snapshots of the same
> > subvolume, and those snapshots must never be changed after they were
> > created.
> >
> > Incremental sends were simply not designed for these cases, and can
> > never be guaranteed to work with such cases.
> >
> > The bug is not having incremental sends fail right away, with an
> > explicit error message, when the send and parent roots aren't RO
> > snapshots of the same subvolume.
>
> I am sorry, I didn't want to anger you or to appear to be just stubborn
> by posting this.
>
> As I wrote in the cover letter I am aware that this is not a supported
> use case and I understand that that makes the patches likely to be
> rejected.

Ok, now I got the cover letter and the remaining v2 patches.
Vger has been having some lag this week, only got the mails during the
last evening.

Thanks.

> As said the reason I _wrote_ the patches was simply to learn more about
> the btrfs code and its internals and see if I would be able to
> understand it enough. The reason I _submitted_ them was just to
> document what I found out so others could have a look into it and just
> in case it maybe useful at a later time.
>
> I also don't want to claim that these will add full support for sending
> unrelated roots - they don't! They only handle those very specific edge
> cases I found, which are currently _possible_, although still not
> supported.
>
> I took a deeper look into the rest to see if it could be supported:
> the comparing algorithm actually works fine, even with completely
> unrelated subvolumes (i.e. btrfs_compare_trees, changed_cb,
> changed_inode etc.), but the processing of the changes (i.e.
> process_recorded_refs etc.) is heavily based on (ino, gen) as
> identifying handle, which can not be changed without the high risk of
> regression - just as you said in your earlier comments - since side
> effects of any changes are hard to see or understand without a very
> deep understanding of the whole code; which is why I didn't even try to
> touch that parts.
>
> I apologize if I appeared to be stubborn or ignorant of your feedback!
> That really wasn't my intent.
>
>
> > > The produced tree state here is:
> > >   |-- subvol1
> > >   |   `-- a         (ino 257, c 1,3)
> > >   |
> > >   `-- subvol2
> > >       `-- b         (ino 257, c 1,5)
> > >
> > > Where subvol1/a and subvol2/b are character devices with differing
> > > minor
> > > numbers but same inode number and same generation.
> > >
> > > Example output of the receive command:
> > >   At subvol subvol2
> > >   snapshot        ./subvol2                       uuid=1c175819-
> > > 8b97-0046-a20e-5f95e37cbd40 transid=13 parent_uuid=bad4a908-21b4-
> > > 6f40-9a08-6b0768346725 parent_transid=13
> > >   utimes          ./subvol2/                      atime=2021-01-
> > > 25T17:18:46+0000 mtime=2021-01-25T17:18:46+0000 ctime=2021-01-
> > > 25T17:18:46+0000
> > >   link            ./subvol2/b                     dest=a
> > >   unlink          ./subvol2/a
> > >   utimes          ./subvol2/                      atime=2021-01-
> > > 25T17:18:46+0000 mtime=2021-01-25T17:18:46+0000 ctime=2021-01-
> > > 25T17:18:46+0000
> > >   utimes          ./subvol2/b                     atime=2021-01-
> > > 25T17:18:46+0000 mtime=2021-01-25T17:18:46+0000 ctime=2021-01-
> > > 25T17:18:46+0000
> > >
> > > => subvol1/a is renamed to subvol2/b instead of recreated to
> > > updated
> > >    rdev which results in received subvol2/b having the wrong minor
> > >    number:
> > >
> > >   257 crw-r--r--. 1 root root 1, 3 Jan 25 17:18 subvol2/b
> > >
> > > Signed-off-by: Roman Anasal <roman.anasal@bdsu.de>
> > > ---
> > > v2:
> > >   - add this patch to also handle changed rdev
> > > ---
> > >  fs/btrfs/send.c | 15 ++++++++++-----
> > >  1 file changed, 10 insertions(+), 5 deletions(-)
> > >
> > > diff --git a/fs/btrfs/send.c b/fs/btrfs/send.c
> > > index c8b1f441f..ef544525f 100644
> > > --- a/fs/btrfs/send.c
> > > +++ b/fs/btrfs/send.c
> > > @@ -6263,6 +6263,7 @@ static int changed_inode(struct send_ctx
> > > *sctx,
> > >         struct btrfs_inode_item *right_ii = NULL;
> > >         u64 left_gen = 0;
> > >         u64 right_gen = 0;
> > > +       u64 left_rdev, right_rdev;
> > >         u64 left_type, right_type;
> > >
> > >         sctx->cur_ino = key->objectid;
> > > @@ -6285,6 +6286,8 @@ static int changed_inode(struct send_ctx
> > > *sctx,
> > >                                 struct btrfs_inode_item);
> > >                 left_gen = btrfs_inode_generation(sctx->left_path-
> > > >nodes[0],
> > >                                 left_ii);
> > > +               left_rdev = btrfs_inode_rdev(sctx->left_path-
> > > >nodes[0],
> > > +                               left_ii);
> > >         } else {
> > >                 right_ii = btrfs_item_ptr(sctx->right_path-
> > > >nodes[0],
> > >                                 sctx->right_path->slots[0],
> > > @@ -6300,6 +6303,9 @@ static int changed_inode(struct send_ctx
> > > *sctx,
> > >                 right_gen = btrfs_inode_generation(sctx-
> > > >right_path->nodes[0],
> > >                                 right_ii);
> > >
> > > +               right_rdev = btrfs_inode_rdev(sctx->right_path-
> > > >nodes[0],
> > > +                               right_ii);
> > > +
> > >                 left_type = S_IFMT & btrfs_inode_mode(
> > >                                 sctx->left_path->nodes[0],
> > > left_ii);
> > >                 right_type = S_IFMT & btrfs_inode_mode(
> > > @@ -6310,7 +6316,8 @@ static int changed_inode(struct send_ctx
> > > *sctx,
> > >                  * the inode as deleted+reused because it would
> > > generate a
> > >                  * stream that tries to delete/mkdir the root dir.
> > >                  */
> > > -               if ((left_gen != right_gen || left_type !=
> > > right_type) &&
> > > +               if ((left_gen != right_gen || left_type !=
> > > right_type ||
> > > +                   left_rdev != right_rdev) &&
> > >                     sctx->cur_ino != BTRFS_FIRST_FREE_OBJECTID)
> > >                         sctx->cur_inode_recreated = 1;
> > >         }
> > > @@ -6350,8 +6357,7 @@ static int changed_inode(struct send_ctx
> > > *sctx,
> > >                                 sctx->left_path->nodes[0],
> > > left_ii);
> > >                 sctx->cur_inode_mode = btrfs_inode_mode(
> > >                                 sctx->left_path->nodes[0],
> > > left_ii);
> > > -               sctx->cur_inode_rdev = btrfs_inode_rdev(
> > > -                               sctx->left_path->nodes[0],
> > > left_ii);
> > > +               sctx->cur_inode_rdev = left_rdev;
> > >                 if (sctx->cur_ino != BTRFS_FIRST_FREE_OBJECTID)
> > >                         ret = send_create_inode_if_needed(sctx);
> > >         } else if (result == BTRFS_COMPARE_TREE_DELETED) {
> > > @@ -6396,8 +6402,7 @@ static int changed_inode(struct send_ctx
> > > *sctx,
> > >                                         sctx->left_path->nodes[0],
> > > left_ii);
> > >                         sctx->cur_inode_mode = btrfs_inode_mode(
> > >                                         sctx->left_path->nodes[0],
> > > left_ii);
> > > -                       sctx->cur_inode_rdev = btrfs_inode_rdev(
> > > -                                       sctx->left_path->nodes[0],
> > > left_ii);
> > > +                       sctx->cur_inode_rdev = left_rdev;
> > >                         ret = send_create_inode_if_needed(sctx);
> > >                         if (ret < 0)
> > >                                 goto out;
> > > --
> > > 2.26.2
> > >
> >
> >
Roman Anasal | BDSU Jan. 31, 2021, 3:52 p.m. UTC | #4
On Mon, Jan 25, 2021 at 20:51 +0000 Filipe Manana wrote:
> On Mon, Jan 25, 2021 at 7:51 PM Roman Anasal <roman.anasal@bdsu.de>
> wrote:
> > Second example:
> >   # case 2: same ino at different path
> >   btrfs subvolume create subvol1
> >   btrfs subvolume create subvol2
> >   mknod subvol1/a c 1 3
> >   mknod subvol2/b c 1 5
> >   btrfs property set subvol1 ro true
> >   btrfs property set subvol2 ro true
> >   btrfs send -p subvol1 subvol2 | btrfs receive --dump
> 
> As I've told you before for the v1 patchset from a week or two ago,
> this is not a supported scenario for incremental sends.
> Incremental sends are meant to be used on RO snapshots of the same
> subvolume, and those snapshots must never be changed after they were
> created.
> 
> Incremental sends were simply not designed for these cases, and can
> never be guaranteed to work with such cases.
> 
> The bug is not having incremental sends fail right away, with an
> explicit error message, when the send and parent roots aren't RO
> snapshots of the same subvolume.

Since this should be fixed then I'd like to propose to add the
following check:

The inodes of the subvolumes' root directories (ino
BTRFS_FIRST_FREE_OBJECTID = 256) must have the same generation.

Since create_subvol() will always commit the transaction, i.e.
increment the generation, no two _independently_ created subvolumes can
be created within the same generation (are there race conditions
possible here?).
Taking a snapshot of a subvolume does not modify the generation of the
root dir inode. Also it is not possible to change or delete/re-create
the root directory of a subvolume since this would delete the subvolume
itself.


So having two subvolumes with root directories created with different
generations means they were created independently and can not share a
common ancestor. Doing an incremental send with them is unsafe and thus
must return an error.
With the root directories at the same generation though the subvolumes
are based on a common ancestor which is a requirement for a safe
incremental send.

Are my assumptions and my understanding here correct? Then this check
would catch most of the unsafe parents.
If so I could have a shot at a patch for this if you'd like me to?


This check still does not solve the second edge case though, when
snapshots are modified afterwards and diverge independently form one
another. For this I still see no good solution besides a new on-disk
flag whether a snapshot was *ever* set to ro=false. But with that I'm
not sure how to (not) inherit that flag in a safe way ...
Filipe Manana Feb. 2, 2021, 11:56 a.m. UTC | #5
On Sun, Jan 31, 2021 at 3:52 PM Roman Anasal | BDSU
<roman.anasal@bdsu.de> wrote:
>
> On Mon, Jan 25, 2021 at 20:51 +0000 Filipe Manana wrote:
> > On Mon, Jan 25, 2021 at 7:51 PM Roman Anasal <roman.anasal@bdsu.de>
> > wrote:
> > > Second example:
> > >   # case 2: same ino at different path
> > >   btrfs subvolume create subvol1
> > >   btrfs subvolume create subvol2
> > >   mknod subvol1/a c 1 3
> > >   mknod subvol2/b c 1 5
> > >   btrfs property set subvol1 ro true
> > >   btrfs property set subvol2 ro true
> > >   btrfs send -p subvol1 subvol2 | btrfs receive --dump
> >
> > As I've told you before for the v1 patchset from a week or two ago,
> > this is not a supported scenario for incremental sends.
> > Incremental sends are meant to be used on RO snapshots of the same
> > subvolume, and those snapshots must never be changed after they were
> > created.
> >
> > Incremental sends were simply not designed for these cases, and can
> > never be guaranteed to work with such cases.
> >
> > The bug is not having incremental sends fail right away, with an
> > explicit error message, when the send and parent roots aren't RO
> > snapshots of the same subvolume.
>
> Since this should be fixed then I'd like to propose to add the
> following check:
>
> The inodes of the subvolumes' root directories (ino
> BTRFS_FIRST_FREE_OBJECTID = 256) must have the same generation.
>
> Since create_subvol() will always commit the transaction, i.e.
> increment the generation, no two _independently_ created subvolumes can
> be created within the same generation (are there race conditions
> possible here?).

That is currently true, but it has been discussed and proposed the
ability to skip the transaction commit when creating a subvolume
Boris sent a proposal patch for that a few months ago.

I don't think that should be assumed. Avoiding the transaction commit,
either by default or optionally, is something that makes sense.
Plus for a case like snapshots, we can actually batch the creation of
several ones in a single transaction.

> Taking a snapshot of a subvolume does not modify the generation of the
> root dir inode. Also it is not possible to change or delete/re-create
> the root directory of a subvolume since this would delete the subvolume
> itself.
>
>
> So having two subvolumes with root directories created with different
> generations means they were created independently and can not share a
> common ancestor. Doing an incremental send with them is unsafe and thus
> must return an error.
> With the root directories at the same generation though the subvolumes
> are based on a common ancestor which is a requirement for a safe
> incremental send.
>
> Are my assumptions and my understanding here correct? Then this check
> would catch most of the unsafe parents.
> If so I could have a shot at a patch for this if you'd like me to?

That is too complex and makes too many assumptions.

To check if two roots are snapshots of the same subvolume (the send
and parent roots), you can simply check if they have non-null uuids in
the "parent_uuid" field of their root items and that they match.

While this is more straightforward to do in the kernel, I would prefer
to have it in btrfs-progs, because:

1) In btrfs-progs we can explicitly print an informative error message
to the user, while in the kernel you can only return an errno value
and log something dmesg/syslog, which is much less user friendly;

2) The check would be on by default but could be skipped with some new
flag - this is just being conservative to avoid breaking any existing
workflows we might not be aware of.
    In particular I'm thinking about people using "btrfs send" with -c
and omitting -p, in which case btrfs-progs selects one of the -c roots
to be used as the parent root,
    but the selected root might not be a snapshot of the same
subvolume as the send root.
    Then maybe one day that option to skip the check would be removed,
after we are more sure no one is using or really needs such workflows.

>
>
> This check still does not solve the second edge case though, when
> snapshots are modified afterwards and diverge independently form one
> another. For this I still see no good solution besides a new on-disk
> flag whether a snapshot was *ever* set to ro=false. But with that I'm
> not sure how to (not) inherit that flag in a safe way ...

I'm afraid there's nothing, codewise, to do about that case.

Setting some flag on the root to make it unusable for send in case it
was ever RW would break send in at least one way:

During a receive we create the root as RW, apply the send stream and
then change the root to RO.
After such change, it would mean we could not send the received
snapshot anymore. There's no way to make sure that only btrfs-receive
can do that, since anyone can use the ioctl.

Perhaps all that needs to be done is to document this well in the man
pages and wiki in case it's not already there.

Thanks.
Roman Anasal | BDSU Feb. 3, 2021, 4:20 p.m. UTC | #6
On Tue, 2021-02-02 at 11:56 +0000, Filipe Manana wrote:
> On Sun, Jan 31, 2021 at 3:52 PM Roman Anasal | BDSU
> <roman.anasal@bdsu.de> wrote:
> > On Mon, Jan 25, 2021 at 20:51 +0000 Filipe Manana wrote:
> > > On Mon, Jan 25, 2021 at 7:51 PM Roman Anasal <
> > > roman.anasal@bdsu.de>
> > > wrote:
> > > > Second example:
> > > >   # case 2: same ino at different path
> > > >   btrfs subvolume create subvol1
> > > >   btrfs subvolume create subvol2
> > > >   mknod subvol1/a c 1 3
> > > >   mknod subvol2/b c 1 5
> > > >   btrfs property set subvol1 ro true
> > > >   btrfs property set subvol2 ro true
> > > >   btrfs send -p subvol1 subvol2 | btrfs receive --dump
> > > 
> > > As I've told you before for the v1 patchset from a week or two
> > > ago,
> > > this is not a supported scenario for incremental sends.
> > > Incremental sends are meant to be used on RO snapshots of the
> > > same
> > > subvolume, and those snapshots must never be changed after they
> > > were
> > > created.
> > > 
> > > Incremental sends were simply not designed for these cases, and
> > > can
> > > never be guaranteed to work with such cases.
> > > 
> > > The bug is not having incremental sends fail right away, with an
> > > explicit error message, when the send and parent roots aren't RO
> > > snapshots of the same subvolume.
> > 
> > Since this should be fixed then I'd like to propose to add the
> > following check:
> > 
> > The inodes of the subvolumes' root directories (ino
> > BTRFS_FIRST_FREE_OBJECTID = 256) must have the same generation.
> > 
> > Since create_subvol() will always commit the transaction, i.e.
> > increment the generation, no two _independently_ created subvolumes
> > can
> > be created within the same generation (are there race conditions
> > possible here?).
> 
> That is currently true, but it has been discussed and proposed the
> ability to skip the transaction commit when creating a subvolume
> Boris sent a proposal patch for that a few months ago.

Ah, okay then, if this may change in the future then this idea isn't
safe and should be dismissed.


> I don't think that should be assumed. Avoiding the transaction
> commit,
> either by default or optionally, is something that makes sense.
> Plus for a case like snapshots, we can actually batch the creation of
> several ones in a single transaction.
> 
> > Taking a snapshot of a subvolume does not modify the generation of
> > the
> > root dir inode. Also it is not possible to change or delete/re-
> > create
> > the root directory of a subvolume since this would delete the
> > subvolume
> > itself.
> > 
> > 
> > So having two subvolumes with root directories created with
> > different
> > generations means they were created independently and can not share
> > a
> > common ancestor. Doing an incremental send with them is unsafe and
> > thus
> > must return an error.
> > With the root directories at the same generation though the
> > subvolumes
> > are based on a common ancestor which is a requirement for a safe
> > incremental send.
> > 
> > Are my assumptions and my understanding here correct? Then this
> > check
> > would catch most of the unsafe parents.
> > If so I could have a shot at a patch for this if you'd like me to?
> 
> That is too complex and makes too many assumptions.
> 
> To check if two roots are snapshots of the same subvolume (the send
> and parent roots), you can simply check if they have non-null uuids
> in
> the "parent_uuid" field of their root items and that they match.

I thought of this, too, but see it break in some scenarios I'd expect
it to work, mostly with "chains" of snapshots as they happen on a
receiving side.

Consider this scenario:

   btrfs subvolume create /subvol/
   # modify /subvol
   btrfs subvolume snapshot -r /subvol/ /snapshots/snap1
   # modify /subvol
   btrfs subvolume snapshot -r /subvol/ /snapshots/snap2
   # modify /subvol
btrfs subvolume snapshot -r /subvol/ /snapshots/snap3

I.e. have a single RW subvolume and taking incremental snapshots of it.

   cd /snapshots/
   btrfs send snap1 | btrfs receive /mnt/backups/
btrfs send -p snap1 snap2 | btrfs receive /mnt/backups/   btrfs send -p snap2 snap3 | btrfs receive /mnt/backups/

I.e. incrementally send the snapshots to another btrfs volume.

   cd /mnt/backups
   btrfs subvolume delete snap2
   btrfs send snap1 | btrfs receive /mnt/backups2/
   btrfs send -p snap1 snap3 | btrfs receive /mnt/backups2/

I.e. delete the intermediate snapshot snap2 and incrementally send
snap1 and snap3 from the receiving filesystem to yet another btrfs
filesystem.

The last command would fail since snap3 was based on snap2 which was
based on snap1; so neither is snap1 the (direct) parent of snap3 nor do
they share a common (direct) parent nor would it be possible to
reconstruct their relation by walking the chain since snap2 does no
longer exist.

While on the orignal filesystem all snapshots have the same parent on
the reciving volume it is a chain:

orignal volume:

        subvolume
        ^   ^   ^
       /    |    \
   snap1  snap2  snap3

receiving volume:

   snap1 <- snap2 <- snap3


So for this to work it would probably require another attribute
"original subvol UUID" for the root of the ancestry tree...


> While this is more straightforward to do in the kernel, I would
> prefer
> to have it in btrfs-progs, because:
> 
> 1) In btrfs-progs we can explicitly print an informative error
> message
> to the user, while in the kernel you can only return an errno value
> and log something dmesg/syslog, which is much less user friendly;

I was thinking about implementing it in the kernel as an (additional)
check to block unsafe sends regardless of the user space tool (are
there any besides btrfs-progs?); but proper handling and an explaining
error message must be imlpemented in btrfs-progs, totally.


> 2) The check would be on by default but could be skipped with some
> new
> flag - this is just being conservative to avoid breaking any existing
> workflows we might not be aware of.
>     In particular I'm thinking about people using "btrfs send" with
> -c
> and omitting -p, in which case btrfs-progs selects one of the -c
> roots
> to be used as the parent root,
>     but the selected root might not be a snapshot of the same
> subvolume as the send root.
>     Then maybe one day that option to skip the check would be
> removed,
> after we are more sure no one is using or really needs such
> workflows.

The way I read find_good_parent() it will only select a clone source as
parent if it is the parent subvolume of the send subvolume [1] or if
both have the same parent [2]?
Which makes sense since selecting an snapshot of an unrelated subvolume
would be unsafe.

[1] https://github.com/kdave/btrfs-
progs/blob/273380d98f4412ae8b0f35ad69debf682e48c6bd/cmds/send.c#L118
[2]
https://github.com/kdave/btrfs-
progs/blob/273380d98f4412ae8b0f35ad69debf682e48c6bd/cmds/send.c#L131

> > 
> > This check still does not solve the second edge case though, when
> > snapshots are modified afterwards and diverge independently form
> > one
> > another. For this I still see no good solution besides a new on-
> > disk
> > flag whether a snapshot was *ever* set to ro=false. But with that
> > I'm
> > not sure how to (not) inherit that flag in a safe way ...
> 
> I'm afraid there's nothing, codewise, to do about that case.
> 
> Setting some flag on the root to make it unusable for send in case it
> was ever RW would break send in at least one way:
> 
> During a receive we create the root as RW, apply the send stream and
> then change the root to RO.
> After such change, it would mean we could not send the received
> snapshot anymore. There's no way to make sure that only btrfs-receive
> can do that, since anyone can use the ioctl.

Another case where allowing to switch to RW before send would be
desirable: make snapshot RW, delete files you don't need anymore, make
RO again, send to backup disk.
Only deleting files/inodes should even be safe now.


> Perhaps all that needs to be done is to document this well in the man
> pages and wiki in case it's not already there.

Yes. Since these are all very unlikely edge cases and reliably detecting them without false positives is hard, just explicitly documenting them is probably the best solution.

> 
> Thanks.
>
diff mbox series

Patch

diff --git a/fs/btrfs/send.c b/fs/btrfs/send.c
index c8b1f441f..ef544525f 100644
--- a/fs/btrfs/send.c
+++ b/fs/btrfs/send.c
@@ -6263,6 +6263,7 @@  static int changed_inode(struct send_ctx *sctx,
 	struct btrfs_inode_item *right_ii = NULL;
 	u64 left_gen = 0;
 	u64 right_gen = 0;
+	u64 left_rdev, right_rdev;
 	u64 left_type, right_type;
 
 	sctx->cur_ino = key->objectid;
@@ -6285,6 +6286,8 @@  static int changed_inode(struct send_ctx *sctx,
 				struct btrfs_inode_item);
 		left_gen = btrfs_inode_generation(sctx->left_path->nodes[0],
 				left_ii);
+		left_rdev = btrfs_inode_rdev(sctx->left_path->nodes[0],
+				left_ii);
 	} else {
 		right_ii = btrfs_item_ptr(sctx->right_path->nodes[0],
 				sctx->right_path->slots[0],
@@ -6300,6 +6303,9 @@  static int changed_inode(struct send_ctx *sctx,
 		right_gen = btrfs_inode_generation(sctx->right_path->nodes[0],
 				right_ii);
 
+		right_rdev = btrfs_inode_rdev(sctx->right_path->nodes[0],
+				right_ii);
+
 		left_type = S_IFMT & btrfs_inode_mode(
 				sctx->left_path->nodes[0], left_ii);
 		right_type = S_IFMT & btrfs_inode_mode(
@@ -6310,7 +6316,8 @@  static int changed_inode(struct send_ctx *sctx,
 		 * the inode as deleted+reused because it would generate a
 		 * stream that tries to delete/mkdir the root dir.
 		 */
-		if ((left_gen != right_gen || left_type != right_type) &&
+		if ((left_gen != right_gen || left_type != right_type ||
+		    left_rdev != right_rdev) &&
 		    sctx->cur_ino != BTRFS_FIRST_FREE_OBJECTID)
 			sctx->cur_inode_recreated = 1;
 	}
@@ -6350,8 +6357,7 @@  static int changed_inode(struct send_ctx *sctx,
 				sctx->left_path->nodes[0], left_ii);
 		sctx->cur_inode_mode = btrfs_inode_mode(
 				sctx->left_path->nodes[0], left_ii);
-		sctx->cur_inode_rdev = btrfs_inode_rdev(
-				sctx->left_path->nodes[0], left_ii);
+		sctx->cur_inode_rdev = left_rdev;
 		if (sctx->cur_ino != BTRFS_FIRST_FREE_OBJECTID)
 			ret = send_create_inode_if_needed(sctx);
 	} else if (result == BTRFS_COMPARE_TREE_DELETED) {
@@ -6396,8 +6402,7 @@  static int changed_inode(struct send_ctx *sctx,
 					sctx->left_path->nodes[0], left_ii);
 			sctx->cur_inode_mode = btrfs_inode_mode(
 					sctx->left_path->nodes[0], left_ii);
-			sctx->cur_inode_rdev = btrfs_inode_rdev(
-					sctx->left_path->nodes[0], left_ii);
+			sctx->cur_inode_rdev = left_rdev;
 			ret = send_create_inode_if_needed(sctx);
 			if (ret < 0)
 				goto out;