diff mbox series

[2/2] btrfs: send: fix invalid commands for inodes with changed type but same gen

Message ID 20210111190243.4152-3-roman.anasal@bdsu.de (mailing list archive)
State New, archived
Headers show
Series btrfs: send: correctly recreate changed inodes | expand

Commit Message

Roman Anasal | BDSU Jan. 11, 2021, 7:02 p.m. UTC
When doing a send, if a new inode has the same number as an inode in the
parent subvolume it will only be detected as to be recreated when the
genrations differ. But with inodes in the same generation the emitted
commands will cause the receiver to fail.

This case does not happen when doing incremental sends with snapshots
that are kept read-only by the user all the time, but it may happen if
- a snapshot was modified after it was created
- the subvol used as parent was created independently from the sent subvol

Example reproducers:

  # case 1: same ino at same path
  btrfs subvolume create subvol1
  btrfs subvolume create subvol2
  mkdir subvol1/a
  touch subvol2/a
  btrfs property set subvol1 ro true
  btrfs property set subvol2 ro true
  btrfs send -p subvol1 subvol2 | btrfs receive --dump

The produced tree state here is:
  |-- subvol1
  |   `-- a/        (ino 257)
  |
  `-- subvol2
      `-- a         (ino 257)

  Where subvol1/a/ is a directory and subvol2/a is a file with the same
  inode number and same generation.

Example output of the receive command:
  At subvol subvol2
  snapshot        ./subvol2                       uuid=19d2be0a-5af1-fa44-9b3f-f21815178d00 transid=9 parent_uuid=1bac8b12-ddb2-6441-8551-700456991785 parent_transid=9
  utimes          ./subvol2/                      atime=2021-01-11T13:41:36+0000 mtime=2021-01-11T13:41:36+0000 ctime=2021-01-11T13:41:36+0000
  link            ./subvol2/a                     dest=a
  unlink          ./subvol2/a
  utimes          ./subvol2/                      atime=2021-01-11T13:41:36+0000 mtime=2021-01-11T13:41:36+0000 ctime=2021-01-11T13:41:36+0000
  chmod           ./subvol2/a                     mode=644
  utimes          ./subvol2/a                     atime=2021-01-11T13:41:36+0000 mtime=2021-01-11T13:41:36+0000 ctime=2021-01-11T13:41:36+0000

=> the `link` command causes the receiver to fail with:
   ERROR: link a -> a failed: File exists

Second example:
  # case 2: same ino at different path
  btrfs subvolume create subvol1
  btrfs subvolume create subvol2
  mkdir subvol1/a
  touch subvol2/b
  btrfs property set subvol1 ro true
  btrfs property set subvol2 ro true
  btrfs send -p subvol1 subvol2 | btrfs receive --dump

The produced tree state here is:
  |-- subvol1
  |   `-- a/        (ino 257)
  |
  `-- subvol2
      `-- b         (ino 257)

  Where subvol1/a/ is a directory and subvol2/b is a file with the same
  inode number and same generation.

Example output of the receive command:
  At subvol subvol2
  snapshot        ./subvol2                       uuid=ea93c47a-5f47-724f-8a43-e15ce745aef0 transid=20 parent_uuid=f03578ef-5bca-1445-a480-3df63677fddf parent_transid=20
  utimes          ./subvol2/                      atime=2021-01-11T13:58:00+0000 mtime=2021-01-11T13:58:00+0000 ctime=2021-01-11T13:58:00+0000
  link            ./subvol2/b                     dest=a
  unlink          ./subvol2/a
  utimes          ./subvol2/                      atime=2021-01-11T13:58:00+0000 mtime=2021-01-11T13:58:00+0000 ctime=2021-01-11T13:58:00+0000
  chmod           ./subvol2/b                     mode=644
  utimes          ./subvol2/b                     atime=2021-01-11T13:58:00+0000 mtime=2021-01-11T13:58:00+0000 ctime=2021-01-11T13:58:00+0000

=> the `link` command causes the receiver to fail with:
   ERROR: link b -> a failed: Operation not permitted

Signed-off-by: Roman Anasal <roman.anasal@bdsu.de>
---
 fs/btrfs/send.c | 16 +++++++++++-----
 1 file changed, 11 insertions(+), 5 deletions(-)

Comments

Andrei Borzenkov Jan. 11, 2021, 7:30 p.m. UTC | #1
11.01.2021 22:02, Roman Anasal пишет:
> When doing a send, if a new inode has the same number as an inode in the
> parent subvolume it will only be detected as to be recreated when the
> genrations differ. But with inodes in the same generation the emitted
> commands will cause the receiver to fail.
> 
> This case does not happen when doing incremental sends with snapshots
> that are kept read-only by the user all the time, but it may happen if
> - a snapshot was modified after it was created
> - the subvol used as parent was created independently from the sent subvol
> 
> Example reproducers:
> 
>   # case 1: same ino at same path
>   btrfs subvolume create subvol1
>   btrfs subvolume create subvol2
>   mkdir subvol1/a
>   touch subvol2/a

What happens in case of

echo foo > subvol1/a
echo bar > subvol2/a

?

>   btrfs property set subvol1 ro true
>   btrfs property set subvol2 ro true
>   btrfs send -p subvol1 subvol2 | btrfs receive --dump
> 
> The produced tree state here is:
>   |-- subvol1
>   |   `-- a/        (ino 257)
>   |
>   `-- subvol2
>       `-- a         (ino 257)
> 
>   Where subvol1/a/ is a directory and subvol2/a is a file with the same
>   inode number and same generation.
> 
> Example output of the receive command:
>   At subvol subvol2
>   snapshot        ./subvol2                       uuid=19d2be0a-5af1-fa44-9b3f-f21815178d00 transid=9 parent_uuid=1bac8b12-ddb2-6441-8551-700456991785 parent_transid=9
>   utimes          ./subvol2/                      atime=2021-01-11T13:41:36+0000 mtime=2021-01-11T13:41:36+0000 ctime=2021-01-11T13:41:36+0000
>   link            ./subvol2/a                     dest=a
>   unlink          ./subvol2/a
>   utimes          ./subvol2/                      atime=2021-01-11T13:41:36+0000 mtime=2021-01-11T13:41:36+0000 ctime=2021-01-11T13:41:36+0000
>   chmod           ./subvol2/a                     mode=644
>   utimes          ./subvol2/a                     atime=2021-01-11T13:41:36+0000 mtime=2021-01-11T13:41:36+0000 ctime=2021-01-11T13:41:36+0000
> 
> => the `link` command causes the receiver to fail with:
>    ERROR: link a -> a failed: File exists
> 
> Second example:
>   # case 2: same ino at different path
>   btrfs subvolume create subvol1
>   btrfs subvolume create subvol2
>   mkdir subvol1/a
>   touch subvol2/b
>   btrfs property set subvol1 ro true
>   btrfs property set subvol2 ro true
>   btrfs send -p subvol1 subvol2 | btrfs receive --dump
> 
> The produced tree state here is:
>   |-- subvol1
>   |   `-- a/        (ino 257)
>   |
>   `-- subvol2
>       `-- b         (ino 257)
> 
>   Where subvol1/a/ is a directory and subvol2/b is a file with the same
>   inode number and same generation.
> 
> Example output of the receive command:
>   At subvol subvol2
>   snapshot        ./subvol2                       uuid=ea93c47a-5f47-724f-8a43-e15ce745aef0 transid=20 parent_uuid=f03578ef-5bca-1445-a480-3df63677fddf parent_transid=20
>   utimes          ./subvol2/                      atime=2021-01-11T13:58:00+0000 mtime=2021-01-11T13:58:00+0000 ctime=2021-01-11T13:58:00+0000
>   link            ./subvol2/b                     dest=a
>   unlink          ./subvol2/a
>   utimes          ./subvol2/                      atime=2021-01-11T13:58:00+0000 mtime=2021-01-11T13:58:00+0000 ctime=2021-01-11T13:58:00+0000
>   chmod           ./subvol2/b                     mode=644
>   utimes          ./subvol2/b                     atime=2021-01-11T13:58:00+0000 mtime=2021-01-11T13:58:00+0000 ctime=2021-01-11T13:58:00+0000
> 
> => the `link` command causes the receiver to fail with:
>    ERROR: link b -> a failed: Operation not permitted
> 
> Signed-off-by: Roman Anasal <roman.anasal@bdsu.de>
> ---
>  fs/btrfs/send.c | 16 +++++++++++-----
>  1 file changed, 11 insertions(+), 5 deletions(-)
> 
> diff --git a/fs/btrfs/send.c b/fs/btrfs/send.c
> index 420371c1d..33ae48442 100644
> --- a/fs/btrfs/send.c
> +++ b/fs/btrfs/send.c
> @@ -6299,12 +6299,18 @@ static int changed_inode(struct send_ctx *sctx,
>  		right_gen = btrfs_inode_generation(sctx->right_path->nodes[0],
>  				right_ii);
>  
> +		u64 left_type = S_IFMT & btrfs_inode_mode(
> +				sctx->left_path->nodes[0], left_ii);
> +		u64 right_type = S_IFMT & btrfs_inode_mode(
> +				sctx->right_path->nodes[0], right_ii);
> +
> +
>  		/*
>  		 * The cur_ino = root dir case is special here. We can't treat
>  		 * the inode as deleted+reused because it would generate a
>  		 * stream that tries to delete/mkdir the root dir.
>  		 */
> -		if (left_gen != right_gen &&
> +		if ((left_gen != right_gen || left_type != right_type) &&
>  		    sctx->cur_ino != BTRFS_FIRST_FREE_OBJECTID)
>  			sctx->cur_inode_recreated = 1;
>  	}
> @@ -6359,10 +6365,10 @@ static int changed_inode(struct send_ctx *sctx,
>  	} else if (result == BTRFS_COMPARE_TREE_CHANGED) {
>  		/*
>  		 * We need to do some special handling in case the inode was
> -		 * reported as changed with a changed generation number. This
> -		 * means that the original inode was deleted and new inode
> -		 * reused the same inum. So we have to treat the old inode as
> -		 * deleted and the new one as new.
> +		 * reported as changed with a changed generation number or
> +		 * changed inode type. This means that the original inode was
> +		 * deleted and new inode reused the same inum. So we have to
> +		 * treat the old inode as deleted and the new one as new.
>  		 */
>  		if (sctx->cur_inode_recreated) {
>  			/*
>
Roman Anasal | BDSU Jan. 11, 2021, 8:53 p.m. UTC | #2
On Mon, 2021-01-11 at 22:30 +0300, Andrei Borzenkov wrote:
> 11.01.2021 22:02, Roman Anasal пишет:
> > When doing a send, if a new inode has the same number as an inode
> > in the
> > parent subvolume it will only be detected as to be recreated when
> > the
> > genrations differ. But with inodes in the same generation the
> > emitted
> > commands will cause the receiver to fail.
> > 
> > This case does not happen when doing incremental sends with
> > snapshots
> > that are kept read-only by the user all the time, but it may happen
> > if
> > - a snapshot was modified after it was created
> > - the subvol used as parent was created independently from the sent
> > subvol
> > 
> > Example reproducers:
> > 
> >   # case 1: same ino at same path
> >   btrfs subvolume create subvol1
> >   btrfs subvolume create subvol2
> >   mkdir subvol1/a
> >   touch subvol2/a
> 
> What happens in case of
> 
> echo foo > subvol1/a
> echo bar > subvol2/a
> 
> ?

`echo foo > subvol1/a` wouldn't work since subvol1/a is a directory.
However, replacing both preceding lines with:

  mkdir subvol1/a
  echo bar > subvol2/a

=> same as before with/without the patch, plus an additional write
command for the content

And with both lines as

  echo foo > subvol1/a
  echo bar > subvol2/a

=> produces an invalid stream (with and without this patch) with a
`link a -> a` followed by `unlink a` as seen in the example output
quoted further below.

So there is another bug here. The conditions for this seem to roughly
be: changed/new inode with same number, type and generation, since
reordering the commands so the inodes have different generations
produces a valid stream.

Changing the name to subvol2/b like in the second case below produces a
valid stream with a `link b -> a` followed by `unlink a`.

I'll take a look into this, too, and if possible provide another patch
for that case.


> >   btrfs property set subvol1 ro true
> >   btrfs property set subvol2 ro true
> >   btrfs send -p subvol1 subvol2 | btrfs receive --dump
> > 
> > The produced tree state here is:
> >   |-- subvol1
> >   |   `-- a/        (ino 257)
> >   |
> >   `-- subvol2
> >       `-- a         (ino 257)
> > 
> >   Where subvol1/a/ is a directory and subvol2/a is a file with the
> > same
> >   inode number and same generation.
> > 
> > Example output of the receive command:
> >   At subvol subvol2
> >   snapshot        ./subvol2                       uuid=19d2be0a-
> > 5af1-fa44-9b3f-f21815178d00 transid=9 parent_uuid=1bac8b12-ddb2-
> > 6441-8551-700456991785 parent_transid=9
> >   utimes          ./subvol2/                      atime=2021-01-
> > 11T13:41:36+0000 mtime=2021-01-11T13:41:36+0000 ctime=2021-01-
> > 11T13:41:36+0000
> >   link            ./subvol2/a                     dest=a
> >   unlink          ./subvol2/a
> >   utimes          ./subvol2/                      atime=2021-01-
> > 11T13:41:36+0000 mtime=2021-01-11T13:41:36+0000 ctime=2021-01-
> > 11T13:41:36+0000
> >   chmod           ./subvol2/a                     mode=644
> >   utimes          ./subvol2/a                     atime=2021-01-
> > 11T13:41:36+0000 mtime=2021-01-11T13:41:36+0000 ctime=2021-01-
> > 11T13:41:36+0000
> > 
> > => the `link` command causes the receiver to fail with:
> >    ERROR: link a -> a failed: File exists
> > 
> > Second example:
> >   # case 2: same ino at different path
> >   btrfs subvolume create subvol1
> >   btrfs subvolume create subvol2
> >   mkdir subvol1/a
> >   touch subvol2/b
> >   btrfs property set subvol1 ro true
> >   btrfs property set subvol2 ro true
> >   btrfs send -p subvol1 subvol2 | btrfs receive --dump
> > 
> > The produced tree state here is:
> >   |-- subvol1
> >   |   `-- a/        (ino 257)
> >   |
> >   `-- subvol2
> >       `-- b         (ino 257)
> > 
> >   Where subvol1/a/ is a directory and subvol2/b is a file with the
> > same
> >   inode number and same generation.
> > 
> > Example output of the receive command:
> >   At subvol subvol2
> >   snapshot        ./subvol2                       uuid=ea93c47a-
> > 5f47-724f-8a43-e15ce745aef0 transid=20 parent_uuid=f03578ef-5bca-
> > 1445-a480-3df63677fddf parent_transid=20
> >   utimes          ./subvol2/                      atime=2021-01-
> > 11T13:58:00+0000 mtime=2021-01-11T13:58:00+0000 ctime=2021-01-
> > 11T13:58:00+0000
> >   link            ./subvol2/b                     dest=a
> >   unlink          ./subvol2/a
> >   utimes          ./subvol2/                      atime=2021-01-
> > 11T13:58:00+0000 mtime=2021-01-11T13:58:00+0000 ctime=2021-01-
> > 11T13:58:00+0000
> >   chmod           ./subvol2/b                     mode=644
> >   utimes          ./subvol2/b                     atime=2021-01-
> > 11T13:58:00+0000 mtime=2021-01-11T13:58:00+0000 ctime=2021-01-
> > 11T13:58:00+0000
> > 
> > => the `link` command causes the receiver to fail with:
> >    ERROR: link b -> a failed: Operation not permitted
> > 
> > Signed-off-by: Roman Anasal <roman.anasal@bdsu.de>
David Sterba Jan. 11, 2021, 9:15 p.m. UTC | #3
On Mon, Jan 11, 2021 at 08:02:43PM +0100, Roman Anasal wrote:
> --- a/fs/btrfs/send.c
> +++ b/fs/btrfs/send.c
> @@ -6299,12 +6299,18 @@ static int changed_inode(struct send_ctx *sctx,
>  		right_gen = btrfs_inode_generation(sctx->right_path->nodes[0],
>  				right_ii);
>  
> +		u64 left_type = S_IFMT & btrfs_inode_mode(
> +				sctx->left_path->nodes[0], left_ii);
> +		u64 right_type = S_IFMT & btrfs_inode_mode(
> +				sctx->right_path->nodes[0], right_ii);

Minor note, we don't use the declarations mixed with code, so the
variables need to be declared separatelly, but I can fix that unless
there's another reason to update and resend the patches.
kernel test robot Jan. 11, 2021, 9:19 p.m. UTC | #4
Hi Roman,

Thank you for the patch! Perhaps something to improve:

[auto build test WARNING on kdave/for-next]
[also build test WARNING on btrfs/next v5.11-rc3 next-20210111]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch]

url:    https://github.com/0day-ci/linux/commits/Roman-Anasal/btrfs-send-correctly-recreate-changed-inodes/20210112-030653
base:   https://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux.git for-next
config: i386-allyesconfig (attached as .config)
compiler: gcc-9 (Debian 9.3.0-15) 9.3.0
reproduce (this is a W=1 build):
        # https://github.com/0day-ci/linux/commit/2993e0e565f9215fc3e4cedd9b1b9bc8df6dbdad
        git remote add linux-review https://github.com/0day-ci/linux
        git fetch --no-tags linux-review Roman-Anasal/btrfs-send-correctly-recreate-changed-inodes/20210112-030653
        git checkout 2993e0e565f9215fc3e4cedd9b1b9bc8df6dbdad
        # save the attached .config to linux build tree
        make W=1 ARCH=i386 

If you fix the issue, kindly add following tag as appropriate
Reported-by: kernel test robot <lkp@intel.com>

All warnings (new ones prefixed by >>):

   fs/btrfs/send.c: In function 'changed_inode':
>> fs/btrfs/send.c:6289:3: warning: ISO C90 forbids mixed declarations and code [-Wdeclaration-after-statement]
    6289 |   u64 left_type = S_IFMT & btrfs_inode_mode(
         |   ^~~


vim +6289 fs/btrfs/send.c

  6243	
  6244	static int changed_inode(struct send_ctx *sctx,
  6245				 enum btrfs_compare_tree_result result)
  6246	{
  6247		int ret = 0;
  6248		struct btrfs_key *key = sctx->cmp_key;
  6249		struct btrfs_inode_item *left_ii = NULL;
  6250		struct btrfs_inode_item *right_ii = NULL;
  6251		u64 left_gen = 0;
  6252		u64 right_gen = 0;
  6253	
  6254		sctx->cur_ino = key->objectid;
  6255		sctx->cur_inode_recreated = 0;
  6256		sctx->cur_inode_last_extent = (u64)-1;
  6257		sctx->cur_inode_next_write_offset = 0;
  6258		sctx->ignore_cur_inode = false;
  6259	
  6260		/*
  6261		 * Set send_progress to current inode. This will tell all get_cur_xxx
  6262		 * functions that the current inode's refs are not updated yet. Later,
  6263		 * when process_recorded_refs is finished, it is set to cur_ino + 1.
  6264		 */
  6265		sctx->send_progress = sctx->cur_ino;
  6266	
  6267		if (result == BTRFS_COMPARE_TREE_NEW ||
  6268		    result == BTRFS_COMPARE_TREE_CHANGED) {
  6269			left_ii = btrfs_item_ptr(sctx->left_path->nodes[0],
  6270					sctx->left_path->slots[0],
  6271					struct btrfs_inode_item);
  6272			left_gen = btrfs_inode_generation(sctx->left_path->nodes[0],
  6273					left_ii);
  6274		} else {
  6275			right_ii = btrfs_item_ptr(sctx->right_path->nodes[0],
  6276					sctx->right_path->slots[0],
  6277					struct btrfs_inode_item);
  6278			right_gen = btrfs_inode_generation(sctx->right_path->nodes[0],
  6279					right_ii);
  6280		}
  6281		if (result == BTRFS_COMPARE_TREE_CHANGED) {
  6282			right_ii = btrfs_item_ptr(sctx->right_path->nodes[0],
  6283					sctx->right_path->slots[0],
  6284					struct btrfs_inode_item);
  6285	
  6286			right_gen = btrfs_inode_generation(sctx->right_path->nodes[0],
  6287					right_ii);
  6288	
> 6289			u64 left_type = S_IFMT & btrfs_inode_mode(
  6290					sctx->left_path->nodes[0], left_ii);
  6291			u64 right_type = S_IFMT & btrfs_inode_mode(
  6292					sctx->right_path->nodes[0], right_ii);
  6293	
  6294	
  6295			/*
  6296			 * The cur_ino = root dir case is special here. We can't treat
  6297			 * the inode as deleted+reused because it would generate a
  6298			 * stream that tries to delete/mkdir the root dir.
  6299			 */
  6300			if ((left_gen != right_gen || left_type != right_type) &&
  6301			    sctx->cur_ino != BTRFS_FIRST_FREE_OBJECTID)
  6302				sctx->cur_inode_recreated = 1;
  6303		}
  6304	
  6305		/*
  6306		 * Normally we do not find inodes with a link count of zero (orphans)
  6307		 * because the most common case is to create a snapshot and use it
  6308		 * for a send operation. However other less common use cases involve
  6309		 * using a subvolume and send it after turning it to RO mode just
  6310		 * after deleting all hard links of a file while holding an open
  6311		 * file descriptor against it or turning a RO snapshot into RW mode,
  6312		 * keep an open file descriptor against a file, delete it and then
  6313		 * turn the snapshot back to RO mode before using it for a send
  6314		 * operation. So if we find such cases, ignore the inode and all its
  6315		 * items completely if it's a new inode, or if it's a changed inode
  6316		 * make sure all its previous paths (from the parent snapshot) are all
  6317		 * unlinked and all other the inode items are ignored.
  6318		 */
  6319		if (result == BTRFS_COMPARE_TREE_NEW ||
  6320		    result == BTRFS_COMPARE_TREE_CHANGED) {
  6321			u32 nlinks;
  6322	
  6323			nlinks = btrfs_inode_nlink(sctx->left_path->nodes[0], left_ii);
  6324			if (nlinks == 0) {
  6325				sctx->ignore_cur_inode = true;
  6326				if (result == BTRFS_COMPARE_TREE_CHANGED)
  6327					ret = btrfs_unlink_all_paths(sctx);
  6328				goto out;
  6329			}
  6330		}
  6331	
  6332		if (result == BTRFS_COMPARE_TREE_NEW) {
  6333			sctx->cur_inode_gen = left_gen;
  6334			sctx->cur_inode_new = 1;
  6335			sctx->cur_inode_deleted = 0;
  6336			sctx->cur_inode_size = btrfs_inode_size(
  6337					sctx->left_path->nodes[0], left_ii);
  6338			sctx->cur_inode_mode = btrfs_inode_mode(
  6339					sctx->left_path->nodes[0], left_ii);
  6340			sctx->cur_inode_rdev = btrfs_inode_rdev(
  6341					sctx->left_path->nodes[0], left_ii);
  6342			if (sctx->cur_ino != BTRFS_FIRST_FREE_OBJECTID)
  6343				ret = send_create_inode_if_needed(sctx);
  6344		} else if (result == BTRFS_COMPARE_TREE_DELETED) {
  6345			sctx->cur_inode_gen = right_gen;
  6346			sctx->cur_inode_new = 0;
  6347			sctx->cur_inode_deleted = 1;
  6348			sctx->cur_inode_size = btrfs_inode_size(
  6349					sctx->right_path->nodes[0], right_ii);
  6350			sctx->cur_inode_mode = btrfs_inode_mode(
  6351					sctx->right_path->nodes[0], right_ii);
  6352		} else if (result == BTRFS_COMPARE_TREE_CHANGED) {
  6353			/*
  6354			 * We need to do some special handling in case the inode was
  6355			 * reported as changed with a changed generation number or
  6356			 * changed inode type. This means that the original inode was
  6357			 * deleted and new inode reused the same inum. So we have to
  6358			 * treat the old inode as deleted and the new one as new.
  6359			 */
  6360			if (sctx->cur_inode_recreated) {
  6361				/*
  6362				 * First, process the inode as if it was deleted.
  6363				 */
  6364				sctx->cur_inode_gen = right_gen;
  6365				sctx->cur_inode_new = 0;
  6366				sctx->cur_inode_deleted = 1;
  6367				sctx->cur_inode_size = btrfs_inode_size(
  6368						sctx->right_path->nodes[0], right_ii);
  6369				sctx->cur_inode_mode = btrfs_inode_mode(
  6370						sctx->right_path->nodes[0], right_ii);
  6371				ret = process_all_refs(sctx,
  6372						BTRFS_COMPARE_TREE_DELETED);
  6373				if (ret < 0)
  6374					goto out;
  6375	
  6376				/*
  6377				 * Now process the inode as if it was new.
  6378				 */
  6379				sctx->cur_inode_gen = left_gen;
  6380				sctx->cur_inode_new = 1;
  6381				sctx->cur_inode_deleted = 0;
  6382				sctx->cur_inode_size = btrfs_inode_size(
  6383						sctx->left_path->nodes[0], left_ii);
  6384				sctx->cur_inode_mode = btrfs_inode_mode(
  6385						sctx->left_path->nodes[0], left_ii);
  6386				sctx->cur_inode_rdev = btrfs_inode_rdev(
  6387						sctx->left_path->nodes[0], left_ii);
  6388				ret = send_create_inode_if_needed(sctx);
  6389				if (ret < 0)
  6390					goto out;
  6391	
  6392				ret = process_all_refs(sctx, BTRFS_COMPARE_TREE_NEW);
  6393				if (ret < 0)
  6394					goto out;
  6395				/*
  6396				 * Advance send_progress now as we did not get into
  6397				 * process_recorded_refs_if_needed in the
  6398				 * cur_inode_recreated case.
  6399				 */
  6400				sctx->send_progress = sctx->cur_ino + 1;
  6401	
  6402				/*
  6403				 * Now process all extents and xattrs of the inode as if
  6404				 * they were all new.
  6405				 */
  6406				ret = process_all_extents(sctx);
  6407				if (ret < 0)
  6408					goto out;
  6409				ret = process_all_new_xattrs(sctx);
  6410				if (ret < 0)
  6411					goto out;
  6412			} else {
  6413				sctx->cur_inode_gen = left_gen;
  6414				sctx->cur_inode_new = 0;
  6415				sctx->cur_inode_recreated = 0;
  6416				sctx->cur_inode_deleted = 0;
  6417				sctx->cur_inode_size = btrfs_inode_size(
  6418						sctx->left_path->nodes[0], left_ii);
  6419				sctx->cur_inode_mode = btrfs_inode_mode(
  6420						sctx->left_path->nodes[0], left_ii);
  6421			}
  6422		}
  6423	
  6424	out:
  6425		return ret;
  6426	}
  6427	

---
0-DAY CI Kernel Test Service, Intel Corporation
https://lists.01.org/hyperkitty/list/kbuild-all@lists.01.org
Roman Anasal | BDSU Jan. 11, 2021, 9:31 p.m. UTC | #5
On Mon, 2021-01-11 at 22:15 +0100, David Sterba wrote:
> On Mon, Jan 11, 2021 at 08:02:43PM +0100, Roman Anasal wrote:
> > --- a/fs/btrfs/send.c
> > +++ b/fs/btrfs/send.c
> > @@ -6299,12 +6299,18 @@ static int changed_inode(struct send_ctx
> > *sctx,
> >  		right_gen = btrfs_inode_generation(sctx->right_path-
> > >nodes[0],
> >  				right_ii);
> >  
> > +		u64 left_type = S_IFMT & btrfs_inode_mode(
> > +				sctx->left_path->nodes[0], left_ii);
> > +		u64 right_type = S_IFMT & btrfs_inode_mode(
> > +				sctx->right_path->nodes[0], right_ii);
> 
> Minor note, we don't use the declarations mixed with code, so the
> variables need to be declared separatelly, but I can fix that unless
> there's another reason to update and resend the patches.

Ah yes, I was quite unsure where to put the declarations and then
wanted to put them as close as possible to the usage and decided to put
them in the same "scope".
But just now I realised I was still thinking in the wrong programming
language and there is no "if-block-scope" in C :D

So I updated the commit respectively and can resend if needed.
Filipe Manana Jan. 12, 2021, 11:27 a.m. UTC | #6
On Tue, Jan 12, 2021 at 9:42 AM Roman Anasal | BDSU
<roman.anasal@bdsu.de> wrote:
>
> On Mon, 2021-01-11 at 22:30 +0300, Andrei Borzenkov wrote:
> > 11.01.2021 22:02, Roman Anasal пишет:
> > > When doing a send, if a new inode has the same number as an inode
> > > in the
> > > parent subvolume it will only be detected as to be recreated when
> > > the
> > > genrations differ. But with inodes in the same generation the
> > > emitted
> > > commands will cause the receiver to fail.
> > >
> > > This case does not happen when doing incremental sends with
> > > snapshots
> > > that are kept read-only by the user all the time, but it may happen
> > > if
> > > - a snapshot was modified after it was created
> > > - the subvol used as parent was created independently from the sent
> > > subvol
> > >
> > > Example reproducers:
> > >
> > >   # case 1: same ino at same path
> > >   btrfs subvolume create subvol1
> > >   btrfs subvolume create subvol2
> > >   mkdir subvol1/a
> > >   touch subvol2/a
> >
> > What happens in case of
> >
> > echo foo > subvol1/a
> > echo bar > subvol2/a
> >
> > ?
>
> `echo foo > subvol1/a` wouldn't work since subvol1/a is a directory.
> However, replacing both preceding lines with:
>
>   mkdir subvol1/a
>   echo bar > subvol2/a
>
> => same as before with/without the patch, plus an additional write
> command for the content
>
> And with both lines as
>
>   echo foo > subvol1/a
>   echo bar > subvol2/a
>
> => produces an invalid stream (with and without this patch) with a
> `link a -> a` followed by `unlink a` as seen in the example output
> quoted further below.
>
> So there is another bug here. The conditions for this seem to roughly
> be: changed/new inode with same number, type and generation, since
> reordering the commands so the inodes have different generations
> produces a valid stream.
>
> Changing the name to subvol2/b like in the second case below produces a
> valid stream with a `link b -> a` followed by `unlink a`.
>
> I'll take a look into this, too, and if possible provide another patch
> for that case.
>
>
> > >   btrfs property set subvol1 ro true
> > >   btrfs property set subvol2 ro true
> > >   btrfs send -p subvol1 subvol2 | btrfs receive --dump

You get all these issues because you are using an incremental send in
a way it's not supposed to.
You are using two subvolumes that are completely unrelated.

My surprise here is that we actually allow a user to try that, instead
of giving an error complaining that subvol1 and subvol2 aren't
snapshots of the same subvolume.

An incremental is supposed to work on snapshots of the same subvolume
(hence, have the same parent uuid).
That's what the entire code relies on to work correctly, and that's
what makes sense - to compute and send the differences between two
points in time of a subvolume.
That's why the code base assumes that inodes with the same number and
same generation must refer to the same inode in both the parent and
the send root.

What I think that needs to be answered is:

1) Are there actually people using incremental sends in that way?
(It's the first time I see such use case)

2) If so, why? That is completely unreliable, not only it can lead to
failure to apply the streams, but can result in all sorts of weirdness
(logical inconsistencies, etc) if applying such streams doesn't cause
an error.

3) Making sure such use cases work reliably would require many, many
changes to the send implementation, as it goes against what it
currently expects.
    Snapshot a subvolume, change the subvolume, snapshot it again,
then use both snapshots for the incremental send, that's the expected
scenario.

In other words, what I think we should have is a check that forbids
using two roots for an incremental send that are not snapshots of the
same subvolume (have different parent uuids).

Thanks.


> > >
> > > The produced tree state here is:
> > >   |-- subvol1
> > >   |   `-- a/        (ino 257)
> > >   |
> > >   `-- subvol2
> > >       `-- a         (ino 257)
> > >
> > >   Where subvol1/a/ is a directory and subvol2/a is a file with the
> > > same
> > >   inode number and same generation.
> > >
> > > Example output of the receive command:
> > >   At subvol subvol2
> > >   snapshot        ./subvol2                       uuid=19d2be0a-
> > > 5af1-fa44-9b3f-f21815178d00 transid=9 parent_uuid=1bac8b12-ddb2-
> > > 6441-8551-700456991785 parent_transid=9
> > >   utimes          ./subvol2/                      atime=2021-01-
> > > 11T13:41:36+0000 mtime=2021-01-11T13:41:36+0000 ctime=2021-01-
> > > 11T13:41:36+0000
> > >   link            ./subvol2/a                     dest=a
> > >   unlink          ./subvol2/a
> > >   utimes          ./subvol2/                      atime=2021-01-
> > > 11T13:41:36+0000 mtime=2021-01-11T13:41:36+0000 ctime=2021-01-
> > > 11T13:41:36+0000
> > >   chmod           ./subvol2/a                     mode=644
> > >   utimes          ./subvol2/a                     atime=2021-01-
> > > 11T13:41:36+0000 mtime=2021-01-11T13:41:36+0000 ctime=2021-01-
> > > 11T13:41:36+0000
> > >
> > > => the `link` command causes the receiver to fail with:
> > >    ERROR: link a -> a failed: File exists
> > >
> > > Second example:
> > >   # case 2: same ino at different path
> > >   btrfs subvolume create subvol1
> > >   btrfs subvolume create subvol2
> > >   mkdir subvol1/a
> > >   touch subvol2/b
> > >   btrfs property set subvol1 ro true
> > >   btrfs property set subvol2 ro true
> > >   btrfs send -p subvol1 subvol2 | btrfs receive --dump
> > >
> > > The produced tree state here is:
> > >   |-- subvol1
> > >   |   `-- a/        (ino 257)
> > >   |
> > >   `-- subvol2
> > >       `-- b         (ino 257)
> > >
> > >   Where subvol1/a/ is a directory and subvol2/b is a file with the
> > > same
> > >   inode number and same generation.
> > >
> > > Example output of the receive command:
> > >   At subvol subvol2
> > >   snapshot        ./subvol2                       uuid=ea93c47a-
> > > 5f47-724f-8a43-e15ce745aef0 transid=20 parent_uuid=f03578ef-5bca-
> > > 1445-a480-3df63677fddf parent_transid=20
> > >   utimes          ./subvol2/                      atime=2021-01-
> > > 11T13:58:00+0000 mtime=2021-01-11T13:58:00+0000 ctime=2021-01-
> > > 11T13:58:00+0000
> > >   link            ./subvol2/b                     dest=a
> > >   unlink          ./subvol2/a
> > >   utimes          ./subvol2/                      atime=2021-01-
> > > 11T13:58:00+0000 mtime=2021-01-11T13:58:00+0000 ctime=2021-01-
> > > 11T13:58:00+0000
> > >   chmod           ./subvol2/b                     mode=644
> > >   utimes          ./subvol2/b                     atime=2021-01-
> > > 11T13:58:00+0000 mtime=2021-01-11T13:58:00+0000 ctime=2021-01-
> > > 11T13:58:00+0000
> > >
> > > => the `link` command causes the receiver to fail with:
> > >    ERROR: link b -> a failed: Operation not permitted
> > >
> > > Signed-off-by: Roman Anasal <roman.anasal@bdsu.de>
Graham Cobb Jan. 12, 2021, 12:07 p.m. UTC | #7
On 12/01/2021 11:27, Filipe Manana wrote:
> ...
> In other words, what I think we should have is a check that forbids
> using two roots for an incremental send that are not snapshots of the
> same subvolume (have different parent uuids).

Are you suggesting that rule should also apply for clone sources (-c
option)? Or are clone sources handled differently from the parent in the
code?
Filipe Manana Jan. 12, 2021, 12:30 p.m. UTC | #8
On Tue, Jan 12, 2021 at 12:08 PM Graham Cobb <g.btrfs@cobb.uk.net> wrote:
>
> On 12/01/2021 11:27, Filipe Manana wrote:
> > ...
> > In other words, what I think we should have is a check that forbids
> > using two roots for an incremental send that are not snapshots of the
> > same subvolume (have different parent uuids).
>
> Are you suggesting that rule should also apply for clone sources (-c
> option)? Or are clone sources handled differently from the parent in the
> code?

No, I'm not. The clone sources are different, and are used only to
find extents for cloning and not for computing differences between
snapshots.

The exception is when -p is not passed to 'btrfs send' and one or more
roots are passed to it with -c, in which case btrfs-progs determines
the "best parent" - so it would check as well if the chosen parent is
a snapshot of the same subvolume. My suggestion was making this on the
kernel side, just in case there are other users of the ioctl other
than btrfs-progs.
Roman Anasal | BDSU Jan. 12, 2021, 1:10 p.m. UTC | #9
On Tue, 2021-01-12 at 11:27 +0000, Filipe Manana wrote:
> You get all these issues because you are using an incremental send in
> a way it's not supposed to.
> You are using two subvolumes that are completely unrelated.
Yes, I am aware of that and know that this is not a designed for use
case.

> My surprise here is that we actually allow a user to try that,
> instead
> of giving an error complaining that subvol1 and subvol2 aren't
> snapshots of the same subvolume.
This could be one way of solving it. But this is already harder than it
sounds, since the same issue may also happen *even when* the subvolumes
share a parent/are related, here an example reproducer:

  btrfs subvolume create subvol1
  btrfs subvolume snapshot subvol1 subvol2
  mkdir subvol1/a
  echo foo > subvol2/a
  btrfs property set subvol1 ro true
  btrfs property set subvol2 ro true
  btrfs send -p subvol1 subvol2 | btrfs receive --dump

This will produce a stream that tries to write data into the cloned
directory inode.

So we not only had to check the parent relationships but also ensure
that the descendant snapshot was not modified, i.e. was never set to
ro=false.
As far as I know there is no flag tracking this on-disk? So this would
need additional work, too.


> An incremental is supposed to work on snapshots of the same subvolume
> (hence, have the same parent uuid).
> That's what the entire code relies on to work correctly, and that's
> what makes sense - to compute and send the differences between two
> points in time of a subvolume.
> That's why the code base assumes that inodes with the same number and
> same generation must refer to the same inode in both the parent and
> the send root.
> 
> What I think that needs to be answered is:
> 
> 1) Are there actually people using incremental sends in that way?
> (It's the first time I see such use case)
Well, I did (:
But admittedly in a kind of experimental setup testing out the limits
of btrfs.

> 2) If so, why? That is completely unreliable, not only it can lead to
> failure to apply the streams, but can result in all sorts of
> weirdness
> (logical inconsistencies, etc) if applying such streams doesn't cause
> an error.
In my case I was moving around subvolumes between multiple disks and
deduplicating as much as possible. Trying to preserve already done
deduplication and purging some intermediate subvolumes I ended up using
"unrelated" subvolumes as parents.

Thinking of it, maybe just using them as clone sources would have just
worked? That would then of course produce much larger streams since
*all* meta data had to be transfered.


> 3) Making sure such use cases work reliably would require many, many
> changes to the send implementation, as it goes against what it
> currently expects.
>     Snapshot a subvolume, change the subvolume, snapshot it again,
> then use both snapshots for the incremental send, that's the expected
> scenario.
I actually don't think that it is really that much work since besides
from some edge cases it already *does* work - I tried ;)


> In other words, what I think we should have is a check that forbids
> using two roots for an incremental send that are not snapshots of the
> same subvolume (have different parent uuids).
I'd like to argue against that:
   1. I don't think allowing this requires that much work
   2. explicitly forbiding it requires work, too (and maybe even changes
      to the on-disk format?)
   3. fixing bugs for this unexpected use case will probably also fix bugs
      for the expected scenario which may only happen in very rare and
      extremly unlikely - though still possible - cases and thus make the
      code overall more resilient:

For example the assumption that inodes with the same number must refer
to the same inode doesn't even hold for direct snapshots - almost
always it does but in some rare conditions it doesn't, which is why
there already is an additional check for that.

This already caused bugs before. And the bug I hit with my unexpected
use of btrfs-send and originally set out to find you had just fixed a
few weeks before [1], i.e. if you hadn't fixed it I would - because of
the unexpected use.
The bug my patch fixes I only discovered by reading the code and having
my scenario in mind.

[1] https://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux.git/commit/?id=0b3f407e6728d990ae1630a02c7b952c21c288d3

That's why I think the work fixing arising bugs for that case is better
invested than in trying to just block it completely.
But given its very rare occurence I would agree to not put any priority
on it.

> Thanks.
>
Filipe Manana Jan. 12, 2021, 1:54 p.m. UTC | #10
On Tue, Jan 12, 2021 at 1:10 PM Roman Anasal | BDSU
<roman.anasal@bdsu.de> wrote:
>
> On Tue, 2021-01-12 at 11:27 +0000, Filipe Manana wrote:
> > You get all these issues because you are using an incremental send in
> > a way it's not supposed to.
> > You are using two subvolumes that are completely unrelated.
> Yes, I am aware of that and know that this is not a designed for use
> case.
>
> > My surprise here is that we actually allow a user to try that,
> > instead
> > of giving an error complaining that subvol1 and subvol2 aren't
> > snapshots of the same subvolume.
> This could be one way of solving it. But this is already harder than it
> sounds, since the same issue may also happen *even when* the subvolumes
> share a parent/are related, here an example reproducer:
>
>   btrfs subvolume create subvol1
>   btrfs subvolume snapshot subvol1 subvol2
>   mkdir subvol1/a
>   echo foo > subvol2/a
>   btrfs property set subvol1 ro true
>   btrfs property set subvol2 ro true
>   btrfs send -p subvol1 subvol2 | btrfs receive --dump
>
> This will produce a stream that tries to write data into the cloned
> directory inode.

Indeed, it will.

That use case is a bit different however, the snapshot starts RW, is
changed and then turned RO and used for an incremental send.

But the idea of an incremental send was always to use snapshots from
the same subvolume and the snapshots were always RO - once created you
didn't change them.
Otherwise you can never ensure full consistency.

So in the end it's not that much different from the case of using
unrelated subvolumes or snapshots as in the patch's changelog.


>
> So we not only had to check the parent relationships but also ensure
> that the descendant snapshot was not modified, i.e. was never set to
> ro=false.
> As far as I know there is no flag tracking this on-disk? So this would
> need additional work, too.

There isn't.

Maybe all we need for now is to document how the incremental send is
supposed to be used.
Adding a flag to mark if a snapshot was ever RW shouldn't be too hard.

>
>
> > An incremental is supposed to work on snapshots of the same subvolume
> > (hence, have the same parent uuid).
> > That's what the entire code relies on to work correctly, and that's
> > what makes sense - to compute and send the differences between two
> > points in time of a subvolume.
> > That's why the code base assumes that inodes with the same number and
> > same generation must refer to the same inode in both the parent and
> > the send root.
> >
> > What I think that needs to be answered is:
> >
> > 1) Are there actually people using incremental sends in that way?
> > (It's the first time I see such use case)
> Well, I did (:
> But admittedly in a kind of experimental setup testing out the limits
> of btrfs.
>
> > 2) If so, why? That is completely unreliable, not only it can lead to
> > failure to apply the streams, but can result in all sorts of
> > weirdness
> > (logical inconsistencies, etc) if applying such streams doesn't cause
> > an error.
> In my case I was moving around subvolumes between multiple disks and
> deduplicating as much as possible. Trying to preserve already done
> deduplication and purging some intermediate subvolumes I ended up using
> "unrelated" subvolumes as parents.
>
> Thinking of it, maybe just using them as clone sources would have just
> worked? That would then of course produce much larger streams since
> *all* meta data had to be transfered.
>
>
> > 3) Making sure such use cases work reliably would require many, many
> > changes to the send implementation, as it goes against what it
> > currently expects.
> >     Snapshot a subvolume, change the subvolume, snapshot it again,
> > then use both snapshots for the incremental send, that's the expected
> > scenario.
> I actually don't think that it is really that much work since besides
> from some edge cases it already *does* work - I tried ;)
>
>
> > In other words, what I think we should have is a check that forbids
> > using two roots for an incremental send that are not snapshots of the
> > same subvolume (have different parent uuids).
> I'd like to argue against that:
>    1. I don't think allowing this requires that much work

For the cases you tried it surely didn't.

>    2. explicitly forbiding it requires work, too (and maybe even changes
>       to the on-disk format?)
>    3. fixing bugs for this unexpected use case will probably also fix bugs
>       for the expected scenario which may only happen in very rare and
>       extremly unlikely - though still possible - cases and thus make the
>       code overall more resilient:

I'd have to disagree with that.

Having pretty much being the only one, apart from Robbie Ko who did
solve some hard similar problems,
solving this kind of problems since 2013, it's far from trivial work
and there's always one more case to solve.

Going through the change logs and send specific fstests should give an
idea of why I don't think it's that trivial to add support for these
use cases.

My concern is that this will require a lot more work for a use case
that is not standard, it was not designed for,
and this always adds the risk of introducing regressions for the
expected and typical use cases.

So I really don't find it compelling to add support for cases that
send was designed for - unless there are indeed users for them and
there is a good reason why they can't use the standard way (use
snapshots of the same subvolume and never modify the snapshots).

>
> For example the assumption that inodes with the same number must refer
> to the same inode doesn't even hold for direct snapshots - almost
> always it does but in some rare conditions it doesn't, which is why
> there already is an additional check for that.

The assumptions everywhere take into account inode number and
generation, not just the number.
Any place that uses only the number to check if it's the same inode,
then it's almost certainly a bug (a notable exception would be the
pending directory renames iirc).

>
> This already caused bugs before. And the bug I hit with my unexpected
> use of btrfs-send and originally set out to find you had just fixed a
> few weeks before [1], i.e. if you hadn't fixed it I would - because of
> the unexpected use.
> The bug my patch fixes I only discovered by reading the code and having
> my scenario in mind.
>
> [1] https://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux.git/commit/?id=0b3f407e6728d990ae1630a02c7b952c21c288d3
>
> That's why I think the work fixing arising bugs for that case is better
> invested than in trying to just block it completely.
> But given its very rare occurence I would agree to not put any priority
> on it.

Well, I'm sure there were plenty of bug fixes for the intended use
cases that also fixed failures with the non-intended use cases.

Thanks.

>
> > Thanks.
> >
Roman Anasal | BDSU Jan. 12, 2021, 2:37 p.m. UTC | #11
On Tue, 2021-01-12 at 13:54 +0000, Filipe Manana wrote:
> On Tue, Jan 12, 2021 at 1:10 PM Roman Anasal | BDSU
> <roman.anasal@bdsu.de> wrote:
> > On Tue, 2021-01-12 at 11:27 +0000, Filipe Manana wrote:
> > > You get all these issues because you are using an incremental
> > > send in
> > > a way it's not supposed to.
> > > You are using two subvolumes that are completely unrelated.
> > Yes, I am aware of that and know that this is not a designed for
> > use
> > case.
> > 
> > > My surprise here is that we actually allow a user to try that,
> > > instead
> > > of giving an error complaining that subvol1 and subvol2 aren't
> > > snapshots of the same subvolume.
> > This could be one way of solving it. But this is already harder
> > than it
> > sounds, since the same issue may also happen *even when* the
> > subvolumes
> > share a parent/are related, here an example reproducer:
> > 
> >   btrfs subvolume create subvol1
> >   btrfs subvolume snapshot subvol1 subvol2
> >   mkdir subvol1/a
> >   echo foo > subvol2/a
> >   btrfs property set subvol1 ro true
> >   btrfs property set subvol2 ro true
> >   btrfs send -p subvol1 subvol2 | btrfs receive --dump
> > 
> > This will produce a stream that tries to write data into the cloned
> > directory inode.
> 
> Indeed, it will.
> 
> That use case is a bit different however, the snapshot starts RW, is
> changed and then turned RO and used for an incremental send.
> 
> But the idea of an incremental send was always to use snapshots from
> the same subvolume and the snapshots were always RO - once created
> you
> didn't change them.
> Otherwise you can never ensure full consistency.
> 
> So in the end it's not that much different from the case of using
> unrelated subvolumes or snapshots as in the patch's changelog.

That's what I wanted to demonstrate: it's basically the same case - but
it is much harder/impossible to detect by the send code as long as
there is no flag on the subvolume whether it was ever rw.


> > So we not only had to check the parent relationships but also
> > ensure
> > that the descendant snapshot was not modified, i.e. was never set
> > to
> > ro=false.
> > As far as I know there is no flag tracking this on-disk? So this
> > would
> > need additional work, too.
> 
> There isn't.
> 
> Maybe all we need for now is to document how the incremental send is
> supposed to be used.
> Adding a flag to mark if a snapshot was ever RW shouldn't be too
> hard.
> 
> > 
> > > An incremental is supposed to work on snapshots of the same
> > > subvolume
> > > (hence, have the same parent uuid).
> > > That's what the entire code relies on to work correctly, and
> > > that's
> > > what makes sense - to compute and send the differences between
> > > two
> > > points in time of a subvolume.
> > > That's why the code base assumes that inodes with the same number
> > > and
> > > same generation must refer to the same inode in both the parent
> > > and
> > > the send root.
> > > 
> > > What I think that needs to be answered is:
> > > 
> > > 1) Are there actually people using incremental sends in that way?
> > > (It's the first time I see such use case)
> > Well, I did (:
> > But admittedly in a kind of experimental setup testing out the
> > limits
> > of btrfs.
> > 
> > > 2) If so, why? That is completely unreliable, not only it can
> > > lead to
> > > failure to apply the streams, but can result in all sorts of
> > > weirdness
> > > (logical inconsistencies, etc) if applying such streams doesn't
> > > cause
> > > an error.
> > In my case I was moving around subvolumes between multiple disks
> > and
> > deduplicating as much as possible. Trying to preserve already done
> > deduplication and purging some intermediate subvolumes I ended up
> > using
> > "unrelated" subvolumes as parents.
> > 
> > Thinking of it, maybe just using them as clone sources would have
> > just
> > worked? That would then of course produce much larger streams since
> > *all* meta data had to be transfered.
> > 
> > 
> > > 3) Making sure such use cases work reliably would require many,
> > > many
> > > changes to the send implementation, as it goes against what it
> > > currently expects.
> > >     Snapshot a subvolume, change the subvolume, snapshot it
> > > again,
> > > then use both snapshots for the incremental send, that's the
> > > expected
> > > scenario.
> > I actually don't think that it is really that much work since
> > besides
> > from some edge cases it already *does* work - I tried ;)
> > 
> > 
> > > In other words, what I think we should have is a check that
> > > forbids
> > > using two roots for an incremental send that are not snapshots of
> > > the
> > > same subvolume (have different parent uuids).
> > I'd like to argue against that:
> >    1. I don't think allowing this requires that much work
> 
> For the cases you tried it surely didn't.
> 
> >    2. explicitly forbiding it requires work, too (and maybe even
> > changes
> >       to the on-disk format?)
> >    3. fixing bugs for this unexpected use case will probably also
> > fix bugs
> >       for the expected scenario which may only happen in very rare
> > and
> >       extremly unlikely - though still possible - cases and thus
> > make the
> >       code overall more resilient:
> 
> I'd have to disagree with that.
> 
> Having pretty much being the only one, apart from Robbie Ko who did
> solve some hard similar problems,
> solving this kind of problems since 2013, it's far from trivial work
> and there's always one more case to solve.
> 
> Going through the change logs and send specific fstests should give
> an
> idea of why I don't think it's that trivial to add support for these
> use cases.
> 
> My concern is that this will require a lot more work for a use case
> that is not standard, it was not designed for,
> and this always adds the risk of introducing regressions for the
> expected and typical use cases.
> 
> So I really don't find it compelling to add support for cases that
> send was designed for - unless there are indeed users for them and
> there is a good reason why they can't use the standard way (use
> snapshots of the same subvolume and never modify the snapshots).

I didn't want to propose adding and supporting a whole new use case
here. In my understanding this case already is handled for the most
part by the existing code and there should only be a few edge cases
left which should be catchable by additional checks.

Looking at this very patch: all it basically does is add an additional
check if the inode has the same type in the snapshot and the parent and
if not branches in the existing code for recreating the inode.
For the expected scenario this is (incidentally) covered by checking if
the generation changed. So the additional check is no problem here and
redundant at most.

If any further fixes for the unsupported cases are only possible with
major reworks I totally agree with you: the risk of breaking the main
case outweihts the benefits.


But limited to this concrete patch: would you consider to include it?


> > For example the assumption that inodes with the same number must
> > refer
> > to the same inode doesn't even hold for direct snapshots - almost
> > always it does but in some rare conditions it doesn't, which is why
> > there already is an additional check for that.
> 
> The assumptions everywhere take into account inode number and
> generation, not just the number.
> Any place that uses only the number to check if it's the same inode,
> then it's almost certainly a bug (a notable exception would be the
> pending directory renames iirc).
> 
> > This already caused bugs before. And the bug I hit with my
> > unexpected
> > use of btrfs-send and originally set out to find you had just fixed
> > a
> > few weeks before [1], i.e. if you hadn't fixed it I would - because
> > of
> > the unexpected use.
> > The bug my patch fixes I only discovered by reading the code and
> > having
> > my scenario in mind.
> > 
> > [1] 
> > https://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux.git/commit/?id=0b3f407e6728d990ae1630a02c7b952c21c288d3
> > 
> > That's why I think the work fixing arising bugs for that case is
> > better
> > invested than in trying to just block it completely.
> > But given its very rare occurence I would agree to not put any
> > priority
> > on it.
> 
> Well, I'm sure there were plenty of bug fixes for the intended use
> cases that also fixed failures with the non-intended use cases.

Of course, absolutely!

> 
> Thanks.
> 
> > > Thanks.
> > > 
> 
>
Filipe Manana Jan. 12, 2021, 3:08 p.m. UTC | #12
On Tue, Jan 12, 2021 at 2:37 PM Roman Anasal | BDSU
<roman.anasal@bdsu.de> wrote:
>
> On Tue, 2021-01-12 at 13:54 +0000, Filipe Manana wrote:
> > On Tue, Jan 12, 2021 at 1:10 PM Roman Anasal | BDSU
> > <roman.anasal@bdsu.de> wrote:
> > > On Tue, 2021-01-12 at 11:27 +0000, Filipe Manana wrote:
> > > > You get all these issues because you are using an incremental
> > > > send in
> > > > a way it's not supposed to.
> > > > You are using two subvolumes that are completely unrelated.
> > > Yes, I am aware of that and know that this is not a designed for
> > > use
> > > case.
> > >
> > > > My surprise here is that we actually allow a user to try that,
> > > > instead
> > > > of giving an error complaining that subvol1 and subvol2 aren't
> > > > snapshots of the same subvolume.
> > > This could be one way of solving it. But this is already harder
> > > than it
> > > sounds, since the same issue may also happen *even when* the
> > > subvolumes
> > > share a parent/are related, here an example reproducer:
> > >
> > >   btrfs subvolume create subvol1
> > >   btrfs subvolume snapshot subvol1 subvol2
> > >   mkdir subvol1/a
> > >   echo foo > subvol2/a
> > >   btrfs property set subvol1 ro true
> > >   btrfs property set subvol2 ro true
> > >   btrfs send -p subvol1 subvol2 | btrfs receive --dump
> > >
> > > This will produce a stream that tries to write data into the cloned
> > > directory inode.
> >
> > Indeed, it will.
> >
> > That use case is a bit different however, the snapshot starts RW, is
> > changed and then turned RO and used for an incremental send.
> >
> > But the idea of an incremental send was always to use snapshots from
> > the same subvolume and the snapshots were always RO - once created
> > you
> > didn't change them.
> > Otherwise you can never ensure full consistency.
> >
> > So in the end it's not that much different from the case of using
> > unrelated subvolumes or snapshots as in the patch's changelog.
>
> That's what I wanted to demonstrate: it's basically the same case - but
> it is much harder/impossible to detect by the send code as long as
> there is no flag on the subvolume whether it was ever rw.

Yes, the only problem I see here is that we don't prevent the unsupported cases.
Detecting if the send and parent roots are snapshots of the same
subvolume is easy.

Checking if a snapshot was ever RW before is not, we would need a flag.
This is already known to cause problems, and has happened and been
reported in the past - mostly someone changes a snapshot from RO to
RW, changes it, sets it back to RO and uses it for an incremental send
later.

These all fall within the same category: not being able to ensure
consistency of the roots being used for an incremental send.

It's unfortunate such cases aren't forbidden with an explicit error.

>
>
> > > So we not only had to check the parent relationships but also
> > > ensure
> > > that the descendant snapshot was not modified, i.e. was never set
> > > to
> > > ro=false.
> > > As far as I know there is no flag tracking this on-disk? So this
> > > would
> > > need additional work, too.
> >
> > There isn't.
> >
> > Maybe all we need for now is to document how the incremental send is
> > supposed to be used.
> > Adding a flag to mark if a snapshot was ever RW shouldn't be too
> > hard.
> >
> > >
> > > > An incremental is supposed to work on snapshots of the same
> > > > subvolume
> > > > (hence, have the same parent uuid).
> > > > That's what the entire code relies on to work correctly, and
> > > > that's
> > > > what makes sense - to compute and send the differences between
> > > > two
> > > > points in time of a subvolume.
> > > > That's why the code base assumes that inodes with the same number
> > > > and
> > > > same generation must refer to the same inode in both the parent
> > > > and
> > > > the send root.
> > > >
> > > > What I think that needs to be answered is:
> > > >
> > > > 1) Are there actually people using incremental sends in that way?
> > > > (It's the first time I see such use case)
> > > Well, I did (:
> > > But admittedly in a kind of experimental setup testing out the
> > > limits
> > > of btrfs.
> > >
> > > > 2) If so, why? That is completely unreliable, not only it can
> > > > lead to
> > > > failure to apply the streams, but can result in all sorts of
> > > > weirdness
> > > > (logical inconsistencies, etc) if applying such streams doesn't
> > > > cause
> > > > an error.
> > > In my case I was moving around subvolumes between multiple disks
> > > and
> > > deduplicating as much as possible. Trying to preserve already done
> > > deduplication and purging some intermediate subvolumes I ended up
> > > using
> > > "unrelated" subvolumes as parents.
> > >
> > > Thinking of it, maybe just using them as clone sources would have
> > > just
> > > worked? That would then of course produce much larger streams since
> > > *all* meta data had to be transfered.
> > >
> > >
> > > > 3) Making sure such use cases work reliably would require many,
> > > > many
> > > > changes to the send implementation, as it goes against what it
> > > > currently expects.
> > > >     Snapshot a subvolume, change the subvolume, snapshot it
> > > > again,
> > > > then use both snapshots for the incremental send, that's the
> > > > expected
> > > > scenario.
> > > I actually don't think that it is really that much work since
> > > besides
> > > from some edge cases it already *does* work - I tried ;)
> > >
> > >
> > > > In other words, what I think we should have is a check that
> > > > forbids
> > > > using two roots for an incremental send that are not snapshots of
> > > > the
> > > > same subvolume (have different parent uuids).
> > > I'd like to argue against that:
> > >    1. I don't think allowing this requires that much work
> >
> > For the cases you tried it surely didn't.
> >
> > >    2. explicitly forbiding it requires work, too (and maybe even
> > > changes
> > >       to the on-disk format?)
> > >    3. fixing bugs for this unexpected use case will probably also
> > > fix bugs
> > >       for the expected scenario which may only happen in very rare
> > > and
> > >       extremly unlikely - though still possible - cases and thus
> > > make the
> > >       code overall more resilient:
> >
> > I'd have to disagree with that.
> >
> > Having pretty much being the only one, apart from Robbie Ko who did
> > solve some hard similar problems,
> > solving this kind of problems since 2013, it's far from trivial work
> > and there's always one more case to solve.
> >
> > Going through the change logs and send specific fstests should give
> > an
> > idea of why I don't think it's that trivial to add support for these
> > use cases.
> >
> > My concern is that this will require a lot more work for a use case
> > that is not standard, it was not designed for,
> > and this always adds the risk of introducing regressions for the
> > expected and typical use cases.
> >
> > So I really don't find it compelling to add support for cases that
> > send was designed for - unless there are indeed users for them and
> > there is a good reason why they can't use the standard way (use
> > snapshots of the same subvolume and never modify the snapshots).
>
> I didn't want to propose adding and supporting a whole new use case
> here. In my understanding this case already is handled for the most
> part by the existing code and there should only be a few edge cases
> left which should be catchable by additional checks.

Well, that's what change is doing, adding support for cases which we
can't ensure are going to work all the time.

If we change the logic here, at , then later we'll realize we will
need to change the logic somewhere else too - as after this change
having the same inode number and generation can actually mean a new
inode, and other places do such check - or worse, introducing a
regression for the cases where send was designed to work. Not to
mention that future code changes, for the supported scenarios, can
become more complex because of this.

>
> Looking at this very patch: all it basically does is add an additional
> check if the inode has the same type in the snapshot and the parent and
> if not branches in the existing code for recreating the inode.
> For the expected scenario this is (incidentally) covered by checking if
> the generation changed. So the additional check is no problem here and
> redundant at most.
>
> If any further fixes for the unsupported cases are only possible with
> major reworks I totally agree with you: the risk of breaking the main
> case outweihts the benefits.
>
>
> But limited to this concrete patch: would you consider to include it?

I wouldn't, for the reasons outlined before in this thread (which also
makes the first patch unnecessary).

However I'm all for making send fail with an error for the unsupported
cases and updating the documentation (man page, wiki, etc) in case
it's not explicit or it is missing something.

Thanks.

>
>
> > > For example the assumption that inodes with the same number must
> > > refer
> > > to the same inode doesn't even hold for direct snapshots - almost
> > > always it does but in some rare conditions it doesn't, which is why
> > > there already is an additional check for that.
> >
> > The assumptions everywhere take into account inode number and
> > generation, not just the number.
> > Any place that uses only the number to check if it's the same inode,
> > then it's almost certainly a bug (a notable exception would be the
> > pending directory renames iirc).
> >
> > > This already caused bugs before. And the bug I hit with my
> > > unexpected
> > > use of btrfs-send and originally set out to find you had just fixed
> > > a
> > > few weeks before [1], i.e. if you hadn't fixed it I would - because
> > > of
> > > the unexpected use.
> > > The bug my patch fixes I only discovered by reading the code and
> > > having
> > > my scenario in mind.
> > >
> > > [1]
> > > https://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux.git/commit/?id=0b3f407e6728d990ae1630a02c7b952c21c288d3
> > >
> > > That's why I think the work fixing arising bugs for that case is
> > > better
> > > invested than in trying to just block it completely.
> > > But given its very rare occurence I would agree to not put any
> > > priority
> > > on it.
> >
> > Well, I'm sure there were plenty of bug fixes for the intended use
> > cases that also fixed failures with the non-intended use cases.
>
> Of course, absolutely!
>
> >
> > Thanks.
> >
> > > > Thanks.
> > > >
> >
> >
diff mbox series

Patch

diff --git a/fs/btrfs/send.c b/fs/btrfs/send.c
index 420371c1d..33ae48442 100644
--- a/fs/btrfs/send.c
+++ b/fs/btrfs/send.c
@@ -6299,12 +6299,18 @@  static int changed_inode(struct send_ctx *sctx,
 		right_gen = btrfs_inode_generation(sctx->right_path->nodes[0],
 				right_ii);
 
+		u64 left_type = S_IFMT & btrfs_inode_mode(
+				sctx->left_path->nodes[0], left_ii);
+		u64 right_type = S_IFMT & btrfs_inode_mode(
+				sctx->right_path->nodes[0], right_ii);
+
+
 		/*
 		 * The cur_ino = root dir case is special here. We can't treat
 		 * the inode as deleted+reused because it would generate a
 		 * stream that tries to delete/mkdir the root dir.
 		 */
-		if (left_gen != right_gen &&
+		if ((left_gen != right_gen || left_type != right_type) &&
 		    sctx->cur_ino != BTRFS_FIRST_FREE_OBJECTID)
 			sctx->cur_inode_recreated = 1;
 	}
@@ -6359,10 +6365,10 @@  static int changed_inode(struct send_ctx *sctx,
 	} else if (result == BTRFS_COMPARE_TREE_CHANGED) {
 		/*
 		 * We need to do some special handling in case the inode was
-		 * reported as changed with a changed generation number. This
-		 * means that the original inode was deleted and new inode
-		 * reused the same inum. So we have to treat the old inode as
-		 * deleted and the new one as new.
+		 * reported as changed with a changed generation number or
+		 * changed inode type. This means that the original inode was
+		 * deleted and new inode reused the same inum. So we have to
+		 * treat the old inode as deleted and the new one as new.
 		 */
 		if (sctx->cur_inode_recreated) {
 			/*