diff mbox series

VFS/BTRFS/NFSD: provide more unique inode number for btrfs export

Message ID 162881913686.1695.12479588032010502384@noble.neil.brown.name (mailing list archive)
State New, archived
Headers show
Series VFS/BTRFS/NFSD: provide more unique inode number for btrfs export | expand

Commit Message

NeilBrown Aug. 13, 2021, 1:45 a.m. UTC
[[This patch is a minimal patch which addresses the current problems
  with nfsd and btrfs, in a way which I think is most supportable, least
  surprising, and least likely to impact any future attempts to more
  completely fix the btrfs file-identify problem]]

BTRFS does not provide unique inode numbers across a filesystem.
It *does* provide unique inode numbers with a subvolume and
uses synthetic device numbers for different subvolumes to ensure
uniqueness for device+inode.

nfsd cannot use these varying device numbers.  If nfsd were to
synthesise different stable filesystem ids to give to the client, that
would cause subvolumes to appear in the mount table on the client, even
though they don't appear in the mount table on the server.  Also, NFSv3
doesn't support changing the filesystem id without a new explicit
mount on the client (this is partially supported in practice, but
violates the protocol specification).

So currently, the roots of all subvolumes report the same inode number
in the same filesystem to NFS clients and tools like 'find' notice that
a directory has the same identity as an ancestor, and so refuse to
enter that directory.

This patch allows btrfs (or any filesystem) to provide a 64bit number
that can be xored with the inode number to make the number more unique.
Rather than the client being certain to see duplicates, with this patch
it is possible but extremely rare.

The number than btrfs provides is a swab64() version of the subvolume
identifier.  This has most entropy in the high bits (the low bits of the
subvolume identifer), while the inoe has most entropy in the low bits.
The result will always be unique within a subvolume, and will almost
always be unique across the filesystem.

Signed-off-by: NeilBrown <neilb@suse.de>
---
 fs/btrfs/inode.c     |  4 ++++
 fs/nfsd/nfs3xdr.c    | 17 ++++++++++++++++-
 fs/nfsd/nfs4xdr.c    |  9 ++++++++-
 fs/nfsd/xdr3.h       |  2 ++
 include/linux/stat.h | 17 +++++++++++++++++
 5 files changed, 47 insertions(+), 2 deletions(-)

Comments

Josef Bacik Aug. 13, 2021, 2:55 p.m. UTC | #1
On 8/12/21 9:45 PM, NeilBrown wrote:
> 
> [[This patch is a minimal patch which addresses the current problems
>    with nfsd and btrfs, in a way which I think is most supportable, least
>    surprising, and least likely to impact any future attempts to more
>    completely fix the btrfs file-identify problem]]
> 
> BTRFS does not provide unique inode numbers across a filesystem.
> It *does* provide unique inode numbers with a subvolume and
> uses synthetic device numbers for different subvolumes to ensure
> uniqueness for device+inode.
> 
> nfsd cannot use these varying device numbers.  If nfsd were to
> synthesise different stable filesystem ids to give to the client, that
> would cause subvolumes to appear in the mount table on the client, even
> though they don't appear in the mount table on the server.  Also, NFSv3
> doesn't support changing the filesystem id without a new explicit
> mount on the client (this is partially supported in practice, but
> violates the protocol specification).
> 
> So currently, the roots of all subvolumes report the same inode number
> in the same filesystem to NFS clients and tools like 'find' notice that
> a directory has the same identity as an ancestor, and so refuse to
> enter that directory.
> 
> This patch allows btrfs (or any filesystem) to provide a 64bit number
> that can be xored with the inode number to make the number more unique.
> Rather than the client being certain to see duplicates, with this patch
> it is possible but extremely rare.
> 
> The number than btrfs provides is a swab64() version of the subvolume
> identifier.  This has most entropy in the high bits (the low bits of the
> subvolume identifer), while the inoe has most entropy in the low bits.
> The result will always be unique within a subvolume, and will almost
> always be unique across the filesystem.
> 

This is a reasonable approach to me, solves the problem without being overly 
complicated and side-steps the thornier issues around how we deal with 
subvolumes.  I'll leave it up to the other maintainers of the other fs'es to 
weigh in, but for me you can add

Acked-by: Josef Bacik <josef@toxicpanda.com>

Thanks,

Josef
Goffredo Baroncelli Aug. 15, 2021, 7:39 a.m. UTC | #2
Hi All,

On 8/13/21 3:45 AM, NeilBrown wrote:
> 
> [[This patch is a minimal patch which addresses the current problems
>    with nfsd and btrfs, in a way which I think is most supportable, least
>    surprising, and least likely to impact any future attempts to more
>    completely fix the btrfs file-identify problem]]
> 
> BTRFS does not provide unique inode numbers across a filesystem.
> It *does* provide unique inode numbers with a subvolume and
> uses synthetic device numbers for different subvolumes to ensure
> uniqueness for device+inode.
> 
> nfsd cannot use these varying device numbers.  If nfsd were to
> synthesise different stable filesystem ids to give to the client, that
> would cause subvolumes to appear in the mount table on the client, even
> though they don't appear in the mount table on the server.  Also, NFSv3
> doesn't support changing the filesystem id without a new explicit
> mount on the client (this is partially supported in practice, but
> violates the protocol specification).

I am sure that it was discussed already but I was unable to find any track of this discussion.
But if the problem is the collision between the inode number of different subvolume in the nfd export, is it simpler if the export is truncated to the subvolume boundary ? It would be more coherent with the current behavior of vfs+nfsd.

In fact in btrfs a subvolume is a complete filesystem, with an "own synthetic" device. We could like or not this solution, but this solution is the more aligned to the unix standard, where for each filesystem there is a pair (device, inode-set). NFS (by default) avoids to cross the boundary between the filesystems. So why in BTRFS this
should be different ?

May be an opt-in option would solve the backward compatibility (i.e. to avoid problem with setup which relies on this behaviour).

> So currently, the roots of all subvolumes report the same inode number
> in the same filesystem to NFS clients and tools like 'find' notice that
> a directory has the same identity as an ancestor, and so refuse to
> enter that directory.
> 
> This patch allows btrfs (or any filesystem) to provide a 64bit number
> that can be xored with the inode number to make the number more unique.
> Rather than the client being certain to see duplicates, with this patch
> it is possible but extremely rare.
> 
> The number than btrfs provides is a swab64() version of the subvolume
> identifier.  This has most entropy in the high bits (the low bits of the
> subvolume identifer), while the inoe has most entropy in the low bits.
> The result will always be unique within a subvolume, and will almost
> always be unique across the filesystem.
> 
> Signed-off-by: NeilBrown <neilb@suse.de>
> ---
>   fs/btrfs/inode.c     |  4 ++++
>   fs/nfsd/nfs3xdr.c    | 17 ++++++++++++++++-
>   fs/nfsd/nfs4xdr.c    |  9 ++++++++-
>   fs/nfsd/xdr3.h       |  2 ++
>   include/linux/stat.h | 17 +++++++++++++++++
>   5 files changed, 47 insertions(+), 2 deletions(-)
> 
> diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
> index 0117d867ecf8..989fdf2032d5 100644
> --- a/fs/btrfs/inode.c
> +++ b/fs/btrfs/inode.c
> @@ -9195,6 +9195,10 @@ static int btrfs_getattr(struct user_namespace *mnt_userns,
>   	generic_fillattr(&init_user_ns, inode, stat);
>   	stat->dev = BTRFS_I(inode)->root->anon_dev;
>   
> +	if (BTRFS_I(inode)->root->root_key.objectid != BTRFS_FS_TREE_OBJECTID)
> +		stat->ino_uniquifier =
> +			swab64(BTRFS_I(inode)->root->root_key.objectid);
> +
>   	spin_lock(&BTRFS_I(inode)->lock);
>   	delalloc_bytes = BTRFS_I(inode)->new_delalloc_bytes;
>   	inode_bytes = inode_get_bytes(inode);
> diff --git a/fs/nfsd/nfs3xdr.c b/fs/nfsd/nfs3xdr.c
> index 0a5ebc52e6a9..669e2437362a 100644
> --- a/fs/nfsd/nfs3xdr.c
> +++ b/fs/nfsd/nfs3xdr.c
> @@ -340,6 +340,7 @@ svcxdr_encode_fattr3(struct svc_rqst *rqstp, struct xdr_stream *xdr,
>   {
>   	struct user_namespace *userns = nfsd_user_namespace(rqstp);
>   	__be32 *p;
> +	u64 ino;
>   	u64 fsid;
>   
>   	p = xdr_reserve_space(xdr, XDR_UNIT * 21);
> @@ -377,7 +378,10 @@ svcxdr_encode_fattr3(struct svc_rqst *rqstp, struct xdr_stream *xdr,
>   	p = xdr_encode_hyper(p, fsid);
>   
>   	/* fileid */
> -	p = xdr_encode_hyper(p, stat->ino);
> +	ino = stat->ino;
> +	if (stat->ino_uniquifier && stat->ino_uniquifier != ino)
> +		ino ^= stat->ino_uniquifier;
> +	p = xdr_encode_hyper(p, ino);
>   
>   	p = encode_nfstime3(p, &stat->atime);
>   	p = encode_nfstime3(p, &stat->mtime);
> @@ -1151,6 +1155,17 @@ svcxdr_encode_entry3_common(struct nfsd3_readdirres *resp, const char *name,
>   	if (xdr_stream_encode_item_present(xdr) < 0)
>   		return false;
>   	/* fileid */
> +	if (!resp->dir_have_uniquifier) {
> +		struct kstat stat;
> +		if (fh_getattr(&resp->fh, &stat) == nfs_ok)
> +			resp->dir_ino_uniquifier = stat.ino_uniquifier;
> +		else
> +			resp->dir_ino_uniquifier = 0;
> +		resp->dir_have_uniquifier = 1;
> +	}
> +	if (resp->dir_ino_uniquifier &&
> +	    resp->dir_ino_uniquifier != ino)
> +		ino ^= resp->dir_ino_uniquifier;
>   	if (xdr_stream_encode_u64(xdr, ino) < 0)
>   		return false;
>   	/* name */
> diff --git a/fs/nfsd/nfs4xdr.c b/fs/nfsd/nfs4xdr.c
> index 7abeccb975b2..ddccf849c29c 100644
> --- a/fs/nfsd/nfs4xdr.c
> +++ b/fs/nfsd/nfs4xdr.c
> @@ -3114,10 +3114,14 @@ nfsd4_encode_fattr(struct xdr_stream *xdr, struct svc_fh *fhp,
>   					fhp->fh_handle.fh_size);
>   	}
>   	if (bmval0 & FATTR4_WORD0_FILEID) {
> +		u64 ino = stat.ino;
> +		if (stat.ino_uniquifier &&
> +		    stat.ino_uniquifier != stat.ino)
> +			ino ^= stat.ino_uniquifier;
>   		p = xdr_reserve_space(xdr, 8);
>   		if (!p)
>   			goto out_resource;
> -		p = xdr_encode_hyper(p, stat.ino);
> +		p = xdr_encode_hyper(p, ino);
>   	}
>   	if (bmval0 & FATTR4_WORD0_FILES_AVAIL) {
>   		p = xdr_reserve_space(xdr, 8);
> @@ -3285,6 +3289,9 @@ nfsd4_encode_fattr(struct xdr_stream *xdr, struct svc_fh *fhp,
>   			if (err)
>   				goto out_nfserr;
>   			ino = parent_stat.ino;
> +			if (parent_stat.ino_uniquifier &&
> +			    parent_stat.ino_uniquifier != ino)
> +				ino ^= parent_stat.ino_uniquifier;
>   		}
>   		p = xdr_encode_hyper(p, ino);
>   	}
> diff --git a/fs/nfsd/xdr3.h b/fs/nfsd/xdr3.h
> index 933008382bbe..b4f9f3c71f72 100644
> --- a/fs/nfsd/xdr3.h
> +++ b/fs/nfsd/xdr3.h
> @@ -179,6 +179,8 @@ struct nfsd3_readdirres {
>   	struct xdr_buf		dirlist;
>   	struct svc_fh		scratch;
>   	struct readdir_cd	common;
> +	u64			dir_ino_uniquifier;
> +	int			dir_have_uniquifier;
>   	unsigned int		cookie_offset;
>   	struct svc_rqst *	rqstp;
>   
> diff --git a/include/linux/stat.h b/include/linux/stat.h
> index fff27e603814..a5188f42ed81 100644
> --- a/include/linux/stat.h
> +++ b/include/linux/stat.h
> @@ -46,6 +46,23 @@ struct kstat {
>   	struct timespec64 btime;			/* File creation time */
>   	u64		blocks;
>   	u64		mnt_id;
> +	/*
> +	 * BTRFS does not provide unique inode numbers within a filesystem,
> +	 * depending on a synthetic 'dev' to provide uniqueness.
> +	 * NFSd cannot make use of this 'dev' number so clients often see
> +	 * duplicate inode numbers.
> +	 * For BTRFS, 'ino' is unlikely to use the high bits.  It puts
> +	 * another number in ino_uniquifier which:
> +	 * - has most entropy in the high bits
> +	 * - is different precisely when 'dev' is different
> +	 * - is stable across unmount/remount
> +	 * NFSd can xor this with 'ino' to get a substantially more unique
> +	 * number for reporting to the client.
> +	 * The ino_uniquifier for a directory can reasonably be applied
> +	 * to inode numbers reported by the readdir filldir callback.
> +	 * It is NOT currently exported to user-space.
> +	 */
> +	u64		ino_uniquifier;
>   };

Why don't rename "ino_uniquifier" as "ino_and_subvolume" and leave to the filesystem the work to combine the inode and the subvolume-id ?

I am worried that the logic is split between the filesystem, which synthesizes the ino_uniquifier, and to NFS which combine to the inode. I am thinking that this combination is filesystem specific; for BTRFS is a simple xor but for other filesystem may be a more complex operation, so leaving an half in the filesystem and another half to the NFS seems to not optimal if other filesystem needs to use ino_uniquifier.


>   #endif
>
Roman Mamedov Aug. 15, 2021, 7:35 p.m. UTC | #3
On Sun, 15 Aug 2021 09:39:08 +0200
Goffredo Baroncelli <kreijack@libero.it> wrote:

> I am sure that it was discussed already but I was unable to find any track
> of this discussion. But if the problem is the collision between the inode
> number of different subvolume in the nfd export, is it simpler if the export
> is truncated to the subvolume boundary ? It would be more coherent with the
> current behavior of vfs+nfsd.

See this bugreport thread which started it all:
https://www.spinics.net/lists/linux-btrfs/msg111172.html

In there the reporting user replied that it is strongly not feasible for them
to export each individual snapshot.

> In fact in btrfs a subvolume is a complete filesystem, with an "own
> synthetic" device. We could like or not this solution, but this solution is
> the more aligned to the unix standard, where for each filesystem there is a
> pair (device, inode-set). NFS (by default) avoids to cross the boundary
> between the filesystems. So why in BTRFS this should be different ?

From the user point of view subvolumes are basically directories; that they
are "complete filesystems"* is merely a low-level implementation detail.

* well except they are not, as you cannot 'dd' a subvolume to another
blockdevice.

> Why don't rename "ino_uniquifier" as "ino_and_subvolume" and leave to the
> filesystem the work to combine the inode and the subvolume-id ?
>
> I am worried that the logic is split between the filesystem, which
> synthesizes the ino_uniquifier, and to NFS which combine to the inode. I am
> thinking that this combination is filesystem specific; for BTRFS is a simple
> xor but for other filesystem may be a more complex operation, so leaving an
> half in the filesystem and another half to the NFS seems to not optimal if
> other filesystem needs to use ino_uniquifier.

I wondered a bit myself, what are the downsides of just doing the
uniquefication inside Btrfs, not leaving that to NFSD?

I mean not even adding the extra stat field, just return the inode itself with
that already applied. Surely cannot be any worse collision-wise, than
different subvolumes straight up having the same inode numbers as right now?

Or is it a performance concern, always doing more work, for something which
only NFSD has needed so far.
Goffredo Baroncelli Aug. 15, 2021, 9:03 p.m. UTC | #4
On 8/15/21 9:35 PM, Roman Mamedov wrote:
> On Sun, 15 Aug 2021 09:39:08 +0200
> Goffredo Baroncelli <kreijack@libero.it> wrote:
> 
>> I am sure that it was discussed already but I was unable to find any track
>> of this discussion. But if the problem is the collision between the inode
>> number of different subvolume in the nfd export, is it simpler if the export
>> is truncated to the subvolume boundary ? It would be more coherent with the
>> current behavior of vfs+nfsd.
> 
> See this bugreport thread which started it all:
> https://www.spinics.net/lists/linux-btrfs/msg111172.html
> 
> In there the reporting user replied that it is strongly not feasible for them
> to export each individual snapshot.

Thanks for pointing that.

However looking at the 'exports' man page, it seems that NFS has already an
option to cover these cases: 'crossmnt'.

If NFSd detects a "child" filesystem (i.e. a filesystem mounted inside an already
exported one) and the "parent" filesystem is marked as 'crossmnt',  the client mount
the parent AND the child filesystem with two separate mounts, so there is not problem of inode collision.

I tested it mounting two nested ext4 filesystem, and there isn't any problem of inode collision
(even if there are two different files with the same inode number).

---------
# mount -o loop disk2 test3/
# echo 123 >test3/one
# mkdir test3/test4
# sudo mount -o loop disk3 test3/test4/
# echo 123 >test3/test4/one
# ls -liR test3/
test3/:
total 24
11 drwx------ 2 root  root  16384 Aug 15 22:27 lost+found
12 -rw-r--r-- 1 ghigo ghigo     4 Aug 15 22:29 one
  2 drwxr-xrwx 3 root  root   4096 Aug 15 22:46 test4

test3/test4:
total 20
11 drwx------ 2 root  root  16384 Aug 15 22:45 lost+found
12 -rw-r--r-- 1 ghigo ghigo     4 Aug 15 22:46 one

# egrep test3 /etc/exports
/tmp/test3 *(rw,no_subtree_check,crossmnt)

# mount 192.168.1.27:/tmp/test3 3
ls -lRi 3
3:
total 24
11 drwx------ 2 root  root  16384 Aug 15 22:27 lost+found
12 -rw-r--r-- 1 ghigo ghigo     4 Aug 15 22:29 one
  2 drwxr-xrwx 3 root  root   4096 Aug 15 22:46 test4

3/test4:
total 20
11 drwx------ 2 root  root  16384 Aug 15 22:45 lost+found
12 -rw-r--r-- 1 ghigo ghigo     4 Aug 15 22:46 one

# mount | egrep 192
192.168.1.27:/tmp/test3 on /tmp/3 type nfs4 (rw,relatime,vers=4.2,rsize=1048576,wsize=1048576,namlen=255,hard,proto=tcp,timeo=600,retrans=2,sec=sys,clientaddr=192.168.1.27,local_lock=none,addr=192.168.1.27)
192.168.1.27:/tmp/test3/test4 on /tmp/3/test4 type nfs4 (rw,relatime,vers=4.2,rsize=1048576,wsize=1048576,namlen=255,hard,proto=tcp,timeo=600,retrans=2,sec=sys,clientaddr=192.168.1.27,local_lock=none,addr=192.168.1.27)


---------------

I tried to mount even with "nfsvers=3", and it seems to work.

However the tests above are related to ext4; in fact it doesn't work with btrfs, but I think this is more a implementation problem than a strategy problem.
What I means is that NFS already has a way to mount different parts of the fs-tree with different mounts (form a client POV). I think that this strategy should be used when NFSd exports a BTRFS filesystem:
- if the 'crossmnt' is NOT Passed, the export should ends at the subvolume boundary (or allow inode collision)
- if the 'crossmnt' is passed, the client should automatically mounts each nested subvolume as a separate mount

> 
>> In fact in btrfs a subvolume is a complete filesystem, with an "own
>> synthetic" device. We could like or not this solution, but this solution is
>> the more aligned to the unix standard, where for each filesystem there is a
>> pair (device, inode-set). NFS (by default) avoids to cross the boundary
>> between the filesystems. So why in BTRFS this should be different ?
> 
>  From the user point of view subvolumes are basically directories; that they
> are "complete filesystems"* is merely a low-level implementation detail.
> 
> * well except they are not, as you cannot 'dd' a subvolume to another
> blockdevice.
> 
>> Why don't rename "ino_uniquifier" as "ino_and_subvolume" and leave to the
>> filesystem the work to combine the inode and the subvolume-id ?
>>
>> I am worried that the logic is split between the filesystem, which
>> synthesizes the ino_uniquifier, and to NFS which combine to the inode. I am
>> thinking that this combination is filesystem specific; for BTRFS is a simple
>> xor but for other filesystem may be a more complex operation, so leaving an
>> half in the filesystem and another half to the NFS seems to not optimal if
>> other filesystem needs to use ino_uniquifier.
> 
> I wondered a bit myself, what are the downsides of just doing the
> uniquefication inside Btrfs, not leaving that to NFSD?
> 
> I mean not even adding the extra stat field, just return the inode itself with
> that already applied. Surely cannot be any worse collision-wise, than
> different subvolumes straight up having the same inode numbers as right now?
> 
> Or is it a performance concern, always doing more work, for something which
> only NFSD has needed so far.
>
NeilBrown Aug. 15, 2021, 9:53 p.m. UTC | #5
On Mon, 16 Aug 2021, kreijack@inwind.it wrote:
> On 8/15/21 9:35 PM, Roman Mamedov wrote:
> > On Sun, 15 Aug 2021 09:39:08 +0200
> > Goffredo Baroncelli <kreijack@libero.it> wrote:
> > 
> >> I am sure that it was discussed already but I was unable to find any track
> >> of this discussion. But if the problem is the collision between the inode
> >> number of different subvolume in the nfd export, is it simpler if the export
> >> is truncated to the subvolume boundary ? It would be more coherent with the
> >> current behavior of vfs+nfsd.
> > 
> > See this bugreport thread which started it all:
> > https://www.spinics.net/lists/linux-btrfs/msg111172.html
> > 
> > In there the reporting user replied that it is strongly not feasible for them
> > to export each individual snapshot.
> 
> Thanks for pointing that.
> 
> However looking at the 'exports' man page, it seems that NFS has already an
> option to cover these cases: 'crossmnt'.
> 
> If NFSd detects a "child" filesystem (i.e. a filesystem mounted inside an already
> exported one) and the "parent" filesystem is marked as 'crossmnt',  the client mount
> the parent AND the child filesystem with two separate mounts, so there is not problem of inode collision.

As you acknowledged, you haven't read the whole back-story.  Maybe you
should.

https://lore.kernel.org/linux-nfs/20210613115313.BC59.409509F4@e16-tech.com/
https://lore.kernel.org/linux-nfs/162848123483.25823.15844774651164477866.stgit@noble.brown/
https://lore.kernel.org/linux-btrfs/162742539595.32498.13687924366155737575.stgit@noble.brown/

The flow of conversation does sometimes jump between threads.

I'm very happy to respond you questions after you've absorbed all that.

NeilBrown
NeilBrown Aug. 15, 2021, 10:17 p.m. UTC | #6
On Mon, 16 Aug 2021, Roman Mamedov wrote:
> 
> I wondered a bit myself, what are the downsides of just doing the
> uniquefication inside Btrfs, not leaving that to NFSD?
> 
> I mean not even adding the extra stat field, just return the inode itself with
> that already applied. Surely cannot be any worse collision-wise, than
> different subvolumes straight up having the same inode numbers as right now?
> 
> Or is it a performance concern, always doing more work, for something which
> only NFSD has needed so far.

Any change in behaviour will have unexpected consequences.  I think the
btrfs maintainers perspective is they they don't want to change
behaviour if they don't have to (which is reasonable) and that currently
they don't have to (which probably means that users aren't complaining
loudly enough).

NFS export of BTRFS is already demonstrably broken and users are
complaining loudly enough that I can hear them ....  though I think it
has been broken like this for 10 years, do I wonder that I didn't hear
them before.

If something is perceived as broken, then a behaviour change that
appears to fix it is more easily accepted.

However, having said that I now see that my latest patch is not ideal.
It changes the inode numbers associated with filehandles of objects in
the non-root subvolume.  This will cause the Linux NFS client to treat
the object as 'stale' For most objects this is a transient annoyance.
Reopen the file or restart the process and all should be well again.
However if the inode number of the mount point changes, you will need to
unmount and remount.  That is more somewhat more of an annoyance.

There are a few ways to handle this more gracefully.

1/ We could get btrfs to hand out new filehandles as well as new inode
numbers, but still accept the old filehandles.  Then we could make the
inode number reported be based on the filehandle.  This would be nearly
seamless but rather clumsy to code.  I'm not *very* keen on this idea,
but it is worth keeping in mind.

2/ We could add a btrfs mount option to control whether the uniquifier
was set or not.  This would allow the sysadmin to choose when to manage
any breakage.  I think this is my preference, but Josef has declared an
aversion to mount options.

3/ We could add a module parameter to nfsd to control whether the
uniquifier is merged in.  This again gives the sysadmin control, and it
can be done despite any aversion from btrfs maintainers.  But I'd need
to overcome any aversion from the nfsd maintainers, and I don't know how
strong that would be yet. (A new export option isn't really appropriate.
It is much more work to add an export option than the add a mount option).

I don't know.... maybe I should try harder to like option 1, or at least
verify if it works as expected and see how ugly the code really is.

NeilBrown
Goffredo Baroncelli Aug. 17, 2021, 7:34 p.m. UTC | #7
On 8/15/21 11:53 PM, NeilBrown wrote:
> On Mon, 16 Aug 2021, kreijack@inwind.it wrote:
>> On 8/15/21 9:35 PM, Roman Mamedov wrote:
>>> On Sun, 15 Aug 2021 09:39:08 +0200
>>> Goffredo Baroncelli <kreijack@libero.it> wrote:
>>>
>>>> I am sure that it was discussed already but I was unable to find any track
>>>> of this discussion. But if the problem is the collision between the inode
>>>> number of different subvolume in the nfd export, is it simpler if the export
>>>> is truncated to the subvolume boundary ? It would be more coherent with the
>>>> current behavior of vfs+nfsd.
>>>
>>> See this bugreport thread which started it all:
>>> https://www.spinics.net/lists/linux-btrfs/msg111172.html
>>>
>>> In there the reporting user replied that it is strongly not feasible for them
>>> to export each individual snapshot.
>>
>> Thanks for pointing that.
>>
>> However looking at the 'exports' man page, it seems that NFS has already an
>> option to cover these cases: 'crossmnt'.
>>
>> If NFSd detects a "child" filesystem (i.e. a filesystem mounted inside an already
>> exported one) and the "parent" filesystem is marked as 'crossmnt',  the client mount
>> the parent AND the child filesystem with two separate mounts, so there is not problem of inode collision.
> 
> As you acknowledged, you haven't read the whole back-story.  Maybe you
> should.
> 
> https://lore.kernel.org/linux-nfs/20210613115313.BC59.409509F4@e16-tech.com/
> https://lore.kernel.org/linux-nfs/162848123483.25823.15844774651164477866.stgit@noble.brown/
> https://lore.kernel.org/linux-btrfs/162742539595.32498.13687924366155737575.stgit@noble.brown/
> 
> The flow of conversation does sometimes jump between threads.
> 
> I'm very happy to respond you questions after you've absorbed all that.

Hi Neil,

I read the other threads. And I still have the opinion that the nfsd crossmnt behavior should be a good solution for the btrfs subvolumes.
> 
> NeilBrown
>
NeilBrown Aug. 17, 2021, 9:39 p.m. UTC | #8
On Wed, 18 Aug 2021, kreijack@inwind.it wrote:
> On 8/15/21 11:53 PM, NeilBrown wrote:
> > On Mon, 16 Aug 2021, kreijack@inwind.it wrote:
> >> On 8/15/21 9:35 PM, Roman Mamedov wrote:

> >>
> >> However looking at the 'exports' man page, it seems that NFS has already an
> >> option to cover these cases: 'crossmnt'.
> >>
> >> If NFSd detects a "child" filesystem (i.e. a filesystem mounted inside an already
> >> exported one) and the "parent" filesystem is marked as 'crossmnt',  the client mount
> >> the parent AND the child filesystem with two separate mounts, so there is not problem of inode collision.
> > 
> > As you acknowledged, you haven't read the whole back-story.  Maybe you
> > should.
> > 
> > https://lore.kernel.org/linux-nfs/20210613115313.BC59.409509F4@e16-tech.com/
> > https://lore.kernel.org/linux-nfs/162848123483.25823.15844774651164477866.stgit@noble.brown/
> > https://lore.kernel.org/linux-btrfs/162742539595.32498.13687924366155737575.stgit@noble.brown/
> > 
> > The flow of conversation does sometimes jump between threads.
> > 
> > I'm very happy to respond you questions after you've absorbed all that.
> 
> Hi Neil,
> 
> I read the other threads.  And I still have the opinion that the nfsd
> crossmnt behavior should be a good solution for the btrfs subvolumes. 

Thanks for reading it all.  Let me join the dots for you.

"crossmnt" doesn't currently work because "subvolumes" aren't mount
points.

We could change btrfs so that subvolumes *are* mountpoints.  They would
have to be automounts.  I posted patches to do that.  They were broadly
rejected because people have many thousands of submounts that are
concurrently active and so /proc/mounts would be multiple megabytes is
size and working with it would become impractical.  Also, non-privileged
users can create subvols, and may want the path names to remain private.
But these subvols would appear in the mount table and so would no longer
be private.

Alternately we could change the "crossmnt" functionality to treat a
change of st_dev as though it were a mount point.  I posted patches to
do this too.  This hits the same sort of problems in a different way.
If NFSD reports that is has crossed a "mount" by providing a different
filesystem-id to the client, then the client will create a new mount
point which will appear in /proc/mounts.  It might be less likely that
many thousands of subvolumes are accessed over NFS than locally, but it
is still entirely possible.  I don't want the NFS client to suffer a
problem that btrfs doesn't impose locally.  And 'private' subvolumes
could again appear on a public list if they were accessed via NFS.

Thanks,
NeilBrown
Wang Yugui Aug. 18, 2021, 2:54 p.m. UTC | #9
Hi,

We use  'swab64' to combinate 'subvol id' and 'inode' into 64bit in this
patch.

case1:
'subvol id': 16bit => 64K, a little small because the subvol id is
always increase?
'inode':	48bit * 4K per node, this is big enough.

case2:
'subvol id': 24bit => 16M,  this is big enough.
'inode':	40bit * 4K per node => 4 PB.  this is a little small?

Is there a way to 'bit-swap' the subvol id, rather the current byte-swap?

If not, maybe it is a better balance if we combinate 22bit subvol id and
42 bit inode?

Best Regards
Wang Yugui (wangyugui@e16-tech.com)
2021/08/18

> 
> [[This patch is a minimal patch which addresses the current problems
>   with nfsd and btrfs, in a way which I think is most supportable, least
>   surprising, and least likely to impact any future attempts to more
>   completely fix the btrfs file-identify problem]]
> 
> BTRFS does not provide unique inode numbers across a filesystem.
> It *does* provide unique inode numbers with a subvolume and
> uses synthetic device numbers for different subvolumes to ensure
> uniqueness for device+inode.
> 
> nfsd cannot use these varying device numbers.  If nfsd were to
> synthesise different stable filesystem ids to give to the client, that
> would cause subvolumes to appear in the mount table on the client, even
> though they don't appear in the mount table on the server.  Also, NFSv3
> doesn't support changing the filesystem id without a new explicit
> mount on the client (this is partially supported in practice, but
> violates the protocol specification).
> 
> So currently, the roots of all subvolumes report the same inode number
> in the same filesystem to NFS clients and tools like 'find' notice that
> a directory has the same identity as an ancestor, and so refuse to
> enter that directory.
> 
> This patch allows btrfs (or any filesystem) to provide a 64bit number
> that can be xored with the inode number to make the number more unique.
> Rather than the client being certain to see duplicates, with this patch
> it is possible but extremely rare.
> 
> The number than btrfs provides is a swab64() version of the subvolume
> identifier.  This has most entropy in the high bits (the low bits of the
> subvolume identifer), while the inoe has most entropy in the low bits.
> The result will always be unique within a subvolume, and will almost
> always be unique across the filesystem.
> 
> Signed-off-by: NeilBrown <neilb@suse.de>
> ---
>  fs/btrfs/inode.c     |  4 ++++
>  fs/nfsd/nfs3xdr.c    | 17 ++++++++++++++++-
>  fs/nfsd/nfs4xdr.c    |  9 ++++++++-
>  fs/nfsd/xdr3.h       |  2 ++
>  include/linux/stat.h | 17 +++++++++++++++++
>  5 files changed, 47 insertions(+), 2 deletions(-)
> 
> diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
> index 0117d867ecf8..989fdf2032d5 100644
> --- a/fs/btrfs/inode.c
> +++ b/fs/btrfs/inode.c
> @@ -9195,6 +9195,10 @@ static int btrfs_getattr(struct user_namespace *mnt_userns,
>  	generic_fillattr(&init_user_ns, inode, stat);
>  	stat->dev = BTRFS_I(inode)->root->anon_dev;
>  
> +	if (BTRFS_I(inode)->root->root_key.objectid != BTRFS_FS_TREE_OBJECTID)
> +		stat->ino_uniquifier =
> +			swab64(BTRFS_I(inode)->root->root_key.objectid);
> +
>  	spin_lock(&BTRFS_I(inode)->lock);
>  	delalloc_bytes = BTRFS_I(inode)->new_delalloc_bytes;
>  	inode_bytes = inode_get_bytes(inode);
> diff --git a/fs/nfsd/nfs3xdr.c b/fs/nfsd/nfs3xdr.c
> index 0a5ebc52e6a9..669e2437362a 100644
> --- a/fs/nfsd/nfs3xdr.c
> +++ b/fs/nfsd/nfs3xdr.c
> @@ -340,6 +340,7 @@ svcxdr_encode_fattr3(struct svc_rqst *rqstp, struct xdr_stream *xdr,
>  {
>  	struct user_namespace *userns = nfsd_user_namespace(rqstp);
>  	__be32 *p;
> +	u64 ino;
>  	u64 fsid;
>  
>  	p = xdr_reserve_space(xdr, XDR_UNIT * 21);
> @@ -377,7 +378,10 @@ svcxdr_encode_fattr3(struct svc_rqst *rqstp, struct xdr_stream *xdr,
>  	p = xdr_encode_hyper(p, fsid);
>  
>  	/* fileid */
> -	p = xdr_encode_hyper(p, stat->ino);
> +	ino = stat->ino;
> +	if (stat->ino_uniquifier && stat->ino_uniquifier != ino)
> +		ino ^= stat->ino_uniquifier;
> +	p = xdr_encode_hyper(p, ino);
>  
>  	p = encode_nfstime3(p, &stat->atime);
>  	p = encode_nfstime3(p, &stat->mtime);
> @@ -1151,6 +1155,17 @@ svcxdr_encode_entry3_common(struct nfsd3_readdirres *resp, const char *name,
>  	if (xdr_stream_encode_item_present(xdr) < 0)
>  		return false;
>  	/* fileid */
> +	if (!resp->dir_have_uniquifier) {
> +		struct kstat stat;
> +		if (fh_getattr(&resp->fh, &stat) == nfs_ok)
> +			resp->dir_ino_uniquifier = stat.ino_uniquifier;
> +		else
> +			resp->dir_ino_uniquifier = 0;
> +		resp->dir_have_uniquifier = 1;
> +	}
> +	if (resp->dir_ino_uniquifier &&
> +	    resp->dir_ino_uniquifier != ino)
> +		ino ^= resp->dir_ino_uniquifier;
>  	if (xdr_stream_encode_u64(xdr, ino) < 0)
>  		return false;
>  	/* name */
> diff --git a/fs/nfsd/nfs4xdr.c b/fs/nfsd/nfs4xdr.c
> index 7abeccb975b2..ddccf849c29c 100644
> --- a/fs/nfsd/nfs4xdr.c
> +++ b/fs/nfsd/nfs4xdr.c
> @@ -3114,10 +3114,14 @@ nfsd4_encode_fattr(struct xdr_stream *xdr, struct svc_fh *fhp,
>  					fhp->fh_handle.fh_size);
>  	}
>  	if (bmval0 & FATTR4_WORD0_FILEID) {
> +		u64 ino = stat.ino;
> +		if (stat.ino_uniquifier &&
> +		    stat.ino_uniquifier != stat.ino)
> +			ino ^= stat.ino_uniquifier;
>  		p = xdr_reserve_space(xdr, 8);
>  		if (!p)
>  			goto out_resource;
> -		p = xdr_encode_hyper(p, stat.ino);
> +		p = xdr_encode_hyper(p, ino);
>  	}
>  	if (bmval0 & FATTR4_WORD0_FILES_AVAIL) {
>  		p = xdr_reserve_space(xdr, 8);
> @@ -3285,6 +3289,9 @@ nfsd4_encode_fattr(struct xdr_stream *xdr, struct svc_fh *fhp,
>  			if (err)
>  				goto out_nfserr;
>  			ino = parent_stat.ino;
> +			if (parent_stat.ino_uniquifier &&
> +			    parent_stat.ino_uniquifier != ino)
> +				ino ^= parent_stat.ino_uniquifier;
>  		}
>  		p = xdr_encode_hyper(p, ino);
>  	}
> diff --git a/fs/nfsd/xdr3.h b/fs/nfsd/xdr3.h
> index 933008382bbe..b4f9f3c71f72 100644
> --- a/fs/nfsd/xdr3.h
> +++ b/fs/nfsd/xdr3.h
> @@ -179,6 +179,8 @@ struct nfsd3_readdirres {
>  	struct xdr_buf		dirlist;
>  	struct svc_fh		scratch;
>  	struct readdir_cd	common;
> +	u64			dir_ino_uniquifier;
> +	int			dir_have_uniquifier;
>  	unsigned int		cookie_offset;
>  	struct svc_rqst *	rqstp;
>  
> diff --git a/include/linux/stat.h b/include/linux/stat.h
> index fff27e603814..a5188f42ed81 100644
> --- a/include/linux/stat.h
> +++ b/include/linux/stat.h
> @@ -46,6 +46,23 @@ struct kstat {
>  	struct timespec64 btime;			/* File creation time */
>  	u64		blocks;
>  	u64		mnt_id;
> +	/*
> +	 * BTRFS does not provide unique inode numbers within a filesystem,
> +	 * depending on a synthetic 'dev' to provide uniqueness.
> +	 * NFSd cannot make use of this 'dev' number so clients often see
> +	 * duplicate inode numbers.
> +	 * For BTRFS, 'ino' is unlikely to use the high bits.  It puts
> +	 * another number in ino_uniquifier which:
> +	 * - has most entropy in the high bits
> +	 * - is different precisely when 'dev' is different
> +	 * - is stable across unmount/remount
> +	 * NFSd can xor this with 'ino' to get a substantially more unique
> +	 * number for reporting to the client.
> +	 * The ino_uniquifier for a directory can reasonably be applied
> +	 * to inode numbers reported by the readdir filldir callback.
> +	 * It is NOT currently exported to user-space.
> +	 */
> +	u64		ino_uniquifier;
>  };
>  
>  #endif
> -- 
> 2.32.0
Goffredo Baroncelli Aug. 18, 2021, 5:24 p.m. UTC | #10
On 8/17/21 11:39 PM, NeilBrown wrote:
> On Wed, 18 Aug 2021, kreijack@inwind.it wrote:
>> On 8/15/21 11:53 PM, NeilBrown wrote:
>>> On Mon, 16 Aug 2021, kreijack@inwind.it wrote:
>>>> On 8/15/21 9:35 PM, Roman Mamedov wrote:
> 
>>>>
>>>> However looking at the 'exports' man page, it seems that NFS has already an
>>>> option to cover these cases: 'crossmnt'.
>>>>
>>>> If NFSd detects a "child" filesystem (i.e. a filesystem mounted inside an already
>>>> exported one) and the "parent" filesystem is marked as 'crossmnt',  the client mount
>>>> the parent AND the child filesystem with two separate mounts, so there is not problem of inode collision.
>>>
>>> As you acknowledged, you haven't read the whole back-story.  Maybe you
>>> should.
>>>
>>> https://lore.kernel.org/linux-nfs/20210613115313.BC59.409509F4@e16-tech.com/
>>> https://lore.kernel.org/linux-nfs/162848123483.25823.15844774651164477866.stgit@noble.brown/
>>> https://lore.kernel.org/linux-btrfs/162742539595.32498.13687924366155737575.stgit@noble.brown/
>>>
>>> The flow of conversation does sometimes jump between threads.
>>>
>>> I'm very happy to respond you questions after you've absorbed all that.
>>
>> Hi Neil,
>>
>> I read the other threads.  And I still have the opinion that the nfsd
>> crossmnt behavior should be a good solution for the btrfs subvolumes.
> 
> Thanks for reading it all.  Let me join the dots for you.
> 
[...]
> 
> Alternately we could change the "crossmnt" functionality to treat a
> change of st_dev as though it were a mount point.  I posted patches to
> do this too.  This hits the same sort of problems in a different way.
> If NFSD reports that is has crossed a "mount" by providing a different
> filesystem-id to the client, then the client will create a new mount
> point which will appear in /proc/mounts.  

Yes, this is my proposal.

> It might be less likely that
> many thousands of subvolumes are accessed over NFS than locally, but it
> is still entirely possible.  

I don't think that it would be so unlikely. Think about a file indexer
and/or a 'find' command runned in the folder that contains the snapshots...

> I don't want the NFS client to suffer a
> problem that btrfs doesn't impose locally.  

The solution is not easy. In fact we are trying to map a u64 x u64 space to a u64 space. The true is that we
cannot guarantee that a collision will not happen. We can only say that for a fresh filesystem is near
impossible, but for an aged filesystem it is unlikely but possible.

We already faced real case where we exhausted the inode space in the 32 bit arch.What is the chances that the subvolumes ever created count is greater  2^24 and the inode number is greater  2^40 ? The likelihood is low but not 0...

Some random toughs:
- the new inode number are created merging the original inode-number (in the lower bit) and the object-id of the subvolume (in higher bit). We could add a warning when these bits overlap:

	if (fls(stat->ino) >= ffs(stat->ino_uniquifer))
		printk("NFSD: Warning possible inode collision...")

More smarter heuristic can be developed, like doing the check against the maximum value if inode and the maximum value of the subvolume once at mount time....

- for the inode number it is an expensive operation (even tough it exists/existed for the 32bit processor), but we could reuse the object-id after it is freed

- I think that we could add an option to nfsd or btrfs (not a default behavior) to avoid to cross the subvolume boundary

> And 'private' subvolumes
> could again appear on a public list if they were accessed via NFS.

(wrongly) I never considered  a similar scenario. However I think that these could be anonymized using a alias (the name of the path to mount is passed by nfsd, so it could create an alias that will be recognized by nfsd when the clienet requires it... complex but doable...)

> 
> Thanks,
> NeilBrown
>
NeilBrown Aug. 18, 2021, 9:46 p.m. UTC | #11
On Thu, 19 Aug 2021, Wang Yugui wrote:
> Hi,
> 
> We use  'swab64' to combinate 'subvol id' and 'inode' into 64bit in this
> patch.
> 
> case1:
> 'subvol id': 16bit => 64K, a little small because the subvol id is
> always increase?
> 'inode':	48bit * 4K per node, this is big enough.
> 
> case2:
> 'subvol id': 24bit => 16M,  this is big enough.
> 'inode':	40bit * 4K per node => 4 PB.  this is a little small?

I don't know what point you are trying to make with the above.

> 
> Is there a way to 'bit-swap' the subvol id, rather the current byte-swap?

Sure:
   for (i=0; i<64; i++) {
        new = (new << 1) | (old & 1)
        old >>= 1;
   }

but would it gain anything significant?

Remember what the goal is.  Most apps don't care at all about duplicate
inode numbers - only a few do, and they only care about a few inodes.
The only bug I actually have a report of is caused by a directory having
the same inode as an ancestor.  i.e.  in lots of cases, duplicate inode
numbers won't be noticed.

The behaviour of btrfs over NFS RELIABLY causes exactly this behaviour
of a directory having the same inode number as an ancestor.  The root of
a subtree will *always* do this.  If we JUST changed the inode numbers
of the roots of subtrees, then most observed problems would go away.  It
would change from "trivial to reproduce" to "rarely happens".  The patch
I actually propose makes it much more unlikely than that.  Even if
duplicate inode numbers do happen, the chance of them being noticed is
infinitesimal.  Given that, there is no point in minor tweaks unless
they can make duplicate inode numbers IMPOSSIBLE.

> 
> If not, maybe it is a better balance if we combinate 22bit subvol id and
> 42 bit inode?

This would be better except when it is worse.  We cannot know which will
happen more often.

As long as BTRFS allows object-ids and root-ids combined to use more
than 64 bits there can be no perfect solution.  There are many possible
solutions that will be close to perfect in practice.  swab64() is the
simplest that I could think of.  Picking any arbitrary cut-off (22/42,
24/40, ...) is unlikely to be better, and could is some circumstances be
worse.

My preference would be for btrfs to start re-using old object-ids and
root-ids, and to enforce a limit (set at mkfs or tunefs) so that the
total number of bits does not exceed 64.  Unfortunately the maintainers
seem reluctant to even consider this.

NeilBrown
Zygo Blaxell Aug. 19, 2021, 2:19 a.m. UTC | #12
On Thu, Aug 19, 2021 at 07:46:22AM +1000, NeilBrown wrote:
> On Thu, 19 Aug 2021, Wang Yugui wrote:
> > Hi,
> > 
> > We use  'swab64' to combinate 'subvol id' and 'inode' into 64bit in this
> > patch.
> > 
> > case1:
> > 'subvol id': 16bit => 64K, a little small because the subvol id is
> > always increase?
> > 'inode':	48bit * 4K per node, this is big enough.
> > 
> > case2:
> > 'subvol id': 24bit => 16M,  this is big enough.
> > 'inode':	40bit * 4K per node => 4 PB.  this is a little small?
> 
> I don't know what point you are trying to make with the above.
> 
> > 
> > Is there a way to 'bit-swap' the subvol id, rather the current byte-swap?
> 
> Sure:
>    for (i=0; i<64; i++) {
>         new = (new << 1) | (old & 1)
>         old >>= 1;
>    }
> 
> but would it gain anything significant?
> 
> Remember what the goal is.  Most apps don't care at all about duplicate
> inode numbers - only a few do, and they only care about a few inodes.
> The only bug I actually have a report of is caused by a directory having
> the same inode as an ancestor.  i.e.  in lots of cases, duplicate inode
> numbers won't be noticed.

rsync -H and cpio's hardlink detection can be badly confused.  They will
think distinct files with the same inode number are hardlinks.  This could
be bad if you were making backups (though if you're making backups over
NFS, you are probably doing something that could be done better in a
different way).

> The behaviour of btrfs over NFS RELIABLY causes exactly this behaviour
> of a directory having the same inode number as an ancestor.  The root of
> a subtree will *always* do this.  If we JUST changed the inode numbers
> of the roots of subtrees, then most observed problems would go away.  It
> would change from "trivial to reproduce" to "rarely happens".  The patch
> I actually propose makes it much more unlikely than that.  Even if
> duplicate inode numbers do happen, the chance of them being noticed is
> infinitesimal.  Given that, there is no point in minor tweaks unless
> they can make duplicate inode numbers IMPOSSIBLE.

That's a good argument.  I have a different one with the same conclusion.

40 bit inodes would take about 20 years to collide with 24-bit subvols--if
you are creating an average of 1742 inodes every second.  Also at the
same time you have to be creating a subvol every 37 seconds to occupy
the colliding 25th bit of the subvol ID.  Only the highest inode number
in any subvol counts--if your inode creation is spread out over several
different subvols, you'll need to make inodes even faster.

For reference, my high scores are 17 inodes per second and a subvol
every 595 seconds (averaged over 1 year).  Burst numbers are much higher,
but one has to spend some time _reading_ the files now and then.

I've encountered other btrfs users with two orders of magnitude higher
inode creation rates than mine.  They are barely squeaking under the
20-year line--or they would be, if they were creating snapshots 50 times
faster than they do today.

Use cases that have the highest inode creation rates (like /tmp) tend
to get more specialized storage solutions (like tmpfs).

Cloud fleets do have higher average inode creation rates, but their
filesystems have much shorter lifespans than 20 years, so the delta on
both sides of the ratio cancels out.

If this hack is only used for NFS, it gives us some time to come up
with a better solution.  (On the other hand, we had 14 years already,
and here we are...)

> > If not, maybe it is a better balance if we combinate 22bit subvol id and
> > 42 bit inode?
> 
> This would be better except when it is worse.  We cannot know which will
> happen more often.
> 
> As long as BTRFS allows object-ids and root-ids combined to use more
> than 64 bits there can be no perfect solution.  There are many possible
> solutions that will be close to perfect in practice.  swab64() is the
> simplest that I could think of.  Picking any arbitrary cut-off (22/42,
> 24/40, ...) is unlikely to be better, and could is some circumstances be
> worse.
> 
> My preference would be for btrfs to start re-using old object-ids and
> root-ids, and to enforce a limit (set at mkfs or tunefs) so that the
> total number of bits does not exceed 64.  Unfortunately the maintainers
> seem reluctant to even consider this.

It was considered, implemented in 2011, and removed in 2020.  Rationale
is in commit b547a88ea5776a8092f7f122ddc20d6720528782 "btrfs: start
deprecation of mount option inode_cache".  It made file creation slower,
and consumed disk space, iops, and memory to run.  Nobody used it.
Newer on-disk data structure versions (free space tree, 2015) didn't
bother implementing inode_cache's storage requirement.

> NeilBrown
Amir Goldstein Aug. 19, 2021, 8:01 a.m. UTC | #13
On Mon, Aug 16, 2021 at 1:21 AM NeilBrown <neilb@suse.de> wrote:
>
> On Mon, 16 Aug 2021, Roman Mamedov wrote:
> >
> > I wondered a bit myself, what are the downsides of just doing the
> > uniquefication inside Btrfs, not leaving that to NFSD?
> >
> > I mean not even adding the extra stat field, just return the inode itself with
> > that already applied. Surely cannot be any worse collision-wise, than
> > different subvolumes straight up having the same inode numbers as right now?
> >
> > Or is it a performance concern, always doing more work, for something which
> > only NFSD has needed so far.
>
> Any change in behaviour will have unexpected consequences.  I think the
> btrfs maintainers perspective is they they don't want to change
> behaviour if they don't have to (which is reasonable) and that currently
> they don't have to (which probably means that users aren't complaining
> loudly enough).
>
> NFS export of BTRFS is already demonstrably broken and users are
> complaining loudly enough that I can hear them ....  though I think it
> has been broken like this for 10 years, do I wonder that I didn't hear
> them before.
>
> If something is perceived as broken, then a behaviour change that
> appears to fix it is more easily accepted.
>
> However, having said that I now see that my latest patch is not ideal.
> It changes the inode numbers associated with filehandles of objects in
> the non-root subvolume.  This will cause the Linux NFS client to treat
> the object as 'stale' For most objects this is a transient annoyance.
> Reopen the file or restart the process and all should be well again.
> However if the inode number of the mount point changes, you will need to
> unmount and remount.  That is more somewhat more of an annoyance.
>
> There are a few ways to handle this more gracefully.
>
> 1/ We could get btrfs to hand out new filehandles as well as new inode
> numbers, but still accept the old filehandles.  Then we could make the
> inode number reported be based on the filehandle.  This would be nearly
> seamless but rather clumsy to code.  I'm not *very* keen on this idea,
> but it is worth keeping in mind.
>

So objects would change their inode number after nfs inode cache is
evicted and while nfs filesystem is mounted. That does not sound ideal.

But I am a bit confused about the problem.
If the export is of the btrfs root, then nfs client cannot access any
subvolumes (right?) - that was the bug report, so the value of inode
numbers in non-root subvolumes is not an issue.
If export is of non-root subvolume, then why bother changing anything
at all? Is there a need to traverse into sub-sub-volumes?

> 2/ We could add a btrfs mount option to control whether the uniquifier
> was set or not.  This would allow the sysadmin to choose when to manage
> any breakage.  I think this is my preference, but Josef has declared an
> aversion to mount options.
>
> 3/ We could add a module parameter to nfsd to control whether the
> uniquifier is merged in.  This again gives the sysadmin control, and it
> can be done despite any aversion from btrfs maintainers.  But I'd need
> to overcome any aversion from the nfsd maintainers, and I don't know how
> strong that would be yet. (A new export option isn't really appropriate.
> It is much more work to add an export option than the add a mount option).
>

That is too bad, because IMO from users POV, "fsid=btrfsroot" or "cross-subvol"
export option would have been a nice way to describe and opt-in to this new
functionality.

But let's consider for a moment the consequences of enabling this functionality
automatically whenever exporting a btrfs root volume without "crossmnt":

1. Objects inside a subvol that are inaccessible(?) with current
nfs/nfsd without
    "crossmnt" will become accessible after enabling the feature -
this will match
    the user experience of accessing btrfs on the host
2. The inode numbers of the newly accessible objects would not match the inode
    numbers on the host fs (no big deal?)
3. The inode numbers of objects in a snapshot would not match the inode
    numbers of the original (pre-snapshot) objects (acceptable tradeoff for
    being able to access the snapshot objects without bloating /proc/mounts?)
4. The inode numbers of objects in a subvol observed via this "cross-subvol"
    export would not match the inode numbers of the same objects observed
    via an individual subvol export
5. st_ino conflicts are possible when multiplexing subvol id and inode number.
    overlayfs resolved those conflicts by allocating an inode number from a
    reserved non-persistent inode range, which may cause objects to change
    their inode number during the lifetime on the filesystem (sensible
tradeoff?)

I think that #4 is a bit hard to swallow and #3 is borderline acceptable...
Both and quite hard to document and to set expectations as a non-opt-in
change of behavior when exporting btrfs root.

IMO, an nfsd module parameter will give some control and therefore is
a must, but it won't make life easier to document and set user expectations
when the semantics are not clearly stated in the exports table.

You claim that "A new export option isn't really appropriate."
but your only argument is that "It is much more work to add
an export option than the add a mount option".

With all due respect, for this particular challenge with all the
constraints involved, this sounds like a pretty weak argument.

Surely, adding an export option is easier than slowly changing all
userspace tools to understand subvolumes? a solution that you had
previously brought up.

Can you elaborate some more about your aversion to a new
export option.

Thanks,
Amir.
NeilBrown Aug. 20, 2021, 2:54 a.m. UTC | #14
On Thu, 19 Aug 2021, Zygo Blaxell wrote:
> On Thu, Aug 19, 2021 at 07:46:22AM +1000, NeilBrown wrote:
> > 
> > Remember what the goal is.  Most apps don't care at all about duplicate
> > inode numbers - only a few do, and they only care about a few inodes.
> > The only bug I actually have a report of is caused by a directory having
> > the same inode as an ancestor.  i.e.  in lots of cases, duplicate inode
> > numbers won't be noticed.
> 
> rsync -H and cpio's hardlink detection can be badly confused.  They will
> think distinct files with the same inode number are hardlinks.  This could
> be bad if you were making backups (though if you're making backups over
> NFS, you are probably doing something that could be done better in a
> different way).

Yes, they could get confused.  inode numbers remain unique within a
"subvolume" so you would need to do at backup of multiple subtrees to
hit a problem.  Certainly possible, but probably less common.

> 
> 40 bit inodes would take about 20 years to collide with 24-bit subvols--if
> you are creating an average of 1742 inodes every second.  Also at the
> same time you have to be creating a subvol every 37 seconds to occupy
> the colliding 25th bit of the subvol ID.  Only the highest inode number
> in any subvol counts--if your inode creation is spread out over several
> different subvols, you'll need to make inodes even faster.
> 
> For reference, my high scores are 17 inodes per second and a subvol
> every 595 seconds (averaged over 1 year).  Burst numbers are much higher,
> but one has to spend some time _reading_ the files now and then.
> 
> I've encountered other btrfs users with two orders of magnitude higher
> inode creation rates than mine.  They are barely squeaking under the
> 20-year line--or they would be, if they were creating snapshots 50 times
> faster than they do today.

I do like seeing concrete numbers, thanks.  How many of these inodes and
subvols remain undeleted?  Supposing inode numbers were reused, how many
bits might you need?


> > My preference would be for btrfs to start re-using old object-ids and
> > root-ids, and to enforce a limit (set at mkfs or tunefs) so that the
> > total number of bits does not exceed 64.  Unfortunately the maintainers
> > seem reluctant to even consider this.
> 
> It was considered, implemented in 2011, and removed in 2020.  Rationale
> is in commit b547a88ea5776a8092f7f122ddc20d6720528782 "btrfs: start
> deprecation of mount option inode_cache".  It made file creation slower,
> and consumed disk space, iops, and memory to run.  Nobody used it.
> Newer on-disk data structure versions (free space tree, 2015) didn't
> bother implementing inode_cache's storage requirement.

Yes, I saw that.  Providing reliable functional certainly can impact
performance and consume disk-space.  That isn't an excuse for not doing
it. 
I suspect that carefully tuned code could result in typical creation
times being unchanged, and mean creation times suffering only a tiny
cost.  Using "max+1" when the creation rate is particularly high might
be a reasonable part of managing costs.
Storage cost need not be worse than the cost of tracking free blocks
on the device.

"Nobody used it" is odd.  It implies it would have to be explicitly
enabled, and all it would provide anyone is sane behaviour.  Who would
imagine that to be an optional extra.

NeilBrown
NeilBrown Aug. 20, 2021, 3:21 a.m. UTC | #15
On Thu, 19 Aug 2021, Amir Goldstein wrote:
> On Mon, Aug 16, 2021 at 1:21 AM NeilBrown <neilb@suse.de> wrote:
> >
> > There are a few ways to handle this more gracefully.
> >
> > 1/ We could get btrfs to hand out new filehandles as well as new inode
> > numbers, but still accept the old filehandles.  Then we could make the
> > inode number reported be based on the filehandle.  This would be nearly
> > seamless but rather clumsy to code.  I'm not *very* keen on this idea,
> > but it is worth keeping in mind.
> >
> 
> So objects would change their inode number after nfs inode cache is
> evicted and while nfs filesystem is mounted. That does not sound ideal.

No.  Almost all filehandle lookups happen in the context of some other
filehandle.  If the provided context is an old-style filehandle, we
provide an old-style filehandle for the lookup.  There is already code
in nfsd to support this (as we have in the past changed how filesystems
are identified).

It would only be if the mountpoint filehandle (which is fetched without
that context) went out of cache that inode numbers would change.  That
would mean that the filesystem (possibly an automount) was unmounted.
When it was remounted it could have a different device number anyway, so
having different inode numbers would be of little consequence.

> 
> But I am a bit confused about the problem.
> If the export is of the btrfs root, then nfs client cannot access any
> subvolumes (right?) - that was the bug report, so the value of inode
> numbers in non-root subvolumes is not an issue.

Not correct.  All objects in the filesystem are fully accessible.  The
only problem is that some pairs of objects have the same inode number.
This causes some programs like 'find' and 'du' to behave differently to
expectations.  They will refuse to even look in a subvolume, because it
looks like doing so could cause an infinite loop.  The values of inode
numbers in non-root subvolumes is EXACTLY the issue.

> If export is of non-root subvolume, then why bother changing anything
> at all? Is there a need to traverse into sub-sub-volumes?
> 
> > 2/ We could add a btrfs mount option to control whether the uniquifier
> > was set or not.  This would allow the sysadmin to choose when to manage
> > any breakage.  I think this is my preference, but Josef has declared an
> > aversion to mount options.
> >
> > 3/ We could add a module parameter to nfsd to control whether the
> > uniquifier is merged in.  This again gives the sysadmin control, and it
> > can be done despite any aversion from btrfs maintainers.  But I'd need
> > to overcome any aversion from the nfsd maintainers, and I don't know how
> > strong that would be yet. (A new export option isn't really appropriate.
> > It is much more work to add an export option than the add a mount option).
> >
> 
> That is too bad, because IMO from users POV, "fsid=btrfsroot" or "cross-subvol"
> export option would have been a nice way to describe and opt-in to this new
> functionality.
> 
> But let's consider for a moment the consequences of enabling this functionality
> automatically whenever exporting a btrfs root volume without "crossmnt":
> 
> 1. Objects inside a subvol that are inaccessible(?) with current
> nfs/nfsd without
>     "crossmnt" will become accessible after enabling the feature -
> this will match
>     the user experience of accessing btrfs on the host

Not correct - as above.

> 2. The inode numbers of the newly accessible objects would not match the inode
>     numbers on the host fs (no big deal?)

Unlikely to be a problem.  Inode numbers have no meaning beyond the facts
that:
  - they are stable for the lifetime of the object
  - they are unique within a filesystem (except btrfs lies about
    filesystems)
  - they are not zero

The facts only need to be equally true on the NFS server and client..

> 3. The inode numbers of objects in a snapshot would not match the inode
>     numbers of the original (pre-snapshot) objects (acceptable tradeoff for
>     being able to access the snapshot objects without bloating /proc/mounts?)

This also should not be a problem.  Files in different snapshots are
different things that happen to share storage (like reflinks).
Comparing inode numbers between places which report different st_dev
does not fit within the meaning of inode numbers.

> 4. The inode numbers of objects in a subvol observed via this "cross-subvol"
>     export would not match the inode numbers of the same objects observed
>     via an individual subvol export

The device number would differ too, so the relative values of the inode
numbers would be irrelevant.

> 5. st_ino conflicts are possible when multiplexing subvol id and inode number.
>     overlayfs resolved those conflicts by allocating an inode number from a
>     reserved non-persistent inode range, which may cause objects to change
>     their inode number during the lifetime on the filesystem (sensible
> tradeoff?)
> 
> I think that #4 is a bit hard to swallow and #3 is borderline acceptable...
> Both and quite hard to document and to set expectations as a non-opt-in
> change of behavior when exporting btrfs root.
> 
> IMO, an nfsd module parameter will give some control and therefore is
> a must, but it won't make life easier to document and set user expectations
> when the semantics are not clearly stated in the exports table.
> 
> You claim that "A new export option isn't really appropriate."
> but your only argument is that "It is much more work to add
> an export option than the add a mount option".
> 
> With all due respect, for this particular challenge with all the
> constraints involved, this sounds like a pretty weak argument.
> 
> Surely, adding an export option is easier than slowly changing all
> userspace tools to understand subvolumes? a solution that you had
> previously brought up.
> 
> Can you elaborate some more about your aversion to a new
> export option.

Export options are bits in a 32bit word - so both user-space and kernel
need to agree on names for them.  We are currently using 18, so there is
room to grow.  It is a perfectly reasonable way to implement sensible
features.  It is, I think, a poor way to implement hacks to work around
misfeatures in filesystems.

This is the core of my dislike for adding an export option.  Using one
effectively admits that what btrfs is doing is a valid thing to do.  I
don't think it is.  I don't think we want any other filesystem developer
to think that they can emulate the behaviour because support is already
provided.

If we add any configuration to support btrfs, I would much prefer it to
be implemented in fs/btrfs, and if not, then with loud warnings that it
works around a deficiency in btrfs.
  /sys/modules/nfsd/parameters/btrfs_export_workaround

Thanks,
NeilBrown
Amir Goldstein Aug. 20, 2021, 6:23 a.m. UTC | #16
On Fri, Aug 20, 2021 at 6:22 AM NeilBrown <neilb@suse.de> wrote:
>
> On Thu, 19 Aug 2021, Amir Goldstein wrote:
> > On Mon, Aug 16, 2021 at 1:21 AM NeilBrown <neilb@suse.de> wrote:
> > >
> > > There are a few ways to handle this more gracefully.
> > >
> > > 1/ We could get btrfs to hand out new filehandles as well as new inode
> > > numbers, but still accept the old filehandles.  Then we could make the
> > > inode number reported be based on the filehandle.  This would be nearly
> > > seamless but rather clumsy to code.  I'm not *very* keen on this idea,
> > > but it is worth keeping in mind.
> > >
> >
> > So objects would change their inode number after nfs inode cache is
> > evicted and while nfs filesystem is mounted. That does not sound ideal.
>
> No.  Almost all filehandle lookups happen in the context of some other
> filehandle.  If the provided context is an old-style filehandle, we
> provide an old-style filehandle for the lookup.  There is already code
> in nfsd to support this (as we have in the past changed how filesystems
> are identified).
>

I see. This is nice!
It almost sounds like "inode mapped mounts" :-)


> It would only be if the mountpoint filehandle (which is fetched without
> that context) went out of cache that inode numbers would change.  That
> would mean that the filesystem (possibly an automount) was unmounted.
> When it was remounted it could have a different device number anyway, so
> having different inode numbers would be of little consequence.
>
> >
> > But I am a bit confused about the problem.
> > If the export is of the btrfs root, then nfs client cannot access any
> > subvolumes (right?) - that was the bug report, so the value of inode
> > numbers in non-root subvolumes is not an issue.
>
> Not correct.  All objects in the filesystem are fully accessible.  The
> only problem is that some pairs of objects have the same inode number.
> This causes some programs like 'find' and 'du' to behave differently to
> expectations.  They will refuse to even look in a subvolume, because it
> looks like doing so could cause an infinite loop.  The values of inode
> numbers in non-root subvolumes is EXACTLY the issue.
>

Thanks for clarifying.

> > If export is of non-root subvolume, then why bother changing anything
> > at all? Is there a need to traverse into sub-sub-volumes?
> >
> > > 2/ We could add a btrfs mount option to control whether the uniquifier
> > > was set or not.  This would allow the sysadmin to choose when to manage
> > > any breakage.  I think this is my preference, but Josef has declared an
> > > aversion to mount options.
> > >
> > > 3/ We could add a module parameter to nfsd to control whether the
> > > uniquifier is merged in.  This again gives the sysadmin control, and it
> > > can be done despite any aversion from btrfs maintainers.  But I'd need
> > > to overcome any aversion from the nfsd maintainers, and I don't know how
> > > strong that would be yet. (A new export option isn't really appropriate.
> > > It is much more work to add an export option than the add a mount option).
> > >
> >
> > That is too bad, because IMO from users POV, "fsid=btrfsroot" or "cross-subvol"
> > export option would have been a nice way to describe and opt-in to this new
> > functionality.
> >
> > But let's consider for a moment the consequences of enabling this functionality
> > automatically whenever exporting a btrfs root volume without "crossmnt":
> >
> > 1. Objects inside a subvol that are inaccessible(?) with current
> > nfs/nfsd without
> >     "crossmnt" will become accessible after enabling the feature -
> > this will match
> >     the user experience of accessing btrfs on the host
>
> Not correct - as above.
>
> > 2. The inode numbers of the newly accessible objects would not match the inode
> >     numbers on the host fs (no big deal?)
>
> Unlikely to be a problem.  Inode numbers have no meaning beyond the facts
> that:
>   - they are stable for the lifetime of the object
>   - they are unique within a filesystem (except btrfs lies about
>     filesystems)
>   - they are not zero
>
> The facts only need to be equally true on the NFS server and client..
>
> > 3. The inode numbers of objects in a snapshot would not match the inode
> >     numbers of the original (pre-snapshot) objects (acceptable tradeoff for
> >     being able to access the snapshot objects without bloating /proc/mounts?)
>
> This also should not be a problem.  Files in different snapshots are
> different things that happen to share storage (like reflinks).
> Comparing inode numbers between places which report different st_dev
> does not fit within the meaning of inode numbers.
>
> > 4. The inode numbers of objects in a subvol observed via this "cross-subvol"
> >     export would not match the inode numbers of the same objects observed
> >     via an individual subvol export
>
> The device number would differ too, so the relative values of the inode
> numbers would be irrelevant.
>
> > 5. st_ino conflicts are possible when multiplexing subvol id and inode number.
> >     overlayfs resolved those conflicts by allocating an inode number from a
> >     reserved non-persistent inode range, which may cause objects to change
> >     their inode number during the lifetime on the filesystem (sensible
> > tradeoff?)
> >
> > I think that #4 is a bit hard to swallow and #3 is borderline acceptable...
> > Both and quite hard to document and to set expectations as a non-opt-in
> > change of behavior when exporting btrfs root.
> >
> > IMO, an nfsd module parameter will give some control and therefore is
> > a must, but it won't make life easier to document and set user expectations
> > when the semantics are not clearly stated in the exports table.
> >
> > You claim that "A new export option isn't really appropriate."
> > but your only argument is that "It is much more work to add
> > an export option than the add a mount option".
> >
> > With all due respect, for this particular challenge with all the
> > constraints involved, this sounds like a pretty weak argument.
> >
> > Surely, adding an export option is easier than slowly changing all
> > userspace tools to understand subvolumes? a solution that you had
> > previously brought up.
> >
> > Can you elaborate some more about your aversion to a new
> > export option.
>
> Export options are bits in a 32bit word - so both user-space and kernel
> need to agree on names for them.  We are currently using 18, so there is
> room to grow.  It is a perfectly reasonable way to implement sensible
> features.  It is, I think, a poor way to implement hacks to work around
> misfeatures in filesystems.
>
> This is the core of my dislike for adding an export option.  Using one
> effectively admits that what btrfs is doing is a valid thing to do.  I
> don't think it is.  I don't think we want any other filesystem developer
> to think that they can emulate the behaviour because support is already
> provided.
>
> If we add any configuration to support btrfs, I would much prefer it to
> be implemented in fs/btrfs, and if not, then with loud warnings that it
> works around a deficiency in btrfs.
>   /sys/modules/nfsd/parameters/btrfs_export_workaround
>

Thanks for clarifying.
I now understand how "hacky" the workaround in nfsd is.

Naive question: could this behavior be implemented in btrfs as you
prefer, but in a way that only serves nfsd and NEW nfs mounts for
that matter (i.e. only NEW filehandles)?

Meaning passing some hint in ->getattr() and ->iterate_shared(),
sort of like idmapped mount does for uid/gid.

Thanks,
Amir.
Zygo Blaxell Aug. 22, 2021, 7:29 p.m. UTC | #17
On Fri, Aug 20, 2021 at 12:54:17PM +1000, NeilBrown wrote:
> On Thu, 19 Aug 2021, Zygo Blaxell wrote:
> > 40 bit inodes would take about 20 years to collide with 24-bit subvols--if
> > you are creating an average of 1742 inodes every second.  Also at the
> > same time you have to be creating a subvol every 37 seconds to occupy
> > the colliding 25th bit of the subvol ID.  Only the highest inode number
> > in any subvol counts--if your inode creation is spread out over several
> > different subvols, you'll need to make inodes even faster.
> > 
> > For reference, my high scores are 17 inodes per second and a subvol
> > every 595 seconds (averaged over 1 year).  Burst numbers are much higher,
> > but one has to spend some time _reading_ the files now and then.
> > 
> > I've encountered other btrfs users with two orders of magnitude higher
> > inode creation rates than mine.  They are barely squeaking under the
> > 20-year line--or they would be, if they were creating snapshots 50 times
> > faster than they do today.
> 
> I do like seeing concrete numbers, thanks.  How many of these inodes and
> subvols remain undeleted?  Supposing inode numbers were reused, how many
> bits might you need?

Number of existing inodes is filesystem size divided by average inode
size, about 30 million inodes per terabyte for build servers, give or
take an order of magnitude per project.  That does put 1 << 32 inodes in
the range of current disk sizes, which motivated the inode_cache feature.

Number of existing subvols stays below 1 << 14.  It's usually some
near-constant multiple of the filesystem age (if it is not limited more
by capacity) because it's not trivial to move a subvol structure from
one filesystem to another.

The main constraint on the product of both numbers is filesystem size.
If that limit is reached, we often see that lower subvol numbers correlate
with higher inode numbers and vice versa; otherwise both keep growing until
they hit the size limit or some user-chosen limit (e.g. "we just don't
need more than the last 300 builds online at any time").

For build and backup use cases (which both heavily use snapshots) there is
no incentive to delete snapshots other than to avoid eventually running
out of space.  There is also no incentive to increase filesystem size
to accommodate extra snapshots, as long as there is room for some minimal
useful number of snapshots, the original subvols, and some free space.

So we get snapshots in numbers that are rougly:

	min(age_of_filesystem * snapshot_creation_rate, filesystem_capacity / average_subvol_unique_data_size)

Subvol IDs are not reusable.  They are embedded in shared object ownership
metadata, and persist for some time after subvols are deleted.

> > > My preference would be for btrfs to start re-using old object-ids and
> > > root-ids, and to enforce a limit (set at mkfs or tunefs) so that the
> > > total number of bits does not exceed 64.  Unfortunately the maintainers
> > > seem reluctant to even consider this.
> > 
> > It was considered, implemented in 2011, and removed in 2020.  Rationale
> > is in commit b547a88ea5776a8092f7f122ddc20d6720528782 "btrfs: start
> > deprecation of mount option inode_cache".  It made file creation slower,
> > and consumed disk space, iops, and memory to run.  Nobody used it.
> > Newer on-disk data structure versions (free space tree, 2015) didn't
> > bother implementing inode_cache's storage requirement.
> 
> Yes, I saw that.  Providing reliable functional certainly can impact
> performance and consume disk-space.  That isn't an excuse for not doing
> it. 
> I suspect that carefully tuned code could result in typical creation
> times being unchanged, and mean creation times suffering only a tiny
> cost.  Using "max+1" when the creation rate is particularly high might
> be a reasonable part of managing costs.
> Storage cost need not be worse than the cost of tracking free blocks
> on the device.

The cost of _tracking_ free object IDs is trivial compared to the cost
of _reusing_ an object ID on btrfs.

If btrfs doesn't reuse object numbers, btrfs can append new objects
to the last partially filled leaf.  If there are shared metadata pages
(i.e. snapshots), btrfs unshares a handful of pages once, and then future
writes use densely packed new pages and delayed allocation without having
to read anything.

If btrfs reuses object numbers, the filesystem has to pack new objects
into random previously filled metadata leaf nodes, so there are a lot
of read-modify-writes scattered over old metadata pages, which spreads
the working set around and reduces cache usage efficiency (i.e. uses
more RAM).  If there are snapshots, each shared page that is modified
for the first time after the snapshot comes with two-orders-of-magnitude
worst-case write multipliers.

The two-algorithm scheme (switching from "reuse freed inode" to "max+1"
under load) would be forced into the "max+1" mode half the time by a
daily workload of alternating git checkouts and builds.  It would save
only one bit of inode namespace over the lifetime of the filesystem.

> "Nobody used it" is odd.  It implies it would have to be explicitly
> enabled, and all it would provide anyone is sane behaviour.  Who would
> imagine that to be an optional extra.

It always had to be explicitly enabled.  It was initially a workaround
for 32-bit ino_t that was limiting a few users, but ino_t got better
and the need for inode_cache went away.

NFS (particularly NFSv2) might be the use case inode_cache has been
waiting for.  btrfs has an i_version field for NFSv4, so it's not like
there's no precedent for adding features in btrfs to support NFS.

On the other hand, the cost of ino_cache gets worse with snapshots,
and the benefit in practice takes years to decades to become relevant.
Users who are exporting snapshots over NFS are likely to be especially
averse to using inode_cache.

> NeilBrown
> 
>
Wang Yugui Aug. 23, 2021, 12:57 a.m. UTC | #18
Hi,

> rsync -H and cpio's hardlink detection can be badly confused.  They will
> think distinct files with the same inode number are hardlinks.  This could
> be bad if you were making backups (though if you're making backups over
> NFS, you are probably doing something that could be done better in a
> different way).

'rysnc -x ' and 'find -mount/-xdev' will fail to work in
snapper config?
snapper is a very important user case.

Although  yet not some option like '-mount/-xdev' for '/bin/cp',
but maybe come soon.

I though the first patchset( crossmnt in nfs client) is the right way,
because in most case, subvol is a filesystem, not  a directory.

Best Regards
Wang Yugui (wangyugui@e16-tech.com)
2021/08/23
NeilBrown Aug. 23, 2021, 5:51 a.m. UTC | #19
On Mon, 23 Aug 2021, Zygo Blaxell wrote:
> 
> Subvol IDs are not reusable.  They are embedded in shared object ownership
> metadata, and persist for some time after subvols are deleted.

Hmmm...  that's interesting.  Makes some sense too.  I did wonder how
ownership across multiple snapshots was tracked.

> 
> > > > My preference would be for btrfs to start re-using old object-ids and
> > > > root-ids, and to enforce a limit (set at mkfs or tunefs) so that the
> > > > total number of bits does not exceed 64.  Unfortunately the maintainers
> > > > seem reluctant to even consider this.
> > > 
> > > It was considered, implemented in 2011, and removed in 2020.  Rationale
> > > is in commit b547a88ea5776a8092f7f122ddc20d6720528782 "btrfs: start
> > > deprecation of mount option inode_cache".  It made file creation slower,
> > > and consumed disk space, iops, and memory to run.  Nobody used it.
> > > Newer on-disk data structure versions (free space tree, 2015) didn't
> > > bother implementing inode_cache's storage requirement.
> > 
> > Yes, I saw that.  Providing reliable functional certainly can impact
> > performance and consume disk-space.  That isn't an excuse for not doing
> > it. 
> > I suspect that carefully tuned code could result in typical creation
> > times being unchanged, and mean creation times suffering only a tiny
> > cost.  Using "max+1" when the creation rate is particularly high might
> > be a reasonable part of managing costs.
> > Storage cost need not be worse than the cost of tracking free blocks
> > on the device.
> 
> The cost of _tracking_ free object IDs is trivial compared to the cost
> of _reusing_ an object ID on btrfs.

I hadn't thought of that.

> 
> If btrfs doesn't reuse object numbers, btrfs can append new objects
> to the last partially filled leaf.  If there are shared metadata pages
> (i.e. snapshots), btrfs unshares a handful of pages once, and then future
> writes use densely packed new pages and delayed allocation without having
> to read anything.
> 
> If btrfs reuses object numbers, the filesystem has to pack new objects
> into random previously filled metadata leaf nodes, so there are a lot
> of read-modify-writes scattered over old metadata pages, which spreads
> the working set around and reduces cache usage efficiency (i.e. uses
> more RAM).  If there are snapshots, each shared page that is modified
> for the first time after the snapshot comes with two-orders-of-magnitude
> worst-case write multipliers.

I don't really follow that .... but I'll take your word for it for now.

> 
> The two-algorithm scheme (switching from "reuse freed inode" to "max+1"
> under load) would be forced into the "max+1" mode half the time by a
> daily workload of alternating git checkouts and builds.  It would save
> only one bit of inode namespace over the lifetime of the filesystem.
> 
> > "Nobody used it" is odd.  It implies it would have to be explicitly
> > enabled, and all it would provide anyone is sane behaviour.  Who would
> > imagine that to be an optional extra.
> 
> It always had to be explicitly enabled.  It was initially a workaround
> for 32-bit ino_t that was limiting a few users, but ino_t got better
> and the need for inode_cache went away.
> 
> NFS (particularly NFSv2) might be the use case inode_cache has been
> waiting for.  btrfs has an i_version field for NFSv4, so it's not like
> there's no precedent for adding features in btrfs to support NFS.

NFSv2 is not worth any effort.  NFSv4 is.  NFSv3 ... some, but not a lot.

> 
> On the other hand, the cost of ino_cache gets worse with snapshots,
> and the benefit in practice takes years to decades to become relevant.
> Users who are exporting snapshots over NFS are likely to be especially
> averse to using inode_cache.

That's the real killer.  Everything will work fine for years until it
doesn't.  And once it doesn't ....  what do you do?

Thanks for lot for all this background info.  I've found it to be very
helpful for my general understanding.

Thanks,
NeilBrown
NeilBrown Aug. 23, 2021, 11:22 p.m. UTC | #20
On Mon, 23 Aug 2021, Zygo Blaxell wrote:
...
> 
> Subvol IDs are not reusable.  They are embedded in shared object ownership
> metadata, and persist for some time after subvols are deleted.
...
> 
> The cost of _tracking_ free object IDs is trivial compared to the cost
> of _reusing_ an object ID on btrfs.

One possible approach to these two objections is to decouple inode
numbers from object ids.
The inode number becomes just another piece of metadata stored in the
inode.
struct btrfs_inode_item has four spare u64s, so we could use one of
those.
struct btrfs_dir_item would need to store the inode number too.  What
is location.offset used for?  Would a diritem ever point to a non-zero
offset?  Could the 'offset' be used to store the inode number?

This could even be added to existing filesystems I think.  It might not
be easy to re-use inode numbers smaller than the largest at the time the
extension was added, but newly added inode numbers could be reused after
they were deleted.

Just a thought...

NeilBrown
Zygo Blaxell Aug. 25, 2021, 2:06 a.m. UTC | #21
On Tue, Aug 24, 2021 at 09:22:05AM +1000, NeilBrown wrote:
> On Mon, 23 Aug 2021, Zygo Blaxell wrote:
> ...
> > 
> > Subvol IDs are not reusable.  They are embedded in shared object ownership
> > metadata, and persist for some time after subvols are deleted.
> ...
> > 
> > The cost of _tracking_ free object IDs is trivial compared to the cost
> > of _reusing_ an object ID on btrfs.
> 
> One possible approach to these two objections is to decouple inode
> numbers from object ids.

This would be reasonable for subvol IDs (I thought of it earlier in this
thread, but didn't mention it because I wasn't going to be the first to
open that worm can ;).

There aren't very many subvol IDs and they're not used as frequently
as inodes, so a lookup table to remap them to smaller numbers to save
st_ino bit-space wouldn't be unreasonably expensive.  If we stop right
here and use the [some_zeros:reversed_subvol:inode] bit-packing scheme
you proposed for NFS, that seems like a reasonable plan.  It would have
48 bits of usable inode number space, ~440000 file creates per second
for 20 years with up to 65535 snapshots, the same number of bits that
ZFS has in its inodes.

Once that subvol ID mapping tree exists, it could also map subvol inode
numbers to globally unique numbers.  Each tree item would contain a map of
[subvol_inode1..subvol_inode2] that maps the inode numbers in the subvol
into the global inode number space at [global_inode1..global_inode2].
When a snapshot is created, the snapshot gets a copy of all the origin
subvol's inode ranges, but with newly allocated base offsets.  If the
original subvol needs new inodes, it gets a new chunk from the global
inode allocator.  If the snapshot subvol needs new inodes, it gets a
different new chunk from the global allocator.  The minimum chunk might
be a million inodes or so to avoid having to allocate new chunks all the
time, but not so high to make the code completely untested (or testers
just set the minchunk to 1000 inodes).

The question I have (and why I didn't propose this earlier) is whether
this scheme is any real improvement over dividing the subvol:inode space
by bit packing.  If you have one subvol that has 3 billion existing inodes
in it, every snapshot of that subvol is going to burn up roughly 2^-32 of
the available globally unique inode numbers.  If we burn 3 billion inodes
instead of 4 billion per subvol, it only gets 25% more lifespan for the
filesystem, and the allocation of unique inode spaces and tracking inode
space usage will add cost to every single file creation and snapshot
operation.  If your oldest/biggest subvol only has a million inodes in
it, all of the above is pure cost:  you can create billions of snapshots,
never repeat any object IDs, and never worry about running out.

I'd want to see cost/benefit simulations of:

	this plan,

	the simpler but less efficient bit-packing plan,

	'cp -a --reflink' to a new subvol and start over every 20 years
	when inodes run out,

	and online garbage-collection/renumbering schemes that allow
	users to schedule the inode renumbering costs in overnight
	batches instead of on every inode create.

> The inode number becomes just another piece of metadata stored in the
> inode.
> struct btrfs_inode_item has four spare u64s, so we could use one of
> those.
> struct btrfs_dir_item would need to store the inode number too.  What
> is location.offset used for?  Would a diritem ever point to a non-zero
> offset?  Could the 'offset' be used to store the inode number?

Offset is used to identify subvol roots at the moment, but so far that
means only values 0 and UINT64_MAX are used.  It seems possible to treat
all other values as inode numbers.  Don't quote me on that--I'm not an
expert on this structure.

> This could even be added to existing filesystems I think.  It might not
> be easy to re-use inode numbers smaller than the largest at the time the
> extension was added, but newly added inode numbers could be reused after
> they were deleted.

We'd need a structure to track reusable inode numbers and it would have to
be kept up to date to work, so this feature would necessarily come with an
incompat bit.  Whether you borrow bits from existing structures or make
extended new structures doesn't matter at that point, though obviously
for something as common as inodes it would be bad to make them bigger.

Some of the btrfs userspace API uses inode numbers, but unless I missed
something, it could all be converted to use object numbers directly
instead.

> Just a thought...
> 
> NeilBrown
diff mbox series

Patch

diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 0117d867ecf8..989fdf2032d5 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -9195,6 +9195,10 @@  static int btrfs_getattr(struct user_namespace *mnt_userns,
 	generic_fillattr(&init_user_ns, inode, stat);
 	stat->dev = BTRFS_I(inode)->root->anon_dev;
 
+	if (BTRFS_I(inode)->root->root_key.objectid != BTRFS_FS_TREE_OBJECTID)
+		stat->ino_uniquifier =
+			swab64(BTRFS_I(inode)->root->root_key.objectid);
+
 	spin_lock(&BTRFS_I(inode)->lock);
 	delalloc_bytes = BTRFS_I(inode)->new_delalloc_bytes;
 	inode_bytes = inode_get_bytes(inode);
diff --git a/fs/nfsd/nfs3xdr.c b/fs/nfsd/nfs3xdr.c
index 0a5ebc52e6a9..669e2437362a 100644
--- a/fs/nfsd/nfs3xdr.c
+++ b/fs/nfsd/nfs3xdr.c
@@ -340,6 +340,7 @@  svcxdr_encode_fattr3(struct svc_rqst *rqstp, struct xdr_stream *xdr,
 {
 	struct user_namespace *userns = nfsd_user_namespace(rqstp);
 	__be32 *p;
+	u64 ino;
 	u64 fsid;
 
 	p = xdr_reserve_space(xdr, XDR_UNIT * 21);
@@ -377,7 +378,10 @@  svcxdr_encode_fattr3(struct svc_rqst *rqstp, struct xdr_stream *xdr,
 	p = xdr_encode_hyper(p, fsid);
 
 	/* fileid */
-	p = xdr_encode_hyper(p, stat->ino);
+	ino = stat->ino;
+	if (stat->ino_uniquifier && stat->ino_uniquifier != ino)
+		ino ^= stat->ino_uniquifier;
+	p = xdr_encode_hyper(p, ino);
 
 	p = encode_nfstime3(p, &stat->atime);
 	p = encode_nfstime3(p, &stat->mtime);
@@ -1151,6 +1155,17 @@  svcxdr_encode_entry3_common(struct nfsd3_readdirres *resp, const char *name,
 	if (xdr_stream_encode_item_present(xdr) < 0)
 		return false;
 	/* fileid */
+	if (!resp->dir_have_uniquifier) {
+		struct kstat stat;
+		if (fh_getattr(&resp->fh, &stat) == nfs_ok)
+			resp->dir_ino_uniquifier = stat.ino_uniquifier;
+		else
+			resp->dir_ino_uniquifier = 0;
+		resp->dir_have_uniquifier = 1;
+	}
+	if (resp->dir_ino_uniquifier &&
+	    resp->dir_ino_uniquifier != ino)
+		ino ^= resp->dir_ino_uniquifier;
 	if (xdr_stream_encode_u64(xdr, ino) < 0)
 		return false;
 	/* name */
diff --git a/fs/nfsd/nfs4xdr.c b/fs/nfsd/nfs4xdr.c
index 7abeccb975b2..ddccf849c29c 100644
--- a/fs/nfsd/nfs4xdr.c
+++ b/fs/nfsd/nfs4xdr.c
@@ -3114,10 +3114,14 @@  nfsd4_encode_fattr(struct xdr_stream *xdr, struct svc_fh *fhp,
 					fhp->fh_handle.fh_size);
 	}
 	if (bmval0 & FATTR4_WORD0_FILEID) {
+		u64 ino = stat.ino;
+		if (stat.ino_uniquifier &&
+		    stat.ino_uniquifier != stat.ino)
+			ino ^= stat.ino_uniquifier;
 		p = xdr_reserve_space(xdr, 8);
 		if (!p)
 			goto out_resource;
-		p = xdr_encode_hyper(p, stat.ino);
+		p = xdr_encode_hyper(p, ino);
 	}
 	if (bmval0 & FATTR4_WORD0_FILES_AVAIL) {
 		p = xdr_reserve_space(xdr, 8);
@@ -3285,6 +3289,9 @@  nfsd4_encode_fattr(struct xdr_stream *xdr, struct svc_fh *fhp,
 			if (err)
 				goto out_nfserr;
 			ino = parent_stat.ino;
+			if (parent_stat.ino_uniquifier &&
+			    parent_stat.ino_uniquifier != ino)
+				ino ^= parent_stat.ino_uniquifier;
 		}
 		p = xdr_encode_hyper(p, ino);
 	}
diff --git a/fs/nfsd/xdr3.h b/fs/nfsd/xdr3.h
index 933008382bbe..b4f9f3c71f72 100644
--- a/fs/nfsd/xdr3.h
+++ b/fs/nfsd/xdr3.h
@@ -179,6 +179,8 @@  struct nfsd3_readdirres {
 	struct xdr_buf		dirlist;
 	struct svc_fh		scratch;
 	struct readdir_cd	common;
+	u64			dir_ino_uniquifier;
+	int			dir_have_uniquifier;
 	unsigned int		cookie_offset;
 	struct svc_rqst *	rqstp;
 
diff --git a/include/linux/stat.h b/include/linux/stat.h
index fff27e603814..a5188f42ed81 100644
--- a/include/linux/stat.h
+++ b/include/linux/stat.h
@@ -46,6 +46,23 @@  struct kstat {
 	struct timespec64 btime;			/* File creation time */
 	u64		blocks;
 	u64		mnt_id;
+	/*
+	 * BTRFS does not provide unique inode numbers within a filesystem,
+	 * depending on a synthetic 'dev' to provide uniqueness.
+	 * NFSd cannot make use of this 'dev' number so clients often see
+	 * duplicate inode numbers.
+	 * For BTRFS, 'ino' is unlikely to use the high bits.  It puts
+	 * another number in ino_uniquifier which:
+	 * - has most entropy in the high bits
+	 * - is different precisely when 'dev' is different
+	 * - is stable across unmount/remount
+	 * NFSd can xor this with 'ino' to get a substantially more unique
+	 * number for reporting to the client.
+	 * The ino_uniquifier for a directory can reasonably be applied
+	 * to inode numbers reported by the readdir filldir callback.
+	 * It is NOT currently exported to user-space.
+	 */
+	u64		ino_uniquifier;
 };
 
 #endif