Message ID | 162881913686.1695.12479588032010502384@noble.neil.brown.name (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | VFS/BTRFS/NFSD: provide more unique inode number for btrfs export | expand |
On 8/12/21 9:45 PM, NeilBrown wrote: > > [[This patch is a minimal patch which addresses the current problems > with nfsd and btrfs, in a way which I think is most supportable, least > surprising, and least likely to impact any future attempts to more > completely fix the btrfs file-identify problem]] > > BTRFS does not provide unique inode numbers across a filesystem. > It *does* provide unique inode numbers with a subvolume and > uses synthetic device numbers for different subvolumes to ensure > uniqueness for device+inode. > > nfsd cannot use these varying device numbers. If nfsd were to > synthesise different stable filesystem ids to give to the client, that > would cause subvolumes to appear in the mount table on the client, even > though they don't appear in the mount table on the server. Also, NFSv3 > doesn't support changing the filesystem id without a new explicit > mount on the client (this is partially supported in practice, but > violates the protocol specification). > > So currently, the roots of all subvolumes report the same inode number > in the same filesystem to NFS clients and tools like 'find' notice that > a directory has the same identity as an ancestor, and so refuse to > enter that directory. > > This patch allows btrfs (or any filesystem) to provide a 64bit number > that can be xored with the inode number to make the number more unique. > Rather than the client being certain to see duplicates, with this patch > it is possible but extremely rare. > > The number than btrfs provides is a swab64() version of the subvolume > identifier. This has most entropy in the high bits (the low bits of the > subvolume identifer), while the inoe has most entropy in the low bits. > The result will always be unique within a subvolume, and will almost > always be unique across the filesystem. > This is a reasonable approach to me, solves the problem without being overly complicated and side-steps the thornier issues around how we deal with subvolumes. I'll leave it up to the other maintainers of the other fs'es to weigh in, but for me you can add Acked-by: Josef Bacik <josef@toxicpanda.com> Thanks, Josef
Hi All, On 8/13/21 3:45 AM, NeilBrown wrote: > > [[This patch is a minimal patch which addresses the current problems > with nfsd and btrfs, in a way which I think is most supportable, least > surprising, and least likely to impact any future attempts to more > completely fix the btrfs file-identify problem]] > > BTRFS does not provide unique inode numbers across a filesystem. > It *does* provide unique inode numbers with a subvolume and > uses synthetic device numbers for different subvolumes to ensure > uniqueness for device+inode. > > nfsd cannot use these varying device numbers. If nfsd were to > synthesise different stable filesystem ids to give to the client, that > would cause subvolumes to appear in the mount table on the client, even > though they don't appear in the mount table on the server. Also, NFSv3 > doesn't support changing the filesystem id without a new explicit > mount on the client (this is partially supported in practice, but > violates the protocol specification). I am sure that it was discussed already but I was unable to find any track of this discussion. But if the problem is the collision between the inode number of different subvolume in the nfd export, is it simpler if the export is truncated to the subvolume boundary ? It would be more coherent with the current behavior of vfs+nfsd. In fact in btrfs a subvolume is a complete filesystem, with an "own synthetic" device. We could like or not this solution, but this solution is the more aligned to the unix standard, where for each filesystem there is a pair (device, inode-set). NFS (by default) avoids to cross the boundary between the filesystems. So why in BTRFS this should be different ? May be an opt-in option would solve the backward compatibility (i.e. to avoid problem with setup which relies on this behaviour). > So currently, the roots of all subvolumes report the same inode number > in the same filesystem to NFS clients and tools like 'find' notice that > a directory has the same identity as an ancestor, and so refuse to > enter that directory. > > This patch allows btrfs (or any filesystem) to provide a 64bit number > that can be xored with the inode number to make the number more unique. > Rather than the client being certain to see duplicates, with this patch > it is possible but extremely rare. > > The number than btrfs provides is a swab64() version of the subvolume > identifier. This has most entropy in the high bits (the low bits of the > subvolume identifer), while the inoe has most entropy in the low bits. > The result will always be unique within a subvolume, and will almost > always be unique across the filesystem. > > Signed-off-by: NeilBrown <neilb@suse.de> > --- > fs/btrfs/inode.c | 4 ++++ > fs/nfsd/nfs3xdr.c | 17 ++++++++++++++++- > fs/nfsd/nfs4xdr.c | 9 ++++++++- > fs/nfsd/xdr3.h | 2 ++ > include/linux/stat.h | 17 +++++++++++++++++ > 5 files changed, 47 insertions(+), 2 deletions(-) > > diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c > index 0117d867ecf8..989fdf2032d5 100644 > --- a/fs/btrfs/inode.c > +++ b/fs/btrfs/inode.c > @@ -9195,6 +9195,10 @@ static int btrfs_getattr(struct user_namespace *mnt_userns, > generic_fillattr(&init_user_ns, inode, stat); > stat->dev = BTRFS_I(inode)->root->anon_dev; > > + if (BTRFS_I(inode)->root->root_key.objectid != BTRFS_FS_TREE_OBJECTID) > + stat->ino_uniquifier = > + swab64(BTRFS_I(inode)->root->root_key.objectid); > + > spin_lock(&BTRFS_I(inode)->lock); > delalloc_bytes = BTRFS_I(inode)->new_delalloc_bytes; > inode_bytes = inode_get_bytes(inode); > diff --git a/fs/nfsd/nfs3xdr.c b/fs/nfsd/nfs3xdr.c > index 0a5ebc52e6a9..669e2437362a 100644 > --- a/fs/nfsd/nfs3xdr.c > +++ b/fs/nfsd/nfs3xdr.c > @@ -340,6 +340,7 @@ svcxdr_encode_fattr3(struct svc_rqst *rqstp, struct xdr_stream *xdr, > { > struct user_namespace *userns = nfsd_user_namespace(rqstp); > __be32 *p; > + u64 ino; > u64 fsid; > > p = xdr_reserve_space(xdr, XDR_UNIT * 21); > @@ -377,7 +378,10 @@ svcxdr_encode_fattr3(struct svc_rqst *rqstp, struct xdr_stream *xdr, > p = xdr_encode_hyper(p, fsid); > > /* fileid */ > - p = xdr_encode_hyper(p, stat->ino); > + ino = stat->ino; > + if (stat->ino_uniquifier && stat->ino_uniquifier != ino) > + ino ^= stat->ino_uniquifier; > + p = xdr_encode_hyper(p, ino); > > p = encode_nfstime3(p, &stat->atime); > p = encode_nfstime3(p, &stat->mtime); > @@ -1151,6 +1155,17 @@ svcxdr_encode_entry3_common(struct nfsd3_readdirres *resp, const char *name, > if (xdr_stream_encode_item_present(xdr) < 0) > return false; > /* fileid */ > + if (!resp->dir_have_uniquifier) { > + struct kstat stat; > + if (fh_getattr(&resp->fh, &stat) == nfs_ok) > + resp->dir_ino_uniquifier = stat.ino_uniquifier; > + else > + resp->dir_ino_uniquifier = 0; > + resp->dir_have_uniquifier = 1; > + } > + if (resp->dir_ino_uniquifier && > + resp->dir_ino_uniquifier != ino) > + ino ^= resp->dir_ino_uniquifier; > if (xdr_stream_encode_u64(xdr, ino) < 0) > return false; > /* name */ > diff --git a/fs/nfsd/nfs4xdr.c b/fs/nfsd/nfs4xdr.c > index 7abeccb975b2..ddccf849c29c 100644 > --- a/fs/nfsd/nfs4xdr.c > +++ b/fs/nfsd/nfs4xdr.c > @@ -3114,10 +3114,14 @@ nfsd4_encode_fattr(struct xdr_stream *xdr, struct svc_fh *fhp, > fhp->fh_handle.fh_size); > } > if (bmval0 & FATTR4_WORD0_FILEID) { > + u64 ino = stat.ino; > + if (stat.ino_uniquifier && > + stat.ino_uniquifier != stat.ino) > + ino ^= stat.ino_uniquifier; > p = xdr_reserve_space(xdr, 8); > if (!p) > goto out_resource; > - p = xdr_encode_hyper(p, stat.ino); > + p = xdr_encode_hyper(p, ino); > } > if (bmval0 & FATTR4_WORD0_FILES_AVAIL) { > p = xdr_reserve_space(xdr, 8); > @@ -3285,6 +3289,9 @@ nfsd4_encode_fattr(struct xdr_stream *xdr, struct svc_fh *fhp, > if (err) > goto out_nfserr; > ino = parent_stat.ino; > + if (parent_stat.ino_uniquifier && > + parent_stat.ino_uniquifier != ino) > + ino ^= parent_stat.ino_uniquifier; > } > p = xdr_encode_hyper(p, ino); > } > diff --git a/fs/nfsd/xdr3.h b/fs/nfsd/xdr3.h > index 933008382bbe..b4f9f3c71f72 100644 > --- a/fs/nfsd/xdr3.h > +++ b/fs/nfsd/xdr3.h > @@ -179,6 +179,8 @@ struct nfsd3_readdirres { > struct xdr_buf dirlist; > struct svc_fh scratch; > struct readdir_cd common; > + u64 dir_ino_uniquifier; > + int dir_have_uniquifier; > unsigned int cookie_offset; > struct svc_rqst * rqstp; > > diff --git a/include/linux/stat.h b/include/linux/stat.h > index fff27e603814..a5188f42ed81 100644 > --- a/include/linux/stat.h > +++ b/include/linux/stat.h > @@ -46,6 +46,23 @@ struct kstat { > struct timespec64 btime; /* File creation time */ > u64 blocks; > u64 mnt_id; > + /* > + * BTRFS does not provide unique inode numbers within a filesystem, > + * depending on a synthetic 'dev' to provide uniqueness. > + * NFSd cannot make use of this 'dev' number so clients often see > + * duplicate inode numbers. > + * For BTRFS, 'ino' is unlikely to use the high bits. It puts > + * another number in ino_uniquifier which: > + * - has most entropy in the high bits > + * - is different precisely when 'dev' is different > + * - is stable across unmount/remount > + * NFSd can xor this with 'ino' to get a substantially more unique > + * number for reporting to the client. > + * The ino_uniquifier for a directory can reasonably be applied > + * to inode numbers reported by the readdir filldir callback. > + * It is NOT currently exported to user-space. > + */ > + u64 ino_uniquifier; > }; Why don't rename "ino_uniquifier" as "ino_and_subvolume" and leave to the filesystem the work to combine the inode and the subvolume-id ? I am worried that the logic is split between the filesystem, which synthesizes the ino_uniquifier, and to NFS which combine to the inode. I am thinking that this combination is filesystem specific; for BTRFS is a simple xor but for other filesystem may be a more complex operation, so leaving an half in the filesystem and another half to the NFS seems to not optimal if other filesystem needs to use ino_uniquifier. > #endif >
On Sun, 15 Aug 2021 09:39:08 +0200 Goffredo Baroncelli <kreijack@libero.it> wrote: > I am sure that it was discussed already but I was unable to find any track > of this discussion. But if the problem is the collision between the inode > number of different subvolume in the nfd export, is it simpler if the export > is truncated to the subvolume boundary ? It would be more coherent with the > current behavior of vfs+nfsd. See this bugreport thread which started it all: https://www.spinics.net/lists/linux-btrfs/msg111172.html In there the reporting user replied that it is strongly not feasible for them to export each individual snapshot. > In fact in btrfs a subvolume is a complete filesystem, with an "own > synthetic" device. We could like or not this solution, but this solution is > the more aligned to the unix standard, where for each filesystem there is a > pair (device, inode-set). NFS (by default) avoids to cross the boundary > between the filesystems. So why in BTRFS this should be different ? From the user point of view subvolumes are basically directories; that they are "complete filesystems"* is merely a low-level implementation detail. * well except they are not, as you cannot 'dd' a subvolume to another blockdevice. > Why don't rename "ino_uniquifier" as "ino_and_subvolume" and leave to the > filesystem the work to combine the inode and the subvolume-id ? > > I am worried that the logic is split between the filesystem, which > synthesizes the ino_uniquifier, and to NFS which combine to the inode. I am > thinking that this combination is filesystem specific; for BTRFS is a simple > xor but for other filesystem may be a more complex operation, so leaving an > half in the filesystem and another half to the NFS seems to not optimal if > other filesystem needs to use ino_uniquifier. I wondered a bit myself, what are the downsides of just doing the uniquefication inside Btrfs, not leaving that to NFSD? I mean not even adding the extra stat field, just return the inode itself with that already applied. Surely cannot be any worse collision-wise, than different subvolumes straight up having the same inode numbers as right now? Or is it a performance concern, always doing more work, for something which only NFSD has needed so far.
On 8/15/21 9:35 PM, Roman Mamedov wrote: > On Sun, 15 Aug 2021 09:39:08 +0200 > Goffredo Baroncelli <kreijack@libero.it> wrote: > >> I am sure that it was discussed already but I was unable to find any track >> of this discussion. But if the problem is the collision between the inode >> number of different subvolume in the nfd export, is it simpler if the export >> is truncated to the subvolume boundary ? It would be more coherent with the >> current behavior of vfs+nfsd. > > See this bugreport thread which started it all: > https://www.spinics.net/lists/linux-btrfs/msg111172.html > > In there the reporting user replied that it is strongly not feasible for them > to export each individual snapshot. Thanks for pointing that. However looking at the 'exports' man page, it seems that NFS has already an option to cover these cases: 'crossmnt'. If NFSd detects a "child" filesystem (i.e. a filesystem mounted inside an already exported one) and the "parent" filesystem is marked as 'crossmnt', the client mount the parent AND the child filesystem with two separate mounts, so there is not problem of inode collision. I tested it mounting two nested ext4 filesystem, and there isn't any problem of inode collision (even if there are two different files with the same inode number). --------- # mount -o loop disk2 test3/ # echo 123 >test3/one # mkdir test3/test4 # sudo mount -o loop disk3 test3/test4/ # echo 123 >test3/test4/one # ls -liR test3/ test3/: total 24 11 drwx------ 2 root root 16384 Aug 15 22:27 lost+found 12 -rw-r--r-- 1 ghigo ghigo 4 Aug 15 22:29 one 2 drwxr-xrwx 3 root root 4096 Aug 15 22:46 test4 test3/test4: total 20 11 drwx------ 2 root root 16384 Aug 15 22:45 lost+found 12 -rw-r--r-- 1 ghigo ghigo 4 Aug 15 22:46 one # egrep test3 /etc/exports /tmp/test3 *(rw,no_subtree_check,crossmnt) # mount 192.168.1.27:/tmp/test3 3 ls -lRi 3 3: total 24 11 drwx------ 2 root root 16384 Aug 15 22:27 lost+found 12 -rw-r--r-- 1 ghigo ghigo 4 Aug 15 22:29 one 2 drwxr-xrwx 3 root root 4096 Aug 15 22:46 test4 3/test4: total 20 11 drwx------ 2 root root 16384 Aug 15 22:45 lost+found 12 -rw-r--r-- 1 ghigo ghigo 4 Aug 15 22:46 one # mount | egrep 192 192.168.1.27:/tmp/test3 on /tmp/3 type nfs4 (rw,relatime,vers=4.2,rsize=1048576,wsize=1048576,namlen=255,hard,proto=tcp,timeo=600,retrans=2,sec=sys,clientaddr=192.168.1.27,local_lock=none,addr=192.168.1.27) 192.168.1.27:/tmp/test3/test4 on /tmp/3/test4 type nfs4 (rw,relatime,vers=4.2,rsize=1048576,wsize=1048576,namlen=255,hard,proto=tcp,timeo=600,retrans=2,sec=sys,clientaddr=192.168.1.27,local_lock=none,addr=192.168.1.27) --------------- I tried to mount even with "nfsvers=3", and it seems to work. However the tests above are related to ext4; in fact it doesn't work with btrfs, but I think this is more a implementation problem than a strategy problem. What I means is that NFS already has a way to mount different parts of the fs-tree with different mounts (form a client POV). I think that this strategy should be used when NFSd exports a BTRFS filesystem: - if the 'crossmnt' is NOT Passed, the export should ends at the subvolume boundary (or allow inode collision) - if the 'crossmnt' is passed, the client should automatically mounts each nested subvolume as a separate mount > >> In fact in btrfs a subvolume is a complete filesystem, with an "own >> synthetic" device. We could like or not this solution, but this solution is >> the more aligned to the unix standard, where for each filesystem there is a >> pair (device, inode-set). NFS (by default) avoids to cross the boundary >> between the filesystems. So why in BTRFS this should be different ? > > From the user point of view subvolumes are basically directories; that they > are "complete filesystems"* is merely a low-level implementation detail. > > * well except they are not, as you cannot 'dd' a subvolume to another > blockdevice. > >> Why don't rename "ino_uniquifier" as "ino_and_subvolume" and leave to the >> filesystem the work to combine the inode and the subvolume-id ? >> >> I am worried that the logic is split between the filesystem, which >> synthesizes the ino_uniquifier, and to NFS which combine to the inode. I am >> thinking that this combination is filesystem specific; for BTRFS is a simple >> xor but for other filesystem may be a more complex operation, so leaving an >> half in the filesystem and another half to the NFS seems to not optimal if >> other filesystem needs to use ino_uniquifier. > > I wondered a bit myself, what are the downsides of just doing the > uniquefication inside Btrfs, not leaving that to NFSD? > > I mean not even adding the extra stat field, just return the inode itself with > that already applied. Surely cannot be any worse collision-wise, than > different subvolumes straight up having the same inode numbers as right now? > > Or is it a performance concern, always doing more work, for something which > only NFSD has needed so far. >
On Mon, 16 Aug 2021, kreijack@inwind.it wrote: > On 8/15/21 9:35 PM, Roman Mamedov wrote: > > On Sun, 15 Aug 2021 09:39:08 +0200 > > Goffredo Baroncelli <kreijack@libero.it> wrote: > > > >> I am sure that it was discussed already but I was unable to find any track > >> of this discussion. But if the problem is the collision between the inode > >> number of different subvolume in the nfd export, is it simpler if the export > >> is truncated to the subvolume boundary ? It would be more coherent with the > >> current behavior of vfs+nfsd. > > > > See this bugreport thread which started it all: > > https://www.spinics.net/lists/linux-btrfs/msg111172.html > > > > In there the reporting user replied that it is strongly not feasible for them > > to export each individual snapshot. > > Thanks for pointing that. > > However looking at the 'exports' man page, it seems that NFS has already an > option to cover these cases: 'crossmnt'. > > If NFSd detects a "child" filesystem (i.e. a filesystem mounted inside an already > exported one) and the "parent" filesystem is marked as 'crossmnt', the client mount > the parent AND the child filesystem with two separate mounts, so there is not problem of inode collision. As you acknowledged, you haven't read the whole back-story. Maybe you should. https://lore.kernel.org/linux-nfs/20210613115313.BC59.409509F4@e16-tech.com/ https://lore.kernel.org/linux-nfs/162848123483.25823.15844774651164477866.stgit@noble.brown/ https://lore.kernel.org/linux-btrfs/162742539595.32498.13687924366155737575.stgit@noble.brown/ The flow of conversation does sometimes jump between threads. I'm very happy to respond you questions after you've absorbed all that. NeilBrown
On Mon, 16 Aug 2021, Roman Mamedov wrote: > > I wondered a bit myself, what are the downsides of just doing the > uniquefication inside Btrfs, not leaving that to NFSD? > > I mean not even adding the extra stat field, just return the inode itself with > that already applied. Surely cannot be any worse collision-wise, than > different subvolumes straight up having the same inode numbers as right now? > > Or is it a performance concern, always doing more work, for something which > only NFSD has needed so far. Any change in behaviour will have unexpected consequences. I think the btrfs maintainers perspective is they they don't want to change behaviour if they don't have to (which is reasonable) and that currently they don't have to (which probably means that users aren't complaining loudly enough). NFS export of BTRFS is already demonstrably broken and users are complaining loudly enough that I can hear them .... though I think it has been broken like this for 10 years, do I wonder that I didn't hear them before. If something is perceived as broken, then a behaviour change that appears to fix it is more easily accepted. However, having said that I now see that my latest patch is not ideal. It changes the inode numbers associated with filehandles of objects in the non-root subvolume. This will cause the Linux NFS client to treat the object as 'stale' For most objects this is a transient annoyance. Reopen the file or restart the process and all should be well again. However if the inode number of the mount point changes, you will need to unmount and remount. That is more somewhat more of an annoyance. There are a few ways to handle this more gracefully. 1/ We could get btrfs to hand out new filehandles as well as new inode numbers, but still accept the old filehandles. Then we could make the inode number reported be based on the filehandle. This would be nearly seamless but rather clumsy to code. I'm not *very* keen on this idea, but it is worth keeping in mind. 2/ We could add a btrfs mount option to control whether the uniquifier was set or not. This would allow the sysadmin to choose when to manage any breakage. I think this is my preference, but Josef has declared an aversion to mount options. 3/ We could add a module parameter to nfsd to control whether the uniquifier is merged in. This again gives the sysadmin control, and it can be done despite any aversion from btrfs maintainers. But I'd need to overcome any aversion from the nfsd maintainers, and I don't know how strong that would be yet. (A new export option isn't really appropriate. It is much more work to add an export option than the add a mount option). I don't know.... maybe I should try harder to like option 1, or at least verify if it works as expected and see how ugly the code really is. NeilBrown
On 8/15/21 11:53 PM, NeilBrown wrote: > On Mon, 16 Aug 2021, kreijack@inwind.it wrote: >> On 8/15/21 9:35 PM, Roman Mamedov wrote: >>> On Sun, 15 Aug 2021 09:39:08 +0200 >>> Goffredo Baroncelli <kreijack@libero.it> wrote: >>> >>>> I am sure that it was discussed already but I was unable to find any track >>>> of this discussion. But if the problem is the collision between the inode >>>> number of different subvolume in the nfd export, is it simpler if the export >>>> is truncated to the subvolume boundary ? It would be more coherent with the >>>> current behavior of vfs+nfsd. >>> >>> See this bugreport thread which started it all: >>> https://www.spinics.net/lists/linux-btrfs/msg111172.html >>> >>> In there the reporting user replied that it is strongly not feasible for them >>> to export each individual snapshot. >> >> Thanks for pointing that. >> >> However looking at the 'exports' man page, it seems that NFS has already an >> option to cover these cases: 'crossmnt'. >> >> If NFSd detects a "child" filesystem (i.e. a filesystem mounted inside an already >> exported one) and the "parent" filesystem is marked as 'crossmnt', the client mount >> the parent AND the child filesystem with two separate mounts, so there is not problem of inode collision. > > As you acknowledged, you haven't read the whole back-story. Maybe you > should. > > https://lore.kernel.org/linux-nfs/20210613115313.BC59.409509F4@e16-tech.com/ > https://lore.kernel.org/linux-nfs/162848123483.25823.15844774651164477866.stgit@noble.brown/ > https://lore.kernel.org/linux-btrfs/162742539595.32498.13687924366155737575.stgit@noble.brown/ > > The flow of conversation does sometimes jump between threads. > > I'm very happy to respond you questions after you've absorbed all that. Hi Neil, I read the other threads. And I still have the opinion that the nfsd crossmnt behavior should be a good solution for the btrfs subvolumes. > > NeilBrown >
On Wed, 18 Aug 2021, kreijack@inwind.it wrote: > On 8/15/21 11:53 PM, NeilBrown wrote: > > On Mon, 16 Aug 2021, kreijack@inwind.it wrote: > >> On 8/15/21 9:35 PM, Roman Mamedov wrote: > >> > >> However looking at the 'exports' man page, it seems that NFS has already an > >> option to cover these cases: 'crossmnt'. > >> > >> If NFSd detects a "child" filesystem (i.e. a filesystem mounted inside an already > >> exported one) and the "parent" filesystem is marked as 'crossmnt', the client mount > >> the parent AND the child filesystem with two separate mounts, so there is not problem of inode collision. > > > > As you acknowledged, you haven't read the whole back-story. Maybe you > > should. > > > > https://lore.kernel.org/linux-nfs/20210613115313.BC59.409509F4@e16-tech.com/ > > https://lore.kernel.org/linux-nfs/162848123483.25823.15844774651164477866.stgit@noble.brown/ > > https://lore.kernel.org/linux-btrfs/162742539595.32498.13687924366155737575.stgit@noble.brown/ > > > > The flow of conversation does sometimes jump between threads. > > > > I'm very happy to respond you questions after you've absorbed all that. > > Hi Neil, > > I read the other threads. And I still have the opinion that the nfsd > crossmnt behavior should be a good solution for the btrfs subvolumes. Thanks for reading it all. Let me join the dots for you. "crossmnt" doesn't currently work because "subvolumes" aren't mount points. We could change btrfs so that subvolumes *are* mountpoints. They would have to be automounts. I posted patches to do that. They were broadly rejected because people have many thousands of submounts that are concurrently active and so /proc/mounts would be multiple megabytes is size and working with it would become impractical. Also, non-privileged users can create subvols, and may want the path names to remain private. But these subvols would appear in the mount table and so would no longer be private. Alternately we could change the "crossmnt" functionality to treat a change of st_dev as though it were a mount point. I posted patches to do this too. This hits the same sort of problems in a different way. If NFSD reports that is has crossed a "mount" by providing a different filesystem-id to the client, then the client will create a new mount point which will appear in /proc/mounts. It might be less likely that many thousands of subvolumes are accessed over NFS than locally, but it is still entirely possible. I don't want the NFS client to suffer a problem that btrfs doesn't impose locally. And 'private' subvolumes could again appear on a public list if they were accessed via NFS. Thanks, NeilBrown
Hi, We use 'swab64' to combinate 'subvol id' and 'inode' into 64bit in this patch. case1: 'subvol id': 16bit => 64K, a little small because the subvol id is always increase? 'inode': 48bit * 4K per node, this is big enough. case2: 'subvol id': 24bit => 16M, this is big enough. 'inode': 40bit * 4K per node => 4 PB. this is a little small? Is there a way to 'bit-swap' the subvol id, rather the current byte-swap? If not, maybe it is a better balance if we combinate 22bit subvol id and 42 bit inode? Best Regards Wang Yugui (wangyugui@e16-tech.com) 2021/08/18 > > [[This patch is a minimal patch which addresses the current problems > with nfsd and btrfs, in a way which I think is most supportable, least > surprising, and least likely to impact any future attempts to more > completely fix the btrfs file-identify problem]] > > BTRFS does not provide unique inode numbers across a filesystem. > It *does* provide unique inode numbers with a subvolume and > uses synthetic device numbers for different subvolumes to ensure > uniqueness for device+inode. > > nfsd cannot use these varying device numbers. If nfsd were to > synthesise different stable filesystem ids to give to the client, that > would cause subvolumes to appear in the mount table on the client, even > though they don't appear in the mount table on the server. Also, NFSv3 > doesn't support changing the filesystem id without a new explicit > mount on the client (this is partially supported in practice, but > violates the protocol specification). > > So currently, the roots of all subvolumes report the same inode number > in the same filesystem to NFS clients and tools like 'find' notice that > a directory has the same identity as an ancestor, and so refuse to > enter that directory. > > This patch allows btrfs (or any filesystem) to provide a 64bit number > that can be xored with the inode number to make the number more unique. > Rather than the client being certain to see duplicates, with this patch > it is possible but extremely rare. > > The number than btrfs provides is a swab64() version of the subvolume > identifier. This has most entropy in the high bits (the low bits of the > subvolume identifer), while the inoe has most entropy in the low bits. > The result will always be unique within a subvolume, and will almost > always be unique across the filesystem. > > Signed-off-by: NeilBrown <neilb@suse.de> > --- > fs/btrfs/inode.c | 4 ++++ > fs/nfsd/nfs3xdr.c | 17 ++++++++++++++++- > fs/nfsd/nfs4xdr.c | 9 ++++++++- > fs/nfsd/xdr3.h | 2 ++ > include/linux/stat.h | 17 +++++++++++++++++ > 5 files changed, 47 insertions(+), 2 deletions(-) > > diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c > index 0117d867ecf8..989fdf2032d5 100644 > --- a/fs/btrfs/inode.c > +++ b/fs/btrfs/inode.c > @@ -9195,6 +9195,10 @@ static int btrfs_getattr(struct user_namespace *mnt_userns, > generic_fillattr(&init_user_ns, inode, stat); > stat->dev = BTRFS_I(inode)->root->anon_dev; > > + if (BTRFS_I(inode)->root->root_key.objectid != BTRFS_FS_TREE_OBJECTID) > + stat->ino_uniquifier = > + swab64(BTRFS_I(inode)->root->root_key.objectid); > + > spin_lock(&BTRFS_I(inode)->lock); > delalloc_bytes = BTRFS_I(inode)->new_delalloc_bytes; > inode_bytes = inode_get_bytes(inode); > diff --git a/fs/nfsd/nfs3xdr.c b/fs/nfsd/nfs3xdr.c > index 0a5ebc52e6a9..669e2437362a 100644 > --- a/fs/nfsd/nfs3xdr.c > +++ b/fs/nfsd/nfs3xdr.c > @@ -340,6 +340,7 @@ svcxdr_encode_fattr3(struct svc_rqst *rqstp, struct xdr_stream *xdr, > { > struct user_namespace *userns = nfsd_user_namespace(rqstp); > __be32 *p; > + u64 ino; > u64 fsid; > > p = xdr_reserve_space(xdr, XDR_UNIT * 21); > @@ -377,7 +378,10 @@ svcxdr_encode_fattr3(struct svc_rqst *rqstp, struct xdr_stream *xdr, > p = xdr_encode_hyper(p, fsid); > > /* fileid */ > - p = xdr_encode_hyper(p, stat->ino); > + ino = stat->ino; > + if (stat->ino_uniquifier && stat->ino_uniquifier != ino) > + ino ^= stat->ino_uniquifier; > + p = xdr_encode_hyper(p, ino); > > p = encode_nfstime3(p, &stat->atime); > p = encode_nfstime3(p, &stat->mtime); > @@ -1151,6 +1155,17 @@ svcxdr_encode_entry3_common(struct nfsd3_readdirres *resp, const char *name, > if (xdr_stream_encode_item_present(xdr) < 0) > return false; > /* fileid */ > + if (!resp->dir_have_uniquifier) { > + struct kstat stat; > + if (fh_getattr(&resp->fh, &stat) == nfs_ok) > + resp->dir_ino_uniquifier = stat.ino_uniquifier; > + else > + resp->dir_ino_uniquifier = 0; > + resp->dir_have_uniquifier = 1; > + } > + if (resp->dir_ino_uniquifier && > + resp->dir_ino_uniquifier != ino) > + ino ^= resp->dir_ino_uniquifier; > if (xdr_stream_encode_u64(xdr, ino) < 0) > return false; > /* name */ > diff --git a/fs/nfsd/nfs4xdr.c b/fs/nfsd/nfs4xdr.c > index 7abeccb975b2..ddccf849c29c 100644 > --- a/fs/nfsd/nfs4xdr.c > +++ b/fs/nfsd/nfs4xdr.c > @@ -3114,10 +3114,14 @@ nfsd4_encode_fattr(struct xdr_stream *xdr, struct svc_fh *fhp, > fhp->fh_handle.fh_size); > } > if (bmval0 & FATTR4_WORD0_FILEID) { > + u64 ino = stat.ino; > + if (stat.ino_uniquifier && > + stat.ino_uniquifier != stat.ino) > + ino ^= stat.ino_uniquifier; > p = xdr_reserve_space(xdr, 8); > if (!p) > goto out_resource; > - p = xdr_encode_hyper(p, stat.ino); > + p = xdr_encode_hyper(p, ino); > } > if (bmval0 & FATTR4_WORD0_FILES_AVAIL) { > p = xdr_reserve_space(xdr, 8); > @@ -3285,6 +3289,9 @@ nfsd4_encode_fattr(struct xdr_stream *xdr, struct svc_fh *fhp, > if (err) > goto out_nfserr; > ino = parent_stat.ino; > + if (parent_stat.ino_uniquifier && > + parent_stat.ino_uniquifier != ino) > + ino ^= parent_stat.ino_uniquifier; > } > p = xdr_encode_hyper(p, ino); > } > diff --git a/fs/nfsd/xdr3.h b/fs/nfsd/xdr3.h > index 933008382bbe..b4f9f3c71f72 100644 > --- a/fs/nfsd/xdr3.h > +++ b/fs/nfsd/xdr3.h > @@ -179,6 +179,8 @@ struct nfsd3_readdirres { > struct xdr_buf dirlist; > struct svc_fh scratch; > struct readdir_cd common; > + u64 dir_ino_uniquifier; > + int dir_have_uniquifier; > unsigned int cookie_offset; > struct svc_rqst * rqstp; > > diff --git a/include/linux/stat.h b/include/linux/stat.h > index fff27e603814..a5188f42ed81 100644 > --- a/include/linux/stat.h > +++ b/include/linux/stat.h > @@ -46,6 +46,23 @@ struct kstat { > struct timespec64 btime; /* File creation time */ > u64 blocks; > u64 mnt_id; > + /* > + * BTRFS does not provide unique inode numbers within a filesystem, > + * depending on a synthetic 'dev' to provide uniqueness. > + * NFSd cannot make use of this 'dev' number so clients often see > + * duplicate inode numbers. > + * For BTRFS, 'ino' is unlikely to use the high bits. It puts > + * another number in ino_uniquifier which: > + * - has most entropy in the high bits > + * - is different precisely when 'dev' is different > + * - is stable across unmount/remount > + * NFSd can xor this with 'ino' to get a substantially more unique > + * number for reporting to the client. > + * The ino_uniquifier for a directory can reasonably be applied > + * to inode numbers reported by the readdir filldir callback. > + * It is NOT currently exported to user-space. > + */ > + u64 ino_uniquifier; > }; > > #endif > -- > 2.32.0
On 8/17/21 11:39 PM, NeilBrown wrote: > On Wed, 18 Aug 2021, kreijack@inwind.it wrote: >> On 8/15/21 11:53 PM, NeilBrown wrote: >>> On Mon, 16 Aug 2021, kreijack@inwind.it wrote: >>>> On 8/15/21 9:35 PM, Roman Mamedov wrote: > >>>> >>>> However looking at the 'exports' man page, it seems that NFS has already an >>>> option to cover these cases: 'crossmnt'. >>>> >>>> If NFSd detects a "child" filesystem (i.e. a filesystem mounted inside an already >>>> exported one) and the "parent" filesystem is marked as 'crossmnt', the client mount >>>> the parent AND the child filesystem with two separate mounts, so there is not problem of inode collision. >>> >>> As you acknowledged, you haven't read the whole back-story. Maybe you >>> should. >>> >>> https://lore.kernel.org/linux-nfs/20210613115313.BC59.409509F4@e16-tech.com/ >>> https://lore.kernel.org/linux-nfs/162848123483.25823.15844774651164477866.stgit@noble.brown/ >>> https://lore.kernel.org/linux-btrfs/162742539595.32498.13687924366155737575.stgit@noble.brown/ >>> >>> The flow of conversation does sometimes jump between threads. >>> >>> I'm very happy to respond you questions after you've absorbed all that. >> >> Hi Neil, >> >> I read the other threads. And I still have the opinion that the nfsd >> crossmnt behavior should be a good solution for the btrfs subvolumes. > > Thanks for reading it all. Let me join the dots for you. > [...] > > Alternately we could change the "crossmnt" functionality to treat a > change of st_dev as though it were a mount point. I posted patches to > do this too. This hits the same sort of problems in a different way. > If NFSD reports that is has crossed a "mount" by providing a different > filesystem-id to the client, then the client will create a new mount > point which will appear in /proc/mounts. Yes, this is my proposal. > It might be less likely that > many thousands of subvolumes are accessed over NFS than locally, but it > is still entirely possible. I don't think that it would be so unlikely. Think about a file indexer and/or a 'find' command runned in the folder that contains the snapshots... > I don't want the NFS client to suffer a > problem that btrfs doesn't impose locally. The solution is not easy. In fact we are trying to map a u64 x u64 space to a u64 space. The true is that we cannot guarantee that a collision will not happen. We can only say that for a fresh filesystem is near impossible, but for an aged filesystem it is unlikely but possible. We already faced real case where we exhausted the inode space in the 32 bit arch.What is the chances that the subvolumes ever created count is greater 2^24 and the inode number is greater 2^40 ? The likelihood is low but not 0... Some random toughs: - the new inode number are created merging the original inode-number (in the lower bit) and the object-id of the subvolume (in higher bit). We could add a warning when these bits overlap: if (fls(stat->ino) >= ffs(stat->ino_uniquifer)) printk("NFSD: Warning possible inode collision...") More smarter heuristic can be developed, like doing the check against the maximum value if inode and the maximum value of the subvolume once at mount time.... - for the inode number it is an expensive operation (even tough it exists/existed for the 32bit processor), but we could reuse the object-id after it is freed - I think that we could add an option to nfsd or btrfs (not a default behavior) to avoid to cross the subvolume boundary > And 'private' subvolumes > could again appear on a public list if they were accessed via NFS. (wrongly) I never considered a similar scenario. However I think that these could be anonymized using a alias (the name of the path to mount is passed by nfsd, so it could create an alias that will be recognized by nfsd when the clienet requires it... complex but doable...) > > Thanks, > NeilBrown >
On Thu, 19 Aug 2021, Wang Yugui wrote: > Hi, > > We use 'swab64' to combinate 'subvol id' and 'inode' into 64bit in this > patch. > > case1: > 'subvol id': 16bit => 64K, a little small because the subvol id is > always increase? > 'inode': 48bit * 4K per node, this is big enough. > > case2: > 'subvol id': 24bit => 16M, this is big enough. > 'inode': 40bit * 4K per node => 4 PB. this is a little small? I don't know what point you are trying to make with the above. > > Is there a way to 'bit-swap' the subvol id, rather the current byte-swap? Sure: for (i=0; i<64; i++) { new = (new << 1) | (old & 1) old >>= 1; } but would it gain anything significant? Remember what the goal is. Most apps don't care at all about duplicate inode numbers - only a few do, and they only care about a few inodes. The only bug I actually have a report of is caused by a directory having the same inode as an ancestor. i.e. in lots of cases, duplicate inode numbers won't be noticed. The behaviour of btrfs over NFS RELIABLY causes exactly this behaviour of a directory having the same inode number as an ancestor. The root of a subtree will *always* do this. If we JUST changed the inode numbers of the roots of subtrees, then most observed problems would go away. It would change from "trivial to reproduce" to "rarely happens". The patch I actually propose makes it much more unlikely than that. Even if duplicate inode numbers do happen, the chance of them being noticed is infinitesimal. Given that, there is no point in minor tweaks unless they can make duplicate inode numbers IMPOSSIBLE. > > If not, maybe it is a better balance if we combinate 22bit subvol id and > 42 bit inode? This would be better except when it is worse. We cannot know which will happen more often. As long as BTRFS allows object-ids and root-ids combined to use more than 64 bits there can be no perfect solution. There are many possible solutions that will be close to perfect in practice. swab64() is the simplest that I could think of. Picking any arbitrary cut-off (22/42, 24/40, ...) is unlikely to be better, and could is some circumstances be worse. My preference would be for btrfs to start re-using old object-ids and root-ids, and to enforce a limit (set at mkfs or tunefs) so that the total number of bits does not exceed 64. Unfortunately the maintainers seem reluctant to even consider this. NeilBrown
On Thu, Aug 19, 2021 at 07:46:22AM +1000, NeilBrown wrote: > On Thu, 19 Aug 2021, Wang Yugui wrote: > > Hi, > > > > We use 'swab64' to combinate 'subvol id' and 'inode' into 64bit in this > > patch. > > > > case1: > > 'subvol id': 16bit => 64K, a little small because the subvol id is > > always increase? > > 'inode': 48bit * 4K per node, this is big enough. > > > > case2: > > 'subvol id': 24bit => 16M, this is big enough. > > 'inode': 40bit * 4K per node => 4 PB. this is a little small? > > I don't know what point you are trying to make with the above. > > > > > Is there a way to 'bit-swap' the subvol id, rather the current byte-swap? > > Sure: > for (i=0; i<64; i++) { > new = (new << 1) | (old & 1) > old >>= 1; > } > > but would it gain anything significant? > > Remember what the goal is. Most apps don't care at all about duplicate > inode numbers - only a few do, and they only care about a few inodes. > The only bug I actually have a report of is caused by a directory having > the same inode as an ancestor. i.e. in lots of cases, duplicate inode > numbers won't be noticed. rsync -H and cpio's hardlink detection can be badly confused. They will think distinct files with the same inode number are hardlinks. This could be bad if you were making backups (though if you're making backups over NFS, you are probably doing something that could be done better in a different way). > The behaviour of btrfs over NFS RELIABLY causes exactly this behaviour > of a directory having the same inode number as an ancestor. The root of > a subtree will *always* do this. If we JUST changed the inode numbers > of the roots of subtrees, then most observed problems would go away. It > would change from "trivial to reproduce" to "rarely happens". The patch > I actually propose makes it much more unlikely than that. Even if > duplicate inode numbers do happen, the chance of them being noticed is > infinitesimal. Given that, there is no point in minor tweaks unless > they can make duplicate inode numbers IMPOSSIBLE. That's a good argument. I have a different one with the same conclusion. 40 bit inodes would take about 20 years to collide with 24-bit subvols--if you are creating an average of 1742 inodes every second. Also at the same time you have to be creating a subvol every 37 seconds to occupy the colliding 25th bit of the subvol ID. Only the highest inode number in any subvol counts--if your inode creation is spread out over several different subvols, you'll need to make inodes even faster. For reference, my high scores are 17 inodes per second and a subvol every 595 seconds (averaged over 1 year). Burst numbers are much higher, but one has to spend some time _reading_ the files now and then. I've encountered other btrfs users with two orders of magnitude higher inode creation rates than mine. They are barely squeaking under the 20-year line--or they would be, if they were creating snapshots 50 times faster than they do today. Use cases that have the highest inode creation rates (like /tmp) tend to get more specialized storage solutions (like tmpfs). Cloud fleets do have higher average inode creation rates, but their filesystems have much shorter lifespans than 20 years, so the delta on both sides of the ratio cancels out. If this hack is only used for NFS, it gives us some time to come up with a better solution. (On the other hand, we had 14 years already, and here we are...) > > If not, maybe it is a better balance if we combinate 22bit subvol id and > > 42 bit inode? > > This would be better except when it is worse. We cannot know which will > happen more often. > > As long as BTRFS allows object-ids and root-ids combined to use more > than 64 bits there can be no perfect solution. There are many possible > solutions that will be close to perfect in practice. swab64() is the > simplest that I could think of. Picking any arbitrary cut-off (22/42, > 24/40, ...) is unlikely to be better, and could is some circumstances be > worse. > > My preference would be for btrfs to start re-using old object-ids and > root-ids, and to enforce a limit (set at mkfs or tunefs) so that the > total number of bits does not exceed 64. Unfortunately the maintainers > seem reluctant to even consider this. It was considered, implemented in 2011, and removed in 2020. Rationale is in commit b547a88ea5776a8092f7f122ddc20d6720528782 "btrfs: start deprecation of mount option inode_cache". It made file creation slower, and consumed disk space, iops, and memory to run. Nobody used it. Newer on-disk data structure versions (free space tree, 2015) didn't bother implementing inode_cache's storage requirement. > NeilBrown
On Mon, Aug 16, 2021 at 1:21 AM NeilBrown <neilb@suse.de> wrote: > > On Mon, 16 Aug 2021, Roman Mamedov wrote: > > > > I wondered a bit myself, what are the downsides of just doing the > > uniquefication inside Btrfs, not leaving that to NFSD? > > > > I mean not even adding the extra stat field, just return the inode itself with > > that already applied. Surely cannot be any worse collision-wise, than > > different subvolumes straight up having the same inode numbers as right now? > > > > Or is it a performance concern, always doing more work, for something which > > only NFSD has needed so far. > > Any change in behaviour will have unexpected consequences. I think the > btrfs maintainers perspective is they they don't want to change > behaviour if they don't have to (which is reasonable) and that currently > they don't have to (which probably means that users aren't complaining > loudly enough). > > NFS export of BTRFS is already demonstrably broken and users are > complaining loudly enough that I can hear them .... though I think it > has been broken like this for 10 years, do I wonder that I didn't hear > them before. > > If something is perceived as broken, then a behaviour change that > appears to fix it is more easily accepted. > > However, having said that I now see that my latest patch is not ideal. > It changes the inode numbers associated with filehandles of objects in > the non-root subvolume. This will cause the Linux NFS client to treat > the object as 'stale' For most objects this is a transient annoyance. > Reopen the file or restart the process and all should be well again. > However if the inode number of the mount point changes, you will need to > unmount and remount. That is more somewhat more of an annoyance. > > There are a few ways to handle this more gracefully. > > 1/ We could get btrfs to hand out new filehandles as well as new inode > numbers, but still accept the old filehandles. Then we could make the > inode number reported be based on the filehandle. This would be nearly > seamless but rather clumsy to code. I'm not *very* keen on this idea, > but it is worth keeping in mind. > So objects would change their inode number after nfs inode cache is evicted and while nfs filesystem is mounted. That does not sound ideal. But I am a bit confused about the problem. If the export is of the btrfs root, then nfs client cannot access any subvolumes (right?) - that was the bug report, so the value of inode numbers in non-root subvolumes is not an issue. If export is of non-root subvolume, then why bother changing anything at all? Is there a need to traverse into sub-sub-volumes? > 2/ We could add a btrfs mount option to control whether the uniquifier > was set or not. This would allow the sysadmin to choose when to manage > any breakage. I think this is my preference, but Josef has declared an > aversion to mount options. > > 3/ We could add a module parameter to nfsd to control whether the > uniquifier is merged in. This again gives the sysadmin control, and it > can be done despite any aversion from btrfs maintainers. But I'd need > to overcome any aversion from the nfsd maintainers, and I don't know how > strong that would be yet. (A new export option isn't really appropriate. > It is much more work to add an export option than the add a mount option). > That is too bad, because IMO from users POV, "fsid=btrfsroot" or "cross-subvol" export option would have been a nice way to describe and opt-in to this new functionality. But let's consider for a moment the consequences of enabling this functionality automatically whenever exporting a btrfs root volume without "crossmnt": 1. Objects inside a subvol that are inaccessible(?) with current nfs/nfsd without "crossmnt" will become accessible after enabling the feature - this will match the user experience of accessing btrfs on the host 2. The inode numbers of the newly accessible objects would not match the inode numbers on the host fs (no big deal?) 3. The inode numbers of objects in a snapshot would not match the inode numbers of the original (pre-snapshot) objects (acceptable tradeoff for being able to access the snapshot objects without bloating /proc/mounts?) 4. The inode numbers of objects in a subvol observed via this "cross-subvol" export would not match the inode numbers of the same objects observed via an individual subvol export 5. st_ino conflicts are possible when multiplexing subvol id and inode number. overlayfs resolved those conflicts by allocating an inode number from a reserved non-persistent inode range, which may cause objects to change their inode number during the lifetime on the filesystem (sensible tradeoff?) I think that #4 is a bit hard to swallow and #3 is borderline acceptable... Both and quite hard to document and to set expectations as a non-opt-in change of behavior when exporting btrfs root. IMO, an nfsd module parameter will give some control and therefore is a must, but it won't make life easier to document and set user expectations when the semantics are not clearly stated in the exports table. You claim that "A new export option isn't really appropriate." but your only argument is that "It is much more work to add an export option than the add a mount option". With all due respect, for this particular challenge with all the constraints involved, this sounds like a pretty weak argument. Surely, adding an export option is easier than slowly changing all userspace tools to understand subvolumes? a solution that you had previously brought up. Can you elaborate some more about your aversion to a new export option. Thanks, Amir.
On Thu, 19 Aug 2021, Zygo Blaxell wrote: > On Thu, Aug 19, 2021 at 07:46:22AM +1000, NeilBrown wrote: > > > > Remember what the goal is. Most apps don't care at all about duplicate > > inode numbers - only a few do, and they only care about a few inodes. > > The only bug I actually have a report of is caused by a directory having > > the same inode as an ancestor. i.e. in lots of cases, duplicate inode > > numbers won't be noticed. > > rsync -H and cpio's hardlink detection can be badly confused. They will > think distinct files with the same inode number are hardlinks. This could > be bad if you were making backups (though if you're making backups over > NFS, you are probably doing something that could be done better in a > different way). Yes, they could get confused. inode numbers remain unique within a "subvolume" so you would need to do at backup of multiple subtrees to hit a problem. Certainly possible, but probably less common. > > 40 bit inodes would take about 20 years to collide with 24-bit subvols--if > you are creating an average of 1742 inodes every second. Also at the > same time you have to be creating a subvol every 37 seconds to occupy > the colliding 25th bit of the subvol ID. Only the highest inode number > in any subvol counts--if your inode creation is spread out over several > different subvols, you'll need to make inodes even faster. > > For reference, my high scores are 17 inodes per second and a subvol > every 595 seconds (averaged over 1 year). Burst numbers are much higher, > but one has to spend some time _reading_ the files now and then. > > I've encountered other btrfs users with two orders of magnitude higher > inode creation rates than mine. They are barely squeaking under the > 20-year line--or they would be, if they were creating snapshots 50 times > faster than they do today. I do like seeing concrete numbers, thanks. How many of these inodes and subvols remain undeleted? Supposing inode numbers were reused, how many bits might you need? > > My preference would be for btrfs to start re-using old object-ids and > > root-ids, and to enforce a limit (set at mkfs or tunefs) so that the > > total number of bits does not exceed 64. Unfortunately the maintainers > > seem reluctant to even consider this. > > It was considered, implemented in 2011, and removed in 2020. Rationale > is in commit b547a88ea5776a8092f7f122ddc20d6720528782 "btrfs: start > deprecation of mount option inode_cache". It made file creation slower, > and consumed disk space, iops, and memory to run. Nobody used it. > Newer on-disk data structure versions (free space tree, 2015) didn't > bother implementing inode_cache's storage requirement. Yes, I saw that. Providing reliable functional certainly can impact performance and consume disk-space. That isn't an excuse for not doing it. I suspect that carefully tuned code could result in typical creation times being unchanged, and mean creation times suffering only a tiny cost. Using "max+1" when the creation rate is particularly high might be a reasonable part of managing costs. Storage cost need not be worse than the cost of tracking free blocks on the device. "Nobody used it" is odd. It implies it would have to be explicitly enabled, and all it would provide anyone is sane behaviour. Who would imagine that to be an optional extra. NeilBrown
On Thu, 19 Aug 2021, Amir Goldstein wrote: > On Mon, Aug 16, 2021 at 1:21 AM NeilBrown <neilb@suse.de> wrote: > > > > There are a few ways to handle this more gracefully. > > > > 1/ We could get btrfs to hand out new filehandles as well as new inode > > numbers, but still accept the old filehandles. Then we could make the > > inode number reported be based on the filehandle. This would be nearly > > seamless but rather clumsy to code. I'm not *very* keen on this idea, > > but it is worth keeping in mind. > > > > So objects would change their inode number after nfs inode cache is > evicted and while nfs filesystem is mounted. That does not sound ideal. No. Almost all filehandle lookups happen in the context of some other filehandle. If the provided context is an old-style filehandle, we provide an old-style filehandle for the lookup. There is already code in nfsd to support this (as we have in the past changed how filesystems are identified). It would only be if the mountpoint filehandle (which is fetched without that context) went out of cache that inode numbers would change. That would mean that the filesystem (possibly an automount) was unmounted. When it was remounted it could have a different device number anyway, so having different inode numbers would be of little consequence. > > But I am a bit confused about the problem. > If the export is of the btrfs root, then nfs client cannot access any > subvolumes (right?) - that was the bug report, so the value of inode > numbers in non-root subvolumes is not an issue. Not correct. All objects in the filesystem are fully accessible. The only problem is that some pairs of objects have the same inode number. This causes some programs like 'find' and 'du' to behave differently to expectations. They will refuse to even look in a subvolume, because it looks like doing so could cause an infinite loop. The values of inode numbers in non-root subvolumes is EXACTLY the issue. > If export is of non-root subvolume, then why bother changing anything > at all? Is there a need to traverse into sub-sub-volumes? > > > 2/ We could add a btrfs mount option to control whether the uniquifier > > was set or not. This would allow the sysadmin to choose when to manage > > any breakage. I think this is my preference, but Josef has declared an > > aversion to mount options. > > > > 3/ We could add a module parameter to nfsd to control whether the > > uniquifier is merged in. This again gives the sysadmin control, and it > > can be done despite any aversion from btrfs maintainers. But I'd need > > to overcome any aversion from the nfsd maintainers, and I don't know how > > strong that would be yet. (A new export option isn't really appropriate. > > It is much more work to add an export option than the add a mount option). > > > > That is too bad, because IMO from users POV, "fsid=btrfsroot" or "cross-subvol" > export option would have been a nice way to describe and opt-in to this new > functionality. > > But let's consider for a moment the consequences of enabling this functionality > automatically whenever exporting a btrfs root volume without "crossmnt": > > 1. Objects inside a subvol that are inaccessible(?) with current > nfs/nfsd without > "crossmnt" will become accessible after enabling the feature - > this will match > the user experience of accessing btrfs on the host Not correct - as above. > 2. The inode numbers of the newly accessible objects would not match the inode > numbers on the host fs (no big deal?) Unlikely to be a problem. Inode numbers have no meaning beyond the facts that: - they are stable for the lifetime of the object - they are unique within a filesystem (except btrfs lies about filesystems) - they are not zero The facts only need to be equally true on the NFS server and client.. > 3. The inode numbers of objects in a snapshot would not match the inode > numbers of the original (pre-snapshot) objects (acceptable tradeoff for > being able to access the snapshot objects without bloating /proc/mounts?) This also should not be a problem. Files in different snapshots are different things that happen to share storage (like reflinks). Comparing inode numbers between places which report different st_dev does not fit within the meaning of inode numbers. > 4. The inode numbers of objects in a subvol observed via this "cross-subvol" > export would not match the inode numbers of the same objects observed > via an individual subvol export The device number would differ too, so the relative values of the inode numbers would be irrelevant. > 5. st_ino conflicts are possible when multiplexing subvol id and inode number. > overlayfs resolved those conflicts by allocating an inode number from a > reserved non-persistent inode range, which may cause objects to change > their inode number during the lifetime on the filesystem (sensible > tradeoff?) > > I think that #4 is a bit hard to swallow and #3 is borderline acceptable... > Both and quite hard to document and to set expectations as a non-opt-in > change of behavior when exporting btrfs root. > > IMO, an nfsd module parameter will give some control and therefore is > a must, but it won't make life easier to document and set user expectations > when the semantics are not clearly stated in the exports table. > > You claim that "A new export option isn't really appropriate." > but your only argument is that "It is much more work to add > an export option than the add a mount option". > > With all due respect, for this particular challenge with all the > constraints involved, this sounds like a pretty weak argument. > > Surely, adding an export option is easier than slowly changing all > userspace tools to understand subvolumes? a solution that you had > previously brought up. > > Can you elaborate some more about your aversion to a new > export option. Export options are bits in a 32bit word - so both user-space and kernel need to agree on names for them. We are currently using 18, so there is room to grow. It is a perfectly reasonable way to implement sensible features. It is, I think, a poor way to implement hacks to work around misfeatures in filesystems. This is the core of my dislike for adding an export option. Using one effectively admits that what btrfs is doing is a valid thing to do. I don't think it is. I don't think we want any other filesystem developer to think that they can emulate the behaviour because support is already provided. If we add any configuration to support btrfs, I would much prefer it to be implemented in fs/btrfs, and if not, then with loud warnings that it works around a deficiency in btrfs. /sys/modules/nfsd/parameters/btrfs_export_workaround Thanks, NeilBrown
On Fri, Aug 20, 2021 at 6:22 AM NeilBrown <neilb@suse.de> wrote: > > On Thu, 19 Aug 2021, Amir Goldstein wrote: > > On Mon, Aug 16, 2021 at 1:21 AM NeilBrown <neilb@suse.de> wrote: > > > > > > There are a few ways to handle this more gracefully. > > > > > > 1/ We could get btrfs to hand out new filehandles as well as new inode > > > numbers, but still accept the old filehandles. Then we could make the > > > inode number reported be based on the filehandle. This would be nearly > > > seamless but rather clumsy to code. I'm not *very* keen on this idea, > > > but it is worth keeping in mind. > > > > > > > So objects would change their inode number after nfs inode cache is > > evicted and while nfs filesystem is mounted. That does not sound ideal. > > No. Almost all filehandle lookups happen in the context of some other > filehandle. If the provided context is an old-style filehandle, we > provide an old-style filehandle for the lookup. There is already code > in nfsd to support this (as we have in the past changed how filesystems > are identified). > I see. This is nice! It almost sounds like "inode mapped mounts" :-) > It would only be if the mountpoint filehandle (which is fetched without > that context) went out of cache that inode numbers would change. That > would mean that the filesystem (possibly an automount) was unmounted. > When it was remounted it could have a different device number anyway, so > having different inode numbers would be of little consequence. > > > > > But I am a bit confused about the problem. > > If the export is of the btrfs root, then nfs client cannot access any > > subvolumes (right?) - that was the bug report, so the value of inode > > numbers in non-root subvolumes is not an issue. > > Not correct. All objects in the filesystem are fully accessible. The > only problem is that some pairs of objects have the same inode number. > This causes some programs like 'find' and 'du' to behave differently to > expectations. They will refuse to even look in a subvolume, because it > looks like doing so could cause an infinite loop. The values of inode > numbers in non-root subvolumes is EXACTLY the issue. > Thanks for clarifying. > > If export is of non-root subvolume, then why bother changing anything > > at all? Is there a need to traverse into sub-sub-volumes? > > > > > 2/ We could add a btrfs mount option to control whether the uniquifier > > > was set or not. This would allow the sysadmin to choose when to manage > > > any breakage. I think this is my preference, but Josef has declared an > > > aversion to mount options. > > > > > > 3/ We could add a module parameter to nfsd to control whether the > > > uniquifier is merged in. This again gives the sysadmin control, and it > > > can be done despite any aversion from btrfs maintainers. But I'd need > > > to overcome any aversion from the nfsd maintainers, and I don't know how > > > strong that would be yet. (A new export option isn't really appropriate. > > > It is much more work to add an export option than the add a mount option). > > > > > > > That is too bad, because IMO from users POV, "fsid=btrfsroot" or "cross-subvol" > > export option would have been a nice way to describe and opt-in to this new > > functionality. > > > > But let's consider for a moment the consequences of enabling this functionality > > automatically whenever exporting a btrfs root volume without "crossmnt": > > > > 1. Objects inside a subvol that are inaccessible(?) with current > > nfs/nfsd without > > "crossmnt" will become accessible after enabling the feature - > > this will match > > the user experience of accessing btrfs on the host > > Not correct - as above. > > > 2. The inode numbers of the newly accessible objects would not match the inode > > numbers on the host fs (no big deal?) > > Unlikely to be a problem. Inode numbers have no meaning beyond the facts > that: > - they are stable for the lifetime of the object > - they are unique within a filesystem (except btrfs lies about > filesystems) > - they are not zero > > The facts only need to be equally true on the NFS server and client.. > > > 3. The inode numbers of objects in a snapshot would not match the inode > > numbers of the original (pre-snapshot) objects (acceptable tradeoff for > > being able to access the snapshot objects without bloating /proc/mounts?) > > This also should not be a problem. Files in different snapshots are > different things that happen to share storage (like reflinks). > Comparing inode numbers between places which report different st_dev > does not fit within the meaning of inode numbers. > > > 4. The inode numbers of objects in a subvol observed via this "cross-subvol" > > export would not match the inode numbers of the same objects observed > > via an individual subvol export > > The device number would differ too, so the relative values of the inode > numbers would be irrelevant. > > > 5. st_ino conflicts are possible when multiplexing subvol id and inode number. > > overlayfs resolved those conflicts by allocating an inode number from a > > reserved non-persistent inode range, which may cause objects to change > > their inode number during the lifetime on the filesystem (sensible > > tradeoff?) > > > > I think that #4 is a bit hard to swallow and #3 is borderline acceptable... > > Both and quite hard to document and to set expectations as a non-opt-in > > change of behavior when exporting btrfs root. > > > > IMO, an nfsd module parameter will give some control and therefore is > > a must, but it won't make life easier to document and set user expectations > > when the semantics are not clearly stated in the exports table. > > > > You claim that "A new export option isn't really appropriate." > > but your only argument is that "It is much more work to add > > an export option than the add a mount option". > > > > With all due respect, for this particular challenge with all the > > constraints involved, this sounds like a pretty weak argument. > > > > Surely, adding an export option is easier than slowly changing all > > userspace tools to understand subvolumes? a solution that you had > > previously brought up. > > > > Can you elaborate some more about your aversion to a new > > export option. > > Export options are bits in a 32bit word - so both user-space and kernel > need to agree on names for them. We are currently using 18, so there is > room to grow. It is a perfectly reasonable way to implement sensible > features. It is, I think, a poor way to implement hacks to work around > misfeatures in filesystems. > > This is the core of my dislike for adding an export option. Using one > effectively admits that what btrfs is doing is a valid thing to do. I > don't think it is. I don't think we want any other filesystem developer > to think that they can emulate the behaviour because support is already > provided. > > If we add any configuration to support btrfs, I would much prefer it to > be implemented in fs/btrfs, and if not, then with loud warnings that it > works around a deficiency in btrfs. > /sys/modules/nfsd/parameters/btrfs_export_workaround > Thanks for clarifying. I now understand how "hacky" the workaround in nfsd is. Naive question: could this behavior be implemented in btrfs as you prefer, but in a way that only serves nfsd and NEW nfs mounts for that matter (i.e. only NEW filehandles)? Meaning passing some hint in ->getattr() and ->iterate_shared(), sort of like idmapped mount does for uid/gid. Thanks, Amir.
On Fri, Aug 20, 2021 at 12:54:17PM +1000, NeilBrown wrote: > On Thu, 19 Aug 2021, Zygo Blaxell wrote: > > 40 bit inodes would take about 20 years to collide with 24-bit subvols--if > > you are creating an average of 1742 inodes every second. Also at the > > same time you have to be creating a subvol every 37 seconds to occupy > > the colliding 25th bit of the subvol ID. Only the highest inode number > > in any subvol counts--if your inode creation is spread out over several > > different subvols, you'll need to make inodes even faster. > > > > For reference, my high scores are 17 inodes per second and a subvol > > every 595 seconds (averaged over 1 year). Burst numbers are much higher, > > but one has to spend some time _reading_ the files now and then. > > > > I've encountered other btrfs users with two orders of magnitude higher > > inode creation rates than mine. They are barely squeaking under the > > 20-year line--or they would be, if they were creating snapshots 50 times > > faster than they do today. > > I do like seeing concrete numbers, thanks. How many of these inodes and > subvols remain undeleted? Supposing inode numbers were reused, how many > bits might you need? Number of existing inodes is filesystem size divided by average inode size, about 30 million inodes per terabyte for build servers, give or take an order of magnitude per project. That does put 1 << 32 inodes in the range of current disk sizes, which motivated the inode_cache feature. Number of existing subvols stays below 1 << 14. It's usually some near-constant multiple of the filesystem age (if it is not limited more by capacity) because it's not trivial to move a subvol structure from one filesystem to another. The main constraint on the product of both numbers is filesystem size. If that limit is reached, we often see that lower subvol numbers correlate with higher inode numbers and vice versa; otherwise both keep growing until they hit the size limit or some user-chosen limit (e.g. "we just don't need more than the last 300 builds online at any time"). For build and backup use cases (which both heavily use snapshots) there is no incentive to delete snapshots other than to avoid eventually running out of space. There is also no incentive to increase filesystem size to accommodate extra snapshots, as long as there is room for some minimal useful number of snapshots, the original subvols, and some free space. So we get snapshots in numbers that are rougly: min(age_of_filesystem * snapshot_creation_rate, filesystem_capacity / average_subvol_unique_data_size) Subvol IDs are not reusable. They are embedded in shared object ownership metadata, and persist for some time after subvols are deleted. > > > My preference would be for btrfs to start re-using old object-ids and > > > root-ids, and to enforce a limit (set at mkfs or tunefs) so that the > > > total number of bits does not exceed 64. Unfortunately the maintainers > > > seem reluctant to even consider this. > > > > It was considered, implemented in 2011, and removed in 2020. Rationale > > is in commit b547a88ea5776a8092f7f122ddc20d6720528782 "btrfs: start > > deprecation of mount option inode_cache". It made file creation slower, > > and consumed disk space, iops, and memory to run. Nobody used it. > > Newer on-disk data structure versions (free space tree, 2015) didn't > > bother implementing inode_cache's storage requirement. > > Yes, I saw that. Providing reliable functional certainly can impact > performance and consume disk-space. That isn't an excuse for not doing > it. > I suspect that carefully tuned code could result in typical creation > times being unchanged, and mean creation times suffering only a tiny > cost. Using "max+1" when the creation rate is particularly high might > be a reasonable part of managing costs. > Storage cost need not be worse than the cost of tracking free blocks > on the device. The cost of _tracking_ free object IDs is trivial compared to the cost of _reusing_ an object ID on btrfs. If btrfs doesn't reuse object numbers, btrfs can append new objects to the last partially filled leaf. If there are shared metadata pages (i.e. snapshots), btrfs unshares a handful of pages once, and then future writes use densely packed new pages and delayed allocation without having to read anything. If btrfs reuses object numbers, the filesystem has to pack new objects into random previously filled metadata leaf nodes, so there are a lot of read-modify-writes scattered over old metadata pages, which spreads the working set around and reduces cache usage efficiency (i.e. uses more RAM). If there are snapshots, each shared page that is modified for the first time after the snapshot comes with two-orders-of-magnitude worst-case write multipliers. The two-algorithm scheme (switching from "reuse freed inode" to "max+1" under load) would be forced into the "max+1" mode half the time by a daily workload of alternating git checkouts and builds. It would save only one bit of inode namespace over the lifetime of the filesystem. > "Nobody used it" is odd. It implies it would have to be explicitly > enabled, and all it would provide anyone is sane behaviour. Who would > imagine that to be an optional extra. It always had to be explicitly enabled. It was initially a workaround for 32-bit ino_t that was limiting a few users, but ino_t got better and the need for inode_cache went away. NFS (particularly NFSv2) might be the use case inode_cache has been waiting for. btrfs has an i_version field for NFSv4, so it's not like there's no precedent for adding features in btrfs to support NFS. On the other hand, the cost of ino_cache gets worse with snapshots, and the benefit in practice takes years to decades to become relevant. Users who are exporting snapshots over NFS are likely to be especially averse to using inode_cache. > NeilBrown > >
Hi, > rsync -H and cpio's hardlink detection can be badly confused. They will > think distinct files with the same inode number are hardlinks. This could > be bad if you were making backups (though if you're making backups over > NFS, you are probably doing something that could be done better in a > different way). 'rysnc -x ' and 'find -mount/-xdev' will fail to work in snapper config? snapper is a very important user case. Although yet not some option like '-mount/-xdev' for '/bin/cp', but maybe come soon. I though the first patchset( crossmnt in nfs client) is the right way, because in most case, subvol is a filesystem, not a directory. Best Regards Wang Yugui (wangyugui@e16-tech.com) 2021/08/23
On Mon, 23 Aug 2021, Zygo Blaxell wrote: > > Subvol IDs are not reusable. They are embedded in shared object ownership > metadata, and persist for some time after subvols are deleted. Hmmm... that's interesting. Makes some sense too. I did wonder how ownership across multiple snapshots was tracked. > > > > > My preference would be for btrfs to start re-using old object-ids and > > > > root-ids, and to enforce a limit (set at mkfs or tunefs) so that the > > > > total number of bits does not exceed 64. Unfortunately the maintainers > > > > seem reluctant to even consider this. > > > > > > It was considered, implemented in 2011, and removed in 2020. Rationale > > > is in commit b547a88ea5776a8092f7f122ddc20d6720528782 "btrfs: start > > > deprecation of mount option inode_cache". It made file creation slower, > > > and consumed disk space, iops, and memory to run. Nobody used it. > > > Newer on-disk data structure versions (free space tree, 2015) didn't > > > bother implementing inode_cache's storage requirement. > > > > Yes, I saw that. Providing reliable functional certainly can impact > > performance and consume disk-space. That isn't an excuse for not doing > > it. > > I suspect that carefully tuned code could result in typical creation > > times being unchanged, and mean creation times suffering only a tiny > > cost. Using "max+1" when the creation rate is particularly high might > > be a reasonable part of managing costs. > > Storage cost need not be worse than the cost of tracking free blocks > > on the device. > > The cost of _tracking_ free object IDs is trivial compared to the cost > of _reusing_ an object ID on btrfs. I hadn't thought of that. > > If btrfs doesn't reuse object numbers, btrfs can append new objects > to the last partially filled leaf. If there are shared metadata pages > (i.e. snapshots), btrfs unshares a handful of pages once, and then future > writes use densely packed new pages and delayed allocation without having > to read anything. > > If btrfs reuses object numbers, the filesystem has to pack new objects > into random previously filled metadata leaf nodes, so there are a lot > of read-modify-writes scattered over old metadata pages, which spreads > the working set around and reduces cache usage efficiency (i.e. uses > more RAM). If there are snapshots, each shared page that is modified > for the first time after the snapshot comes with two-orders-of-magnitude > worst-case write multipliers. I don't really follow that .... but I'll take your word for it for now. > > The two-algorithm scheme (switching from "reuse freed inode" to "max+1" > under load) would be forced into the "max+1" mode half the time by a > daily workload of alternating git checkouts and builds. It would save > only one bit of inode namespace over the lifetime of the filesystem. > > > "Nobody used it" is odd. It implies it would have to be explicitly > > enabled, and all it would provide anyone is sane behaviour. Who would > > imagine that to be an optional extra. > > It always had to be explicitly enabled. It was initially a workaround > for 32-bit ino_t that was limiting a few users, but ino_t got better > and the need for inode_cache went away. > > NFS (particularly NFSv2) might be the use case inode_cache has been > waiting for. btrfs has an i_version field for NFSv4, so it's not like > there's no precedent for adding features in btrfs to support NFS. NFSv2 is not worth any effort. NFSv4 is. NFSv3 ... some, but not a lot. > > On the other hand, the cost of ino_cache gets worse with snapshots, > and the benefit in practice takes years to decades to become relevant. > Users who are exporting snapshots over NFS are likely to be especially > averse to using inode_cache. That's the real killer. Everything will work fine for years until it doesn't. And once it doesn't .... what do you do? Thanks for lot for all this background info. I've found it to be very helpful for my general understanding. Thanks, NeilBrown
On Mon, 23 Aug 2021, Zygo Blaxell wrote: ... > > Subvol IDs are not reusable. They are embedded in shared object ownership > metadata, and persist for some time after subvols are deleted. ... > > The cost of _tracking_ free object IDs is trivial compared to the cost > of _reusing_ an object ID on btrfs. One possible approach to these two objections is to decouple inode numbers from object ids. The inode number becomes just another piece of metadata stored in the inode. struct btrfs_inode_item has four spare u64s, so we could use one of those. struct btrfs_dir_item would need to store the inode number too. What is location.offset used for? Would a diritem ever point to a non-zero offset? Could the 'offset' be used to store the inode number? This could even be added to existing filesystems I think. It might not be easy to re-use inode numbers smaller than the largest at the time the extension was added, but newly added inode numbers could be reused after they were deleted. Just a thought... NeilBrown
On Tue, Aug 24, 2021 at 09:22:05AM +1000, NeilBrown wrote: > On Mon, 23 Aug 2021, Zygo Blaxell wrote: > ... > > > > Subvol IDs are not reusable. They are embedded in shared object ownership > > metadata, and persist for some time after subvols are deleted. > ... > > > > The cost of _tracking_ free object IDs is trivial compared to the cost > > of _reusing_ an object ID on btrfs. > > One possible approach to these two objections is to decouple inode > numbers from object ids. This would be reasonable for subvol IDs (I thought of it earlier in this thread, but didn't mention it because I wasn't going to be the first to open that worm can ;). There aren't very many subvol IDs and they're not used as frequently as inodes, so a lookup table to remap them to smaller numbers to save st_ino bit-space wouldn't be unreasonably expensive. If we stop right here and use the [some_zeros:reversed_subvol:inode] bit-packing scheme you proposed for NFS, that seems like a reasonable plan. It would have 48 bits of usable inode number space, ~440000 file creates per second for 20 years with up to 65535 snapshots, the same number of bits that ZFS has in its inodes. Once that subvol ID mapping tree exists, it could also map subvol inode numbers to globally unique numbers. Each tree item would contain a map of [subvol_inode1..subvol_inode2] that maps the inode numbers in the subvol into the global inode number space at [global_inode1..global_inode2]. When a snapshot is created, the snapshot gets a copy of all the origin subvol's inode ranges, but with newly allocated base offsets. If the original subvol needs new inodes, it gets a new chunk from the global inode allocator. If the snapshot subvol needs new inodes, it gets a different new chunk from the global allocator. The minimum chunk might be a million inodes or so to avoid having to allocate new chunks all the time, but not so high to make the code completely untested (or testers just set the minchunk to 1000 inodes). The question I have (and why I didn't propose this earlier) is whether this scheme is any real improvement over dividing the subvol:inode space by bit packing. If you have one subvol that has 3 billion existing inodes in it, every snapshot of that subvol is going to burn up roughly 2^-32 of the available globally unique inode numbers. If we burn 3 billion inodes instead of 4 billion per subvol, it only gets 25% more lifespan for the filesystem, and the allocation of unique inode spaces and tracking inode space usage will add cost to every single file creation and snapshot operation. If your oldest/biggest subvol only has a million inodes in it, all of the above is pure cost: you can create billions of snapshots, never repeat any object IDs, and never worry about running out. I'd want to see cost/benefit simulations of: this plan, the simpler but less efficient bit-packing plan, 'cp -a --reflink' to a new subvol and start over every 20 years when inodes run out, and online garbage-collection/renumbering schemes that allow users to schedule the inode renumbering costs in overnight batches instead of on every inode create. > The inode number becomes just another piece of metadata stored in the > inode. > struct btrfs_inode_item has four spare u64s, so we could use one of > those. > struct btrfs_dir_item would need to store the inode number too. What > is location.offset used for? Would a diritem ever point to a non-zero > offset? Could the 'offset' be used to store the inode number? Offset is used to identify subvol roots at the moment, but so far that means only values 0 and UINT64_MAX are used. It seems possible to treat all other values as inode numbers. Don't quote me on that--I'm not an expert on this structure. > This could even be added to existing filesystems I think. It might not > be easy to re-use inode numbers smaller than the largest at the time the > extension was added, but newly added inode numbers could be reused after > they were deleted. We'd need a structure to track reusable inode numbers and it would have to be kept up to date to work, so this feature would necessarily come with an incompat bit. Whether you borrow bits from existing structures or make extended new structures doesn't matter at that point, though obviously for something as common as inodes it would be bad to make them bigger. Some of the btrfs userspace API uses inode numbers, but unless I missed something, it could all be converted to use object numbers directly instead. > Just a thought... > > NeilBrown
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c index 0117d867ecf8..989fdf2032d5 100644 --- a/fs/btrfs/inode.c +++ b/fs/btrfs/inode.c @@ -9195,6 +9195,10 @@ static int btrfs_getattr(struct user_namespace *mnt_userns, generic_fillattr(&init_user_ns, inode, stat); stat->dev = BTRFS_I(inode)->root->anon_dev; + if (BTRFS_I(inode)->root->root_key.objectid != BTRFS_FS_TREE_OBJECTID) + stat->ino_uniquifier = + swab64(BTRFS_I(inode)->root->root_key.objectid); + spin_lock(&BTRFS_I(inode)->lock); delalloc_bytes = BTRFS_I(inode)->new_delalloc_bytes; inode_bytes = inode_get_bytes(inode); diff --git a/fs/nfsd/nfs3xdr.c b/fs/nfsd/nfs3xdr.c index 0a5ebc52e6a9..669e2437362a 100644 --- a/fs/nfsd/nfs3xdr.c +++ b/fs/nfsd/nfs3xdr.c @@ -340,6 +340,7 @@ svcxdr_encode_fattr3(struct svc_rqst *rqstp, struct xdr_stream *xdr, { struct user_namespace *userns = nfsd_user_namespace(rqstp); __be32 *p; + u64 ino; u64 fsid; p = xdr_reserve_space(xdr, XDR_UNIT * 21); @@ -377,7 +378,10 @@ svcxdr_encode_fattr3(struct svc_rqst *rqstp, struct xdr_stream *xdr, p = xdr_encode_hyper(p, fsid); /* fileid */ - p = xdr_encode_hyper(p, stat->ino); + ino = stat->ino; + if (stat->ino_uniquifier && stat->ino_uniquifier != ino) + ino ^= stat->ino_uniquifier; + p = xdr_encode_hyper(p, ino); p = encode_nfstime3(p, &stat->atime); p = encode_nfstime3(p, &stat->mtime); @@ -1151,6 +1155,17 @@ svcxdr_encode_entry3_common(struct nfsd3_readdirres *resp, const char *name, if (xdr_stream_encode_item_present(xdr) < 0) return false; /* fileid */ + if (!resp->dir_have_uniquifier) { + struct kstat stat; + if (fh_getattr(&resp->fh, &stat) == nfs_ok) + resp->dir_ino_uniquifier = stat.ino_uniquifier; + else + resp->dir_ino_uniquifier = 0; + resp->dir_have_uniquifier = 1; + } + if (resp->dir_ino_uniquifier && + resp->dir_ino_uniquifier != ino) + ino ^= resp->dir_ino_uniquifier; if (xdr_stream_encode_u64(xdr, ino) < 0) return false; /* name */ diff --git a/fs/nfsd/nfs4xdr.c b/fs/nfsd/nfs4xdr.c index 7abeccb975b2..ddccf849c29c 100644 --- a/fs/nfsd/nfs4xdr.c +++ b/fs/nfsd/nfs4xdr.c @@ -3114,10 +3114,14 @@ nfsd4_encode_fattr(struct xdr_stream *xdr, struct svc_fh *fhp, fhp->fh_handle.fh_size); } if (bmval0 & FATTR4_WORD0_FILEID) { + u64 ino = stat.ino; + if (stat.ino_uniquifier && + stat.ino_uniquifier != stat.ino) + ino ^= stat.ino_uniquifier; p = xdr_reserve_space(xdr, 8); if (!p) goto out_resource; - p = xdr_encode_hyper(p, stat.ino); + p = xdr_encode_hyper(p, ino); } if (bmval0 & FATTR4_WORD0_FILES_AVAIL) { p = xdr_reserve_space(xdr, 8); @@ -3285,6 +3289,9 @@ nfsd4_encode_fattr(struct xdr_stream *xdr, struct svc_fh *fhp, if (err) goto out_nfserr; ino = parent_stat.ino; + if (parent_stat.ino_uniquifier && + parent_stat.ino_uniquifier != ino) + ino ^= parent_stat.ino_uniquifier; } p = xdr_encode_hyper(p, ino); } diff --git a/fs/nfsd/xdr3.h b/fs/nfsd/xdr3.h index 933008382bbe..b4f9f3c71f72 100644 --- a/fs/nfsd/xdr3.h +++ b/fs/nfsd/xdr3.h @@ -179,6 +179,8 @@ struct nfsd3_readdirres { struct xdr_buf dirlist; struct svc_fh scratch; struct readdir_cd common; + u64 dir_ino_uniquifier; + int dir_have_uniquifier; unsigned int cookie_offset; struct svc_rqst * rqstp; diff --git a/include/linux/stat.h b/include/linux/stat.h index fff27e603814..a5188f42ed81 100644 --- a/include/linux/stat.h +++ b/include/linux/stat.h @@ -46,6 +46,23 @@ struct kstat { struct timespec64 btime; /* File creation time */ u64 blocks; u64 mnt_id; + /* + * BTRFS does not provide unique inode numbers within a filesystem, + * depending on a synthetic 'dev' to provide uniqueness. + * NFSd cannot make use of this 'dev' number so clients often see + * duplicate inode numbers. + * For BTRFS, 'ino' is unlikely to use the high bits. It puts + * another number in ino_uniquifier which: + * - has most entropy in the high bits + * - is different precisely when 'dev' is different + * - is stable across unmount/remount + * NFSd can xor this with 'ino' to get a substantially more unique + * number for reporting to the client. + * The ino_uniquifier for a directory can reasonably be applied + * to inode numbers reported by the readdir filldir callback. + * It is NOT currently exported to user-space. + */ + u64 ino_uniquifier; }; #endif
[[This patch is a minimal patch which addresses the current problems with nfsd and btrfs, in a way which I think is most supportable, least surprising, and least likely to impact any future attempts to more completely fix the btrfs file-identify problem]] BTRFS does not provide unique inode numbers across a filesystem. It *does* provide unique inode numbers with a subvolume and uses synthetic device numbers for different subvolumes to ensure uniqueness for device+inode. nfsd cannot use these varying device numbers. If nfsd were to synthesise different stable filesystem ids to give to the client, that would cause subvolumes to appear in the mount table on the client, even though they don't appear in the mount table on the server. Also, NFSv3 doesn't support changing the filesystem id without a new explicit mount on the client (this is partially supported in practice, but violates the protocol specification). So currently, the roots of all subvolumes report the same inode number in the same filesystem to NFS clients and tools like 'find' notice that a directory has the same identity as an ancestor, and so refuse to enter that directory. This patch allows btrfs (or any filesystem) to provide a 64bit number that can be xored with the inode number to make the number more unique. Rather than the client being certain to see duplicates, with this patch it is possible but extremely rare. The number than btrfs provides is a swab64() version of the subvolume identifier. This has most entropy in the high bits (the low bits of the subvolume identifer), while the inoe has most entropy in the low bits. The result will always be unique within a subvolume, and will almost always be unique across the filesystem. Signed-off-by: NeilBrown <neilb@suse.de> --- fs/btrfs/inode.c | 4 ++++ fs/nfsd/nfs3xdr.c | 17 ++++++++++++++++- fs/nfsd/nfs4xdr.c | 9 ++++++++- fs/nfsd/xdr3.h | 2 ++ include/linux/stat.h | 17 +++++++++++++++++ 5 files changed, 47 insertions(+), 2 deletions(-)