Message ID | 162742539595.32498.13687924366155737575.stgit@noble.brown (mailing list archive) |
---|---|
Headers | show |
Series | expose btrfs subvols in mount table correctly | expand |
On Wed, Jul 28, 2021 at 08:37:45AM +1000, NeilBrown wrote: > We can enhance nfsd to understand that some automounts can be managed. > "internal mounts" where a filesystem provides an automount point and > mounts its own directories, can be handled differently by nfsd. > > This series addresses all these issues. After a few enhancements to the > VFS to provide needed support, they enhance exportfs and nfsd to cope > with the concept of internal mounts, and then enhance btrfs to provide > them. I'm half asleep right now; will review and reply in detail tomorrow. The first impression is that it's going to be brittle; properties of mount really should not depend upon the nature of mountpoint - too many ways for that to go wrong. Anyway, tomorrow...
Hi, We no longer need the dummy inode(BTRFS_FIRST_FREE_OBJECTID - 1) in this patch serials? I tried to backport it to 5.10.x, but it failed to work. No big modification in this 5.10.x backporting, and all modified pathes are attached. Best Regards Wang Yugui (wangyugui@e16-tech.com) 2021/07/28 > There are long-standing problems with btrfs subvols, particularly in > relation to whether and how they are exposed in the mount table. > > - /proc/self/mountinfo reports the major:minor device number for each > filesystem and when a btrfs subvol is explicitly mounted, the number > reported is wrong - it does not match what stat() reports for the > mountpoint. > > - when subvol are not explicitly mounted, they don't appear in > mountinfo at all. > > Consequences include that a tool which uses stat() to find the dev of the > filesystem, then searches mountinfo for that filesystem, will not find > it. > > Some tools (e.g. findmnt) appear to have been enhanced to cope with this > strangeness, but it would be best to make btrfs behave more normally. > > - nfsd cannot currently see the transition to subvol, so reports the > main volume and all subvols to the client as being in the same > filesystem. As inode numbers are not unique across all subvols, > this can confuse clients. In particular, 'find' is likely to report a > loop. > > subvols can be made to appear in mountinfo using automounts. However > nfsd does not cope well with automounts. It assumes all filesystems to > be exported are already mounted. So adding automounts to btrfs would > break nfsd. > > We can enhance nfsd to understand that some automounts can be managed. > "internal mounts" where a filesystem provides an automount point and > mounts its own directories, can be handled differently by nfsd. > > This series addresses all these issues. After a few enhancements to the > VFS to provide needed support, they enhance exportfs and nfsd to cope > with the concept of internal mounts, and then enhance btrfs to provide > them. > > The NFSv3 support is incomplete. I'm not sure we can make it work > "perfectly". A normal nfsv3 mount seem to work well enough, but if > mounted with '-o noac', it loses track of the mounted-on inode number > and complains about inode numbers changing. > > My basic test for these is to mount a btrfs filesystem which contains > subvols, nfs-export it and mount it with nfsv3 and nfsv4, then run > 'find' in each of the filesystem and check the contents of > /proc/self/mountinfo. > > The first patch simply fixes the dev number in mountinfo and could > possibly be tagged for -stable. > > NeilBrown > > --- > > NeilBrown (11): > VFS: show correct dev num in mountinfo > VFS: allow d_automount to create in-place bind-mount. > VFS: pass lookup_flags into follow_down() > VFS: export lookup_mnt() > VFS: new function: mount_is_internal() > nfsd: include a vfsmount in struct svc_fh > exportfs: Allow filehandle lookup to cross internal mount points. > nfsd: change get_parent_attributes() to nfsd_get_mounted_on() > nfsd: Allow filehandle lookup to cross internal mount points. > btrfs: introduce mapping function from location to inum > btrfs: use automount to bind-mount all subvol roots. > > > fs/btrfs/btrfs_inode.h | 12 +++ > fs/btrfs/inode.c | 111 ++++++++++++++++++++++++++- > fs/btrfs/super.c | 1 + > fs/exportfs/expfs.c | 100 ++++++++++++++++++++---- > fs/fhandle.c | 2 +- > fs/internal.h | 1 - > fs/namei.c | 6 +- > fs/namespace.c | 32 +++++++- > fs/nfsd/export.c | 4 +- > fs/nfsd/nfs3xdr.c | 40 +++++++--- > fs/nfsd/nfs4proc.c | 9 ++- > fs/nfsd/nfs4xdr.c | 106 ++++++++++++------------- > fs/nfsd/nfsfh.c | 44 +++++++---- > fs/nfsd/nfsfh.h | 3 +- > fs/nfsd/nfsproc.c | 5 +- > fs/nfsd/vfs.c | 162 +++++++++++++++++++++++---------------- > fs/nfsd/vfs.h | 12 +-- > fs/nfsd/xdr4.h | 2 +- > fs/overlayfs/namei.c | 5 +- > fs/xfs/xfs_ioctl.c | 12 ++- > include/linux/exportfs.h | 4 +- > include/linux/mount.h | 4 + > include/linux/namei.h | 2 +- > 23 files changed, 490 insertions(+), 189 deletions(-) > > -- > Signature
Hi, This patchset works well in 5.14-rc3. 1, fixed dummy inode(255, BTRFS_FIRST_FREE_OBJECTID - 1 ) is changed to dynamic dummy inode(18446744073709551358, or 18446744073709551359, ...) 2, btrfs subvol mount info is shown in /proc/mounts, even if nfsd/nfs is not used. /dev/sdc btrfs 94G 3.5M 93G 1% /mnt/test /dev/sdc btrfs 94G 3.5M 93G 1% /mnt/test/sub1 /dev/sdc btrfs 94G 3.5M 93G 1% /mnt/test/sub2 This is a visiual feature change for btrfs user. Best Regards Wang Yugui (wangyugui@e16-tech.com) 2021/07/28 > Hi, > > We no longer need the dummy inode(BTRFS_FIRST_FREE_OBJECTID - 1) in this > patch serials? > > I tried to backport it to 5.10.x, but it failed to work. > No big modification in this 5.10.x backporting, and all modified pathes > are attached. > > Best Regards > Wang Yugui (wangyugui@e16-tech.com) > 2021/07/28 > > > There are long-standing problems with btrfs subvols, particularly in > > relation to whether and how they are exposed in the mount table. > > > > - /proc/self/mountinfo reports the major:minor device number for each > > filesystem and when a btrfs subvol is explicitly mounted, the number > > reported is wrong - it does not match what stat() reports for the > > mountpoint. > > > > - when subvol are not explicitly mounted, they don't appear in > > mountinfo at all. > > > > Consequences include that a tool which uses stat() to find the dev of the > > filesystem, then searches mountinfo for that filesystem, will not find > > it. > > > > Some tools (e.g. findmnt) appear to have been enhanced to cope with this > > strangeness, but it would be best to make btrfs behave more normally. > > > > - nfsd cannot currently see the transition to subvol, so reports the > > main volume and all subvols to the client as being in the same > > filesystem. As inode numbers are not unique across all subvols, > > this can confuse clients. In particular, 'find' is likely to report a > > loop. > > > > subvols can be made to appear in mountinfo using automounts. However > > nfsd does not cope well with automounts. It assumes all filesystems to > > be exported are already mounted. So adding automounts to btrfs would > > break nfsd. > > > > We can enhance nfsd to understand that some automounts can be managed. > > "internal mounts" where a filesystem provides an automount point and > > mounts its own directories, can be handled differently by nfsd. > > > > This series addresses all these issues. After a few enhancements to the > > VFS to provide needed support, they enhance exportfs and nfsd to cope > > with the concept of internal mounts, and then enhance btrfs to provide > > them. > > > > The NFSv3 support is incomplete. I'm not sure we can make it work > > "perfectly". A normal nfsv3 mount seem to work well enough, but if > > mounted with '-o noac', it loses track of the mounted-on inode number > > and complains about inode numbers changing. > > > > My basic test for these is to mount a btrfs filesystem which contains > > subvols, nfs-export it and mount it with nfsv3 and nfsv4, then run > > 'find' in each of the filesystem and check the contents of > > /proc/self/mountinfo. > > > > The first patch simply fixes the dev number in mountinfo and could > > possibly be tagged for -stable. > > > > NeilBrown > > > > --- > > > > NeilBrown (11): > > VFS: show correct dev num in mountinfo > > VFS: allow d_automount to create in-place bind-mount. > > VFS: pass lookup_flags into follow_down() > > VFS: export lookup_mnt() > > VFS: new function: mount_is_internal() > > nfsd: include a vfsmount in struct svc_fh > > exportfs: Allow filehandle lookup to cross internal mount points. > > nfsd: change get_parent_attributes() to nfsd_get_mounted_on() > > nfsd: Allow filehandle lookup to cross internal mount points. > > btrfs: introduce mapping function from location to inum > > btrfs: use automount to bind-mount all subvol roots. > > > > > > fs/btrfs/btrfs_inode.h | 12 +++ > > fs/btrfs/inode.c | 111 ++++++++++++++++++++++++++- > > fs/btrfs/super.c | 1 + > > fs/exportfs/expfs.c | 100 ++++++++++++++++++++---- > > fs/fhandle.c | 2 +- > > fs/internal.h | 1 - > > fs/namei.c | 6 +- > > fs/namespace.c | 32 +++++++- > > fs/nfsd/export.c | 4 +- > > fs/nfsd/nfs3xdr.c | 40 +++++++--- > > fs/nfsd/nfs4proc.c | 9 ++- > > fs/nfsd/nfs4xdr.c | 106 ++++++++++++------------- > > fs/nfsd/nfsfh.c | 44 +++++++---- > > fs/nfsd/nfsfh.h | 3 +- > > fs/nfsd/nfsproc.c | 5 +- > > fs/nfsd/vfs.c | 162 +++++++++++++++++++++++---------------- > > fs/nfsd/vfs.h | 12 +-- > > fs/nfsd/xdr4.h | 2 +- > > fs/overlayfs/namei.c | 5 +- > > fs/xfs/xfs_ioctl.c | 12 ++- > > include/linux/exportfs.h | 4 +- > > include/linux/mount.h | 4 + > > include/linux/namei.h | 2 +- > > 23 files changed, 490 insertions(+), 189 deletions(-) > > > > -- > > Signature >
On Wed, 28 Jul 2021, Wang Yugui wrote: > Hi, > > This patchset works well in 5.14-rc3. Thanks for testing. > > 1, fixed dummy inode(255, BTRFS_FIRST_FREE_OBJECTID - 1 ) is changed to > dynamic dummy inode(18446744073709551358, or 18446744073709551359, ...) The BTRFS_FIRST_FREE_OBJECTID-1 was a just a hack, I never wanted it to be permanent. The new number is ULONG_MAX - subvol_id (where subvol_id starts at 257 I think). This is a bit less of a hack. It is an easily available number that is fairly unique. > > 2, btrfs subvol mount info is shown in /proc/mounts, even if nfsd/nfs is > not used. > /dev/sdc btrfs 94G 3.5M 93G 1% /mnt/test > /dev/sdc btrfs 94G 3.5M 93G 1% /mnt/test/sub1 > /dev/sdc btrfs 94G 3.5M 93G 1% /mnt/test/sub2 > > This is a visiual feature change for btrfs user. Hopefully it is an improvement. But it is certainly a change that needs to be carefully considered. Thanks, NeilBrown
On Wed, 28 Jul 2021, Wang Yugui wrote: > Hi, > > We no longer need the dummy inode(BTRFS_FIRST_FREE_OBJECTID - 1) in this > patch serials? No. > > I tried to backport it to 5.10.x, but it failed to work. > No big modification in this 5.10.x backporting, and all modified pathes > are attached. I'm not surprised, but I doubt there would be a major barrier to making it work. I'm unlikely to try until I have more positive reviews. Thanks, NeilBrown
Hi, > > I tried to backport it to 5.10.x, but it failed to work. > > No big modification in this 5.10.x backporting, and all modified pathes > > are attached. > > I'm not surprised, but I doubt there would be a major barrier to making > it work. I'm unlikely to try until I have more positive reviews. With two patches as dependency, d045465fc6cbfa4acfb5a7d817a7c1a57a078109 0001-exportfs-Add-a-function-to-return-the-raw-output-fro.patch 2e19d10c1438241de32467637a2a411971547991 0002-nfsd-Fix-up-nfsd-to-ensure-that-timeout-errors-don-t.patch the 5.10.x backporting become more simple and it works well now. Best Regards Wang Yugui (wangyugui@e16-tech.com) 2021/07/28
On Wed, Jul 28, 2021 at 3:02 AM NeilBrown <neilb@suse.de> wrote: > > On Wed, 28 Jul 2021, Wang Yugui wrote: > > Hi, > > > > This patchset works well in 5.14-rc3. > > Thanks for testing. > > > > > 1, fixed dummy inode(255, BTRFS_FIRST_FREE_OBJECTID - 1 ) is changed to > > dynamic dummy inode(18446744073709551358, or 18446744073709551359, ...) > > The BTRFS_FIRST_FREE_OBJECTID-1 was a just a hack, I never wanted it to > be permanent. > The new number is ULONG_MAX - subvol_id (where subvol_id starts at 257 I > think). > This is a bit less of a hack. It is an easily available number that is > fairly unique. > > > > > 2, btrfs subvol mount info is shown in /proc/mounts, even if nfsd/nfs is > > not used. > > /dev/sdc btrfs 94G 3.5M 93G 1% /mnt/test > > /dev/sdc btrfs 94G 3.5M 93G 1% /mnt/test/sub1 > > /dev/sdc btrfs 94G 3.5M 93G 1% /mnt/test/sub2 > > > > This is a visiual feature change for btrfs user. > > Hopefully it is an improvement. But it is certainly a change that needs > to be carefully considered. I think this is behavior people generally expect, but I wonder what the consequences of this would be with huge numbers of subvolumes. If there are hundreds or thousands of them (which is quite possible on SUSE systems, for example, with its auto-snapshotting regime), this would be a mess, wouldn't it? Or can we add a way to mark these things to not show up there or is there some kind of behavioral change we can make to snapper or other tools to make them not show up here?
On 28/07/2021 08:01, NeilBrown wrote: > On Wed, 28 Jul 2021, Wang Yugui wrote: >> Hi, >> >> This patchset works well in 5.14-rc3. > > Thanks for testing. > >> >> 1, fixed dummy inode(255, BTRFS_FIRST_FREE_OBJECTID - 1 ) is changed to >> dynamic dummy inode(18446744073709551358, or 18446744073709551359, ...) > > The BTRFS_FIRST_FREE_OBJECTID-1 was a just a hack, I never wanted it to > be permanent. > The new number is ULONG_MAX - subvol_id (where subvol_id starts at 257 I > think). > This is a bit less of a hack. It is an easily available number that is > fairly unique. > >> >> 2, btrfs subvol mount info is shown in /proc/mounts, even if nfsd/nfs is >> not used. >> /dev/sdc btrfs 94G 3.5M 93G 1% /mnt/test >> /dev/sdc btrfs 94G 3.5M 93G 1% /mnt/test/sub1 >> /dev/sdc btrfs 94G 3.5M 93G 1% /mnt/test/sub2 >> >> This is a visiual feature change for btrfs user. > > Hopefully it is an improvement. But it is certainly a change that needs > to be carefully considered. Would this change the behaviour of findmnt? I have several scripts that depend on findmnt to select btrfs filesystems. Just to take a couple of examples (using the example shown above): my scripts would depend on 'findmnt --target /mnt/test/sub1 -o target' providing /mnt/test, not the subvolume; and another script would depend on 'findmnt -t btrfs --mountpoint /mnt/test/sub1' providing no output as the directory is not an /etc/fstab mount point for a btrfs filesystem. Maybe findmnt isn't affected? Or maybe the change is worth making anyway? But it needs to be carefully considered if it breaks existing user interfaces.
On Wed, Jul 28, 2021 at 08:26:12AM -0400, Neal Gompa wrote: > I think this is behavior people generally expect, but I wonder what > the consequences of this would be with huge numbers of subvolumes. If > there are hundreds or thousands of them (which is quite possible on > SUSE systems, for example, with its auto-snapshotting regime), this > would be a mess, wouldn't it? I'm surprised that btrfs is special here. Doesn't anyone have thousands of lvm snapshots? Or is it that they do but they're not normally mounted? --b.
I'm still stuck trying to understand why subvolumes can't get their own superblocks: - Why are the performance issues Josef raises unsurmountable? And why are they unique to btrfs? (Surely there other cases where people need hundreds or thousands of superblocks?) - If filehandle decoding can return a different vfs mount than it's passed, why can't it return a different superblock? --b. On Wed, Jul 28, 2021 at 08:37:45AM +1000, NeilBrown wrote: > There are long-standing problems with btrfs subvols, particularly in > relation to whether and how they are exposed in the mount table. > > - /proc/self/mountinfo reports the major:minor device number for each > filesystem and when a btrfs subvol is explicitly mounted, the number > reported is wrong - it does not match what stat() reports for the > mountpoint. > > - when subvol are not explicitly mounted, they don't appear in > mountinfo at all. > > Consequences include that a tool which uses stat() to find the dev of the > filesystem, then searches mountinfo for that filesystem, will not find > it. > > Some tools (e.g. findmnt) appear to have been enhanced to cope with this > strangeness, but it would be best to make btrfs behave more normally. > > - nfsd cannot currently see the transition to subvol, so reports the > main volume and all subvols to the client as being in the same > filesystem. As inode numbers are not unique across all subvols, > this can confuse clients. In particular, 'find' is likely to report a > loop. > > subvols can be made to appear in mountinfo using automounts. However > nfsd does not cope well with automounts. It assumes all filesystems to > be exported are already mounted. So adding automounts to btrfs would > break nfsd. > > We can enhance nfsd to understand that some automounts can be managed. > "internal mounts" where a filesystem provides an automount point and > mounts its own directories, can be handled differently by nfsd. > > This series addresses all these issues. After a few enhancements to the > VFS to provide needed support, they enhance exportfs and nfsd to cope > with the concept of internal mounts, and then enhance btrfs to provide > them. > > The NFSv3 support is incomplete. I'm not sure we can make it work > "perfectly". A normal nfsv3 mount seem to work well enough, but if > mounted with '-o noac', it loses track of the mounted-on inode number > and complains about inode numbers changing. > > My basic test for these is to mount a btrfs filesystem which contains > subvols, nfs-export it and mount it with nfsv3 and nfsv4, then run > 'find' in each of the filesystem and check the contents of > /proc/self/mountinfo. > > The first patch simply fixes the dev number in mountinfo and could > possibly be tagged for -stable. > > NeilBrown > > --- > > NeilBrown (11): > VFS: show correct dev num in mountinfo > VFS: allow d_automount to create in-place bind-mount. > VFS: pass lookup_flags into follow_down() > VFS: export lookup_mnt() > VFS: new function: mount_is_internal() > nfsd: include a vfsmount in struct svc_fh > exportfs: Allow filehandle lookup to cross internal mount points. > nfsd: change get_parent_attributes() to nfsd_get_mounted_on() > nfsd: Allow filehandle lookup to cross internal mount points. > btrfs: introduce mapping function from location to inum > btrfs: use automount to bind-mount all subvol roots. > > > fs/btrfs/btrfs_inode.h | 12 +++ > fs/btrfs/inode.c | 111 ++++++++++++++++++++++++++- > fs/btrfs/super.c | 1 + > fs/exportfs/expfs.c | 100 ++++++++++++++++++++---- > fs/fhandle.c | 2 +- > fs/internal.h | 1 - > fs/namei.c | 6 +- > fs/namespace.c | 32 +++++++- > fs/nfsd/export.c | 4 +- > fs/nfsd/nfs3xdr.c | 40 +++++++--- > fs/nfsd/nfs4proc.c | 9 ++- > fs/nfsd/nfs4xdr.c | 106 ++++++++++++------------- > fs/nfsd/nfsfh.c | 44 +++++++---- > fs/nfsd/nfsfh.h | 3 +- > fs/nfsd/nfsproc.c | 5 +- > fs/nfsd/vfs.c | 162 +++++++++++++++++++++++---------------- > fs/nfsd/vfs.h | 12 +-- > fs/nfsd/xdr4.h | 2 +- > fs/overlayfs/namei.c | 5 +- > fs/xfs/xfs_ioctl.c | 12 ++- > include/linux/exportfs.h | 4 +- > include/linux/mount.h | 4 + > include/linux/namei.h | 2 +- > 23 files changed, 490 insertions(+), 189 deletions(-) > > -- > Signature
On 7/28/21 3:35 PM, J. Bruce Fields wrote: > I'm still stuck trying to understand why subvolumes can't get their own > superblocks: > > - Why are the performance issues Josef raises unsurmountable? > And why are they unique to btrfs? (Surely there other cases > where people need hundreds or thousands of superblocks?) > I don't think anybody has that many file systems. For btrfs it's a single file system. Think of syncfs, it's going to walk through all of the super blocks on the system calling ->sync_fs on each subvol superblock. Now this isn't a huge deal, we could just have some flag that says "I'm not real" or even just have anonymous superblocks that don't get added to the global super_blocks list, and that would address my main pain points. The second part is inode reclaim. Again this particular problem could be avoided if we had an anonymous superblock that wasn't actually used, but the inode lru is per superblock. Now with reclaim instead of walking all the inodes, you're walking a bunch of super blocks and then walking the list of inodes within those super blocks. You're burning CPU cycles because now instead of getting big chunks of inodes to dispose, it's spread out across many super blocks. The other weird thing is the way we apply pressure to shrinker systems. We essentially say "try to evict X objects from your list", which means in this case with lots of subvolumes we'd be evicting waaaaay more inodes than you were before, likely impacting performance where you have workloads that have lots of files open across many subvolumes (which is what FB does with it's containers). If we want a anonymous superblock per subvolume then the only way it'll work is if it's not actually tied into anything, and we still use the primary super block for the whole file system. And if that's what we're going to do what's the point of the super block exactly? This approach that Neil's come up with seems like a reasonable solution to me. Christoph gets his separation and /proc/self/mountinfo, and we avoid the scalability headache of a billion super blocks. Thanks, Josef
On Wed, 28 Jul 2021, Neal Gompa wrote: > On Wed, Jul 28, 2021 at 3:02 AM NeilBrown <neilb@suse.de> wrote: > > > > On Wed, 28 Jul 2021, Wang Yugui wrote: > > > Hi, > > > > > > This patchset works well in 5.14-rc3. > > > > Thanks for testing. > > > > > > > > 1, fixed dummy inode(255, BTRFS_FIRST_FREE_OBJECTID - 1 ) is changed to > > > dynamic dummy inode(18446744073709551358, or 18446744073709551359, ...) > > > > The BTRFS_FIRST_FREE_OBJECTID-1 was a just a hack, I never wanted it to > > be permanent. > > The new number is ULONG_MAX - subvol_id (where subvol_id starts at 257 I > > think). > > This is a bit less of a hack. It is an easily available number that is > > fairly unique. > > > > > > > > 2, btrfs subvol mount info is shown in /proc/mounts, even if nfsd/nfs is > > > not used. > > > /dev/sdc btrfs 94G 3.5M 93G 1% /mnt/test > > > /dev/sdc btrfs 94G 3.5M 93G 1% /mnt/test/sub1 > > > /dev/sdc btrfs 94G 3.5M 93G 1% /mnt/test/sub2 > > > > > > This is a visiual feature change for btrfs user. > > > > Hopefully it is an improvement. But it is certainly a change that needs > > to be carefully considered. > > I think this is behavior people generally expect, but I wonder what > the consequences of this would be with huge numbers of subvolumes. If > there are hundreds or thousands of them (which is quite possible on > SUSE systems, for example, with its auto-snapshotting regime), this > would be a mess, wouldn't it? Would there be hundreds or thousands of subvols concurrently being accessed? The auto-mounted subvols only appear in the mount table while that are being accessed, and for about 15 minutes after the last access. I suspect that most subvols are "backup" snapshots which are not being accessed and so would not appear. > > Or can we add a way to mark these things to not show up there or is > there some kind of behavioral change we can make to snapper or other > tools to make them not show up here? Certainly it might make sense to flag these in some way so that tools can choose the ignore them or handle them specially, just as nfsd needs to handle them specially. I was considering a "local" mount flag. NeilBrown > > > > -- > 真実はいつも一つ!/ Always, there's only one truth! > >
On Wed, Jul 28, 2021 at 03:14:31PM -0400, J. Bruce Fields wrote: > On Wed, Jul 28, 2021 at 08:26:12AM -0400, Neal Gompa wrote: > > I think this is behavior people generally expect, but I wonder what > > the consequences of this would be with huge numbers of subvolumes. If > > there are hundreds or thousands of them (which is quite possible on > > SUSE systems, for example, with its auto-snapshotting regime), this > > would be a mess, wouldn't it? > > I'm surprised that btrfs is special here. Doesn't anyone have thousands > of lvm snapshots? Or is it that they do but they're not normally > mounted? Unprivileged users can't create lvm snapshots as easily or quickly as using mkdir (well, ok, mkdir and fssync). lvm doesn't scale very well past more than a few dozen snapshots of the same original volume, and performance degrades linearly in the number of snapshots if the original LV is modified. btrfs is the opposite: users can create and delete as many snapshots as they like, at a cost more expensive than mkdir but less expensive than 'cp -a', and users only pay IO costs for writes to the subvols they modify. So some btrfs users use snapshots in places where more traditional tools like 'cp -a' or 'git checkout' are used on other filesystems. e.g. a build system might make a snapshot of a git working tree containing a checked out and built baseline revision, and then it might do a loop where it makes a snapshot, applies one patch from an integration branch in the snapshot directory, and incrementally builds there. The next revision makes a snapshot of its parent revision's subvol and builds the next patch. If there are merges in the integration branch, then the builder can go back to parent revisions, create a new snapshot, apply the patch, and build in a snapshot on both sides of the merge. After testing picks a winner, the builder can simply delete all the snapshots except the one for the version that won testing (there is no requirement to commit the snapshot to the origin LV as in lvm, either can be destroyed without requiring action to preserve the other). You can do a similar thing with overlayfs, but it runs into problems with all the mount points. In btrfs, the mount points are persistent because they're built into the filesystem. With overlayfs, you have to save and restore them so they persist across reboots (unless that feature has been added since I last looked). I'm looking at a few machines here, and if all the subvols are visible to 'df', its output would be somewhere around 3-5 MB. That's too much--we'd have to hack up df to not show the same btrfs twice...as well as every monitoring tool that reports free space...which sounds similar to the problems we're trying to avoid. Ideally there would be a way to turn this on or off. It is creating a set of new problems that is the complement of the set we're trying to fix in this change. > --b.
On Wed, 28 Jul 2021, g.btrfs@cobb.uk.net wrote: > On 28/07/2021 08:01, NeilBrown wrote: > > On Wed, 28 Jul 2021, Wang Yugui wrote: > >> Hi, > >> > >> This patchset works well in 5.14-rc3. > > > > Thanks for testing. > > > >> > >> 1, fixed dummy inode(255, BTRFS_FIRST_FREE_OBJECTID - 1 ) is changed to > >> dynamic dummy inode(18446744073709551358, or 18446744073709551359, ...) > > > > The BTRFS_FIRST_FREE_OBJECTID-1 was a just a hack, I never wanted it to > > be permanent. > > The new number is ULONG_MAX - subvol_id (where subvol_id starts at 257 I > > think). > > This is a bit less of a hack. It is an easily available number that is > > fairly unique. > > > >> > >> 2, btrfs subvol mount info is shown in /proc/mounts, even if nfsd/nfs is > >> not used. > >> /dev/sdc btrfs 94G 3.5M 93G 1% /mnt/test > >> /dev/sdc btrfs 94G 3.5M 93G 1% /mnt/test/sub1 > >> /dev/sdc btrfs 94G 3.5M 93G 1% /mnt/test/sub2 > >> > >> This is a visiual feature change for btrfs user. > > > > Hopefully it is an improvement. But it is certainly a change that needs > > to be carefully considered. > > Would this change the behaviour of findmnt? I have several scripts that > depend on findmnt to select btrfs filesystems. Just to take a couple of > examples (using the example shown above): my scripts would depend on > 'findmnt --target /mnt/test/sub1 -o target' providing /mnt/test, not the > subvolume; and another script would depend on 'findmnt -t btrfs > --mountpoint /mnt/test/sub1' providing no output as the directory is not > an /etc/fstab mount point for a btrfs filesystem. Yes, I think it does change the behaviour of findmnt. If the sub1 automount has not been triggered, findmnt --target /mnt/test/sub1 -o target will report "/mnt/test". After it has been triggered, it will report "/mnt/test/sub1" Similarly "findmnt -t btrfs --mountpoint /mnt/test/sub1" will report nothing if the automount hasn't been triggered, and will report full details of /mnt/test/sub1 if it has. > > Maybe findmnt isn't affected? Or maybe the change is worth making > anyway? But it needs to be carefully considered if it breaks existing > user interfaces. > I hope the change is worth making anyway, but breaking findmnt would not be a popular move. This is unfortunate.... btrfs is "broken" and people/code have adjusted to that breakage so that "fixing" it will be traumatic. The only way I can find to get findmnt to ignore the new entries in /proc/self/mountinfo is to trigger a parse error such as by replacing the " - " with " -- " but that causes a parse error message to be generated, and will likely break other tools. (... or I could check if current->comm is "findmnt", and suppress the extra entries, but that is even more horrible!!) A possible option is to change findmnt to explicitly ignore the new "internal" mounts (unless some option is given) and then delay the kernel update until that can be rolled out. Or we could make the new internal mounts invisible in /proc without some kernel setting enabled. Then distros can enable it once all important tools can cope, and they can easily test correctness without rebooting. I wonder if correcting the device-number reported for explicit subvol mounts will break anything.... findmnt seems happy with it in my limited testing. There seems to be a testsuite with util-linux. Maybe I should try that. Thanks, NeilBrown
On Thu, 29 Jul 2021, Zygo Blaxell wrote: > > I'm looking at a few machines here, and if all the subvols are visible to > 'df', its output would be somewhere around 3-5 MB. That's too much--we'd > have to hack up df to not show the same btrfs twice...as well as every > monitoring tool that reports free space...which sounds similar to the > problems we're trying to avoid. Thanks for providing hard data!! How many of these subvols are actively used (have a file open) a the same time? Most? About half? Just a few?? Thanks, NeilBrown
On Thu, Jul 29, 2021 at 08:50:50AM +1000, NeilBrown wrote: > On Wed, 28 Jul 2021, Neal Gompa wrote: > > On Wed, Jul 28, 2021 at 3:02 AM NeilBrown <neilb@suse.de> wrote: > > > > > > On Wed, 28 Jul 2021, Wang Yugui wrote: > > > > Hi, > > > > > > > > This patchset works well in 5.14-rc3. > > > > > > Thanks for testing. > > > > > > > > > > > 1, fixed dummy inode(255, BTRFS_FIRST_FREE_OBJECTID - 1 ) is changed to > > > > dynamic dummy inode(18446744073709551358, or 18446744073709551359, ...) > > > > > > The BTRFS_FIRST_FREE_OBJECTID-1 was a just a hack, I never wanted it to > > > be permanent. > > > The new number is ULONG_MAX - subvol_id (where subvol_id starts at 257 I > > > think). > > > This is a bit less of a hack. It is an easily available number that is > > > fairly unique. > > > > > > > > > > > 2, btrfs subvol mount info is shown in /proc/mounts, even if nfsd/nfs is > > > > not used. > > > > /dev/sdc btrfs 94G 3.5M 93G 1% /mnt/test > > > > /dev/sdc btrfs 94G 3.5M 93G 1% /mnt/test/sub1 > > > > /dev/sdc btrfs 94G 3.5M 93G 1% /mnt/test/sub2 > > > > > > > > This is a visiual feature change for btrfs user. > > > > > > Hopefully it is an improvement. But it is certainly a change that needs > > > to be carefully considered. > > > > I think this is behavior people generally expect, but I wonder what > > the consequences of this would be with huge numbers of subvolumes. If > > there are hundreds or thousands of them (which is quite possible on > > SUSE systems, for example, with its auto-snapshotting regime), this > > would be a mess, wouldn't it? > > Would there be hundreds or thousands of subvols concurrently being > accessed? The auto-mounted subvols only appear in the mount table while > that are being accessed, and for about 15 minutes after the last access. > I suspect that most subvols are "backup" snapshots which are not being > accessed and so would not appear. bees dedupes across subvols and polls every few minutes for new data to dedupe. bees doesn't particularly care where the "src" in the dedupe call comes from, so it will pick a subvol that has a reference to the data at random (whichever one comes up first in backref search) for each dedupe call. There is a cache of open fds on each subvol root so that it can access files within that subvol using openat(). The cache quickly populates fully, i.e. it holds a fd to every subvol on the filesystem. The cache has a 15 minute timeout too, so bees would likely keep the mount table fully populated at all times. plocate also uses openat() and it can also be active on many subvols simultaneously, though it only runs once a day, and it's reasonable to exclude all snapshots from plocate for performance reasons. My bigger concern here is that users on btrfs can currently have private subvols with secret names. e.g. user$ mkdir -m 700 private user$ btrfs sub create private/secret user$ cd private/secret user$ ...do stuff... Would "secret" now be visible in the very public /proc/mounts every time the user is doing stuff? > > Or can we add a way to mark these things to not show up there or is > > there some kind of behavioral change we can make to snapper or other > > tools to make them not show up here? > > Certainly it might make sense to flag these in some way so that tools > can choose the ignore them or handle them specially, just as nfsd needs > to handle them specially. I was considering a "local" mount flag. I would definitely want an 'off' switch for this thing until the impact is better understood. > NeilBrown > > > > > > > > > -- > > 真実はいつも一つ!/ Always, there's only one truth! > > > >
On Thu, 29 Jul 2021, Zygo Blaxell wrote: > On Thu, Jul 29, 2021 at 08:50:50AM +1000, NeilBrown wrote: > > On Wed, 28 Jul 2021, Neal Gompa wrote: > > > On Wed, Jul 28, 2021 at 3:02 AM NeilBrown <neilb@suse.de> wrote: > > > > > > > > On Wed, 28 Jul 2021, Wang Yugui wrote: > > > > > Hi, > > > > > > > > > > This patchset works well in 5.14-rc3. > > > > > > > > Thanks for testing. > > > > > > > > > > > > > > 1, fixed dummy inode(255, BTRFS_FIRST_FREE_OBJECTID - 1 ) is changed to > > > > > dynamic dummy inode(18446744073709551358, or 18446744073709551359, ...) > > > > > > > > The BTRFS_FIRST_FREE_OBJECTID-1 was a just a hack, I never wanted it to > > > > be permanent. > > > > The new number is ULONG_MAX - subvol_id (where subvol_id starts at 257 I > > > > think). > > > > This is a bit less of a hack. It is an easily available number that is > > > > fairly unique. > > > > > > > > > > > > > > 2, btrfs subvol mount info is shown in /proc/mounts, even if nfsd/nfs is > > > > > not used. > > > > > /dev/sdc btrfs 94G 3.5M 93G 1% /mnt/test > > > > > /dev/sdc btrfs 94G 3.5M 93G 1% /mnt/test/sub1 > > > > > /dev/sdc btrfs 94G 3.5M 93G 1% /mnt/test/sub2 > > > > > > > > > > This is a visiual feature change for btrfs user. > > > > > > > > Hopefully it is an improvement. But it is certainly a change that needs > > > > to be carefully considered. > > > > > > I think this is behavior people generally expect, but I wonder what > > > the consequences of this would be with huge numbers of subvolumes. If > > > there are hundreds or thousands of them (which is quite possible on > > > SUSE systems, for example, with its auto-snapshotting regime), this > > > would be a mess, wouldn't it? > > > > Would there be hundreds or thousands of subvols concurrently being > > accessed? The auto-mounted subvols only appear in the mount table while > > that are being accessed, and for about 15 minutes after the last access. > > I suspect that most subvols are "backup" snapshots which are not being > > accessed and so would not appear. > > bees dedupes across subvols and polls every few minutes for new data > to dedupe. bees doesn't particularly care where the "src" in the dedupe > call comes from, so it will pick a subvol that has a reference to the > data at random (whichever one comes up first in backref search) for each > dedupe call. There is a cache of open fds on each subvol root so that it > can access files within that subvol using openat(). The cache quickly > populates fully, i.e. it holds a fd to every subvol on the filesystem. > The cache has a 15 minute timeout too, so bees would likely keep the > mount table fully populated at all times. OK ... that is very interesting and potentially helpful - thanks. Localizing these daemons in a separate namespace would stop them from polluting the public namespace, but I don't know how easy that would be.. Do you know how bees opens these files? Does it use path-names from the root, or some special btrfs ioctl, or ??? If path-names are not used, it might be possible to suppress the automount. > > plocate also uses openat() and it can also be active on many subvols > simultaneously, though it only runs once a day, and it's reasonable to > exclude all snapshots from plocate for performance reasons. > > My bigger concern here is that users on btrfs can currently have private > subvols with secret names. e.g. > > user$ mkdir -m 700 private > user$ btrfs sub create private/secret > user$ cd private/secret > user$ ...do stuff... > > Would "secret" now be visible in the very public /proc/mounts every time > the user is doing stuff? Yes, the secret would be publicly visible. Unless we hid it. It is conceivable that the content of /proc/mounts could be limited to mountpoints where the process reading had 'x' access to the mountpoint. However to be really safe we would want to require 'x' access to all ancestors too, and possibly some 'r' access. That would get prohibitively expensive. We could go with "owned by root, or owned by user" maybe. Thanks, NeilBrown > > > > Or can we add a way to mark these things to not show up there or is > > > there some kind of behavioral change we can make to snapper or other > > > tools to make them not show up here? > > > > Certainly it might make sense to flag these in some way so that tools > > can choose the ignore them or handle them specially, just as nfsd needs > > to handle them specially. I was considering a "local" mount flag. > > I would definitely want an 'off' switch for this thing until the impact > is better understood. > > > NeilBrown > > > > > > > > > > > > > > -- > > > 真実はいつも一つ!/ Always, there's only one truth! > > > > > > > >
On 29/07/2021 02:39, NeilBrown wrote: > On Wed, 28 Jul 2021, g.btrfs@cobb.uk.net wrote: >> On 28/07/2021 08:01, NeilBrown wrote: >>> On Wed, 28 Jul 2021, Wang Yugui wrote: >>>> Hi, >>>> >>>> This patchset works well in 5.14-rc3. >>> >>> Thanks for testing. >>> >>>> >>>> 1, fixed dummy inode(255, BTRFS_FIRST_FREE_OBJECTID - 1 ) is changed to >>>> dynamic dummy inode(18446744073709551358, or 18446744073709551359, ...) >>> >>> The BTRFS_FIRST_FREE_OBJECTID-1 was a just a hack, I never wanted it to >>> be permanent. >>> The new number is ULONG_MAX - subvol_id (where subvol_id starts at 257 I >>> think). >>> This is a bit less of a hack. It is an easily available number that is >>> fairly unique. >>> >>>> >>>> 2, btrfs subvol mount info is shown in /proc/mounts, even if nfsd/nfs is >>>> not used. >>>> /dev/sdc btrfs 94G 3.5M 93G 1% /mnt/test >>>> /dev/sdc btrfs 94G 3.5M 93G 1% /mnt/test/sub1 >>>> /dev/sdc btrfs 94G 3.5M 93G 1% /mnt/test/sub2 >>>> >>>> This is a visiual feature change for btrfs user. >>> >>> Hopefully it is an improvement. But it is certainly a change that needs >>> to be carefully considered. >> >> Would this change the behaviour of findmnt? I have several scripts that >> depend on findmnt to select btrfs filesystems. Just to take a couple of >> examples (using the example shown above): my scripts would depend on >> 'findmnt --target /mnt/test/sub1 -o target' providing /mnt/test, not the >> subvolume; and another script would depend on 'findmnt -t btrfs >> --mountpoint /mnt/test/sub1' providing no output as the directory is not >> an /etc/fstab mount point for a btrfs filesystem. > > Yes, I think it does change the behaviour of findmnt. > If the sub1 automount has not been triggered, > findmnt --target /mnt/test/sub1 -o target > will report "/mnt/test". > After it has been triggered, it will report "/mnt/test/sub1" > > Similarly "findmnt -t btrfs --mountpoint /mnt/test/sub1" will report > nothing if the automount hasn't been triggered, and will report full > details of /mnt/test/sub1 if it has. > >> >> Maybe findmnt isn't affected? Or maybe the change is worth making >> anyway? But it needs to be carefully considered if it breaks existing >> user interfaces. >> > I hope the change is worth making anyway, but breaking findmnt would not > be a popular move. I agree. I use findmnt, but I also use NFS mounted btrfs disks so I am keen to see this deployed. But people who don't maintain their own scripts and need a third party to change them might disagree! > This is unfortunate.... btrfs is "broken" and people/code have adjusted > to that breakage so that "fixing" it will be traumatic. > > The only way I can find to get findmnt to ignore the new entries in > /proc/self/mountinfo is to trigger a parse error such as by replacing the > " - " with " -- " > but that causes a parse error message to be generated, and will likely > break other tools. > (... or I could check if current->comm is "findmnt", and suppress the > extra entries, but that is even more horrible!!) > > A possible option is to change findmnt to explicitly ignore the new > "internal" mounts (unless some option is given) and then delay the > kernel update until that can be rolled out. That sounds good as a permanent fix for findmnt. Some sort of '--include-subvols' option. Particularly if it were possible to default it using an environment variable so a script can be written to work with both the old and the new versions of findmnt. Unfortunately it won't help any other program which does similar searches through /proc/self/mountinfo. How about creating two different files? Say, /proc/self/mountinfo and /proc/self/mountinfo.internal (better filenames may be available!). The .internal file could be just the additional internal mounts, or it could be the complete list. Or something like /proc/self/mountinfo.without-subvols and /proc/self/mountinfo.with-subvols and a sysctl setting to choose which is made visible as /proc/self/mountinfo. Graham
On Thu, Jul 29, 2021 at 01:36:06PM +1000, NeilBrown wrote: > On Thu, 29 Jul 2021, Zygo Blaxell wrote: > > On Thu, Jul 29, 2021 at 08:50:50AM +1000, NeilBrown wrote: > > > On Wed, 28 Jul 2021, Neal Gompa wrote: > > > > On Wed, Jul 28, 2021 at 3:02 AM NeilBrown <neilb@suse.de> wrote: > > > > > > > > > > On Wed, 28 Jul 2021, Wang Yugui wrote: > > > > > > Hi, > > > > > > > > > > > > This patchset works well in 5.14-rc3. > > > > > > > > > > Thanks for testing. > > > > > > > > > > > > > > > > > 1, fixed dummy inode(255, BTRFS_FIRST_FREE_OBJECTID - 1 ) is changed to > > > > > > dynamic dummy inode(18446744073709551358, or 18446744073709551359, ...) > > > > > > > > > > The BTRFS_FIRST_FREE_OBJECTID-1 was a just a hack, I never wanted it to > > > > > be permanent. > > > > > The new number is ULONG_MAX - subvol_id (where subvol_id starts at 257 I > > > > > think). > > > > > This is a bit less of a hack. It is an easily available number that is > > > > > fairly unique. > > > > > > > > > > > > > > > > > 2, btrfs subvol mount info is shown in /proc/mounts, even if nfsd/nfs is > > > > > > not used. > > > > > > /dev/sdc btrfs 94G 3.5M 93G 1% /mnt/test > > > > > > /dev/sdc btrfs 94G 3.5M 93G 1% /mnt/test/sub1 > > > > > > /dev/sdc btrfs 94G 3.5M 93G 1% /mnt/test/sub2 > > > > > > > > > > > > This is a visiual feature change for btrfs user. > > > > > > > > > > Hopefully it is an improvement. But it is certainly a change that needs > > > > > to be carefully considered. > > > > > > > > I think this is behavior people generally expect, but I wonder what > > > > the consequences of this would be with huge numbers of subvolumes. If > > > > there are hundreds or thousands of them (which is quite possible on > > > > SUSE systems, for example, with its auto-snapshotting regime), this > > > > would be a mess, wouldn't it? > > > > > > Would there be hundreds or thousands of subvols concurrently being > > > accessed? The auto-mounted subvols only appear in the mount table while > > > that are being accessed, and for about 15 minutes after the last access. > > > I suspect that most subvols are "backup" snapshots which are not being > > > accessed and so would not appear. > > > > bees dedupes across subvols and polls every few minutes for new data > > to dedupe. bees doesn't particularly care where the "src" in the dedupe > > call comes from, so it will pick a subvol that has a reference to the > > data at random (whichever one comes up first in backref search) for each > > dedupe call. There is a cache of open fds on each subvol root so that it > > can access files within that subvol using openat(). The cache quickly > > populates fully, i.e. it holds a fd to every subvol on the filesystem. > > The cache has a 15 minute timeout too, so bees would likely keep the > > mount table fully populated at all times. > > OK ... that is very interesting and potentially helpful - thanks. > > Localizing these daemons in a separate namespace would stop them from > polluting the public namespace, but I don't know how easy that would > be.. > > Do you know how bees opens these files? Does it use path-names from the > root, or some special btrfs ioctl, or ??? There's a function in bees that opens a subvol root directory by subvol id. It walks up the btrfs subvol tree with btrfs ioctls to construct a path to the root, then down the filesystem tree with other btrfs ioctls to get filenames for each subvol. The filenames are fed to openat() with the parent subvol's fd to get a fd for each child subvol's root directory along the path. This is recursive and expensive (the fd has to be checked to see if it matches the subvol, in case some other process renamed it) and called every time bees wants to open a file, so the fd goes into a cache for future open-subvol-by-id calls. For files, bees calls subvol root open to get a fd for the subvol's root, then calls the btrfs inode-to-path ioctl on that fd to get a list of names for the inode, then openat(subvol_fd, inode_to_path(inum), ...) on each name until a fd matching the target subvol and inode is obtained. File access is driven by data content, so bees cannot easily predict which files will need to be accessed again in the near future and which can be closed. The fd cache is a brute-force way to reduce the number of inode-to-path and open calls. Upper layers of bees use (subvol, inode) pairs to identify files and request file descriptors. The lower layers use filenames as an implementation detail for compatibility with kernel API. > If path-names are not used, it might be possible to suppress the > automount. A userspace interface to read and dedupe that doesn't use pathnames or file descriptors (other than one fd to bind the interface to a filesystem) would be nice! About half of the bees code is devoted to emulating that interface using the existing kernel API. Ideally a dedupe agent would be able to pass two physical offsets and a length of identical data to the filesystem without ever opening a file. While I'm in favor of making bees smaller, this seems like an expensive way to suppress automounts. > > plocate also uses openat() and it can also be active on many subvols > > simultaneously, though it only runs once a day, and it's reasonable to > > exclude all snapshots from plocate for performance reasons. > > > > My bigger concern here is that users on btrfs can currently have private > > subvols with secret names. e.g. > > > > user$ mkdir -m 700 private > > user$ btrfs sub create private/secret > > user$ cd private/secret > > user$ ...do stuff... > > > > Would "secret" now be visible in the very public /proc/mounts every time > > the user is doing stuff? > > Yes, the secret would be publicly visible. Unless we hid it. > > It is conceivable that the content of /proc/mounts could be limited to > mountpoints where the process reading had 'x' access to the mountpoint. > However to be really safe we would want to require 'x' access to all > ancestors too, and possibly some 'r' access. That would get > prohibitively expensive. And inconsistent, since we don't do that with other mount points, i.e. outside of btrfs. But we do hide parts of the path names when the process's filesystem root or namespace changes, so maybe this is more of that. > We could go with "owned by root, or owned by user" maybe. > > Thanks, > NeilBrown > > > > > > > > Or can we add a way to mark these things to not show up there or is > > > > there some kind of behavioral change we can make to snapper or other > > > > tools to make them not show up here? > > > > > > Certainly it might make sense to flag these in some way so that tools > > > can choose the ignore them or handle them specially, just as nfsd needs > > > to handle them specially. I was considering a "local" mount flag. > > > > I would definitely want an 'off' switch for this thing until the impact > > is better understood. > > > > > NeilBrown > > > > > > > > > > > > > > > > > > > -- > > > > 真実はいつも一つ!/ Always, there's only one truth! > > > > > > > > > > > > >
On Thu, Jul 29, 2021 at 11:43:21AM +1000, NeilBrown wrote: > On Thu, 29 Jul 2021, Zygo Blaxell wrote: > > > > I'm looking at a few machines here, and if all the subvols are visible to > > 'df', its output would be somewhere around 3-5 MB. That's too much--we'd > > have to hack up df to not show the same btrfs twice...as well as every > > monitoring tool that reports free space...which sounds similar to the > > problems we're trying to avoid. > > Thanks for providing hard data!! How many of these subvols are actively > used (have a file open) a the same time? Most? About half? Just a few?? Between 1% and 10% of the subvols have open fds at any time (not counting bees, which holds an open fd on every subvol most of the time). The ratio is higher (more active) when the machine has less storage or more CPU/RAM: we keep idle subvols around longer if we have lots of free space, or we make more subvols active at the same time if we have lots of RAM and CPU. I don't recall if stat() triggers automount, but most of the subvols are located in a handful of directories. Could a single 'ls -l' command activate all of their automounts? If so, then we'd be hitting those at least once every 15 minutes--these directories are browseable, and an entry point for users. Certainly anything that looks inside those directories (like certain file-browsing user agents that look for icons one level down) will trigger automount as they search children of the subvol root. Some of this might be fixable, like I could probably make bees be more parsimonious with subvol root fds, and I could probably rework reporting scripts to avoid touching anything inside subdirectories, and I could probably rework our subvolume directory layout to avoid accidentally triggering thousands of automounts. But I'd rather not. I'd rather have working NFS and a 15-20 line df output with no new privacy or scalability concerns. > Thanks, > NeilBrown
On Wed, Jul 28, 2021 at 05:30:04PM -0400, Josef Bacik wrote: > I don't think anybody has that many file systems. For btrfs it's a single > file system. Think of syncfs, it's going to walk through all of the super > blocks on the system calling ->sync_fs on each subvol superblock. Now this > isn't a huge deal, we could just have some flag that says "I'm not real" or > even just have anonymous superblocks that don't get added to the global > super_blocks list, and that would address my main pain points. Umm... Aren't the snapshots read-only by definition? > The second part is inode reclaim. Again this particular problem could be > avoided if we had an anonymous superblock that wasn't actually used, but the > inode lru is per superblock. Now with reclaim instead of walking all the > inodes, you're walking a bunch of super blocks and then walking the list of > inodes within those super blocks. You're burning CPU cycles because now > instead of getting big chunks of inodes to dispose, it's spread out across > many super blocks. > > The other weird thing is the way we apply pressure to shrinker systems. We > essentially say "try to evict X objects from your list", which means in this > case with lots of subvolumes we'd be evicting waaaaay more inodes than you > were before, likely impacting performance where you have workloads that have > lots of files open across many subvolumes (which is what FB does with it's > containers). > > If we want a anonymous superblock per subvolume then the only way it'll work > is if it's not actually tied into anything, and we still use the primary > super block for the whole file system. And if that's what we're going to do > what's the point of the super block exactly? This approach that Neil's come > up with seems like a reasonable solution to me. Christoph gets his > separation and /proc/self/mountinfo, and we avoid the scalability headache > of a billion super blocks. Thanks, AFAICS, we also get arseloads of weird corner cases - in particular, Neil's suggestions re visibility in /proc/mounts look rather arbitrary. Al, really disliking the entire series...
I've been pondering all the excellent feedback, and what I have learnt from examining the code in btrfs, and I have developed a different perspective. Maybe "subvol" is a poor choice of name because it conjures up connections with the Volumes in LVM, and btrfs subvols are very different things. Btrfs subvols are really just subtrees that can be treated as a unit for operations like "clone" or "destroy". As such, they don't really deserve separate st_dev numbers. Maybe the different st_dev numbers were introduced as a "cheap" way to extend to size of the inode-number space. Like many "cheap" things, it has hidden costs. Maybe objects in different subvols should still be given different inode numbers. This would be problematic on 32bit systems, but much less so on 64bit systems. The patch below, which is just a proof-of-concept, changes btrfs to report a uniform st_dev, and different (64bit) st_ino in different subvols. It has problems: - it will break any 32bit readdir and 32bit stat. I don't know how big a problem that is these days (ino_t in the kernel is "unsigned long", not "unsigned long long). That surprised me). - It might break some user-space expectations. One thing I have learnt is not to make any assumption about what other people might expect. However, it would be quite easy to make this opt-in (or opt-out) with a mount option, so that people who need the current inode numbers and will accept the current breakage can keep working. I think this approach would be a net-win for NFS export, whether BTRFS supports it directly or not. I might post a patch which modifies NFS to intuit improved inode numbers for btrfs exports.... So: how would this break your use-case?? Thanks, NeilBrown diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c index 0117d867ecf8..8dc58c848502 100644 --- a/fs/btrfs/inode.c +++ b/fs/btrfs/inode.c @@ -6020,6 +6020,37 @@ static int btrfs_opendir(struct inode *inode, struct file *file) return 0; } +static u64 btrfs_make_inum(struct btrfs_key *root, struct btrfs_key *ino) +{ + u64 rootid = root->objectid; + u64 inoid = ino->objectid; + int shift = 64-8; + + if (ino->type == BTRFS_ROOT_ITEM_KEY) { + /* This is a subvol root found during readdir. */ + rootid = inoid; + inoid = BTRFS_FIRST_FREE_OBJECTID; + } + if (rootid == BTRFS_FS_TREE_OBJECTID) + /* this is main vol, not subvol (I think) */ + return inoid; + /* store the rootid in the high bits of the inum. This + * will break if 32bit inums are required - we cannot know + */ + while (rootid) { + inoid ^= (rootid & 0xff) << shift; + rootid >>= 8; + shift -= 8; + } + return inoid; +} + +static inline u64 btrfs_ino_to_inum(struct inode *inode) +{ + return btrfs_make_inum(&BTRFS_I(inode)->root->root_key, + &BTRFS_I(inode)->location); +} + struct dir_entry { u64 ino; u64 offset; @@ -6045,6 +6076,49 @@ static int btrfs_filldir(void *addr, int entries, struct dir_context *ctx) return 0; } +static inline bool btrfs_dir_emit_dot(struct file *file, + struct dir_context *ctx) +{ + return ctx->actor(ctx, ".", 1, ctx->pos, + btrfs_ino_to_inum(file->f_path.dentry->d_inode), + DT_DIR) == 0; +} + +static inline ino_t btrfs_parent_ino(struct dentry *dentry) +{ + ino_t res; + + /* + * Don't strictly need d_lock here? If the parent ino could change + * then surely we'd have a deeper race in the caller? + */ + spin_lock(&dentry->d_lock); + res = btrfs_ino_to_inum(dentry->d_parent->d_inode); + spin_unlock(&dentry->d_lock); + return res; +} + +static inline bool btrfs_dir_emit_dotdot(struct file *file, + struct dir_context *ctx) +{ + return ctx->actor(ctx, "..", 2, ctx->pos, + btrfs_parent_ino(file->f_path.dentry), DT_DIR) == 0; +} +static inline bool btrfs_dir_emit_dots(struct file *file, + struct dir_context *ctx) +{ + if (ctx->pos == 0) { + if (!btrfs_dir_emit_dot(file, ctx)) + return false; + ctx->pos = 1; + } + if (ctx->pos == 1) { + if (!btrfs_dir_emit_dotdot(file, ctx)) + return false; + ctx->pos = 2; + } + return true; +} static int btrfs_real_readdir(struct file *file, struct dir_context *ctx) { struct inode *inode = file_inode(file); @@ -6067,7 +6141,7 @@ static int btrfs_real_readdir(struct file *file, struct dir_context *ctx) bool put = false; struct btrfs_key location; - if (!dir_emit_dots(file, ctx)) + if (!btrfs_dir_emit_dots(file, ctx)) return 0; path = btrfs_alloc_path(); @@ -6136,7 +6210,8 @@ static int btrfs_real_readdir(struct file *file, struct dir_context *ctx) put_unaligned(fs_ftype_to_dtype(btrfs_dir_type(leaf, di)), &entry->type); btrfs_dir_item_key_to_cpu(leaf, di, &location); - put_unaligned(location.objectid, &entry->ino); + put_unaligned(btrfs_make_inum(&root->root_key, &location), + &entry->ino); put_unaligned(found_key.offset, &entry->offset); entries++; addr += sizeof(struct dir_entry) + name_len; @@ -9193,7 +9268,7 @@ static int btrfs_getattr(struct user_namespace *mnt_userns, STATX_ATTR_NODUMP); generic_fillattr(&init_user_ns, inode, stat); - stat->dev = BTRFS_I(inode)->root->anon_dev; + stat->ino = btrfs_ino_to_inum(inode); spin_lock(&BTRFS_I(inode)->lock); delalloc_bytes = BTRFS_I(inode)->new_delalloc_bytes;
On 2021/7/30 上午10:36, NeilBrown wrote: > > I've been pondering all the excellent feedback, and what I have learnt > from examining the code in btrfs, and I have developed a different > perspective. Great! Some new developers into the btrfs realm! > > Maybe "subvol" is a poor choice of name because it conjures up > connections with the Volumes in LVM, and btrfs subvols are very different > things. Btrfs subvols are really just subtrees that can be treated as a > unit for operations like "clone" or "destroy". > > As such, they don't really deserve separate st_dev numbers. > > Maybe the different st_dev numbers were introduced as a "cheap" way to > extend to size of the inode-number space. Like many "cheap" things, it > has hidden costs. > > Maybe objects in different subvols should still be given different inode > numbers. This would be problematic on 32bit systems, but much less so on > 64bit systems. > > The patch below, which is just a proof-of-concept, changes btrfs to > report a uniform st_dev, and different (64bit) st_ino in different subvols. > > It has problems: > - it will break any 32bit readdir and 32bit stat. I don't know how big > a problem that is these days (ino_t in the kernel is "unsigned long", > not "unsigned long long). That surprised me). > - It might break some user-space expectations. One thing I have learnt > is not to make any assumption about what other people might expect. Wouldn't any filesystem boundary check fail to stop at subvolume boundary? Then it will go through the full btrfs subvolumes/snapshots, which can be super slow. > > However, it would be quite easy to make this opt-in (or opt-out) with a > mount option, so that people who need the current inode numbers and will > accept the current breakage can keep working. > > I think this approach would be a net-win for NFS export, whether BTRFS > supports it directly or not. I might post a patch which modifies NFS to > intuit improved inode numbers for btrfs exports.... Some extra ideas, but not familiar with VFS enough to be sure. Can we generate "fake" superblock for each subvolume? Like using the subolume UUID to replace the FSID of each subvolume. Could that migrate the problem? Thanks, Qu > > So: how would this break your use-case?? > > Thanks, > NeilBrown > > diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c > index 0117d867ecf8..8dc58c848502 100644 > --- a/fs/btrfs/inode.c > +++ b/fs/btrfs/inode.c > @@ -6020,6 +6020,37 @@ static int btrfs_opendir(struct inode *inode, struct file *file) > return 0; > } > > +static u64 btrfs_make_inum(struct btrfs_key *root, struct btrfs_key *ino) > +{ > + u64 rootid = root->objectid; > + u64 inoid = ino->objectid; > + int shift = 64-8; > + > + if (ino->type == BTRFS_ROOT_ITEM_KEY) { > + /* This is a subvol root found during readdir. */ > + rootid = inoid; > + inoid = BTRFS_FIRST_FREE_OBJECTID; > + } > + if (rootid == BTRFS_FS_TREE_OBJECTID) > + /* this is main vol, not subvol (I think) */ > + return inoid; > + /* store the rootid in the high bits of the inum. This > + * will break if 32bit inums are required - we cannot know > + */ > + while (rootid) { > + inoid ^= (rootid & 0xff) << shift; > + rootid >>= 8; > + shift -= 8; > + } > + return inoid; > +} > + > +static inline u64 btrfs_ino_to_inum(struct inode *inode) > +{ > + return btrfs_make_inum(&BTRFS_I(inode)->root->root_key, > + &BTRFS_I(inode)->location); > +} > + > struct dir_entry { > u64 ino; > u64 offset; > @@ -6045,6 +6076,49 @@ static int btrfs_filldir(void *addr, int entries, struct dir_context *ctx) > return 0; > } > > +static inline bool btrfs_dir_emit_dot(struct file *file, > + struct dir_context *ctx) > +{ > + return ctx->actor(ctx, ".", 1, ctx->pos, > + btrfs_ino_to_inum(file->f_path.dentry->d_inode), > + DT_DIR) == 0; > +} > + > +static inline ino_t btrfs_parent_ino(struct dentry *dentry) > +{ > + ino_t res; > + > + /* > + * Don't strictly need d_lock here? If the parent ino could change > + * then surely we'd have a deeper race in the caller? > + */ > + spin_lock(&dentry->d_lock); > + res = btrfs_ino_to_inum(dentry->d_parent->d_inode); > + spin_unlock(&dentry->d_lock); > + return res; > +} > + > +static inline bool btrfs_dir_emit_dotdot(struct file *file, > + struct dir_context *ctx) > +{ > + return ctx->actor(ctx, "..", 2, ctx->pos, > + btrfs_parent_ino(file->f_path.dentry), DT_DIR) == 0; > +} > +static inline bool btrfs_dir_emit_dots(struct file *file, > + struct dir_context *ctx) > +{ > + if (ctx->pos == 0) { > + if (!btrfs_dir_emit_dot(file, ctx)) > + return false; > + ctx->pos = 1; > + } > + if (ctx->pos == 1) { > + if (!btrfs_dir_emit_dotdot(file, ctx)) > + return false; > + ctx->pos = 2; > + } > + return true; > +} > static int btrfs_real_readdir(struct file *file, struct dir_context *ctx) > { > struct inode *inode = file_inode(file); > @@ -6067,7 +6141,7 @@ static int btrfs_real_readdir(struct file *file, struct dir_context *ctx) > bool put = false; > struct btrfs_key location; > > - if (!dir_emit_dots(file, ctx)) > + if (!btrfs_dir_emit_dots(file, ctx)) > return 0; > > path = btrfs_alloc_path(); > @@ -6136,7 +6210,8 @@ static int btrfs_real_readdir(struct file *file, struct dir_context *ctx) > put_unaligned(fs_ftype_to_dtype(btrfs_dir_type(leaf, di)), > &entry->type); > btrfs_dir_item_key_to_cpu(leaf, di, &location); > - put_unaligned(location.objectid, &entry->ino); > + put_unaligned(btrfs_make_inum(&root->root_key, &location), > + &entry->ino); > put_unaligned(found_key.offset, &entry->offset); > entries++; > addr += sizeof(struct dir_entry) + name_len; > @@ -9193,7 +9268,7 @@ static int btrfs_getattr(struct user_namespace *mnt_userns, > STATX_ATTR_NODUMP); > > generic_fillattr(&init_user_ns, inode, stat); > - stat->dev = BTRFS_I(inode)->root->anon_dev; > + stat->ino = btrfs_ino_to_inum(inode); > > spin_lock(&BTRFS_I(inode)->lock); > delalloc_bytes = BTRFS_I(inode)->new_delalloc_bytes; >
On Fri, Jul 30, 2021 at 5:41 AM NeilBrown <neilb@suse.de> wrote: > > > I've been pondering all the excellent feedback, and what I have learnt > from examining the code in btrfs, and I have developed a different > perspective. > > Maybe "subvol" is a poor choice of name because it conjures up > connections with the Volumes in LVM, and btrfs subvols are very different > things. Btrfs subvols are really just subtrees that can be treated as a > unit for operations like "clone" or "destroy". > > As such, they don't really deserve separate st_dev numbers. > > Maybe the different st_dev numbers were introduced as a "cheap" way to > extend to size of the inode-number space. Like many "cheap" things, it > has hidden costs. > > Maybe objects in different subvols should still be given different inode > numbers. This would be problematic on 32bit systems, but much less so on > 64bit systems. > > The patch below, which is just a proof-of-concept, changes btrfs to > report a uniform st_dev, and different (64bit) st_ino in different subvols. > > It has problems: > - it will break any 32bit readdir and 32bit stat. I don't know how big > a problem that is these days (ino_t in the kernel is "unsigned long", > not "unsigned long long). That surprised me). > - It might break some user-space expectations. One thing I have learnt > is not to make any assumption about what other people might expect. > > However, it would be quite easy to make this opt-in (or opt-out) with a > mount option, so that people who need the current inode numbers and will > accept the current breakage can keep working. > > I think this approach would be a net-win for NFS export, whether BTRFS > supports it directly or not. I might post a patch which modifies NFS to > intuit improved inode numbers for btrfsdemostrates exports.... > > So: how would this break your use-case?? The simple cases are find -xdev and du -x which expect the st_dev change, but that can be excused if opting in to a unified st_dev namespace. The harder problem is <st_dev;st_ino> collisions which are not even that hard to hit with unlimited number of snapshots. The 'diff' tool demonstrates the implications of <st_dev;st_ino> collisions for different objects on userspace. See xfstest overlay/049 for a demonstration. The overlayfs xino feature made a similar change to overlayfs <st_dev;st_ino> with one big difference - applications expect that all objects in overlayfs mount will have the same st_dev. Also, overlayfs has prior knowledge on the number of layers so it is easier to parcel the ino namespace and avoid collisions. Thanks, Amir.
On 2021/7/30 下午1:25, Qu Wenruo wrote: > > > On 2021/7/30 上午10:36, NeilBrown wrote: >> >> I've been pondering all the excellent feedback, and what I have learnt >> from examining the code in btrfs, and I have developed a different >> perspective. > > Great! Some new developers into the btrfs realm! > >> >> Maybe "subvol" is a poor choice of name because it conjures up >> connections with the Volumes in LVM, and btrfs subvols are very different >> things. Btrfs subvols are really just subtrees that can be treated as a >> unit for operations like "clone" or "destroy". >> >> As such, they don't really deserve separate st_dev numbers. >> >> Maybe the different st_dev numbers were introduced as a "cheap" way to >> extend to size of the inode-number space. Like many "cheap" things, it >> has hidden costs. Forgot another problem already caused by this st_dev method. Since btrfs uses st_dev to distinguish them its inode name space, and st_dev is allocated using anonymous bdev, and the anonymous bdev poor has limited size (much smaller than btrfs subvolume id name space), it's already causing problems like we can't allocate enough anonymous bdev for each subvolume, and failed to create subvolume/snapshot. Thus it's really a time to re-consider how we should export this info to user space. Thanks, Qu >> >> Maybe objects in different subvols should still be given different inode >> numbers. This would be problematic on 32bit systems, but much less so on >> 64bit systems. >> >> The patch below, which is just a proof-of-concept, changes btrfs to >> report a uniform st_dev, and different (64bit) st_ino in different >> subvols. >> >> It has problems: >> - it will break any 32bit readdir and 32bit stat. I don't know how big >> a problem that is these days (ino_t in the kernel is "unsigned long", >> not "unsigned long long). That surprised me). >> - It might break some user-space expectations. One thing I have learnt >> is not to make any assumption about what other people might expect. > > Wouldn't any filesystem boundary check fail to stop at subvolume boundary? > > Then it will go through the full btrfs subvolumes/snapshots, which can > be super slow. > >> >> However, it would be quite easy to make this opt-in (or opt-out) with a >> mount option, so that people who need the current inode numbers and will >> accept the current breakage can keep working. >> >> I think this approach would be a net-win for NFS export, whether BTRFS >> supports it directly or not. I might post a patch which modifies NFS to >> intuit improved inode numbers for btrfs exports.... > > Some extra ideas, but not familiar with VFS enough to be sure. > > Can we generate "fake" superblock for each subvolume? > Like using the subolume UUID to replace the FSID of each subvolume. > Could that migrate the problem? > > Thanks, > Qu > >> >> So: how would this break your use-case?? >> >> Thanks, >> NeilBrown >> >> diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c >> index 0117d867ecf8..8dc58c848502 100644 >> --- a/fs/btrfs/inode.c >> +++ b/fs/btrfs/inode.c >> @@ -6020,6 +6020,37 @@ static int btrfs_opendir(struct inode *inode, >> struct file *file) >> return 0; >> } >> >> +static u64 btrfs_make_inum(struct btrfs_key *root, struct btrfs_key >> *ino) >> +{ >> + u64 rootid = root->objectid; >> + u64 inoid = ino->objectid; >> + int shift = 64-8; >> + >> + if (ino->type == BTRFS_ROOT_ITEM_KEY) { >> + /* This is a subvol root found during readdir. */ >> + rootid = inoid; >> + inoid = BTRFS_FIRST_FREE_OBJECTID; >> + } >> + if (rootid == BTRFS_FS_TREE_OBJECTID) >> + /* this is main vol, not subvol (I think) */ >> + return inoid; >> + /* store the rootid in the high bits of the inum. This >> + * will break if 32bit inums are required - we cannot know >> + */ >> + while (rootid) { >> + inoid ^= (rootid & 0xff) << shift; >> + rootid >>= 8; >> + shift -= 8; >> + } >> + return inoid; >> +} >> + >> +static inline u64 btrfs_ino_to_inum(struct inode *inode) >> +{ >> + return btrfs_make_inum(&BTRFS_I(inode)->root->root_key, >> + &BTRFS_I(inode)->location); >> +} >> + >> struct dir_entry { >> u64 ino; >> u64 offset; >> @@ -6045,6 +6076,49 @@ static int btrfs_filldir(void *addr, int >> entries, struct dir_context *ctx) >> return 0; >> } >> >> +static inline bool btrfs_dir_emit_dot(struct file *file, >> + struct dir_context *ctx) >> +{ >> + return ctx->actor(ctx, ".", 1, ctx->pos, >> + btrfs_ino_to_inum(file->f_path.dentry->d_inode), >> + DT_DIR) == 0; >> +} >> + >> +static inline ino_t btrfs_parent_ino(struct dentry *dentry) >> +{ >> + ino_t res; >> + >> + /* >> + * Don't strictly need d_lock here? If the parent ino could change >> + * then surely we'd have a deeper race in the caller? >> + */ >> + spin_lock(&dentry->d_lock); >> + res = btrfs_ino_to_inum(dentry->d_parent->d_inode); >> + spin_unlock(&dentry->d_lock); >> + return res; >> +} >> + >> +static inline bool btrfs_dir_emit_dotdot(struct file *file, >> + struct dir_context *ctx) >> +{ >> + return ctx->actor(ctx, "..", 2, ctx->pos, >> + btrfs_parent_ino(file->f_path.dentry), DT_DIR) == 0; >> +} >> +static inline bool btrfs_dir_emit_dots(struct file *file, >> + struct dir_context *ctx) >> +{ >> + if (ctx->pos == 0) { >> + if (!btrfs_dir_emit_dot(file, ctx)) >> + return false; >> + ctx->pos = 1; >> + } >> + if (ctx->pos == 1) { >> + if (!btrfs_dir_emit_dotdot(file, ctx)) >> + return false; >> + ctx->pos = 2; >> + } >> + return true; >> +} >> static int btrfs_real_readdir(struct file *file, struct dir_context >> *ctx) >> { >> struct inode *inode = file_inode(file); >> @@ -6067,7 +6141,7 @@ static int btrfs_real_readdir(struct file *file, >> struct dir_context *ctx) >> bool put = false; >> struct btrfs_key location; >> >> - if (!dir_emit_dots(file, ctx)) >> + if (!btrfs_dir_emit_dots(file, ctx)) >> return 0; >> >> path = btrfs_alloc_path(); >> @@ -6136,7 +6210,8 @@ static int btrfs_real_readdir(struct file *file, >> struct dir_context *ctx) >> put_unaligned(fs_ftype_to_dtype(btrfs_dir_type(leaf, di)), >> &entry->type); >> btrfs_dir_item_key_to_cpu(leaf, di, &location); >> - put_unaligned(location.objectid, &entry->ino); >> + put_unaligned(btrfs_make_inum(&root->root_key, &location), >> + &entry->ino); >> put_unaligned(found_key.offset, &entry->offset); >> entries++; >> addr += sizeof(struct dir_entry) + name_len; >> @@ -9193,7 +9268,7 @@ static int btrfs_getattr(struct user_namespace >> *mnt_userns, >> STATX_ATTR_NODUMP); >> >> generic_fillattr(&init_user_ns, inode, stat); >> - stat->dev = BTRFS_I(inode)->root->anon_dev; >> + stat->ino = btrfs_ino_to_inum(inode); >> >> spin_lock(&BTRFS_I(inode)->lock); >> delalloc_bytes = BTRFS_I(inode)->new_delalloc_bytes; >>
On Fri, Jul 30, 2021 at 8:33 AM Qu Wenruo <quwenruo.btrfs@gmx.com> wrote: > > > > On 2021/7/30 下午1:25, Qu Wenruo wrote: > > > > > > On 2021/7/30 上午10:36, NeilBrown wrote: > >> > >> I've been pondering all the excellent feedback, and what I have learnt > >> from examining the code in btrfs, and I have developed a different > >> perspective. > > > > Great! Some new developers into the btrfs realm! > > > >> > >> Maybe "subvol" is a poor choice of name because it conjures up > >> connections with the Volumes in LVM, and btrfs subvols are very different > >> things. Btrfs subvols are really just subtrees that can be treated as a > >> unit for operations like "clone" or "destroy". > >> > >> As such, they don't really deserve separate st_dev numbers. > >> > >> Maybe the different st_dev numbers were introduced as a "cheap" way to > >> extend to size of the inode-number space. Like many "cheap" things, it > >> has hidden costs. > > Forgot another problem already caused by this st_dev method. > > Since btrfs uses st_dev to distinguish them its inode name space, and > st_dev is allocated using anonymous bdev, and the anonymous bdev poor > has limited size (much smaller than btrfs subvolume id name space), it's > already causing problems like we can't allocate enough anonymous bdev > for each subvolume, and failed to create subvolume/snapshot. > How about creating a major dev for btrfs subvolumes to start with. Then at least there is a possibility for administrative reservation of st_dev values for subvols that need persistent <st_dev;st_ino> By default subvols get assigned a minor dynamically as today and with opt-in (e.g. for small short lived btrfs filesystems), the unified st_dev approach can be used, possibly by providing an upper limit to the inode numbers on the filesystem, similar to xfs -o inode32 mount option. Thanks, Amir.
On Fri, 30 Jul 2021, Qu Wenruo wrote: > > On 2021/7/30 上午10:36, NeilBrown wrote: > > > > I've been pondering all the excellent feedback, and what I have learnt > > from examining the code in btrfs, and I have developed a different > > perspective. > > Great! Some new developers into the btrfs realm! :-) > > > > > Maybe "subvol" is a poor choice of name because it conjures up > > connections with the Volumes in LVM, and btrfs subvols are very different > > things. Btrfs subvols are really just subtrees that can be treated as a > > unit for operations like "clone" or "destroy". > > > > As such, they don't really deserve separate st_dev numbers. > > > > Maybe the different st_dev numbers were introduced as a "cheap" way to > > extend to size of the inode-number space. Like many "cheap" things, it > > has hidden costs. > > > > Maybe objects in different subvols should still be given different inode > > numbers. This would be problematic on 32bit systems, but much less so on > > 64bit systems. > > > > The patch below, which is just a proof-of-concept, changes btrfs to > > report a uniform st_dev, and different (64bit) st_ino in different subvols. > > > > It has problems: > > - it will break any 32bit readdir and 32bit stat. I don't know how big > > a problem that is these days (ino_t in the kernel is "unsigned long", > > not "unsigned long long). That surprised me). > > - It might break some user-space expectations. One thing I have learnt > > is not to make any assumption about what other people might expect. > > Wouldn't any filesystem boundary check fail to stop at subvolume boundary? You mean like "du -x"?? Yes. You would lose the misleading illusion that there are multiple filesystems. That is one user-expectation that would need to be addressed before people opt-in > > Then it will go through the full btrfs subvolumes/snapshots, which can > be super slow. > > > > > However, it would be quite easy to make this opt-in (or opt-out) with a > > mount option, so that people who need the current inode numbers and will > > accept the current breakage can keep working. > > > > I think this approach would be a net-win for NFS export, whether BTRFS > > supports it directly or not. I might post a patch which modifies NFS to > > intuit improved inode numbers for btrfs exports.... > > Some extra ideas, but not familiar with VFS enough to be sure. > > Can we generate "fake" superblock for each subvolume? I don't see how that would help. Either subvols are like filesystems and appear in /proc/mounts, or they aren't like filesystems and don't get different st_dev. Either of these outcomes can be achieved without fake superblocks. If you really need BTRFS subvols to have some properties of filesystems but not all, then you are in for a whole world of pain. Maybe btrfs subvols should be treated more like XFS "managed trees". At least there you have precedent and someone else to share the pain. Maybe we should train people to use "quota" to check the usage of a subvol, rather than "du" (which will stop working with my patch if it contains refs to other subvols) or "df" (which already doesn't work), or "btrs df" > Like using the subolume UUID to replace the FSID of each subvolume. > Could that migrate the problem? Which problem, exactly? My first approach to making subvols work on NFS took essentially that approach. It was seen (quite reasonably) as a hack to work around poor behaviour in btrfs. Given that NFS has always seen all of a btrfs filesystem as have a uniform fsid, I'm now of the opinion that we don't want to change that, but should just fix the duplicate-inode-number problem. If I could think of some way for NFSD to see different inode numbers than VFS, I would push hard for fixs nfsd by giving it more sane inode numbers. Thanks, NeilBrown
On Fri, 30 Jul 2021, Qu Wenruo wrote: > > On 2021/7/30 下午1:25, Qu Wenruo wrote: > > > > > > On 2021/7/30 上午10:36, NeilBrown wrote: > >> > >> I've been pondering all the excellent feedback, and what I have learnt > >> from examining the code in btrfs, and I have developed a different > >> perspective. > > > > Great! Some new developers into the btrfs realm! > > > >> > >> Maybe "subvol" is a poor choice of name because it conjures up > >> connections with the Volumes in LVM, and btrfs subvols are very different > >> things. Btrfs subvols are really just subtrees that can be treated as a > >> unit for operations like "clone" or "destroy". > >> > >> As such, they don't really deserve separate st_dev numbers. > >> > >> Maybe the different st_dev numbers were introduced as a "cheap" way to > >> extend to size of the inode-number space. Like many "cheap" things, it > >> has hidden costs. > > Forgot another problem already caused by this st_dev method. > > Since btrfs uses st_dev to distinguish them its inode name space, and > st_dev is allocated using anonymous bdev, and the anonymous bdev poor > has limited size (much smaller than btrfs subvolume id name space), it's > already causing problems like we can't allocate enough anonymous bdev > for each subvolume, and failed to create subvolume/snapshot. What sort of numbers do you see in practice? How many subvolumes and how many inodes per subvolume? If we allocated some number of bits to each, with over-allocation to allow for growth, could we fit both into 64 bits? NeilBrown > > Thus it's really a time to re-consider how we should export this info to > user space. > > Thanks, > Qu >
On Fri, 30 Jul 2021, Al Viro wrote: > On Wed, Jul 28, 2021 at 05:30:04PM -0400, Josef Bacik wrote: > > > I don't think anybody has that many file systems. For btrfs it's a single > > file system. Think of syncfs, it's going to walk through all of the super > > blocks on the system calling ->sync_fs on each subvol superblock. Now this > > isn't a huge deal, we could just have some flag that says "I'm not real" or > > even just have anonymous superblocks that don't get added to the global > > super_blocks list, and that would address my main pain points. > > Umm... Aren't the snapshots read-only by definition? No, though they can be. subvols can be created empty, or duplicated from an existing subvol. Any subvol can be written, using copy-on-write of course. NeilBrown
On 2021/7/30 下午2:00, NeilBrown wrote: > On Fri, 30 Jul 2021, Qu Wenruo wrote: >> >> On 2021/7/30 下午1:25, Qu Wenruo wrote: >>> >>> >>> On 2021/7/30 上午10:36, NeilBrown wrote: >>>> >>>> I've been pondering all the excellent feedback, and what I have learnt >>>> from examining the code in btrfs, and I have developed a different >>>> perspective. >>> >>> Great! Some new developers into the btrfs realm! >>> >>>> >>>> Maybe "subvol" is a poor choice of name because it conjures up >>>> connections with the Volumes in LVM, and btrfs subvols are very different >>>> things. Btrfs subvols are really just subtrees that can be treated as a >>>> unit for operations like "clone" or "destroy". >>>> >>>> As such, they don't really deserve separate st_dev numbers. >>>> >>>> Maybe the different st_dev numbers were introduced as a "cheap" way to >>>> extend to size of the inode-number space. Like many "cheap" things, it >>>> has hidden costs. >> >> Forgot another problem already caused by this st_dev method. >> >> Since btrfs uses st_dev to distinguish them its inode name space, and >> st_dev is allocated using anonymous bdev, and the anonymous bdev poor >> has limited size (much smaller than btrfs subvolume id name space), it's >> already causing problems like we can't allocate enough anonymous bdev >> for each subvolume, and failed to create subvolume/snapshot. > > What sort of numbers do you see in practice? How many subvolumes and how > many inodes per subvolume? Normally the "live"(*) subvolume numbers are below the minor dev number range (1<<20), thus not a big deal. *: Live here means the subvolume is at least accessed once. Subvolume exists but never accessed doesn't get its anonymous bdev number allocated. But (1<<20) is really small compared some real-world users. Thus we had some reports of such problem, and changed the timing to allocate such bdev number. > If we allocated some number of bits to each, with over-allocation to > allow for growth, could we fit both into 64 bits? I don't think it's even possible, as currently we use u32 for dev_t, which is already way below the theoretical limit (U64_MAX - 512). Thus AFAIK there is no real way to solve it right now. Thanks, Qu > > NeilBrown > > >> >> Thus it's really a time to re-consider how we should export this info to >> user space. >> >> Thanks, >> Qu >>
On 2021/7/30 下午1:58, NeilBrown wrote: > On Fri, 30 Jul 2021, Qu Wenruo wrote: >> >> On 2021/7/30 上午10:36, NeilBrown wrote: >>> >>> I've been pondering all the excellent feedback, and what I have learnt >>> from examining the code in btrfs, and I have developed a different >>> perspective. >> >> Great! Some new developers into the btrfs realm! > > :-) > >> >>> >>> Maybe "subvol" is a poor choice of name because it conjures up >>> connections with the Volumes in LVM, and btrfs subvols are very different >>> things. Btrfs subvols are really just subtrees that can be treated as a >>> unit for operations like "clone" or "destroy". >>> >>> As such, they don't really deserve separate st_dev numbers. >>> >>> Maybe the different st_dev numbers were introduced as a "cheap" way to >>> extend to size of the inode-number space. Like many "cheap" things, it >>> has hidden costs. >>> >>> Maybe objects in different subvols should still be given different inode >>> numbers. This would be problematic on 32bit systems, but much less so on >>> 64bit systems. >>> >>> The patch below, which is just a proof-of-concept, changes btrfs to >>> report a uniform st_dev, and different (64bit) st_ino in different subvols. >>> >>> It has problems: >>> - it will break any 32bit readdir and 32bit stat. I don't know how big >>> a problem that is these days (ino_t in the kernel is "unsigned long", >>> not "unsigned long long). That surprised me). >>> - It might break some user-space expectations. One thing I have learnt >>> is not to make any assumption about what other people might expect. >> >> Wouldn't any filesystem boundary check fail to stop at subvolume boundary? > > You mean like "du -x"?? Yes. You would lose the misleading illusion > that there are multiple filesystems. That is one user-expectation that > would need to be addressed before people opt-in OK, forgot it's an opt-in feature, then it's less an impact. But it can still sometimes be problematic. E.g. if the user want to put some git code into one subvolume, while export another subvolume through NFS. Then the user has to opt-in, affecting the git subvolume to lose the ability to determine subvolume boundary, right? This is more concerning as most btrfs users won't want to explicitly prepare another different btrfs. > >> >> Then it will go through the full btrfs subvolumes/snapshots, which can >> be super slow. >> >>> >>> However, it would be quite easy to make this opt-in (or opt-out) with a >>> mount option, so that people who need the current inode numbers and will >>> accept the current breakage can keep working. >>> >>> I think this approach would be a net-win for NFS export, whether BTRFS >>> supports it directly or not. I might post a patch which modifies NFS to >>> intuit improved inode numbers for btrfs exports.... >> >> Some extra ideas, but not familiar with VFS enough to be sure. >> >> Can we generate "fake" superblock for each subvolume? > > I don't see how that would help. Either subvols are like filesystems > and appear in /proc/mounts, or they aren't like filesystems and don't > get different st_dev. Either of these outcomes can be achieved without > fake superblocks. If you really need BTRFS subvols to have some > properties of filesystems but not all, then you are in for a whole world > of pain. I guess it's time we pay for the hacks... > > Maybe btrfs subvols should be treated more like XFS "managed trees". At > least there you have precedent and someone else to share the pain. > Maybe we should train people to use "quota" to check the usage of a > subvol, Well, btrfs quota has its own pain... > rather than "du" (which will stop working with my patch if it > contains refs to other subvols) or "df" (which already doesn't work), or > "btrs df" BTW, since XFS has a similar feature (not sure with XFS though), I guess in the long run, it may be worthy to make the VFS to have some way to treat such concept that is not full volume but still different trees. > >> Like using the subolume UUID to replace the FSID of each subvolume. >> Could that migrate the problem? > > Which problem, exactly? My first approach to making subvols work on NFS > took essentially that approach. It was seen (quite reasonably) as a > hack to work around poor behaviour in btrfs. > > Given that NFS has always seen all of a btrfs filesystem as have a > uniform fsid, I'm now of the opinion that we don't want to change that, > but should just fix the duplicate-inode-number problem. > > If I could think of some way for NFSD to see different inode numbers > than VFS, I would push hard for fixs nfsd by giving it more sane inode > numbers. Really not familiar with NFS/VFS, thus some ideas from me may sounds super crazy. Is it possible that, for nfsd to detect such "subvolume" concept by its own, like checking st_dev and the fsid returned from statfs(). Then if nfsd find some boundary which has different st_dev, but the same fsid as its parent, then it knows it's a "subvolume"-like concept. Then do some local inode number mapping inside nfsd? Like use the highest 20 bits for different subvolumes, while the remaining 44 bits for real inode numbers. Of-course, this is still a workaround... Thanks, Qu > > Thanks, > NeilBrown >
On Fri, 30 Jul 2021, Qu Wenruo wrote: > > > > You mean like "du -x"?? Yes. You would lose the misleading illusion > > that there are multiple filesystems. That is one user-expectation that > > would need to be addressed before people opt-in > > OK, forgot it's an opt-in feature, then it's less an impact. The hope would have to be that everyone would eventually opt-in once all issues were understood. > > Really not familiar with NFS/VFS, thus some ideas from me may sounds > super crazy. > > Is it possible that, for nfsd to detect such "subvolume" concept by its > own, like checking st_dev and the fsid returned from statfs(). > > Then if nfsd find some boundary which has different st_dev, but the same > fsid as its parent, then it knows it's a "subvolume"-like concept. > > Then do some local inode number mapping inside nfsd? > Like use the highest 20 bits for different subvolumes, while the > remaining 44 bits for real inode numbers. > > Of-course, this is still a workaround... Yes, it would certainly be possible to add some hacks to nfsd to fix the immediate problem, and we could probably even created some well-defined interfaces into btrfs to extract the required information so that it wasn't too hackish. Maybe that is what we will have to do. But I'd rather not hack NFSD while there is any chance that a more complete solution will be found. I'm not quite ready to give up on the idea of squeezing all btrfs inodes into a 64bit number space. 24bits of subvol and 40 bits of inode? Make the split a mkfs or mount option? Maybe hand out inode numbers to subvols in 2^32 chunks so each subvol (which has ever been accessed) has a mapping from the top 32 bits of the objectid to the top 32 bits of the inode number. We don't need something that is theoretically perfect (that's not possible anyway as we don't have 64bits of device numbers). We just need something that is practical and scales adequately. If you have petabytes of storage, it is reasonable to spend a gigabyte of memory on a lookup table(?). If we can make inode numbers unique, we can possibly leave the st_dev changing at subvols so that "du -x" works as currently expected. One thought I had was to use a strong hash to combine the subvol object id and the inode object id into a 64bit number. What is the chance of a collision in practice :-) Thanks, NeilBrown
On 2021/7/30 下午2:53, NeilBrown wrote: > On Fri, 30 Jul 2021, Qu Wenruo wrote: >>> >>> You mean like "du -x"?? Yes. You would lose the misleading illusion >>> that there are multiple filesystems. That is one user-expectation that >>> would need to be addressed before people opt-in >> >> OK, forgot it's an opt-in feature, then it's less an impact. > > The hope would have to be that everyone would eventually opt-in once all > issues were understood. > >> >> Really not familiar with NFS/VFS, thus some ideas from me may sounds >> super crazy. >> >> Is it possible that, for nfsd to detect such "subvolume" concept by its >> own, like checking st_dev and the fsid returned from statfs(). >> >> Then if nfsd find some boundary which has different st_dev, but the same >> fsid as its parent, then it knows it's a "subvolume"-like concept. >> >> Then do some local inode number mapping inside nfsd? >> Like use the highest 20 bits for different subvolumes, while the >> remaining 44 bits for real inode numbers. >> >> Of-course, this is still a workaround... > > Yes, it would certainly be possible to add some hacks to nfsd to fix the > immediate problem, and we could probably even created some well-defined > interfaces into btrfs to extract the required information so that it > wasn't too hackish. > > Maybe that is what we will have to do. But I'd rather not hack NFSD > while there is any chance that a more complete solution will be found. > > I'm not quite ready to give up on the idea of squeezing all btrfs inodes > into a 64bit number space. 24bits of subvol and 40 bits of inode? > Make the split a mkfs or mount option? Btrfs used to have a subvolume number limit in the past, for different reasons. In that case, subvolume number is limited to 48 bits, which is still too large to avoid conflicts. For inode number there is really no limit except the 256 ~ (U64)-256 limit. Considering all these numbers are almost U64, conflicts would be unavoidable AFAIK. > Maybe hand out inode numbers to subvols in 2^32 chunks so each subvol > (which has ever been accessed) has a mapping from the top 32 bits of the > objectid to the top 32 bits of the inode number. > > We don't need something that is theoretically perfect (that's not > possible anyway as we don't have 64bits of device numbers). We just > need something that is practical and scales adequately. If you have > petabytes of storage, it is reasonable to spend a gigabyte of memory on > a lookup table(?). Can such squishing-all-inodes-into-one-namespace work to be done in a more generic way? e.g, let each fs with "subvolume"-like feature to provide the interface to do that. Despite that I still hope to have a way to distinguish the "subvolume" boundary. If completely inside btrfs, it's pretty simple to locate a subvolume boundary. All subvolume have the same inode number 256. Maybe we could reserve some special "squished" inode number to indicate boundary inside a filesystem. E.g. reserve (u64)-1 as a special indicator for subvolume boundaries. As most fs would have reserved super high inode numbers anyway. > > If we can make inode numbers unique, we can possibly leave the st_dev > changing at subvols so that "du -x" works as currently expected. > > One thought I had was to use a strong hash to combine the subvol object > id and the inode object id into a 64bit number. What is the chance of > a collision in practice :-) But with just 64bits, conflicts will happen anyway... Thanks, Qu > > Thanks, > NeilBrown >
On Fri, Jul 30, 2021 at 02:23:44PM +0800, Qu Wenruo wrote: > OK, forgot it's an opt-in feature, then it's less an impact. > > But it can still sometimes be problematic. > > E.g. if the user want to put some git code into one subvolume, while > export another subvolume through NFS. > > Then the user has to opt-in, affecting the git subvolume to lose the > ability to determine subvolume boundary, right? Totally naive question: is it be possible to treat different subvolumes differently, and give the user some choice at subvolume creation time how this new boundary should behave? It seems like there are some conflicting priorities that can only be resolved by someone who knows the intended use case. --b.
On 7/30/21 11:17 AM, J. Bruce Fields wrote: > On Fri, Jul 30, 2021 at 02:23:44PM +0800, Qu Wenruo wrote: >> OK, forgot it's an opt-in feature, then it's less an impact. >> >> But it can still sometimes be problematic. >> >> E.g. if the user want to put some git code into one subvolume, while >> export another subvolume through NFS. >> >> Then the user has to opt-in, affecting the git subvolume to lose the >> ability to determine subvolume boundary, right? > > Totally naive question: is it be possible to treat different subvolumes > differently, and give the user some choice at subvolume creation time > how this new boundary should behave? > > It seems like there are some conflicting priorities that can only be > resolved by someone who knows the intended use case. > This is the crux of the problem. We have no real interfaces or anything to deal with this sort of paradigm. We do the st_dev thing because that's the most common way that tools like find or rsync use to determine they've wandered into a "different" volume. This exists specifically because of usescases like Zygo's, where he's taking thousands of snapshots and manually excluding them from find/rsync is just not reasonable. We have no good way to give the user information about what's going on, we just have these old shitty interfaces. I asked our guys about filling up /proc/self/mountinfo with our subvolumes and they had a heart attack because we have around 2-4k subvolumes on machines, and with monitoring stuff in place we regularly read /proc/self/mountinfo to determine what's mounted and such. And then there's NFS which needs to know that it's walked into a new inode space. This is all super shitty, and mostly exists because we don't have a good way to expose to the user wtf is going on. Personally I would be ok with simply disallowing NFS to wander into subvolumes from an exported fs. If you want to export subvolumes then export them individually, otherwise if you walk into a subvolume from NFS you simply get an empty directory. This doesn't solve the mountinfo problem where a user may want to figure out which subvol they're in, but this is where I think we could address the issue with better interfaces. Or perhaps Neil's idea to have a common major number with a different minor number for every subvol. Either way this isn't as simple as shoehorning it into automount and being done with it, we need to take a step back and think about how should this actually look, taking into account we've got 12 years of having Btrfs deployed with existing usecases that expect a certain behavior. Thanks, Josef
On 2021-07-30 17:48, Josef Bacik wrote: > On 7/30/21 11:17 AM, J. Bruce Fields wrote: >> On Fri, Jul 30, 2021 at 02:23:44PM +0800, Qu Wenruo wrote: >>> OK, forgot it's an opt-in feature, then it's less an impact. >>> >>> But it can still sometimes be problematic. >>> >>> E.g. if the user want to put some git code into one subvolume, while >>> export another subvolume through NFS. >>> >>> Then the user has to opt-in, affecting the git subvolume to lose the >>> ability to determine subvolume boundary, right? >> >> Totally naive question: is it be possible to treat different subvolumes >> differently, and give the user some choice at subvolume creation time >> how this new boundary should behave? >> >> It seems like there are some conflicting priorities that can only be >> resolved by someone who knows the intended use case. >> > > This is the crux of the problem. We have no real interfaces or anything > to deal with this sort of paradigm. We do the st_dev thing because > that's the most common way that tools like find or rsync use to > determine they've wandered into a "different" volume. This exists > specifically because of usescases like Zygo's, where he's taking > thousands of snapshots and manually excluding them from find/rsync is > just not reasonable. > > We have no good way to give the user information about what's going on, > we just have these old shitty interfaces. I asked our guys about > filling up /proc/self/mountinfo with our subvolumes and they had a heart > attack because we have around 2-4k subvolumes on machines, and with > monitoring stuff in place we regularly read /proc/self/mountinfo to > determine what's mounted and such. > > And then there's NFS which needs to know that it's walked into a new > inode space. > > This is all super shitty, and mostly exists because we don't have a good > way to expose to the user wtf is going on. > > Personally I would be ok with simply disallowing NFS to wander into > subvolumes from an exported fs. If you want to export subvolumes then > export them individually, otherwise if you walk into a subvolume from > NFS you simply get an empty directory. > > This doesn't solve the mountinfo problem where a user may want to figure > out which subvol they're in, but this is where I think we could address > the issue with better interfaces. Or perhaps Neil's idea to have a > common major number with a different minor number for every subvol. > > Either way this isn't as simple as shoehorning it into automount and > being done with it, we need to take a step back and think about how > should this actually look, taking into account we've got 12 years of > having Btrfs deployed with existing usecases that expect a certain > behavior. Thanks, > > Josef As a user and sysadmin I really appreciate the way Btrfs currently works. We use hourly snapshots which are exposed over Samba as "Previous Versions" to Windows users. This amounts to thousands of snapshots, all user serviceable. A great feature! In Samba world we have a mount option[1] called "noserverino" which lets the client generate unique inode numbers, rather than using the server provided inode numbers. This allows Linux clients to work well against servers exposing subvolumes and snapshots. NFS has really old roots and had to make choices that we don't really have to make today. Can we not provide something similar to mount.cifs that generate unique inode numbers for the clients. This could be either an nfsd export option (such as /mnt/foo *(rw,uniq_inodes)) or a mount option on the clients. One worry I have with making subvolumes automountpoints is that it might affect the possibility to cp --reflink across that boundary. [1] https://www.samba.org/~ab/output/htmldocs/manpages-3/mount.cifs.8.html
On Fri, Jul 30, 2021 at 11:48:15AM -0400, Josef Bacik wrote: > On 7/30/21 11:17 AM, J. Bruce Fields wrote: > > On Fri, Jul 30, 2021 at 02:23:44PM +0800, Qu Wenruo wrote: > > > OK, forgot it's an opt-in feature, then it's less an impact. > > > > > > But it can still sometimes be problematic. > > > > > > E.g. if the user want to put some git code into one subvolume, while > > > export another subvolume through NFS. > > > > > > Then the user has to opt-in, affecting the git subvolume to lose the > > > ability to determine subvolume boundary, right? > > > > Totally naive question: is it be possible to treat different subvolumes > > differently, and give the user some choice at subvolume creation time > > how this new boundary should behave? > > > > It seems like there are some conflicting priorities that can only be > > resolved by someone who knows the intended use case. > > > > This is the crux of the problem. We have no real interfaces or anything to > deal with this sort of paradigm. We do the st_dev thing because that's the > most common way that tools like find or rsync use to determine they've > wandered into a "different" volume. This exists specifically because of > usescases like Zygo's, where he's taking thousands of snapshots and manually > excluding them from find/rsync is just not reasonable. > > We have no good way to give the user information about what's going on, we > just have these old shitty interfaces. I asked our guys about filling up > /proc/self/mountinfo with our subvolumes and they had a heart attack because > we have around 2-4k subvolumes on machines, and with monitoring stuff in > place we regularly read /proc/self/mountinfo to determine what's mounted and > such. > > And then there's NFS which needs to know that it's walked into a new inode space. NFS somehow works surprisingly well without knowing that. I didn't know there was a problem with NFS, despite exporting thousands of btrfs subvols from a single export point for 7 years. Maybe I have some non-default setting in /etc/exports which works around the problems, or maybe I got lucky, and all my use cases are weirdly specific and evade all the bugs by accident? > This is all super shitty, and mostly exists because we don't have a good way > to expose to the user wtf is going on. > > Personally I would be ok with simply disallowing NFS to wander into > subvolumes from an exported fs. If you want to export subvolumes then > export them individually, otherwise if you walk into a subvolume from NFS > you simply get an empty directory. As a present exporter of thousands of btrfs subvols over NFS from single export points, I'm not a fan of this idea. > This doesn't solve the mountinfo problem where a user may want to figure out > which subvol they're in, but this is where I think we could address the > issue with better interfaces. Or perhaps Neil's idea to have a common major > number with a different minor number for every subvol. It's not hard to figure out what subvol you're in. There's an ioctl which tells the subvol ID, and another that tells the name. The problem is that it's btrfs-specific, and no existing software knows how and when to use it (and also it's privileged, but that's easy to fix compared to the other issues). > Either way this isn't as simple as shoehorning it into automount and being > done with it, we need to take a step back and think about how should this > actually look, taking into account we've got 12 years of having Btrfs > deployed with existing usecases that expect a certain behavior. Thanks, I think if we got into a time machine, went back 12 years, changed the btrfs behavior, and then returned to the present, in the alternate history, we would all be here today talking about how mountinfo doesn't scale up to what btrfs throws at it, and can btrfs opt out of it somehow. Maybe we could have a system call for mount point discovery? Right now, the kernel throws a trail of breadcrumbs into /proc/self/mountinfo, and users use userspace libraries to translate that text blob into actionable information. We could solve problems with scalability and visibility in mountinfo if we only had to provide the information in the context of a single inode (i.e. the inode's parent or child mount points accessible to the caller). So you'd have a call for "get paths for all the mount points below inode X" and another for "get paths for all mount points above inode X", and calls that tell you details about mount points (like what they're mounted on, which filesystem they are part of, what the mount flags are, etc). > Josef
On Fri, Jul 30, 2021 at 03:09:12PM +0800, Qu Wenruo wrote: > > > On 2021/7/30 下午2:53, NeilBrown wrote: > > On Fri, 30 Jul 2021, Qu Wenruo wrote: > > > > > > > > You mean like "du -x"?? Yes. You would lose the misleading illusion > > > > that there are multiple filesystems. That is one user-expectation that > > > > would need to be addressed before people opt-in > > > > > > OK, forgot it's an opt-in feature, then it's less an impact. > > > > The hope would have to be that everyone would eventually opt-in once all > > issues were understood. > > > > > > > > Really not familiar with NFS/VFS, thus some ideas from me may sounds > > > super crazy. > > > > > > Is it possible that, for nfsd to detect such "subvolume" concept by its > > > own, like checking st_dev and the fsid returned from statfs(). > > > > > > Then if nfsd find some boundary which has different st_dev, but the same > > > fsid as its parent, then it knows it's a "subvolume"-like concept. > > > > > > Then do some local inode number mapping inside nfsd? > > > Like use the highest 20 bits for different subvolumes, while the > > > remaining 44 bits for real inode numbers. > > > > > > Of-course, this is still a workaround... > > > > Yes, it would certainly be possible to add some hacks to nfsd to fix the > > immediate problem, and we could probably even created some well-defined > > interfaces into btrfs to extract the required information so that it > > wasn't too hackish. > > > > Maybe that is what we will have to do. But I'd rather not hack NFSD > > while there is any chance that a more complete solution will be found. > > > > I'm not quite ready to give up on the idea of squeezing all btrfs inodes > > into a 64bit number space. 24bits of subvol and 40 bits of inode? > > Make the split a mkfs or mount option? > > Btrfs used to have a subvolume number limit in the past, for different > reasons. > > In that case, subvolume number is limited to 48 bits, which is still too > large to avoid conflicts. > > For inode number there is really no limit except the 256 ~ (U64)-256 limit. > > Considering all these numbers are almost U64, conflicts would be > unavoidable AFAIK. > > > Maybe hand out inode numbers to subvols in 2^32 chunks so each subvol > > (which has ever been accessed) has a mapping from the top 32 bits of the > > objectid to the top 32 bits of the inode number. > > > > We don't need something that is theoretically perfect (that's not > > possible anyway as we don't have 64bits of device numbers). We just > > need something that is practical and scales adequately. If you have > > petabytes of storage, it is reasonable to spend a gigabyte of memory on > > a lookup table(?). > > Can such squishing-all-inodes-into-one-namespace work to be done in a > more generic way? e.g, let each fs with "subvolume"-like feature to > provide the interface to do that. If you know the highest subvol ID number, you can pack two integers into one larger integer by reversing the bits of the subvol number and ORing them with the inode number, i.e. 0x0080000000000300 is subvol 256 inode 768. The subvol ID's grow left to right while the inode numbers grow right to left. You can have billions of inodes in a few subvols, or billions of subvols with a few inodes each, and neither will collide with the other until there are billions of both. If the filesystem tracks the number of bits in the highest subvol ID and the highest inode number, then the inode numbers can be decoded, and collisions can be detected. e.g. if the maximum subvol ID on the filesystem is below 131072, it will fit in 17 bits, then we know bits 63-47 are the subvol ID and bits 46-0 are the inode.. When subvol 131072 is created, the number of subvol bits increases to 18, but if every inode fits in less than 46 bits, we know that every existing inode has a 0 in the 18th subvol ID bit of the inode number, so there is no ambiguity. If you don't know the maximum subvol ID, you can guess based on the position of the large run of zero bits in the middle of the integer--not reliable, but good enough for a guess if you were looking at 'ls -li' output (and wrote the inode numbers in hex). In the pathological case (the maximum subvol ID and maximum inode number require more than 64 total bits) we return ENOSPC. This can all be done when btrfs fills in an inode struct. There's no need to change the on-disk format, other than to track the highest inode and subvol number. btrfs can compute the maxima in reasonable but non-zero time by searching trees on mount, so an incompatible disk format change would only be needed to avoid making mount slower. > Despite that I still hope to have a way to distinguish the "subvolume" > boundary. Packing the bits into a single uint64 doesn't help with this--it does the opposite. Subvol boundaries become harder to see without deliberate checking (i.e. not the traditional parent.st_dev != child.st_dev test). Judging from previous btrfs-related complaints, some users do want "stealth" subvols whose boundaries are not accidentally visible, so the new behavior could be a feature for someone. > If completely inside btrfs, it's pretty simple to locate a subvolume > boundary. > All subvolume have the same inode number 256. > > Maybe we could reserve some special "squished" inode number to indicate > boundary inside a filesystem. > > E.g. reserve (u64)-1 as a special indicator for subvolume boundaries. > As most fs would have reserved super high inode numbers anyway. > > > > > > If we can make inode numbers unique, we can possibly leave the st_dev > > changing at subvols so that "du -x" works as currently expected. > > > > One thought I had was to use a strong hash to combine the subvol object > > id and the inode object id into a 64bit number. What is the chance of > > a collision in practice :-) > > But with just 64bits, conflicts will happen anyway... The collision rate might be low enough that we could just skip over the colliding numbers, but we'd have to have some kind of in-memory collision map to avoid slowing down inode creation (currently the next inode number is more or less "++last_inode_number", and looking up inodes to see if they exist first would slow down new file creation a lot). > Thanks, > Qu > > > > Thanks, > > NeilBrown > >