mbox series

[PATCH/RFC,00/11] expose btrfs subvols in mount table correctly

Message ID 162742539595.32498.13687924366155737575.stgit@noble.brown (mailing list archive)
Headers show
Series expose btrfs subvols in mount table correctly | expand

Message

NeilBrown July 27, 2021, 10:37 p.m. UTC
There are long-standing problems with btrfs subvols, particularly in
relation to whether and how they are exposed in the mount table.

 - /proc/self/mountinfo reports the major:minor device number for each
    filesystem and when a btrfs subvol is explicitly mounted, the number
    reported is wrong - it does not match what stat() reports for the
    mountpoint.

 - when subvol are not explicitly mounted, they don't appear in
   mountinfo at all.

Consequences include that a tool which uses stat() to find the dev of the
filesystem, then searches mountinfo for that filesystem, will not find
it.

Some tools (e.g. findmnt) appear to have been enhanced to cope with this
strangeness, but it would be best to make btrfs behave more normally.

  - nfsd cannot currently see the transition to subvol, so reports the
    main volume and all subvols to the client as being in the same
    filesystem.  As inode numbers are not unique across all subvols,
    this can confuse clients.  In particular, 'find' is likely to report a
    loop.

subvols can be made to appear in mountinfo using automounts.  However
nfsd does not cope well with automounts.  It assumes all filesystems to
be exported are already mounted.  So adding automounts to btrfs would
break nfsd.

We can enhance nfsd to understand that some automounts can be managed.
"internal mounts" where a filesystem provides an automount point and
mounts its own directories, can be handled differently by nfsd.

This series addresses all these issues.  After a few enhancements to the
VFS to provide needed support, they enhance exportfs and nfsd to cope
with the concept of internal mounts, and then enhance btrfs to provide
them.

The NFSv3 support is incomplete.  I'm not sure we can make it work
"perfectly".  A normal nfsv3 mount seem to work well enough, but if
mounted with '-o noac', it loses track of the mounted-on inode number
and complains about inode numbers changing.

My basic test for these is to mount a btrfs filesystem which contains
subvols, nfs-export it and mount it with nfsv3 and nfsv4, then run
'find' in each of the filesystem and check the contents of
/proc/self/mountinfo.

The first patch simply fixes the dev number in mountinfo and could
possibly be tagged for -stable.

NeilBrown

---

NeilBrown (11):
      VFS: show correct dev num in mountinfo
      VFS: allow d_automount to create in-place bind-mount.
      VFS: pass lookup_flags into follow_down()
      VFS: export lookup_mnt()
      VFS: new function: mount_is_internal()
      nfsd: include a vfsmount in struct svc_fh
      exportfs: Allow filehandle lookup to cross internal mount points.
      nfsd: change get_parent_attributes() to nfsd_get_mounted_on()
      nfsd: Allow filehandle lookup to cross internal mount points.
      btrfs: introduce mapping function from location to inum
      btrfs: use automount to bind-mount all subvol roots.


 fs/btrfs/btrfs_inode.h   |  12 +++
 fs/btrfs/inode.c         | 111 ++++++++++++++++++++++++++-
 fs/btrfs/super.c         |   1 +
 fs/exportfs/expfs.c      | 100 ++++++++++++++++++++----
 fs/fhandle.c             |   2 +-
 fs/internal.h            |   1 -
 fs/namei.c               |   6 +-
 fs/namespace.c           |  32 +++++++-
 fs/nfsd/export.c         |   4 +-
 fs/nfsd/nfs3xdr.c        |  40 +++++++---
 fs/nfsd/nfs4proc.c       |   9 ++-
 fs/nfsd/nfs4xdr.c        | 106 ++++++++++++-------------
 fs/nfsd/nfsfh.c          |  44 +++++++----
 fs/nfsd/nfsfh.h          |   3 +-
 fs/nfsd/nfsproc.c        |   5 +-
 fs/nfsd/vfs.c            | 162 +++++++++++++++++++++++----------------
 fs/nfsd/vfs.h            |  12 +--
 fs/nfsd/xdr4.h           |   2 +-
 fs/overlayfs/namei.c     |   5 +-
 fs/xfs/xfs_ioctl.c       |  12 ++-
 include/linux/exportfs.h |   4 +-
 include/linux/mount.h    |   4 +
 include/linux/namei.h    |   2 +-
 23 files changed, 490 insertions(+), 189 deletions(-)

--
Signature

Comments

Al Viro July 28, 2021, 2:19 a.m. UTC | #1
On Wed, Jul 28, 2021 at 08:37:45AM +1000, NeilBrown wrote:

> We can enhance nfsd to understand that some automounts can be managed.
> "internal mounts" where a filesystem provides an automount point and
> mounts its own directories, can be handled differently by nfsd.
> 
> This series addresses all these issues.  After a few enhancements to the
> VFS to provide needed support, they enhance exportfs and nfsd to cope
> with the concept of internal mounts, and then enhance btrfs to provide
> them.

I'm half asleep right now; will review and reply in detail tomorrow.
The first impression is that it's going to be brittle; properties of
mount really should not depend upon the nature of mountpoint - too many
ways for that to go wrong.

Anyway, tomorrow...
Wang Yugui July 28, 2021, 4:58 a.m. UTC | #2
Hi,

We no longer need the dummy inode(BTRFS_FIRST_FREE_OBJECTID - 1) in this
patch serials?

I tried to backport it to 5.10.x, but it failed to work.
No big modification in this 5.10.x backporting, and all modified pathes
are attached.

Best Regards
Wang Yugui (wangyugui@e16-tech.com)
2021/07/28

> There are long-standing problems with btrfs subvols, particularly in
> relation to whether and how they are exposed in the mount table.
> 
>  - /proc/self/mountinfo reports the major:minor device number for each
>     filesystem and when a btrfs subvol is explicitly mounted, the number
>     reported is wrong - it does not match what stat() reports for the
>     mountpoint.
> 
>  - when subvol are not explicitly mounted, they don't appear in
>    mountinfo at all.
> 
> Consequences include that a tool which uses stat() to find the dev of the
> filesystem, then searches mountinfo for that filesystem, will not find
> it.
> 
> Some tools (e.g. findmnt) appear to have been enhanced to cope with this
> strangeness, but it would be best to make btrfs behave more normally.
> 
>   - nfsd cannot currently see the transition to subvol, so reports the
>     main volume and all subvols to the client as being in the same
>     filesystem.  As inode numbers are not unique across all subvols,
>     this can confuse clients.  In particular, 'find' is likely to report a
>     loop.
> 
> subvols can be made to appear in mountinfo using automounts.  However
> nfsd does not cope well with automounts.  It assumes all filesystems to
> be exported are already mounted.  So adding automounts to btrfs would
> break nfsd.
> 
> We can enhance nfsd to understand that some automounts can be managed.
> "internal mounts" where a filesystem provides an automount point and
> mounts its own directories, can be handled differently by nfsd.
> 
> This series addresses all these issues.  After a few enhancements to the
> VFS to provide needed support, they enhance exportfs and nfsd to cope
> with the concept of internal mounts, and then enhance btrfs to provide
> them.
> 
> The NFSv3 support is incomplete.  I'm not sure we can make it work
> "perfectly".  A normal nfsv3 mount seem to work well enough, but if
> mounted with '-o noac', it loses track of the mounted-on inode number
> and complains about inode numbers changing.
> 
> My basic test for these is to mount a btrfs filesystem which contains
> subvols, nfs-export it and mount it with nfsv3 and nfsv4, then run
> 'find' in each of the filesystem and check the contents of
> /proc/self/mountinfo.
> 
> The first patch simply fixes the dev number in mountinfo and could
> possibly be tagged for -stable.
> 
> NeilBrown
> 
> ---
> 
> NeilBrown (11):
>       VFS: show correct dev num in mountinfo
>       VFS: allow d_automount to create in-place bind-mount.
>       VFS: pass lookup_flags into follow_down()
>       VFS: export lookup_mnt()
>       VFS: new function: mount_is_internal()
>       nfsd: include a vfsmount in struct svc_fh
>       exportfs: Allow filehandle lookup to cross internal mount points.
>       nfsd: change get_parent_attributes() to nfsd_get_mounted_on()
>       nfsd: Allow filehandle lookup to cross internal mount points.
>       btrfs: introduce mapping function from location to inum
>       btrfs: use automount to bind-mount all subvol roots.
> 
> 
>  fs/btrfs/btrfs_inode.h   |  12 +++
>  fs/btrfs/inode.c         | 111 ++++++++++++++++++++++++++-
>  fs/btrfs/super.c         |   1 +
>  fs/exportfs/expfs.c      | 100 ++++++++++++++++++++----
>  fs/fhandle.c             |   2 +-
>  fs/internal.h            |   1 -
>  fs/namei.c               |   6 +-
>  fs/namespace.c           |  32 +++++++-
>  fs/nfsd/export.c         |   4 +-
>  fs/nfsd/nfs3xdr.c        |  40 +++++++---
>  fs/nfsd/nfs4proc.c       |   9 ++-
>  fs/nfsd/nfs4xdr.c        | 106 ++++++++++++-------------
>  fs/nfsd/nfsfh.c          |  44 +++++++----
>  fs/nfsd/nfsfh.h          |   3 +-
>  fs/nfsd/nfsproc.c        |   5 +-
>  fs/nfsd/vfs.c            | 162 +++++++++++++++++++++++----------------
>  fs/nfsd/vfs.h            |  12 +--
>  fs/nfsd/xdr4.h           |   2 +-
>  fs/overlayfs/namei.c     |   5 +-
>  fs/xfs/xfs_ioctl.c       |  12 ++-
>  include/linux/exportfs.h |   4 +-
>  include/linux/mount.h    |   4 +
>  include/linux/namei.h    |   2 +-
>  23 files changed, 490 insertions(+), 189 deletions(-)
> 
> --
> Signature
Wang Yugui July 28, 2021, 6:04 a.m. UTC | #3
Hi,

This patchset works well in 5.14-rc3.

1, fixed dummy inode(255, BTRFS_FIRST_FREE_OBJECTID - 1 )  is changed to
dynamic dummy inode(18446744073709551358, or 18446744073709551359, ...)

2, btrfs subvol mount info is shown in /proc/mounts, even if nfsd/nfs is
not used.
/dev/sdc                btrfs   94G  3.5M   93G   1% /mnt/test
/dev/sdc                btrfs   94G  3.5M   93G   1% /mnt/test/sub1
/dev/sdc                btrfs   94G  3.5M   93G   1% /mnt/test/sub2

This is a visiual feature change for btrfs user.

Best Regards
Wang Yugui (wangyugui@e16-tech.com)
2021/07/28

> Hi,
> 
> We no longer need the dummy inode(BTRFS_FIRST_FREE_OBJECTID - 1) in this
> patch serials?
> 
> I tried to backport it to 5.10.x, but it failed to work.
> No big modification in this 5.10.x backporting, and all modified pathes
> are attached.
> 
> Best Regards
> Wang Yugui (wangyugui@e16-tech.com)
> 2021/07/28
> 
> > There are long-standing problems with btrfs subvols, particularly in
> > relation to whether and how they are exposed in the mount table.
> > 
> >  - /proc/self/mountinfo reports the major:minor device number for each
> >     filesystem and when a btrfs subvol is explicitly mounted, the number
> >     reported is wrong - it does not match what stat() reports for the
> >     mountpoint.
> > 
> >  - when subvol are not explicitly mounted, they don't appear in
> >    mountinfo at all.
> > 
> > Consequences include that a tool which uses stat() to find the dev of the
> > filesystem, then searches mountinfo for that filesystem, will not find
> > it.
> > 
> > Some tools (e.g. findmnt) appear to have been enhanced to cope with this
> > strangeness, but it would be best to make btrfs behave more normally.
> > 
> >   - nfsd cannot currently see the transition to subvol, so reports the
> >     main volume and all subvols to the client as being in the same
> >     filesystem.  As inode numbers are not unique across all subvols,
> >     this can confuse clients.  In particular, 'find' is likely to report a
> >     loop.
> > 
> > subvols can be made to appear in mountinfo using automounts.  However
> > nfsd does not cope well with automounts.  It assumes all filesystems to
> > be exported are already mounted.  So adding automounts to btrfs would
> > break nfsd.
> > 
> > We can enhance nfsd to understand that some automounts can be managed.
> > "internal mounts" where a filesystem provides an automount point and
> > mounts its own directories, can be handled differently by nfsd.
> > 
> > This series addresses all these issues.  After a few enhancements to the
> > VFS to provide needed support, they enhance exportfs and nfsd to cope
> > with the concept of internal mounts, and then enhance btrfs to provide
> > them.
> > 
> > The NFSv3 support is incomplete.  I'm not sure we can make it work
> > "perfectly".  A normal nfsv3 mount seem to work well enough, but if
> > mounted with '-o noac', it loses track of the mounted-on inode number
> > and complains about inode numbers changing.
> > 
> > My basic test for these is to mount a btrfs filesystem which contains
> > subvols, nfs-export it and mount it with nfsv3 and nfsv4, then run
> > 'find' in each of the filesystem and check the contents of
> > /proc/self/mountinfo.
> > 
> > The first patch simply fixes the dev number in mountinfo and could
> > possibly be tagged for -stable.
> > 
> > NeilBrown
> > 
> > ---
> > 
> > NeilBrown (11):
> >       VFS: show correct dev num in mountinfo
> >       VFS: allow d_automount to create in-place bind-mount.
> >       VFS: pass lookup_flags into follow_down()
> >       VFS: export lookup_mnt()
> >       VFS: new function: mount_is_internal()
> >       nfsd: include a vfsmount in struct svc_fh
> >       exportfs: Allow filehandle lookup to cross internal mount points.
> >       nfsd: change get_parent_attributes() to nfsd_get_mounted_on()
> >       nfsd: Allow filehandle lookup to cross internal mount points.
> >       btrfs: introduce mapping function from location to inum
> >       btrfs: use automount to bind-mount all subvol roots.
> > 
> > 
> >  fs/btrfs/btrfs_inode.h   |  12 +++
> >  fs/btrfs/inode.c         | 111 ++++++++++++++++++++++++++-
> >  fs/btrfs/super.c         |   1 +
> >  fs/exportfs/expfs.c      | 100 ++++++++++++++++++++----
> >  fs/fhandle.c             |   2 +-
> >  fs/internal.h            |   1 -
> >  fs/namei.c               |   6 +-
> >  fs/namespace.c           |  32 +++++++-
> >  fs/nfsd/export.c         |   4 +-
> >  fs/nfsd/nfs3xdr.c        |  40 +++++++---
> >  fs/nfsd/nfs4proc.c       |   9 ++-
> >  fs/nfsd/nfs4xdr.c        | 106 ++++++++++++-------------
> >  fs/nfsd/nfsfh.c          |  44 +++++++----
> >  fs/nfsd/nfsfh.h          |   3 +-
> >  fs/nfsd/nfsproc.c        |   5 +-
> >  fs/nfsd/vfs.c            | 162 +++++++++++++++++++++++----------------
> >  fs/nfsd/vfs.h            |  12 +--
> >  fs/nfsd/xdr4.h           |   2 +-
> >  fs/overlayfs/namei.c     |   5 +-
> >  fs/xfs/xfs_ioctl.c       |  12 ++-
> >  include/linux/exportfs.h |   4 +-
> >  include/linux/mount.h    |   4 +
> >  include/linux/namei.h    |   2 +-
> >  23 files changed, 490 insertions(+), 189 deletions(-)
> > 
> > --
> > Signature
>
NeilBrown July 28, 2021, 7:01 a.m. UTC | #4
On Wed, 28 Jul 2021, Wang Yugui wrote:
> Hi,
> 
> This patchset works well in 5.14-rc3.

Thanks for testing.

> 
> 1, fixed dummy inode(255, BTRFS_FIRST_FREE_OBJECTID - 1 )  is changed to
> dynamic dummy inode(18446744073709551358, or 18446744073709551359, ...)

The BTRFS_FIRST_FREE_OBJECTID-1 was a just a hack, I never wanted it to
be permanent.
The new number is ULONG_MAX - subvol_id (where subvol_id starts at 257 I
think).
This is a bit less of a hack.  It is an easily available number that is
fairly unique.

> 
> 2, btrfs subvol mount info is shown in /proc/mounts, even if nfsd/nfs is
> not used.
> /dev/sdc                btrfs   94G  3.5M   93G   1% /mnt/test
> /dev/sdc                btrfs   94G  3.5M   93G   1% /mnt/test/sub1
> /dev/sdc                btrfs   94G  3.5M   93G   1% /mnt/test/sub2
> 
> This is a visiual feature change for btrfs user.

Hopefully it is an improvement.  But it is certainly a change that needs
to be carefully considered.

Thanks,
NeilBrown
NeilBrown July 28, 2021, 7:06 a.m. UTC | #5
On Wed, 28 Jul 2021, Wang Yugui wrote:
> Hi,
> 
> We no longer need the dummy inode(BTRFS_FIRST_FREE_OBJECTID - 1) in this
> patch serials?

No.

> 
> I tried to backport it to 5.10.x, but it failed to work.
> No big modification in this 5.10.x backporting, and all modified pathes
> are attached.

I'm not surprised, but I doubt there would be a major barrier to making
it work.  I'm unlikely to try until I have more positive reviews.

Thanks,
NeilBrown
Wang Yugui July 28, 2021, 9:36 a.m. UTC | #6
Hi,

> > I tried to backport it to 5.10.x, but it failed to work.
> > No big modification in this 5.10.x backporting, and all modified pathes
> > are attached.
> 
> I'm not surprised, but I doubt there would be a major barrier to making
> it work.  I'm unlikely to try until I have more positive reviews.

With two patches as dependency, 
d045465fc6cbfa4acfb5a7d817a7c1a57a078109
	0001-exportfs-Add-a-function-to-return-the-raw-output-fro.patch
2e19d10c1438241de32467637a2a411971547991
	0002-nfsd-Fix-up-nfsd-to-ensure-that-timeout-errors-don-t.patch

the 5.10.x backporting become more simple and it works well now.

Best Regards
Wang Yugui (wangyugui@e16-tech.com)
2021/07/28
Neal Gompa July 28, 2021, 12:26 p.m. UTC | #7
On Wed, Jul 28, 2021 at 3:02 AM NeilBrown <neilb@suse.de> wrote:
>
> On Wed, 28 Jul 2021, Wang Yugui wrote:
> > Hi,
> >
> > This patchset works well in 5.14-rc3.
>
> Thanks for testing.
>
> >
> > 1, fixed dummy inode(255, BTRFS_FIRST_FREE_OBJECTID - 1 )  is changed to
> > dynamic dummy inode(18446744073709551358, or 18446744073709551359, ...)
>
> The BTRFS_FIRST_FREE_OBJECTID-1 was a just a hack, I never wanted it to
> be permanent.
> The new number is ULONG_MAX - subvol_id (where subvol_id starts at 257 I
> think).
> This is a bit less of a hack.  It is an easily available number that is
> fairly unique.
>
> >
> > 2, btrfs subvol mount info is shown in /proc/mounts, even if nfsd/nfs is
> > not used.
> > /dev/sdc                btrfs   94G  3.5M   93G   1% /mnt/test
> > /dev/sdc                btrfs   94G  3.5M   93G   1% /mnt/test/sub1
> > /dev/sdc                btrfs   94G  3.5M   93G   1% /mnt/test/sub2
> >
> > This is a visiual feature change for btrfs user.
>
> Hopefully it is an improvement.  But it is certainly a change that needs
> to be carefully considered.

I think this is behavior people generally expect, but I wonder what
the consequences of this would be with huge numbers of subvolumes. If
there are hundreds or thousands of them (which is quite possible on
SUSE systems, for example, with its auto-snapshotting regime), this
would be a mess, wouldn't it?

Or can we add a way to mark these things to not show up there or is
there some kind of behavioral change we can make to snapper or other
tools to make them not show up here?
Graham Cobb July 28, 2021, 1:43 p.m. UTC | #8
On 28/07/2021 08:01, NeilBrown wrote:
> On Wed, 28 Jul 2021, Wang Yugui wrote:
>> Hi,
>>
>> This patchset works well in 5.14-rc3.
> 
> Thanks for testing.
> 
>>
>> 1, fixed dummy inode(255, BTRFS_FIRST_FREE_OBJECTID - 1 )  is changed to
>> dynamic dummy inode(18446744073709551358, or 18446744073709551359, ...)
> 
> The BTRFS_FIRST_FREE_OBJECTID-1 was a just a hack, I never wanted it to
> be permanent.
> The new number is ULONG_MAX - subvol_id (where subvol_id starts at 257 I
> think).
> This is a bit less of a hack.  It is an easily available number that is
> fairly unique.
> 
>>
>> 2, btrfs subvol mount info is shown in /proc/mounts, even if nfsd/nfs is
>> not used.
>> /dev/sdc                btrfs   94G  3.5M   93G   1% /mnt/test
>> /dev/sdc                btrfs   94G  3.5M   93G   1% /mnt/test/sub1
>> /dev/sdc                btrfs   94G  3.5M   93G   1% /mnt/test/sub2
>>
>> This is a visiual feature change for btrfs user.
> 
> Hopefully it is an improvement.  But it is certainly a change that needs
> to be carefully considered.

Would this change the behaviour of findmnt? I have several scripts that
depend on findmnt to select btrfs filesystems. Just to take a couple of
examples (using the example shown above): my scripts would depend on
'findmnt --target /mnt/test/sub1 -o target' providing /mnt/test, not the
subvolume; and another script would depend on 'findmnt -t btrfs
--mountpoint /mnt/test/sub1' providing no output as the directory is not
an /etc/fstab mount point for a btrfs filesystem.

Maybe findmnt isn't affected? Or maybe the change is worth making
anyway? But it needs to be carefully considered if it breaks existing
user interfaces.
J. Bruce Fields July 28, 2021, 7:14 p.m. UTC | #9
On Wed, Jul 28, 2021 at 08:26:12AM -0400, Neal Gompa wrote:
> I think this is behavior people generally expect, but I wonder what
> the consequences of this would be with huge numbers of subvolumes. If
> there are hundreds or thousands of them (which is quite possible on
> SUSE systems, for example, with its auto-snapshotting regime), this
> would be a mess, wouldn't it?

I'm surprised that btrfs is special here.  Doesn't anyone have thousands
of lvm snapshots?  Or is it that they do but they're not normally
mounted?

--b.
J. Bruce Fields July 28, 2021, 7:35 p.m. UTC | #10
I'm still stuck trying to understand why subvolumes can't get their own
superblocks:

	- Why are the performance issues Josef raises unsurmountable?
	  And why are they unique to btrfs?  (Surely there other cases
	  where people need hundreds or thousands of superblocks?)

	- If filehandle decoding can return a different vfs mount than
	  it's passed, why can't it return a different superblock?

--b.

On Wed, Jul 28, 2021 at 08:37:45AM +1000, NeilBrown wrote:
> There are long-standing problems with btrfs subvols, particularly in
> relation to whether and how they are exposed in the mount table.
> 
>  - /proc/self/mountinfo reports the major:minor device number for each
>     filesystem and when a btrfs subvol is explicitly mounted, the number
>     reported is wrong - it does not match what stat() reports for the
>     mountpoint.
> 
>  - when subvol are not explicitly mounted, they don't appear in
>    mountinfo at all.
> 
> Consequences include that a tool which uses stat() to find the dev of the
> filesystem, then searches mountinfo for that filesystem, will not find
> it.
> 
> Some tools (e.g. findmnt) appear to have been enhanced to cope with this
> strangeness, but it would be best to make btrfs behave more normally.
> 
>   - nfsd cannot currently see the transition to subvol, so reports the
>     main volume and all subvols to the client as being in the same
>     filesystem.  As inode numbers are not unique across all subvols,
>     this can confuse clients.  In particular, 'find' is likely to report a
>     loop.
> 
> subvols can be made to appear in mountinfo using automounts.  However
> nfsd does not cope well with automounts.  It assumes all filesystems to
> be exported are already mounted.  So adding automounts to btrfs would
> break nfsd.
> 
> We can enhance nfsd to understand that some automounts can be managed.
> "internal mounts" where a filesystem provides an automount point and
> mounts its own directories, can be handled differently by nfsd.
> 
> This series addresses all these issues.  After a few enhancements to the
> VFS to provide needed support, they enhance exportfs and nfsd to cope
> with the concept of internal mounts, and then enhance btrfs to provide
> them.
> 
> The NFSv3 support is incomplete.  I'm not sure we can make it work
> "perfectly".  A normal nfsv3 mount seem to work well enough, but if
> mounted with '-o noac', it loses track of the mounted-on inode number
> and complains about inode numbers changing.
> 
> My basic test for these is to mount a btrfs filesystem which contains
> subvols, nfs-export it and mount it with nfsv3 and nfsv4, then run
> 'find' in each of the filesystem and check the contents of
> /proc/self/mountinfo.
> 
> The first patch simply fixes the dev number in mountinfo and could
> possibly be tagged for -stable.
> 
> NeilBrown
> 
> ---
> 
> NeilBrown (11):
>       VFS: show correct dev num in mountinfo
>       VFS: allow d_automount to create in-place bind-mount.
>       VFS: pass lookup_flags into follow_down()
>       VFS: export lookup_mnt()
>       VFS: new function: mount_is_internal()
>       nfsd: include a vfsmount in struct svc_fh
>       exportfs: Allow filehandle lookup to cross internal mount points.
>       nfsd: change get_parent_attributes() to nfsd_get_mounted_on()
>       nfsd: Allow filehandle lookup to cross internal mount points.
>       btrfs: introduce mapping function from location to inum
>       btrfs: use automount to bind-mount all subvol roots.
> 
> 
>  fs/btrfs/btrfs_inode.h   |  12 +++
>  fs/btrfs/inode.c         | 111 ++++++++++++++++++++++++++-
>  fs/btrfs/super.c         |   1 +
>  fs/exportfs/expfs.c      | 100 ++++++++++++++++++++----
>  fs/fhandle.c             |   2 +-
>  fs/internal.h            |   1 -
>  fs/namei.c               |   6 +-
>  fs/namespace.c           |  32 +++++++-
>  fs/nfsd/export.c         |   4 +-
>  fs/nfsd/nfs3xdr.c        |  40 +++++++---
>  fs/nfsd/nfs4proc.c       |   9 ++-
>  fs/nfsd/nfs4xdr.c        | 106 ++++++++++++-------------
>  fs/nfsd/nfsfh.c          |  44 +++++++----
>  fs/nfsd/nfsfh.h          |   3 +-
>  fs/nfsd/nfsproc.c        |   5 +-
>  fs/nfsd/vfs.c            | 162 +++++++++++++++++++++++----------------
>  fs/nfsd/vfs.h            |  12 +--
>  fs/nfsd/xdr4.h           |   2 +-
>  fs/overlayfs/namei.c     |   5 +-
>  fs/xfs/xfs_ioctl.c       |  12 ++-
>  include/linux/exportfs.h |   4 +-
>  include/linux/mount.h    |   4 +
>  include/linux/namei.h    |   2 +-
>  23 files changed, 490 insertions(+), 189 deletions(-)
> 
> --
> Signature
Josef Bacik July 28, 2021, 9:30 p.m. UTC | #11
On 7/28/21 3:35 PM, J. Bruce Fields wrote:
> I'm still stuck trying to understand why subvolumes can't get their own
> superblocks:
> 
> 	- Why are the performance issues Josef raises unsurmountable?
> 	  And why are they unique to btrfs?  (Surely there other cases
> 	  where people need hundreds or thousands of superblocks?)
> 

I don't think anybody has that many file systems.  For btrfs it's a single file 
system.  Think of syncfs, it's going to walk through all of the super blocks on 
the system calling ->sync_fs on each subvol superblock.  Now this isn't a huge 
deal, we could just have some flag that says "I'm not real" or even just have 
anonymous superblocks that don't get added to the global super_blocks list, and 
that would address my main pain points.

The second part is inode reclaim.  Again this particular problem could be 
avoided if we had an anonymous superblock that wasn't actually used, but the 
inode lru is per superblock.  Now with reclaim instead of walking all the 
inodes, you're walking a bunch of super blocks and then walking the list of 
inodes within those super blocks.  You're burning CPU cycles because now instead 
of getting big chunks of inodes to dispose, it's spread out across many super 
blocks.

The other weird thing is the way we apply pressure to shrinker systems.  We 
essentially say "try to evict X objects from your list", which means in this 
case with lots of subvolumes we'd be evicting waaaaay more inodes than you were 
before, likely impacting performance where you have workloads that have lots of 
files open across many subvolumes (which is what FB does with it's containers).

If we want a anonymous superblock per subvolume then the only way it'll work is 
if it's not actually tied into anything, and we still use the primary super 
block for the whole file system.  And if that's what we're going to do what's 
the point of the super block exactly?  This approach that Neil's come up with 
seems like a reasonable solution to me.  Christoph gets his separation and 
/proc/self/mountinfo, and we avoid the scalability headache of a billion super 
blocks.  Thanks,

Josef
NeilBrown July 28, 2021, 10:50 p.m. UTC | #12
On Wed, 28 Jul 2021, Neal Gompa wrote:
> On Wed, Jul 28, 2021 at 3:02 AM NeilBrown <neilb@suse.de> wrote:
> >
> > On Wed, 28 Jul 2021, Wang Yugui wrote:
> > > Hi,
> > >
> > > This patchset works well in 5.14-rc3.
> >
> > Thanks for testing.
> >
> > >
> > > 1, fixed dummy inode(255, BTRFS_FIRST_FREE_OBJECTID - 1 )  is changed to
> > > dynamic dummy inode(18446744073709551358, or 18446744073709551359, ...)
> >
> > The BTRFS_FIRST_FREE_OBJECTID-1 was a just a hack, I never wanted it to
> > be permanent.
> > The new number is ULONG_MAX - subvol_id (where subvol_id starts at 257 I
> > think).
> > This is a bit less of a hack.  It is an easily available number that is
> > fairly unique.
> >
> > >
> > > 2, btrfs subvol mount info is shown in /proc/mounts, even if nfsd/nfs is
> > > not used.
> > > /dev/sdc                btrfs   94G  3.5M   93G   1% /mnt/test
> > > /dev/sdc                btrfs   94G  3.5M   93G   1% /mnt/test/sub1
> > > /dev/sdc                btrfs   94G  3.5M   93G   1% /mnt/test/sub2
> > >
> > > This is a visiual feature change for btrfs user.
> >
> > Hopefully it is an improvement.  But it is certainly a change that needs
> > to be carefully considered.
> 
> I think this is behavior people generally expect, but I wonder what
> the consequences of this would be with huge numbers of subvolumes. If
> there are hundreds or thousands of them (which is quite possible on
> SUSE systems, for example, with its auto-snapshotting regime), this
> would be a mess, wouldn't it?

Would there be hundreds or thousands of subvols concurrently being
accessed? The auto-mounted subvols only appear in the mount table while
that are being accessed, and for about 15 minutes after the last access.
I suspect that most subvols are "backup" snapshots which are not being
accessed and so would not appear.

> 
> Or can we add a way to mark these things to not show up there or is
> there some kind of behavioral change we can make to snapper or other
> tools to make them not show up here?

Certainly it might make sense to flag these in some way so that tools
can choose the ignore them or handle them specially, just as nfsd needs
to handle them specially.  I was considering a "local" mount flag.

NeilBrown

> 
> 
> 
> -- 
> 真実はいつも一つ!/ Always, there's only one truth!
> 
>
Zygo Blaxell July 29, 2021, 1:29 a.m. UTC | #13
On Wed, Jul 28, 2021 at 03:14:31PM -0400, J. Bruce Fields wrote:
> On Wed, Jul 28, 2021 at 08:26:12AM -0400, Neal Gompa wrote:
> > I think this is behavior people generally expect, but I wonder what
> > the consequences of this would be with huge numbers of subvolumes. If
> > there are hundreds or thousands of them (which is quite possible on
> > SUSE systems, for example, with its auto-snapshotting regime), this
> > would be a mess, wouldn't it?
> 
> I'm surprised that btrfs is special here.  Doesn't anyone have thousands
> of lvm snapshots?  Or is it that they do but they're not normally
> mounted?

Unprivileged users can't create lvm snapshots as easily or quickly as
using mkdir (well, ok, mkdir and fssync).  lvm doesn't scale very well
past more than a few dozen snapshots of the same original volume, and
performance degrades linearly in the number of snapshots if the original
LV is modified.  btrfs is the opposite:  users can create and delete
as many snapshots as they like, at a cost more expensive than mkdir but
less expensive than 'cp -a', and users only pay IO costs for writes to
the subvols they modify.  So some btrfs users use snapshots in places
where more traditional tools like 'cp -a' or 'git checkout' are used on
other filesystems.

e.g. a build system might make a snapshot of a git working tree containing
a checked out and built baseline revision, and then it might do a loop
where it makes a snapshot, applies one patch from an integration branch
in the snapshot directory, and incrementally builds there.  The next
revision makes a snapshot of its parent revision's subvol and builds
the next patch.  If there are merges in the integration branch, then
the builder can go back to parent revisions, create a new snapshot,
apply the patch, and build in a snapshot on both sides of the merge.
After testing picks a winner, the builder can simply delete all the
snapshots except the one for the version that won testing (there is no
requirement to commit the snapshot to the origin LV as in lvm, either
can be destroyed without requiring action to preserve the other).

You can do a similar thing with overlayfs, but it runs into problems
with all the mount points.  In btrfs, the mount points are persistent
because they're built into the filesystem.  With overlayfs, you have
to save and restore them so they persist across reboots (unless that
feature has been added since I last looked).

I'm looking at a few machines here, and if all the subvols are visible to
'df', its output would be somewhere around 3-5 MB.  That's too much--we'd
have to hack up df to not show the same btrfs twice...as well as every
monitoring tool that reports free space...which sounds similar to the
problems we're trying to avoid.

Ideally there would be a way to turn this on or off.  It is creating a
set of new problems that is the complement of the set we're trying to
fix in this change.

> --b.
NeilBrown July 29, 2021, 1:39 a.m. UTC | #14
On Wed, 28 Jul 2021, g.btrfs@cobb.uk.net wrote:
> On 28/07/2021 08:01, NeilBrown wrote:
> > On Wed, 28 Jul 2021, Wang Yugui wrote:
> >> Hi,
> >>
> >> This patchset works well in 5.14-rc3.
> > 
> > Thanks for testing.
> > 
> >>
> >> 1, fixed dummy inode(255, BTRFS_FIRST_FREE_OBJECTID - 1 )  is changed to
> >> dynamic dummy inode(18446744073709551358, or 18446744073709551359, ...)
> > 
> > The BTRFS_FIRST_FREE_OBJECTID-1 was a just a hack, I never wanted it to
> > be permanent.
> > The new number is ULONG_MAX - subvol_id (where subvol_id starts at 257 I
> > think).
> > This is a bit less of a hack.  It is an easily available number that is
> > fairly unique.
> > 
> >>
> >> 2, btrfs subvol mount info is shown in /proc/mounts, even if nfsd/nfs is
> >> not used.
> >> /dev/sdc                btrfs   94G  3.5M   93G   1% /mnt/test
> >> /dev/sdc                btrfs   94G  3.5M   93G   1% /mnt/test/sub1
> >> /dev/sdc                btrfs   94G  3.5M   93G   1% /mnt/test/sub2
> >>
> >> This is a visiual feature change for btrfs user.
> > 
> > Hopefully it is an improvement.  But it is certainly a change that needs
> > to be carefully considered.
> 
> Would this change the behaviour of findmnt? I have several scripts that
> depend on findmnt to select btrfs filesystems. Just to take a couple of
> examples (using the example shown above): my scripts would depend on
> 'findmnt --target /mnt/test/sub1 -o target' providing /mnt/test, not the
> subvolume; and another script would depend on 'findmnt -t btrfs
> --mountpoint /mnt/test/sub1' providing no output as the directory is not
> an /etc/fstab mount point for a btrfs filesystem.

Yes, I think it does change the behaviour of findmnt.
If the sub1 automount has not been triggered,
  findmnt --target /mnt/test/sub1 -o target
will report "/mnt/test".
After it has been triggered, it will report "/mnt/test/sub1"

Similarly "findmnt -t btrfs --mountpoint /mnt/test/sub1" will report
nothing if the automount hasn't been triggered, and will report full
details of /mnt/test/sub1 if it has.

> 
> Maybe findmnt isn't affected? Or maybe the change is worth making
> anyway? But it needs to be carefully considered if it breaks existing
> user interfaces.
> 
I hope the change is worth making anyway, but breaking findmnt would not
be a popular move.
This is unfortunate....  btrfs is "broken" and people/code have adjusted
to that breakage so that "fixing" it will be traumatic.

The only way I can find to get findmnt to ignore the new entries in
/proc/self/mountinfo is to trigger a parse error such as by replacing the 
" - " with " -- "
but that causes a parse error message to be generated, and will likely
break other tools.
(...  or I could check if current->comm is "findmnt", and suppress the
extra entries, but that is even more horrible!!)

A possible option is to change findmnt to explicitly ignore the new
"internal" mounts (unless some option is given) and then delay the
kernel update until that can be rolled out.

Or we could make the new internal mounts invisible in /proc without some
kernel setting enabled.  Then distros can enable it once all important
tools can cope, and they can easily test correctness without rebooting.

I wonder if correcting the device-number reported for explicit subvol
mounts will break anything....  findmnt seems happy with it in my
limited testing.  There seems to be a testsuite with util-linux.  Maybe
I should try that.

Thanks,
NeilBrown
NeilBrown July 29, 2021, 1:43 a.m. UTC | #15
On Thu, 29 Jul 2021, Zygo Blaxell wrote:
> 
> I'm looking at a few machines here, and if all the subvols are visible to
> 'df', its output would be somewhere around 3-5 MB.  That's too much--we'd
> have to hack up df to not show the same btrfs twice...as well as every
> monitoring tool that reports free space...which sounds similar to the
> problems we're trying to avoid.

Thanks for providing hard data!! How many of these subvols are actively
used (have a file open) a the same time? Most? About half? Just a few??

Thanks,
NeilBrown
Zygo Blaxell July 29, 2021, 2:37 a.m. UTC | #16
On Thu, Jul 29, 2021 at 08:50:50AM +1000, NeilBrown wrote:
> On Wed, 28 Jul 2021, Neal Gompa wrote:
> > On Wed, Jul 28, 2021 at 3:02 AM NeilBrown <neilb@suse.de> wrote:
> > >
> > > On Wed, 28 Jul 2021, Wang Yugui wrote:
> > > > Hi,
> > > >
> > > > This patchset works well in 5.14-rc3.
> > >
> > > Thanks for testing.
> > >
> > > >
> > > > 1, fixed dummy inode(255, BTRFS_FIRST_FREE_OBJECTID - 1 )  is changed to
> > > > dynamic dummy inode(18446744073709551358, or 18446744073709551359, ...)
> > >
> > > The BTRFS_FIRST_FREE_OBJECTID-1 was a just a hack, I never wanted it to
> > > be permanent.
> > > The new number is ULONG_MAX - subvol_id (where subvol_id starts at 257 I
> > > think).
> > > This is a bit less of a hack.  It is an easily available number that is
> > > fairly unique.
> > >
> > > >
> > > > 2, btrfs subvol mount info is shown in /proc/mounts, even if nfsd/nfs is
> > > > not used.
> > > > /dev/sdc                btrfs   94G  3.5M   93G   1% /mnt/test
> > > > /dev/sdc                btrfs   94G  3.5M   93G   1% /mnt/test/sub1
> > > > /dev/sdc                btrfs   94G  3.5M   93G   1% /mnt/test/sub2
> > > >
> > > > This is a visiual feature change for btrfs user.
> > >
> > > Hopefully it is an improvement.  But it is certainly a change that needs
> > > to be carefully considered.
> > 
> > I think this is behavior people generally expect, but I wonder what
> > the consequences of this would be with huge numbers of subvolumes. If
> > there are hundreds or thousands of them (which is quite possible on
> > SUSE systems, for example, with its auto-snapshotting regime), this
> > would be a mess, wouldn't it?
> 
> Would there be hundreds or thousands of subvols concurrently being
> accessed? The auto-mounted subvols only appear in the mount table while
> that are being accessed, and for about 15 minutes after the last access.
> I suspect that most subvols are "backup" snapshots which are not being
> accessed and so would not appear.

bees dedupes across subvols and polls every few minutes for new data
to dedupe.  bees doesn't particularly care where the "src" in the dedupe
call comes from, so it will pick a subvol that has a reference to the
data at random (whichever one comes up first in backref search) for each
dedupe call.  There is a cache of open fds on each subvol root so that it
can access files within that subvol using openat().  The cache quickly
populates fully, i.e. it holds a fd to every subvol on the filesystem.
The cache has a 15 minute timeout too, so bees would likely keep the
mount table fully populated at all times.

plocate also uses openat() and it can also be active on many subvols
simultaneously, though it only runs once a day, and it's reasonable to
exclude all snapshots from plocate for performance reasons.

My bigger concern here is that users on btrfs can currently have private
subvols with secret names.  e.g.

	user$ mkdir -m 700 private
	user$ btrfs sub create private/secret
	user$ cd private/secret
	user$ ...do stuff...

Would "secret" now be visible in the very public /proc/mounts every time
the user is doing stuff?

> > Or can we add a way to mark these things to not show up there or is
> > there some kind of behavioral change we can make to snapper or other
> > tools to make them not show up here?
> 
> Certainly it might make sense to flag these in some way so that tools
> can choose the ignore them or handle them specially, just as nfsd needs
> to handle them specially.  I was considering a "local" mount flag.

I would definitely want an 'off' switch for this thing until the impact
is better understood.

> NeilBrown
> 
> > 
> > 
> > 
> > -- 
> > 真実はいつも一つ!/ Always, there's only one truth!
> > 
> >
NeilBrown July 29, 2021, 3:36 a.m. UTC | #17
On Thu, 29 Jul 2021, Zygo Blaxell wrote:
> On Thu, Jul 29, 2021 at 08:50:50AM +1000, NeilBrown wrote:
> > On Wed, 28 Jul 2021, Neal Gompa wrote:
> > > On Wed, Jul 28, 2021 at 3:02 AM NeilBrown <neilb@suse.de> wrote:
> > > >
> > > > On Wed, 28 Jul 2021, Wang Yugui wrote:
> > > > > Hi,
> > > > >
> > > > > This patchset works well in 5.14-rc3.
> > > >
> > > > Thanks for testing.
> > > >
> > > > >
> > > > > 1, fixed dummy inode(255, BTRFS_FIRST_FREE_OBJECTID - 1 )  is changed to
> > > > > dynamic dummy inode(18446744073709551358, or 18446744073709551359, ...)
> > > >
> > > > The BTRFS_FIRST_FREE_OBJECTID-1 was a just a hack, I never wanted it to
> > > > be permanent.
> > > > The new number is ULONG_MAX - subvol_id (where subvol_id starts at 257 I
> > > > think).
> > > > This is a bit less of a hack.  It is an easily available number that is
> > > > fairly unique.
> > > >
> > > > >
> > > > > 2, btrfs subvol mount info is shown in /proc/mounts, even if nfsd/nfs is
> > > > > not used.
> > > > > /dev/sdc                btrfs   94G  3.5M   93G   1% /mnt/test
> > > > > /dev/sdc                btrfs   94G  3.5M   93G   1% /mnt/test/sub1
> > > > > /dev/sdc                btrfs   94G  3.5M   93G   1% /mnt/test/sub2
> > > > >
> > > > > This is a visiual feature change for btrfs user.
> > > >
> > > > Hopefully it is an improvement.  But it is certainly a change that needs
> > > > to be carefully considered.
> > > 
> > > I think this is behavior people generally expect, but I wonder what
> > > the consequences of this would be with huge numbers of subvolumes. If
> > > there are hundreds or thousands of them (which is quite possible on
> > > SUSE systems, for example, with its auto-snapshotting regime), this
> > > would be a mess, wouldn't it?
> > 
> > Would there be hundreds or thousands of subvols concurrently being
> > accessed? The auto-mounted subvols only appear in the mount table while
> > that are being accessed, and for about 15 minutes after the last access.
> > I suspect that most subvols are "backup" snapshots which are not being
> > accessed and so would not appear.
> 
> bees dedupes across subvols and polls every few minutes for new data
> to dedupe.  bees doesn't particularly care where the "src" in the dedupe
> call comes from, so it will pick a subvol that has a reference to the
> data at random (whichever one comes up first in backref search) for each
> dedupe call.  There is a cache of open fds on each subvol root so that it
> can access files within that subvol using openat().  The cache quickly
> populates fully, i.e. it holds a fd to every subvol on the filesystem.
> The cache has a 15 minute timeout too, so bees would likely keep the
> mount table fully populated at all times.

OK ... that is very interesting and potentially helpful - thanks.

Localizing these daemons in a separate namespace would stop them from
polluting the public namespace, but I don't know how easy that would
be..

Do you know how bees opens these files?  Does it use path-names from the
root, or some special btrfs ioctl, or ???
If path-names are not used, it might be possible to suppress the
automount. 

> 
> plocate also uses openat() and it can also be active on many subvols
> simultaneously, though it only runs once a day, and it's reasonable to
> exclude all snapshots from plocate for performance reasons.
> 
> My bigger concern here is that users on btrfs can currently have private
> subvols with secret names.  e.g.
> 
> 	user$ mkdir -m 700 private
> 	user$ btrfs sub create private/secret
> 	user$ cd private/secret
> 	user$ ...do stuff...
> 
> Would "secret" now be visible in the very public /proc/mounts every time
> the user is doing stuff?

Yes, the secret would be publicly visible.  Unless we hid it.

It is conceivable that the content of /proc/mounts could be limited to
mountpoints where the process reading had 'x' access to the mountpoint. 
However to be really safe we would want to require 'x' access to all
ancestors too, and possibly some 'r' access.  That would get
prohibitively expensive.

We could go with "owned by root, or owned by user" maybe.

Thanks,
NeilBrown


> 
> > > Or can we add a way to mark these things to not show up there or is
> > > there some kind of behavioral change we can make to snapper or other
> > > tools to make them not show up here?
> > 
> > Certainly it might make sense to flag these in some way so that tools
> > can choose the ignore them or handle them specially, just as nfsd needs
> > to handle them specially.  I was considering a "local" mount flag.
> 
> I would definitely want an 'off' switch for this thing until the impact
> is better understood.
> 
> > NeilBrown
> > 
> > > 
> > > 
> > > 
> > > -- 
> > > 真実はいつも一つ!/ Always, there's only one truth!
> > > 
> > > 
> 
>
Graham Cobb July 29, 2021, 9:28 a.m. UTC | #18
On 29/07/2021 02:39, NeilBrown wrote:
> On Wed, 28 Jul 2021, g.btrfs@cobb.uk.net wrote:
>> On 28/07/2021 08:01, NeilBrown wrote:
>>> On Wed, 28 Jul 2021, Wang Yugui wrote:
>>>> Hi,
>>>>
>>>> This patchset works well in 5.14-rc3.
>>>
>>> Thanks for testing.
>>>
>>>>
>>>> 1, fixed dummy inode(255, BTRFS_FIRST_FREE_OBJECTID - 1 )  is changed to
>>>> dynamic dummy inode(18446744073709551358, or 18446744073709551359, ...)
>>>
>>> The BTRFS_FIRST_FREE_OBJECTID-1 was a just a hack, I never wanted it to
>>> be permanent.
>>> The new number is ULONG_MAX - subvol_id (where subvol_id starts at 257 I
>>> think).
>>> This is a bit less of a hack.  It is an easily available number that is
>>> fairly unique.
>>>
>>>>
>>>> 2, btrfs subvol mount info is shown in /proc/mounts, even if nfsd/nfs is
>>>> not used.
>>>> /dev/sdc                btrfs   94G  3.5M   93G   1% /mnt/test
>>>> /dev/sdc                btrfs   94G  3.5M   93G   1% /mnt/test/sub1
>>>> /dev/sdc                btrfs   94G  3.5M   93G   1% /mnt/test/sub2
>>>>
>>>> This is a visiual feature change for btrfs user.
>>>
>>> Hopefully it is an improvement.  But it is certainly a change that needs
>>> to be carefully considered.
>>
>> Would this change the behaviour of findmnt? I have several scripts that
>> depend on findmnt to select btrfs filesystems. Just to take a couple of
>> examples (using the example shown above): my scripts would depend on
>> 'findmnt --target /mnt/test/sub1 -o target' providing /mnt/test, not the
>> subvolume; and another script would depend on 'findmnt -t btrfs
>> --mountpoint /mnt/test/sub1' providing no output as the directory is not
>> an /etc/fstab mount point for a btrfs filesystem.
> 
> Yes, I think it does change the behaviour of findmnt.
> If the sub1 automount has not been triggered,
>   findmnt --target /mnt/test/sub1 -o target
> will report "/mnt/test".
> After it has been triggered, it will report "/mnt/test/sub1"
> 
> Similarly "findmnt -t btrfs --mountpoint /mnt/test/sub1" will report
> nothing if the automount hasn't been triggered, and will report full
> details of /mnt/test/sub1 if it has.
> 
>>
>> Maybe findmnt isn't affected? Or maybe the change is worth making
>> anyway? But it needs to be carefully considered if it breaks existing
>> user interfaces.
>>
> I hope the change is worth making anyway, but breaking findmnt would not
> be a popular move.

I agree. I use findmnt, but I also use NFS mounted btrfs disks so I am
keen to see this deployed. But people who don't maintain their own
scripts and need a third party to change them might disagree!

> This is unfortunate....  btrfs is "broken" and people/code have adjusted
> to that breakage so that "fixing" it will be traumatic.
> 
> The only way I can find to get findmnt to ignore the new entries in
> /proc/self/mountinfo is to trigger a parse error such as by replacing the 
> " - " with " -- "
> but that causes a parse error message to be generated, and will likely
> break other tools.
> (...  or I could check if current->comm is "findmnt", and suppress the
> extra entries, but that is even more horrible!!)
> 
> A possible option is to change findmnt to explicitly ignore the new
> "internal" mounts (unless some option is given) and then delay the
> kernel update until that can be rolled out.

That sounds good as a permanent fix for findmnt. Some sort of
'--include-subvols' option. Particularly if it were possible to default
it using an environment variable so a script can be written to work with
both the old and the new versions of findmnt.

Unfortunately it won't help any other program which does similar
searches through /proc/self/mountinfo.

How about creating two different files? Say, /proc/self/mountinfo and
/proc/self/mountinfo.internal (better filenames may be available!). The
.internal file could be just the additional internal mounts, or it could
be the complete list. Or something like
/proc/self/mountinfo.without-subvols and
/proc/self/mountinfo.with-subvols and a sysctl setting to choose which
is made visible as /proc/self/mountinfo.

Graham
Zygo Blaxell July 29, 2021, 11:20 p.m. UTC | #19
On Thu, Jul 29, 2021 at 01:36:06PM +1000, NeilBrown wrote:
> On Thu, 29 Jul 2021, Zygo Blaxell wrote:
> > On Thu, Jul 29, 2021 at 08:50:50AM +1000, NeilBrown wrote:
> > > On Wed, 28 Jul 2021, Neal Gompa wrote:
> > > > On Wed, Jul 28, 2021 at 3:02 AM NeilBrown <neilb@suse.de> wrote:
> > > > >
> > > > > On Wed, 28 Jul 2021, Wang Yugui wrote:
> > > > > > Hi,
> > > > > >
> > > > > > This patchset works well in 5.14-rc3.
> > > > >
> > > > > Thanks for testing.
> > > > >
> > > > > >
> > > > > > 1, fixed dummy inode(255, BTRFS_FIRST_FREE_OBJECTID - 1 )  is changed to
> > > > > > dynamic dummy inode(18446744073709551358, or 18446744073709551359, ...)
> > > > >
> > > > > The BTRFS_FIRST_FREE_OBJECTID-1 was a just a hack, I never wanted it to
> > > > > be permanent.
> > > > > The new number is ULONG_MAX - subvol_id (where subvol_id starts at 257 I
> > > > > think).
> > > > > This is a bit less of a hack.  It is an easily available number that is
> > > > > fairly unique.
> > > > >
> > > > > >
> > > > > > 2, btrfs subvol mount info is shown in /proc/mounts, even if nfsd/nfs is
> > > > > > not used.
> > > > > > /dev/sdc                btrfs   94G  3.5M   93G   1% /mnt/test
> > > > > > /dev/sdc                btrfs   94G  3.5M   93G   1% /mnt/test/sub1
> > > > > > /dev/sdc                btrfs   94G  3.5M   93G   1% /mnt/test/sub2
> > > > > >
> > > > > > This is a visiual feature change for btrfs user.
> > > > >
> > > > > Hopefully it is an improvement.  But it is certainly a change that needs
> > > > > to be carefully considered.
> > > > 
> > > > I think this is behavior people generally expect, but I wonder what
> > > > the consequences of this would be with huge numbers of subvolumes. If
> > > > there are hundreds or thousands of them (which is quite possible on
> > > > SUSE systems, for example, with its auto-snapshotting regime), this
> > > > would be a mess, wouldn't it?
> > > 
> > > Would there be hundreds or thousands of subvols concurrently being
> > > accessed? The auto-mounted subvols only appear in the mount table while
> > > that are being accessed, and for about 15 minutes after the last access.
> > > I suspect that most subvols are "backup" snapshots which are not being
> > > accessed and so would not appear.
> > 
> > bees dedupes across subvols and polls every few minutes for new data
> > to dedupe.  bees doesn't particularly care where the "src" in the dedupe
> > call comes from, so it will pick a subvol that has a reference to the
> > data at random (whichever one comes up first in backref search) for each
> > dedupe call.  There is a cache of open fds on each subvol root so that it
> > can access files within that subvol using openat().  The cache quickly
> > populates fully, i.e. it holds a fd to every subvol on the filesystem.
> > The cache has a 15 minute timeout too, so bees would likely keep the
> > mount table fully populated at all times.
> 
> OK ... that is very interesting and potentially helpful - thanks.
> 
> Localizing these daemons in a separate namespace would stop them from
> polluting the public namespace, but I don't know how easy that would
> be..
> 
> Do you know how bees opens these files?  Does it use path-names from the
> root, or some special btrfs ioctl, or ???

There's a function in bees that opens a subvol root directory by subvol
id.  It walks up the btrfs subvol tree with btrfs ioctls to construct a
path to the root, then down the filesystem tree with other btrfs ioctls
to get filenames for each subvol.  The filenames are fed to openat()
with the parent subvol's fd to get a fd for each child subvol's root
directory along the path.  This is recursive and expensive (the fd
has to be checked to see if it matches the subvol, in case some other
process renamed it) and called every time bees wants to open a file,
so the fd goes into a cache for future open-subvol-by-id calls.

For files, bees calls subvol root open to get a fd for the subvol's root,
then calls the btrfs inode-to-path ioctl on that fd to get a list of
names for the inode, then openat(subvol_fd, inode_to_path(inum), ...) on
each name until a fd matching the target subvol and inode is obtained.
File access is driven by data content, so bees cannot easily predict
which files will need to be accessed again in the near future and which
can be closed.  The fd cache is a brute-force way to reduce the number
of inode-to-path and open calls.

Upper layers of bees use (subvol, inode) pairs to identify files
and request file descriptors.  The lower layers use filenames as an
implementation detail for compatibility with kernel API.

> If path-names are not used, it might be possible to suppress the
> automount. 

A userspace interface to read and dedupe that doesn't use pathnames or
file descriptors (other than one fd to bind the interface to a filesystem)
would be nice!  About half of the bees code is devoted to emulating
that interface using the existing kernel API.

Ideally a dedupe agent would be able to pass two physical offsets and
a length of identical data to the filesystem without ever opening a file.

While I'm in favor of making bees smaller, this seems like an expensive
way to suppress automounts.

> > plocate also uses openat() and it can also be active on many subvols
> > simultaneously, though it only runs once a day, and it's reasonable to
> > exclude all snapshots from plocate for performance reasons.
> > 
> > My bigger concern here is that users on btrfs can currently have private
> > subvols with secret names.  e.g.
> > 
> > 	user$ mkdir -m 700 private
> > 	user$ btrfs sub create private/secret
> > 	user$ cd private/secret
> > 	user$ ...do stuff...
> > 
> > Would "secret" now be visible in the very public /proc/mounts every time
> > the user is doing stuff?
> 
> Yes, the secret would be publicly visible.  Unless we hid it.
> 
> It is conceivable that the content of /proc/mounts could be limited to
> mountpoints where the process reading had 'x' access to the mountpoint. 
> However to be really safe we would want to require 'x' access to all
> ancestors too, and possibly some 'r' access.  That would get
> prohibitively expensive.

And inconsistent, since we don't do that with other mount points, i.e.
outside of btrfs.  But we do hide parts of the path names when the
process's filesystem root or namespace changes, so maybe this is more
of that.

> We could go with "owned by root, or owned by user" maybe.
> 
> Thanks,
> NeilBrown
> 
> 
> > 
> > > > Or can we add a way to mark these things to not show up there or is
> > > > there some kind of behavioral change we can make to snapper or other
> > > > tools to make them not show up here?
> > > 
> > > Certainly it might make sense to flag these in some way so that tools
> > > can choose the ignore them or handle them specially, just as nfsd needs
> > > to handle them specially.  I was considering a "local" mount flag.
> > 
> > I would definitely want an 'off' switch for this thing until the impact
> > is better understood.
> > 
> > > NeilBrown
> > > 
> > > > 
> > > > 
> > > > 
> > > > -- 
> > > > 真実はいつも一つ!/ Always, there's only one truth!
> > > > 
> > > > 
> > 
> > 
>
Zygo Blaxell July 29, 2021, 11:20 p.m. UTC | #20
On Thu, Jul 29, 2021 at 11:43:21AM +1000, NeilBrown wrote:
> On Thu, 29 Jul 2021, Zygo Blaxell wrote:
> > 
> > I'm looking at a few machines here, and if all the subvols are visible to
> > 'df', its output would be somewhere around 3-5 MB.  That's too much--we'd
> > have to hack up df to not show the same btrfs twice...as well as every
> > monitoring tool that reports free space...which sounds similar to the
> > problems we're trying to avoid.
> 
> Thanks for providing hard data!! How many of these subvols are actively
> used (have a file open) a the same time? Most? About half? Just a few??

Between 1% and 10% of the subvols have open fds at any time (not counting
bees, which holds an open fd on every subvol most of the time).  The ratio
is higher (more active) when the machine has less storage or more CPU/RAM:
we keep idle subvols around longer if we have lots of free space, or we
make more subvols active at the same time if we have lots of RAM and CPU.

I don't recall if stat() triggers automount, but most of the subvols are
located in a handful of directories.  Could a single 'ls -l' command
activate all of their automounts?  If so, then we'd be hitting those
at least once every 15 minutes--these directories are browseable, and
an entry point for users.  Certainly anything that looks inside those
directories (like certain file-browsing user agents that look for icons
one level down) will trigger automount as they search children of the
subvol root.

Some of this might be fixable, like I could probably make bees be
more parsimonious with subvol root fds, and I could probably rework
reporting scripts to avoid touching anything inside subdirectories,
and I could probably rework our subvolume directory layout to avoid
accidentally triggering thousands of automounts.  But I'd rather not.
I'd rather have working NFS and a 15-20 line df output with no new
privacy or scalability concerns.

> Thanks,
> NeilBrown
Al Viro July 30, 2021, 12:13 a.m. UTC | #21
On Wed, Jul 28, 2021 at 05:30:04PM -0400, Josef Bacik wrote:

> I don't think anybody has that many file systems.  For btrfs it's a single
> file system.  Think of syncfs, it's going to walk through all of the super
> blocks on the system calling ->sync_fs on each subvol superblock.  Now this
> isn't a huge deal, we could just have some flag that says "I'm not real" or
> even just have anonymous superblocks that don't get added to the global
> super_blocks list, and that would address my main pain points.

Umm...  Aren't the snapshots read-only by definition?

> The second part is inode reclaim.  Again this particular problem could be
> avoided if we had an anonymous superblock that wasn't actually used, but the
> inode lru is per superblock.  Now with reclaim instead of walking all the
> inodes, you're walking a bunch of super blocks and then walking the list of
> inodes within those super blocks.  You're burning CPU cycles because now
> instead of getting big chunks of inodes to dispose, it's spread out across
> many super blocks.
> 
> The other weird thing is the way we apply pressure to shrinker systems.  We
> essentially say "try to evict X objects from your list", which means in this
> case with lots of subvolumes we'd be evicting waaaaay more inodes than you
> were before, likely impacting performance where you have workloads that have
> lots of files open across many subvolumes (which is what FB does with it's
> containers).
> 
> If we want a anonymous superblock per subvolume then the only way it'll work
> is if it's not actually tied into anything, and we still use the primary
> super block for the whole file system.  And if that's what we're going to do
> what's the point of the super block exactly?  This approach that Neil's come
> up with seems like a reasonable solution to me.  Christoph gets his
> separation and /proc/self/mountinfo, and we avoid the scalability headache
> of a billion super blocks.  Thanks,

AFAICS, we also get arseloads of weird corner cases - in particular, Neil's
suggestions re visibility in /proc/mounts look rather arbitrary.

Al, really disliking the entire series...
NeilBrown July 30, 2021, 2:36 a.m. UTC | #22
I've been pondering all the excellent feedback, and what I have learnt
from examining the code in btrfs, and I have developed a different
perspective.

Maybe "subvol" is a poor choice of name because it conjures up
connections with the Volumes in LVM, and btrfs subvols are very different
things.  Btrfs subvols are really just subtrees that can be treated as a
unit for operations like "clone" or "destroy".

As such, they don't really deserve separate st_dev numbers.

Maybe the different st_dev numbers were introduced as a "cheap" way to
extend to size of the inode-number space.  Like many "cheap" things, it
has hidden costs.

Maybe objects in different subvols should still be given different inode
numbers.  This would be problematic on 32bit systems, but much less so on
64bit systems.

The patch below, which is just a proof-of-concept, changes btrfs to
report a uniform st_dev, and different (64bit) st_ino in different subvols.

It has problems:
 - it will break any 32bit readdir and 32bit stat.  I don't know how big
   a problem that is these days (ino_t in the kernel is "unsigned long",
   not "unsigned long long). That surprised me).
 - It might break some user-space expectations.  One thing I have learnt
   is not to make any assumption about what other people might expect.

However, it would be quite easy to make this opt-in (or opt-out) with a
mount option, so that people who need the current inode numbers and will
accept the current breakage can keep working.

I think this approach would be a net-win for NFS export, whether BTRFS
supports it directly or not.  I might post a patch which modifies NFS to
intuit improved inode numbers for btrfs exports....

So: how would this break your use-case??

Thanks,
NeilBrown

diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 0117d867ecf8..8dc58c848502 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -6020,6 +6020,37 @@ static int btrfs_opendir(struct inode *inode, struct file *file)
 	return 0;
 }
 
+static u64 btrfs_make_inum(struct btrfs_key *root, struct btrfs_key *ino)
+{
+	u64 rootid = root->objectid;
+	u64 inoid = ino->objectid;
+	int shift = 64-8;
+
+	if (ino->type == BTRFS_ROOT_ITEM_KEY) {
+		/* This is a subvol root found during readdir. */
+		rootid = inoid;
+		inoid = BTRFS_FIRST_FREE_OBJECTID;
+	}
+	if (rootid == BTRFS_FS_TREE_OBJECTID)
+		/* this is main vol, not subvol (I think) */
+		return inoid;
+	/* store the rootid in the high bits of the inum.  This
+	 * will break if 32bit inums are required - we cannot know
+	 */
+	while (rootid) {
+		inoid ^= (rootid & 0xff) << shift;
+		rootid >>= 8;
+		shift -= 8;
+	}
+	return inoid;
+}
+
+static inline u64 btrfs_ino_to_inum(struct inode *inode)
+{
+	return btrfs_make_inum(&BTRFS_I(inode)->root->root_key,
+			       &BTRFS_I(inode)->location);
+}
+
 struct dir_entry {
 	u64 ino;
 	u64 offset;
@@ -6045,6 +6076,49 @@ static int btrfs_filldir(void *addr, int entries, struct dir_context *ctx)
 	return 0;
 }
 
+static inline bool btrfs_dir_emit_dot(struct file *file,
+				      struct dir_context *ctx)
+{
+	return ctx->actor(ctx, ".", 1, ctx->pos,
+			  btrfs_ino_to_inum(file->f_path.dentry->d_inode),
+			  DT_DIR) == 0;
+}
+
+static inline ino_t btrfs_parent_ino(struct dentry *dentry)
+{
+	ino_t res;
+
+	/*
+	 * Don't strictly need d_lock here? If the parent ino could change
+	 * then surely we'd have a deeper race in the caller?
+	 */
+	spin_lock(&dentry->d_lock);
+	res = btrfs_ino_to_inum(dentry->d_parent->d_inode);
+	spin_unlock(&dentry->d_lock);
+	return res;
+}
+
+static inline bool btrfs_dir_emit_dotdot(struct file *file,
+					 struct dir_context *ctx)
+{
+	return ctx->actor(ctx, "..", 2, ctx->pos,
+			  btrfs_parent_ino(file->f_path.dentry), DT_DIR) == 0;
+}
+static inline bool btrfs_dir_emit_dots(struct file *file,
+				       struct dir_context *ctx)
+{
+	if (ctx->pos == 0) {
+		if (!btrfs_dir_emit_dot(file, ctx))
+			return false;
+		ctx->pos = 1;
+	}
+	if (ctx->pos == 1) {
+		if (!btrfs_dir_emit_dotdot(file, ctx))
+			return false;
+		ctx->pos = 2;
+	}
+	return true;
+}
 static int btrfs_real_readdir(struct file *file, struct dir_context *ctx)
 {
 	struct inode *inode = file_inode(file);
@@ -6067,7 +6141,7 @@ static int btrfs_real_readdir(struct file *file, struct dir_context *ctx)
 	bool put = false;
 	struct btrfs_key location;
 
-	if (!dir_emit_dots(file, ctx))
+	if (!btrfs_dir_emit_dots(file, ctx))
 		return 0;
 
 	path = btrfs_alloc_path();
@@ -6136,7 +6210,8 @@ static int btrfs_real_readdir(struct file *file, struct dir_context *ctx)
 		put_unaligned(fs_ftype_to_dtype(btrfs_dir_type(leaf, di)),
 				&entry->type);
 		btrfs_dir_item_key_to_cpu(leaf, di, &location);
-		put_unaligned(location.objectid, &entry->ino);
+		put_unaligned(btrfs_make_inum(&root->root_key, &location),
+			      &entry->ino);
 		put_unaligned(found_key.offset, &entry->offset);
 		entries++;
 		addr += sizeof(struct dir_entry) + name_len;
@@ -9193,7 +9268,7 @@ static int btrfs_getattr(struct user_namespace *mnt_userns,
 				  STATX_ATTR_NODUMP);
 
 	generic_fillattr(&init_user_ns, inode, stat);
-	stat->dev = BTRFS_I(inode)->root->anon_dev;
+	stat->ino = btrfs_ino_to_inum(inode);
 
 	spin_lock(&BTRFS_I(inode)->lock);
 	delalloc_bytes = BTRFS_I(inode)->new_delalloc_bytes;
Qu Wenruo July 30, 2021, 5:25 a.m. UTC | #23
On 2021/7/30 上午10:36, NeilBrown wrote:
>
> I've been pondering all the excellent feedback, and what I have learnt
> from examining the code in btrfs, and I have developed a different
> perspective.

Great! Some new developers into the btrfs realm!

>
> Maybe "subvol" is a poor choice of name because it conjures up
> connections with the Volumes in LVM, and btrfs subvols are very different
> things.  Btrfs subvols are really just subtrees that can be treated as a
> unit for operations like "clone" or "destroy".
>
> As such, they don't really deserve separate st_dev numbers.
>
> Maybe the different st_dev numbers were introduced as a "cheap" way to
> extend to size of the inode-number space.  Like many "cheap" things, it
> has hidden costs.
>
> Maybe objects in different subvols should still be given different inode
> numbers.  This would be problematic on 32bit systems, but much less so on
> 64bit systems.
>
> The patch below, which is just a proof-of-concept, changes btrfs to
> report a uniform st_dev, and different (64bit) st_ino in different subvols.
>
> It has problems:
>   - it will break any 32bit readdir and 32bit stat.  I don't know how big
>     a problem that is these days (ino_t in the kernel is "unsigned long",
>     not "unsigned long long). That surprised me).
>   - It might break some user-space expectations.  One thing I have learnt
>     is not to make any assumption about what other people might expect.

Wouldn't any filesystem boundary check fail to stop at subvolume boundary?

Then it will go through the full btrfs subvolumes/snapshots, which can
be super slow.

>
> However, it would be quite easy to make this opt-in (or opt-out) with a
> mount option, so that people who need the current inode numbers and will
> accept the current breakage can keep working.
>
> I think this approach would be a net-win for NFS export, whether BTRFS
> supports it directly or not.  I might post a patch which modifies NFS to
> intuit improved inode numbers for btrfs exports....

Some extra ideas, but not familiar with VFS enough to be sure.

Can we generate "fake" superblock for each subvolume?
Like using the subolume UUID to replace the FSID of each subvolume.
Could that migrate the problem?

Thanks,
Qu

>
> So: how would this break your use-case??
>
> Thanks,
> NeilBrown
>
> diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
> index 0117d867ecf8..8dc58c848502 100644
> --- a/fs/btrfs/inode.c
> +++ b/fs/btrfs/inode.c
> @@ -6020,6 +6020,37 @@ static int btrfs_opendir(struct inode *inode, struct file *file)
>   	return 0;
>   }
>
> +static u64 btrfs_make_inum(struct btrfs_key *root, struct btrfs_key *ino)
> +{
> +	u64 rootid = root->objectid;
> +	u64 inoid = ino->objectid;
> +	int shift = 64-8;
> +
> +	if (ino->type == BTRFS_ROOT_ITEM_KEY) {
> +		/* This is a subvol root found during readdir. */
> +		rootid = inoid;
> +		inoid = BTRFS_FIRST_FREE_OBJECTID;
> +	}
> +	if (rootid == BTRFS_FS_TREE_OBJECTID)
> +		/* this is main vol, not subvol (I think) */
> +		return inoid;
> +	/* store the rootid in the high bits of the inum.  This
> +	 * will break if 32bit inums are required - we cannot know
> +	 */
> +	while (rootid) {
> +		inoid ^= (rootid & 0xff) << shift;
> +		rootid >>= 8;
> +		shift -= 8;
> +	}
> +	return inoid;
> +}
> +
> +static inline u64 btrfs_ino_to_inum(struct inode *inode)
> +{
> +	return btrfs_make_inum(&BTRFS_I(inode)->root->root_key,
> +			       &BTRFS_I(inode)->location);
> +}
> +
>   struct dir_entry {
>   	u64 ino;
>   	u64 offset;
> @@ -6045,6 +6076,49 @@ static int btrfs_filldir(void *addr, int entries, struct dir_context *ctx)
>   	return 0;
>   }
>
> +static inline bool btrfs_dir_emit_dot(struct file *file,
> +				      struct dir_context *ctx)
> +{
> +	return ctx->actor(ctx, ".", 1, ctx->pos,
> +			  btrfs_ino_to_inum(file->f_path.dentry->d_inode),
> +			  DT_DIR) == 0;
> +}
> +
> +static inline ino_t btrfs_parent_ino(struct dentry *dentry)
> +{
> +	ino_t res;
> +
> +	/*
> +	 * Don't strictly need d_lock here? If the parent ino could change
> +	 * then surely we'd have a deeper race in the caller?
> +	 */
> +	spin_lock(&dentry->d_lock);
> +	res = btrfs_ino_to_inum(dentry->d_parent->d_inode);
> +	spin_unlock(&dentry->d_lock);
> +	return res;
> +}
> +
> +static inline bool btrfs_dir_emit_dotdot(struct file *file,
> +					 struct dir_context *ctx)
> +{
> +	return ctx->actor(ctx, "..", 2, ctx->pos,
> +			  btrfs_parent_ino(file->f_path.dentry), DT_DIR) == 0;
> +}
> +static inline bool btrfs_dir_emit_dots(struct file *file,
> +				       struct dir_context *ctx)
> +{
> +	if (ctx->pos == 0) {
> +		if (!btrfs_dir_emit_dot(file, ctx))
> +			return false;
> +		ctx->pos = 1;
> +	}
> +	if (ctx->pos == 1) {
> +		if (!btrfs_dir_emit_dotdot(file, ctx))
> +			return false;
> +		ctx->pos = 2;
> +	}
> +	return true;
> +}
>   static int btrfs_real_readdir(struct file *file, struct dir_context *ctx)
>   {
>   	struct inode *inode = file_inode(file);
> @@ -6067,7 +6141,7 @@ static int btrfs_real_readdir(struct file *file, struct dir_context *ctx)
>   	bool put = false;
>   	struct btrfs_key location;
>
> -	if (!dir_emit_dots(file, ctx))
> +	if (!btrfs_dir_emit_dots(file, ctx))
>   		return 0;
>
>   	path = btrfs_alloc_path();
> @@ -6136,7 +6210,8 @@ static int btrfs_real_readdir(struct file *file, struct dir_context *ctx)
>   		put_unaligned(fs_ftype_to_dtype(btrfs_dir_type(leaf, di)),
>   				&entry->type);
>   		btrfs_dir_item_key_to_cpu(leaf, di, &location);
> -		put_unaligned(location.objectid, &entry->ino);
> +		put_unaligned(btrfs_make_inum(&root->root_key, &location),
> +			      &entry->ino);
>   		put_unaligned(found_key.offset, &entry->offset);
>   		entries++;
>   		addr += sizeof(struct dir_entry) + name_len;
> @@ -9193,7 +9268,7 @@ static int btrfs_getattr(struct user_namespace *mnt_userns,
>   				  STATX_ATTR_NODUMP);
>
>   	generic_fillattr(&init_user_ns, inode, stat);
> -	stat->dev = BTRFS_I(inode)->root->anon_dev;
> +	stat->ino = btrfs_ino_to_inum(inode);
>
>   	spin_lock(&BTRFS_I(inode)->lock);
>   	delalloc_bytes = BTRFS_I(inode)->new_delalloc_bytes;
>
Amir Goldstein July 30, 2021, 5:28 a.m. UTC | #24
On Fri, Jul 30, 2021 at 5:41 AM NeilBrown <neilb@suse.de> wrote:
>
>
> I've been pondering all the excellent feedback, and what I have learnt
> from examining the code in btrfs, and I have developed a different
> perspective.
>
> Maybe "subvol" is a poor choice of name because it conjures up
> connections with the Volumes in LVM, and btrfs subvols are very different
> things.  Btrfs subvols are really just subtrees that can be treated as a
> unit for operations like "clone" or "destroy".
>
> As such, they don't really deserve separate st_dev numbers.
>
> Maybe the different st_dev numbers were introduced as a "cheap" way to
> extend to size of the inode-number space.  Like many "cheap" things, it
> has hidden costs.
>
> Maybe objects in different subvols should still be given different inode
> numbers.  This would be problematic on 32bit systems, but much less so on
> 64bit systems.
>
> The patch below, which is just a proof-of-concept, changes btrfs to
> report a uniform st_dev, and different (64bit) st_ino in different subvols.
>
> It has problems:
>  - it will break any 32bit readdir and 32bit stat.  I don't know how big
>    a problem that is these days (ino_t in the kernel is "unsigned long",
>    not "unsigned long long). That surprised me).
>  - It might break some user-space expectations.  One thing I have learnt
>    is not to make any assumption about what other people might expect.
>
> However, it would be quite easy to make this opt-in (or opt-out) with a
> mount option, so that people who need the current inode numbers and will
> accept the current breakage can keep working.
>
> I think this approach would be a net-win for NFS export, whether BTRFS
> supports it directly or not.  I might post a patch which modifies NFS to
> intuit improved inode numbers for btrfsdemostrates exports....
>
> So: how would this break your use-case??

The simple cases are find -xdev and du -x which expect the st_dev
change, but that can be excused if opting in to a unified st_dev namespace.

The harder problem is <st_dev;st_ino> collisions which are not even
that hard to hit with unlimited number of snapshots.
The 'diff' tool demonstrates the implications of <st_dev;st_ino>
collisions for different objects on userspace.
See xfstest overlay/049 for a demonstration.

The overlayfs xino feature made a similar change to overlayfs
<st_dev;st_ino> with one big difference - applications expect that
all objects in overlayfs mount will have the same st_dev.

Also, overlayfs has prior knowledge on the number of layers
so it is easier to parcel the ino namespace and avoid collisions.

Thanks,
Amir.
Qu Wenruo July 30, 2021, 5:31 a.m. UTC | #25
On 2021/7/30 下午1:25, Qu Wenruo wrote:
>
>
> On 2021/7/30 上午10:36, NeilBrown wrote:
>>
>> I've been pondering all the excellent feedback, and what I have learnt
>> from examining the code in btrfs, and I have developed a different
>> perspective.
>
> Great! Some new developers into the btrfs realm!
>
>>
>> Maybe "subvol" is a poor choice of name because it conjures up
>> connections with the Volumes in LVM, and btrfs subvols are very different
>> things.  Btrfs subvols are really just subtrees that can be treated as a
>> unit for operations like "clone" or "destroy".
>>
>> As such, they don't really deserve separate st_dev numbers.
>>
>> Maybe the different st_dev numbers were introduced as a "cheap" way to
>> extend to size of the inode-number space.  Like many "cheap" things, it
>> has hidden costs.

Forgot another problem already caused by this st_dev method.

Since btrfs uses st_dev to distinguish them its inode name space, and
st_dev is allocated using anonymous bdev, and the anonymous bdev poor
has limited size (much smaller than btrfs subvolume id name space), it's
already causing problems like we can't allocate enough anonymous bdev
for each subvolume, and failed to create subvolume/snapshot.

Thus it's really a time to re-consider how we should export this info to
user space.

Thanks,
Qu

>>
>> Maybe objects in different subvols should still be given different inode
>> numbers.  This would be problematic on 32bit systems, but much less so on
>> 64bit systems.
>>
>> The patch below, which is just a proof-of-concept, changes btrfs to
>> report a uniform st_dev, and different (64bit) st_ino in different
>> subvols.
>>
>> It has problems:
>>   - it will break any 32bit readdir and 32bit stat.  I don't know how big
>>     a problem that is these days (ino_t in the kernel is "unsigned long",
>>     not "unsigned long long). That surprised me).
>>   - It might break some user-space expectations.  One thing I have learnt
>>     is not to make any assumption about what other people might expect.
>
> Wouldn't any filesystem boundary check fail to stop at subvolume boundary?
>
> Then it will go through the full btrfs subvolumes/snapshots, which can
> be super slow.
>
>>
>> However, it would be quite easy to make this opt-in (or opt-out) with a
>> mount option, so that people who need the current inode numbers and will
>> accept the current breakage can keep working.
>>
>> I think this approach would be a net-win for NFS export, whether BTRFS
>> supports it directly or not.  I might post a patch which modifies NFS to
>> intuit improved inode numbers for btrfs exports....
>
> Some extra ideas, but not familiar with VFS enough to be sure.
>
> Can we generate "fake" superblock for each subvolume?
> Like using the subolume UUID to replace the FSID of each subvolume.
> Could that migrate the problem?
>
> Thanks,
> Qu
>
>>
>> So: how would this break your use-case??
>>
>> Thanks,
>> NeilBrown
>>
>> diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
>> index 0117d867ecf8..8dc58c848502 100644
>> --- a/fs/btrfs/inode.c
>> +++ b/fs/btrfs/inode.c
>> @@ -6020,6 +6020,37 @@ static int btrfs_opendir(struct inode *inode,
>> struct file *file)
>>       return 0;
>>   }
>>
>> +static u64 btrfs_make_inum(struct btrfs_key *root, struct btrfs_key
>> *ino)
>> +{
>> +    u64 rootid = root->objectid;
>> +    u64 inoid = ino->objectid;
>> +    int shift = 64-8;
>> +
>> +    if (ino->type == BTRFS_ROOT_ITEM_KEY) {
>> +        /* This is a subvol root found during readdir. */
>> +        rootid = inoid;
>> +        inoid = BTRFS_FIRST_FREE_OBJECTID;
>> +    }
>> +    if (rootid == BTRFS_FS_TREE_OBJECTID)
>> +        /* this is main vol, not subvol (I think) */
>> +        return inoid;
>> +    /* store the rootid in the high bits of the inum.  This
>> +     * will break if 32bit inums are required - we cannot know
>> +     */
>> +    while (rootid) {
>> +        inoid ^= (rootid & 0xff) << shift;
>> +        rootid >>= 8;
>> +        shift -= 8;
>> +    }
>> +    return inoid;
>> +}
>> +
>> +static inline u64 btrfs_ino_to_inum(struct inode *inode)
>> +{
>> +    return btrfs_make_inum(&BTRFS_I(inode)->root->root_key,
>> +                   &BTRFS_I(inode)->location);
>> +}
>> +
>>   struct dir_entry {
>>       u64 ino;
>>       u64 offset;
>> @@ -6045,6 +6076,49 @@ static int btrfs_filldir(void *addr, int
>> entries, struct dir_context *ctx)
>>       return 0;
>>   }
>>
>> +static inline bool btrfs_dir_emit_dot(struct file *file,
>> +                      struct dir_context *ctx)
>> +{
>> +    return ctx->actor(ctx, ".", 1, ctx->pos,
>> +              btrfs_ino_to_inum(file->f_path.dentry->d_inode),
>> +              DT_DIR) == 0;
>> +}
>> +
>> +static inline ino_t btrfs_parent_ino(struct dentry *dentry)
>> +{
>> +    ino_t res;
>> +
>> +    /*
>> +     * Don't strictly need d_lock here? If the parent ino could change
>> +     * then surely we'd have a deeper race in the caller?
>> +     */
>> +    spin_lock(&dentry->d_lock);
>> +    res = btrfs_ino_to_inum(dentry->d_parent->d_inode);
>> +    spin_unlock(&dentry->d_lock);
>> +    return res;
>> +}
>> +
>> +static inline bool btrfs_dir_emit_dotdot(struct file *file,
>> +                     struct dir_context *ctx)
>> +{
>> +    return ctx->actor(ctx, "..", 2, ctx->pos,
>> +              btrfs_parent_ino(file->f_path.dentry), DT_DIR) == 0;
>> +}
>> +static inline bool btrfs_dir_emit_dots(struct file *file,
>> +                       struct dir_context *ctx)
>> +{
>> +    if (ctx->pos == 0) {
>> +        if (!btrfs_dir_emit_dot(file, ctx))
>> +            return false;
>> +        ctx->pos = 1;
>> +    }
>> +    if (ctx->pos == 1) {
>> +        if (!btrfs_dir_emit_dotdot(file, ctx))
>> +            return false;
>> +        ctx->pos = 2;
>> +    }
>> +    return true;
>> +}
>>   static int btrfs_real_readdir(struct file *file, struct dir_context
>> *ctx)
>>   {
>>       struct inode *inode = file_inode(file);
>> @@ -6067,7 +6141,7 @@ static int btrfs_real_readdir(struct file *file,
>> struct dir_context *ctx)
>>       bool put = false;
>>       struct btrfs_key location;
>>
>> -    if (!dir_emit_dots(file, ctx))
>> +    if (!btrfs_dir_emit_dots(file, ctx))
>>           return 0;
>>
>>       path = btrfs_alloc_path();
>> @@ -6136,7 +6210,8 @@ static int btrfs_real_readdir(struct file *file,
>> struct dir_context *ctx)
>>           put_unaligned(fs_ftype_to_dtype(btrfs_dir_type(leaf, di)),
>>                   &entry->type);
>>           btrfs_dir_item_key_to_cpu(leaf, di, &location);
>> -        put_unaligned(location.objectid, &entry->ino);
>> +        put_unaligned(btrfs_make_inum(&root->root_key, &location),
>> +                  &entry->ino);
>>           put_unaligned(found_key.offset, &entry->offset);
>>           entries++;
>>           addr += sizeof(struct dir_entry) + name_len;
>> @@ -9193,7 +9268,7 @@ static int btrfs_getattr(struct user_namespace
>> *mnt_userns,
>>                     STATX_ATTR_NODUMP);
>>
>>       generic_fillattr(&init_user_ns, inode, stat);
>> -    stat->dev = BTRFS_I(inode)->root->anon_dev;
>> +    stat->ino = btrfs_ino_to_inum(inode);
>>
>>       spin_lock(&BTRFS_I(inode)->lock);
>>       delalloc_bytes = BTRFS_I(inode)->new_delalloc_bytes;
>>
Amir Goldstein July 30, 2021, 5:53 a.m. UTC | #26
On Fri, Jul 30, 2021 at 8:33 AM Qu Wenruo <quwenruo.btrfs@gmx.com> wrote:
>
>
>
> On 2021/7/30 下午1:25, Qu Wenruo wrote:
> >
> >
> > On 2021/7/30 上午10:36, NeilBrown wrote:
> >>
> >> I've been pondering all the excellent feedback, and what I have learnt
> >> from examining the code in btrfs, and I have developed a different
> >> perspective.
> >
> > Great! Some new developers into the btrfs realm!
> >
> >>
> >> Maybe "subvol" is a poor choice of name because it conjures up
> >> connections with the Volumes in LVM, and btrfs subvols are very different
> >> things.  Btrfs subvols are really just subtrees that can be treated as a
> >> unit for operations like "clone" or "destroy".
> >>
> >> As such, they don't really deserve separate st_dev numbers.
> >>
> >> Maybe the different st_dev numbers were introduced as a "cheap" way to
> >> extend to size of the inode-number space.  Like many "cheap" things, it
> >> has hidden costs.
>
> Forgot another problem already caused by this st_dev method.
>
> Since btrfs uses st_dev to distinguish them its inode name space, and
> st_dev is allocated using anonymous bdev, and the anonymous bdev poor
> has limited size (much smaller than btrfs subvolume id name space), it's
> already causing problems like we can't allocate enough anonymous bdev
> for each subvolume, and failed to create subvolume/snapshot.
>

How about creating a major dev for btrfs subvolumes to start with.
Then at least there is a possibility for administrative reservation of
st_dev values for subvols that need persistent <st_dev;st_ino>

By default subvols get assigned a minor dynamically as today
and with opt-in (e.g. for small short lived btrfs filesystems), the
unified st_dev approach can be used, possibly by providing
an upper limit to the inode numbers on the filesystem, similar to
xfs -o inode32 mount option.

Thanks,
Amir.
NeilBrown July 30, 2021, 5:58 a.m. UTC | #27
On Fri, 30 Jul 2021, Qu Wenruo wrote:
> 
> On 2021/7/30 上午10:36, NeilBrown wrote:
> >
> > I've been pondering all the excellent feedback, and what I have learnt
> > from examining the code in btrfs, and I have developed a different
> > perspective.
> 
> Great! Some new developers into the btrfs realm!

:-)

> 
> >
> > Maybe "subvol" is a poor choice of name because it conjures up
> > connections with the Volumes in LVM, and btrfs subvols are very different
> > things.  Btrfs subvols are really just subtrees that can be treated as a
> > unit for operations like "clone" or "destroy".
> >
> > As such, they don't really deserve separate st_dev numbers.
> >
> > Maybe the different st_dev numbers were introduced as a "cheap" way to
> > extend to size of the inode-number space.  Like many "cheap" things, it
> > has hidden costs.
> >
> > Maybe objects in different subvols should still be given different inode
> > numbers.  This would be problematic on 32bit systems, but much less so on
> > 64bit systems.
> >
> > The patch below, which is just a proof-of-concept, changes btrfs to
> > report a uniform st_dev, and different (64bit) st_ino in different subvols.
> >
> > It has problems:
> >   - it will break any 32bit readdir and 32bit stat.  I don't know how big
> >     a problem that is these days (ino_t in the kernel is "unsigned long",
> >     not "unsigned long long). That surprised me).
> >   - It might break some user-space expectations.  One thing I have learnt
> >     is not to make any assumption about what other people might expect.
> 
> Wouldn't any filesystem boundary check fail to stop at subvolume boundary?

You mean like "du -x"?? Yes.  You would lose the misleading illusion
that there are multiple filesystems.  That is one user-expectation that
would need to be addressed before people opt-in

> 
> Then it will go through the full btrfs subvolumes/snapshots, which can
> be super slow.
> 
> >
> > However, it would be quite easy to make this opt-in (or opt-out) with a
> > mount option, so that people who need the current inode numbers and will
> > accept the current breakage can keep working.
> >
> > I think this approach would be a net-win for NFS export, whether BTRFS
> > supports it directly or not.  I might post a patch which modifies NFS to
> > intuit improved inode numbers for btrfs exports....
> 
> Some extra ideas, but not familiar with VFS enough to be sure.
> 
> Can we generate "fake" superblock for each subvolume?

I don't see how that would help.  Either subvols are like filesystems
and appear in /proc/mounts, or they aren't like filesystems and don't
get different st_dev.  Either of these outcomes can be achieved without
fake superblocks.  If you really need BTRFS subvols to have some
properties of filesystems but not all, then you are in for a whole world
of pain.

Maybe btrfs subvols should be treated more like XFS "managed trees".  At
least there you have precedent and someone else to share the pain.
Maybe we should train people to use "quota" to check the usage of a
subvol, rather than "du" (which will stop working with my patch if it
contains refs to other subvols) or "df" (which already doesn't work), or
"btrs df"

> Like using the subolume UUID to replace the FSID of each subvolume.
> Could that migrate the problem?

Which problem, exactly?  My first approach to making subvols work on NFS
took essentially that approach.  It was seen (quite reasonably) as a
hack to work around poor behaviour in btrfs.

Given that NFS has always seen all of a btrfs filesystem as have a
uniform fsid, I'm now of the opinion that we don't want to change that,
but should just fix the duplicate-inode-number problem.

If I could think of some way for NFSD to see different inode numbers
than VFS, I would push hard for fixs nfsd by giving it more sane inode
numbers.

Thanks,
NeilBrown
NeilBrown July 30, 2021, 6 a.m. UTC | #28
On Fri, 30 Jul 2021, Qu Wenruo wrote:
> 
> On 2021/7/30 下午1:25, Qu Wenruo wrote:
> >
> >
> > On 2021/7/30 上午10:36, NeilBrown wrote:
> >>
> >> I've been pondering all the excellent feedback, and what I have learnt
> >> from examining the code in btrfs, and I have developed a different
> >> perspective.
> >
> > Great! Some new developers into the btrfs realm!
> >
> >>
> >> Maybe "subvol" is a poor choice of name because it conjures up
> >> connections with the Volumes in LVM, and btrfs subvols are very different
> >> things.  Btrfs subvols are really just subtrees that can be treated as a
> >> unit for operations like "clone" or "destroy".
> >>
> >> As such, they don't really deserve separate st_dev numbers.
> >>
> >> Maybe the different st_dev numbers were introduced as a "cheap" way to
> >> extend to size of the inode-number space.  Like many "cheap" things, it
> >> has hidden costs.
> 
> Forgot another problem already caused by this st_dev method.
> 
> Since btrfs uses st_dev to distinguish them its inode name space, and
> st_dev is allocated using anonymous bdev, and the anonymous bdev poor
> has limited size (much smaller than btrfs subvolume id name space), it's
> already causing problems like we can't allocate enough anonymous bdev
> for each subvolume, and failed to create subvolume/snapshot.

What sort of numbers do you see in practice? How many subvolumes and how
many inodes per subvolume?
If we allocated some number of bits to each, with over-allocation to
allow for growth, could we fit both into 64 bits?

NeilBrown


> 
> Thus it's really a time to re-consider how we should export this info to
> user space.
> 
> Thanks,
> Qu
>
NeilBrown July 30, 2021, 6:08 a.m. UTC | #29
On Fri, 30 Jul 2021, Al Viro wrote:
> On Wed, Jul 28, 2021 at 05:30:04PM -0400, Josef Bacik wrote:
> 
> > I don't think anybody has that many file systems.  For btrfs it's a single
> > file system.  Think of syncfs, it's going to walk through all of the super
> > blocks on the system calling ->sync_fs on each subvol superblock.  Now this
> > isn't a huge deal, we could just have some flag that says "I'm not real" or
> > even just have anonymous superblocks that don't get added to the global
> > super_blocks list, and that would address my main pain points.
> 
> Umm...  Aren't the snapshots read-only by definition?

No, though they can be.
subvols can be created empty, or duplicated from an existing subvol.
Any subvol can be written, using copy-on-write of course.

NeilBrown
Qu Wenruo July 30, 2021, 6:09 a.m. UTC | #30
On 2021/7/30 下午2:00, NeilBrown wrote:
> On Fri, 30 Jul 2021, Qu Wenruo wrote:
>>
>> On 2021/7/30 下午1:25, Qu Wenruo wrote:
>>>
>>>
>>> On 2021/7/30 上午10:36, NeilBrown wrote:
>>>>
>>>> I've been pondering all the excellent feedback, and what I have learnt
>>>> from examining the code in btrfs, and I have developed a different
>>>> perspective.
>>>
>>> Great! Some new developers into the btrfs realm!
>>>
>>>>
>>>> Maybe "subvol" is a poor choice of name because it conjures up
>>>> connections with the Volumes in LVM, and btrfs subvols are very different
>>>> things.  Btrfs subvols are really just subtrees that can be treated as a
>>>> unit for operations like "clone" or "destroy".
>>>>
>>>> As such, they don't really deserve separate st_dev numbers.
>>>>
>>>> Maybe the different st_dev numbers were introduced as a "cheap" way to
>>>> extend to size of the inode-number space.  Like many "cheap" things, it
>>>> has hidden costs.
>>
>> Forgot another problem already caused by this st_dev method.
>>
>> Since btrfs uses st_dev to distinguish them its inode name space, and
>> st_dev is allocated using anonymous bdev, and the anonymous bdev poor
>> has limited size (much smaller than btrfs subvolume id name space), it's
>> already causing problems like we can't allocate enough anonymous bdev
>> for each subvolume, and failed to create subvolume/snapshot.
>
> What sort of numbers do you see in practice? How many subvolumes and how
> many inodes per subvolume?

Normally the "live"(*) subvolume numbers are below the minor dev number
range (1<<20), thus not a big deal.

*: Live here means the subvolume is at least accessed once. Subvolume
exists but never accessed doesn't get its anonymous bdev number allocated.

But (1<<20) is really small compared some real-world users.
Thus we had some reports of such problem, and changed the timing to
allocate such bdev number.

> If we allocated some number of bits to each, with over-allocation to
> allow for growth, could we fit both into 64 bits?

I don't think it's even possible, as currently we use u32 for dev_t,
which is already way below the theoretical limit (U64_MAX - 512).

Thus AFAIK there is no real way to solve it right now.

Thanks,
Qu
>
> NeilBrown
>
>
>>
>> Thus it's really a time to re-consider how we should export this info to
>> user space.
>>
>> Thanks,
>> Qu
>>
Qu Wenruo July 30, 2021, 6:23 a.m. UTC | #31
On 2021/7/30 下午1:58, NeilBrown wrote:
> On Fri, 30 Jul 2021, Qu Wenruo wrote:
>>
>> On 2021/7/30 上午10:36, NeilBrown wrote:
>>>
>>> I've been pondering all the excellent feedback, and what I have learnt
>>> from examining the code in btrfs, and I have developed a different
>>> perspective.
>>
>> Great! Some new developers into the btrfs realm!
>
> :-)
>
>>
>>>
>>> Maybe "subvol" is a poor choice of name because it conjures up
>>> connections with the Volumes in LVM, and btrfs subvols are very different
>>> things.  Btrfs subvols are really just subtrees that can be treated as a
>>> unit for operations like "clone" or "destroy".
>>>
>>> As such, they don't really deserve separate st_dev numbers.
>>>
>>> Maybe the different st_dev numbers were introduced as a "cheap" way to
>>> extend to size of the inode-number space.  Like many "cheap" things, it
>>> has hidden costs.
>>>
>>> Maybe objects in different subvols should still be given different inode
>>> numbers.  This would be problematic on 32bit systems, but much less so on
>>> 64bit systems.
>>>
>>> The patch below, which is just a proof-of-concept, changes btrfs to
>>> report a uniform st_dev, and different (64bit) st_ino in different subvols.
>>>
>>> It has problems:
>>>    - it will break any 32bit readdir and 32bit stat.  I don't know how big
>>>      a problem that is these days (ino_t in the kernel is "unsigned long",
>>>      not "unsigned long long). That surprised me).
>>>    - It might break some user-space expectations.  One thing I have learnt
>>>      is not to make any assumption about what other people might expect.
>>
>> Wouldn't any filesystem boundary check fail to stop at subvolume boundary?
>
> You mean like "du -x"?? Yes.  You would lose the misleading illusion
> that there are multiple filesystems.  That is one user-expectation that
> would need to be addressed before people opt-in

OK, forgot it's an opt-in feature, then it's less an impact.

But it can still sometimes be problematic.

E.g. if the user want to put some git code into one subvolume, while
export another subvolume through NFS.

Then the user has to opt-in, affecting the git subvolume to lose the
ability to determine subvolume boundary, right?

This is more concerning as most btrfs users won't want to explicitly
prepare another different btrfs.

>
>>
>> Then it will go through the full btrfs subvolumes/snapshots, which can
>> be super slow.
>>
>>>
>>> However, it would be quite easy to make this opt-in (or opt-out) with a
>>> mount option, so that people who need the current inode numbers and will
>>> accept the current breakage can keep working.
>>>
>>> I think this approach would be a net-win for NFS export, whether BTRFS
>>> supports it directly or not.  I might post a patch which modifies NFS to
>>> intuit improved inode numbers for btrfs exports....
>>
>> Some extra ideas, but not familiar with VFS enough to be sure.
>>
>> Can we generate "fake" superblock for each subvolume?
>
> I don't see how that would help.  Either subvols are like filesystems
> and appear in /proc/mounts, or they aren't like filesystems and don't
> get different st_dev.  Either of these outcomes can be achieved without
> fake superblocks.  If you really need BTRFS subvols to have some
> properties of filesystems but not all, then you are in for a whole world
> of pain.

I guess it's time we pay for the hacks...

>
> Maybe btrfs subvols should be treated more like XFS "managed trees".  At
> least there you have precedent and someone else to share the pain.
> Maybe we should train people to use "quota" to check the usage of a
> subvol,

Well, btrfs quota has its own pain...

> rather than "du" (which will stop working with my patch if it
> contains refs to other subvols) or "df" (which already doesn't work), or
> "btrs df"


BTW, since XFS has a similar feature (not sure with XFS though), I guess
in the long run, it may be worthy to make the VFS to have some way to
treat such concept that is not full volume but still different trees.

>
>> Like using the subolume UUID to replace the FSID of each subvolume.
>> Could that migrate the problem?
>
> Which problem, exactly?  My first approach to making subvols work on NFS
> took essentially that approach.  It was seen (quite reasonably) as a
> hack to work around poor behaviour in btrfs.
>
> Given that NFS has always seen all of a btrfs filesystem as have a
> uniform fsid, I'm now of the opinion that we don't want to change that,
> but should just fix the duplicate-inode-number problem.
>
> If I could think of some way for NFSD to see different inode numbers
> than VFS, I would push hard for fixs nfsd by giving it more sane inode
> numbers.

Really not familiar with NFS/VFS, thus some ideas from me may sounds
super crazy.

Is it possible that, for nfsd to detect such "subvolume" concept by its
own, like checking st_dev and the fsid returned from statfs().

Then if nfsd find some boundary which has different st_dev, but the same
fsid as its parent, then it knows it's a "subvolume"-like concept.

Then do some local inode number mapping inside nfsd?
Like use the highest 20 bits for different subvolumes, while the
remaining 44 bits for real inode numbers.

Of-course, this is still a workaround...

Thanks,
Qu

>
> Thanks,
> NeilBrown
>
NeilBrown July 30, 2021, 6:53 a.m. UTC | #32
On Fri, 30 Jul 2021, Qu Wenruo wrote:
> >
> > You mean like "du -x"?? Yes.  You would lose the misleading illusion
> > that there are multiple filesystems.  That is one user-expectation that
> > would need to be addressed before people opt-in
> 
> OK, forgot it's an opt-in feature, then it's less an impact.

The hope would have to be that everyone would eventually opt-in once all
issues were understood.

> 
> Really not familiar with NFS/VFS, thus some ideas from me may sounds
> super crazy.
> 
> Is it possible that, for nfsd to detect such "subvolume" concept by its
> own, like checking st_dev and the fsid returned from statfs().
> 
> Then if nfsd find some boundary which has different st_dev, but the same
> fsid as its parent, then it knows it's a "subvolume"-like concept.
> 
> Then do some local inode number mapping inside nfsd?
> Like use the highest 20 bits for different subvolumes, while the
> remaining 44 bits for real inode numbers.
> 
> Of-course, this is still a workaround...

Yes, it would certainly be possible to add some hacks to nfsd to fix the
immediate problem, and we could probably even created some well-defined
interfaces into btrfs to extract the required information so that it
wasn't too hackish.

Maybe that is what we will have to do.  But I'd rather not hack NFSD
while there is any chance that a more complete solution will be found.

I'm not quite ready to give up on the idea of squeezing all btrfs inodes
into a 64bit number space.  24bits of subvol and 40 bits of inode?
Make the split a mkfs or mount option?
Maybe hand out inode numbers to subvols in 2^32 chunks so each subvol
(which has ever been accessed) has a mapping from the top 32 bits of the
objectid to the top 32 bits of the inode number.

We don't need something that is theoretically perfect (that's not
possible anyway as we don't have 64bits of device numbers).  We just
need something that is practical and scales adequately.  If you have
petabytes of storage, it is reasonable to spend a gigabyte of memory on
a lookup table(?).

If we can make inode numbers unique, we can possibly leave the st_dev
changing at subvols so that "du -x" works as currently expected.

One thought I had was to use a strong hash to combine the subvol object
id and the inode object id into a 64bit number.  What is the chance of
a collision in practice :-)

Thanks,
NeilBrown
Qu Wenruo July 30, 2021, 7:09 a.m. UTC | #33
On 2021/7/30 下午2:53, NeilBrown wrote:
> On Fri, 30 Jul 2021, Qu Wenruo wrote:
>>>
>>> You mean like "du -x"?? Yes.  You would lose the misleading illusion
>>> that there are multiple filesystems.  That is one user-expectation that
>>> would need to be addressed before people opt-in
>>
>> OK, forgot it's an opt-in feature, then it's less an impact.
>
> The hope would have to be that everyone would eventually opt-in once all
> issues were understood.
>
>>
>> Really not familiar with NFS/VFS, thus some ideas from me may sounds
>> super crazy.
>>
>> Is it possible that, for nfsd to detect such "subvolume" concept by its
>> own, like checking st_dev and the fsid returned from statfs().
>>
>> Then if nfsd find some boundary which has different st_dev, but the same
>> fsid as its parent, then it knows it's a "subvolume"-like concept.
>>
>> Then do some local inode number mapping inside nfsd?
>> Like use the highest 20 bits for different subvolumes, while the
>> remaining 44 bits for real inode numbers.
>>
>> Of-course, this is still a workaround...
>
> Yes, it would certainly be possible to add some hacks to nfsd to fix the
> immediate problem, and we could probably even created some well-defined
> interfaces into btrfs to extract the required information so that it
> wasn't too hackish.
>
> Maybe that is what we will have to do.  But I'd rather not hack NFSD
> while there is any chance that a more complete solution will be found.
>
> I'm not quite ready to give up on the idea of squeezing all btrfs inodes
> into a 64bit number space.  24bits of subvol and 40 bits of inode?
> Make the split a mkfs or mount option?

Btrfs used to have a subvolume number limit in the past, for different
reasons.

In that case, subvolume number is limited to 48 bits, which is still too
large to avoid conflicts.

For inode number there is really no limit except the 256 ~ (U64)-256 limit.

Considering all these numbers are almost U64, conflicts would be
unavoidable AFAIK.

> Maybe hand out inode numbers to subvols in 2^32 chunks so each subvol
> (which has ever been accessed) has a mapping from the top 32 bits of the
> objectid to the top 32 bits of the inode number.
>
> We don't need something that is theoretically perfect (that's not
> possible anyway as we don't have 64bits of device numbers).  We just
> need something that is practical and scales adequately.  If you have
> petabytes of storage, it is reasonable to spend a gigabyte of memory on
> a lookup table(?).

Can such squishing-all-inodes-into-one-namespace work to be done in a
more generic way? e.g, let each fs with "subvolume"-like feature to
provide the interface to do that.


Despite that I still hope to have a way to distinguish the "subvolume"
boundary.

If completely inside btrfs, it's pretty simple to locate a subvolume
boundary.
All subvolume have the same inode number 256.

Maybe we could reserve some special "squished" inode number to indicate
boundary inside a filesystem.

E.g. reserve (u64)-1 as a special indicator for subvolume boundaries.
As most fs would have reserved super high inode numbers anyway.


>
> If we can make inode numbers unique, we can possibly leave the st_dev
> changing at subvols so that "du -x" works as currently expected.
>
> One thought I had was to use a strong hash to combine the subvol object
> id and the inode object id into a 64bit number.  What is the chance of
> a collision in practice :-)

But with just 64bits, conflicts will happen anyway...

Thanks,
Qu
>
> Thanks,
> NeilBrown
>
J. Bruce Fields July 30, 2021, 3:17 p.m. UTC | #34
On Fri, Jul 30, 2021 at 02:23:44PM +0800, Qu Wenruo wrote:
> OK, forgot it's an opt-in feature, then it's less an impact.
> 
> But it can still sometimes be problematic.
> 
> E.g. if the user want to put some git code into one subvolume, while
> export another subvolume through NFS.
> 
> Then the user has to opt-in, affecting the git subvolume to lose the
> ability to determine subvolume boundary, right?

Totally naive question: is it be possible to treat different subvolumes
differently, and give the user some choice at subvolume creation time
how this new boundary should behave?

It seems like there are some conflicting priorities that can only be
resolved by someone who knows the intended use case.

--b.
Josef Bacik July 30, 2021, 3:48 p.m. UTC | #35
On 7/30/21 11:17 AM, J. Bruce Fields wrote:
> On Fri, Jul 30, 2021 at 02:23:44PM +0800, Qu Wenruo wrote:
>> OK, forgot it's an opt-in feature, then it's less an impact.
>>
>> But it can still sometimes be problematic.
>>
>> E.g. if the user want to put some git code into one subvolume, while
>> export another subvolume through NFS.
>>
>> Then the user has to opt-in, affecting the git subvolume to lose the
>> ability to determine subvolume boundary, right?
> 
> Totally naive question: is it be possible to treat different subvolumes
> differently, and give the user some choice at subvolume creation time
> how this new boundary should behave?
> 
> It seems like there are some conflicting priorities that can only be
> resolved by someone who knows the intended use case.
> 

This is the crux of the problem.  We have no real interfaces or anything to deal 
with this sort of paradigm.  We do the st_dev thing because that's the most 
common way that tools like find or rsync use to determine they've wandered into 
a "different" volume.  This exists specifically because of usescases like 
Zygo's, where he's taking thousands of snapshots and manually excluding them 
from find/rsync is just not reasonable.

We have no good way to give the user information about what's going on, we just 
have these old shitty interfaces.  I asked our guys about filling up 
/proc/self/mountinfo with our subvolumes and they had a heart attack because we 
have around 2-4k subvolumes on machines, and with monitoring stuff in place we 
regularly read /proc/self/mountinfo to determine what's mounted and such.

And then there's NFS which needs to know that it's walked into a new inode space.

This is all super shitty, and mostly exists because we don't have a good way to 
expose to the user wtf is going on.

Personally I would be ok with simply disallowing NFS to wander into subvolumes 
from an exported fs.  If you want to export subvolumes then export them 
individually, otherwise if you walk into a subvolume from NFS you simply get an 
empty directory.

This doesn't solve the mountinfo problem where a user may want to figure out 
which subvol they're in, but this is where I think we could address the issue 
with better interfaces.  Or perhaps Neil's idea to have a common major number 
with a different minor number for every subvol.

Either way this isn't as simple as shoehorning it into automount and being done 
with it, we need to take a step back and think about how should this actually 
look, taking into account we've got 12 years of having Btrfs deployed with 
existing usecases that expect a certain behavior.  Thanks,

Josef
Forza July 30, 2021, 4:25 p.m. UTC | #36
On 2021-07-30 17:48, Josef Bacik wrote:
> On 7/30/21 11:17 AM, J. Bruce Fields wrote:
>> On Fri, Jul 30, 2021 at 02:23:44PM +0800, Qu Wenruo wrote:
>>> OK, forgot it's an opt-in feature, then it's less an impact.
>>>
>>> But it can still sometimes be problematic.
>>>
>>> E.g. if the user want to put some git code into one subvolume, while
>>> export another subvolume through NFS.
>>>
>>> Then the user has to opt-in, affecting the git subvolume to lose the
>>> ability to determine subvolume boundary, right?
>>
>> Totally naive question: is it be possible to treat different subvolumes
>> differently, and give the user some choice at subvolume creation time
>> how this new boundary should behave?
>>
>> It seems like there are some conflicting priorities that can only be
>> resolved by someone who knows the intended use case.
>>
> 
> This is the crux of the problem.  We have no real interfaces or anything 
> to deal with this sort of paradigm.  We do the st_dev thing because 
> that's the most common way that tools like find or rsync use to 
> determine they've wandered into a "different" volume.  This exists 
> specifically because of usescases like Zygo's, where he's taking 
> thousands of snapshots and manually excluding them from find/rsync is 
> just not reasonable.
> 
> We have no good way to give the user information about what's going on, 
> we just have these old shitty interfaces.  I asked our guys about 
> filling up /proc/self/mountinfo with our subvolumes and they had a heart 
> attack because we have around 2-4k subvolumes on machines, and with 
> monitoring stuff in place we regularly read /proc/self/mountinfo to 
> determine what's mounted and such.
> 
> And then there's NFS which needs to know that it's walked into a new 
> inode space.
> 
> This is all super shitty, and mostly exists because we don't have a good 
> way to expose to the user wtf is going on.
> 
> Personally I would be ok with simply disallowing NFS to wander into 
> subvolumes from an exported fs.  If you want to export subvolumes then 
> export them individually, otherwise if you walk into a subvolume from 
> NFS you simply get an empty directory.
> 
> This doesn't solve the mountinfo problem where a user may want to figure 
> out which subvol they're in, but this is where I think we could address 
> the issue with better interfaces.  Or perhaps Neil's idea to have a 
> common major number with a different minor number for every subvol.
> 
> Either way this isn't as simple as shoehorning it into automount and 
> being done with it, we need to take a step back and think about how 
> should this actually look, taking into account we've got 12 years of 
> having Btrfs deployed with existing usecases that expect a certain 
> behavior.  Thanks,
> 
> Josef


As a user and sysadmin I really appreciate the way Btrfs currently works.

We use hourly snapshots which are exposed over Samba as "Previous 
Versions" to Windows users. This amounts to thousands of snapshots, all 
user serviceable. A great feature!

In Samba world we have a mount option[1] called "noserverino" which lets 
the client generate unique inode numbers, rather than using the server 
provided inode numbers. This allows Linux clients to work well against 
servers exposing subvolumes and snapshots.

NFS has really old roots and had to make choices that we don't really 
have to make today. Can we not provide something similar to mount.cifs 
that generate unique inode numbers for the clients. This could be either 
an nfsd export option (such as /mnt/foo *(rw,uniq_inodes)) or a mount 
option on the clients.

One worry I have with making subvolumes automountpoints is that it might 
affect the possibility to cp --reflink across that boundary.



[1] https://www.samba.org/~ab/output/htmldocs/manpages-3/mount.cifs.8.html
Zygo Blaxell July 30, 2021, 5:43 p.m. UTC | #37
On Fri, Jul 30, 2021 at 11:48:15AM -0400, Josef Bacik wrote:
> On 7/30/21 11:17 AM, J. Bruce Fields wrote:
> > On Fri, Jul 30, 2021 at 02:23:44PM +0800, Qu Wenruo wrote:
> > > OK, forgot it's an opt-in feature, then it's less an impact.
> > > 
> > > But it can still sometimes be problematic.
> > > 
> > > E.g. if the user want to put some git code into one subvolume, while
> > > export another subvolume through NFS.
> > > 
> > > Then the user has to opt-in, affecting the git subvolume to lose the
> > > ability to determine subvolume boundary, right?
> > 
> > Totally naive question: is it be possible to treat different subvolumes
> > differently, and give the user some choice at subvolume creation time
> > how this new boundary should behave?
> > 
> > It seems like there are some conflicting priorities that can only be
> > resolved by someone who knows the intended use case.
> > 
> 
> This is the crux of the problem.  We have no real interfaces or anything to
> deal with this sort of paradigm.  We do the st_dev thing because that's the
> most common way that tools like find or rsync use to determine they've
> wandered into a "different" volume.  This exists specifically because of
> usescases like Zygo's, where he's taking thousands of snapshots and manually
> excluding them from find/rsync is just not reasonable.
> 
> We have no good way to give the user information about what's going on, we
> just have these old shitty interfaces.  I asked our guys about filling up
> /proc/self/mountinfo with our subvolumes and they had a heart attack because
> we have around 2-4k subvolumes on machines, and with monitoring stuff in
> place we regularly read /proc/self/mountinfo to determine what's mounted and
> such.
> 
> And then there's NFS which needs to know that it's walked into a new inode space.

NFS somehow works surprisingly well without knowing that.  I didn't know
there was a problem with NFS, despite exporting thousands of btrfs subvols
from a single export point for 7 years.  Maybe I have some non-default
setting in /etc/exports which works around the problems, or maybe I got
lucky, and all my use cases are weirdly specific and evade all the bugs
by accident?

> This is all super shitty, and mostly exists because we don't have a good way
> to expose to the user wtf is going on.
> 
> Personally I would be ok with simply disallowing NFS to wander into
> subvolumes from an exported fs.  If you want to export subvolumes then
> export them individually, otherwise if you walk into a subvolume from NFS
> you simply get an empty directory.

As a present exporter of thousands of btrfs subvols over NFS from single
export points, I'm not a fan of this idea.

> This doesn't solve the mountinfo problem where a user may want to figure out
> which subvol they're in, but this is where I think we could address the
> issue with better interfaces.  Or perhaps Neil's idea to have a common major
> number with a different minor number for every subvol.

It's not hard to figure out what subvol you're in.  There's an ioctl
which tells the subvol ID, and another that tells the name.  The problem
is that it's btrfs-specific, and no existing software knows how and when
to use it (and also it's privileged, but that's easy to fix compared to
the other issues).

> Either way this isn't as simple as shoehorning it into automount and being
> done with it, we need to take a step back and think about how should this
> actually look, taking into account we've got 12 years of having Btrfs
> deployed with existing usecases that expect a certain behavior.  Thanks,

I think if we got into a time machine, went back 12 years, changed
the btrfs behavior, and then returned to the present, in the alternate
history, we would all be here today talking about how mountinfo doesn't
scale up to what btrfs throws at it, and can btrfs opt out of it somehow.

Maybe we could have a system call for mount point discovery?  Right now,
the kernel throws a trail of breadcrumbs into /proc/self/mountinfo,
and users use userspace libraries to translate that text blob into
actionable information.  We could solve problems with scalability and
visibility in mountinfo if we only had to provide the information in
the context of a single inode (i.e. the inode's parent or child mount
points accessible to the caller).

So you'd have a call for "get paths for all the mount points below inode
X" and another for "get paths for all mount points above inode X", and
calls that tell you details about mount points (like what they're mounted
on, which filesystem they are part of, what the mount flags are, etc).

> Josef
Zygo Blaxell July 30, 2021, 6:15 p.m. UTC | #38
On Fri, Jul 30, 2021 at 03:09:12PM +0800, Qu Wenruo wrote:
> 
> 
> On 2021/7/30 下午2:53, NeilBrown wrote:
> > On Fri, 30 Jul 2021, Qu Wenruo wrote:
> > > > 
> > > > You mean like "du -x"?? Yes.  You would lose the misleading illusion
> > > > that there are multiple filesystems.  That is one user-expectation that
> > > > would need to be addressed before people opt-in
> > > 
> > > OK, forgot it's an opt-in feature, then it's less an impact.
> > 
> > The hope would have to be that everyone would eventually opt-in once all
> > issues were understood.
> > 
> > > 
> > > Really not familiar with NFS/VFS, thus some ideas from me may sounds
> > > super crazy.
> > > 
> > > Is it possible that, for nfsd to detect such "subvolume" concept by its
> > > own, like checking st_dev and the fsid returned from statfs().
> > > 
> > > Then if nfsd find some boundary which has different st_dev, but the same
> > > fsid as its parent, then it knows it's a "subvolume"-like concept.
> > > 
> > > Then do some local inode number mapping inside nfsd?
> > > Like use the highest 20 bits for different subvolumes, while the
> > > remaining 44 bits for real inode numbers.
> > > 
> > > Of-course, this is still a workaround...
> > 
> > Yes, it would certainly be possible to add some hacks to nfsd to fix the
> > immediate problem, and we could probably even created some well-defined
> > interfaces into btrfs to extract the required information so that it
> > wasn't too hackish.
> > 
> > Maybe that is what we will have to do.  But I'd rather not hack NFSD
> > while there is any chance that a more complete solution will be found.
> > 
> > I'm not quite ready to give up on the idea of squeezing all btrfs inodes
> > into a 64bit number space.  24bits of subvol and 40 bits of inode?
> > Make the split a mkfs or mount option?
> 
> Btrfs used to have a subvolume number limit in the past, for different
> reasons.
> 
> In that case, subvolume number is limited to 48 bits, which is still too
> large to avoid conflicts.
> 
> For inode number there is really no limit except the 256 ~ (U64)-256 limit.
> 
> Considering all these numbers are almost U64, conflicts would be
> unavoidable AFAIK.
> 
> > Maybe hand out inode numbers to subvols in 2^32 chunks so each subvol
> > (which has ever been accessed) has a mapping from the top 32 bits of the
> > objectid to the top 32 bits of the inode number.
> > 
> > We don't need something that is theoretically perfect (that's not
> > possible anyway as we don't have 64bits of device numbers).  We just
> > need something that is practical and scales adequately.  If you have
> > petabytes of storage, it is reasonable to spend a gigabyte of memory on
> > a lookup table(?).
> 
> Can such squishing-all-inodes-into-one-namespace work to be done in a
> more generic way? e.g, let each fs with "subvolume"-like feature to
> provide the interface to do that.

If you know the highest subvol ID number, you can pack two integers into
one larger integer by reversing the bits of the subvol number and ORing
them with the inode number, i.e. 0x0080000000000300 is subvol 256
inode 768.

The subvol ID's grow left to right while the inode numbers grow right
to left.  You can have billions of inodes in a few subvols, or billions of
subvols with a few inodes each, and neither will collide with the other
until there are billions of both.

If the filesystem tracks the number of bits in the highest subvol ID
and the highest inode number, then the inode numbers can be decoded,
and collisions can be detected.  e.g. if the maximum subvol ID on the
filesystem is below 131072, it will fit in 17 bits, then we know bits
63-47 are the subvol ID and bits 46-0 are the inode..  When subvol 131072
is created, the number of subvol bits increases to 18, but if every inode
fits in less than 46 bits, we know that every existing inode has a 0 in
the 18th subvol ID bit of the inode number, so there is no ambiguity.

If you don't know the maximum subvol ID, you can guess based on the
position of the large run of zero bits in the middle of the integer--not
reliable, but good enough for a guess if you were looking at 'ls -li'
output (and wrote the inode numbers in hex).

In the pathological case (the maximum subvol ID and maximum inode number
require more than 64 total bits) we return ENOSPC.

This can all be done when btrfs fills in an inode struct.  There's no need
to change the on-disk format, other than to track the highest inode and
subvol number.  btrfs can compute the maxima in reasonable but non-zero
time by searching trees on mount, so an incompatible disk format change
would only be needed to avoid making mount slower.

> Despite that I still hope to have a way to distinguish the "subvolume"
> boundary.

Packing the bits into a single uint64 doesn't help with this--it does
the opposite.  Subvol boundaries become harder to see without deliberate
checking (i.e. not the traditional parent.st_dev != child.st_dev test).

Judging from previous btrfs-related complaints, some users do want
"stealth" subvols whose boundaries are not accidentally visible, so the
new behavior could be a feature for someone.

> If completely inside btrfs, it's pretty simple to locate a subvolume
> boundary.
> All subvolume have the same inode number 256.
> 
> Maybe we could reserve some special "squished" inode number to indicate
> boundary inside a filesystem.
> 
> E.g. reserve (u64)-1 as a special indicator for subvolume boundaries.
> As most fs would have reserved super high inode numbers anyway.
> 
> 
> > 
> > If we can make inode numbers unique, we can possibly leave the st_dev
> > changing at subvols so that "du -x" works as currently expected.
> > 
> > One thought I had was to use a strong hash to combine the subvol object
> > id and the inode object id into a 64bit number.  What is the chance of
> > a collision in practice :-)
> 
> But with just 64bits, conflicts will happen anyway...

The collision rate might be low enough that we could just skip over the
colliding numbers, but we'd have to have some kind of in-memory collision
map to avoid slowing down inode creation (currently the next inode number
is more or less "++last_inode_number", and looking up inodes to see if
they exist first would slow down new file creation a lot).

> Thanks,
> Qu
> > 
> > Thanks,
> > NeilBrown
> >