mbox series

[0/3] fanotify support for btrfs sub-volumes

Message ID 20231025135048.36153-1-amir73il@gmail.com (mailing list archive)
Headers show
Series fanotify support for btrfs sub-volumes | expand

Message

Amir Goldstein Oct. 25, 2023, 1:50 p.m. UTC
Jan,

This patch set implements your suggestion [1] for handling fanotify
events for filesystems with non-uniform f_fsid.

With these changes, events report the fsid as it would be reported
by statfs(2) on the same objet, i.e. the sub-volume's fsid for an inode
in sub-volume.

This creates a small challenge to watching program, which needs to map
from fsid in event to a stored mount_fd to use with open_by_handle_at(2).
Luckily, for btrfs, fsid[0] is uniform and fsid[1] is per sub-volume.

I have adapted fsnotifywatch tool [2] to be able to watch btrfs sb.
The adapted tool detects the special case of btrfs (a bit hacky) and
indexes the mount_fd to be used for open_by_handle_at(2) by fsid[0].

Note that this hackacry is not needed when the tool is watching a
single filesystem (no need for mount_fd lookup table), because btrfs
correctly decodes file handles from any sub-volume with mount_fd from
any other sub-volume.

Christian is targeting the patches on vfs.f_fsid to the second part
of the 6.7 merge window.

If I get an ACK from you and from btrfs developers, perhaps these
patches could also make it to 6.7.

The btrfs patch could go independently via btrfs tree after the vfs
patch is merged, as does the fanotify patch, but it is my preference
to get ACKs and queue them all in the same vfs PR.

Thanks,
Amir.

[1] https://lore.kernel.org/r/20230920110429.f4wkfuls73pd55pv@quack3/
[2] https://github.com/amir73il/inotify-tools/commits/exportfs_encode_fid

Amir Goldstein (3):
  fs: define a new super operation to get fsid
  btrfs: implement super operation to get fsid
  fanotify: support reporting events with fid on btrfs sub-volumes

 Documentation/filesystems/locking.rst |  2 ++
 Documentation/filesystems/vfs.rst     |  4 ++++
 fs/btrfs/super.c                      | 33 ++++++++++++++++++---------
 fs/notify/fanotify/fanotify.c         | 28 +++++++++++++++++------
 fs/notify/fanotify/fanotify_user.c    | 14 ++++++++----
 fs/statfs.c                           | 14 ++++++++++++
 include/linux/fs.h                    |  1 +
 include/linux/statfs.h                |  2 ++
 8 files changed, 76 insertions(+), 22 deletions(-)

Comments

'Christoph Hellwig' Oct. 25, 2023, 3:34 p.m. UTC | #1
On Wed, Oct 25, 2023 at 04:50:45PM +0300, Amir Goldstein wrote:
> Jan,
> 
> This patch set implements your suggestion [1] for handling fanotify
> events for filesystems with non-uniform f_fsid.

File systems nust never report non-uniform fsids (or st_dev) for that
matter.  btrfs is simply broken here and needs to be fixed.
Jan Kara Oct. 25, 2023, 5:04 p.m. UTC | #2
On Wed 25-10-23 08:34:21, Christoph Hellwig wrote:
> On Wed, Oct 25, 2023 at 04:50:45PM +0300, Amir Goldstein wrote:
> > Jan,
> > 
> > This patch set implements your suggestion [1] for handling fanotify
> > events for filesystems with non-uniform f_fsid.
> 
> File systems nust never report non-uniform fsids (or st_dev) for that
> matter.  btrfs is simply broken here and needs to be fixed.

Well, this is the discussion how btrfs should be presenting its subvolumes
to VFS / userspace, isn't it? I never dived into that too closely but as
far as I remember it was discussed to death without finding an acceptable
(to all parties) solution? I guess having a different fsid per subvolume
makes sense (and we can't change that given it is like that forever even if
we wanted). Having different subvolumes share one superblock is more
disputable but there were reasons for that as well. So I'm not sure how you
imagine to resolve this...

								Honza
Amir Goldstein Oct. 25, 2023, 5:17 p.m. UTC | #3
On Wed, Oct 25, 2023 at 4:50 PM Amir Goldstein <amir73il@gmail.com> wrote:
>
> Jan,
>
> This patch set implements your suggestion [1] for handling fanotify
> events for filesystems with non-uniform f_fsid.
>
> With these changes, events report the fsid as it would be reported
> by statfs(2) on the same objet, i.e. the sub-volume's fsid for an inode
> in sub-volume.
>
> This creates a small challenge to watching program, which needs to map
> from fsid in event to a stored mount_fd to use with open_by_handle_at(2).
> Luckily, for btrfs, fsid[0] is uniform and fsid[1] is per sub-volume.
>
> I have adapted fsnotifywatch tool [2] to be able to watch btrfs sb.
> The adapted tool detects the special case of btrfs (a bit hacky) and
> indexes the mount_fd to be used for open_by_handle_at(2) by fsid[0].
>
> Note that this hackacry is not needed when the tool is watching a
> single filesystem (no need for mount_fd lookup table), because btrfs
> correctly decodes file handles from any sub-volume with mount_fd from
> any other sub-volume.

Jan,

Now that I've implemented the userspace part of btrfs sb watch,
I realize that if userspace has to be aware of the fsid oddity of btrfs
anyway, maybe reporting the accurate fsid of the object in event is
not that important at all.

Facts:
1. file_handle is unique across all sub-volumes and can be resolved
    from any fd on any sub-volume
2. fsid[0] can be compared to match an event to a btrfs sb, where any
    fd can be used to resolve file_handle
3. userspace needs to be aware of this fsid[0] fact if it watches more
    than a single sb and userspace needs not care about the value of
    fsid in event at all when watching a single sb
4. even though fanotify never allowed setting sb mark on a path inside
    btrfs sub-volume, it always reported events on inodes in sub-volumes
    to btrfs sb watch - those events always carried the "wrong" fsid (i.e.
    the btrfs root volume fsid)
5. we already agreed that setting up inode marks on inodes inside
    sub-volume should be a no brainer

If we allow reporting either sub-vol fsid or root-vol fsid (exactly as
we do for inodes in sub-vol in current upstream), because we assume
that userspace knows about the fsid[0] trick, then we can we just
remove the -EXDEV error and be done with it.

Thoughts?

Thanks,
Amir.
Amir Goldstein Oct. 25, 2023, 6:02 p.m. UTC | #4
On Wed, Oct 25, 2023 at 8:17 PM Amir Goldstein <amir73il@gmail.com> wrote:
>
> On Wed, Oct 25, 2023 at 4:50 PM Amir Goldstein <amir73il@gmail.com> wrote:
> >
> > Jan,
> >
> > This patch set implements your suggestion [1] for handling fanotify
> > events for filesystems with non-uniform f_fsid.
> >
> > With these changes, events report the fsid as it would be reported
> > by statfs(2) on the same objet, i.e. the sub-volume's fsid for an inode
> > in sub-volume.
> >
> > This creates a small challenge to watching program, which needs to map
> > from fsid in event to a stored mount_fd to use with open_by_handle_at(2).
> > Luckily, for btrfs, fsid[0] is uniform and fsid[1] is per sub-volume.
> >
> > I have adapted fsnotifywatch tool [2] to be able to watch btrfs sb.
> > The adapted tool detects the special case of btrfs (a bit hacky) and
> > indexes the mount_fd to be used for open_by_handle_at(2) by fsid[0].
> >
> > Note that this hackacry is not needed when the tool is watching a
> > single filesystem (no need for mount_fd lookup table), because btrfs
> > correctly decodes file handles from any sub-volume with mount_fd from
> > any other sub-volume.
>
> Jan,
>
> Now that I've implemented the userspace part of btrfs sb watch,
> I realize that if userspace has to be aware of the fsid oddity of btrfs
> anyway, maybe reporting the accurate fsid of the object in event is
> not that important at all.
>
> Facts:
> 1. file_handle is unique across all sub-volumes and can be resolved
>     from any fd on any sub-volume
> 2. fsid[0] can be compared to match an event to a btrfs sb, where any
>     fd can be used to resolve file_handle
> 3. userspace needs to be aware of this fsid[0] fact if it watches more
>     than a single sb and userspace needs not care about the value of
>     fsid in event at all when watching a single sb
> 4. even though fanotify never allowed setting sb mark on a path inside
>     btrfs sub-volume, it always reported events on inodes in sub-volumes
>     to btrfs sb watch - those events always carried the "wrong" fsid (i.e.
>     the btrfs root volume fsid)
> 5. we already agreed that setting up inode marks on inodes inside
>     sub-volume should be a no brainer
>
> If we allow reporting either sub-vol fsid or root-vol fsid (exactly as
> we do for inodes in sub-vol in current upstream),

Another way to put it is that fsid in event describes the object
that was used to setup the mark not the target object.

If an event is received via an inode/sb/mount mark, the fsid
would always describe the fsid of the inode that was used to setup
the mark and that is always the fsid that userspace would query
statfs(2) at the time of calling the fanotify_mark(2) call.

Maybe it is non trivial to document, but for a library that returns
an opaque "watch descriptor", the "watch descriptor" can always
be deduced from the event.

Does this make sense?

Thanks,
Amir.
Josef Bacik Oct. 25, 2023, 9:06 p.m. UTC | #5
On Wed, Oct 25, 2023 at 08:34:21AM -0700, Christoph Hellwig wrote:
> On Wed, Oct 25, 2023 at 04:50:45PM +0300, Amir Goldstein wrote:
> > Jan,
> > 
> > This patch set implements your suggestion [1] for handling fanotify
> > events for filesystems with non-uniform f_fsid.
> 
> File systems nust never report non-uniform fsids (or st_dev) for that
> matter.  btrfs is simply broken here and needs to be fixed.

We keep going around and around on this so I'd like to get a set of steps laid
out for us to work towards to resolve this once and for all.

HYSTERICAL RAISINS (why we do st_dev)
-------------------------------------

Chris made this decision forever ago because things like rsync would screw up
with snapshots and end up backing up the same thing over and over again.  We saw
it was using st_dev (as were a few other standard tools) to distinguish between
file systems, so we abused this to make userspace happy.

The other nice thing this provided was a solution for the fact that we re-use
inode numbers in the file system, as they're unique for the subvolume only.

PROBLEMS WE WANT TO SOLVE
-------------------------

1) Stop abusing st_dev.  We actually want this as btrfs developers because it's
   kind of annoying to figure out which device is mounted when st_dev doesn't
   map to any of the devices in /proc/mounts.

2) Give user space a way to tell it's on a subvolume, so it can not be confused
   by the repeating inode numbers.

POSSIBLE SOLUTIONS
------------------

1) A statx field for subvolume id.  The subvolume id's are unique to the file
   system, so subvolume id + inode number is unique to the file system.  This is
   a u64, so is nice and easy to export through statx.
2) A statx field for the uuid/fsid of the file system.  I'd like this because
   again, being able to easily stat a couple of files and tell they're on the
   same file system is a valuable thing.  We have a per-fs uuid that we can
   export here.
3) A statx field for the uuid of the subvolume.  Our subvolumes have their own
   unique uuid.  This could be an alternative for the subvolume id option, or an
   addition.

Either 1 or 3 are necessary to give userspace a way to tell they've wandered
into a different subvolume.  I'd like to have all 3, but I recognize that may be
wishful thinking.  2 isn't necessary, but if we're going to go about messing
with statx then I'd like to do it all at once, and I want this for the reasons
stated above.

SEQUENCE OF EVENTS
------------------

We do one of the statx changes, that rolls into a real kernel.  We run around
and submit patches for rsync and anything else we can think of to take advantage
of the statx feature.

Then we wait, call it 2 kernel releases after the initial release.  Then we go
and rip out the dev_t hack.

Does this sound like a reasonable path forward to resolve everybody's concerns?
I feel like I'm missing some other argument here, but I'm currently on vacation
and can't think of what it is nor have the energy to go look it up at the
moment.  Thanks,

Josef
Qu Wenruo Oct. 25, 2023, 11:02 p.m. UTC | #6
On 2023/10/26 07:36, Josef Bacik wrote:
> On Wed, Oct 25, 2023 at 08:34:21AM -0700, Christoph Hellwig wrote:
>> On Wed, Oct 25, 2023 at 04:50:45PM +0300, Amir Goldstein wrote:
>>> Jan,
>>>
>>> This patch set implements your suggestion [1] for handling fanotify
>>> events for filesystems with non-uniform f_fsid.
>>
>> File systems nust never report non-uniform fsids (or st_dev) for that
>> matter.  btrfs is simply broken here and needs to be fixed.
>
> We keep going around and around on this so I'd like to get a set of steps laid
> out for us to work towards to resolve this once and for all.
>
> HYSTERICAL RAISINS (why we do st_dev)
> -------------------------------------
>
> Chris made this decision forever ago because things like rsync would screw up
> with snapshots and end up backing up the same thing over and over again.  We saw
> it was using st_dev (as were a few other standard tools) to distinguish between
> file systems, so we abused this to make userspace happy.
>
> The other nice thing this provided was a solution for the fact that we re-use
> inode numbers in the file system, as they're unique for the subvolume only.
>
> PROBLEMS WE WANT TO SOLVE
> -------------------------
>
> 1) Stop abusing st_dev.  We actually want this as btrfs developers because it's
>     kind of annoying to figure out which device is mounted when st_dev doesn't
>     map to any of the devices in /proc/mounts.
>
> 2) Give user space a way to tell it's on a subvolume, so it can not be confused
>     by the repeating inode numbers.
>
> POSSIBLE SOLUTIONS
> ------------------
>
> 1) A statx field for subvolume id.  The subvolume id's are unique to the file
>     system, so subvolume id + inode number is unique to the file system.  This is
>     a u64, so is nice and easy to export through statx.
> 2) A statx field for the uuid/fsid of the file system.  I'd like this because
>     again, being able to easily stat a couple of files and tell they're on the
>     same file system is a valuable thing.  We have a per-fs uuid that we can
>     export here.
> 3) A statx field for the uuid of the subvolume.  Our subvolumes have their own
>     unique uuid.  This could be an alternative for the subvolume id option, or an
>     addition.

No need for a full UUID, just a u64 is good enough.

Although a full UUID for the subvolumes won't hurt and can reduce the
need to call the btrfs specific ioctl just to receive the UUID.


My concern is, such new members would not be utilized by any other fs,
would it cause some compatibility problem?

>
> Either 1 or 3 are necessary to give userspace a way to tell they've wandered
> into a different subvolume.  I'd like to have all 3, but I recognize that may be
> wishful thinking.  2 isn't necessary, but if we're going to go about messing
> with statx then I'd like to do it all at once, and I want this for the reasons
> stated above.
>
> SEQUENCE OF EVENTS
> ------------------
>
> We do one of the statx changes, that rolls into a real kernel.  We run around
> and submit patches for rsync and anything else we can think of to take advantage
> of the statx feature.

My main concern is, how older programs could handle this? Like programs
utilizing stat() only, and for whatever reasons they don't bother to add
statx() support.
(Can vary from lack of maintenance to weird compatibility reasons)

Thus we still need such st_dev hack, until there is no real world
programs utilizing vanilla stat() only.
(Which everyone knows it's impossible)

Thanks,
Qu
>
> Then we wait, call it 2 kernel releases after the initial release.  Then we go
> and rip out the dev_t hack. >
> Does this sound like a reasonable path forward to resolve everybody's concerns?
> I feel like I'm missing some other argument here, but I'm currently on vacation
> and can't think of what it is nor have the energy to go look it up at the
> moment.  Thanks,
>
> Josef
Amir Goldstein Oct. 26, 2023, 5:49 a.m. UTC | #7
On Thu, Oct 26, 2023 at 2:02 AM Qu Wenruo <quwenruo.btrfs@gmx.com> wrote:
>
>
>
> On 2023/10/26 07:36, Josef Bacik wrote:
> > On Wed, Oct 25, 2023 at 08:34:21AM -0700, Christoph Hellwig wrote:
> >> On Wed, Oct 25, 2023 at 04:50:45PM +0300, Amir Goldstein wrote:
> >>> Jan,
> >>>
> >>> This patch set implements your suggestion [1] for handling fanotify
> >>> events for filesystems with non-uniform f_fsid.
> >>
> >> File systems nust never report non-uniform fsids (or st_dev) for that
> >> matter.  btrfs is simply broken here and needs to be fixed.
> >
> > We keep going around and around on this so I'd like to get a set of steps laid
> > out for us to work towards to resolve this once and for all.
> >
> > HYSTERICAL RAISINS (why we do st_dev)
> > -------------------------------------
> >
> > Chris made this decision forever ago because things like rsync would screw up
> > with snapshots and end up backing up the same thing over and over again.  We saw
> > it was using st_dev (as were a few other standard tools) to distinguish between
> > file systems, so we abused this to make userspace happy.
> >
> > The other nice thing this provided was a solution for the fact that we re-use
> > inode numbers in the file system, as they're unique for the subvolume only.
> >
> > PROBLEMS WE WANT TO SOLVE
> > -------------------------
> >
> > 1) Stop abusing st_dev.  We actually want this as btrfs developers because it's
> >     kind of annoying to figure out which device is mounted when st_dev doesn't
> >     map to any of the devices in /proc/mounts.
> >
> > 2) Give user space a way to tell it's on a subvolume, so it can not be confused
> >     by the repeating inode numbers.
> >
> > POSSIBLE SOLUTIONS
> > ------------------
> >
> > 1) A statx field for subvolume id.  The subvolume id's are unique to the file
> >     system, so subvolume id + inode number is unique to the file system.  This is
> >     a u64, so is nice and easy to export through statx.
> > 2) A statx field for the uuid/fsid of the file system.  I'd like this because
> >     again, being able to easily stat a couple of files and tell they're on the
> >     same file system is a valuable thing.  We have a per-fs uuid that we can
> >     export here.
> > 3) A statx field for the uuid of the subvolume.  Our subvolumes have their own
> >     unique uuid.  This could be an alternative for the subvolume id option, or an
> >     addition.
>
> No need for a full UUID, just a u64 is good enough.
>
> Although a full UUID for the subvolumes won't hurt and can reduce the
> need to call the btrfs specific ioctl just to receive the UUID.
>
>
> My concern is, such new members would not be utilized by any other fs,
> would it cause some compatibility problem?
>
> >
> > Either 1 or 3 are necessary to give userspace a way to tell they've wandered
> > into a different subvolume.  I'd like to have all 3, but I recognize that may be
> > wishful thinking.  2 isn't necessary, but if we're going to go about messing
> > with statx then I'd like to do it all at once, and I want this for the reasons
> > stated above.
> >
> > SEQUENCE OF EVENTS
> > ------------------
> >
> > We do one of the statx changes, that rolls into a real kernel.  We run around
> > and submit patches for rsync and anything else we can think of to take advantage
> > of the statx feature.
>
> My main concern is, how older programs could handle this? Like programs
> utilizing stat() only, and for whatever reasons they don't bother to add
> statx() support.
> (Can vary from lack of maintenance to weird compatibility reasons)
>
> Thus we still need such st_dev hack, until there is no real world
> programs utilizing vanilla stat() only.
> (Which everyone knows it's impossible)
>

I agree it does not sound possible to change the world to know
that the same st_dev,st_ino pair could belong to different objects.

One such program btw is diff - it will skip the comparison if both
objects have the same st_dev,st_ino even if they are actually
different objects with different data (i.e. a file and its old snapshot).

Thanks,
Amir.
Jan Kara Oct. 26, 2023, 12:17 p.m. UTC | #8
On Wed 25-10-23 21:02:45, Amir Goldstein wrote:
> On Wed, Oct 25, 2023 at 8:17 PM Amir Goldstein <amir73il@gmail.com> wrote:
> >
> > On Wed, Oct 25, 2023 at 4:50 PM Amir Goldstein <amir73il@gmail.com> wrote:
> > >
> > > Jan,
> > >
> > > This patch set implements your suggestion [1] for handling fanotify
> > > events for filesystems with non-uniform f_fsid.
> > >
> > > With these changes, events report the fsid as it would be reported
> > > by statfs(2) on the same objet, i.e. the sub-volume's fsid for an inode
> > > in sub-volume.
> > >
> > > This creates a small challenge to watching program, which needs to map
> > > from fsid in event to a stored mount_fd to use with open_by_handle_at(2).
> > > Luckily, for btrfs, fsid[0] is uniform and fsid[1] is per sub-volume.
> > >
> > > I have adapted fsnotifywatch tool [2] to be able to watch btrfs sb.
> > > The adapted tool detects the special case of btrfs (a bit hacky) and
> > > indexes the mount_fd to be used for open_by_handle_at(2) by fsid[0].
> > >
> > > Note that this hackacry is not needed when the tool is watching a
> > > single filesystem (no need for mount_fd lookup table), because btrfs
> > > correctly decodes file handles from any sub-volume with mount_fd from
> > > any other sub-volume.
> >
> > Jan,
> >
> > Now that I've implemented the userspace part of btrfs sb watch,
> > I realize that if userspace has to be aware of the fsid oddity of btrfs
> > anyway, maybe reporting the accurate fsid of the object in event is
> > not that important at all.
> >
> > Facts:
> > 1. file_handle is unique across all sub-volumes and can be resolved
> >     from any fd on any sub-volume
> > 2. fsid[0] can be compared to match an event to a btrfs sb, where any
> >     fd can be used to resolve file_handle
> > 3. userspace needs to be aware of this fsid[0] fact if it watches more
> >     than a single sb and userspace needs not care about the value of
> >     fsid in event at all when watching a single sb
> > 4. even though fanotify never allowed setting sb mark on a path inside
> >     btrfs sub-volume, it always reported events on inodes in sub-volumes
> >     to btrfs sb watch - those events always carried the "wrong" fsid (i.e.
> >     the btrfs root volume fsid)
> > 5. we already agreed that setting up inode marks on inodes inside
> >     sub-volume should be a no brainer
> >
> > If we allow reporting either sub-vol fsid or root-vol fsid (exactly as
> > we do for inodes in sub-vol in current upstream),
> 
> Another way to put it is that fsid in event describes the object
> that was used to setup the mark not the target object.
> 
> If an event is received via an inode/sb/mount mark, the fsid
> would always describe the fsid of the inode that was used to setup
> the mark and that is always the fsid that userspace would query
> statfs(2) at the time of calling the fanotify_mark(2) call.
> 
> Maybe it is non trivial to document, but for a library that returns
> an opaque "watch descriptor", the "watch descriptor" can always
> be deduced from the event.
> 
> Does this make sense?

Yes, it makes sense if we always reported event with fsid of the object
used for placing the mark. For filesystems with homogeneous fsid there's no
difference, for btrfs it looks like the least surprising choice and works
well for inode marks as well. The only catch is in the internal fsnotify
implementation AFAICT - if we have multiple marks for the same btrfs
superblock, each mark on different subvolume, then we should be reporting
one event with different fsids for different marks. So we need to cache the
fsid in the mark and not in the connector. But that should be easy to do.

								Honza
Amir Goldstein Oct. 26, 2023, 12:36 p.m. UTC | #9
On Thu, Oct 26, 2023 at 3:17 PM Jan Kara <jack@suse.cz> wrote:
>
> On Wed 25-10-23 21:02:45, Amir Goldstein wrote:
> > On Wed, Oct 25, 2023 at 8:17 PM Amir Goldstein <amir73il@gmail.com> wrote:
> > >
> > > On Wed, Oct 25, 2023 at 4:50 PM Amir Goldstein <amir73il@gmail.com> wrote:
> > > >
> > > > Jan,
> > > >
> > > > This patch set implements your suggestion [1] for handling fanotify
> > > > events for filesystems with non-uniform f_fsid.
> > > >
> > > > With these changes, events report the fsid as it would be reported
> > > > by statfs(2) on the same objet, i.e. the sub-volume's fsid for an inode
> > > > in sub-volume.
> > > >
> > > > This creates a small challenge to watching program, which needs to map
> > > > from fsid in event to a stored mount_fd to use with open_by_handle_at(2).
> > > > Luckily, for btrfs, fsid[0] is uniform and fsid[1] is per sub-volume.
> > > >
> > > > I have adapted fsnotifywatch tool [2] to be able to watch btrfs sb.
> > > > The adapted tool detects the special case of btrfs (a bit hacky) and
> > > > indexes the mount_fd to be used for open_by_handle_at(2) by fsid[0].
> > > >
> > > > Note that this hackacry is not needed when the tool is watching a
> > > > single filesystem (no need for mount_fd lookup table), because btrfs
> > > > correctly decodes file handles from any sub-volume with mount_fd from
> > > > any other sub-volume.
> > >
> > > Jan,
> > >
> > > Now that I've implemented the userspace part of btrfs sb watch,
> > > I realize that if userspace has to be aware of the fsid oddity of btrfs
> > > anyway, maybe reporting the accurate fsid of the object in event is
> > > not that important at all.
> > >
> > > Facts:
> > > 1. file_handle is unique across all sub-volumes and can be resolved
> > >     from any fd on any sub-volume
> > > 2. fsid[0] can be compared to match an event to a btrfs sb, where any
> > >     fd can be used to resolve file_handle
> > > 3. userspace needs to be aware of this fsid[0] fact if it watches more
> > >     than a single sb and userspace needs not care about the value of
> > >     fsid in event at all when watching a single sb
> > > 4. even though fanotify never allowed setting sb mark on a path inside
> > >     btrfs sub-volume, it always reported events on inodes in sub-volumes
> > >     to btrfs sb watch - those events always carried the "wrong" fsid (i.e.
> > >     the btrfs root volume fsid)
> > > 5. we already agreed that setting up inode marks on inodes inside
> > >     sub-volume should be a no brainer
> > >
> > > If we allow reporting either sub-vol fsid or root-vol fsid (exactly as
> > > we do for inodes in sub-vol in current upstream),
> >
> > Another way to put it is that fsid in event describes the object
> > that was used to setup the mark not the target object.
> >
> > If an event is received via an inode/sb/mount mark, the fsid
> > would always describe the fsid of the inode that was used to setup
> > the mark and that is always the fsid that userspace would query
> > statfs(2) at the time of calling the fanotify_mark(2) call.
> >
> > Maybe it is non trivial to document, but for a library that returns
> > an opaque "watch descriptor", the "watch descriptor" can always
> > be deduced from the event.
> >
> > Does this make sense?
>
> Yes, it makes sense if we always reported event with fsid of the object
> used for placing the mark. For filesystems with homogeneous fsid there's no
> difference, for btrfs it looks like the least surprising choice and works
> well for inode marks as well. The only catch is in the internal fsnotify
> implementation AFAICT - if we have multiple marks for the same btrfs
> superblock, each mark on different subvolume, then we should be reporting
> one event with different fsids for different marks. So we need to cache the
> fsid in the mark and not in the connector. But that should be easy to do.

True. I thought about this as well.

I have prepared the would be man page (fanotify.7) update:

fsid   This is a unique identifier of the filesystem containing the object
         associated with the event.  It is a structure of type
__kernel_fsid_t and
         contains the same value reported in  f_fsid  when calling
statfs(2) with
         the same pathname argument that was used for fanotify_mark(2).
         Note that some filesystems (e.g., btrfs(5)) report
non-uniform values of
         f_fsid on different objects of the same filesystem.  In these cases, if
         fanotify_mark(2) is called several times with different
pathname values,
         the fsid value reported in events will match f_fsid associated  with at
         least one of those pathname values.

FWIW, I removed the -EXDEV subvolume change and ran the upstream
fsnotifywatch tool (without btrfs magic) on btrfs sub-volume and it correctly
resolves paths for all events in the filesystem:

root@kvm-xfstests:~# fsnotifywatch --filesystem /vdf/sub/vol/ &
[1] 1703
root@kvm-xfstests:~# Establishing watches...
Setting up filesystem watch on /vdf/sub/vol/
Finished establishing watches, now collecting statistics.

root@kvm-xfstests:~# touch /vdf/x /vdf/sub/vol/y
root@kvm-xfstests:~# fg
fsnotifywatch --filesystem /vdf/sub/vol/
^Ctotal  attrib  close_write  open  create  filename
3      1       1            1     1       /vdf/x
3      1       1            1     1       /vdf/sub/vol/y

I will prepare the patch.

Thanks,
Amir.
'Christoph Hellwig' Oct. 27, 2023, 5:44 a.m. UTC | #10
On Wed, Oct 25, 2023 at 07:04:45PM +0200, Jan Kara wrote:
> Well, this is the discussion how btrfs should be presenting its subvolumes
> to VFS / userspace, isn't it?

Yes.  Which we've pressured to resolve forever, but it's been ignored.

> I never dived into that too closely but as
> far as I remember it was discussed to death without finding an acceptable
> (to all parties) solution? I guess having a different fsid per subvolume
> makes sense (and we can't change that given it is like that forever even if
> we wanted). Having different subvolumes share one superblock is more
> disputable but there were reasons for that as well. So I'm not sure how you
> imagine to resolve this...

We need to solve this out kernel wide, and right now the kernel doesn't
support different dev_t / fsids inside a single file syste at all.
SuSE hacks around that badly for limited user interfaces with the
horrible get_inode_dev method they've added, but this has been rejected
upstream for good reason.  What this series does is to add another
limited version of this through the backdoor.
'Christoph Hellwig' Oct. 27, 2023, 5:46 a.m. UTC | #11
I think you're missing the point.  A bunch of statx fields might be
useful, but they are not solving the problem.  What you need is
a separate vfsmount per subvolume so that userspace sees when it
is crossing into it.  We probably can't force this onto existing
users, so it needs a mount, or even better on-disk option but without
that we're not getting anywhere.
Jan Kara Oct. 27, 2023, 10:58 a.m. UTC | #12
On Thu 26-10-23 22:44:26, Christoph Hellwig wrote:
> On Wed, Oct 25, 2023 at 07:04:45PM +0200, Jan Kara wrote:
> > Well, this is the discussion how btrfs should be presenting its subvolumes
> > to VFS / userspace, isn't it?
> 
> Yes.  Which we've pressured to resolve forever, but it's been ignored.
> 
> > I never dived into that too closely but as
> > far as I remember it was discussed to death without finding an acceptable
> > (to all parties) solution? I guess having a different fsid per subvolume
> > makes sense (and we can't change that given it is like that forever even if
> > we wanted). Having different subvolumes share one superblock is more
> > disputable but there were reasons for that as well. So I'm not sure how you
> > imagine to resolve this...
> 
> We need to solve this out kernel wide, and right now the kernel doesn't
> support different dev_t / fsids inside a single file syste at all.
> SuSE hacks around that badly for limited user interfaces with the
> horrible get_inode_dev method they've added, but this has been rejected
> upstream for good reason.  What this series does is to add another
> limited version of this through the backdoor.

OK, I see. I agree adding ->get_fsid is just piling on top of the problems
so I can see why you don't like it. Band aids are double-edged sword ;)

								Honza
Amir Goldstein Oct. 28, 2023, 5:57 a.m. UTC | #13
On Fri, Oct 27, 2023 at 4:17 PM Josef Bacik <josef@toxicpanda.com> wrote:
>
> On Thu, Oct 26, 2023 at 10:46:01PM -0700, Christoph Hellwig wrote:
> > I think you're missing the point.  A bunch of statx fields might be
> > useful, but they are not solving the problem.  What you need is
> > a separate vfsmount per subvolume so that userspace sees when it
> > is crossing into it.  We probably can't force this onto existing
> > users, so it needs a mount, or even better on-disk option but without
> > that we're not getting anywhere.
> >
>
> We have this same discussion every time, and every time you stop responding
> after I point out the problems with it.
>
> A per-subvolume vfsmount means that /proc/mounts /proc/$PID/mountinfo becomes
> insanely dumb.  I've got millions of machines in this fleet with thousands of
> subvolumes.  One of our workloads fires up several containers per task and runs
> multiple tasks per machine, so on the order of 10-20k subvolumes.
>

I think it is probably just as common to see similar workloads using overlayfs
for containers, especially considering the fact that the more you
scale the number
of containers, the more you need the inode page cache sharing between them.

Overlayfs has sb/vfsmount per instance, so any users having problems with
huge number of mounts would have already complained about it and maybe
they have because...

> So now I've got thousands of entries in /proc/mounts, and literally every system
> related tool parses /proc/mounts every 4 nanoseconds, now I'm significantly
> contributing to global warming from the massive amount of CPU usage that is
> burned parsing this stupid file.
>

...after Miklos sorts out the new list/statmount() syscalls and mount
tree change
notifications, maybe vfsmount per btrfs subvol could be reconsidered? ;)

> Additionally, now you're ending up with potentially sensitive information being
> leaked through /proc/mounts that you didn't expect to be leaked before.  I've
> got users complaining to be me because "/home/john/twilight_fanfic" showed up in
> their /proc/mounts.
>

This makes me wonder.
I understand why using diverse st_dev is needed for btrfs snapshots
where the same st_ino can have different revisions.
I am not sure I understand why diverse st_dev is needed for subvols
that are created for containerisation reasons.
Don't files in sub-vols have unique st_ino anyway?
Is the st_dev mitigation for sub-vol a must or just an implementation
convenience?

> And then there's the expiry thing.  Now they're just directories, reclaim works
> like it works for anything else.  With auto mounts they have to expire at some
> point, which makes them so much more heavier weight than we want to sign up for.
> Who knows what sort of scalability issues we'll run into.
>

I agree that this aspect of auto mount is unfortunate, but I think it would
benefit other fs that support auto mount to improve reclaiming of auto mounts.

In the end, I think that we all understand that the legacy btrfs behavior
is not going away without an opt-in, but I think it would be a good outcome
if users could choose the tradeoff between efficiency of single mount vs.
working well with features like nfs export and fanotify subvol watch.

Having an incentive to migrate to the "multi-sb" btrfs mode, would create
the pressure from end users on distros and from there to project leaders
to fix the issues that you mentioned related to huse number of mounts
and auto mount reclaim.

Thanks,
Amir.
'Christoph Hellwig' Oct. 30, 2023, 1:25 p.m. UTC | #14
On Fri, Oct 27, 2023 at 09:17:26AM -0400, Josef Bacik wrote:
> We have this same discussion every time, and every time you stop responding
> after I point out the problems with it.

I'm not sure why you think I stop responding when the action item is not
for me.

> A per-subvolume vfsmount means that /proc/mounts /proc/$PID/mountinfo becomes
> insanely dumb.  I've got millions of machines in this fleet with thousands of
> subvolumes.  One of our workloads fires up several containers per task and runs
> multiple tasks per machine, so on the order of 10-20k subvolumes.

If you do not expose them in /proc/mounts and friends you don't have
any way to show where the mount point or whatever barrier you want
between different st_dev/fsid/etc is.  And you need that.

I've not seen any good alternative that keeps the invariant and doesn't
break things.  If you think there is one please explain it, but no,
adding new fields so that new software can see how we've broken all
the old software is not the answer.

A (not really) great partial answer would be to stop parsing the
text mount information so much, at least for your workloads that
definitively are little on the extreme side.  I think Miklos has been
working on interface that could be useful for that.
Christian Brauner Oct. 31, 2023, 12:14 p.m. UTC | #15
> > A per-subvolume vfsmount means that /proc/mounts /proc/$PID/mountinfo becomes

So that part confuses me and I'd like to understand this a bit more.

So everytime you create a subvolume what you're doing today is that you
give it an anonymous device number stored in ->anon_dev which presumably
is also stored on disk?

Say I have a btrfs filesystem with 2 subvolumes on /dev/sda:

/mnt/subvol1
/mnt/subvol2

What happens in the kernel right now I've mentiond in the mount api
conversion patch for btrfs I sent out in June at [1] because I tweaked
that behavior. Say I mount both subvolumes:

mount /dev/sda -o subvol=subvol1 /vol1 # sb1@vfsmount1
mount /dev/sda -o subvol=subvol2 /vol2 # sb1@vfsmount2

It creates a superblock for /dev/sda. It then creates two vfsmounts: one
for subvol1 and one for subvol2. So you end up with two subvolumes on
the same superblock.

So if you mount a subvolume today then you already get separate
vfsmounts. To put it another way. If you start 10,000 containers each
using a separate btrfs subvolume then you get 10,000 vfsmounts.

So I don't yet understand the scaling argument if each subvolume has a
vfsmount anyway because afaict that's already the case.

Or is it that you want a separate superblock per subvolume? Because only
if you allocate a new superblock you'll get clean device number
handling, no? Or am I misunderstanding this?

mount /dev/sda -o subvol=subvol1 /vol1 # sget_fc() -> sb1@vfsmount1
mount /dev/sda -o subvol=subvol2 /vol2 # sget_fc() -> sb2@vfsmount2

and mounting the same subvolume again somewhere else gives you the same
superblock but on a different vfsmount:

mount /dev/sda -o subvol=subvol1 /vol1 # sget_fc() -> sb1@vfsmount3

Is that the proposal?

[1]: https://lore.kernel.org/all/20230626-fs-btrfs-mount-api-v1-2-045e9735a00b@kernel.org
'Christoph Hellwig' Oct. 31, 2023, 12:22 p.m. UTC | #16
On Tue, Oct 31, 2023 at 01:14:42PM +0100, Christian Brauner wrote:
> What happens in the kernel right now I've mentiond in the mount api
> conversion patch for btrfs I sent out in June at [1] because I tweaked
> that behavior. Say I mount both subvolumes:
> 
> mount /dev/sda -o subvol=subvol1 /vol1 # sb1@vfsmount1
> mount /dev/sda -o subvol=subvol2 /vol2 # sb1@vfsmount2
> 
> It creates a superblock for /dev/sda. It then creates two vfsmounts: one
> for subvol1 and one for subvol2. So you end up with two subvolumes on
> the same superblock.
> 
> So if you mount a subvolume today then you already get separate
> vfsmounts. To put it another way. If you start 10,000 containers each
> using a separate btrfs subvolume then you get 10,000 vfsmounts.

But only if you mount them explicitly, which you don't have to.

> Or is it that you want a separate superblock per subvolume?

Does "you" refer to me here?  No, I don't.

> Because only
> if you allocate a new superblock you'll get clean device number
> handling, no? Or am I misunderstanding this?

If you allocate a super block you get it for free.  If you don't
you have to manually allocate it report it in ->getattr.
Christian Brauner Oct. 31, 2023, 12:50 p.m. UTC | #17
On Tue, Oct 31, 2023 at 05:22:46AM -0700, Christoph Hellwig wrote:
> On Tue, Oct 31, 2023 at 01:14:42PM +0100, Christian Brauner wrote:
> > What happens in the kernel right now I've mentiond in the mount api
> > conversion patch for btrfs I sent out in June at [1] because I tweaked
> > that behavior. Say I mount both subvolumes:
> > 
> > mount /dev/sda -o subvol=subvol1 /vol1 # sb1@vfsmount1
> > mount /dev/sda -o subvol=subvol2 /vol2 # sb1@vfsmount2
> > 
> > It creates a superblock for /dev/sda. It then creates two vfsmounts: one
> > for subvol1 and one for subvol2. So you end up with two subvolumes on
> > the same superblock.
> > 
> > So if you mount a subvolume today then you already get separate
> > vfsmounts. To put it another way. If you start 10,000 containers each
> > using a separate btrfs subvolume then you get 10,000 vfsmounts.
> 
> But only if you mount them explicitly, which you don't have to.

Yep, I'm aware.

> 
> > Or is it that you want a separate superblock per subvolume?
> 
> Does "you" refer to me here?  No, I don't.
> 
> > Because only
> > if you allocate a new superblock you'll get clean device number
> > handling, no? Or am I misunderstanding this?
> 
> If you allocate a super block you get it for free.  If you don't
> you have to manually allocate it report it in ->getattr.

So this is effectively a request for:

btrfs subvolume create /mnt/subvol1

to create vfsmounts? IOW,

mkfs.btrfs /dev/sda
mount /dev/sda /mnt
btrfs subvolume create /mnt/subvol1
btrfs subvolume create /mnt/subvol2

would create two new vfsmounts that are exposed in /proc/<pid>/mountinfo
afterwards?

That might be odd. Because these vfsmounts aren't really mounted, no?
And so you'd be showing potentially hundreds of mounts in
/proc/<pid>/mountinfo that you can't unmount?

And even if you treat them as mounted what would unmounting mean? I'm
not saying that it's a show stopper but we would need a clear
understanding what the semantics are were after?

My knee-jerk reaction is that if we wanted each btrfs subvolume to be a
vfsmount then we don't want to have them show up in
/proc/<pid>/mountinfo _unless_ they're actually mounted.
'Christoph Hellwig' Oct. 31, 2023, 5:06 p.m. UTC | #18
On Tue, Oct 31, 2023 at 01:50:46PM +0100, Christian Brauner wrote:
> So this is effectively a request for:
> 
> btrfs subvolume create /mnt/subvol1
> 
> to create vfsmounts? IOW,
> 
> mkfs.btrfs /dev/sda
> mount /dev/sda /mnt
> btrfs subvolume create /mnt/subvol1
> btrfs subvolume create /mnt/subvol2
> 
> would create two new vfsmounts that are exposed in /proc/<pid>/mountinfo
> afterwards?

Yes.

> That might be odd. Because these vfsmounts aren't really mounted, no?

Why aren't they?

> And so you'd be showing potentially hundreds of mounts in
> /proc/<pid>/mountinfo that you can't unmount?

Why would you not allow them to be unmounted?

> And even if you treat them as mounted what would unmounting mean?

The code in btrfs_lookup_dentry that does a hand crafted version
of the file system / subvolume crossing (the location.type !=
BTRFS_INODE_ITEM_KEY one) would not be executed.
Qu Wenruo Nov. 1, 2023, 12:03 a.m. UTC | #19
On 2023/11/1 03:36, Christoph Hellwig wrote:
> On Tue, Oct 31, 2023 at 01:50:46PM +0100, Christian Brauner wrote:
>> So this is effectively a request for:
>>
>> btrfs subvolume create /mnt/subvol1
>>
>> to create vfsmounts? IOW,
>>
>> mkfs.btrfs /dev/sda
>> mount /dev/sda /mnt
>> btrfs subvolume create /mnt/subvol1
>> btrfs subvolume create /mnt/subvol2
>>
>> would create two new vfsmounts that are exposed in /proc/<pid>/mountinfo
>> afterwards?
>
> Yes.
>
>> That might be odd. Because these vfsmounts aren't really mounted, no?
>
> Why aren't they?

So did you mean that, if we have a btrfs with two subvolumes under the
fs tree:

  /subv1
  /subv2

Then we mount the fs root, we should have subv1 and subv2 all showing up
at mountinfo?

Can we make this more dynamic? Like only initializing the vfsmount if
the subvolume tree got its first read?

>
>> And so you'd be showing potentially hundreds of mounts in
>> /proc/<pid>/mountinfo that you can't unmount?
>
> Why would you not allow them to be unmounted?

This unmount may not make much sense, as:

- It break the assumption all subvolumes can be access from fs tree

- We may re-initialize the vfsmount every time we read that subvolume

But otherwise the vfsmount per-subvolume solution looks at least worthy
a try to me.

Thanks,
Qu
>
>> And even if you treat them as mounted what would unmounting mean?
>
> The code in btrfs_lookup_dentry that does a hand crafted version
> of the file system / subvolume crossing (the location.type !=
> BTRFS_INODE_ITEM_KEY one) would not be executed.
>
Christian Brauner Nov. 1, 2023, 8:16 a.m. UTC | #20
On Tue, Oct 31, 2023 at 10:06:17AM -0700, Christoph Hellwig wrote:
> On Tue, Oct 31, 2023 at 01:50:46PM +0100, Christian Brauner wrote:
> > So this is effectively a request for:
> > 
> > btrfs subvolume create /mnt/subvol1
> > 
> > to create vfsmounts? IOW,
> > 
> > mkfs.btrfs /dev/sda
> > mount /dev/sda /mnt
> > btrfs subvolume create /mnt/subvol1
> > btrfs subvolume create /mnt/subvol2
> > 
> > would create two new vfsmounts that are exposed in /proc/<pid>/mountinfo
> > afterwards?
> 
> Yes.
> 
> > That might be odd. Because these vfsmounts aren't really mounted, no?
> 
> Why aren't they?
> 
> > And so you'd be showing potentially hundreds of mounts in
> > /proc/<pid>/mountinfo that you can't unmount?
> 
> Why would you not allow them to be unmounted?
> 
> > And even if you treat them as mounted what would unmounting mean?
> 
> The code in btrfs_lookup_dentry that does a hand crafted version
> of the file system / subvolume crossing (the location.type !=
> BTRFS_INODE_ITEM_KEY one) would not be executed.

So today, when we do:

mkfs.btrfs -f /dev/sda
mount -t btrfs /dev/sda /mnt
btrfs subvolume create /mnt/subvol1
btrfs subvolume create /mnt/subvol2

Then all subvolumes are always visible under /mnt.
IOW, you can't hide them other than by overmounting or destroying them.

If we make subvolumes vfsmounts then we very likely alter this behavior
and I see two obvious options:

(1) They are fake vfsmounts that can't be unmounted:

    umount /mnt/subvol1 # returns -EINVAL

    This retains the invariant that every subvolume is always visible
    from the filesystems root, i.e., /mnt will include /mnt/subvol{1,}

(2) They are proper vfsmounts:

    umount /mnt/subvol1 # succeeds

    This retains standard semantics for userspace about anything that
    shows up in /proc/<pid>/mountinfo but means that after
    umount /mnt/subvol1 succeeds, /mnt/subvol1 won't be accessible from
    the filesystem root /mnt anymore.

Both options can be made to work from a purely technical perspective,
I'm asking which one it has to be because it isn't clear just from the
snippets in this thread.

One should also point out that if each subvolume is a vfsmount, then say
a btrfs filesystems with 1000 subvolumes which is mounted from the root:

mount -t btrfs /dev/sda /mnt

could be exploded into 1000 individual mounts. Which many users might not want.

So I would expect that we would need to default to mounting without
subvolumes accessible, and a mount option to mount with all subvolumes
mounted, idk:

mount -t btrfs -o tree /dev/sda /mnt

or sm.

I agree that mapping subvolumes to vfsmounts sounds like the natural
thing to do.

But if we do e.g., (2) then this surely needs to be a Kconfig and/or a
mount option to avoid breaking userspace (And I'm pretty sure that btrfs
will end up supporting both modes almost indefinitely.).
Qu Wenruo Nov. 1, 2023, 8:41 a.m. UTC | #21
On 2023/11/1 18:46, Christian Brauner wrote:
> On Tue, Oct 31, 2023 at 10:06:17AM -0700, Christoph Hellwig wrote:
>> On Tue, Oct 31, 2023 at 01:50:46PM +0100, Christian Brauner wrote:
>>> So this is effectively a request for:
>>>
>>> btrfs subvolume create /mnt/subvol1
>>>
>>> to create vfsmounts? IOW,
>>>
>>> mkfs.btrfs /dev/sda
>>> mount /dev/sda /mnt
>>> btrfs subvolume create /mnt/subvol1
>>> btrfs subvolume create /mnt/subvol2
>>>
>>> would create two new vfsmounts that are exposed in /proc/<pid>/mountinfo
>>> afterwards?
>>
>> Yes.
>>
>>> That might be odd. Because these vfsmounts aren't really mounted, no?
>>
>> Why aren't they?
>>
>>> And so you'd be showing potentially hundreds of mounts in
>>> /proc/<pid>/mountinfo that you can't unmount?
>>
>> Why would you not allow them to be unmounted?
>>
>>> And even if you treat them as mounted what would unmounting mean?
>>
>> The code in btrfs_lookup_dentry that does a hand crafted version
>> of the file system / subvolume crossing (the location.type !=
>> BTRFS_INODE_ITEM_KEY one) would not be executed.
>
> So today, when we do:
>
> mkfs.btrfs -f /dev/sda
> mount -t btrfs /dev/sda /mnt
> btrfs subvolume create /mnt/subvol1
> btrfs subvolume create /mnt/subvol2
>
> Then all subvolumes are always visible under /mnt.
> IOW, you can't hide them other than by overmounting or destroying them.
>
> If we make subvolumes vfsmounts then we very likely alter this behavior
> and I see two obvious options:
>
> (1) They are fake vfsmounts that can't be unmounted:
>
>      umount /mnt/subvol1 # returns -EINVAL
>
>      This retains the invariant that every subvolume is always visible
>      from the filesystems root, i.e., /mnt will include /mnt/subvol{1,}

I'd like to go this option. But I still have a question.

How do we properly unmount a btrfs?
Do we have some other way to record which subvolume is really mounted
and which is just those place holder?


>
> (2) They are proper vfsmounts:
>
>      umount /mnt/subvol1 # succeeds
>
>      This retains standard semantics for userspace about anything that
>      shows up in /proc/<pid>/mountinfo but means that after
>      umount /mnt/subvol1 succeeds, /mnt/subvol1 won't be accessible from
>      the filesystem root /mnt anymore.
>
> Both options can be made to work from a purely technical perspective,
> I'm asking which one it has to be because it isn't clear just from the
> snippets in this thread.
>
> One should also point out that if each subvolume is a vfsmount, then say
> a btrfs filesystems with 1000 subvolumes which is mounted from the root:
>
> mount -t btrfs /dev/sda /mnt
>
> could be exploded into 1000 individual mounts. Which many users might not want.

Can we make it dynamic? AKA, the btrfs_insert_fs_root() is the perfect
timing here.

That would greatly reduce the initial vfsmount explode, but I'm not sure
if it's possible to add vfsmount halfway.

Thanks,
Qu
>
> So I would expect that we would need to default to mounting without
> subvolumes accessible, and a mount option to mount with all subvolumes
> mounted, idk:
>
> mount -t btrfs -o tree /dev/sda /mnt
>
> or sm.
>
> I agree that mapping subvolumes to vfsmounts sounds like the natural
> thing to do.
>
> But if we do e.g., (2) then this surely needs to be a Kconfig and/or a
> mount option to avoid breaking userspace (And I'm pretty sure that btrfs
> will end up supporting both modes almost indefinitely.).
Christian Brauner Nov. 1, 2023, 9:52 a.m. UTC | #22
On Wed, Nov 01, 2023 at 07:11:53PM +1030, Qu Wenruo wrote:
> 
> 
> On 2023/11/1 18:46, Christian Brauner wrote:
> > On Tue, Oct 31, 2023 at 10:06:17AM -0700, Christoph Hellwig wrote:
> > > On Tue, Oct 31, 2023 at 01:50:46PM +0100, Christian Brauner wrote:
> > > > So this is effectively a request for:
> > > > 
> > > > btrfs subvolume create /mnt/subvol1
> > > > 
> > > > to create vfsmounts? IOW,
> > > > 
> > > > mkfs.btrfs /dev/sda
> > > > mount /dev/sda /mnt
> > > > btrfs subvolume create /mnt/subvol1
> > > > btrfs subvolume create /mnt/subvol2
> > > > 
> > > > would create two new vfsmounts that are exposed in /proc/<pid>/mountinfo
> > > > afterwards?
> > > 
> > > Yes.
> > > 
> > > > That might be odd. Because these vfsmounts aren't really mounted, no?
> > > 
> > > Why aren't they?
> > > 
> > > > And so you'd be showing potentially hundreds of mounts in
> > > > /proc/<pid>/mountinfo that you can't unmount?
> > > 
> > > Why would you not allow them to be unmounted?
> > > 
> > > > And even if you treat them as mounted what would unmounting mean?
> > > 
> > > The code in btrfs_lookup_dentry that does a hand crafted version
> > > of the file system / subvolume crossing (the location.type !=
> > > BTRFS_INODE_ITEM_KEY one) would not be executed.
> > 
> > So today, when we do:
> > 
> > mkfs.btrfs -f /dev/sda
> > mount -t btrfs /dev/sda /mnt
> > btrfs subvolume create /mnt/subvol1
> > btrfs subvolume create /mnt/subvol2
> > 
> > Then all subvolumes are always visible under /mnt.
> > IOW, you can't hide them other than by overmounting or destroying them.
> > 
> > If we make subvolumes vfsmounts then we very likely alter this behavior
> > and I see two obvious options:
> > 
> > (1) They are fake vfsmounts that can't be unmounted:
> > 
> >      umount /mnt/subvol1 # returns -EINVAL
> > 
> >      This retains the invariant that every subvolume is always visible
> >      from the filesystems root, i.e., /mnt will include /mnt/subvol{1,}
> 
> I'd like to go this option. But I still have a question.
> 
> How do we properly unmount a btrfs?
> Do we have some other way to record which subvolume is really mounted
> and which is just those place holder?

So the downside of this really is that this would be custom btrfs
semantics. Having mounts in /proc/<pid>/mountinfo that you can't unmount
only happens in weird corner cases today:

* mounts inherited during unprivileged mount namespace creation
* locked mounts

Both of which are pretty inelegant and effectively only exist because of
user namespaces. So if we can avoid proliferating such semantics it
would be preferable.

I think it would also be rather confusing for userspace to be presented
with a bunch of mounts in /proc/<pid>/mountinfo that it can't do
anything with.

> > (2) They are proper vfsmounts:
> > 
> >      umount /mnt/subvol1 # succeeds
> > 
> >      This retains standard semantics for userspace about anything that
> >      shows up in /proc/<pid>/mountinfo but means that after
> >      umount /mnt/subvol1 succeeds, /mnt/subvol1 won't be accessible from
> >      the filesystem root /mnt anymore.
> > 
> > Both options can be made to work from a purely technical perspective,
> > I'm asking which one it has to be because it isn't clear just from the
> > snippets in this thread.
> > 
> > One should also point out that if each subvolume is a vfsmount, then say
> > a btrfs filesystems with 1000 subvolumes which is mounted from the root:
> > 
> > mount -t btrfs /dev/sda /mnt
> > 
> > could be exploded into 1000 individual mounts. Which many users might not want.
> 
> Can we make it dynamic? AKA, the btrfs_insert_fs_root() is the perfect
> timing here.

Probably, it would be an automount. Though I would have to recheck that
code to see how exactly that would work but roughly, when you add the
inode for the subvolume you raise S_AUTOMOUNT on it and then you add
.d_automount for btrfs.
Josef Bacik Nov. 2, 2023, 5:13 a.m. UTC | #23
On Wed, Nov 01, 2023 at 10:52:18AM +0100, Christian Brauner wrote:
> On Wed, Nov 01, 2023 at 07:11:53PM +1030, Qu Wenruo wrote:
> > 
> > 
> > On 2023/11/1 18:46, Christian Brauner wrote:
> > > On Tue, Oct 31, 2023 at 10:06:17AM -0700, Christoph Hellwig wrote:
> > > > On Tue, Oct 31, 2023 at 01:50:46PM +0100, Christian Brauner wrote:
> > > > > So this is effectively a request for:
> > > > > 
> > > > > btrfs subvolume create /mnt/subvol1
> > > > > 
> > > > > to create vfsmounts? IOW,
> > > > > 
> > > > > mkfs.btrfs /dev/sda
> > > > > mount /dev/sda /mnt
> > > > > btrfs subvolume create /mnt/subvol1
> > > > > btrfs subvolume create /mnt/subvol2
> > > > > 
> > > > > would create two new vfsmounts that are exposed in /proc/<pid>/mountinfo
> > > > > afterwards?
> > > > 
> > > > Yes.
> > > > 
> > > > > That might be odd. Because these vfsmounts aren't really mounted, no?
> > > > 
> > > > Why aren't they?
> > > > 
> > > > > And so you'd be showing potentially hundreds of mounts in
> > > > > /proc/<pid>/mountinfo that you can't unmount?
> > > > 
> > > > Why would you not allow them to be unmounted?
> > > > 
> > > > > And even if you treat them as mounted what would unmounting mean?
> > > > 
> > > > The code in btrfs_lookup_dentry that does a hand crafted version
> > > > of the file system / subvolume crossing (the location.type !=
> > > > BTRFS_INODE_ITEM_KEY one) would not be executed.
> > > 
> > > So today, when we do:
> > > 
> > > mkfs.btrfs -f /dev/sda
> > > mount -t btrfs /dev/sda /mnt
> > > btrfs subvolume create /mnt/subvol1
> > > btrfs subvolume create /mnt/subvol2
> > > 
> > > Then all subvolumes are always visible under /mnt.
> > > IOW, you can't hide them other than by overmounting or destroying them.
> > > 
> > > If we make subvolumes vfsmounts then we very likely alter this behavior
> > > and I see two obvious options:
> > > 
> > > (1) They are fake vfsmounts that can't be unmounted:
> > > 
> > >      umount /mnt/subvol1 # returns -EINVAL
> > > 
> > >      This retains the invariant that every subvolume is always visible
> > >      from the filesystems root, i.e., /mnt will include /mnt/subvol{1,}
> > 
> > I'd like to go this option. But I still have a question.
> > 
> > How do we properly unmount a btrfs?
> > Do we have some other way to record which subvolume is really mounted
> > and which is just those place holder?
> 
> So the downside of this really is that this would be custom btrfs
> semantics. Having mounts in /proc/<pid>/mountinfo that you can't unmount
> only happens in weird corner cases today:
> 
> * mounts inherited during unprivileged mount namespace creation
> * locked mounts
> 
> Both of which are pretty inelegant and effectively only exist because of
> user namespaces. So if we can avoid proliferating such semantics it
> would be preferable.
> 
> I think it would also be rather confusing for userspace to be presented
> with a bunch of mounts in /proc/<pid>/mountinfo that it can't do
> anything with.
> 
> > > (2) They are proper vfsmounts:
> > > 
> > >      umount /mnt/subvol1 # succeeds
> > > 
> > >      This retains standard semantics for userspace about anything that
> > >      shows up in /proc/<pid>/mountinfo but means that after
> > >      umount /mnt/subvol1 succeeds, /mnt/subvol1 won't be accessible from
> > >      the filesystem root /mnt anymore.
> > > 
> > > Both options can be made to work from a purely technical perspective,
> > > I'm asking which one it has to be because it isn't clear just from the
> > > snippets in this thread.
> > > 
> > > One should also point out that if each subvolume is a vfsmount, then say
> > > a btrfs filesystems with 1000 subvolumes which is mounted from the root:
> > > 
> > > mount -t btrfs /dev/sda /mnt
> > > 
> > > could be exploded into 1000 individual mounts. Which many users might not want.
> > 
> > Can we make it dynamic? AKA, the btrfs_insert_fs_root() is the perfect
> > timing here.
> 
> Probably, it would be an automount. Though I would have to recheck that
> code to see how exactly that would work but roughly, when you add the
> inode for the subvolume you raise S_AUTOMOUNT on it and then you add
> .d_automount for btrfs.

Btw I'm working on this, mostly to show Christoph it doesn't do what he thinks
it does.

However I ran into some weirdness where I need to support the new mount API, so
that's what I've been doing since I wandered away from this thread.  I should
have that done tomorrow, and then I was going to do the S_AUTOMOUNT thing ontop
of that.

But I have the same questions as you Christian, I'm not entirely sure how this
is supposed to be better.  Even if they show up in /proc/mounts, it's not going
to do anything useful for the applications that don't check /proc/mounts to see
if they've wandered into a new mount.  I also don't quite understand how NFS
suddenly knows it's wandered into a new mount with a vfsmount.

At this point I'm tired of it being brought up in every conversation where we
try to expose more information to the users.  So I'll write the patches and as
long as they don't break anything we can merge it, but I don't think it'll make
a single bit of difference.

We'll be converted to the new mount API tho, so I suppose that's something.
Thanks,

Josef
Amir Goldstein Nov. 2, 2023, 8:53 a.m. UTC | #24
On Thu, Nov 2, 2023 at 7:13 AM Josef Bacik <josef@toxicpanda.com> wrote:
>
> On Wed, Nov 01, 2023 at 10:52:18AM +0100, Christian Brauner wrote:
> > On Wed, Nov 01, 2023 at 07:11:53PM +1030, Qu Wenruo wrote:
> > >
> > >
> > > On 2023/11/1 18:46, Christian Brauner wrote:
> > > > On Tue, Oct 31, 2023 at 10:06:17AM -0700, Christoph Hellwig wrote:
> > > > > On Tue, Oct 31, 2023 at 01:50:46PM +0100, Christian Brauner wrote:
> > > > > > So this is effectively a request for:
> > > > > >
> > > > > > btrfs subvolume create /mnt/subvol1
> > > > > >
> > > > > > to create vfsmounts? IOW,
> > > > > >
> > > > > > mkfs.btrfs /dev/sda
> > > > > > mount /dev/sda /mnt
> > > > > > btrfs subvolume create /mnt/subvol1
> > > > > > btrfs subvolume create /mnt/subvol2
> > > > > >
> > > > > > would create two new vfsmounts that are exposed in /proc/<pid>/mountinfo
> > > > > > afterwards?
> > > > >
> > > > > Yes.
> > > > >
> > > > > > That might be odd. Because these vfsmounts aren't really mounted, no?
> > > > >
> > > > > Why aren't they?
> > > > >
> > > > > > And so you'd be showing potentially hundreds of mounts in
> > > > > > /proc/<pid>/mountinfo that you can't unmount?
> > > > >
> > > > > Why would you not allow them to be unmounted?
> > > > >
> > > > > > And even if you treat them as mounted what would unmounting mean?
> > > > >
> > > > > The code in btrfs_lookup_dentry that does a hand crafted version
> > > > > of the file system / subvolume crossing (the location.type !=
> > > > > BTRFS_INODE_ITEM_KEY one) would not be executed.
> > > >
> > > > So today, when we do:
> > > >
> > > > mkfs.btrfs -f /dev/sda
> > > > mount -t btrfs /dev/sda /mnt
> > > > btrfs subvolume create /mnt/subvol1
> > > > btrfs subvolume create /mnt/subvol2
> > > >
> > > > Then all subvolumes are always visible under /mnt.
> > > > IOW, you can't hide them other than by overmounting or destroying them.
> > > >
> > > > If we make subvolumes vfsmounts then we very likely alter this behavior
> > > > and I see two obvious options:
> > > >
> > > > (1) They are fake vfsmounts that can't be unmounted:
> > > >
> > > >      umount /mnt/subvol1 # returns -EINVAL
> > > >
> > > >      This retains the invariant that every subvolume is always visible
> > > >      from the filesystems root, i.e., /mnt will include /mnt/subvol{1,}
> > >
> > > I'd like to go this option. But I still have a question.
> > >
> > > How do we properly unmount a btrfs?
> > > Do we have some other way to record which subvolume is really mounted
> > > and which is just those place holder?
> >
> > So the downside of this really is that this would be custom btrfs
> > semantics. Having mounts in /proc/<pid>/mountinfo that you can't unmount
> > only happens in weird corner cases today:
> >
> > * mounts inherited during unprivileged mount namespace creation
> > * locked mounts
> >
> > Both of which are pretty inelegant and effectively only exist because of
> > user namespaces. So if we can avoid proliferating such semantics it
> > would be preferable.
> >
> > I think it would also be rather confusing for userspace to be presented
> > with a bunch of mounts in /proc/<pid>/mountinfo that it can't do
> > anything with.
> >
> > > > (2) They are proper vfsmounts:
> > > >
> > > >      umount /mnt/subvol1 # succeeds
> > > >
> > > >      This retains standard semantics for userspace about anything that
> > > >      shows up in /proc/<pid>/mountinfo but means that after
> > > >      umount /mnt/subvol1 succeeds, /mnt/subvol1 won't be accessible from
> > > >      the filesystem root /mnt anymore.
> > > >
> > > > Both options can be made to work from a purely technical perspective,
> > > > I'm asking which one it has to be because it isn't clear just from the
> > > > snippets in this thread.
> > > >
> > > > One should also point out that if each subvolume is a vfsmount, then say
> > > > a btrfs filesystems with 1000 subvolumes which is mounted from the root:
> > > >
> > > > mount -t btrfs /dev/sda /mnt
> > > >
> > > > could be exploded into 1000 individual mounts. Which many users might not want.
> > >
> > > Can we make it dynamic? AKA, the btrfs_insert_fs_root() is the perfect
> > > timing here.
> >
> > Probably, it would be an automount. Though I would have to recheck that
> > code to see how exactly that would work but roughly, when you add the
> > inode for the subvolume you raise S_AUTOMOUNT on it and then you add
> > .d_automount for btrfs.
>
> Btw I'm working on this, mostly to show Christoph it doesn't do what he thinks
> it does.
>
> However I ran into some weirdness where I need to support the new mount API, so
> that's what I've been doing since I wandered away from this thread.  I should
> have that done tomorrow, and then I was going to do the S_AUTOMOUNT thing ontop
> of that.
>
> But I have the same questions as you Christian, I'm not entirely sure how this
> is supposed to be better.  Even if they show up in /proc/mounts, it's not going
> to do anything useful for the applications that don't check /proc/mounts to see
> if they've wandered into a new mount.  I also don't quite understand how NFS
> suddenly knows it's wandered into a new mount with a vfsmount.
>

IIUC, the NFS server communicated the same/different filesystem to the client
by means of an FSID. This is not the same f_fsid from statfs(), it's the FSID
communicated to nfsd server from /etc/exports or guessed by nfsd server
for blockdev fs. IIRC, the NFS FSID is 128bit (?).

If we implemented the s_op->get_fsid() method that Jan has suggested [1],
we could make it mean -
"nfsd, please use this as the NFS FSID instead of assuming the all inodes on
 this sb should be presented to the client as a uniform FSID".

[1] https://lore.kernel.org/linux-fsdevel/20231025135048.36153-2-amir73il@gmail.com/

> At this point I'm tired of it being brought up in every conversation where we
> try to expose more information to the users.  So I'll write the patches and as
> long as they don't break anything we can merge it, but I don't think it'll make
> a single bit of difference.
>
> We'll be converted to the new mount API tho, so I suppose that's something.

Horray!

I had just noticed that since Monday, we have a new fs on the block with
multiple subvol on the same sb - bcachefs.

I took a look at bch2_statfs(), bch2_encode_fh() and bch2_getattr() and
I can't help wondering, what's different from btrfs?

Kent,

Is inode->v.i_ino unique for all inodes in bcachefs sb
and inode->ei_inode.bi_inum unique within a subvol?
If so, how is a cloned inode number allocated?

BTW1: bch2_statfs() open codes uuid_to_fsid().
BTW2: Any reason why bcache fs uses  bch_sb.uuid and not sb.s_uuid?

If you publish bch_sb.uuid to vfs via sb.s_uuid, you can use
bcachefs as a layer in overlayfs for advance features like
NFS export of overlayfs - with s_uuid, those features will not
be available for overlayfs over bcachefs.

Thanks,
Amir.
Christian Brauner Nov. 2, 2023, 9:48 a.m. UTC | #25
> We'll be converted to the new mount API tho, so I suppose that's something.
> Thanks,

Just in case you forgot about it. I did send a patch to convert btrfs to
the new mount api in June:

https://lore.kernel.org/all/20230626-fs-btrfs-mount-api-v1-0-045e9735a00b@kernel.org

Can I ask you to please please copy just two things from that series:

(1) Please get rid of the second filesystems type.
(2) Please fix the silent remount behavior when mounting a subvolume.

You might need my first patch for that from that series for (2).

+static int btrfs_get_tree_common(struct fs_context *fc)
+{
+	struct vfsmount *root_mnt = NULL;
+	struct fs_context *root_fc;
+	struct dentry *root_dentry;
+	struct btrfs_fs_context *ctx = fc->fs_private;
+	int ret;
+
+	if (WARN_ON(ctx->phase != BTRFS_FS_CONTEXT_PREPARE))
+		return -EINVAL;
+
+	root_fc = vfs_dup_fs_context(fc);
+	if (IS_ERR(root_fc))
+		return PTR_ERR(root_fc);
+
+	/*
+	 * We've duplicated the security mount options above and we only
+	 * need them to be set when we really create a new superblock.
+	 * They're irrelevant when we mount the subvolume as the
+	 * superblock does already exist at that point. So free the
+	 * security blob here.
+	 */
+	security_free_mnt_opts(&fc->security);
+	fc->security = NULL;
+
+	/* Create the superblock so we can mount a subtree later. */
+	ctx->phase = BTRFS_FS_CONTEXT_SUPER;
+
+	root_mnt = fc_mount(root_fc);
+	if (PTR_ERR_OR_ZERO(root_mnt) == -EBUSY) {
+		bool ro2rw = !(root_fc->sb_flags & SB_RDONLY);
+
+		if (ro2rw)
+			root_fc->sb_flags |= SB_RDONLY;
+		else
+			root_fc->sb_flags &= ~SB_RDONLY;
+
+		root_mnt = fc_mount(root_fc);
+		if (IS_ERR(root_mnt)) {
+			put_fs_context(root_fc);
+			return PTR_ERR(root_mnt);
+		}
+		ctx->root_mnt = root_mnt;
+
+		/*
+		 * Ever since commit 0723a0473fb4 ("btrfs: allow
+		 * mounting btrfs subvolumes with different ro/rw
+		 * options") the following works:
+		 *
+		 *        (i) mount /dev/sda3 -o subvol=foo,ro /mnt/foo
+		 *       (ii) mount /dev/sda3 -o subvol=bar,rw /mnt/bar
+		 *
+		 * which looks nice and innocent but is actually pretty
+		 * intricate and deserves a long comment.
+		 *
+		 * On another filesystem a subvolume mount is close to
+		 * something like:
+		 *
+		 *	(iii) # create rw superblock + initial mount
+		 *	      mount -t xfs /dev/sdb /opt/
+		 *
+		 *	      # create ro bind mount
+		 *	      mount --bind -o ro /opt/foo /mnt/foo
+		 *
+		 *	      # unmount initial mount
+		 *	      umount /opt
+		 *
+		 * Of course, there's some special subvolume sauce and
+		 * there's the fact that the sb->s_root dentry is really
+		 * swapped after mount_subtree(). But conceptually it's
+		 * very close and will help us understand the issue.
+		 *
+		 * The old mount api didn't cleanly distinguish between
+		 * a mount being made ro and a superblock being made ro.
+		 * The only way to change the ro state of either object
+		 * was by passing MS_RDONLY. If a new mount was created
+		 * via mount(2) such as:
+		 *
+		 *      mount("/dev/sdb", "/mnt", "xfs", MS_RDONLY, NULL);
+		 *
+		 * the MS_RDONLY flag being specified had two effects:
+		 *
+		 * (1) MNT_READONLY was raised -> the resulting mount
+		 *     got @mnt->mnt_flags |= MNT_READONLY raised.
+		 *
+		 * (2) MS_RDONLY was passed to the filesystem's mount
+		 *     method and the filesystems made the superblock
+		 *     ro. Note, how SB_RDONLY has the same value as
+		 *     MS_RDONLY and is raised whenever MS_RDONLY is
+		 *     passed through mount(2).
+		 *
+		 * Creating a subtree mount via (iii) ends up leaving a
+		 * rw superblock with a subtree mounted ro.
+		 *
+		 * But consider the effect on the old mount api on btrfs
+		 * subvolume mounting which combines the distinct step
+		 * in (iii) into a a single step.
+		 *
+		 * By issuing (i) both the mount and the superblock are
+		 * turned ro. Now when (ii) is issued the superblock is
+		 * ro and thus even if the mount created for (ii) is rw
+		 * it wouldn't help. Hence, btrfs needed to transition
+		 * the superblock from ro to rw for (ii) which it did
+		 * using an internal remount call (a bold choice...).
+		 *
+		 * IOW, subvolume mounting was inherently messy due to
+		 * the ambiguity of MS_RDONLY in mount(2). Note, this
+		 * ambiguity has mount(8) always translate "ro" to
+		 * MS_RDONLY. IOW, in both (i) and (ii) "ro" becomes
+		 * MS_RDONLY when passed by mount(8) to mount(2).
+		 *
+		 * Enter the new mount api. The new mount api
+		 * disambiguates making a mount ro and making a
+		 * superblock ro.
+		 *
+		 * (3) To turn a mount ro the MOUNT_ATTR_RDONLY flag can
+		 *     be used with either fsmount() or mount_setattr().
+		 *     This is a pure VFS level change for a specific
+		 *     mount or mount tree that is never seen by the
+		 *     filesystem itself.
+		 *
+		 * (4) To turn a superblock ro the "ro" flag must be
+		 *     used with fsconfig(FSCONFIG_SET_FLAG, "ro"). This
+		 *     option is seen by the filesytem in fc->sb_flags.
+		 *
+		 * This disambiguation has rather positive consequences.
+		 * Mounting a subvolume ro will not also turn the
+		 * superblock ro. Only the mount for the subvolume will
+		 * become ro.
+		 *
+		 * So, if the superblock creation request comes from the
+		 * new mount api the caller must've explicitly done:
+		 *
+		 *      fsconfig(FSCONFIG_SET_FLAG, "ro")
+		 *      fsmount/mount_setattr(MOUNT_ATTR_RDONLY)
+		 *
+		 * IOW, at some point the caller must have explicitly
+		 * turned the whole superblock ro and we shouldn't just
+		 * undo it like we did for the old mount api. In any
+		 * case, it lets us avoid this nasty hack in the new
+		 * mount api.
+		 *
+		 * Consequently, the remounting hack must only be used
+		 * for requests originating from the old mount api and
+		 * should be marked for full deprecation so it can be
+		 * turned off in a couple of years.
+		 *
+		 * The new mount api has no reason to support this hack.
+		 */
+		if (root_fc->oldapi && ro2rw) {
+			/*
+			 * This magic internal remount is a pretty bold
+			 * move as the VFS reserves the right to protect
+			 * ro->rw transitions on the VFS layer similar
+			 * to how it protects rw->ro transitions.
+			 */
+			ret = btrfs_legacy_reconfigure(root_fc);
+			if (ret)
+				root_mnt = ERR_PTR(ret);
+		}
+	}
+	put_fs_context(root_fc);
+	if (IS_ERR(root_mnt))
+		return PTR_ERR(root_mnt);
+	ctx->root_mnt = root_mnt;
+
+	root_dentry = mount_subvol(fc);
+	if (IS_ERR(root_dentry))
+		return PTR_ERR(root_dentry);
+
+	fc->root = root_dentry;
+	return 0;
+}
Christian Brauner Nov. 2, 2023, 11:07 a.m. UTC | #26
> Btw I'm working on this, mostly to show Christoph it doesn't do what he thinks
> it does.
> 
> However I ran into some weirdness where I need to support the new mount API, so
> that's what I've been doing since I wandered away from this thread.  I should
> have that done tomorrow, and then I was going to do the S_AUTOMOUNT thing ontop
> of that.
> 
> But I have the same questions as you Christian, I'm not entirely sure how this
> is supposed to be better.  Even if they show up in /proc/mounts, it's not going
> to do anything useful for the applications that don't check /proc/mounts to see
> if they've wandered into a new mount.  I also don't quite understand how NFS
> suddenly knows it's wandered into a new mount with a vfsmount.

So the subvolumes-as-vfsmount solution was implemented already a few
years ago. I looked at that patchset and the crucial point in the
solution was patch [1].

show_mountinfo() is called under namespace_sem (see fs/namespace.c).
That thing is crucial for all mount namespaces, mount propagation and so
on. We can't cause IO under that unless we want to allow to trivially
deadlock the whole system by tricking us into talking to an unresponsive
NFS server or similar. And all vfs_getattr*() flavours can legitimately
cause IO even with AT_STATX_DONT_SYNC.

So exposing this via /proc/<pid>/mountinfo doesn't work. But that means
even if you make it a separate vfsmount you need to epose the device
information through another interface.

But at that point we really need to ask if it makes sense to use
vfsmounts per subvolume in the first place:

(1) We pollute /proc/<pid>/mountinfo with a lot of mounts.
(2) By calling ->getattr() from show_mountinfo() we open the whole
    system up to deadlocks.
(3) We change btrfs semantics drastically to the point where they need a
    new mount, module, or Kconfig option.
(4) We make (initial) lookup on btrfs subvolumes more heavyweight
    because you need to create a mount for the subvolume.

So right now, I don't see how we can make this work even if the concept
doesn't seem necessarily wrong.

Even if we were to go through with this and make each subvolume a
vfsmount but then don't expose the ->getattr() device numbers in
/proc/<pid>/mountinfo but instead add a separate retrieval method via
statx() we'd be creating even more confusion for userspace by showing
different device numbers in /proc/<pid>/mountinfo than in statx().

[1]:

Subject:        [PATCH 01/11] VFS: show correct dev num in mountinfo
Date:	 	Wed, 28 Jul 2021 08:37:45 +1000
Message-ID:	<162742546548.32498.10889023150565429936.stgit@noble.brown>

/proc/$PID/mountinfo contains a field for the device number of the
filesystem at each mount.

This is taken from the superblock ->s_dev field, which is correct for
every filesystem except btrfs.  A btrfs filesystem can contain multiple
subvols which each have a different device number.  If (a directory
within) one of these subvols is mounted, the device number reported in
mountinfo will be different from the device number reported by stat().

This confuses some libraries and tools such as, historically, findmnt.
Current findmnt seems to cope with the strangeness.

So instead of using ->s_dev, call vfs_getattr_nosec() and use the ->dev
provided.  As there is no STATX flag to ask for the device number, we
pass a request mask for zero, and also ask the filesystem to avoid
syncing with any remote service.

Signed-off-by: NeilBrown <neilb@suse.de>
---
 fs/proc_namespace.c |    8 +++++++-
 1 file changed, 7 insertions(+), 1 deletion(-)

diff --git a/fs/proc_namespace.c b/fs/proc_namespace.c
index 392ef5162655..f342a0231e9e 100644
--- a/fs/proc_namespace.c
+++ b/fs/proc_namespace.c
@@ -138,10 +138,16 @@ static int show_mountinfo(struct seq_file *m, struct vfsmount *mnt)
 	struct mount *r = real_mount(mnt);
 	struct super_block *sb = mnt->mnt_sb;
 	struct path mnt_path = { .dentry = mnt->mnt_root, .mnt = mnt };
+	struct kstat stat;
 	int err;
 
+	/* We only want ->dev, and there is no STATX flag for that,
+	 * so ask for nothing and assume we get ->dev
+	 */
+	vfs_getattr_nosec(&mnt_path, &stat, 0, AT_STATX_DONT_SYNC);
+
 	seq_printf(m, "%i %i %u:%u ", r->mnt_id, r->mnt_parent->mnt_id,
-		   MAJOR(sb->s_dev), MINOR(sb->s_dev));
+		   MAJOR(stat.dev), MINOR(stat.dev));
 	if (sb->s_op->show_path) {
 		err = sb->s_op->show_path(m, mnt->mnt_root);
 		if (err)
Josef Bacik Nov. 2, 2023, 12:34 p.m. UTC | #27
On Thu, Nov 02, 2023 at 10:48:35AM +0100, Christian Brauner wrote:
> > We'll be converted to the new mount API tho, so I suppose that's something.
> > Thanks,
> 
> Just in case you forgot about it. I did send a patch to convert btrfs to
> the new mount api in June:
> 
> https://lore.kernel.org/all/20230626-fs-btrfs-mount-api-v1-0-045e9735a00b@kernel.org
> 

Yeah Daan told me about this after I had done the bulk of the work.  I
shamelessly stole the dup idea, I had been doing something uglier.

> Can I ask you to please please copy just two things from that series:
> 
> (1) Please get rid of the second filesystems type.
> (2) Please fix the silent remount behavior when mounting a subvolume.
>

Yeah I've gotten rid of the second file system type, the remount thing is odd,
I'm going to see if I can get away with not bringing that over.  I *think* it's
because the standard distro way of doing things is to do

mount -o ro,subvol=/my/root/vol /
mount -o rw,subvol=/my/home/vol /home
<boot some more>
mount -o remount,rw /

but I haven't messed with it yet to see if it breaks.  That's on the list to
investigate today.  Thanks,

Josef
David Sterba Nov. 2, 2023, 5:07 p.m. UTC | #28
On Thu, Nov 02, 2023 at 08:34:46AM -0400, Josef Bacik wrote:
> On Thu, Nov 02, 2023 at 10:48:35AM +0100, Christian Brauner wrote:
> > > We'll be converted to the new mount API tho, so I suppose that's something.
> > > Thanks,
> > 
> > Just in case you forgot about it. I did send a patch to convert btrfs to
> > the new mount api in June:
> > 
> > https://lore.kernel.org/all/20230626-fs-btrfs-mount-api-v1-0-045e9735a00b@kernel.org
> > 
> 
> Yeah Daan told me about this after I had done the bulk of the work.  I
> shamelessly stole the dup idea, I had been doing something uglier.
> 
> > Can I ask you to please please copy just two things from that series:
> > 
> > (1) Please get rid of the second filesystems type.
> > (2) Please fix the silent remount behavior when mounting a subvolume.
> >
> 
> Yeah I've gotten rid of the second file system type, the remount thing is odd,
> I'm going to see if I can get away with not bringing that over.  I *think* it's
> because the standard distro way of doing things is to do
> 
> mount -o ro,subvol=/my/root/vol /
> mount -o rw,subvol=/my/home/vol /home
> <boot some more>
> mount -o remount,rw /
> 
> but I haven't messed with it yet to see if it breaks.  That's on the list to
> investigate today.  Thanks,

It's a use case for distros, 0723a0473fb4 ("btrfs: allow mounting btrfs
subvolumes with different ro/rw options"), the functionality should
be preserved else it's a regression.
Josef Bacik Nov. 2, 2023, 8:32 p.m. UTC | #29
On Thu, Nov 02, 2023 at 06:07:45PM +0100, David Sterba wrote:
> On Thu, Nov 02, 2023 at 08:34:46AM -0400, Josef Bacik wrote:
> > On Thu, Nov 02, 2023 at 10:48:35AM +0100, Christian Brauner wrote:
> > > > We'll be converted to the new mount API tho, so I suppose that's something.
> > > > Thanks,
> > > 
> > > Just in case you forgot about it. I did send a patch to convert btrfs to
> > > the new mount api in June:
> > > 
> > > https://lore.kernel.org/all/20230626-fs-btrfs-mount-api-v1-0-045e9735a00b@kernel.org
> > > 
> > 
> > Yeah Daan told me about this after I had done the bulk of the work.  I
> > shamelessly stole the dup idea, I had been doing something uglier.
> > 
> > > Can I ask you to please please copy just two things from that series:
> > > 
> > > (1) Please get rid of the second filesystems type.
> > > (2) Please fix the silent remount behavior when mounting a subvolume.
> > >
> > 
> > Yeah I've gotten rid of the second file system type, the remount thing is odd,
> > I'm going to see if I can get away with not bringing that over.  I *think* it's
> > because the standard distro way of doing things is to do
> > 
> > mount -o ro,subvol=/my/root/vol /
> > mount -o rw,subvol=/my/home/vol /home
> > <boot some more>
> > mount -o remount,rw /
> > 
> > but I haven't messed with it yet to see if it breaks.  That's on the list to
> > investigate today.  Thanks,
> 
> It's a use case for distros, 0723a0473fb4 ("btrfs: allow mounting btrfs
> subvolumes with different ro/rw options"), the functionality should
> be preserved else it's a regression.

I'll add an fstest for it then, I could have easily broken this if I didn't see
Christians giant note about it.  Thanks,

Josef
Christian Brauner Nov. 3, 2023, 6:56 a.m. UTC | #30
On Thu, Nov 02, 2023 at 06:07:45PM +0100, David Sterba wrote:
> On Thu, Nov 02, 2023 at 08:34:46AM -0400, Josef Bacik wrote:
> > On Thu, Nov 02, 2023 at 10:48:35AM +0100, Christian Brauner wrote:
> > > > We'll be converted to the new mount API tho, so I suppose that's something.
> > > > Thanks,
> > > 
> > > Just in case you forgot about it. I did send a patch to convert btrfs to
> > > the new mount api in June:
> > > 
> > > https://lore.kernel.org/all/20230626-fs-btrfs-mount-api-v1-0-045e9735a00b@kernel.org
> > > 
> > 
> > Yeah Daan told me about this after I had done the bulk of the work.  I
> > shamelessly stole the dup idea, I had been doing something uglier.
> > 
> > > Can I ask you to please please copy just two things from that series:
> > > 
> > > (1) Please get rid of the second filesystems type.
> > > (2) Please fix the silent remount behavior when mounting a subvolume.
> > >
> > 
> > Yeah I've gotten rid of the second file system type, the remount thing is odd,
> > I'm going to see if I can get away with not bringing that over.  I *think* it's
> > because the standard distro way of doing things is to do
> > 
> > mount -o ro,subvol=/my/root/vol /
> > mount -o rw,subvol=/my/home/vol /home
> > <boot some more>
> > mount -o remount,rw /
> > 
> > but I haven't messed with it yet to see if it breaks.  That's on the list to
> > investigate today.  Thanks,
> 
> It's a use case for distros, 0723a0473fb4 ("btrfs: allow mounting btrfs
> subvolumes with different ro/rw options"), the functionality should
> be preserved else it's a regression.

My series explicitly made sure that it _isn't_ broken. Which is pretty
obvious from the description I put in there where that example is
explained at length.

It just handles it _cleanly_ through the new mount api while retaining
the behavior through the old mount api. The details - as Josef noted -
I've explained extensively.
Josef Bacik Nov. 3, 2023, 1:52 p.m. UTC | #31
On Fri, Nov 03, 2023 at 07:56:39AM +0100, Christian Brauner wrote:
> On Thu, Nov 02, 2023 at 06:07:45PM +0100, David Sterba wrote:
> > On Thu, Nov 02, 2023 at 08:34:46AM -0400, Josef Bacik wrote:
> > > On Thu, Nov 02, 2023 at 10:48:35AM +0100, Christian Brauner wrote:
> > > > > We'll be converted to the new mount API tho, so I suppose that's something.
> > > > > Thanks,
> > > > 
> > > > Just in case you forgot about it. I did send a patch to convert btrfs to
> > > > the new mount api in June:
> > > > 
> > > > https://lore.kernel.org/all/20230626-fs-btrfs-mount-api-v1-0-045e9735a00b@kernel.org
> > > > 
> > > 
> > > Yeah Daan told me about this after I had done the bulk of the work.  I
> > > shamelessly stole the dup idea, I had been doing something uglier.
> > > 
> > > > Can I ask you to please please copy just two things from that series:
> > > > 
> > > > (1) Please get rid of the second filesystems type.
> > > > (2) Please fix the silent remount behavior when mounting a subvolume.
> > > >
> > > 
> > > Yeah I've gotten rid of the second file system type, the remount thing is odd,
> > > I'm going to see if I can get away with not bringing that over.  I *think* it's
> > > because the standard distro way of doing things is to do
> > > 
> > > mount -o ro,subvol=/my/root/vol /
> > > mount -o rw,subvol=/my/home/vol /home
> > > <boot some more>
> > > mount -o remount,rw /
> > > 
> > > but I haven't messed with it yet to see if it breaks.  That's on the list to
> > > investigate today.  Thanks,
> > 
> > It's a use case for distros, 0723a0473fb4 ("btrfs: allow mounting btrfs
> > subvolumes with different ro/rw options"), the functionality should
> > be preserved else it's a regression.
> 
> My series explicitly made sure that it _isn't_ broken. Which is pretty
> obvious from the description I put in there where that example is
> explained at length.
> 
> It just handles it _cleanly_ through the new mount api while retaining
> the behavior through the old mount api. The details - as Josef noted -
> I've explained extensively.

I think Dave's comments were towards me, because I was considering just not
pulling it forward and waiting to see who complained.  I'll copy your approach
and your comment, and wire up a test to make sure we don't regress.  Thanks,

Josef
'Christoph Hellwig' Nov. 3, 2023, 2:21 p.m. UTC | #32
On Wed, Nov 01, 2023 at 10:33:41AM +1030, Qu Wenruo wrote:
> Can we make this more dynamic? Like only initializing the vfsmount if
> the subvolume tree got its first read?

Yes, I would treat it like an automount.
'Christoph Hellwig' Nov. 3, 2023, 2:22 p.m. UTC | #33
On Wed, Nov 01, 2023 at 09:16:50AM +0100, Christian Brauner wrote:
> mkfs.btrfs -f /dev/sda
> mount -t btrfs /dev/sda /mnt
> btrfs subvolume create /mnt/subvol1
> btrfs subvolume create /mnt/subvol2
> 
> Then all subvolumes are always visible under /mnt.
> IOW, you can't hide them other than by overmounting or destroying them.

Yes.

> If we make subvolumes vfsmounts then we very likely alter this behavior
> and I see two obvious options:
> 
> (1) They are fake vfsmounts that can't be unmounted:
> 
>     umount /mnt/subvol1 # returns -EINVAL
> 
>     This retains the invariant that every subvolume is always visible
>     from the filesystems root, i.e., /mnt will include /mnt/subvol{1,}

Why would we have to prevent them to be automounted?  I'd expect
automount-like behavior where they are automatially mounted and then
expired or manuall unmounted.

> But if we do e.g., (2) then this surely needs to be a Kconfig and/or a
> mount option to avoid breaking userspace (And I'm pretty sure that btrfs
> will end up supporting both modes almost indefinitely.).

It would definitively need to be an opt-in for existing systems.
'Christoph Hellwig' Nov. 3, 2023, 2:23 p.m. UTC | #34
On Wed, Nov 01, 2023 at 07:11:53PM +1030, Qu Wenruo wrote:
> > mount -t btrfs /dev/sda /mnt
> > 
> > could be exploded into 1000 individual mounts. Which many users might not want.
> 
> Can we make it dynamic? AKA, the btrfs_insert_fs_root() is the perfect
> timing here.
> 
> That would greatly reduce the initial vfsmount explode, but I'm not sure
> if it's possible to add vfsmount halfway.

Yes, that's what I had in mind as well.
'Christoph Hellwig' Nov. 3, 2023, 2:28 p.m. UTC | #35
On Thu, Nov 02, 2023 at 12:07:47PM +0100, Christian Brauner wrote:
> But at that point we really need to ask if it makes sense to use
> vfsmounts per subvolume in the first place:
> 
> (1) We pollute /proc/<pid>/mountinfo with a lot of mounts.
> (2) By calling ->getattr() from show_mountinfo() we open the whole
>     system up to deadlocks.
> (3) We change btrfs semantics drastically to the point where they need a
>     new mount, module, or Kconfig option.
> (4) We make (initial) lookup on btrfs subvolumes more heavyweight
>     because you need to create a mount for the subvolume.
> 
> So right now, I don't see how we can make this work even if the concept
> doesn't seem necessarily wrong.

How else do you want to solve it?  Crossing a mount point is the
only legitimate boundary for changing st_dev and having a new inode
number space.  And we can't fix that retroactively.
Christian Brauner Nov. 3, 2023, 3:47 p.m. UTC | #36
On Fri, Nov 03, 2023 at 07:28:42AM -0700, Christoph Hellwig wrote:
> On Thu, Nov 02, 2023 at 12:07:47PM +0100, Christian Brauner wrote:
> > But at that point we really need to ask if it makes sense to use
> > vfsmounts per subvolume in the first place:
> > 
> > (1) We pollute /proc/<pid>/mountinfo with a lot of mounts.
> > (2) By calling ->getattr() from show_mountinfo() we open the whole
> >     system up to deadlocks.
> > (3) We change btrfs semantics drastically to the point where they need a
> >     new mount, module, or Kconfig option.
> > (4) We make (initial) lookup on btrfs subvolumes more heavyweight
> >     because you need to create a mount for the subvolume.
> > 
> > So right now, I don't see how we can make this work even if the concept
> > doesn't seem necessarily wrong.
> 
> How else do you want to solve it?  Crossing a mount point is the
> only legitimate boundary for changing st_dev and having a new inode
> number space.  And we can't fix that retroactively.

I think the idea of using vfsmounts for this makes some sense if the
goal is to retroactively justify and accommodate the idea that a
subvolume is to be treated as equivalent to a separate device.

I question that premise though. I think marking them with separate
device numbers is bringing us nothing but pain at this point and this
solution is basically bending the vfs to make that work somehow.

And the worst thing is that I think that treating subvolumes like
vfsmounts will hurt vfsmounts more than it will hurt subvolumes.

Right now all that vfsmounts technically are is a topological
abstraction on top of filesystem objects such as files, directories,
sockets, even devices that are exposed as filesystems objects. None of
them get to muck with core properties of what a vfsmount is though.

Additionally, vfsmount are tied to a superblock and share the device
numbers with the superblock they belong to.

If we make subvolumes and vfsmounts equivalent we break both properties.
And I think that's wrong or at least really ugly.

And I already see that the suggested workaround for (2) will somehow end
up being stashing device numbers in struct mount or struct vfsmount so
we can show it in mountinfo and if that's the case I want to express a
proactive nak for that solution.

The way I see it is that a subvolume at the end is nothing but a
subclass of directories a special one but whatever.

I would feel much more comfortable if the two filesystems that expose
these objects give us something like STATX_SUBVOLUME that userspace can
raise in the request mask of statx().

If userspace requests STATX_SUBVOLUME in the request mask, the two
filesystems raise STATX_SUBVOLUME in the statx result mask and then also
return the _real_ device number of the superblock and stop exposing that
made up device number.

This can be accompanied by a vfs ioctl that is identical for both btrfs
and bcachefs and returns $whatever unique property to mark the inode
space of the subvolume.

And then we leave innocent vfs objects alone and we also avoid
bringing in all that heavy vfsmount machinery on top of subvolumes.
'Christoph Hellwig' Nov. 6, 2023, 7:53 a.m. UTC | #37
On Fri, Nov 03, 2023 at 04:47:02PM +0100, Christian Brauner wrote:
> I think the idea of using vfsmounts for this makes some sense if the
> goal is to retroactively justify and accommodate the idea that a
> subvolume is to be treated as equivalent to a separate device.

st_dev has only been very historically about treating something as
a device.  For userspae the most important part is that it designates
a separate domain for inode numbers.  And that's something that's simply
broken in btrfs.

> I question that premise though. I think marking them with separate
> device numbers is bringing us nothing but pain at this point and this
> solution is basically bending the vfs to make that work somehow.

Well, the only other theoretical option would be to use a simple
inode number space across subvolumes in btrfs, but I don't really
see how that could be retrofitted in any sensible way.

> I would feel much more comfortable if the two filesystems that expose
> these objects give us something like STATX_SUBVOLUME that userspace can
> raise in the request mask of statx().

Except that this doesn't fix any existing code.

> If userspace requests STATX_SUBVOLUME in the request mask, the two
> filesystems raise STATX_SUBVOLUME in the statx result mask and then also
> return the _real_ device number of the superblock and stop exposing that
> made up device number.

What is a "real" device number?
Qu Wenruo Nov. 6, 2023, 8:18 a.m. UTC | #38
On 2023/11/6 18:23, Christoph Hellwig wrote:
> On Fri, Nov 03, 2023 at 04:47:02PM +0100, Christian Brauner wrote:
>> I think the idea of using vfsmounts for this makes some sense if the
>> goal is to retroactively justify and accommodate the idea that a
>> subvolume is to be treated as equivalent to a separate device.
> 
> st_dev has only been very historically about treating something as
> a device.  For userspae the most important part is that it designates
> a separate domain for inode numbers.  And that's something that's simply
> broken in btrfs.

In fact, I'm not sure if the "treating something as a device" thing is 
even correct long before btrfs.

For example, for an EXT4 fs with external log device. Thankfully it's 
still more or less obvious we would use the device number of the main 
fs, not the log device, but we already had such examples.


Another thing is, the st_dev situation has to be kept, as there are too 
many legacy programs that relies on this to distinguish btrfs subvolume 
boundaries, this would never be changed unfortunately, even if we had 
some better solution (like the proposed extra subvolid through statx).

> 
>> I question that premise though. I think marking them with separate
>> device numbers is bringing us nothing but pain at this point and this
>> solution is basically bending the vfs to make that work somehow.
> 
> Well, the only other theoretical option would be to use a simple
> inode number space across subvolumes in btrfs, but I don't really
> see how that could be retrofitted in any sensible way.
> 
>> I would feel much more comfortable if the two filesystems that expose
>> these objects give us something like STATX_SUBVOLUME that userspace can
>> raise in the request mask of statx().
> 
> Except that this doesn't fix any existing code.

To me, the biggest btrfs specific problem is the number of btrfs 
subvolumes vs the very limited amount of anonymous device number pool.

As long as we don't expand the st_dev width, nor change the behavior of 
per-subvolume st_dev number, the only thing I can came up with is 
allowing manually "unmounting" a subvolume to reclaim the anonymous 
device number.

Which I believe the per-subvolume-vfsmount and the automount behavior 
for subvolume can help a lot.

> 
>> If userspace requests STATX_SUBVOLUME in the request mask, the two
>> filesystems raise STATX_SUBVOLUME in the statx result mask and then also
>> return the _real_ device number of the superblock and stop exposing that
>> made up device number.

Btrfs goes the anonymous device number pool because we don't have any 
better way to return a "real" device number.

There may be 1 or whatever number of devices, verse way more number of 
subvolumes.

Thus we go the "nature" idea to go anonymous device number pool, but as 
we can all see already, the pool is not large enough for subvolumes.

> 
> What is a "real" device number?

I'm more interested in if we can allocate st_dev from other pools.

IIRC logical volumes (LV from LVM) are not allocating from anonymous dev 
number pool, thus this may sound a stupid question, but what's 
preventing us from using the device number pool of LVM?
Device number conflicts or something else?

Thanks,
Qu
Jan Kara Nov. 6, 2023, 9:03 a.m. UTC | #39
On Fri 03-11-23 16:47:02, Christian Brauner wrote:
> On Fri, Nov 03, 2023 at 07:28:42AM -0700, Christoph Hellwig wrote:
> > On Thu, Nov 02, 2023 at 12:07:47PM +0100, Christian Brauner wrote:
> > > But at that point we really need to ask if it makes sense to use
> > > vfsmounts per subvolume in the first place:
> > > 
> > > (1) We pollute /proc/<pid>/mountinfo with a lot of mounts.
> > > (2) By calling ->getattr() from show_mountinfo() we open the whole
> > >     system up to deadlocks.
> > > (3) We change btrfs semantics drastically to the point where they need a
> > >     new mount, module, or Kconfig option.
> > > (4) We make (initial) lookup on btrfs subvolumes more heavyweight
> > >     because you need to create a mount for the subvolume.
> > > 
> > > So right now, I don't see how we can make this work even if the concept
> > > doesn't seem necessarily wrong.
> > 
> > How else do you want to solve it?  Crossing a mount point is the
> > only legitimate boundary for changing st_dev and having a new inode
> > number space.  And we can't fix that retroactively.
> 
> I think the idea of using vfsmounts for this makes some sense if the
> goal is to retroactively justify and accommodate the idea that a
> subvolume is to be treated as equivalent to a separate device.
> 
> I question that premise though. I think marking them with separate
> device numbers is bringing us nothing but pain at this point and this
> solution is basically bending the vfs to make that work somehow.
> 
> And the worst thing is that I think that treating subvolumes like
> vfsmounts will hurt vfsmounts more than it will hurt subvolumes.
> 
> Right now all that vfsmounts technically are is a topological
> abstraction on top of filesystem objects such as files, directories,
> sockets, even devices that are exposed as filesystems objects. None of
> them get to muck with core properties of what a vfsmount is though.
> 
> Additionally, vfsmount are tied to a superblock and share the device
> numbers with the superblock they belong to.
> 
> If we make subvolumes and vfsmounts equivalent we break both properties.
> And I think that's wrong or at least really ugly.
> 
> And I already see that the suggested workaround for (2) will somehow end
> up being stashing device numbers in struct mount or struct vfsmount so
> we can show it in mountinfo and if that's the case I want to express a
> proactive nak for that solution.
> 
> The way I see it is that a subvolume at the end is nothing but a
> subclass of directories a special one but whatever.

As far as I understand the problem, subvolumes indeed seem closer to
special directories than anything else. They slightly resemble what ext4 &
xfs implement with project quotas (were each inode can have additional
recursively inherited "project id"). What breaks this "special directory"
kind of view for btrfs is that subvolumes have overlapping inode numbers.
Since we don't seem to have a way of getting out of the current situation
in a "seamless" way anyway, I wonder if implementing a btrfs feature to
provide unique inode numbers across all subvolumes would not be the
cleanest way out...

> I would feel much more comfortable if the two filesystems that expose
> these objects give us something like STATX_SUBVOLUME that userspace can
> raise in the request mask of statx().
> 
> If userspace requests STATX_SUBVOLUME in the request mask, the two
> filesystems raise STATX_SUBVOLUME in the statx result mask and then also
> return the _real_ device number of the superblock and stop exposing that
> made up device number.
> 
> This can be accompanied by a vfs ioctl that is identical for both btrfs
> and bcachefs and returns $whatever unique property to mark the inode
> space of the subvolume.
> 
> And then we leave innocent vfs objects alone and we also avoid
> bringing in all that heavy vfsmount machinery on top of subvolumes.

Well, but this requires application knowledge of a new type of object - a
subvolume. So you'd have to teach all applications that try to identify
whether two "filenames" point to the same object or not about this and that
seems like a neverending story. Hence either we will live with fake devices
on btrfs forever or we need to find some other solution to "inode numbers
across subvolumes overlap" problem within "standard unix" APIs.

								Honza
Christian Brauner Nov. 6, 2023, 9:52 a.m. UTC | #40
On Mon, Nov 06, 2023 at 10:03:55AM +0100, Jan Kara wrote:
> On Fri 03-11-23 16:47:02, Christian Brauner wrote:
> > On Fri, Nov 03, 2023 at 07:28:42AM -0700, Christoph Hellwig wrote:
> > > On Thu, Nov 02, 2023 at 12:07:47PM +0100, Christian Brauner wrote:
> > > > But at that point we really need to ask if it makes sense to use
> > > > vfsmounts per subvolume in the first place:
> > > > 
> > > > (1) We pollute /proc/<pid>/mountinfo with a lot of mounts.
> > > > (2) By calling ->getattr() from show_mountinfo() we open the whole
> > > >     system up to deadlocks.
> > > > (3) We change btrfs semantics drastically to the point where they need a
> > > >     new mount, module, or Kconfig option.
> > > > (4) We make (initial) lookup on btrfs subvolumes more heavyweight
> > > >     because you need to create a mount for the subvolume.
> > > > 
> > > > So right now, I don't see how we can make this work even if the concept
> > > > doesn't seem necessarily wrong.
> > > 
> > > How else do you want to solve it?  Crossing a mount point is the
> > > only legitimate boundary for changing st_dev and having a new inode
> > > number space.  And we can't fix that retroactively.
> > 
> > I think the idea of using vfsmounts for this makes some sense if the
> > goal is to retroactively justify and accommodate the idea that a
> > subvolume is to be treated as equivalent to a separate device.
> > 
> > I question that premise though. I think marking them with separate
> > device numbers is bringing us nothing but pain at this point and this
> > solution is basically bending the vfs to make that work somehow.
> > 
> > And the worst thing is that I think that treating subvolumes like
> > vfsmounts will hurt vfsmounts more than it will hurt subvolumes.
> > 
> > Right now all that vfsmounts technically are is a topological
> > abstraction on top of filesystem objects such as files, directories,
> > sockets, even devices that are exposed as filesystems objects. None of
> > them get to muck with core properties of what a vfsmount is though.
> > 
> > Additionally, vfsmount are tied to a superblock and share the device
> > numbers with the superblock they belong to.
> > 
> > If we make subvolumes and vfsmounts equivalent we break both properties.
> > And I think that's wrong or at least really ugly.
> > 
> > And I already see that the suggested workaround for (2) will somehow end
> > up being stashing device numbers in struct mount or struct vfsmount so
> > we can show it in mountinfo and if that's the case I want to express a
> > proactive nak for that solution.
> > 
> > The way I see it is that a subvolume at the end is nothing but a
> > subclass of directories a special one but whatever.
> 
> As far as I understand the problem, subvolumes indeed seem closer to
> special directories than anything else. They slightly resemble what ext4 &
> xfs implement with project quotas (were each inode can have additional
> recursively inherited "project id"). What breaks this "special directory"
> kind of view for btrfs is that subvolumes have overlapping inode numbers.
> Since we don't seem to have a way of getting out of the current situation
> in a "seamless" way anyway, I wonder if implementing a btrfs feature to
> provide unique inode numbers across all subvolumes would not be the
> cleanest way out...
> 
> > I would feel much more comfortable if the two filesystems that expose
> > these objects give us something like STATX_SUBVOLUME that userspace can
> > raise in the request mask of statx().
> > 
> > If userspace requests STATX_SUBVOLUME in the request mask, the two
> > filesystems raise STATX_SUBVOLUME in the statx result mask and then also
> > return the _real_ device number of the superblock and stop exposing that
> > made up device number.
> > 
> > This can be accompanied by a vfs ioctl that is identical for both btrfs
> > and bcachefs and returns $whatever unique property to mark the inode
> > space of the subvolume.
> > 
> > And then we leave innocent vfs objects alone and we also avoid
> > bringing in all that heavy vfsmount machinery on top of subvolumes.
> 
> Well, but this requires application knowledge of a new type of object - a
> subvolume. So you'd have to teach all applications that try to identify
> whether two "filenames" point to the same object or not about this and that
> seems like a neverending story. Hence either we will live with fake devices

But that is what's happening today already, no? All tools need to figure
out that they are on a btrfs subvolume somehow whenever they want to do
something meaningful to it. systemd code is full of special btrfs
handling code.

I don't understand why we're bending and breaking ourselves to somehow
make a filesystem specific, special object fit into standard apis when
it clearly breaks standard apis?
Christian Brauner Nov. 6, 2023, 9:56 a.m. UTC | #41
> Another thing is, the st_dev situation has to be kept, as there are too many
> legacy programs that relies on this to distinguish btrfs subvolume
> boundaries, this would never be changed unfortunately, even if we had some
> better solution (like the proposed extra subvolid through statx).

It would retain backwards compatibility as userspace would need to
explicitly query for STATX_SUBVOLUME otherwise they get they fake
st_dev.

> Which I believe the per-subvolume-vfsmount and the automount behavior for
> subvolume can help a lot.

Very much opposed to this at this point. I've seen the code for this in
the prior patchset and it's implication nothing about this makes me want
this.
Christian Brauner Nov. 6, 2023, 10:03 a.m. UTC | #42
> > I would feel much more comfortable if the two filesystems that expose
> > these objects give us something like STATX_SUBVOLUME that userspace can
> > raise in the request mask of statx().
> 
> Except that this doesn't fix any existing code.

But why do we care?
Current code already does need to know it is on a btrfs subvolume. They
all know that btrfs subvolumes are special. They will need to know that
btrfs subvolumes are special in the future even if they were vfsmounts.
They would likely end up with another kind of confusion because suddenly
vfsmounts have device numbers that aren't associated with the superblock
that vfsmount belongs to.

So nothing is really solved by vfsmounts either. The only thing that we
achieved is that we somehow accommodated that st_dev hack. And that I
consider nakable.

> 
> > If userspace requests STATX_SUBVOLUME in the request mask, the two
> > filesystems raise STATX_SUBVOLUME in the statx result mask and then also
> > return the _real_ device number of the superblock and stop exposing that
> > made up device number.
> 
> What is a "real" device number?

The device number of the superblock of the btrfs filesystem and not some
made-up device number.

I care about not making a btrfs specific problem the vfs's problem by
hoisting that whole problem space a level up by mapping subvolumes to
vfsmounts.
Qu Wenruo Nov. 6, 2023, 10:41 a.m. UTC | #43
On 2023/11/6 20:33, Christian Brauner wrote:
>>> I would feel much more comfortable if the two filesystems that expose
>>> these objects give us something like STATX_SUBVOLUME that userspace can
>>> raise in the request mask of statx().
>>
>> Except that this doesn't fix any existing code.
>
> But why do we care?
> Current code already does need to know it is on a btrfs subvolume.

Not really, the user space tools doesn't care if it's btrfs or not.

They just check the st_dev, and find at a point the st_dev changed, thus
they know there is a boundary.
They don't care if it's a btrfs subvolume boundary or a regular file
system boundary.

Even if they go statx, they don't really care if it's something called
subvolid or whatever, they just care how to distinguish a boundary.
Maybe it's a fsid/subvolid or whatever combination, they just want a way
to determine the boundary.

And st_dev is the perfect proxy. I don't think there is a better way to
distinguish the boundary, even if we have statx().

> They
> all know that btrfs subvolumes are special. They will need to know that
> btrfs subvolumes are special in the future even if they were vfsmounts.
> They would likely end up with another kind of confusion because suddenly
> vfsmounts have device numbers that aren't associated with the superblock
> that vfsmount belongs to.

This looks like you are asking user space programs (especially legacy
ones) to do special handling for btrfs, which I don't believe is the
standard way.

>
> So nothing is really solved by vfsmounts either. The only thing that we
> achieved is that we somehow accommodated that st_dev hack. And that I
> consider nakable.

I think this is the problem.

If we keep the existing behavior, at least old programs won't complain
and we're still POSIX compatible, but limited number of subvolumes
(which can be more or less worked around, and is there for a while).

If we change the st_dev, firstly to what value? All the same for the
same btrfs? Then a big behavior break.

It's really a compatibility problem, and it would take a long time to
find a acceptable compromise, but never a sudden change.


You can of course complain about the vision that one fs should report
the same st_dev no matter what, but my counter argument is, for
subvolume it's really a different tree for each one, and btrfs is
combining the PV/VG/LV into one layer.

Thus either we go treat subvolumes as LVs, thus they would have
different devices numbers from each other. (just like what we do for
now, and still what I believe we should go)

Or we treat it as a VG, which should still a different device number
from all the PVs. (A made-up device id, but shared between all
subvolumes, and break up the existing behavior)

But never treating a btrfs as a PV, because that makes no sense.
>
>>
>>> If userspace requests STATX_SUBVOLUME in the request mask, the two
>>> filesystems raise STATX_SUBVOLUME in the statx result mask and then also
>>> return the _real_ device number of the superblock and stop exposing that
>>> made up device number.
>>
>> What is a "real" device number?
>
> The device number of the superblock of the btrfs filesystem and not some
> made-up device number.

Then again, which device for a multi-device btrfs?

The lowest devid one? Which can be gone by device rm.
The one used for mount? Which can be gone again.

A made up one? Then what's the difference? We go the VG way, and break
the existing programs, and archive nothing.

Thanks,
Qu

>
> I care about not making a btrfs specific problem the vfs's problem by
> hoisting that whole problem space a level up by mapping subvolumes to
> vfsmounts.
Christian Brauner Nov. 6, 2023, 10:59 a.m. UTC | #44
> > They
> > all know that btrfs subvolumes are special. They will need to know that
> > btrfs subvolumes are special in the future even if they were vfsmounts.
> > They would likely end up with another kind of confusion because suddenly
> > vfsmounts have device numbers that aren't associated with the superblock
> > that vfsmount belongs to.
> 
> This looks like you are asking user space programs (especially legacy
> ones) to do special handling for btrfs, which I don't believe is the
> standard way.

I think spending time engaging this claim isn't worth it. This is just
easily falsifiable via a simple grep for btrfs in systemd, lxc, runc,
util-linux.

And yes, I'm definitely asking userspace to change behavior if they want
to retrieve additional information about btrfs subvolumes. We're
exposing a new api.

You get the same problem if you make subvolumes vfsmounts. Userspace
will have to adapt anyway. New APIs don't come for free and especially
not ones that suddenly pop 10 vfsmounts into your mountinfo during a
simple lookup operation.

> 
> > 
> > So nothing is really solved by vfsmounts either. The only thing that we
> > achieved is that we somehow accommodated that st_dev hack. And that I
> > consider nakable.
> 
> I think this is the problem.
> 
> If we keep the existing behavior, at least old programs won't complain
> and we're still POSIX compatible, but limited number of subvolumes
> (which can be more or less worked around, and is there for a while).
> 
> If we change the st_dev, firstly to what value? All the same for the
> same btrfs? Then a big behavior break.
> 
> It's really a compatibility problem, and it would take a long time to
> find a acceptable compromise, but never a sudden change.

This is a mischaracterization. And I'm repeating from my last mail,
st_dev wouldn't need to change. You can keep doing what you're doing
right now if you want to. We're talking about a new api to allow
differentiating subvolumes that is purely opt-in through statx().

> You can of course complain about the vision that one fs should report
> the same st_dev no matter what, but my counter argument is, for
> subvolume it's really a different tree for each one, and btrfs is
> combining the PV/VG/LV into one layer.
> 
> Thus either we go treat subvolumes as LVs, thus they would have
> different devices numbers from each other. (just like what we do for
> now, and still what I believe we should go)
> 
> Or we treat it as a VG, which should still a different device number
> from all the PVs. (A made-up device id, but shared between all
> subvolumes, and break up the existing behavior)
> 
> But never treating a btrfs as a PV, because that makes no sense.

Whatever this paragraph is supposed to tell me I don't get it.

You are reporting a single st_dev for every single btrfs mount right now
including bind mounts in mountinfo.

What you're asking is to make each subvolume a vfsmount and then showing
these vfsmounts in mountinfo and reporting made up device numbers for
that vfsmount. Which is a massive uapi change.
Jan Kara Nov. 6, 2023, 12:22 p.m. UTC | #45
On Mon 06-11-23 10:52:24, Christian Brauner wrote:
> On Mon, Nov 06, 2023 at 10:03:55AM +0100, Jan Kara wrote:
> > On Fri 03-11-23 16:47:02, Christian Brauner wrote:
> > > On Fri, Nov 03, 2023 at 07:28:42AM -0700, Christoph Hellwig wrote:
> > > > On Thu, Nov 02, 2023 at 12:07:47PM +0100, Christian Brauner wrote:
> > > > > But at that point we really need to ask if it makes sense to use
> > > > > vfsmounts per subvolume in the first place:
> > > > > 
> > > > > (1) We pollute /proc/<pid>/mountinfo with a lot of mounts.
> > > > > (2) By calling ->getattr() from show_mountinfo() we open the whole
> > > > >     system up to deadlocks.
> > > > > (3) We change btrfs semantics drastically to the point where they need a
> > > > >     new mount, module, or Kconfig option.
> > > > > (4) We make (initial) lookup on btrfs subvolumes more heavyweight
> > > > >     because you need to create a mount for the subvolume.
> > > > > 
> > > > > So right now, I don't see how we can make this work even if the concept
> > > > > doesn't seem necessarily wrong.
> > > > 
> > > > How else do you want to solve it?  Crossing a mount point is the
> > > > only legitimate boundary for changing st_dev and having a new inode
> > > > number space.  And we can't fix that retroactively.
> > > 
> > > I think the idea of using vfsmounts for this makes some sense if the
> > > goal is to retroactively justify and accommodate the idea that a
> > > subvolume is to be treated as equivalent to a separate device.
> > > 
> > > I question that premise though. I think marking them with separate
> > > device numbers is bringing us nothing but pain at this point and this
> > > solution is basically bending the vfs to make that work somehow.
> > > 
> > > And the worst thing is that I think that treating subvolumes like
> > > vfsmounts will hurt vfsmounts more than it will hurt subvolumes.
> > > 
> > > Right now all that vfsmounts technically are is a topological
> > > abstraction on top of filesystem objects such as files, directories,
> > > sockets, even devices that are exposed as filesystems objects. None of
> > > them get to muck with core properties of what a vfsmount is though.
> > > 
> > > Additionally, vfsmount are tied to a superblock and share the device
> > > numbers with the superblock they belong to.
> > > 
> > > If we make subvolumes and vfsmounts equivalent we break both properties.
> > > And I think that's wrong or at least really ugly.
> > > 
> > > And I already see that the suggested workaround for (2) will somehow end
> > > up being stashing device numbers in struct mount or struct vfsmount so
> > > we can show it in mountinfo and if that's the case I want to express a
> > > proactive nak for that solution.
> > > 
> > > The way I see it is that a subvolume at the end is nothing but a
> > > subclass of directories a special one but whatever.
> > 
> > As far as I understand the problem, subvolumes indeed seem closer to
> > special directories than anything else. They slightly resemble what ext4 &
> > xfs implement with project quotas (were each inode can have additional
> > recursively inherited "project id"). What breaks this "special directory"
> > kind of view for btrfs is that subvolumes have overlapping inode numbers.
> > Since we don't seem to have a way of getting out of the current situation
> > in a "seamless" way anyway, I wonder if implementing a btrfs feature to
> > provide unique inode numbers across all subvolumes would not be the
> > cleanest way out...
> > 
> > > I would feel much more comfortable if the two filesystems that expose
> > > these objects give us something like STATX_SUBVOLUME that userspace can
> > > raise in the request mask of statx().
> > > 
> > > If userspace requests STATX_SUBVOLUME in the request mask, the two
> > > filesystems raise STATX_SUBVOLUME in the statx result mask and then also
> > > return the _real_ device number of the superblock and stop exposing that
> > > made up device number.
> > > 
> > > This can be accompanied by a vfs ioctl that is identical for both btrfs
> > > and bcachefs and returns $whatever unique property to mark the inode
> > > space of the subvolume.
> > > 
> > > And then we leave innocent vfs objects alone and we also avoid
> > > bringing in all that heavy vfsmount machinery on top of subvolumes.
> > 
> > Well, but this requires application knowledge of a new type of object - a
> > subvolume. So you'd have to teach all applications that try to identify
> > whether two "filenames" point to the same object or not about this and that
> > seems like a neverending story. Hence either we will live with fake devices
> 
> But that is what's happening today already, no? All tools need to figure
> out that they are on a btrfs subvolume somehow whenever they want to do
> something meaningful to it. systemd code is full of special btrfs
> handling code.

Yes, for systemd, util-linux or similar tools, there's probably no way they
can avoid knowing about btrfs. If your API makes life easier for them, sure
we can do it. But I was speaking more about tools like diff or tar which
want to find out if two paths lead to the same object (inode) or not. For
such tools I'd hope we can avoid introducing the special subvolume
awareness...

> I don't understand why we're bending and breaking ourselves to somehow
> make a filesystem specific, special object fit into standard apis when
> it clearly breaks standard apis?

Firstly, I'm not hung up on any particular solution (or even keeping status
quo). I was under the impression (maybe wrong) that Christoph would like to
eventually get rid of reporting different st_dev in stat(2) for different
subvolumes and different fsids in statfs(2) as well. So I was thinking
about possibilities for that.

								Honza
'Christoph Hellwig' Nov. 6, 2023, 12:25 p.m. UTC | #46
On Mon, Nov 06, 2023 at 06:48:11PM +1030, Qu Wenruo wrote:
> > st_dev has only been very historically about treating something as
> > a device.  For userspae the most important part is that it designates
> > a separate domain for inode numbers.  And that's something that's simply
> > broken in btrfs.
> 
> In fact, I'm not sure if the "treating something as a device" thing is even
> correct long before btrfs.

It never really has been.  There's just two APIs that ever did this,
ustat and the old quotactl.  Both have been deprecated a long time ago
and never hid wide use.

> For example, for an EXT4 fs with external log device. Thankfully it's still
> more or less obvious we would use the device number of the main fs, not the
> log device, but we already had such examples.

More relevant (as the log device never has persistent data) is the XFS
realtime device.
'Christoph Hellwig' Nov. 6, 2023, 12:29 p.m. UTC | #47
On Mon, Nov 06, 2023 at 11:03:37AM +0100, Christian Brauner wrote:
> But why do we care?
> Current code already does need to know it is on a btrfs subvolume. They
> all know that btrfs subvolumes are special.

"they all know" is a bit vague.  How do you know "all" code knows?

> They will need to know that
> btrfs subvolumes are special in the future even if they were vfsmounts.
> They would likely end up with another kind of confusion because suddenly
> vfsmounts have device numbers that aren't associated with the superblock
> that vfsmount belongs to.

Let's take a step back.  Posix says st_ino is uniqueue for a given
st_dev, and per posix a mount mount is defined as any file that
has a different st_dev from the parent.  So by the Posix definition
btrfs subvolume roots are mount points, which is am obvios clash
with the Linux definition based on vfsmounts.

> > > If userspace requests STATX_SUBVOLUME in the request mask, the two
> > > filesystems raise STATX_SUBVOLUME in the statx result mask and then also
> > > return the _real_ device number of the superblock and stop exposing that
> > > made up device number.
> > 
> > What is a "real" device number?
> 
> The device number of the superblock of the btrfs filesystem and not some
> made-up device number.

The block device st_dev is just as made up.

> I care about not making a btrfs specific problem the vfs's problem by
> hoisting that whole problem space a level up by mapping subvolumes to
> vfsmounts.

While I'd love to fix it, and evern more not have more of this
crap sneak in (*cough* bcachefs, *cough*). І'm ok with that stance.
But that also means we can't let this creep into the vfs by other
means, which is what started the thread.
'Christoph Hellwig' Nov. 6, 2023, 12:30 p.m. UTC | #48
On Mon, Nov 06, 2023 at 11:59:22AM +0100, Christian Brauner wrote:
> > > They
> > > all know that btrfs subvolumes are special. They will need to know that
> > > btrfs subvolumes are special in the future even if they were vfsmounts.
> > > They would likely end up with another kind of confusion because suddenly
> > > vfsmounts have device numbers that aren't associated with the superblock
> > > that vfsmount belongs to.
> > 
> > This looks like you are asking user space programs (especially legacy
> > ones) to do special handling for btrfs, which I don't believe is the
> > standard way.
> 
> I think spending time engaging this claim isn't worth it. This is just
> easily falsifiable via a simple grep for btrfs in systemd, lxc, runc,
> util-linux.

Myabe you need to get our of your little bubble.  There is plenty of
code outside the fast moving Linux Desktop and containers bubbles that
takes decades to adopt to new file systems, and then it'll take time
again to find bugs exposed by such odd behavior.
Christian Brauner Nov. 6, 2023, 1:05 p.m. UTC | #49
On Mon, Nov 06, 2023 at 04:30:57AM -0800, Christoph Hellwig wrote:
> On Mon, Nov 06, 2023 at 11:59:22AM +0100, Christian Brauner wrote:
> > > > They
> > > > all know that btrfs subvolumes are special. They will need to know that
> > > > btrfs subvolumes are special in the future even if they were vfsmounts.
> > > > They would likely end up with another kind of confusion because suddenly
> > > > vfsmounts have device numbers that aren't associated with the superblock
> > > > that vfsmount belongs to.
> > > 
> > > This looks like you are asking user space programs (especially legacy
> > > ones) to do special handling for btrfs, which I don't believe is the
> > > standard way.
> > 
> > I think spending time engaging this claim isn't worth it. This is just
> > easily falsifiable via a simple grep for btrfs in systemd, lxc, runc,
> > util-linux.
> 
> Myabe you need to get our of your little bubble.  There is plenty of

Unnecessary personal comment, let alone that I'm not in any specific
bubble just because I'm trying to be aware of what is currently going on
in userspace.

> code outside the fast moving Linux Desktop and containers bubbles that
> takes decades to adopt to new file systems, and then it'll take time
> again to find bugs exposed by such odd behavior.

So what?

Whatever you do here: vfsmounts or any other solution will force changes
in userspace on a larger scale and changes to the filesystem itself. If
you accommodate tar then you are fscking over other parts of userspace
which are equally important. There is no free lunch.

Second, we're not going to shy away from changes just because it takes
long for them to be adopted. That's just not a position for us to take.
Christian Brauner Nov. 6, 2023, 1:47 p.m. UTC | #50
On Mon, Nov 06, 2023 at 04:29:23AM -0800, Christoph Hellwig wrote:
> On Mon, Nov 06, 2023 at 11:03:37AM +0100, Christian Brauner wrote:
> > But why do we care?
> > Current code already does need to know it is on a btrfs subvolume. They
> > all know that btrfs subvolumes are special.
> 
> "they all know" is a bit vague.  How do you know "all" code knows?

Granted, an over-generalization but non in any way different from
claiming that currently on one needs to know about btrfs subvolumes or
that the proposed vfsmount solution will make it magically so that no
one needs to care anymore.

Tools will have to change either way is my point. And a lot of tools do
already handle subvolumes specially exactly because of the non-unique
inode situation. And if they don't they still can get confused by seing
st_dev numbers they can't associate with a filesystem.

> > They will need to know that
> > btrfs subvolumes are special in the future even if they were vfsmounts.
> > They would likely end up with another kind of confusion because suddenly
> > vfsmounts have device numbers that aren't associated with the superblock
> > that vfsmount belongs to.
> 
> Let's take a step back.  Posix says st_ino is uniqueue for a given
> st_dev, and per posix a mount mount is defined as any file that
> has a different st_dev from the parent.  So by the Posix definition
> btrfs subvolume roots are mount points, which is am obvios clash
> with the Linux definition based on vfsmounts.

3.229 Mount Point
Either the system root directory or a directory for which the st_dev
field of structure stat differs from that of its parent directory.

I think that's just an argument against mapping subvolumes to vfsmounts.
Because bind-mounts don't change the device number - and they very much
shouldn't.

> 
> > > > If userspace requests STATX_SUBVOLUME in the request mask, the two
> > > > filesystems raise STATX_SUBVOLUME in the statx result mask and then also
> > > > return the _real_ device number of the superblock and stop exposing that
> > > > made up device number.
> > > 
> > > What is a "real" device number?
> > 
> > The device number of the superblock of the btrfs filesystem and not some
> > made-up device number.
> 
> The block device st_dev is just as made up.
> 
> > I care about not making a btrfs specific problem the vfs's problem by
> > hoisting that whole problem space a level up by mapping subvolumes to
> > vfsmounts.
> 
> While I'd love to fix it, and evern more not have more of this
> crap sneak in (*cough* bcachefs, *cough*). І'm ok with that stance.
> But that also means we can't let this creep into the vfs by other
> means, which is what started the thread.

The thing is I'm not even sure there's anything to fix.

This discussion started with btrfs maybe getting an alternative way to
uniquify an inode independent of st_dev.

I'm not sure that is such a massive problem.

If we give both btrfs and bcachefs a single flag in statx() that allows
_interested_ userspace to query whether a file is located on a subvolume
that shouldn't be a problem (We have STATX_ATTR_* which identifies
additional properties that are restricted to few filesytems).

And all the specific gobbledigook can be implemented as an ioctl() -
ideally both btrfs and bcachefs agree on something - that the vfs
doesn't have to care about at all.

I genuinely don't care if they report a fake st_dev from stat(). I
genuinely _do_ care that we don't make vfsmounts privy to this.

Let alone that automounts are a giant paint. Not just do they iirc allow
to create shadow mounts, they also interact with namespace and container
creation.

If you spawn thousands of containers each with a private mount namespace
- which is the default - you now trigger automounts in thousands of
containers when triggering a lookup on btrfs. If you have mount
propagation turned on each automount may also propagate into god knows
how many other mount namespaces. That's just nasty.

IOW, making subvolumes vfsmounts will also have wider semantic
implications for using btrfs as a filesystem.
'Christoph Hellwig' Nov. 6, 2023, 5:10 p.m. UTC | #51
On Mon, Nov 06, 2023 at 02:05:45PM +0100, Christian Brauner wrote:
> > > I think spending time engaging this claim isn't worth it. This is just
> > > easily falsifiable via a simple grep for btrfs in systemd, lxc, runc,
> > > util-linux.
> > 
> > Myabe you need to get our of your little bubble.  There is plenty of
> 
> Unnecessary personal comment, let alone that I'm not in any specific
> bubble just because I'm trying to be aware of what is currently going on
> in userspace.

Maybe you're just taking it to personal?  A place where systemd, lxc,
runc, and util-linux are "all software" is a very much a bubble as you
won't find much userspace that stays more uptodate with particular
quirks of modern Linux features.

> Whatever you do here: vfsmounts or any other solution will force changes
> in userspace on a larger scale and changes to the filesystem itself. If
> you accommodate tar then you are fscking over other parts of userspace
> which are equally important. There is no free lunch.

It works for everything that knows that Linux mountpoint as exposed
in /proc/mounts and proc/self/mountinfo corresponds to the posix
definition of a mount point, and that one used on basically every
other unix system.  It might not work as-is for software that actually
particularly knows how to manage btrfs subvolumes, but those are, by
defintion, not the problem anyway.

It's thinkgs like backup tools that run into random ino_t duplicates.
That's an example we had in the past, and I would be absolutely not be
surprised if there is more than more of those hiding right now.
'Christoph Hellwig' Nov. 6, 2023, 5:13 p.m. UTC | #52
On Mon, Nov 06, 2023 at 02:47:16PM +0100, Christian Brauner wrote:
> Granted, an over-generalization but non in any way different from
> claiming that currently on one needs to know about btrfs subvolumes or
> that the proposed vfsmount solution will make it magically so that no
> one needs to care anymore.

I don't think any one has claimed "no one" needs to care any more.  What
the vfsmounts buy us that is that software that doesn't know and
should't know about btrfs subvolumes isn't silently broken.  Software
that actually wants to do something fancy with them always need special
casing.

> Tools will have to change either way is my point. And a lot of tools do
> already handle subvolumes specially exactly because of the non-unique
> inode situation. And if they don't they still can get confused by seing
> st_dev numbers they can't associate with a filesystem.

Again, tools that actually are related to subvolume features are
not even the problem.
Josef Bacik Nov. 6, 2023, 10:42 p.m. UTC | #53
On Mon, Nov 06, 2023 at 09:13:19AM -0800, Christoph Hellwig wrote:
> On Mon, Nov 06, 2023 at 02:47:16PM +0100, Christian Brauner wrote:
> > Granted, an over-generalization but non in any way different from
> > claiming that currently on one needs to know about btrfs subvolumes or
> > that the proposed vfsmount solution will make it magically so that no
> > one needs to care anymore.
> 
> I don't think any one has claimed "no one" needs to care any more.  What
> the vfsmounts buy us that is that software that doesn't know and
> should't know about btrfs subvolumes isn't silently broken.  Software
> that actually wants to do something fancy with them always need special
> casing.

Again, this is where I'm confused, because this doesn't change anything, we're
still going to report st_dev as being different, which is what you hate.

You pointed out above that user space thinks that different st_dev means
different inode space.  That is why Chris put the st_dev hack in, because rsync
got confused, and this made it not confused.

So we're actually doing exactly what user space wants, letting them know that
they've wandered into a different inode ino space when they cross into a
subvolume.

But that's the crux of the problem, new subvolume == new inode number space.
I'm not changing this.  I don't think you want us to change this.

We want to let user space who don't care to know about btrfs be able to
operate cleanly.  We have that with the st_dev hack.  Changing to vfsmounts
doesn't fix this.  Userspace doesn't know it's wandered into a new directory,
and in fact if you look at the rsync code they have a special python script you
have to use if you want to exclude bind mounts, which is all the automount thing
would accomplish.

At this point I don't care, tell me what you want me to do so you'll stop
complaining anytime we try to expose more btrfs specific information to user
space and I'll do it.  I'm tired of having this argument.  You would have had
auto mount patches in your inbox last week but it took me longer to get the new
mount api conversion done than anticipated.

But it doesn't appear to me there's agreement on the way forward.  vfsmounts
aren't going to do anything from what I can tell, but I could very well be
missing some detail.  And Christian doesn't seem to want that as a solution.
Thanks,

Josef
Christian Brauner Nov. 7, 2023, 8:58 a.m. UTC | #54
On Mon, Nov 06, 2023 at 09:10:28AM -0800, Christoph Hellwig wrote:
> On Mon, Nov 06, 2023 at 02:05:45PM +0100, Christian Brauner wrote:
> > > > I think spending time engaging this claim isn't worth it. This is just
> > > > easily falsifiable via a simple grep for btrfs in systemd, lxc, runc,
> > > > util-linux.
> > > 
> > > Myabe you need to get our of your little bubble.  There is plenty of
> > 
> > Unnecessary personal comment, let alone that I'm not in any specific
> > bubble just because I'm trying to be aware of what is currently going on
> > in userspace.
> 
> Maybe you're just taking it to personal?  A place where systemd, lxc,

Of course I'm taking that personal. It's a personal comment that
unnecessarily accompanies the message that you think I discount software
that isn't all that modern. Which is a fair point. It doesn't have to
come with garnish about me living in a bubble.

Which is also a little insulting because you very well know that I spend
hours each cycle fixing all kinds of weird regressions from software
from the stone age.

> runc, and util-linux are "all software" is a very much a bubble as you
> won't find much userspace that stays more uptodate with particular
> quirks of modern Linux features.

You assume that your solution doesn't break things. And it will. For
btrfs users and for other userspace tools as well. As detailed in other
parts of the thread.

> 
> > Whatever you do here: vfsmounts or any other solution will force changes
> > in userspace on a larger scale and changes to the filesystem itself. If
> > you accommodate tar then you are fscking over other parts of userspace
> > which are equally important. There is no free lunch.
> 
> It works for everything that knows that Linux mountpoint as exposed
> in /proc/mounts and proc/self/mountinfo corresponds to the posix
> definition of a mount point, and that one used on basically every
> other unix system.  It might not work as-is for software that actually
> particularly knows how to manage btrfs subvolumes, but those are, by
> defintion, not the problem anyway.

On current systems and since forever bind-mounts do not change device
numbers unless they are a new filesystem mount. Making subvolumes
vfsmounts breaks that. That'll mean a uapi change for
/proc/<pid>/mountinfo for a start.

It also risks immediately breaking basic stuff such as associating
vfsmounts with the superblock they belong to via device numbers from
parsing /proc/<pid>/mountinfo.

And see other mails for other side-effects of this.              

All of this also discounts the invasive effects that it will have when
you suddenly start plopping automounts into the mount tables of
processes on lookup and propagting subvolumes as vfsmounts into random
mount namespaces. I've detailed problems with automounts that btrfs
would get themselves into elsewhere so I'm not going to repeat it here.
Christian Brauner Nov. 7, 2023, 9:06 a.m. UTC | #55
> But it doesn't appear to me there's agreement on the way forward.  vfsmounts
> aren't going to do anything from what I can tell, but I could very well be
> missing some detail.  And Christian doesn't seem to want that as a solution.

No, I really don't.

I still think that it is fine to add a statx() flag indicating that a
file is located on a subvolume. That allows interested userspace to
bounce to an ioctl() where you can give it additional information in
whatever form you like.
'Christoph Hellwig' Nov. 8, 2023, 7:51 a.m. UTC | #56
On Mon, Nov 06, 2023 at 05:42:10PM -0500, Josef Bacik wrote:
> Again, this is where I'm confused, because this doesn't change anything, we're
> still going to report st_dev as being different, which is what you hate.

It's not something I hate.  It's that changing it without a mount point
has broken things and will probably still break things.
'Christoph Hellwig' Nov. 8, 2023, 7:52 a.m. UTC | #57
On Tue, Nov 07, 2023 at 10:06:18AM +0100, Christian Brauner wrote:
> > But it doesn't appear to me there's agreement on the way forward.  vfsmounts
> > aren't going to do anything from what I can tell, but I could very well be
> > missing some detail.  And Christian doesn't seem to want that as a solution.
> 
> No, I really don't.
> 
> I still think that it is fine to add a statx() flag indicating that a
> file is located on a subvolume. That allows interested userspace to
> bounce to an ioctl() where you can give it additional information in
> whatever form you like.

What is that flag going to buy us?
'Christoph Hellwig' Nov. 8, 2023, 7:56 a.m. UTC | #58
On Tue, Nov 07, 2023 at 09:58:48AM +0100, Christian Brauner wrote:
> > Maybe you're just taking it to personal?  A place where systemd, lxc,
> 
> Of course I'm taking that personal. It's a personal comment that
> unnecessarily accompanies the message that you think I discount software
> that isn't all that modern. Which is a fair point. It doesn't have to
> come with garnish about me living in a bubble.

І'm not sure why you're trying to overload words with meanings that
weren't said.  If you list software that is both by it's place in the
food chain and by its developer community very up to date to low-level
Linux quirks (sometimes too much, btw - we really should accomodate
them better), it shows a somewhat limited view, which is the definition
of a bubble.  There is absolutely no implication that this is intentional
or even malicious.

> > definition of a mount point, and that one used on basically every
> > other unix system.  It might not work as-is for software that actually
> > particularly knows how to manage btrfs subvolumes, but those are, by
> > defintion, not the problem anyway.
> 
> On current systems and since forever bind-mounts do not change device
> numbers unless they are a new filesystem mount. Making subvolumes
> vfsmounts breaks that. That'll mean a uapi change for
> /proc/<pid>/mountinfo for a start.

a bind mount can of course change the dev_t - if it points to a
different super block at the moment.
Christian Brauner Nov. 8, 2023, 8:09 a.m. UTC | #59
> of a bubble.  There is absolutely no implication that this is intentional
> or even malicious.

Ok, sometimes it's easy to miss nuances in mail which is why such
comments are easy to misread.

> 
> > > definition of a mount point, and that one used on basically every
> > > other unix system.  It might not work as-is for software that actually
> > > particularly knows how to manage btrfs subvolumes, but those are, by
> > > defintion, not the problem anyway.
> > 
> > On current systems and since forever bind-mounts do not change device
> > numbers unless they are a new filesystem mount. Making subvolumes
> > vfsmounts breaks that. That'll mean a uapi change for
> > /proc/<pid>/mountinfo for a start.
> 
> a bind mount can of course change the dev_t - if it points to a
> different super block at the moment.

No, a bind mount just takes an existing directory tree of an existing
filesystem and makes it visible on some location in the filesystem
hierarchy. It doesn't change the device number it will inherit it from
the superblock it belongs. mount -t xfs /dev/sda /mnt creates a new
filesystem and a first mount for that filesystem. Any other additional
bind-mount off of that will inherit the same device id in
/proc/<pid>/mountinfo.
'Christoph Hellwig' Nov. 8, 2023, 8:12 a.m. UTC | #60
On Wed, Nov 08, 2023 at 09:09:10AM +0100, Christian Brauner wrote:
> > a bind mount can of course change the dev_t - if it points to a
> > different super block at the moment.
> 
> No, a bind mount just takes an existing directory tree of an existing
> filesystem and makes it visible on some location in the filesystem
> hierarchy. It doesn't change the device number it will inherit it from
> the superblock it belongs.

That's what I'm trying to say.

So if you have:

	/mnt/1/ with dev 8
	/mnt/2/ with dev 23

then a bind mount of

	/mnt/1/foo to /mnt/2/bar will get your dev 8 for /mnt/2/bar

So the device number changes at the mount point here, bind mount or not.
Christian Brauner Nov. 8, 2023, 8:22 a.m. UTC | #61
On Wed, Nov 08, 2023 at 12:12:36AM -0800, Christoph Hellwig wrote:
> On Wed, Nov 08, 2023 at 09:09:10AM +0100, Christian Brauner wrote:
> > > a bind mount can of course change the dev_t - if it points to a
> > > different super block at the moment.
> > 
> > No, a bind mount just takes an existing directory tree of an existing
> > filesystem and makes it visible on some location in the filesystem
> > hierarchy. It doesn't change the device number it will inherit it from
> > the superblock it belongs.
> 
> That's what I'm trying to say.
> 
> So if you have:
> 
> 	/mnt/1/ with dev 8
> 	/mnt/2/ with dev 23
> 
> then a bind mount of
> 
> 	/mnt/1/foo to /mnt/2/bar will get your dev 8 for /mnt/2/bar
> 
> So the device number changes at the mount point here, bind mount or not.

Yes, I know. But /mnt/2/ will have the device number of the
superblock/filesystem it belongs to and so will /mnt/1. Creating a
bind-mount won't suddenly change the device number and decoupling it
from the superblock it is a bind-mount of.
Christian Brauner Nov. 8, 2023, 8:27 a.m. UTC | #62
On Tue, Nov 07, 2023 at 11:52:18PM -0800, Christoph Hellwig wrote:
> On Tue, Nov 07, 2023 at 10:06:18AM +0100, Christian Brauner wrote:
> > > But it doesn't appear to me there's agreement on the way forward.  vfsmounts
> > > aren't going to do anything from what I can tell, but I could very well be
> > > missing some detail.  And Christian doesn't seem to want that as a solution.
> > 
> > No, I really don't.
> > 
> > I still think that it is fine to add a statx() flag indicating that a
> > file is located on a subvolume. That allows interested userspace to
> > bounce to an ioctl() where you can give it additional information in
> > whatever form you like.
> 
> What is that flag going to buy us?

The initial list that Josef provided in
https://lore.kernel.org/linux-btrfs/20231025210654.GA2892534@perftesting
asks to give users a way to figure out whether a file is located on a
subvolume. Which I think is reasonable and there's a good chunk of
software out there that could really benefit from this. Now all of the
additional info that is requested doesn't need to live in statx(). But
that flag can serve as an indicator for userspace that they are on a
subvolume and that they can go to btrfs specific ioctls if they want to
figure out more details.
Jan Kara Nov. 8, 2023, 11:08 a.m. UTC | #63
On Tue 07-11-23 23:51:58, Christoph Hellwig wrote:
> On Mon, Nov 06, 2023 at 05:42:10PM -0500, Josef Bacik wrote:
> > Again, this is where I'm confused, because this doesn't change anything, we're
> > still going to report st_dev as being different, which is what you hate.
> 
> It's not something I hate.  It's that changing it without a mount point
> has broken things and will probably still break things.

So let me maybe return to what has started this thread. For fanotify we
return <fsid, fhandle> pair with events to identify object where something
happened. The fact that fsid is not uniform for all inodes of a superblock
on btrfs is what triggered this series because we are then faced with the
problem that caching fsid per superblock for "superblock marks" (to save
CPU overhead when generating events) can lead to somewhat confusing results
on btrfs. Whether we have vfsmount in the places where inodes' st_dev /
fsid change is irrelevant for this fanotify issue. As far as I'm following
the discussion it seems the non-uniform fsids per superblock are going to
stay with us on btrfs so fanotify code should better accommodate them? At
least by making sure the behavior is somewhat consistent and documented...

								Honza
'Christoph Hellwig' Nov. 8, 2023, 2:07 p.m. UTC | #64
On Wed, Nov 08, 2023 at 09:22:33AM +0100, Christian Brauner wrote:
> > 	/mnt/1/foo to /mnt/2/bar will get your dev 8 for /mnt/2/bar
> > 
> > So the device number changes at the mount point here, bind mount or not.
> 
> Yes, I know. But /mnt/2/ will have the device number of the
> superblock/filesystem it belongs to and so will /mnt/1. Creating a
> bind-mount won't suddenly change the device number and decoupling it
> from the superblock it is a bind-mount of.

It doesn't any more then just changing st_dev.  But at least it aligns
to the boundary that such a change always aligned to in not just Linux
but most (if not all?) Unix variants and thus where it is expected.
'Christoph Hellwig' Nov. 8, 2023, 2:08 p.m. UTC | #65
On Wed, Nov 08, 2023 at 09:27:44AM +0100, Christian Brauner wrote:
> > What is that flag going to buy us?
> 
> The initial list that Josef provided in
> https://lore.kernel.org/linux-btrfs/20231025210654.GA2892534@perftesting
> asks to give users a way to figure out whether a file is located on a
> subvolume. Which I think is reasonable and there's a good chunk of
> software out there that could really benefit from this. Now all of the
> additional info that is requested doesn't need to live in statx(). But
> that flag can serve as an indicator for userspace that they are on a
> subvolume and that they can go to btrfs specific ioctls if they want to
> figure out more details.

Well, if we want to legitimize the historic btrfs behavior the way to
find out is if st_dev changes without that being a mount point, so
an extra flag would be redundant.
'Christoph Hellwig' Nov. 8, 2023, 2:11 p.m. UTC | #66
On Wed, Nov 08, 2023 at 12:08:14PM +0100, Jan Kara wrote:
> On Tue 07-11-23 23:51:58, Christoph Hellwig wrote:
> > On Mon, Nov 06, 2023 at 05:42:10PM -0500, Josef Bacik wrote:
> > > Again, this is where I'm confused, because this doesn't change anything, we're
> > > still going to report st_dev as being different, which is what you hate.
> > 
> > It's not something I hate.  It's that changing it without a mount point
> > has broken things and will probably still break things.
> 
> So let me maybe return to what has started this thread. For fanotify we
> return <fsid, fhandle> pair with events to identify object where something
> happened. The fact that fsid is not uniform for all inodes of a superblock
> on btrfs is what triggered this series because we are then faced with the
> problem that caching fsid per superblock for "superblock marks" (to save
> CPU overhead when generating events) can lead to somewhat confusing results
> on btrfs. Whether we have vfsmount in the places where inodes' st_dev /
> fsid change is irrelevant for this fanotify issue. As far as I'm following
> the discussion it seems the non-uniform fsids per superblock are going to
> stay with us on btrfs so fanotify code should better accommodate them? At
> least by making sure the behavior is somewhat consistent and documented...

I'd say if you want fanotify to work properly you must not switch st_dev
and diverge from the known behavior.  Just like your already do
for tons of other things that use sb->s_dev or identifier derived
from it as we've got plenty of those.
Christian Brauner Nov. 8, 2023, 3:57 p.m. UTC | #67
On Wed, Nov 08, 2023 at 06:07:10AM -0800, Christoph Hellwig wrote:
> On Wed, Nov 08, 2023 at 09:22:33AM +0100, Christian Brauner wrote:
> > > 	/mnt/1/foo to /mnt/2/bar will get your dev 8 for /mnt/2/bar
> > > 
> > > So the device number changes at the mount point here, bind mount or not.
> > 
> > Yes, I know. But /mnt/2/ will have the device number of the
> > superblock/filesystem it belongs to and so will /mnt/1. Creating a
> > bind-mount won't suddenly change the device number and decoupling it
> > from the superblock it is a bind-mount of.
> 
> It doesn't any more then just changing st_dev.  But at least it aligns
> to the boundary that such a change always aligned to in not just Linux
> but most (if not all?) Unix variants and thus where it is expected.

I'm not parsing that sentence fully tbh. But the point stands that this
alters how mountinfo works. bind-mounts don't deviate from the device
number of the superblock they belong to and there's no reason to tie
that st_dev change in stat() that is btrfs specific to vfsmounts. That's
not going to happen and has been rejected before.

And that is on top of all other problems with making subvolumes
automounted vfsmounts that were outlined.
Christian Brauner Nov. 8, 2023, 4:16 p.m. UTC | #68
On Wed, Nov 08, 2023 at 06:08:09AM -0800, Christoph Hellwig wrote:
> On Wed, Nov 08, 2023 at 09:27:44AM +0100, Christian Brauner wrote:
> > > What is that flag going to buy us?
> > 
> > The initial list that Josef provided in
> > https://lore.kernel.org/linux-btrfs/20231025210654.GA2892534@perftesting
> > asks to give users a way to figure out whether a file is located on a
> > subvolume. Which I think is reasonable and there's a good chunk of
> > software out there that could really benefit from this. Now all of the
> > additional info that is requested doesn't need to live in statx(). But
> > that flag can serve as an indicator for userspace that they are on a
> > subvolume and that they can go to btrfs specific ioctls if they want to
> > figure out more details.
> 
> Well, if we want to legitimize the historic btrfs behavior the way to
> find out is if st_dev changes without that being a mount point, so
> an extra flag would be redundant.

The device number may also change on overlayfs per directory in certain
circumstances so it doesn't work in the general case.

Plus that requires a lot of gymnastics in the general case as you need
to statx() the file, call statfs() to figure out that it is a btrfs
filesystem, retrieve the device number of the superblock/filesystem and
compare that with the device number returned from stat(). And that's the
btrfs specific case. For bcachefs this doesn't work because it doesn't
seem to change st_dev.
Christian Brauner Nov. 8, 2023, 4:20 p.m. UTC | #69
On Wed, Nov 08, 2023 at 05:16:38PM +0100, Christian Brauner wrote:
> On Wed, Nov 08, 2023 at 06:08:09AM -0800, Christoph Hellwig wrote:
> > On Wed, Nov 08, 2023 at 09:27:44AM +0100, Christian Brauner wrote:
> > > > What is that flag going to buy us?
> > > 
> > > The initial list that Josef provided in
> > > https://lore.kernel.org/linux-btrfs/20231025210654.GA2892534@perftesting
> > > asks to give users a way to figure out whether a file is located on a
> > > subvolume. Which I think is reasonable and there's a good chunk of
> > > software out there that could really benefit from this. Now all of the

I explained myself badly here. What I mean and what is immediately
useful is to add STATX_ATTR_SUBVOLUME_ROOT which works for both btrfs
and bcachefs and makes it easy for userspace to figure out whether an
inode is the root of a subvolume:

(This won't compile obviously.)

diff --git a/fs/bcachefs/fs.c b/fs/bcachefs/fs.c
index 166d8d8abe68..fce8603d37b0 100644
--- a/fs/bcachefs/fs.c
+++ b/fs/bcachefs/fs.c
@@ -776,6 +776,10 @@ static int bch2_getattr(struct mnt_idmap *idmap,
                stat->attributes |= STATX_ATTR_NODUMP;
        stat->attributes_mask    |= STATX_ATTR_NODUMP;

+       if (BTRFS_is_subvol_root(inode))
+               stat->attributes_mask |= STATX_ATTR_SUBVOLUME_ROOT;
+       stat->attributes_mask |= STATX_ATTR_SUBVOLUME_ROOT;
+
        return 0;
 }

diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 5e3fccddde0c..c339a9a08d7e 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -8657,10 +8657,14 @@ static int btrfs_getattr(struct mnt_idmap *idmap,
        if (bi_ro_flags & BTRFS_INODE_RO_VERITY)
                stat->attributes |= STATX_ATTR_VERITY;

+       if (BCH2_is_subvol_root(inode))
+               stat->attributes        |= STATX_ATTR_SUBVOLUME_ROOT;
+
        stat->attributes_mask |= (STATX_ATTR_APPEND |
                                  STATX_ATTR_COMPRESSED |
                                  STATX_ATTR_IMMUTABLE |
-                                 STATX_ATTR_NODUMP);
+                                 STATX_ATTR_NODUMP |
+                                 STATX_ATTR_SUBVOLUME_ROOT);

        generic_fillattr(idmap, request_mask, inode, stat);
        stat->dev = BTRFS_I(inode)->root->anon_dev;
diff --git a/include/uapi/linux/stat.h b/include/uapi/linux/stat.h
index 7cab2c65d3d7..24d493babe63 100644
--- a/include/uapi/linux/stat.h
+++ b/include/uapi/linux/stat.h
@@ -187,6 +187,7 @@ struct statx {
 #define STATX_ATTR_ENCRYPTED           0x00000800 /* [I] File requires key to decrypt in fs */
 #define STATX_ATTR_AUTOMOUNT           0x00001000 /* Dir: Automount trigger */
 #define STATX_ATTR_MOUNT_ROOT          0x00002000 /* Root of a mount */
+#define STATX_ATTR_SUBVOLUME_ROOT      0x00004000 /* Root of a subvolume */
 #define STATX_ATTR_VERITY              0x00100000 /* [I] Verity protected file */
 #define STATX_ATTR_DAX                 0x00200000 /* File is currently in DAX state */

This would be a pretty big help for userspace already.

Right now all code needs to do a stat() and a statfs() and then check
the inode number. And that likely only works for btrfs.

This would also allow tools that want to to detect when they're crossing
into a new subvolume - be it on btrfs or bcachefs - and take appropriate
measures deciding what they want to do just relying on statx() without
any additional system calls.

And I think that's something we should do.
'Christoph Hellwig' Nov. 9, 2023, 6:53 a.m. UTC | #70
On Wed, Nov 08, 2023 at 05:16:32PM +0100, Christian Brauner wrote:
> > Well, if we want to legitimize the historic btrfs behavior the way to
> > find out is if st_dev changes without that being a mount point, so
> > an extra flag would be redundant.
> 
> The device number may also change on overlayfs per directory in certain
> circumstances so it doesn't work in the general case.
> 
> Plus that requires a lot of gymnastics in the general case as you need
> to statx() the file, call statfs() to figure out that it is a btrfs
> filesystem, retrieve the device number of the superblock/filesystem and
> compare that with the device number returned from stat(). And that's the
> btrfs specific case.

Why would you care about the device of the super block?  You need to
compare it with the parent.

> For bcachefs this doesn't work because it doesn't
> seem to change st_dev.

Well, that's going to be really broken then.  But hey, if we merge
these kinds of things without review we'll have to live ith it :(

But maybe we just need to throw in the towel when we have three
file systems now that think they can just do random undocument and
not backward compatbile things with their inode numbers and declare
the inode number of a compeltely meaningless cookie..
'Christoph Hellwig' Nov. 9, 2023, 6:55 a.m. UTC | #71
On Wed, Nov 08, 2023 at 05:20:06PM +0100, Christian Brauner wrote:
> This would also allow tools that want to to detect when they're crossing
> into a new subvolume - be it on btrfs or bcachefs - and take appropriate
> measures deciding what they want to do just relying on statx() without
> any additional system calls.

How?  If they want to only rely on Posix and not just he historical
unix/linux behavior they need to compare st_dev for the inode and it's
parent to see if it the Posix concept of a mount point (not to be
confused with the Linux concept of a mountpoint apparently) because
that allows the file system to use a new inode number namespace.
Christian Brauner Nov. 9, 2023, 9:07 a.m. UTC | #72
On Wed, Nov 08, 2023 at 10:55:52PM -0800, Christoph Hellwig wrote:
> On Wed, Nov 08, 2023 at 05:20:06PM +0100, Christian Brauner wrote:
> > This would also allow tools that want to to detect when they're crossing
> > into a new subvolume - be it on btrfs or bcachefs - and take appropriate
> > measures deciding what they want to do just relying on statx() without
> > any additional system calls.
> 
> How?  If they want to only rely on Posix and not just he historical
> unix/linux behavior they need to compare st_dev for the inode and it's
> parent to see if it the Posix concept of a mount point (not to be
> confused with the Linux concept of a mountpoint apparently) because
> that allows the file system to use a new inode number namespace.

That doesn't work anymore. Both overlayfs and btrfs make this
impossible or at least inconsistent.

Tools that want to rely on that definition can continue to do so and
really just ignore any of the new features. But tools that want to know
about this and adjust behavior can really benefit from this. Just
marking an inode as a subvolume root is worth it without committing to
any filesystem specifics now that we have two of them.
'Christoph Hellwig' Nov. 9, 2023, 2:41 p.m. UTC | #73
On Thu, Nov 09, 2023 at 10:07:35AM +0100, Christian Brauner wrote:
> > How?  If they want to only rely on Posix and not just he historical
> > unix/linux behavior they need to compare st_dev for the inode and it's
> > parent to see if it the Posix concept of a mount point (not to be
> > confused with the Linux concept of a mountpoint apparently) because
> > that allows the file system to use a new inode number namespace.
> 
> That doesn't work anymore. Both overlayfs and btrfs make this
> impossible or at least inconsistent.

One you hit a different st_dev on btrfs you'll stay on that until
you hit a mount point or another (nested) subvolume.  Can't comment
on overlayfs.  But if it keeps mixing things forth and back what would
the semantics of the flag be anyway?
Christian Brauner Nov. 10, 2023, 9:33 a.m. UTC | #74
> you hit a mount point or another (nested) subvolume.  Can't comment
> on overlayfs.  But if it keeps mixing things forth and back what would

Overlayfs shows that this st_dev switching happens on things other than
btrfs. It has nothing to do with subvolumes.
Amir Goldstein Nov. 10, 2023, 10:31 a.m. UTC | #75
On Fri, Nov 10, 2023 at 11:33 AM Christian Brauner <brauner@kernel.org> wrote:
>
> > you hit a mount point or another (nested) subvolume.  Can't comment
> > on overlayfs.  But if it keeps mixing things forth and back what would
>
> Overlayfs shows that this st_dev switching happens on things other than
> btrfs. It has nothing to do with subvolumes.

IMO, overlayfs is a good example of Pragmatism.

On some filesystems, following POSIX standard verbatim is not possible.
We care more about the users and about how the POSIX standard affects
real life applications, than about actually following the standard.

This table is complex, but it explains the tradeoffs that overlayfs
does when following strict POSIX is not possible:
https://docs.kernel.org/filesystems/overlayfs.html#inode-properties

In the Legacy case, where Uniform st_dev is not possible,
overlayfs preserve these important *practical* rules:
1. Unique st_dev,st_ino
2. Uniform st_dev across all directories (non-persistent st_ino)
3. Persistent st_ino for non-directories (non-uniform st_dev)

Rule #2 is important for traversal boundaries (e.g. find -xdev, du -x)

It's not mentioned in this table, but overlayfs f_fsid is and always
was uniform. Since v6.6, overlayfs f_fsid is also unique among
overlayfs instances.

I don't know of a good way to stop a thread where all that needs
to be said has been said, but my opinion is that this thread has lost
focus a long time ago.

I will post a new patch for fanotify with Jan's proposal to address
Christoph's concerns:
- no ->get_fsid() method
- no sb/mount mark on btrfs subvol

Thanks,
Amir.