xfs: use a unique and persistent value for f_fsid

Message ID	20210322171118.446536-1-amir73il@gmail.com (mailing list archive)
State	New, archived
Headers	show Return-Path: <linux-xfs-owner@kernel.org> From: Amir Goldstein <amir73il@gmail.com> To: "Darrick J . Wong" <darrick.wong@oracle.com> Cc: Jan Kara <jack@suse.cz>, Christian Brauner <christian.brauner@ubuntu.com>, linux-xfs@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-api@vger.kernel.org Subject: [PATCH] xfs: use a unique and persistent value for f_fsid Date: Mon, 22 Mar 2021 19:11:18 +0200 Message-Id: <20210322171118.446536-1-amir73il@gmail.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Precedence: bulk
Series	xfs: use a unique and persistent value for f_fsid \| expand xfs: use a unique and persistent value for f_fsid

Amir Goldstein March 22, 2021, 5:11 p.m. UTC

Some filesystems on persistent storage backend use a digest of the
filesystem's persistent uuid as the value for f_fsid returned by
statfs(2).

xfs, as many other filesystem provide the non-persistent block device
number as the value of f_fsid.

Since kernel v5.1, fanotify_init(2) supports the flag FAN_REPORT_FID
for identifying objects using file_handle and f_fsid in events.

The xfs specific ioctl XFS_IOC_PATH_TO_FSHANDLE similarly attaches an
fsid to exported file handles, but it is not the same fsid exported
via statfs(2) - it is a persistent fsid based on the filesystem's uuid.

Use the same persistent value for f_fsid, so object identifiers in
fanotify events will describe the objects more uniquely.

Signed-off-by: Amir Goldstein <amir73il@gmail.com>
---

Guys,

This change would be useful for fanotify users.
Do you see any problems with that minor change of uapi?

The way I see it, now f_fsid of an xfs filesystem can change on reboots.
With this change, it will change once more and never more.

I did not find any kernel internal user other than fanotify and as for
userland expectations, there should not be much expectations from the
value of f_fsid as it is persistent for some filesystems and bdev number
for others - there is no standard.

Thanks,
Amir.

 fs/xfs/xfs_super.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

Dave Chinner March 22, 2021, 11:03 p.m. UTC | #1

On Mon, Mar 22, 2021 at 07:11:18PM +0200, Amir Goldstein wrote:
> Some filesystems on persistent storage backend use a digest of the
> filesystem's persistent uuid as the value for f_fsid returned by
> statfs(2).
> 
> xfs, as many other filesystem provide the non-persistent block device
> number as the value of f_fsid.
> 
> Since kernel v5.1, fanotify_init(2) supports the flag FAN_REPORT_FID
> for identifying objects using file_handle and f_fsid in events.

The filesystem id is encoded into the VFS filehandle - it does not
need some special external identifier to identify the filesystem it
belongs to....

> The xfs specific ioctl XFS_IOC_PATH_TO_FSHANDLE similarly attaches an
> fsid to exported file handles, but it is not the same fsid exported
> via statfs(2) - it is a persistent fsid based on the filesystem's uuid.

To actually use that {fshandle,fhandle} tuple for anything
requires CAP_SYS_ADMIN. A user can read the fshandle, but it can't
use it for anything useful. i.e. it's use is entirely isolated to
the file handle interface for identifying the filesystem the handle
belongs to. This is messy, but XFS inherited this "fixed fsid"
interface from Irix filehandles and was needed to port
xfsdump/xfsrestore to Linux.  Realistically, it is not functionality
that should be duplicated/exposed more widely on Linux...

IMO, if fanotify needs a persistent filesystem ID on Linux, it
should be using something common across all filesystems from the
linux superblock, not deep dark internal filesystem magic. The
export interfaces that generate VFS (and NFS) filehandles already
have a persistent fsid associated with them, which may in fact be
the filesystem UUID in it's entirety.

The export-derived "filesystem ID" is what should be exported to
userspace in combination with the file handle to identify the fs the
handle belongs to because then you have consistent behaviour and a
change that invalidates the filehandle will also invalidate the
fshandle....

> Use the same persistent value for f_fsid, so object identifiers in
> fanotify events will describe the objects more uniquely.

It's not persistent as in "will never change". The moment a user
changes the XFS filesystem uuid, the f_fsid changes.

However, changing the uuid on XFS is an offline (unmounted)
operation, so there will be no fanotify marks present when it is
changed. Hence when it is remounted, there will be a new f_fsid
returned in statvfs(), just like what happens now, and all
applications dependent on "persistent" fsids (and persistent
filehandles for that matter) will now get ESTALE errors...

And, worse, mp->m_fixed_fsid (and XFS superblock UUIDs in general)
are not unique if you've got snapshots and they've been mounted via
"-o nouuid" to avoid XFS's duplicate uuid checking. This is one of
the reasons that the duplicate checking exists - so that fshandles
are unique and resolve to a single filesystem....

> Signed-off-by: Amir Goldstein <amir73il@gmail.com>
> ---
> 
> Guys,
> 
> This change would be useful for fanotify users.
> Do you see any problems with that minor change of uapi?

Yes.

IMO, we shouldn't be making a syscall interface rely on the
undefined, filesystem specific behaviour a value some other syscall
exposes to userspace. This means the fsid has no defined or
standardised behaviour applications can rely on and can't be
guaranteed unique and unchanging by fanotify. This seems like a
lose-lose situation for everyone...

Cheers,

Dave.

Amir Goldstein March 23, 2021, 4:50 a.m. UTC | #2

On Tue, Mar 23, 2021 at 1:03 AM Dave Chinner <david@fromorbit.com> wrote:
>
> On Mon, Mar 22, 2021 at 07:11:18PM +0200, Amir Goldstein wrote:
> > Some filesystems on persistent storage backend use a digest of the
> > filesystem's persistent uuid as the value for f_fsid returned by
> > statfs(2).
> >
> > xfs, as many other filesystem provide the non-persistent block device
> > number as the value of f_fsid.
> >
> > Since kernel v5.1, fanotify_init(2) supports the flag FAN_REPORT_FID
> > for identifying objects using file_handle and f_fsid in events.
>
> The filesystem id is encoded into the VFS filehandle - it does not
> need some special external identifier to identify the filesystem it
> belongs to....
>

Let's take it from the start.
There is no requirement for fanotify to get a persistent fs id, we just need
a unique fs id that is known to userspace, so the statfs API is good enough
for our needs.

See quote from fanotify.7:

" The fields of the fanotify_event_info_fid structure are as follows:
...
       fsid   This  is  a  unique identifier of the filesystem
containing the object associated with the event.  It is a structure of
type __kernel_fsid_t and contains the same value as f_fsid when
              calling statfs(2).

       file_handle
              This is a variable length structure of type struct
file_handle.  It is an opaque handle that corresponds to a specified
object on a filesystem as returned by name_to_handle_at(2).  It
              can  be  used  to uniquely identify a file on a
filesystem and can be passed as an argument to open_by_handle_at(2).
..."

So the main objective is to "uniquely identify an object" which was observed
before (i.e. at the time of setting a watch) and a secondary objective is to
resolve a path from the identifier, which requires extra privileges.

This definition does not specify the lifetime of the identifier and
indeed, in most
cases, uniqueness in the system while filesystem is mounted should suffice
as that is also the lifetime of the fanotify mark.

But the fanotify group can outlive the mounted filesystem and it can be used
to watch multiple filesystems. It's not really a problem per-se that
xfs filesystems
can change and reuse f_fsid, it is just less friendly that's all.

I am trying to understand your objection to making this "friendly" change.

> > The xfs specific ioctl XFS_IOC_PATH_TO_FSHANDLE similarly attaches an
> > fsid to exported file handles, but it is not the same fsid exported
> > via statfs(2) - it is a persistent fsid based on the filesystem's uuid.
>
> To actually use that {fshandle,fhandle} tuple for anything
> requires CAP_SYS_ADMIN. A user can read the fshandle, but it can't
> use it for anything useful.

It is a unique identifier and that is a useful thing - see demo code:
* Index watches by fanotify fid:
  https://github.com/amir73il/inotify-tools/commit/ed82098b54b847e3c2d46b32d35b2aa537a9ba94
* Handle global watches on several filesystems:
  https://github.com/amir73il/inotify-tools/commit/1188ef00dc84964de58afb32c91e19930ad1e2e8

> i.e. it's use is entirely isolated to
> the file handle interface for identifying the filesystem the handle
> belongs to. This is messy, but XFS inherited this "fixed fsid"
> interface from Irix filehandles and was needed to port
> xfsdump/xfsrestore to Linux.  Realistically, it is not functionality
> that should be duplicated/exposed more widely on Linux...
>

Other filesystems expose a uuid digest as f_fsid: ext4, btrfs, ocfs2
and many more. XFS is really the exception among the big local fs.
This is not exposing anything new at all.
I would say it is more similar to the way that the generation part of
the file handle has improved over the years in different filesystems
to provide better uniqueness guarantees.

> IMO, if fanotify needs a persistent filesystem ID on Linux, it

It does not *need* that. It's just nicer for f_fsid to use a persistent
fs identifier if one is anyway available.

> should be using something common across all filesystems from the
> linux superblock, not deep dark internal filesystem magic. The
> export interfaces that generate VFS (and NFS) filehandles already
> have a persistent fsid associated with them, which may in fact be
> the filesystem UUID in it's entirety.
>

Yes, nfsd is using dark internal and AFAIK undocumnetd magic to
pick that identifier (Bruce, am I wrong?).

> The export-derived "filesystem ID" is what should be exported to
> userspace in combination with the file handle to identify the fs the
> handle belongs to because then you have consistent behaviour and a
> change that invalidates the filehandle will also invalidate the
> fshandle....
>

nfsd has a much stronger persistent file handles requirement than
fanotify. There is no need to make things more complicated than
they need to be.

> > Use the same persistent value for f_fsid, so object identifiers in
> > fanotify events will describe the objects more uniquely.
>
> It's not persistent as in "will never change". The moment a user
> changes the XFS filesystem uuid, the f_fsid changes.
>

Yes. I know. But it's still much better than the bdev number IMO.

> However, changing the uuid on XFS is an offline (unmounted)
> operation, so there will be no fanotify marks present when it is
> changed. Hence when it is remounted, there will be a new f_fsid
> returned in statvfs(), just like what happens now, and all
> applications dependent on "persistent" fsids (and persistent
> filehandles for that matter) will now get ESTALE errors...
>
> And, worse, mp->m_fixed_fsid (and XFS superblock UUIDs in general)
> are not unique if you've got snapshots and they've been mounted via
> "-o nouuid" to avoid XFS's duplicate uuid checking. This is one of
> the reasons that the duplicate checking exists - so that fshandles
> are unique and resolve to a single filesystem....
>

Both of the caveats of uuid you mentioned are not a big concern for
fanotify because the nature of f_fsid can be understood by the event
listener before setting the multi-fs watch (i.e. in case of fsid collision).

> > Signed-off-by: Amir Goldstein <amir73il@gmail.com>
> > ---
> >
> > Guys,
> >
> > This change would be useful for fanotify users.
> > Do you see any problems with that minor change of uapi?
>
> Yes.
>
> IMO, we shouldn't be making a syscall interface rely on the
> undefined, filesystem specific behaviour a value some other syscall
> exposes to userspace. This means the fsid has no defined or
> standardised behaviour applications can rely on and can't be
> guaranteed unique and unchanging by fanotify. This seems like a
> lose-lose situation for everyone...
>

The fanotify uapi guarantee is to provide the same value of f_fsid
observed by statfs() uapi. The statfs() uapi guarantee about f_fsid is
a bit vague, but it's good enough for our needs:

"...The  general idea is that f_fsid contains some random stuff such that the
 pair (f_fsid,ino) uniquely determines a file.  Some operating systems use
 (a variation on) the device number, or the device number combined with the
 filesystem type..."

Regardless of the fanotify uapi and whether it's good or bad, do you insist
that the value of f_fsid exposed by xfs needs to be the bdev number and
not derived from uuid?

One thing we could do is in the "-o nouuid" case that you mentioned
we continue to use the bdev number for f_fsid.
Would you like me to make that change?

Thanks,
Amir.

Darrick J. Wong March 23, 2021, 6:35 a.m. UTC | #3

On Tue, Mar 23, 2021 at 06:50:44AM +0200, Amir Goldstein wrote:
> On Tue, Mar 23, 2021 at 1:03 AM Dave Chinner <david@fromorbit.com> wrote:
> >
> > On Mon, Mar 22, 2021 at 07:11:18PM +0200, Amir Goldstein wrote:
> > > Some filesystems on persistent storage backend use a digest of the
> > > filesystem's persistent uuid as the value for f_fsid returned by
> > > statfs(2).
> > >
> > > xfs, as many other filesystem provide the non-persistent block device
> > > number as the value of f_fsid.
> > >
> > > Since kernel v5.1, fanotify_init(2) supports the flag FAN_REPORT_FID
> > > for identifying objects using file_handle and f_fsid in events.
> >
> > The filesystem id is encoded into the VFS filehandle - it does not
> > need some special external identifier to identify the filesystem it
> > belongs to....
> >
> 
> Let's take it from the start.
> There is no requirement for fanotify to get a persistent fs id, we just need
> a unique fs id that is known to userspace, so the statfs API is good enough
> for our needs.
> 
> See quote from fanotify.7:
> 
> " The fields of the fanotify_event_info_fid structure are as follows:
> ...
>        fsid   This  is  a  unique identifier of the filesystem
> containing the object associated with the event.  It is a structure of
> type __kernel_fsid_t and contains the same value as f_fsid when
>               calling statfs(2).
> 
>        file_handle
>               This is a variable length structure of type struct
> file_handle.  It is an opaque handle that corresponds to a specified
> object on a filesystem as returned by name_to_handle_at(2).  It
>               can  be  used  to uniquely identify a file on a
> filesystem and can be passed as an argument to open_by_handle_at(2).

Hmmmm.... so I guess you'd /like/ a file handle that will survive across
unmount/mount cycles, and possibly even a reboot?

I looked at the first commit, and I guess you use name_to_handle_at,
which returns a mount_id that is .... that weird number in the leftmost
column of /proc/mountinfo, which increments monotonically for each mount
and definitely doesn't survive a remount cycle, let alone a reboot?

Hence wanting to use something less volatile than mnt_id_ida...?

My natural inclination is "just use whatever NFS does", but ... then I
saw fh_compose and realized that the fsid part of an NFS handle depends
on the export options and probably isn't all that easy for someone who
isn't an nfs client to extract.

Was that how you arrived at using the statfs fsid field?

...except XFS doesn't guarantee that fsid is particularly unique or
stable, since a reboot can enumerate blockdevs in a different order and
hence the dev_t will change.  UUIDs also aren't a great idea because you
can snapshot an fs and mount it with nouuid, and now a "unique" file
handle can map ambiguously to two different files.

Urgh, I'm gonna have to think about this one, all the options suck.
fanotify might be smart enough to handle ambiguous file handles but now
I wonder how dumb programs react to that.  Also it's 23:30 here. :)

--D

> ..."
> 
> So the main objective is to "uniquely identify an object" which was observed
> before (i.e. at the time of setting a watch) and a secondary objective is to
> resolve a path from the identifier, which requires extra privileges.
> 
> This definition does not specify the lifetime of the identifier and
> indeed, in most
> cases, uniqueness in the system while filesystem is mounted should suffice
> as that is also the lifetime of the fanotify mark.
> 
> But the fanotify group can outlive the mounted filesystem and it can be used
> to watch multiple filesystems. It's not really a problem per-se that
> xfs filesystems
> can change and reuse f_fsid, it is just less friendly that's all.
> 
> I am trying to understand your objection to making this "friendly" change.
> 
> > > The xfs specific ioctl XFS_IOC_PATH_TO_FSHANDLE similarly attaches an
> > > fsid to exported file handles, but it is not the same fsid exported
> > > via statfs(2) - it is a persistent fsid based on the filesystem's uuid.
> >
> > To actually use that {fshandle,fhandle} tuple for anything
> > requires CAP_SYS_ADMIN. A user can read the fshandle, but it can't
> > use it for anything useful.
> 
> It is a unique identifier and that is a useful thing - see demo code:
> * Index watches by fanotify fid:
>   https://github.com/amir73il/inotify-tools/commit/ed82098b54b847e3c2d46b32d35b2aa537a9ba94
> * Handle global watches on several filesystems:
>   https://github.com/amir73il/inotify-tools/commit/1188ef00dc84964de58afb32c91e19930ad1e2e8
> 
> > i.e. it's use is entirely isolated to
> > the file handle interface for identifying the filesystem the handle
> > belongs to. This is messy, but XFS inherited this "fixed fsid"
> > interface from Irix filehandles and was needed to port
> > xfsdump/xfsrestore to Linux.  Realistically, it is not functionality
> > that should be duplicated/exposed more widely on Linux...
> >
> 
> Other filesystems expose a uuid digest as f_fsid: ext4, btrfs, ocfs2
> and many more. XFS is really the exception among the big local fs.
> This is not exposing anything new at all.
> I would say it is more similar to the way that the generation part of
> the file handle has improved over the years in different filesystems
> to provide better uniqueness guarantees.
> 
> > IMO, if fanotify needs a persistent filesystem ID on Linux, it
> 
> It does not *need* that. It's just nicer for f_fsid to use a persistent
> fs identifier if one is anyway available.
> 
> > should be using something common across all filesystems from the
> > linux superblock, not deep dark internal filesystem magic. The
> > export interfaces that generate VFS (and NFS) filehandles already
> > have a persistent fsid associated with them, which may in fact be
> > the filesystem UUID in it's entirety.
> >
> 
> Yes, nfsd is using dark internal and AFAIK undocumnetd magic to
> pick that identifier (Bruce, am I wrong?).
> 
> > The export-derived "filesystem ID" is what should be exported to
> > userspace in combination with the file handle to identify the fs the
> > handle belongs to because then you have consistent behaviour and a
> > change that invalidates the filehandle will also invalidate the
> > fshandle....
> >
> 
> nfsd has a much stronger persistent file handles requirement than
> fanotify. There is no need to make things more complicated than
> they need to be.
> 
> > > Use the same persistent value for f_fsid, so object identifiers in
> > > fanotify events will describe the objects more uniquely.
> >
> > It's not persistent as in "will never change". The moment a user
> > changes the XFS filesystem uuid, the f_fsid changes.
> >
> 
> Yes. I know. But it's still much better than the bdev number IMO.
> 
> > However, changing the uuid on XFS is an offline (unmounted)
> > operation, so there will be no fanotify marks present when it is
> > changed. Hence when it is remounted, there will be a new f_fsid
> > returned in statvfs(), just like what happens now, and all
> > applications dependent on "persistent" fsids (and persistent
> > filehandles for that matter) will now get ESTALE errors...
> >
> > And, worse, mp->m_fixed_fsid (and XFS superblock UUIDs in general)
> > are not unique if you've got snapshots and they've been mounted via
> > "-o nouuid" to avoid XFS's duplicate uuid checking. This is one of
> > the reasons that the duplicate checking exists - so that fshandles
> > are unique and resolve to a single filesystem....
> >
> 
> Both of the caveats of uuid you mentioned are not a big concern for
> fanotify because the nature of f_fsid can be understood by the event
> listener before setting the multi-fs watch (i.e. in case of fsid collision).
> 
> > > Signed-off-by: Amir Goldstein <amir73il@gmail.com>
> > > ---
> > >
> > > Guys,
> > >
> > > This change would be useful for fanotify users.
> > > Do you see any problems with that minor change of uapi?
> >
> > Yes.
> >
> > IMO, we shouldn't be making a syscall interface rely on the
> > undefined, filesystem specific behaviour a value some other syscall
> > exposes to userspace. This means the fsid has no defined or
> > standardised behaviour applications can rely on and can't be
> > guaranteed unique and unchanging by fanotify. This seems like a
> > lose-lose situation for everyone...
> >
> 
> The fanotify uapi guarantee is to provide the same value of f_fsid
> observed by statfs() uapi. The statfs() uapi guarantee about f_fsid is
> a bit vague, but it's good enough for our needs:
> 
> "...The  general idea is that f_fsid contains some random stuff such that the
>  pair (f_fsid,ino) uniquely determines a file.  Some operating systems use
>  (a variation on) the device number, or the device number combined with the
>  filesystem type..."
> 
> Regardless of the fanotify uapi and whether it's good or bad, do you insist
> that the value of f_fsid exposed by xfs needs to be the bdev number and
> not derived from uuid?
> 
> One thing we could do is in the "-o nouuid" case that you mentioned
> we continue to use the bdev number for f_fsid.
> Would you like me to make that change?
> 
> Thanks,
> Amir.

Amir Goldstein March 23, 2021, 6:44 a.m. UTC | #4

On Tue, Mar 23, 2021 at 8:35 AM Darrick J. Wong <djwong@kernel.org> wrote:
>
> On Tue, Mar 23, 2021 at 06:50:44AM +0200, Amir Goldstein wrote:
> > On Tue, Mar 23, 2021 at 1:03 AM Dave Chinner <david@fromorbit.com> wrote:
> > >
> > > On Mon, Mar 22, 2021 at 07:11:18PM +0200, Amir Goldstein wrote:
> > > > Some filesystems on persistent storage backend use a digest of the
> > > > filesystem's persistent uuid as the value for f_fsid returned by
> > > > statfs(2).
> > > >
> > > > xfs, as many other filesystem provide the non-persistent block device
> > > > number as the value of f_fsid.
> > > >
> > > > Since kernel v5.1, fanotify_init(2) supports the flag FAN_REPORT_FID
> > > > for identifying objects using file_handle and f_fsid in events.
> > >
> > > The filesystem id is encoded into the VFS filehandle - it does not
> > > need some special external identifier to identify the filesystem it
> > > belongs to....
> > >
> >
> > Let's take it from the start.
> > There is no requirement for fanotify to get a persistent fs id, we just need
> > a unique fs id that is known to userspace, so the statfs API is good enough
> > for our needs.
> >
> > See quote from fanotify.7:
> >
> > " The fields of the fanotify_event_info_fid structure are as follows:
> > ...
> >        fsid   This  is  a  unique identifier of the filesystem
> > containing the object associated with the event.  It is a structure of
> > type __kernel_fsid_t and contains the same value as f_fsid when
> >               calling statfs(2).
> >
> >        file_handle
> >               This is a variable length structure of type struct
> > file_handle.  It is an opaque handle that corresponds to a specified
> > object on a filesystem as returned by name_to_handle_at(2).  It
> >               can  be  used  to uniquely identify a file on a
> > filesystem and can be passed as an argument to open_by_handle_at(2).
>
> Hmmmm.... so I guess you'd /like/ a file handle that will survive across
> unmount/mount cycles, and possibly even a reboot?
>
> I looked at the first commit, and I guess you use name_to_handle_at,
> which returns a mount_id that is .... that weird number in the leftmost
> column of /proc/mountinfo, which increments monotonically for each mount
> and definitely doesn't survive a remount cycle, let alone a reboot?
>
> Hence wanting to use something less volatile than mnt_id_ida...?
>
> My natural inclination is "just use whatever NFS does", but ... then I
> saw fh_compose and realized that the fsid part of an NFS handle depends
> on the export options and probably isn't all that easy for someone who
> isn't an nfs client to extract.
>
> Was that how you arrived at using the statfs fsid field?

Yes. Exactly.

>
> ...except XFS doesn't guarantee that fsid is particularly unique or
> stable, since a reboot can enumerate blockdevs in a different order and
> hence the dev_t will change.  UUIDs also aren't a great idea because you
> can snapshot an fs and mount it with nouuid, and now a "unique" file
> handle can map ambiguously to two different files.
>

As I explained in my reply to Dave, that's not a big issue.
If program want to listen on events from multiple filesystems,
the program will sample f_fsid of both filesystems before setting up the
filesystem watches. If there is a collision, there are other ways to handle
this case (i.e. run two separate listener programs).

Also, as I wrote to Dave, I can easily handle the special case of "-o nouuid"
by leaving the bdev number for f_fsid in that case.

> Urgh, I'm gonna have to think about this one, all the options suck.
> fanotify might be smart enough to handle ambiguous file handles but now

Actually, fanotify is copletely dumb about this, it's the user of fanotify that
needs to be able to do something useful with this information.

There is a demo I wrote based on inotifywatch that demonstrates that [1].

Thanks,
Amir.

[1] https://github.com/amir73il/inotify-tools/commits/fanotify_name_fid

Dave Chinner March 23, 2021, 7:26 a.m. UTC | #5

On Tue, Mar 23, 2021 at 06:50:44AM +0200, Amir Goldstein wrote:
> On Tue, Mar 23, 2021 at 1:03 AM Dave Chinner <david@fromorbit.com> wrote:
> >
> > On Mon, Mar 22, 2021 at 07:11:18PM +0200, Amir Goldstein wrote:
> > > Some filesystems on persistent storage backend use a digest of the
> > > filesystem's persistent uuid as the value for f_fsid returned by
> > > statfs(2).
> > >
> > > xfs, as many other filesystem provide the non-persistent block device
> > > number as the value of f_fsid.
> > >
> > > Since kernel v5.1, fanotify_init(2) supports the flag FAN_REPORT_FID
> > > for identifying objects using file_handle and f_fsid in events.
> >
> > The filesystem id is encoded into the VFS filehandle - it does not
> > need some special external identifier to identify the filesystem it
> > belongs to....
> >
> 
> Let's take it from the start.
> There is no requirement for fanotify to get a persistent fs id, we just need
> a unique fs id that is known to userspace, so the statfs API is good enough
> for our needs.

So why change the code then? If it ain't broke, don't fix it...

> See quote from fanotify.7:
> 
> " The fields of the fanotify_event_info_fid structure are as follows:
> ...
>        fsid   This  is  a  unique identifier of the filesystem
> containing the object associated with the event.  It is a structure of
> type __kernel_fsid_t and contains the same value as f_fsid when
>               calling statfs(2).
> 
>        file_handle
>               This is a variable length structure of type struct
> file_handle.  It is an opaque handle that corresponds to a specified
> object on a filesystem as returned by name_to_handle_at(2).  It
>               can  be  used  to uniquely identify a file on a
> filesystem and can be passed as an argument to open_by_handle_at(2).
> ..."
> 
> So the main objective is to "uniquely identify an object" which was observed
> before (i.e. at the time of setting a watch) and a secondary objective is to
> resolve a path from the identifier, which requires extra privileges.
> 
> This definition does not specify the lifetime of the identifier and
> indeed, in most
> cases, uniqueness in the system while filesystem is mounted should suffice
> as that is also the lifetime of the fanotify mark.
> 
> But the fanotify group can outlive the mounted filesystem and it can be used
> to watch multiple filesystems. It's not really a problem per-se that
> xfs filesystems
> can change and reuse f_fsid, it is just less friendly that's all.
> 
> I am trying to understand your objection to making this "friendly" change.

"friendly" isn't a useful way to describe whether a change is
desirable of whether code works correctly or not. If it's broken or
not fit for purpose, it doesn't matter how "friendly" it might be -
it's still broken...

I'm asking why using a device encoding is a problem, because
it will not change across mount/unmount cycles as the backing device
doesn't change. It *may* change across reboots, but then what does
fanotify care about in that case? Something has to re-establish all
watches from scratch at that point, so who cares if the fsid has
changed?

So what problem is "persistence" across reboots solving?  You say
it's "friendly" but that has no technical definition I know of....

> > i.e. it's use is entirely isolated to
> > the file handle interface for identifying the filesystem the handle
> > belongs to. This is messy, but XFS inherited this "fixed fsid"
> > interface from Irix filehandles and was needed to port
> > xfsdump/xfsrestore to Linux.  Realistically, it is not functionality
> > that should be duplicated/exposed more widely on Linux...
> 
> Other filesystems expose a uuid digest as f_fsid: ext4, btrfs, ocfs2
> and many more. XFS is really the exception among the big local fs.
> This is not exposing anything new at all.

I'm not suggesting that it is. I'm asking you to explain what the
problem it solves is so I have the context necessary to evaluate the
impact of making such a userspace visible change might be....

> I would say it is more similar to the way that the generation part of
> the file handle has improved over the years in different filesystems
> to provide better uniqueness guarantees.

No, it's most definitely not. Userspace never sees the inode
generation number.  The inode generation can only be changed when an
inode goes through a life cycle. If you change it in the middle of a
referenced inode's life, then you invalidate valid file handles
(i.e.  ESTALE) for no good reason. But we can change the way we
modify the generation number when the inode cycle out of existence
without any impact on external APIs because the change of generation
number is necessary to invalidate the filehandle in that case.

So the way we have changed generation number selection over time
does not impact the actual users of file handles - they still only
get ESTALE when the inode is unlinked and is no longer accessible
and nobody even notices that we changed generation number selection
algorithms.

> > The export-derived "filesystem ID" is what should be exported to
> > userspace in combination with the file handle to identify the fs the
> > handle belongs to because then you have consistent behaviour and a
> > change that invalidates the filehandle will also invalidate the
> > fshandle....
> >
> 
> nfsd has a much stronger persistent file handles requirement than
> fanotify. There is no need to make things more complicated than
> they need to be.

Userspace filehandles have exactly the same persistence requirements
as NFS file handles. fanotify might not have the same requirements
as NFS, but that doesn't change the persistence requirements
for userspace handles....

> > However, changing the uuid on XFS is an offline (unmounted)
> > operation, so there will be no fanotify marks present when it is
> > changed. Hence when it is remounted, there will be a new f_fsid
> > returned in statvfs(), just like what happens now, and all
> > applications dependent on "persistent" fsids (and persistent
> > filehandles for that matter) will now get ESTALE errors...
> >
> > And, worse, mp->m_fixed_fsid (and XFS superblock UUIDs in general)
> > are not unique if you've got snapshots and they've been mounted via
> > "-o nouuid" to avoid XFS's duplicate uuid checking. This is one of
> > the reasons that the duplicate checking exists - so that fshandles
> > are unique and resolve to a single filesystem....
> 
> Both of the caveats of uuid you mentioned are not a big concern for
> fanotify because the nature of f_fsid can be understood by the event
> listener before setting the multi-fs watch (i.e. in case of fsid collision).

Sorry, I don't understand what "the nature of f_fsid can be
understood" means. What meaning are you trying to infer from 8 bytes
of opaque data in f_fsid?

> > > Signed-off-by: Amir Goldstein <amir73il@gmail.com>
> > > ---
> > >
> > > Guys,
> > >
> > > This change would be useful for fanotify users.
> > > Do you see any problems with that minor change of uapi?
> >
> > Yes.
> >
> > IMO, we shouldn't be making a syscall interface rely on the
> > undefined, filesystem specific behaviour a value some other syscall
> > exposes to userspace. This means the fsid has no defined or
> > standardised behaviour applications can rely on and can't be
> > guaranteed unique and unchanging by fanotify. This seems like a
> > lose-lose situation for everyone...
> >
> 
> The fanotify uapi guarantee is to provide the same value of f_fsid
> observed by statfs() uapi. The statfs() uapi guarantee about f_fsid is
> a bit vague, but it's good enough for our needs:
> 
> "...The  general idea is that f_fsid contains some random stuff such that the
>  pair (f_fsid,ino) uniquely determines a file.  Some operating systems use
>  (a variation on) the device number, or the device number combined with the
>  filesystem type..."

Mixed messaging!

You start by saying "f_fsid needs persistence", then say "it doesn't
need persistence, then say "device number based f_fsid is
sub-optimal", then saying "the statfs() defined f_fsid is good
enough" despite the fact is says "(a variation on) the device
number" is a documented way of implementing it and you've said that
is sub-optimal.

I''ve got no clue what fanotify wants or needs from f_fsid now.

> Regardless of the fanotify uapi and whether it's good or bad, do you insist
> that the value of f_fsid exposed by xfs needs to be the bdev number and
> not derived from uuid?

I'm not insisting that it needs to be the bdev number. I'm trying to
understand why it needs to be changed, what the impact of that
change is and whether there are other alternatives before I form an
opinion on whether we should make this user visible filesystem
identifier change or not...

> One thing we could do is in the "-o nouuid" case that you mentioned
> we continue to use the bdev number for f_fsid.
> Would you like me to make that change?

No, I need to you stop rushing around in a hurry to change code and
assuming that everyone knows every little detail of fanotify and the
problems that need to be solved. Slow down and explain clearly and
concisely why f_fsid needs to be persistent, how it gets used to
optimise <something> when it is persistent, and what the impact of
it not being persistent is.

Cheers,

Dave.

Amir Goldstein March 23, 2021, 9:35 a.m. UTC | #6

On Tue, Mar 23, 2021 at 9:26 AM Dave Chinner <david@fromorbit.com> wrote:
>
> On Tue, Mar 23, 2021 at 06:50:44AM +0200, Amir Goldstein wrote:
> > On Tue, Mar 23, 2021 at 1:03 AM Dave Chinner <david@fromorbit.com> wrote:
> > >
> > > On Mon, Mar 22, 2021 at 07:11:18PM +0200, Amir Goldstein wrote:
> > > > Some filesystems on persistent storage backend use a digest of the
> > > > filesystem's persistent uuid as the value for f_fsid returned by
> > > > statfs(2).
> > > >
> > > > xfs, as many other filesystem provide the non-persistent block device
> > > > number as the value of f_fsid.
> > > >
> > > > Since kernel v5.1, fanotify_init(2) supports the flag FAN_REPORT_FID
> > > > for identifying objects using file_handle and f_fsid in events.
> > >
> > > The filesystem id is encoded into the VFS filehandle - it does not
> > > need some special external identifier to identify the filesystem it
> > > belongs to....
> > >
> >
> > Let's take it from the start.
> > There is no requirement for fanotify to get a persistent fs id, we just need
> > a unique fs id that is known to userspace, so the statfs API is good enough
> > for our needs.
>
> So why change the code then? If it ain't broke, don't fix it...
>

Fair enough. I am asking a "nice to have" change.
I will try to explain why.

[...]

> > I am trying to understand your objection to making this "friendly" change.
>
> "friendly" isn't a useful way to describe whether a change is
> desirable of whether code works correctly or not. If it's broken or
> not fit for purpose, it doesn't matter how "friendly" it might be -
> it's still broken...
>
> I'm asking why using a device encoding is a problem, because
> it will not change across mount/unmount cycles as the backing device
> doesn't change. It *may* change across reboots, but then what does
> fanotify care about in that case? Something has to re-establish all
> watches from scratch at that point, so who cares if the fsid has
> changed?
>

This is correct.
First of all, we need to acknowledge that the fanotify FAN_REPORT_FID
interface was introduced in kernel v5.1 and I don't know of any specific
users and how future users will be using it.

> So what problem is "persistence" across reboots solving?  You say
> it's "friendly" but that has no technical definition I know of....
>

For most use cases, getting a unique fsid that is not "persistent"
would be fine. Many use case will probably be watching a single
filesystem and then the value of fsid in the event doesn't matter at all.

If, however, at some point in the future, someone were to write
a listener that stores events in a persistent queue for later processing
it would be more "convenient" if fsid values were "persistent".

I'm sorry, but it's hard for me to describe a property of fsid that is
nice-to-have without using "convenient" and "friendly" terminology.

Technically, there is a way for userland to get the same outcome
without making this change in XFS, but I suspect it's not going to
be important enough for anyone to care and the end result would
be that the end users will suffer unpredicted behavior.

> > > i.e. it's use is entirely isolated to
> > > the file handle interface for identifying the filesystem the handle
> > > belongs to. This is messy, but XFS inherited this "fixed fsid"
> > > interface from Irix filehandles and was needed to port
> > > xfsdump/xfsrestore to Linux.  Realistically, it is not functionality
> > > that should be duplicated/exposed more widely on Linux...
> >
> > Other filesystems expose a uuid digest as f_fsid: ext4, btrfs, ocfs2
> > and many more. XFS is really the exception among the big local fs.
> > This is not exposing anything new at all.
>
> I'm not suggesting that it is. I'm asking you to explain what the
> problem it solves is so I have the context necessary to evaluate the
> impact of making such a userspace visible change might be....
>

I understand.
As I tried to explain, this change does not solve any clear and immediate
problem. I hope I was able to explain why it might be beneficial.

[...]

> > > However, changing the uuid on XFS is an offline (unmounted)
> > > operation, so there will be no fanotify marks present when it is
> > > changed. Hence when it is remounted, there will be a new f_fsid
> > > returned in statvfs(), just like what happens now, and all
> > > applications dependent on "persistent" fsids (and persistent
> > > filehandles for that matter) will now get ESTALE errors...
> > >
> > > And, worse, mp->m_fixed_fsid (and XFS superblock UUIDs in general)
> > > are not unique if you've got snapshots and they've been mounted via
> > > "-o nouuid" to avoid XFS's duplicate uuid checking. This is one of
> > > the reasons that the duplicate checking exists - so that fshandles
> > > are unique and resolve to a single filesystem....
> >
> > Both of the caveats of uuid you mentioned are not a big concern for
> > fanotify because the nature of f_fsid can be understood by the event
> > listener before setting the multi-fs watch (i.e. in case of fsid collision).
>
> Sorry, I don't understand what "the nature of f_fsid can be
> understood" means. What meaning are you trying to infer from 8 bytes
> of opaque data in f_fsid?
>

When the program is requested to watch multiple filesystems, it starts by
querying their fsid. In case of an fsid collision, the program knows that it
will not be able to tell which filesystem the event originated in, so the
program can print a descriptive error to the user.

[...]

> > The fanotify uapi guarantee is to provide the same value of f_fsid
> > observed by statfs() uapi. The statfs() uapi guarantee about f_fsid is
> > a bit vague, but it's good enough for our needs:
> >
> > "...The  general idea is that f_fsid contains some random stuff such that the
> >  pair (f_fsid,ino) uniquely determines a file.  Some operating systems use
> >  (a variation on) the device number, or the device number combined with the
> >  filesystem type..."
>
> Mixed messaging!
>
> You start by saying "f_fsid needs persistence", then say "it doesn't
> need persistence, then say "device number based f_fsid is
> sub-optimal", then saying "the statfs() defined f_fsid is good
> enough" despite the fact is says "(a variation on) the device
> number" is a documented way of implementing it and you've said that
> is sub-optimal.
>
> I''ve got no clue what fanotify wants or needs from f_fsid now.
>

fanotify has two requirements for f_fsid and they are both met for XFS.
From fanotify_mark.2:
"      ENODEV The  filesystem object indicated by pathname is not
              associated with a filesystem that supports fsid (e.g., tmpfs(5)).
...
       EXDEV  The  filesystem  object  indicated by pathname resides
              within a filesystem subvolume (e.g., btrfs(5)) which
uses a different
              fsid than its root superblock.
"

The reason I am proposing this change in XFS is not for the needs of
fanotify itself for the needs of its future downstream users.
We could defer the decision of making this change to that future time
when and if users start complaining.
XFS is certainly not the only filesystem that might need this sort of change
to f_fsid.

> > Regardless of the fanotify uapi and whether it's good or bad, do you insist
> > that the value of f_fsid exposed by xfs needs to be the bdev number and
> > not derived from uuid?
>
> I'm not insisting that it needs to be the bdev number. I'm trying to
> understand why it needs to be changed, what the impact of that
> change is and whether there are other alternatives before I form an
> opinion on whether we should make this user visible filesystem
> identifier change or not...
>
> > One thing we could do is in the "-o nouuid" case that you mentioned
> > we continue to use the bdev number for f_fsid.
> > Would you like me to make that change?
>
> No, I need to you stop rushing around in a hurry to change code and
> assuming that everyone knows every little detail of fanotify and the
> problems that need to be solved. Slow down and explain clearly and
> concisely why f_fsid needs to be persistent, how it gets used to
> optimise <something> when it is persistent, and what the impact of
> it not being persistent is.
>

Fair enough.
I'm in a position of disadvantage having no real users to request this change.
XFS is certainly "not broken", so the argument for "not fixing" it is valid.

Nevertheless bdev can change on modern systems even without reboot
for example for loop mounted images, so please consider investing the time
in forming an opinion about making this change for the sake of making f_fsid
more meaningful for any caller of statfs(2) not only for fanotify listeners.

Leaving fanotify out of the picture, the question that the prospect user is
trying answer is:
"Is the object at $PATH or at $FD the same object that was observed at
 'an earlier time'?"

With XFS, that question can be answered (< 100% certainty)
using the XFS_IOC_PATH_TO_FSHANDLE interface.

name_to_handle_at(2) + statfs(2) is a generic interface that provides
this answer with less certainty, but it could provide the answer
with the same certainty for XFS.

Thanks,
Amir.

Dave Chinner March 24, 2021, 12:54 a.m. UTC | #7

On Tue, Mar 23, 2021 at 11:35:46AM +0200, Amir Goldstein wrote:
> On Tue, Mar 23, 2021 at 9:26 AM Dave Chinner <david@fromorbit.com> wrote:
> > On Tue, Mar 23, 2021 at 06:50:44AM +0200, Amir Goldstein wrote:
> > > On Tue, Mar 23, 2021 at 1:03 AM Dave Chinner <david@fromorbit.com> wrote:
> For most use cases, getting a unique fsid that is not "persistent"
> would be fine. Many use case will probably be watching a single
> filesystem and then the value of fsid in the event doesn't matter at all.
> 
> If, however, at some point in the future, someone were to write
> a listener that stores events in a persistent queue for later processing
> it would be more "convenient" if fsid values were "persistent".

Ok, that is what I suspected - that you want to write filehandles to
a userspace database of some sort for later processing and so you
need to also store the filesystem that the filehandle belongs to.

FWIW, that's something XFS started doing a couple of decades ago for
DMAPI based prioprietary HSM implementations. They were build around
a massive userspace database that indexed the contents of the
fileystem via file handles and were kept up to date via
notifications through the DMAPI interface.

> > > > However, changing the uuid on XFS is an offline (unmounted)
> > > > operation, so there will be no fanotify marks present when it is
> > > > changed. Hence when it is remounted, there will be a new f_fsid
> > > > returned in statvfs(), just like what happens now, and all
> > > > applications dependent on "persistent" fsids (and persistent
> > > > filehandles for that matter) will now get ESTALE errors...
> > > >
> > > > And, worse, mp->m_fixed_fsid (and XFS superblock UUIDs in general)
> > > > are not unique if you've got snapshots and they've been mounted via
> > > > "-o nouuid" to avoid XFS's duplicate uuid checking. This is one of
> > > > the reasons that the duplicate checking exists - so that fshandles
> > > > are unique and resolve to a single filesystem....
> > >
> > > Both of the caveats of uuid you mentioned are not a big concern for
> > > fanotify because the nature of f_fsid can be understood by the event
> > > listener before setting the multi-fs watch (i.e. in case of fsid collision).
> >
> > Sorry, I don't understand what "the nature of f_fsid can be
> > understood" means. What meaning are you trying to infer from 8 bytes
> > of opaque data in f_fsid?
> >
> 
> When the program is requested to watch multiple filesystems, it starts by
> querying their fsid. In case of an fsid collision, the program knows that it
> will not be able to tell which filesystem the event originated in, so the
> program can print a descriptive error to the user.

Ok, so it can handle collisions, but it cannot detect things like
two filesystems swapping fsids because device ordering changed at
boot time. i.e. there no way to determine the difference between
f_fsid change vs the same filesystems with stable f_fsid being
mounted in different locations....

> > > Regardless of the fanotify uapi and whether it's good or bad, do you insist
> > > that the value of f_fsid exposed by xfs needs to be the bdev number and
> > > not derived from uuid?
> >
> > I'm not insisting that it needs to be the bdev number. I'm trying to
> > understand why it needs to be changed, what the impact of that
> > change is and whether there are other alternatives before I form an
> > opinion on whether we should make this user visible filesystem
> > identifier change or not...
> >
> > > One thing we could do is in the "-o nouuid" case that you mentioned
> > > we continue to use the bdev number for f_fsid.
> > > Would you like me to make that change?
> >
> > No, I need to you stop rushing around in a hurry to change code and
> > assuming that everyone knows every little detail of fanotify and the
> > problems that need to be solved. Slow down and explain clearly and
> > concisely why f_fsid needs to be persistent, how it gets used to
> > optimise <something> when it is persistent, and what the impact of
> > it not being persistent is.
> >
> 
> Fair enough.
> I'm in a position of disadvantage having no real users to request this change.
> XFS is certainly "not broken", so the argument for "not fixing" it is valid.
> 
> Nevertheless bdev can change on modern systems even without reboot
> for example for loop mounted images, so please consider investing the time
> in forming an opinion about making this change for the sake of making f_fsid
> more meaningful for any caller of statfs(2) not only for fanotify listeners.
> 
> Leaving fanotify out of the picture, the question that the prospect user is
> trying answer is:
> "Is the object at $PATH or at $FD the same object that was observed at
>  'an earlier time'?"
> 
> With XFS, that question can be answered (< 100% certainty)
> using the XFS_IOC_PATH_TO_FSHANDLE interface.

Actually, there's a bit more to it than that. See below.

> name_to_handle_at(2) + statfs(2) is a generic interface that provides
> this answer with less certainty, but it could provide the answer
> with the same certainty for XFS.

Let me see if I get this straight....

Because the VFS filehandle interface does not cater for this by
giving you a fshandle that is persistent, you have to know what path
the filehandle was derived from to be able to open a mountfd for
open_by_handle_at() on the file handle you have stored in userspace.
And that open_by_handle_at() returns an ephemeral mount ID, so the
kernel does not provide what you need to use open_by_handle_at()
immediately.

To turn this ephemeral mount ID into a stable identifier you have to
look up /proc/self/mounts to find the mount point, then statvfs()
the mount point to get the f_fsid.

To use the handle, you then need to open the path to the stored
mount point, check that f_fsid still matches what you originally
looked up, then you can run open_by_handle_at() on the file handle.
If you have an open fd on the filesystem and f_fsid matches, you
have the filesystem pinned until you close the mount fd, and so you
can just sort your queued filehandles by f_fsid and process them all
while you have the mount fd open....

Is that right?

But that still leaves a problem in that the VFS filehandle does not
contain a filesystem identifier itself, so you can never actually
verify that the filehandle belongs to the mount that you opened for
that f_fsid. i.e. The file handle is built by exportfs_encode_fh(),
which filesystems use to encode inode/gen/parent information.
Nothing else is placed in the VFS handle, so the file handle cannot
be used to identify what filesystem it came from.

These seem like a fundamental problems for storing VFS handles
across reboots: identifying the filesystem is not atomic with the
file handle generation and it that identification is not encoded
into the file handle for later verification.

IOWs, if you get the fsid translation wrong, the filehandle will end
up being used on the wrong filesystem and userspace has no way of
knowing that this occurred - it will get ESTALE or data that isn't
what it expected. Either way, it'll look like data corruption to the
application(s). Using f_fsid for this seems fragile to me and has
potential to break in unexpected ways in highly dynamic
environments.

The XFS filehandle exposed by the ioctls, and the NFS filehandle for
that matter, both include an identifier for the filesystem they
belong to in the file handle. This identifier matches the stable
filesystem identifier held by the filesystem (or the NFS export
table), and hence the kernel could resolve whether the filehandle
itself has been directed at the correct filesystem.

The XFS ioctls do not do this fshandle checking - this is something
performed by the libhandle library (part of xfsprogs).  libhandle
knows the format of the XFS filehandles, so it peaks inside them to
extract the fsid to determine where to direct them. 

Applications must first initialise filesystems that file handles can
be used on by calling path_to_fshandle() to populate an open file
cache.  Behind the scenes, this calls XFS_IOC_PATH_TO_FSHANDLE to
associate a {fd, path, fshandle} tuple for that filesystem. The
libhandle operations then match the fsid embedded in the file handle
to the known open fshandles, and if they match it uses the
associated fd to issue the ioctl to the correct filesystem.

This fshandle fd is held open for as long as the application is
running, so it pins the filesystem and so the fshandle obtained at
init time is guaranteed to be valid until the application exits.
Hence on startup an app simply needs to walk the paths it is
interested in and call path_to_fshandle() on all of them, but
otherwise it does not need to know what filesystem a filehandle
belongs to - the libhandle implementation takes care of that
entirely....

IOWs, this whole "find the right filesystem for the file handle"
implementation is largely abstracted away from userspace by
libhandle. Hence just looking at what the the XFS ioctls do does not
give the whole picture of how stable filehandles were actually used
by applications...

I suspect that the right thing to do here is extend the VFS
filehandles to contain an 8 byte fsid prefix (supplied by a new an
export op) and an AT_FSID flag for name_to_handle_at() to return
just the 8 byte fsid that is used by handles on that filesystem.
This provides the filesystems with a well defined API for providing
a stable identifier instead of redefining what filesystems need to
return in some other UAPI.

This also means that userspace can be entirely filesystem agnostic
and it doesn't need to rely on parsing proc files to translate
ephemeral mount IDs to paths, statvfs() and hoping that f_fsid is
stable enough that it doesn't get the destination wrong.  It also
means that fanotify UAPI probably no longer needs to supply a
f_fsid with the filehandle because it is built into the
filehandle....

Cheers,

Dave.

Amir Goldstein March 24, 2021, 6:53 a.m. UTC | #8

On Wed, Mar 24, 2021 at 2:54 AM Dave Chinner <david@fromorbit.com> wrote:
>
> On Tue, Mar 23, 2021 at 11:35:46AM +0200, Amir Goldstein wrote:
> > On Tue, Mar 23, 2021 at 9:26 AM Dave Chinner <david@fromorbit.com> wrote:
> > > On Tue, Mar 23, 2021 at 06:50:44AM +0200, Amir Goldstein wrote:
> > > > On Tue, Mar 23, 2021 at 1:03 AM Dave Chinner <david@fromorbit.com> wrote:
> > For most use cases, getting a unique fsid that is not "persistent"
> > would be fine. Many use case will probably be watching a single
> > filesystem and then the value of fsid in the event doesn't matter at all.
> >
> > If, however, at some point in the future, someone were to write
> > a listener that stores events in a persistent queue for later processing
> > it would be more "convenient" if fsid values were "persistent".
>
> Ok, that is what I suspected - that you want to write filehandles to
> a userspace database of some sort for later processing and so you
> need to also store the filesystem that the filehandle belongs to.
>
> FWIW, that's something XFS started doing a couple of decades ago for
> DMAPI based prioprietary HSM implementations. They were build around
> a massive userspace database that indexed the contents of the
> fileystem via file handles and were kept up to date via
> notifications through the DMAPI interface.
>

History repeats itself, but with different storage tearing layers ;-)

[...]
> > When the program is requested to watch multiple filesystems, it starts by
> > querying their fsid. In case of an fsid collision, the program knows that it
> > will not be able to tell which filesystem the event originated in, so the
> > program can print a descriptive error to the user.
>
> Ok, so it can handle collisions, but it cannot detect things like
> two filesystems swapping fsids because device ordering changed at
> boot time. i.e. there no way to determine the difference between
> f_fsid change vs the same filesystems with stable f_fsid being
> mounted in different locations....
>

Correct.

[...]

> > Leaving fanotify out of the picture, the question that the prospect user is
> > trying answer is:
> > "Is the object at $PATH or at $FD the same object that was observed at
> >  'an earlier time'?"
> >
> > With XFS, that question can be answered (< 100% certainty)
> > using the XFS_IOC_PATH_TO_FSHANDLE interface.
>
> Actually, there's a bit more to it than that. See below.
>
> > name_to_handle_at(2) + statfs(2) is a generic interface that provides
> > this answer with less certainty, but it could provide the answer
> > with the same certainty for XFS.
>
> Let me see if I get this straight....
>
> Because the VFS filehandle interface does not cater for this by
> giving you a fshandle that is persistent, you have to know what path
> the filehandle was derived from to be able to open a mountfd for
> open_by_handle_at() on the file handle you have stored in userspace.

That is what NFS and DMAPI need, but this is not what I asked for.
I specifically asked for the ability to answer the question:
"Is the object at $PATH or at $FD the same object that was observed at
 'an earlier time'?"

Note that as opposed to open_by_handle_at(), which requires
capabilities, checking the identity of the object does not require any
capabilities beyond search/read access permissions to the object.

Furthermore, with name_to_handle_at(fd, ..., AT_EMPTY_PATH)
and fstatfs() there are none of the races you mention below and
fanotify obviously captures a valid {fsid,fhandle} tuple.

> And that open_by_handle_at() returns an ephemeral mount ID, so the
> kernel does not provide what you need to use open_by_handle_at()
> immediately.
>
> To turn this ephemeral mount ID into a stable identifier you have to
> look up /proc/self/mounts to find the mount point, then statvfs()
> the mount point to get the f_fsid.
>
> To use the handle, you then need to open the path to the stored
> mount point, check that f_fsid still matches what you originally
> looked up, then you can run open_by_handle_at() on the file handle.
> If you have an open fd on the filesystem and f_fsid matches, you
> have the filesystem pinned until you close the mount fd, and so you
> can just sort your queued filehandles by f_fsid and process them all
> while you have the mount fd open....
>
> Is that right?

It's not wrong, but it's irrelevant to the requirement, which was to
*identify* the object, not to *access* the object.
See more below...

>
> But that still leaves a problem in that the VFS filehandle does not
> contain a filesystem identifier itself, so you can never actually
> verify that the filehandle belongs to the mount that you opened for
> that f_fsid. i.e. The file handle is built by exportfs_encode_fh(),
> which filesystems use to encode inode/gen/parent information.
> Nothing else is placed in the VFS handle, so the file handle cannot
> be used to identify what filesystem it came from.
>
> These seem like a fundamental problems for storing VFS handles
> across reboots: identifying the filesystem is not atomic with the
> file handle generation and it that identification is not encoded
> into the file handle for later verification.
>
> IOWs, if you get the fsid translation wrong, the filehandle will end
> up being used on the wrong filesystem and userspace has no way of
> knowing that this occurred - it will get ESTALE or data that isn't
> what it expected. Either way, it'll look like data corruption to the
> application(s). Using f_fsid for this seems fragile to me and has
> potential to break in unexpected ways in highly dynamic
> environments.
>

The potential damage sounds bad when you put it this way, but in fact
it really depends on the use case. For the use case of NFS client it's true
you MUST NOT get the wrong object when resolving file handles.

With fanotify, this is not the case.
When a listener gets an event with an object identifier, the listener cannot
infer the path of that object.

If the listener has several objects open (e.g. tail -f A B C) then when getting
an event, the identifier can be used to match the open file with certainty
(having verified no collisions of identifiers after opening the files).

If the listener is watching multiple directories (e.g. inotifywatch --recursive)
then the listener has two options:
1. Keep open fds for all watches dirs - this is what inotify_add_watch()
    does internally (not fds per-se but keeping an elevated i_count)
2. Keep fid->path map for all watches dirs and accept the fact that the
    cached path information may be stale

The 2nd option is valid for applications that use the events as hints
to take action. An indexer application, for example, doesn't care if
it will scan a directory where there were no changes as long as it will
get the correct hint eventually.

So if an indexer application were to act on FAN_MOVE events by
scanning the entire subtree under the parent dir where an entry was
renamed, the index will be eventually consistent, regardless of all
the events on objects with stale path cache that may have been
received after the rename.

> The XFS filehandle exposed by the ioctls, and the NFS filehandle for
> that matter, both include an identifier for the filesystem they
> belong to in the file handle. This identifier matches the stable
> filesystem identifier held by the filesystem (or the NFS export
> table), and hence the kernel could resolve whether the filehandle
> itself has been directed at the correct filesystem.
>
> The XFS ioctls do not do this fshandle checking - this is something
> performed by the libhandle library (part of xfsprogs).  libhandle
> knows the format of the XFS filehandles, so it peaks inside them to
> extract the fsid to determine where to direct them.
>
> Applications must first initialise filesystems that file handles can
> be used on by calling path_to_fshandle() to populate an open file
> cache.  Behind the scenes, this calls XFS_IOC_PATH_TO_FSHANDLE to
> associate a {fd, path, fshandle} tuple for that filesystem. The
> libhandle operations then match the fsid embedded in the file handle
> to the known open fshandles, and if they match it uses the
> associated fd to issue the ioctl to the correct filesystem.
>
> This fshandle fd is held open for as long as the application is
> running, so it pins the filesystem and so the fshandle obtained at
> init time is guaranteed to be valid until the application exits.
> Hence on startup an app simply needs to walk the paths it is
> interested in and call path_to_fshandle() on all of them, but
> otherwise it does not need to know what filesystem a filehandle
> belongs to - the libhandle implementation takes care of that
> entirely....
>
> IOWs, this whole "find the right filesystem for the file handle"
> implementation is largely abstracted away from userspace by
> libhandle. Hence just looking at what the the XFS ioctls do does not
> give the whole picture of how stable filehandles were actually used
> by applications...
>
> I suspect that the right thing to do here is extend the VFS
> filehandles to contain an 8 byte fsid prefix (supplied by a new an
> export op) and an AT_FSID flag for name_to_handle_at() to return
> just the 8 byte fsid that is used by handles on that filesystem.
> This provides the filesystems with a well defined API for providing
> a stable identifier instead of redefining what filesystems need to
> return in some other UAPI.
>
> This also means that userspace can be entirely filesystem agnostic
> and it doesn't need to rely on parsing proc files to translate
> ephemeral mount IDs to paths, statvfs() and hoping that f_fsid is
> stable enough that it doesn't get the destination wrong.  It also
> means that fanotify UAPI probably no longer needs to supply a
> f_fsid with the filehandle because it is built into the
> filehandle....
>

That is one option. Let's call it the "bullet proof" option.

Another option, let's call it the "pragmatic" options, is that you accept
that my patch shouldn't break anything and agree to apply it.

In that case, a future indexer (or whatever) application author can use
fanotify, name_to_handle_at() and fstats() as is and document that after
mount cycle, the indexer may get confused and miss changes in obscure
filesystems that nobody uses on desktops and servers.

The third option, let's call it the "sad" option, is that we do nothing
and said future indexer application author will need to find ways to
work around this deficiency or document that after mount cycle, the
indexer may get confused and miss changes in commonly used
desktop and server filesystems (i.e. XFS).

<side bar>
I think that what indexer author would really want is not "persistent fsid"
but rather a "persistent change journal" [1].
I have not abandoned this effort and I have a POC [2] for a new fsnotify
backend (not fanotify) based on inputs that also you provided in LSFMM.
In this POC, which is temporarily reusing the code of overlayfs index,
the persistent identifier of an object is {s_uuid,fhandle}.
</side bar>

Would you be willing to accept the "pragmatic" option?

Thanks,
Amir.

[1] https://lwn.net/Articles/755277/
[2] https://github.com/amir73il/linux/commits/ovl-watch

Christoph Hellwig March 24, 2021, 7:43 a.m. UTC | #9

On Wed, Mar 24, 2021 at 08:53:25AM +0200, Amir Goldstein wrote:
> > This also means that userspace can be entirely filesystem agnostic
> > and it doesn't need to rely on parsing proc files to translate
> > ephemeral mount IDs to paths, statvfs() and hoping that f_fsid is
> > stable enough that it doesn't get the destination wrong.  It also
> > means that fanotify UAPI probably no longer needs to supply a
> > f_fsid with the filehandle because it is built into the
> > filehandle....
> >
> 
> That is one option. Let's call it the "bullet proof" option.
> 
> Another option, let's call it the "pragmatic" options, is that you accept
> that my patch shouldn't break anything and agree to apply it.

Your patch may very well break something.  Most Linux file systems do
store the dev_t in the fsid and userspace may for whatever silly
reasons depend on it.

Also trying to use the fsid for anything persistent is plain stupid,
64-bits are not enough entropy for such an identifier.  You at least
need a 128-bit UUID-like identifier for that.

So I think this whole discussion is going in the wrong direction.
Is exposing a stable file system identifier useful?  Yes, for many
reasons.  Is repurposing the fsid for that a good idea?  Hell no.

Amir Goldstein March 24, 2021, 9:18 a.m. UTC | #10

On Wed, Mar 24, 2021 at 9:43 AM Christoph Hellwig <hch@infradead.org> wrote:
>
> On Wed, Mar 24, 2021 at 08:53:25AM +0200, Amir Goldstein wrote:
> > > This also means that userspace can be entirely filesystem agnostic
> > > and it doesn't need to rely on parsing proc files to translate
> > > ephemeral mount IDs to paths, statvfs() and hoping that f_fsid is
> > > stable enough that it doesn't get the destination wrong.  It also
> > > means that fanotify UAPI probably no longer needs to supply a
> > > f_fsid with the filehandle because it is built into the
> > > filehandle....
> > >
> >
> > That is one option. Let's call it the "bullet proof" option.
> >
> > Another option, let's call it the "pragmatic" options, is that you accept
> > that my patch shouldn't break anything and agree to apply it.
>
> Your patch may very well break something.  Most Linux file systems do
> store the dev_t in the fsid and userspace may for whatever silly
> reasons depend on it.
>

I acknowledge that.
I do not claim that my change carries zero risk of breakage.
However, if such userspace dependency exists, it would break on ext4,
btrfs, ocsf2, ceph and many more fs, so it would have to be a
dependency that is tightly coupled with a specific fs.
The probability of that is rather low IMO.

I propose an opt-in mount option "-o fixed_fsid" for this behavior to make
everyone sleep better.

> Also trying to use the fsid for anything persistent is plain stupid,
> 64-bits are not enough entropy for such an identifier.  You at least
> need a 128-bit UUID-like identifier for that.
>

That's a strong claim to make without providing statistical analysis
and the description of the use case.

The requirement is for a unique identifier of a mounted fs within a
single system.

> So I think this whole discussion is going in the wrong direction.
> Is exposing a stable file system identifier useful?  Yes, for many
> reasons.  Is repurposing the fsid for that a good idea?  Hell no.

Again. Strong reaction not backed up by equally strong technical
arguments.

I am not at all opposed to a new API for stable FS_HANDLE, but no,
I am not going to offer writing this new API myself at this point.

Applications that use statfs() to identify a filesystem may very well
already exist in the wild, so the fixed_fsid discussion is independent
of the new API.

Considering the fact that many relevant filesystems do provide a stable
f_fsid and considering the fact that a blockdev number of devices that
are instantiated at boot time are often practically stable, the users of
those applications could be unaware of the instability of f_fsid or maybe
they can live with it.

I find it hard to argue with the "all-or-nothing" attitude in the reaction
to my proposed change.

What I propose is an opt-in mount option "-o fixed_fsid".

IMO, NACKing this proposal is not going to improve anything for
anyone, but I don't know what more to say.

Thanks,
Amir.

Dave Chinner March 25, 2021, 10:53 p.m. UTC | #11

On Wed, Mar 24, 2021 at 08:53:25AM +0200, Amir Goldstein wrote:
> On Wed, Mar 24, 2021 at 2:54 AM Dave Chinner <david@fromorbit.com> wrote:
> > On Tue, Mar 23, 2021 at 11:35:46AM +0200, Amir Goldstein wrote:
> > > On Tue, Mar 23, 2021 at 9:26 AM Dave Chinner <david@fromorbit.com> wrote:
> > > > On Tue, Mar 23, 2021 at 06:50:44AM +0200, Amir Goldstein wrote:
> > > Leaving fanotify out of the picture, the question that the prospect user is
> > > trying answer is:
> > > "Is the object at $PATH or at $FD the same object that was observed at
> > >  'an earlier time'?"
> > >
> > > With XFS, that question can be answered (< 100% certainty)
> > > using the XFS_IOC_PATH_TO_FSHANDLE interface.
> >
> > Actually, there's a bit more to it than that. See below.
> >
> > > name_to_handle_at(2) + statfs(2) is a generic interface that provides
> > > this answer with less certainty, but it could provide the answer
> > > with the same certainty for XFS.
> >
> > Let me see if I get this straight....
> >
> > Because the VFS filehandle interface does not cater for this by
> > giving you a fshandle that is persistent, you have to know what path
> > the filehandle was derived from to be able to open a mountfd for
> > open_by_handle_at() on the file handle you have stored in userspace.
> 
> That is what NFS and DMAPI need, but this is not what I asked for.
> I specifically asked for the ability to answer the question:
> "Is the object at $PATH or at $FD the same object that was observed at
>  'an earlier time'?"

Define "at an earlier time". Once it means "across system reboots"
then the problem scope becomes a lot larger...

> Note that as opposed to open_by_handle_at(), which requires
> capabilities, checking the identity of the object does not require any
> capabilities beyond search/read access permissions to the object.

How do you check the identity of a file handle without a filesystem
identifier in it? having an external f_fsid isn't really sufficient
- you need to query the filesytem for it's identifier and determine
if the file handle contains that same identifier...

> Furthermore, with name_to_handle_at(fd, ..., AT_EMPTY_PATH)
> and fstatfs() there are none of the races you mention below and
> fanotify obviously captures a valid {fsid,fhandle} tuple.

Except that fsid is not guaranteed to be the same across mounts, let
alone reboots. There is no guarantee of uniqueness, either. IOWs, in
the bigger picture, f_fsid isn't something that can provide a
guaranteed answer to your "is $obj the same as it was at $TIME"...

You can't even infer a path from the fsid, even if it is unique, The
fsid doesn't tell you what *mount point* it refers to. i.e in
the present of bind mounts, there can be multiple disjoint directory
heirachies that correspond to the same fsid...

> > And that open_by_handle_at() returns an ephemeral mount ID, so the
> > kernel does not provide what you need to use open_by_handle_at()
> > immediately.
> >
> > To turn this ephemeral mount ID into a stable identifier you have to
> > look up /proc/self/mounts to find the mount point, then statvfs()
> > the mount point to get the f_fsid.
> >
> > To use the handle, you then need to open the path to the stored
> > mount point, check that f_fsid still matches what you originally
> > looked up, then you can run open_by_handle_at() on the file handle.
> > If you have an open fd on the filesystem and f_fsid matches, you
> > have the filesystem pinned until you close the mount fd, and so you
> > can just sort your queued filehandles by f_fsid and process them all
> > while you have the mount fd open....
> >
> > Is that right?
> 
> It's not wrong, but it's irrelevant to the requirement, which was to
> *identify* the object, not to *access* the object.
> See more below...
> 
> >
> > But that still leaves a problem in that the VFS filehandle does not
> > contain a filesystem identifier itself, so you can never actually
> > verify that the filehandle belongs to the mount that you opened for
> > that f_fsid. i.e. The file handle is built by exportfs_encode_fh(),
> > which filesystems use to encode inode/gen/parent information.
> > Nothing else is placed in the VFS handle, so the file handle cannot
> > be used to identify what filesystem it came from.
> >
> > These seem like a fundamental problems for storing VFS handles
> > across reboots: identifying the filesystem is not atomic with the
> > file handle generation and it that identification is not encoded
> > into the file handle for later verification.
> >
> > IOWs, if you get the fsid translation wrong, the filehandle will end
> > up being used on the wrong filesystem and userspace has no way of
> > knowing that this occurred - it will get ESTALE or data that isn't
> > what it expected. Either way, it'll look like data corruption to the
> > application(s). Using f_fsid for this seems fragile to me and has
> > potential to break in unexpected ways in highly dynamic
> > environments.
> >
> 
> The potential damage sounds bad when you put it this way, but in fact
> it really depends on the use case. For the use case of NFS client it's true
> you MUST NOT get the wrong object when resolving file handles.
> 
> With fanotify, this is not the case.
> When a listener gets an event with an object identifier, the listener cannot
> infer the path of that object.
> 
> If the listener has several objects open (e.g. tail -f A B C) then when getting
> an event, the identifier can be used to match the open file with certainty
> (having verified no collisions of identifiers after opening the files).

Sorry, you've lost me. How on do you reliably match a {fsid, fhandle}
to an open file descriptor? You've got to have more information
available than just a fd, fsid and a fhandle...

> If the listener is watching multiple directories (e.g. inotifywatch --recursive)
> then the listener has two options:
> 1. Keep open fds for all watches dirs - this is what inotify_add_watch()
>     does internally (not fds per-se but keeping an elevated i_count)
> 2. Keep fid->path map for all watches dirs and accept the fact that the
>     cached path information may be stale
> 
> The 2nd option is valid for applications that use the events as hints
> to take action. An indexer application, for example, doesn't care if
> it will scan a directory where there were no changes as long as it will
> get the correct hint eventually.
>
> So if an indexer application were to act on FAN_MOVE events by
> scanning the entire subtree under the parent dir where an entry was
> renamed, the index will be eventually consistent, regardless of all
> the events on objects with stale path cache that may have been
> received after the rename.

Sure, but you don't need a file handle for this. you can just scan
the directory heirachy any time you get a notification for that
fsid. Even if you have multiple directory heirarchies that you
are watching on a given mount.

I'm -guessing- that you are using the filehandle to differentiate
between different watched heirarchies, and you do that by taking a
name_to_handle_at() snapshot of the path when the watch is set, yes?

AFAICT, the application cannot  care about whether it loses
events across reboot because the indexer already needs to scan after
boot to so that it is coherent with whatever state the filesystem is
in after recovery.

> > The XFS filehandle exposed by the ioctls, and the NFS filehandle for
> > that matter, both include an identifier for the filesystem they
> > belong to in the file handle. This identifier matches the stable
> > filesystem identifier held by the filesystem (or the NFS export
> > table), and hence the kernel could resolve whether the filehandle
> > itself has been directed at the correct filesystem.
> >
> > The XFS ioctls do not do this fshandle checking - this is something
> > performed by the libhandle library (part of xfsprogs).  libhandle
> > knows the format of the XFS filehandles, so it peaks inside them to
> > extract the fsid to determine where to direct them.
> >
> > Applications must first initialise filesystems that file handles can
> > be used on by calling path_to_fshandle() to populate an open file
> > cache.  Behind the scenes, this calls XFS_IOC_PATH_TO_FSHANDLE to
> > associate a {fd, path, fshandle} tuple for that filesystem. The
> > libhandle operations then match the fsid embedded in the file handle
> > to the known open fshandles, and if they match it uses the
> > associated fd to issue the ioctl to the correct filesystem.
> >
> > This fshandle fd is held open for as long as the application is
> > running, so it pins the filesystem and so the fshandle obtained at
> > init time is guaranteed to be valid until the application exits.
> > Hence on startup an app simply needs to walk the paths it is
> > interested in and call path_to_fshandle() on all of them, but
> > otherwise it does not need to know what filesystem a filehandle
> > belongs to - the libhandle implementation takes care of that
> > entirely....
> >
> > IOWs, this whole "find the right filesystem for the file handle"
> > implementation is largely abstracted away from userspace by
> > libhandle. Hence just looking at what the the XFS ioctls do does not
> > give the whole picture of how stable filehandles were actually used
> > by applications...
> >
> > I suspect that the right thing to do here is extend the VFS
> > filehandles to contain an 8 byte fsid prefix (supplied by a new an
> > export op) and an AT_FSID flag for name_to_handle_at() to return
> > just the 8 byte fsid that is used by handles on that filesystem.
> > This provides the filesystems with a well defined API for providing
> > a stable identifier instead of redefining what filesystems need to
> > return in some other UAPI.
> >
> > This also means that userspace can be entirely filesystem agnostic
> > and it doesn't need to rely on parsing proc files to translate
> > ephemeral mount IDs to paths, statvfs() and hoping that f_fsid is
> > stable enough that it doesn't get the destination wrong.  It also
> > means that fanotify UAPI probably no longer needs to supply a
> > f_fsid with the filehandle because it is built into the
> > filehandle....
> >
> 
> That is one option. Let's call it the "bullet proof" option.

"Reliable". "Provides well defined behaviour". "Guarantees".

> Another option, let's call it the "pragmatic" options, is that you accept
> that my patch shouldn't break anything and agree to apply it.

"shouldn't break anything" is the problem. You can assert all you
want that nothing will break, but history tells us that even the
most benign UAPI changes can break unexpected stuff in unexpected
ways.

That's the fundamental problem. We *know* that what you are trying
to do with filehandles and fsid has -explicit, well known- issues.
What we have here is a new interface that
is .... problematic, and now it needs to redefine other parts
of the UAPI to make "problems" with the new interface go away.

Yes, we really suck at APIs, but that doesn't mean hacking a UAPI
around to work around problems in another UAPI is the right answer.

> In that case, a future indexer (or whatever) application author can use
> fanotify, name_to_handle_at() and fstats() as is and document that after
> mount cycle, the indexer may get confused and miss changes in obscure
> filesystems that nobody uses on desktops and servers.

But anyone wanting to use this for a HSM style application that
needs a guarantee that the filehandle can be resolved to the correct
filesystem for open_by_handle_at() is SOL?

IOWs, this proposal is not really fixing the underlying problem,
it's just kicking the can down the road.

> The third option, let's call it the "sad" option, is that we do nothing
> and said future indexer application author will need to find ways to
> work around this deficiency or document that after mount cycle, the
> indexer may get confused and miss changes in commonly used
> desktop and server filesystems (i.e. XFS).

Which it already needs to do, because there are many, many
filesysetms out there that have f_fsid that change on every mount.

> <side bar>
> I think that what indexer author would really want is not "persistent fsid"
> but rather a "persistent change journal" [1].
> I have not abandoned this effort and I have a POC [2] for a new fsnotify
> backend (not fanotify) based on inputs that also you provided in LSFMM.
> In this POC, which is temporarily reusing the code of overlayfs index,
> the persistent identifier of an object is {s_uuid,fhandle}.
> </side bar>

Yup, that's pretty much what HSMs on top of DMAPI did. But then the
app developers realised that they can still miss events, especially
when the system crashes. Not to mention that the filesystem may not
actually replay all the changes it reports to userspace during
journal recovery because it is using asynchronous journalling and so
much of the pending in memory change was lost, even though change
events were reported to userspace....

Hence doing stuff like "fanotify intents" doesn't actually solve any
"missing event" problems - it just creates more complex coherency
problems because you cannot co-ordinate "intent done" events with
filesystem journal completion points sanely. The fanotify journal
needs to change state atomically with the filesystem journal state,
and that's not really something that can be done by a layer above
the filesystem....

Hence the introduction of the bulkstat interface in XFS for fast,
efficient scanning of all the inodes in the filesystem for changes.
The HSMs -always- scanned the filesystems after an unclean mount
(i.e. not having a registered unmount event recorded in the
userspace application event database) because the filesystem and the
userspace database state are never in sync after a crash event.

And, well, because userspace applications can crash and/or have bugs
that lose events, these HSMs would also do periodic scans to
determine if it had missed anything. When you are indexing hundreds
of millions of files and many petabytes of storage across disk and
tape (this was the typical scale of DMAPI installations in the mid
2000s), you care about proactively catching that one event that was
missed because of a transient memory allocation event under heavy
production load....

> Would you be willing to accept the "pragmatic" option?

It doesn't really seem "pragmatic" to me. It looks "convenient" for
fanotify to redefine what f_fsid means, but I'm not convinced that
we should be changing another UAPI to paper over a sub-optimal UAPI
in relatively new and largely unused fanotify functionality....

Cheers,

Dave.

Dave Chinner March 25, 2021, 11:03 p.m. UTC | #12

On Wed, Mar 24, 2021 at 11:18:36AM +0200, Amir Goldstein wrote:
> On Wed, Mar 24, 2021 at 9:43 AM Christoph Hellwig <hch@infradead.org> wrote:
> >
> > On Wed, Mar 24, 2021 at 08:53:25AM +0200, Amir Goldstein wrote:
> > > > This also means that userspace can be entirely filesystem agnostic
> > > > and it doesn't need to rely on parsing proc files to translate
> > > > ephemeral mount IDs to paths, statvfs() and hoping that f_fsid is
> > > > stable enough that it doesn't get the destination wrong.  It also
> > > > means that fanotify UAPI probably no longer needs to supply a
> > > > f_fsid with the filehandle because it is built into the
> > > > filehandle....
> > > >
> > >
> > > That is one option. Let's call it the "bullet proof" option.
> > >
> > > Another option, let's call it the "pragmatic" options, is that you accept
> > > that my patch shouldn't break anything and agree to apply it.
> >
> > Your patch may very well break something.  Most Linux file systems do
> > store the dev_t in the fsid and userspace may for whatever silly
> > reasons depend on it.
> >
> 
> I acknowledge that.
> I do not claim that my change carries zero risk of breakage.
> However, if such userspace dependency exists, it would break on ext4,
> btrfs, ocsf2, ceph and many more fs, so it would have to be a
> dependency that is tightly coupled with a specific fs.
> The probability of that is rather low IMO.
> 
> I propose an opt-in mount option "-o fixed_fsid" for this behavior to make
> everyone sleep better.

Layering hacks on top of hacks to avoid fixing the fanotify UAPI
limitations isn't a very palatable option. Especially those that
require adding mount options we'll have to support forever more...

Cheers,

Dave.

Amir Goldstein March 26, 2021, 6:04 a.m. UTC | #13

> How do you check the identity of a file handle without a filesystem
> identifier in it? having an external f_fsid isn't really sufficient
> - you need to query the filesytem for it's identifier and determine
> if the file handle contains that same identifier...
>
> > Furthermore, with name_to_handle_at(fd, ..., AT_EMPTY_PATH)
> > and fstatfs() there are none of the races you mention below and
> > fanotify obviously captures a valid {fsid,fhandle} tuple.
>
> Except that fsid is not guaranteed to be the same across mounts, let
> alone reboots. There is no guarantee of uniqueness, either. IOWs, in
> the bigger picture, f_fsid isn't something that can provide a
> guaranteed answer to your "is $obj the same as it was at $TIME"...
>
> You can't even infer a path from the fsid, even if it is unique, The
> fsid doesn't tell you what *mount point* it refers to. i.e in
> the present of bind mounts, there can be multiple disjoint directory
> heirachies that correspond to the same fsid...
>

This conversation is not converging.
Partly because I may have failed to explain the requirements.

I am telling you that I need to do X and you are telling me that I cannot
do Y. I did not say that I needed to infer that path of an object nor that
I cared which mount point the object was observed from.

The application is able to cope with not being able to correctly resolve
fid to path, but it is "incoventiet". The technical term to translate
"inconvenient" is the "cost" of IO and time to recrawl the filesystem.

I hear your main arguments against re-purposing f_fsid and potential
breakage of applications loud and clear.

On their own, they are perfectly valid and acceptable reasons to NACK
my patch, so I am not going to pursue it.

OTOH, I find your argument to leave f_fsid "less useful than it can be"
because "it is not perfect", far less convincing, but it doesn't look like
I am going to change your mind, so no point in arguing about that.

[...]
> > If the listener has several objects open (e.g. tail -f A B C) then when getting
> > an event, the identifier can be used to match the open file with certainty
> > (having verified no collisions of identifiers after opening the files).
>
> Sorry, you've lost me. How on do you reliably match a {fsid, fhandle}
> to an open file descriptor? You've got to have more information
> available than just a fd, fsid and a fhandle...
>

This was just an example of the simple case where the listener is started
after opening the files and capturing the {fsid, fhandle} tuple from fd.
The events received by the listener contain {fsid, fhandle} tuples that
should match one of the captures fids.
For this example, the stable fsid is irrelevant.

> > If the listener is watching multiple directories (e.g. inotifywatch --recursive)
> > then the listener has two options:
> > 1. Keep open fds for all watches dirs - this is what inotify_add_watch()
> >     does internally (not fds per-se but keeping an elevated i_count)
> > 2. Keep fid->path map for all watches dirs and accept the fact that the
> >     cached path information may be stale
> >
> > The 2nd option is valid for applications that use the events as hints
> > to take action. An indexer application, for example, doesn't care if
> > it will scan a directory where there were no changes as long as it will
> > get the correct hint eventually.
> >
> > So if an indexer application were to act on FAN_MOVE events by
> > scanning the entire subtree under the parent dir where an entry was
> > renamed, the index will be eventually consistent, regardless of all
> > the events on objects with stale path cache that may have been
> > received after the rename.
>
> Sure, but you don't need a file handle for this. you can just scan
> the directory heirachy any time you get a notification for that
> fsid. Even if you have multiple directory heirarchies that you
> are watching on a given mount.
>
> I'm -guessing- that you are using the filehandle to differentiate
> between different watched heirarchies, and you do that by taking a
> name_to_handle_at() snapshot of the path when the watch is set, yes?
>

More or less, yes.

> AFAICT, the application cannot  care about whether it loses
> events across reboot because the indexer already needs to scan after
> boot to so that it is coherent with whatever state the filesystem is
> in after recovery.
>

That is absolutely right.

But a listener can watch several mounted filesystems on a live system
and mounts can come and go (this is standard for e.g. an AV scanner).
When a filesystem is mounted, the application cannot KNOW that
filesystem was not mounted somewhere else and modified without
supervision, so the safest thing would be a full crawl.

But users should be able to configure that they trust that the filesystem
was not mounted elsewhere and that they trust that the listener has
subscribed to watch for changes, before untrusted users got access
to make modifications in the watches filesystem.

Under those specific circumstances, it is "inconvenient" and potentially
very expensive for the identity of the filesystem to change across
mount cycle (e.g. when using loop mount). That said, a good application
can use libblkid to overcome the "inconvenience" and detect fsid changes.

The argument of "many other filesystem have unstable fsid" is not
relevant. The users making all the assumptions above would know
what filesystem is used and if they know that filesystem's fsid can be
trusted to be stable, because they read the code and/or documentation
they can use that knowledge to improve their system.

[...]

> > That is one option. Let's call it the "bullet proof" option.
>
> "Reliable". "Provides well defined behaviour". "Guarantees".
>

Yes, and also "Application is able to do the right thing without
admin configuration".

> > Another option, let's call it the "pragmatic" options, is that you accept
> > that my patch shouldn't break anything and agree to apply it.
>
> "shouldn't break anything" is the problem. You can assert all you
> want that nothing will break, but history tells us that even the
> most benign UAPI changes can break unexpected stuff in unexpected
> ways.
>

No arguments here. My only claim was that risk is not high.

> That's the fundamental problem. We *know* that what you are trying
> to do with filehandles and fsid has -explicit, well known- issues.
> What we have here is a new interface that
> is .... problematic, and now it needs to redefine other parts
> of the UAPI to make "problems" with the new interface go away.
>
> Yes, we really suck at APIs, but that doesn't mean hacking a UAPI
> around to work around problems in another UAPI is the right answer.
>

All very very true.

And yet, those very true words masquerade the fact that the change
that I proposed, IMO, slightly improves an existing UAPI and aligns
xfs behavior with the behavior of other "prominent" Linux local filesytems.

It's really a matter of perspective how we choose to present it.

> > In that case, a future indexer (or whatever) application author can use
> > fanotify, name_to_handle_at() and fstats() as is and document that after
> > mount cycle, the indexer may get confused and miss changes in obscure
> > filesystems that nobody uses on desktops and servers.
>
> But anyone wanting to use this for a HSM style application that
> needs a guarantee that the filehandle can be resolved to the correct
> filesystem for open_by_handle_at() is SOL?
>
> IOWs, this proposal is not really fixing the underlying problem,
> it's just kicking the can down the road.
>

It's not a kick, it's just a nudge ;-)

And I am not abandoning the HSM users. It's just a much bigger project
that also involves persistent change intent journal.

> > The third option, let's call it the "sad" option, is that we do nothing
> > and said future indexer application author will need to find ways to
> > work around this deficiency or document that after mount cycle, the
> > indexer may get confused and miss changes in commonly used
> > desktop and server filesystems (i.e. XFS).
>
> Which it already needs to do, because there are many, many
> filesysetms out there that have f_fsid that change on every mount.
>

Not any of the filesystems that matter to desktop/server users.
IOW, no other filesystem that comes as the default fs of any
major distro.

> > <side bar>
> > I think that what indexer author would really want is not "persistent fsid"
> > but rather a "persistent change journal" [1].
> > I have not abandoned this effort and I have a POC [2] for a new fsnotify
> > backend (not fanotify) based on inputs that also you provided in LSFMM.
> > In this POC, which is temporarily reusing the code of overlayfs index,
> > the persistent identifier of an object is {s_uuid,fhandle}.
> > </side bar>
>
> Yup, that's pretty much what HSMs on top of DMAPI did. But then the
> app developers realised that they can still miss events, especially
> when the system crashes. Not to mention that the filesystem may not
> actually replay all the changes it reports to userspace during
> journal recovery because it is using asynchronous journalling and so
> much of the pending in memory change was lost, even though change
> events were reported to userspace....
>

Right...

> Hence doing stuff like "fanotify intents" doesn't actually solve any
> "missing event" problems - it just creates more complex coherency
> problems because you cannot co-ordinate "intent done" events with
> filesystem journal completion points sanely. The fanotify journal
> needs to change state atomically with the filesystem journal state,
> and that's not really something that can be done by a layer above
> the filesystem....
>

Surely, a filesystem can do it more efficiently internally, but perhaps
the generic intent journal subsystem can be done...?

My POC is a kernel subsystem (not fanotify)
Intents are created in {mnt_want,{sb,file}_start}_write() call sites
*before* fs freeze lock.

When configured correctly, the backend store of the "modify intent"
map is on the same filesystems that is being watched for changes
and the ONLY information contained in the "modify intents" is the
{uuid,fhandle} tuple of directories wherein modifications may have
occurred.

The "modify intent" records themselves are also metadata (directory
entries), so after a crash, if there is no "modify intent" record for any
given directory, it should be safe to assume that there were no
modifications made under that directory.

This assertion should hold also with crashes that "rollback" changes.

My POC application uses that "map"/"index" of modified directories
to perform a "pruned tree scan" after reboot.

Do you see any flaws in this design?

> Hence the introduction of the bulkstat interface in XFS for fast,
> efficient scanning of all the inodes in the filesystem for changes.
> The HSMs -always- scanned the filesystems after an unclean mount
> (i.e. not having a registered unmount event recorded in the
> userspace application event database) because the filesystem and the
> userspace database state are never in sync after a crash event.
>
> And, well, because userspace applications can crash and/or have bugs
> that lose events, these HSMs would also do periodic scans to
> determine if it had missed anything. When you are indexing hundreds
> of millions of files and many petabytes of storage across disk and
> tape (this was the typical scale of DMAPI installations in the mid
> 2000s), you care about proactively catching that one event that was
> missed because of a transient memory allocation event under heavy
> production load....
>

Yes, I know all about that.
The game is trying to reduce the number of cases when fs scan is
needed and reduce the cost of the scan to minimum.

I'll be happy to share more information about my POC if anyone
asks and/or in LSFMM when/if that eventually happens.

w.r.t $SUBJECT, if nobody else shares my POV that this MAY be
a beneficial and bening change that is worth doing regardless of
fanotify, I am not going to argue for it more.

Thanks,
Amir.

J. Bruce Fields March 26, 2021, 7:15 p.m. UTC | #14

On Tue, Mar 23, 2021 at 06:50:44AM +0200, Amir Goldstein wrote:
> On Tue, Mar 23, 2021 at 1:03 AM Dave Chinner <david@fromorbit.com> wrote:
> > should be using something common across all filesystems from the
> > linux superblock, not deep dark internal filesystem magic. The
> > export interfaces that generate VFS (and NFS) filehandles already
> > have a persistent fsid associated with them, which may in fact be
> > the filesystem UUID in it's entirety.
> >
> 
> Yes, nfsd is using dark internal and AFAIK undocumnetd magic to
> pick that identifier (Bruce, am I wrong?).

Sorry, I kept putting off catching up with this thread and only now
noticed the question.

It's actually done mostly in userspace (rpc.mountd), so "dark internal"
might not be fair, but it is rather complicated.  There are several
options (UUID, device number, number provided by the user with fsid=
option), and I don't recall the logic behind which we use when.

I don't *think* we have good comprehensive documentation of it anywhere.
I wish we did.  It'd take a little time to put together.  Starting
points would be linux/fs/nfsd/nfsfh.c and
nfs-utils/support/export/cache.c.

--b.

Theodore Ts'o March 26, 2021, 10:34 p.m. UTC | #15

I've been reading through this whole thread, and it appears to me that
the only real, long-term solution is to rely on file system UUID's
(for those file systems that support real 128-bit UUID's), and
optionally, for those file systems which support it, a new, "snapshot"
or "fs-instance" UUID.

The FS UUID is pretty simple; we just need an ioctl (or similar
interface) which returns the UUID for a particular file system.

The Snapshot UUID is the one which is bit trickier.  If the underlying
block device can supply something unique --- for example, the WWN or
WWID which is defined by FC, ATA, SATA, SCSI, NVMe, etc. then that
plus a partition identifier could be hashed to form a Snapshot UUID.
LVM volumes have a LV UUID that could be used for a similar purpose.

If you have a block device which doesn't relibly provide a WWN or
WWID, or we could could imagine that a file system has a field in the
superblock, and a file system specific program could get used to set
that field to a random UUID, and that becomes part of the snapshot
process.  This may be problematic for remote/network file systems
which don't have such a concept, but life's a bitch....

With that, then userspace can fetch the st_dev, st_ino, the FS UUID,
and the Snapshot UUID, and use some combination of those fields (as
available) to try determine whether two objects are unique or not.

Is this perfect?  Heck, no.  But ultimately, this is a hard problem,
and trying to wave our hands and create something that works given one
set of assumptions --- and completely breaks in a diferent operating
environment --- is going lead to angry users blaming the fs
developers.  It's a messy problem, and I think all we can do is expose
the entire mess to userspace, and make it be a userspace problem.
That way, the angry users can blame the userspace developers instead.  :-)

     	      	    	      	    - Ted

Amir Goldstein March 27, 2021, 9:06 a.m. UTC | #16

On Fri, Mar 26, 2021 at 10:15 PM J. Bruce Fields <bfields@fieldses.org> wrote:
>
> On Tue, Mar 23, 2021 at 06:50:44AM +0200, Amir Goldstein wrote:
> > On Tue, Mar 23, 2021 at 1:03 AM Dave Chinner <david@fromorbit.com> wrote:
> > > should be using something common across all filesystems from the
> > > linux superblock, not deep dark internal filesystem magic. The
> > > export interfaces that generate VFS (and NFS) filehandles already
> > > have a persistent fsid associated with them, which may in fact be
> > > the filesystem UUID in it's entirety.
> > >
> >
> > Yes, nfsd is using dark internal and AFAIK undocumnetd magic to
> > pick that identifier (Bruce, am I wrong?).
>
> Sorry, I kept putting off catching up with this thread and only now
> noticed the question.
>
> It's actually done mostly in userspace (rpc.mountd), so "dark internal"
> might not be fair, but it is rather complicated.  There are several
> options (UUID, device number, number provided by the user with fsid=
> option), and I don't recall the logic behind which we use when.
>

I'll take back "dark internal" then and replace it with "light external" ;-)
which is also a problem. If userspace is involved in declaring the id
of the *export* then from NFS client POV, that is not a problem, but
from fsnotify POV, that identifier cannot be determined when an event
happens on an inode NOT via the NFS client.

As a matter of fact, the fanotify requirements about fsid are even more
strict than being able to get fsid from the inode. From fanotify_mark.2:
"
       EXDEV  The filesystem object indicated by pathname resides within
              a filesystem subvolume (e.g., btrfs(5)) which uses a
different fsid
              than its root superblock...
"

> I don't *think* we have good comprehensive documentation of it anywhere.
> I wish we did.  It'd take a little time to put together.  Starting
> points would be linux/fs/nfsd/nfsfh.c and
> nfs-utils/support/export/cache.c.

At least as far as fanotify is concerned, the documentation is not going
to matter. The only thing needed is an fsid value that is queryable via
a userspace API.

f_fsid meets this criteria, which is why it was chosen for fanotify.
Having the fsid reported by fanotify also be stable is a nice to have
feature for very selective use cases, which is why I posted this xfs patch.

Thanks,
Amir.

Amir Goldstein March 27, 2021, 9:14 a.m. UTC | #17

On Sat, Mar 27, 2021 at 1:34 AM Theodore Ts'o <tytso@mit.edu> wrote:
>
> I've been reading through this whole thread, and it appears to me that
> the only real, long-term solution is to rely on file system UUID's
> (for those file systems that support real 128-bit UUID's), and
> optionally, for those file systems which support it, a new, "snapshot"
> or "fs-instance" UUID.
>
> The FS UUID is pretty simple; we just need an ioctl (or similar
> interface) which returns the UUID for a particular file system.
>
> The Snapshot UUID is the one which is bit trickier.  If the underlying
> block device can supply something unique --- for example, the WWN or
> WWID which is defined by FC, ATA, SATA, SCSI, NVMe, etc. then that
> plus a partition identifier could be hashed to form a Snapshot UUID.
> LVM volumes have a LV UUID that could be used for a similar purpose.
>
> If you have a block device which doesn't relibly provide a WWN or
> WWID, or we could could imagine that a file system has a field in the
> superblock, and a file system specific program could get used to set
> that field to a random UUID, and that becomes part of the snapshot
> process.  This may be problematic for remote/network file systems
> which don't have such a concept, but life's a bitch....
>
> With that, then userspace can fetch the st_dev, st_ino, the FS UUID,
> and the Snapshot UUID, and use some combination of those fields (as
> available) to try determine whether two objects are unique or not.
>
> Is this perfect?  Heck, no.  But ultimately, this is a hard problem,
> and trying to wave our hands and create something that works given one
> set of assumptions --- and completely breaks in a diferent operating
> environment --- is going lead to angry users blaming the fs
> developers.  It's a messy problem, and I think all we can do is expose
> the entire mess to userspace, and make it be a userspace problem.
> That way, the angry users can blame the userspace developers instead.  :-)

Sounds like a plan ;-)

FWIW, if and when we will have a standard userspace API (fsinfo()?) to
export fs instance uuid to userspace, fanotify can use the uuid instead of
fsid when available (opt-in by new faotify_init() flag).

The fanotify_event_info_header contains an "info_type" field, so it's not
a problem for some events to report fsid (as today) and for other events
to report uuid, depending on availability of the information per filesystem.

Thanks,
Amir.

xfs: use a unique and persistent value for f_fsid

Commit Message

Comments

Patch