Message ID | 20210322171118.446536-1-amir73il@gmail.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | xfs: use a unique and persistent value for f_fsid | expand |
On Mon, Mar 22, 2021 at 07:11:18PM +0200, Amir Goldstein wrote: > Some filesystems on persistent storage backend use a digest of the > filesystem's persistent uuid as the value for f_fsid returned by > statfs(2). > > xfs, as many other filesystem provide the non-persistent block device > number as the value of f_fsid. > > Since kernel v5.1, fanotify_init(2) supports the flag FAN_REPORT_FID > for identifying objects using file_handle and f_fsid in events. The filesystem id is encoded into the VFS filehandle - it does not need some special external identifier to identify the filesystem it belongs to.... > The xfs specific ioctl XFS_IOC_PATH_TO_FSHANDLE similarly attaches an > fsid to exported file handles, but it is not the same fsid exported > via statfs(2) - it is a persistent fsid based on the filesystem's uuid. To actually use that {fshandle,fhandle} tuple for anything requires CAP_SYS_ADMIN. A user can read the fshandle, but it can't use it for anything useful. i.e. it's use is entirely isolated to the file handle interface for identifying the filesystem the handle belongs to. This is messy, but XFS inherited this "fixed fsid" interface from Irix filehandles and was needed to port xfsdump/xfsrestore to Linux. Realistically, it is not functionality that should be duplicated/exposed more widely on Linux... IMO, if fanotify needs a persistent filesystem ID on Linux, it should be using something common across all filesystems from the linux superblock, not deep dark internal filesystem magic. The export interfaces that generate VFS (and NFS) filehandles already have a persistent fsid associated with them, which may in fact be the filesystem UUID in it's entirety. The export-derived "filesystem ID" is what should be exported to userspace in combination with the file handle to identify the fs the handle belongs to because then you have consistent behaviour and a change that invalidates the filehandle will also invalidate the fshandle.... > Use the same persistent value for f_fsid, so object identifiers in > fanotify events will describe the objects more uniquely. It's not persistent as in "will never change". The moment a user changes the XFS filesystem uuid, the f_fsid changes. However, changing the uuid on XFS is an offline (unmounted) operation, so there will be no fanotify marks present when it is changed. Hence when it is remounted, there will be a new f_fsid returned in statvfs(), just like what happens now, and all applications dependent on "persistent" fsids (and persistent filehandles for that matter) will now get ESTALE errors... And, worse, mp->m_fixed_fsid (and XFS superblock UUIDs in general) are not unique if you've got snapshots and they've been mounted via "-o nouuid" to avoid XFS's duplicate uuid checking. This is one of the reasons that the duplicate checking exists - so that fshandles are unique and resolve to a single filesystem.... > Signed-off-by: Amir Goldstein <amir73il@gmail.com> > --- > > Guys, > > This change would be useful for fanotify users. > Do you see any problems with that minor change of uapi? Yes. IMO, we shouldn't be making a syscall interface rely on the undefined, filesystem specific behaviour a value some other syscall exposes to userspace. This means the fsid has no defined or standardised behaviour applications can rely on and can't be guaranteed unique and unchanging by fanotify. This seems like a lose-lose situation for everyone... Cheers, Dave.
On Tue, Mar 23, 2021 at 1:03 AM Dave Chinner <david@fromorbit.com> wrote: > > On Mon, Mar 22, 2021 at 07:11:18PM +0200, Amir Goldstein wrote: > > Some filesystems on persistent storage backend use a digest of the > > filesystem's persistent uuid as the value for f_fsid returned by > > statfs(2). > > > > xfs, as many other filesystem provide the non-persistent block device > > number as the value of f_fsid. > > > > Since kernel v5.1, fanotify_init(2) supports the flag FAN_REPORT_FID > > for identifying objects using file_handle and f_fsid in events. > > The filesystem id is encoded into the VFS filehandle - it does not > need some special external identifier to identify the filesystem it > belongs to.... > Let's take it from the start. There is no requirement for fanotify to get a persistent fs id, we just need a unique fs id that is known to userspace, so the statfs API is good enough for our needs. See quote from fanotify.7: " The fields of the fanotify_event_info_fid structure are as follows: ... fsid This is a unique identifier of the filesystem containing the object associated with the event. It is a structure of type __kernel_fsid_t and contains the same value as f_fsid when calling statfs(2). file_handle This is a variable length structure of type struct file_handle. It is an opaque handle that corresponds to a specified object on a filesystem as returned by name_to_handle_at(2). It can be used to uniquely identify a file on a filesystem and can be passed as an argument to open_by_handle_at(2). ..." So the main objective is to "uniquely identify an object" which was observed before (i.e. at the time of setting a watch) and a secondary objective is to resolve a path from the identifier, which requires extra privileges. This definition does not specify the lifetime of the identifier and indeed, in most cases, uniqueness in the system while filesystem is mounted should suffice as that is also the lifetime of the fanotify mark. But the fanotify group can outlive the mounted filesystem and it can be used to watch multiple filesystems. It's not really a problem per-se that xfs filesystems can change and reuse f_fsid, it is just less friendly that's all. I am trying to understand your objection to making this "friendly" change. > > The xfs specific ioctl XFS_IOC_PATH_TO_FSHANDLE similarly attaches an > > fsid to exported file handles, but it is not the same fsid exported > > via statfs(2) - it is a persistent fsid based on the filesystem's uuid. > > To actually use that {fshandle,fhandle} tuple for anything > requires CAP_SYS_ADMIN. A user can read the fshandle, but it can't > use it for anything useful. It is a unique identifier and that is a useful thing - see demo code: * Index watches by fanotify fid: https://github.com/amir73il/inotify-tools/commit/ed82098b54b847e3c2d46b32d35b2aa537a9ba94 * Handle global watches on several filesystems: https://github.com/amir73il/inotify-tools/commit/1188ef00dc84964de58afb32c91e19930ad1e2e8 > i.e. it's use is entirely isolated to > the file handle interface for identifying the filesystem the handle > belongs to. This is messy, but XFS inherited this "fixed fsid" > interface from Irix filehandles and was needed to port > xfsdump/xfsrestore to Linux. Realistically, it is not functionality > that should be duplicated/exposed more widely on Linux... > Other filesystems expose a uuid digest as f_fsid: ext4, btrfs, ocfs2 and many more. XFS is really the exception among the big local fs. This is not exposing anything new at all. I would say it is more similar to the way that the generation part of the file handle has improved over the years in different filesystems to provide better uniqueness guarantees. > IMO, if fanotify needs a persistent filesystem ID on Linux, it It does not *need* that. It's just nicer for f_fsid to use a persistent fs identifier if one is anyway available. > should be using something common across all filesystems from the > linux superblock, not deep dark internal filesystem magic. The > export interfaces that generate VFS (and NFS) filehandles already > have a persistent fsid associated with them, which may in fact be > the filesystem UUID in it's entirety. > Yes, nfsd is using dark internal and AFAIK undocumnetd magic to pick that identifier (Bruce, am I wrong?). > The export-derived "filesystem ID" is what should be exported to > userspace in combination with the file handle to identify the fs the > handle belongs to because then you have consistent behaviour and a > change that invalidates the filehandle will also invalidate the > fshandle.... > nfsd has a much stronger persistent file handles requirement than fanotify. There is no need to make things more complicated than they need to be. > > Use the same persistent value for f_fsid, so object identifiers in > > fanotify events will describe the objects more uniquely. > > It's not persistent as in "will never change". The moment a user > changes the XFS filesystem uuid, the f_fsid changes. > Yes. I know. But it's still much better than the bdev number IMO. > However, changing the uuid on XFS is an offline (unmounted) > operation, so there will be no fanotify marks present when it is > changed. Hence when it is remounted, there will be a new f_fsid > returned in statvfs(), just like what happens now, and all > applications dependent on "persistent" fsids (and persistent > filehandles for that matter) will now get ESTALE errors... > > And, worse, mp->m_fixed_fsid (and XFS superblock UUIDs in general) > are not unique if you've got snapshots and they've been mounted via > "-o nouuid" to avoid XFS's duplicate uuid checking. This is one of > the reasons that the duplicate checking exists - so that fshandles > are unique and resolve to a single filesystem.... > Both of the caveats of uuid you mentioned are not a big concern for fanotify because the nature of f_fsid can be understood by the event listener before setting the multi-fs watch (i.e. in case of fsid collision). > > Signed-off-by: Amir Goldstein <amir73il@gmail.com> > > --- > > > > Guys, > > > > This change would be useful for fanotify users. > > Do you see any problems with that minor change of uapi? > > Yes. > > IMO, we shouldn't be making a syscall interface rely on the > undefined, filesystem specific behaviour a value some other syscall > exposes to userspace. This means the fsid has no defined or > standardised behaviour applications can rely on and can't be > guaranteed unique and unchanging by fanotify. This seems like a > lose-lose situation for everyone... > The fanotify uapi guarantee is to provide the same value of f_fsid observed by statfs() uapi. The statfs() uapi guarantee about f_fsid is a bit vague, but it's good enough for our needs: "...The general idea is that f_fsid contains some random stuff such that the pair (f_fsid,ino) uniquely determines a file. Some operating systems use (a variation on) the device number, or the device number combined with the filesystem type..." Regardless of the fanotify uapi and whether it's good or bad, do you insist that the value of f_fsid exposed by xfs needs to be the bdev number and not derived from uuid? One thing we could do is in the "-o nouuid" case that you mentioned we continue to use the bdev number for f_fsid. Would you like me to make that change? Thanks, Amir.
On Tue, Mar 23, 2021 at 06:50:44AM +0200, Amir Goldstein wrote: > On Tue, Mar 23, 2021 at 1:03 AM Dave Chinner <david@fromorbit.com> wrote: > > > > On Mon, Mar 22, 2021 at 07:11:18PM +0200, Amir Goldstein wrote: > > > Some filesystems on persistent storage backend use a digest of the > > > filesystem's persistent uuid as the value for f_fsid returned by > > > statfs(2). > > > > > > xfs, as many other filesystem provide the non-persistent block device > > > number as the value of f_fsid. > > > > > > Since kernel v5.1, fanotify_init(2) supports the flag FAN_REPORT_FID > > > for identifying objects using file_handle and f_fsid in events. > > > > The filesystem id is encoded into the VFS filehandle - it does not > > need some special external identifier to identify the filesystem it > > belongs to.... > > > > Let's take it from the start. > There is no requirement for fanotify to get a persistent fs id, we just need > a unique fs id that is known to userspace, so the statfs API is good enough > for our needs. > > See quote from fanotify.7: > > " The fields of the fanotify_event_info_fid structure are as follows: > ... > fsid This is a unique identifier of the filesystem > containing the object associated with the event. It is a structure of > type __kernel_fsid_t and contains the same value as f_fsid when > calling statfs(2). > > file_handle > This is a variable length structure of type struct > file_handle. It is an opaque handle that corresponds to a specified > object on a filesystem as returned by name_to_handle_at(2). It > can be used to uniquely identify a file on a > filesystem and can be passed as an argument to open_by_handle_at(2). Hmmmm.... so I guess you'd /like/ a file handle that will survive across unmount/mount cycles, and possibly even a reboot? I looked at the first commit, and I guess you use name_to_handle_at, which returns a mount_id that is .... that weird number in the leftmost column of /proc/mountinfo, which increments monotonically for each mount and definitely doesn't survive a remount cycle, let alone a reboot? Hence wanting to use something less volatile than mnt_id_ida...? My natural inclination is "just use whatever NFS does", but ... then I saw fh_compose and realized that the fsid part of an NFS handle depends on the export options and probably isn't all that easy for someone who isn't an nfs client to extract. Was that how you arrived at using the statfs fsid field? ...except XFS doesn't guarantee that fsid is particularly unique or stable, since a reboot can enumerate blockdevs in a different order and hence the dev_t will change. UUIDs also aren't a great idea because you can snapshot an fs and mount it with nouuid, and now a "unique" file handle can map ambiguously to two different files. Urgh, I'm gonna have to think about this one, all the options suck. fanotify might be smart enough to handle ambiguous file handles but now I wonder how dumb programs react to that. Also it's 23:30 here. :) --D > ..." > > So the main objective is to "uniquely identify an object" which was observed > before (i.e. at the time of setting a watch) and a secondary objective is to > resolve a path from the identifier, which requires extra privileges. > > This definition does not specify the lifetime of the identifier and > indeed, in most > cases, uniqueness in the system while filesystem is mounted should suffice > as that is also the lifetime of the fanotify mark. > > But the fanotify group can outlive the mounted filesystem and it can be used > to watch multiple filesystems. It's not really a problem per-se that > xfs filesystems > can change and reuse f_fsid, it is just less friendly that's all. > > I am trying to understand your objection to making this "friendly" change. > > > > The xfs specific ioctl XFS_IOC_PATH_TO_FSHANDLE similarly attaches an > > > fsid to exported file handles, but it is not the same fsid exported > > > via statfs(2) - it is a persistent fsid based on the filesystem's uuid. > > > > To actually use that {fshandle,fhandle} tuple for anything > > requires CAP_SYS_ADMIN. A user can read the fshandle, but it can't > > use it for anything useful. > > It is a unique identifier and that is a useful thing - see demo code: > * Index watches by fanotify fid: > https://github.com/amir73il/inotify-tools/commit/ed82098b54b847e3c2d46b32d35b2aa537a9ba94 > * Handle global watches on several filesystems: > https://github.com/amir73il/inotify-tools/commit/1188ef00dc84964de58afb32c91e19930ad1e2e8 > > > i.e. it's use is entirely isolated to > > the file handle interface for identifying the filesystem the handle > > belongs to. This is messy, but XFS inherited this "fixed fsid" > > interface from Irix filehandles and was needed to port > > xfsdump/xfsrestore to Linux. Realistically, it is not functionality > > that should be duplicated/exposed more widely on Linux... > > > > Other filesystems expose a uuid digest as f_fsid: ext4, btrfs, ocfs2 > and many more. XFS is really the exception among the big local fs. > This is not exposing anything new at all. > I would say it is more similar to the way that the generation part of > the file handle has improved over the years in different filesystems > to provide better uniqueness guarantees. > > > IMO, if fanotify needs a persistent filesystem ID on Linux, it > > It does not *need* that. It's just nicer for f_fsid to use a persistent > fs identifier if one is anyway available. > > > should be using something common across all filesystems from the > > linux superblock, not deep dark internal filesystem magic. The > > export interfaces that generate VFS (and NFS) filehandles already > > have a persistent fsid associated with them, which may in fact be > > the filesystem UUID in it's entirety. > > > > Yes, nfsd is using dark internal and AFAIK undocumnetd magic to > pick that identifier (Bruce, am I wrong?). > > > The export-derived "filesystem ID" is what should be exported to > > userspace in combination with the file handle to identify the fs the > > handle belongs to because then you have consistent behaviour and a > > change that invalidates the filehandle will also invalidate the > > fshandle.... > > > > nfsd has a much stronger persistent file handles requirement than > fanotify. There is no need to make things more complicated than > they need to be. > > > > Use the same persistent value for f_fsid, so object identifiers in > > > fanotify events will describe the objects more uniquely. > > > > It's not persistent as in "will never change". The moment a user > > changes the XFS filesystem uuid, the f_fsid changes. > > > > Yes. I know. But it's still much better than the bdev number IMO. > > > However, changing the uuid on XFS is an offline (unmounted) > > operation, so there will be no fanotify marks present when it is > > changed. Hence when it is remounted, there will be a new f_fsid > > returned in statvfs(), just like what happens now, and all > > applications dependent on "persistent" fsids (and persistent > > filehandles for that matter) will now get ESTALE errors... > > > > And, worse, mp->m_fixed_fsid (and XFS superblock UUIDs in general) > > are not unique if you've got snapshots and they've been mounted via > > "-o nouuid" to avoid XFS's duplicate uuid checking. This is one of > > the reasons that the duplicate checking exists - so that fshandles > > are unique and resolve to a single filesystem.... > > > > Both of the caveats of uuid you mentioned are not a big concern for > fanotify because the nature of f_fsid can be understood by the event > listener before setting the multi-fs watch (i.e. in case of fsid collision). > > > > Signed-off-by: Amir Goldstein <amir73il@gmail.com> > > > --- > > > > > > Guys, > > > > > > This change would be useful for fanotify users. > > > Do you see any problems with that minor change of uapi? > > > > Yes. > > > > IMO, we shouldn't be making a syscall interface rely on the > > undefined, filesystem specific behaviour a value some other syscall > > exposes to userspace. This means the fsid has no defined or > > standardised behaviour applications can rely on and can't be > > guaranteed unique and unchanging by fanotify. This seems like a > > lose-lose situation for everyone... > > > > The fanotify uapi guarantee is to provide the same value of f_fsid > observed by statfs() uapi. The statfs() uapi guarantee about f_fsid is > a bit vague, but it's good enough for our needs: > > "...The general idea is that f_fsid contains some random stuff such that the > pair (f_fsid,ino) uniquely determines a file. Some operating systems use > (a variation on) the device number, or the device number combined with the > filesystem type..." > > Regardless of the fanotify uapi and whether it's good or bad, do you insist > that the value of f_fsid exposed by xfs needs to be the bdev number and > not derived from uuid? > > One thing we could do is in the "-o nouuid" case that you mentioned > we continue to use the bdev number for f_fsid. > Would you like me to make that change? > > Thanks, > Amir.
On Tue, Mar 23, 2021 at 8:35 AM Darrick J. Wong <djwong@kernel.org> wrote: > > On Tue, Mar 23, 2021 at 06:50:44AM +0200, Amir Goldstein wrote: > > On Tue, Mar 23, 2021 at 1:03 AM Dave Chinner <david@fromorbit.com> wrote: > > > > > > On Mon, Mar 22, 2021 at 07:11:18PM +0200, Amir Goldstein wrote: > > > > Some filesystems on persistent storage backend use a digest of the > > > > filesystem's persistent uuid as the value for f_fsid returned by > > > > statfs(2). > > > > > > > > xfs, as many other filesystem provide the non-persistent block device > > > > number as the value of f_fsid. > > > > > > > > Since kernel v5.1, fanotify_init(2) supports the flag FAN_REPORT_FID > > > > for identifying objects using file_handle and f_fsid in events. > > > > > > The filesystem id is encoded into the VFS filehandle - it does not > > > need some special external identifier to identify the filesystem it > > > belongs to.... > > > > > > > Let's take it from the start. > > There is no requirement for fanotify to get a persistent fs id, we just need > > a unique fs id that is known to userspace, so the statfs API is good enough > > for our needs. > > > > See quote from fanotify.7: > > > > " The fields of the fanotify_event_info_fid structure are as follows: > > ... > > fsid This is a unique identifier of the filesystem > > containing the object associated with the event. It is a structure of > > type __kernel_fsid_t and contains the same value as f_fsid when > > calling statfs(2). > > > > file_handle > > This is a variable length structure of type struct > > file_handle. It is an opaque handle that corresponds to a specified > > object on a filesystem as returned by name_to_handle_at(2). It > > can be used to uniquely identify a file on a > > filesystem and can be passed as an argument to open_by_handle_at(2). > > Hmmmm.... so I guess you'd /like/ a file handle that will survive across > unmount/mount cycles, and possibly even a reboot? > > I looked at the first commit, and I guess you use name_to_handle_at, > which returns a mount_id that is .... that weird number in the leftmost > column of /proc/mountinfo, which increments monotonically for each mount > and definitely doesn't survive a remount cycle, let alone a reboot? > > Hence wanting to use something less volatile than mnt_id_ida...? > > My natural inclination is "just use whatever NFS does", but ... then I > saw fh_compose and realized that the fsid part of an NFS handle depends > on the export options and probably isn't all that easy for someone who > isn't an nfs client to extract. > > Was that how you arrived at using the statfs fsid field? Yes. Exactly. > > ...except XFS doesn't guarantee that fsid is particularly unique or > stable, since a reboot can enumerate blockdevs in a different order and > hence the dev_t will change. UUIDs also aren't a great idea because you > can snapshot an fs and mount it with nouuid, and now a "unique" file > handle can map ambiguously to two different files. > As I explained in my reply to Dave, that's not a big issue. If program want to listen on events from multiple filesystems, the program will sample f_fsid of both filesystems before setting up the filesystem watches. If there is a collision, there are other ways to handle this case (i.e. run two separate listener programs). Also, as I wrote to Dave, I can easily handle the special case of "-o nouuid" by leaving the bdev number for f_fsid in that case. > Urgh, I'm gonna have to think about this one, all the options suck. > fanotify might be smart enough to handle ambiguous file handles but now Actually, fanotify is copletely dumb about this, it's the user of fanotify that needs to be able to do something useful with this information. There is a demo I wrote based on inotifywatch that demonstrates that [1]. Thanks, Amir. [1] https://github.com/amir73il/inotify-tools/commits/fanotify_name_fid
On Tue, Mar 23, 2021 at 06:50:44AM +0200, Amir Goldstein wrote: > On Tue, Mar 23, 2021 at 1:03 AM Dave Chinner <david@fromorbit.com> wrote: > > > > On Mon, Mar 22, 2021 at 07:11:18PM +0200, Amir Goldstein wrote: > > > Some filesystems on persistent storage backend use a digest of the > > > filesystem's persistent uuid as the value for f_fsid returned by > > > statfs(2). > > > > > > xfs, as many other filesystem provide the non-persistent block device > > > number as the value of f_fsid. > > > > > > Since kernel v5.1, fanotify_init(2) supports the flag FAN_REPORT_FID > > > for identifying objects using file_handle and f_fsid in events. > > > > The filesystem id is encoded into the VFS filehandle - it does not > > need some special external identifier to identify the filesystem it > > belongs to.... > > > > Let's take it from the start. > There is no requirement for fanotify to get a persistent fs id, we just need > a unique fs id that is known to userspace, so the statfs API is good enough > for our needs. So why change the code then? If it ain't broke, don't fix it... > See quote from fanotify.7: > > " The fields of the fanotify_event_info_fid structure are as follows: > ... > fsid This is a unique identifier of the filesystem > containing the object associated with the event. It is a structure of > type __kernel_fsid_t and contains the same value as f_fsid when > calling statfs(2). > > file_handle > This is a variable length structure of type struct > file_handle. It is an opaque handle that corresponds to a specified > object on a filesystem as returned by name_to_handle_at(2). It > can be used to uniquely identify a file on a > filesystem and can be passed as an argument to open_by_handle_at(2). > ..." > > So the main objective is to "uniquely identify an object" which was observed > before (i.e. at the time of setting a watch) and a secondary objective is to > resolve a path from the identifier, which requires extra privileges. > > This definition does not specify the lifetime of the identifier and > indeed, in most > cases, uniqueness in the system while filesystem is mounted should suffice > as that is also the lifetime of the fanotify mark. > > But the fanotify group can outlive the mounted filesystem and it can be used > to watch multiple filesystems. It's not really a problem per-se that > xfs filesystems > can change and reuse f_fsid, it is just less friendly that's all. > > I am trying to understand your objection to making this "friendly" change. "friendly" isn't a useful way to describe whether a change is desirable of whether code works correctly or not. If it's broken or not fit for purpose, it doesn't matter how "friendly" it might be - it's still broken... I'm asking why using a device encoding is a problem, because it will not change across mount/unmount cycles as the backing device doesn't change. It *may* change across reboots, but then what does fanotify care about in that case? Something has to re-establish all watches from scratch at that point, so who cares if the fsid has changed? So what problem is "persistence" across reboots solving? You say it's "friendly" but that has no technical definition I know of.... > > i.e. it's use is entirely isolated to > > the file handle interface for identifying the filesystem the handle > > belongs to. This is messy, but XFS inherited this "fixed fsid" > > interface from Irix filehandles and was needed to port > > xfsdump/xfsrestore to Linux. Realistically, it is not functionality > > that should be duplicated/exposed more widely on Linux... > > Other filesystems expose a uuid digest as f_fsid: ext4, btrfs, ocfs2 > and many more. XFS is really the exception among the big local fs. > This is not exposing anything new at all. I'm not suggesting that it is. I'm asking you to explain what the problem it solves is so I have the context necessary to evaluate the impact of making such a userspace visible change might be.... > I would say it is more similar to the way that the generation part of > the file handle has improved over the years in different filesystems > to provide better uniqueness guarantees. No, it's most definitely not. Userspace never sees the inode generation number. The inode generation can only be changed when an inode goes through a life cycle. If you change it in the middle of a referenced inode's life, then you invalidate valid file handles (i.e. ESTALE) for no good reason. But we can change the way we modify the generation number when the inode cycle out of existence without any impact on external APIs because the change of generation number is necessary to invalidate the filehandle in that case. So the way we have changed generation number selection over time does not impact the actual users of file handles - they still only get ESTALE when the inode is unlinked and is no longer accessible and nobody even notices that we changed generation number selection algorithms. > > The export-derived "filesystem ID" is what should be exported to > > userspace in combination with the file handle to identify the fs the > > handle belongs to because then you have consistent behaviour and a > > change that invalidates the filehandle will also invalidate the > > fshandle.... > > > > nfsd has a much stronger persistent file handles requirement than > fanotify. There is no need to make things more complicated than > they need to be. Userspace filehandles have exactly the same persistence requirements as NFS file handles. fanotify might not have the same requirements as NFS, but that doesn't change the persistence requirements for userspace handles.... > > However, changing the uuid on XFS is an offline (unmounted) > > operation, so there will be no fanotify marks present when it is > > changed. Hence when it is remounted, there will be a new f_fsid > > returned in statvfs(), just like what happens now, and all > > applications dependent on "persistent" fsids (and persistent > > filehandles for that matter) will now get ESTALE errors... > > > > And, worse, mp->m_fixed_fsid (and XFS superblock UUIDs in general) > > are not unique if you've got snapshots and they've been mounted via > > "-o nouuid" to avoid XFS's duplicate uuid checking. This is one of > > the reasons that the duplicate checking exists - so that fshandles > > are unique and resolve to a single filesystem.... > > Both of the caveats of uuid you mentioned are not a big concern for > fanotify because the nature of f_fsid can be understood by the event > listener before setting the multi-fs watch (i.e. in case of fsid collision). Sorry, I don't understand what "the nature of f_fsid can be understood" means. What meaning are you trying to infer from 8 bytes of opaque data in f_fsid? > > > Signed-off-by: Amir Goldstein <amir73il@gmail.com> > > > --- > > > > > > Guys, > > > > > > This change would be useful for fanotify users. > > > Do you see any problems with that minor change of uapi? > > > > Yes. > > > > IMO, we shouldn't be making a syscall interface rely on the > > undefined, filesystem specific behaviour a value some other syscall > > exposes to userspace. This means the fsid has no defined or > > standardised behaviour applications can rely on and can't be > > guaranteed unique and unchanging by fanotify. This seems like a > > lose-lose situation for everyone... > > > > The fanotify uapi guarantee is to provide the same value of f_fsid > observed by statfs() uapi. The statfs() uapi guarantee about f_fsid is > a bit vague, but it's good enough for our needs: > > "...The general idea is that f_fsid contains some random stuff such that the > pair (f_fsid,ino) uniquely determines a file. Some operating systems use > (a variation on) the device number, or the device number combined with the > filesystem type..." Mixed messaging! You start by saying "f_fsid needs persistence", then say "it doesn't need persistence, then say "device number based f_fsid is sub-optimal", then saying "the statfs() defined f_fsid is good enough" despite the fact is says "(a variation on) the device number" is a documented way of implementing it and you've said that is sub-optimal. I''ve got no clue what fanotify wants or needs from f_fsid now. > Regardless of the fanotify uapi and whether it's good or bad, do you insist > that the value of f_fsid exposed by xfs needs to be the bdev number and > not derived from uuid? I'm not insisting that it needs to be the bdev number. I'm trying to understand why it needs to be changed, what the impact of that change is and whether there are other alternatives before I form an opinion on whether we should make this user visible filesystem identifier change or not... > One thing we could do is in the "-o nouuid" case that you mentioned > we continue to use the bdev number for f_fsid. > Would you like me to make that change? No, I need to you stop rushing around in a hurry to change code and assuming that everyone knows every little detail of fanotify and the problems that need to be solved. Slow down and explain clearly and concisely why f_fsid needs to be persistent, how it gets used to optimise <something> when it is persistent, and what the impact of it not being persistent is. Cheers, Dave.
On Tue, Mar 23, 2021 at 9:26 AM Dave Chinner <david@fromorbit.com> wrote: > > On Tue, Mar 23, 2021 at 06:50:44AM +0200, Amir Goldstein wrote: > > On Tue, Mar 23, 2021 at 1:03 AM Dave Chinner <david@fromorbit.com> wrote: > > > > > > On Mon, Mar 22, 2021 at 07:11:18PM +0200, Amir Goldstein wrote: > > > > Some filesystems on persistent storage backend use a digest of the > > > > filesystem's persistent uuid as the value for f_fsid returned by > > > > statfs(2). > > > > > > > > xfs, as many other filesystem provide the non-persistent block device > > > > number as the value of f_fsid. > > > > > > > > Since kernel v5.1, fanotify_init(2) supports the flag FAN_REPORT_FID > > > > for identifying objects using file_handle and f_fsid in events. > > > > > > The filesystem id is encoded into the VFS filehandle - it does not > > > need some special external identifier to identify the filesystem it > > > belongs to.... > > > > > > > Let's take it from the start. > > There is no requirement for fanotify to get a persistent fs id, we just need > > a unique fs id that is known to userspace, so the statfs API is good enough > > for our needs. > > So why change the code then? If it ain't broke, don't fix it... > Fair enough. I am asking a "nice to have" change. I will try to explain why. [...] > > I am trying to understand your objection to making this "friendly" change. > > "friendly" isn't a useful way to describe whether a change is > desirable of whether code works correctly or not. If it's broken or > not fit for purpose, it doesn't matter how "friendly" it might be - > it's still broken... > > I'm asking why using a device encoding is a problem, because > it will not change across mount/unmount cycles as the backing device > doesn't change. It *may* change across reboots, but then what does > fanotify care about in that case? Something has to re-establish all > watches from scratch at that point, so who cares if the fsid has > changed? > This is correct. First of all, we need to acknowledge that the fanotify FAN_REPORT_FID interface was introduced in kernel v5.1 and I don't know of any specific users and how future users will be using it. > So what problem is "persistence" across reboots solving? You say > it's "friendly" but that has no technical definition I know of.... > For most use cases, getting a unique fsid that is not "persistent" would be fine. Many use case will probably be watching a single filesystem and then the value of fsid in the event doesn't matter at all. If, however, at some point in the future, someone were to write a listener that stores events in a persistent queue for later processing it would be more "convenient" if fsid values were "persistent". I'm sorry, but it's hard for me to describe a property of fsid that is nice-to-have without using "convenient" and "friendly" terminology. Technically, there is a way for userland to get the same outcome without making this change in XFS, but I suspect it's not going to be important enough for anyone to care and the end result would be that the end users will suffer unpredicted behavior. > > > i.e. it's use is entirely isolated to > > > the file handle interface for identifying the filesystem the handle > > > belongs to. This is messy, but XFS inherited this "fixed fsid" > > > interface from Irix filehandles and was needed to port > > > xfsdump/xfsrestore to Linux. Realistically, it is not functionality > > > that should be duplicated/exposed more widely on Linux... > > > > Other filesystems expose a uuid digest as f_fsid: ext4, btrfs, ocfs2 > > and many more. XFS is really the exception among the big local fs. > > This is not exposing anything new at all. > > I'm not suggesting that it is. I'm asking you to explain what the > problem it solves is so I have the context necessary to evaluate the > impact of making such a userspace visible change might be.... > I understand. As I tried to explain, this change does not solve any clear and immediate problem. I hope I was able to explain why it might be beneficial. [...] > > > However, changing the uuid on XFS is an offline (unmounted) > > > operation, so there will be no fanotify marks present when it is > > > changed. Hence when it is remounted, there will be a new f_fsid > > > returned in statvfs(), just like what happens now, and all > > > applications dependent on "persistent" fsids (and persistent > > > filehandles for that matter) will now get ESTALE errors... > > > > > > And, worse, mp->m_fixed_fsid (and XFS superblock UUIDs in general) > > > are not unique if you've got snapshots and they've been mounted via > > > "-o nouuid" to avoid XFS's duplicate uuid checking. This is one of > > > the reasons that the duplicate checking exists - so that fshandles > > > are unique and resolve to a single filesystem.... > > > > Both of the caveats of uuid you mentioned are not a big concern for > > fanotify because the nature of f_fsid can be understood by the event > > listener before setting the multi-fs watch (i.e. in case of fsid collision). > > Sorry, I don't understand what "the nature of f_fsid can be > understood" means. What meaning are you trying to infer from 8 bytes > of opaque data in f_fsid? > When the program is requested to watch multiple filesystems, it starts by querying their fsid. In case of an fsid collision, the program knows that it will not be able to tell which filesystem the event originated in, so the program can print a descriptive error to the user. [...] > > The fanotify uapi guarantee is to provide the same value of f_fsid > > observed by statfs() uapi. The statfs() uapi guarantee about f_fsid is > > a bit vague, but it's good enough for our needs: > > > > "...The general idea is that f_fsid contains some random stuff such that the > > pair (f_fsid,ino) uniquely determines a file. Some operating systems use > > (a variation on) the device number, or the device number combined with the > > filesystem type..." > > Mixed messaging! > > You start by saying "f_fsid needs persistence", then say "it doesn't > need persistence, then say "device number based f_fsid is > sub-optimal", then saying "the statfs() defined f_fsid is good > enough" despite the fact is says "(a variation on) the device > number" is a documented way of implementing it and you've said that > is sub-optimal. > > I''ve got no clue what fanotify wants or needs from f_fsid now. > fanotify has two requirements for f_fsid and they are both met for XFS. From fanotify_mark.2: " ENODEV The filesystem object indicated by pathname is not associated with a filesystem that supports fsid (e.g., tmpfs(5)). ... EXDEV The filesystem object indicated by pathname resides within a filesystem subvolume (e.g., btrfs(5)) which uses a different fsid than its root superblock. " The reason I am proposing this change in XFS is not for the needs of fanotify itself for the needs of its future downstream users. We could defer the decision of making this change to that future time when and if users start complaining. XFS is certainly not the only filesystem that might need this sort of change to f_fsid. > > Regardless of the fanotify uapi and whether it's good or bad, do you insist > > that the value of f_fsid exposed by xfs needs to be the bdev number and > > not derived from uuid? > > I'm not insisting that it needs to be the bdev number. I'm trying to > understand why it needs to be changed, what the impact of that > change is and whether there are other alternatives before I form an > opinion on whether we should make this user visible filesystem > identifier change or not... > > > One thing we could do is in the "-o nouuid" case that you mentioned > > we continue to use the bdev number for f_fsid. > > Would you like me to make that change? > > No, I need to you stop rushing around in a hurry to change code and > assuming that everyone knows every little detail of fanotify and the > problems that need to be solved. Slow down and explain clearly and > concisely why f_fsid needs to be persistent, how it gets used to > optimise <something> when it is persistent, and what the impact of > it not being persistent is. > Fair enough. I'm in a position of disadvantage having no real users to request this change. XFS is certainly "not broken", so the argument for "not fixing" it is valid. Nevertheless bdev can change on modern systems even without reboot for example for loop mounted images, so please consider investing the time in forming an opinion about making this change for the sake of making f_fsid more meaningful for any caller of statfs(2) not only for fanotify listeners. Leaving fanotify out of the picture, the question that the prospect user is trying answer is: "Is the object at $PATH or at $FD the same object that was observed at 'an earlier time'?" With XFS, that question can be answered (< 100% certainty) using the XFS_IOC_PATH_TO_FSHANDLE interface. name_to_handle_at(2) + statfs(2) is a generic interface that provides this answer with less certainty, but it could provide the answer with the same certainty for XFS. Thanks, Amir.
On Tue, Mar 23, 2021 at 11:35:46AM +0200, Amir Goldstein wrote: > On Tue, Mar 23, 2021 at 9:26 AM Dave Chinner <david@fromorbit.com> wrote: > > On Tue, Mar 23, 2021 at 06:50:44AM +0200, Amir Goldstein wrote: > > > On Tue, Mar 23, 2021 at 1:03 AM Dave Chinner <david@fromorbit.com> wrote: > For most use cases, getting a unique fsid that is not "persistent" > would be fine. Many use case will probably be watching a single > filesystem and then the value of fsid in the event doesn't matter at all. > > If, however, at some point in the future, someone were to write > a listener that stores events in a persistent queue for later processing > it would be more "convenient" if fsid values were "persistent". Ok, that is what I suspected - that you want to write filehandles to a userspace database of some sort for later processing and so you need to also store the filesystem that the filehandle belongs to. FWIW, that's something XFS started doing a couple of decades ago for DMAPI based prioprietary HSM implementations. They were build around a massive userspace database that indexed the contents of the fileystem via file handles and were kept up to date via notifications through the DMAPI interface. > > > > However, changing the uuid on XFS is an offline (unmounted) > > > > operation, so there will be no fanotify marks present when it is > > > > changed. Hence when it is remounted, there will be a new f_fsid > > > > returned in statvfs(), just like what happens now, and all > > > > applications dependent on "persistent" fsids (and persistent > > > > filehandles for that matter) will now get ESTALE errors... > > > > > > > > And, worse, mp->m_fixed_fsid (and XFS superblock UUIDs in general) > > > > are not unique if you've got snapshots and they've been mounted via > > > > "-o nouuid" to avoid XFS's duplicate uuid checking. This is one of > > > > the reasons that the duplicate checking exists - so that fshandles > > > > are unique and resolve to a single filesystem.... > > > > > > Both of the caveats of uuid you mentioned are not a big concern for > > > fanotify because the nature of f_fsid can be understood by the event > > > listener before setting the multi-fs watch (i.e. in case of fsid collision). > > > > Sorry, I don't understand what "the nature of f_fsid can be > > understood" means. What meaning are you trying to infer from 8 bytes > > of opaque data in f_fsid? > > > > When the program is requested to watch multiple filesystems, it starts by > querying their fsid. In case of an fsid collision, the program knows that it > will not be able to tell which filesystem the event originated in, so the > program can print a descriptive error to the user. Ok, so it can handle collisions, but it cannot detect things like two filesystems swapping fsids because device ordering changed at boot time. i.e. there no way to determine the difference between f_fsid change vs the same filesystems with stable f_fsid being mounted in different locations.... > > > Regardless of the fanotify uapi and whether it's good or bad, do you insist > > > that the value of f_fsid exposed by xfs needs to be the bdev number and > > > not derived from uuid? > > > > I'm not insisting that it needs to be the bdev number. I'm trying to > > understand why it needs to be changed, what the impact of that > > change is and whether there are other alternatives before I form an > > opinion on whether we should make this user visible filesystem > > identifier change or not... > > > > > One thing we could do is in the "-o nouuid" case that you mentioned > > > we continue to use the bdev number for f_fsid. > > > Would you like me to make that change? > > > > No, I need to you stop rushing around in a hurry to change code and > > assuming that everyone knows every little detail of fanotify and the > > problems that need to be solved. Slow down and explain clearly and > > concisely why f_fsid needs to be persistent, how it gets used to > > optimise <something> when it is persistent, and what the impact of > > it not being persistent is. > > > > Fair enough. > I'm in a position of disadvantage having no real users to request this change. > XFS is certainly "not broken", so the argument for "not fixing" it is valid. > > Nevertheless bdev can change on modern systems even without reboot > for example for loop mounted images, so please consider investing the time > in forming an opinion about making this change for the sake of making f_fsid > more meaningful for any caller of statfs(2) not only for fanotify listeners. > > Leaving fanotify out of the picture, the question that the prospect user is > trying answer is: > "Is the object at $PATH or at $FD the same object that was observed at > 'an earlier time'?" > > With XFS, that question can be answered (< 100% certainty) > using the XFS_IOC_PATH_TO_FSHANDLE interface. Actually, there's a bit more to it than that. See below. > name_to_handle_at(2) + statfs(2) is a generic interface that provides > this answer with less certainty, but it could provide the answer > with the same certainty for XFS. Let me see if I get this straight.... Because the VFS filehandle interface does not cater for this by giving you a fshandle that is persistent, you have to know what path the filehandle was derived from to be able to open a mountfd for open_by_handle_at() on the file handle you have stored in userspace. And that open_by_handle_at() returns an ephemeral mount ID, so the kernel does not provide what you need to use open_by_handle_at() immediately. To turn this ephemeral mount ID into a stable identifier you have to look up /proc/self/mounts to find the mount point, then statvfs() the mount point to get the f_fsid. To use the handle, you then need to open the path to the stored mount point, check that f_fsid still matches what you originally looked up, then you can run open_by_handle_at() on the file handle. If you have an open fd on the filesystem and f_fsid matches, you have the filesystem pinned until you close the mount fd, and so you can just sort your queued filehandles by f_fsid and process them all while you have the mount fd open.... Is that right? But that still leaves a problem in that the VFS filehandle does not contain a filesystem identifier itself, so you can never actually verify that the filehandle belongs to the mount that you opened for that f_fsid. i.e. The file handle is built by exportfs_encode_fh(), which filesystems use to encode inode/gen/parent information. Nothing else is placed in the VFS handle, so the file handle cannot be used to identify what filesystem it came from. These seem like a fundamental problems for storing VFS handles across reboots: identifying the filesystem is not atomic with the file handle generation and it that identification is not encoded into the file handle for later verification. IOWs, if you get the fsid translation wrong, the filehandle will end up being used on the wrong filesystem and userspace has no way of knowing that this occurred - it will get ESTALE or data that isn't what it expected. Either way, it'll look like data corruption to the application(s). Using f_fsid for this seems fragile to me and has potential to break in unexpected ways in highly dynamic environments. The XFS filehandle exposed by the ioctls, and the NFS filehandle for that matter, both include an identifier for the filesystem they belong to in the file handle. This identifier matches the stable filesystem identifier held by the filesystem (or the NFS export table), and hence the kernel could resolve whether the filehandle itself has been directed at the correct filesystem. The XFS ioctls do not do this fshandle checking - this is something performed by the libhandle library (part of xfsprogs). libhandle knows the format of the XFS filehandles, so it peaks inside them to extract the fsid to determine where to direct them. Applications must first initialise filesystems that file handles can be used on by calling path_to_fshandle() to populate an open file cache. Behind the scenes, this calls XFS_IOC_PATH_TO_FSHANDLE to associate a {fd, path, fshandle} tuple for that filesystem. The libhandle operations then match the fsid embedded in the file handle to the known open fshandles, and if they match it uses the associated fd to issue the ioctl to the correct filesystem. This fshandle fd is held open for as long as the application is running, so it pins the filesystem and so the fshandle obtained at init time is guaranteed to be valid until the application exits. Hence on startup an app simply needs to walk the paths it is interested in and call path_to_fshandle() on all of them, but otherwise it does not need to know what filesystem a filehandle belongs to - the libhandle implementation takes care of that entirely.... IOWs, this whole "find the right filesystem for the file handle" implementation is largely abstracted away from userspace by libhandle. Hence just looking at what the the XFS ioctls do does not give the whole picture of how stable filehandles were actually used by applications... I suspect that the right thing to do here is extend the VFS filehandles to contain an 8 byte fsid prefix (supplied by a new an export op) and an AT_FSID flag for name_to_handle_at() to return just the 8 byte fsid that is used by handles on that filesystem. This provides the filesystems with a well defined API for providing a stable identifier instead of redefining what filesystems need to return in some other UAPI. This also means that userspace can be entirely filesystem agnostic and it doesn't need to rely on parsing proc files to translate ephemeral mount IDs to paths, statvfs() and hoping that f_fsid is stable enough that it doesn't get the destination wrong. It also means that fanotify UAPI probably no longer needs to supply a f_fsid with the filehandle because it is built into the filehandle.... Cheers, Dave.
On Wed, Mar 24, 2021 at 2:54 AM Dave Chinner <david@fromorbit.com> wrote: > > On Tue, Mar 23, 2021 at 11:35:46AM +0200, Amir Goldstein wrote: > > On Tue, Mar 23, 2021 at 9:26 AM Dave Chinner <david@fromorbit.com> wrote: > > > On Tue, Mar 23, 2021 at 06:50:44AM +0200, Amir Goldstein wrote: > > > > On Tue, Mar 23, 2021 at 1:03 AM Dave Chinner <david@fromorbit.com> wrote: > > For most use cases, getting a unique fsid that is not "persistent" > > would be fine. Many use case will probably be watching a single > > filesystem and then the value of fsid in the event doesn't matter at all. > > > > If, however, at some point in the future, someone were to write > > a listener that stores events in a persistent queue for later processing > > it would be more "convenient" if fsid values were "persistent". > > Ok, that is what I suspected - that you want to write filehandles to > a userspace database of some sort for later processing and so you > need to also store the filesystem that the filehandle belongs to. > > FWIW, that's something XFS started doing a couple of decades ago for > DMAPI based prioprietary HSM implementations. They were build around > a massive userspace database that indexed the contents of the > fileystem via file handles and were kept up to date via > notifications through the DMAPI interface. > History repeats itself, but with different storage tearing layers ;-) [...] > > When the program is requested to watch multiple filesystems, it starts by > > querying their fsid. In case of an fsid collision, the program knows that it > > will not be able to tell which filesystem the event originated in, so the > > program can print a descriptive error to the user. > > Ok, so it can handle collisions, but it cannot detect things like > two filesystems swapping fsids because device ordering changed at > boot time. i.e. there no way to determine the difference between > f_fsid change vs the same filesystems with stable f_fsid being > mounted in different locations.... > Correct. [...] > > Leaving fanotify out of the picture, the question that the prospect user is > > trying answer is: > > "Is the object at $PATH or at $FD the same object that was observed at > > 'an earlier time'?" > > > > With XFS, that question can be answered (< 100% certainty) > > using the XFS_IOC_PATH_TO_FSHANDLE interface. > > Actually, there's a bit more to it than that. See below. > > > name_to_handle_at(2) + statfs(2) is a generic interface that provides > > this answer with less certainty, but it could provide the answer > > with the same certainty for XFS. > > Let me see if I get this straight.... > > Because the VFS filehandle interface does not cater for this by > giving you a fshandle that is persistent, you have to know what path > the filehandle was derived from to be able to open a mountfd for > open_by_handle_at() on the file handle you have stored in userspace. That is what NFS and DMAPI need, but this is not what I asked for. I specifically asked for the ability to answer the question: "Is the object at $PATH or at $FD the same object that was observed at 'an earlier time'?" Note that as opposed to open_by_handle_at(), which requires capabilities, checking the identity of the object does not require any capabilities beyond search/read access permissions to the object. Furthermore, with name_to_handle_at(fd, ..., AT_EMPTY_PATH) and fstatfs() there are none of the races you mention below and fanotify obviously captures a valid {fsid,fhandle} tuple. > And that open_by_handle_at() returns an ephemeral mount ID, so the > kernel does not provide what you need to use open_by_handle_at() > immediately. > > To turn this ephemeral mount ID into a stable identifier you have to > look up /proc/self/mounts to find the mount point, then statvfs() > the mount point to get the f_fsid. > > To use the handle, you then need to open the path to the stored > mount point, check that f_fsid still matches what you originally > looked up, then you can run open_by_handle_at() on the file handle. > If you have an open fd on the filesystem and f_fsid matches, you > have the filesystem pinned until you close the mount fd, and so you > can just sort your queued filehandles by f_fsid and process them all > while you have the mount fd open.... > > Is that right? It's not wrong, but it's irrelevant to the requirement, which was to *identify* the object, not to *access* the object. See more below... > > But that still leaves a problem in that the VFS filehandle does not > contain a filesystem identifier itself, so you can never actually > verify that the filehandle belongs to the mount that you opened for > that f_fsid. i.e. The file handle is built by exportfs_encode_fh(), > which filesystems use to encode inode/gen/parent information. > Nothing else is placed in the VFS handle, so the file handle cannot > be used to identify what filesystem it came from. > > These seem like a fundamental problems for storing VFS handles > across reboots: identifying the filesystem is not atomic with the > file handle generation and it that identification is not encoded > into the file handle for later verification. > > IOWs, if you get the fsid translation wrong, the filehandle will end > up being used on the wrong filesystem and userspace has no way of > knowing that this occurred - it will get ESTALE or data that isn't > what it expected. Either way, it'll look like data corruption to the > application(s). Using f_fsid for this seems fragile to me and has > potential to break in unexpected ways in highly dynamic > environments. > The potential damage sounds bad when you put it this way, but in fact it really depends on the use case. For the use case of NFS client it's true you MUST NOT get the wrong object when resolving file handles. With fanotify, this is not the case. When a listener gets an event with an object identifier, the listener cannot infer the path of that object. If the listener has several objects open (e.g. tail -f A B C) then when getting an event, the identifier can be used to match the open file with certainty (having verified no collisions of identifiers after opening the files). If the listener is watching multiple directories (e.g. inotifywatch --recursive) then the listener has two options: 1. Keep open fds for all watches dirs - this is what inotify_add_watch() does internally (not fds per-se but keeping an elevated i_count) 2. Keep fid->path map for all watches dirs and accept the fact that the cached path information may be stale The 2nd option is valid for applications that use the events as hints to take action. An indexer application, for example, doesn't care if it will scan a directory where there were no changes as long as it will get the correct hint eventually. So if an indexer application were to act on FAN_MOVE events by scanning the entire subtree under the parent dir where an entry was renamed, the index will be eventually consistent, regardless of all the events on objects with stale path cache that may have been received after the rename. > The XFS filehandle exposed by the ioctls, and the NFS filehandle for > that matter, both include an identifier for the filesystem they > belong to in the file handle. This identifier matches the stable > filesystem identifier held by the filesystem (or the NFS export > table), and hence the kernel could resolve whether the filehandle > itself has been directed at the correct filesystem. > > The XFS ioctls do not do this fshandle checking - this is something > performed by the libhandle library (part of xfsprogs). libhandle > knows the format of the XFS filehandles, so it peaks inside them to > extract the fsid to determine where to direct them. > > Applications must first initialise filesystems that file handles can > be used on by calling path_to_fshandle() to populate an open file > cache. Behind the scenes, this calls XFS_IOC_PATH_TO_FSHANDLE to > associate a {fd, path, fshandle} tuple for that filesystem. The > libhandle operations then match the fsid embedded in the file handle > to the known open fshandles, and if they match it uses the > associated fd to issue the ioctl to the correct filesystem. > > This fshandle fd is held open for as long as the application is > running, so it pins the filesystem and so the fshandle obtained at > init time is guaranteed to be valid until the application exits. > Hence on startup an app simply needs to walk the paths it is > interested in and call path_to_fshandle() on all of them, but > otherwise it does not need to know what filesystem a filehandle > belongs to - the libhandle implementation takes care of that > entirely.... > > IOWs, this whole "find the right filesystem for the file handle" > implementation is largely abstracted away from userspace by > libhandle. Hence just looking at what the the XFS ioctls do does not > give the whole picture of how stable filehandles were actually used > by applications... > > I suspect that the right thing to do here is extend the VFS > filehandles to contain an 8 byte fsid prefix (supplied by a new an > export op) and an AT_FSID flag for name_to_handle_at() to return > just the 8 byte fsid that is used by handles on that filesystem. > This provides the filesystems with a well defined API for providing > a stable identifier instead of redefining what filesystems need to > return in some other UAPI. > > This also means that userspace can be entirely filesystem agnostic > and it doesn't need to rely on parsing proc files to translate > ephemeral mount IDs to paths, statvfs() and hoping that f_fsid is > stable enough that it doesn't get the destination wrong. It also > means that fanotify UAPI probably no longer needs to supply a > f_fsid with the filehandle because it is built into the > filehandle.... > That is one option. Let's call it the "bullet proof" option. Another option, let's call it the "pragmatic" options, is that you accept that my patch shouldn't break anything and agree to apply it. In that case, a future indexer (or whatever) application author can use fanotify, name_to_handle_at() and fstats() as is and document that after mount cycle, the indexer may get confused and miss changes in obscure filesystems that nobody uses on desktops and servers. The third option, let's call it the "sad" option, is that we do nothing and said future indexer application author will need to find ways to work around this deficiency or document that after mount cycle, the indexer may get confused and miss changes in commonly used desktop and server filesystems (i.e. XFS). <side bar> I think that what indexer author would really want is not "persistent fsid" but rather a "persistent change journal" [1]. I have not abandoned this effort and I have a POC [2] for a new fsnotify backend (not fanotify) based on inputs that also you provided in LSFMM. In this POC, which is temporarily reusing the code of overlayfs index, the persistent identifier of an object is {s_uuid,fhandle}. </side bar> Would you be willing to accept the "pragmatic" option? Thanks, Amir. [1] https://lwn.net/Articles/755277/ [2] https://github.com/amir73il/linux/commits/ovl-watch
On Wed, Mar 24, 2021 at 08:53:25AM +0200, Amir Goldstein wrote: > > This also means that userspace can be entirely filesystem agnostic > > and it doesn't need to rely on parsing proc files to translate > > ephemeral mount IDs to paths, statvfs() and hoping that f_fsid is > > stable enough that it doesn't get the destination wrong. It also > > means that fanotify UAPI probably no longer needs to supply a > > f_fsid with the filehandle because it is built into the > > filehandle.... > > > > That is one option. Let's call it the "bullet proof" option. > > Another option, let's call it the "pragmatic" options, is that you accept > that my patch shouldn't break anything and agree to apply it. Your patch may very well break something. Most Linux file systems do store the dev_t in the fsid and userspace may for whatever silly reasons depend on it. Also trying to use the fsid for anything persistent is plain stupid, 64-bits are not enough entropy for such an identifier. You at least need a 128-bit UUID-like identifier for that. So I think this whole discussion is going in the wrong direction. Is exposing a stable file system identifier useful? Yes, for many reasons. Is repurposing the fsid for that a good idea? Hell no.
On Wed, Mar 24, 2021 at 9:43 AM Christoph Hellwig <hch@infradead.org> wrote: > > On Wed, Mar 24, 2021 at 08:53:25AM +0200, Amir Goldstein wrote: > > > This also means that userspace can be entirely filesystem agnostic > > > and it doesn't need to rely on parsing proc files to translate > > > ephemeral mount IDs to paths, statvfs() and hoping that f_fsid is > > > stable enough that it doesn't get the destination wrong. It also > > > means that fanotify UAPI probably no longer needs to supply a > > > f_fsid with the filehandle because it is built into the > > > filehandle.... > > > > > > > That is one option. Let's call it the "bullet proof" option. > > > > Another option, let's call it the "pragmatic" options, is that you accept > > that my patch shouldn't break anything and agree to apply it. > > Your patch may very well break something. Most Linux file systems do > store the dev_t in the fsid and userspace may for whatever silly > reasons depend on it. > I acknowledge that. I do not claim that my change carries zero risk of breakage. However, if such userspace dependency exists, it would break on ext4, btrfs, ocsf2, ceph and many more fs, so it would have to be a dependency that is tightly coupled with a specific fs. The probability of that is rather low IMO. I propose an opt-in mount option "-o fixed_fsid" for this behavior to make everyone sleep better. > Also trying to use the fsid for anything persistent is plain stupid, > 64-bits are not enough entropy for such an identifier. You at least > need a 128-bit UUID-like identifier for that. > That's a strong claim to make without providing statistical analysis and the description of the use case. The requirement is for a unique identifier of a mounted fs within a single system. > So I think this whole discussion is going in the wrong direction. > Is exposing a stable file system identifier useful? Yes, for many > reasons. Is repurposing the fsid for that a good idea? Hell no. Again. Strong reaction not backed up by equally strong technical arguments. I am not at all opposed to a new API for stable FS_HANDLE, but no, I am not going to offer writing this new API myself at this point. Applications that use statfs() to identify a filesystem may very well already exist in the wild, so the fixed_fsid discussion is independent of the new API. Considering the fact that many relevant filesystems do provide a stable f_fsid and considering the fact that a blockdev number of devices that are instantiated at boot time are often practically stable, the users of those applications could be unaware of the instability of f_fsid or maybe they can live with it. I find it hard to argue with the "all-or-nothing" attitude in the reaction to my proposed change. What I propose is an opt-in mount option "-o fixed_fsid". IMO, NACKing this proposal is not going to improve anything for anyone, but I don't know what more to say. Thanks, Amir.
On Wed, Mar 24, 2021 at 08:53:25AM +0200, Amir Goldstein wrote: > On Wed, Mar 24, 2021 at 2:54 AM Dave Chinner <david@fromorbit.com> wrote: > > On Tue, Mar 23, 2021 at 11:35:46AM +0200, Amir Goldstein wrote: > > > On Tue, Mar 23, 2021 at 9:26 AM Dave Chinner <david@fromorbit.com> wrote: > > > > On Tue, Mar 23, 2021 at 06:50:44AM +0200, Amir Goldstein wrote: > > > Leaving fanotify out of the picture, the question that the prospect user is > > > trying answer is: > > > "Is the object at $PATH or at $FD the same object that was observed at > > > 'an earlier time'?" > > > > > > With XFS, that question can be answered (< 100% certainty) > > > using the XFS_IOC_PATH_TO_FSHANDLE interface. > > > > Actually, there's a bit more to it than that. See below. > > > > > name_to_handle_at(2) + statfs(2) is a generic interface that provides > > > this answer with less certainty, but it could provide the answer > > > with the same certainty for XFS. > > > > Let me see if I get this straight.... > > > > Because the VFS filehandle interface does not cater for this by > > giving you a fshandle that is persistent, you have to know what path > > the filehandle was derived from to be able to open a mountfd for > > open_by_handle_at() on the file handle you have stored in userspace. > > That is what NFS and DMAPI need, but this is not what I asked for. > I specifically asked for the ability to answer the question: > "Is the object at $PATH or at $FD the same object that was observed at > 'an earlier time'?" Define "at an earlier time". Once it means "across system reboots" then the problem scope becomes a lot larger... > Note that as opposed to open_by_handle_at(), which requires > capabilities, checking the identity of the object does not require any > capabilities beyond search/read access permissions to the object. How do you check the identity of a file handle without a filesystem identifier in it? having an external f_fsid isn't really sufficient - you need to query the filesytem for it's identifier and determine if the file handle contains that same identifier... > Furthermore, with name_to_handle_at(fd, ..., AT_EMPTY_PATH) > and fstatfs() there are none of the races you mention below and > fanotify obviously captures a valid {fsid,fhandle} tuple. Except that fsid is not guaranteed to be the same across mounts, let alone reboots. There is no guarantee of uniqueness, either. IOWs, in the bigger picture, f_fsid isn't something that can provide a guaranteed answer to your "is $obj the same as it was at $TIME"... You can't even infer a path from the fsid, even if it is unique, The fsid doesn't tell you what *mount point* it refers to. i.e in the present of bind mounts, there can be multiple disjoint directory heirachies that correspond to the same fsid... > > And that open_by_handle_at() returns an ephemeral mount ID, so the > > kernel does not provide what you need to use open_by_handle_at() > > immediately. > > > > To turn this ephemeral mount ID into a stable identifier you have to > > look up /proc/self/mounts to find the mount point, then statvfs() > > the mount point to get the f_fsid. > > > > To use the handle, you then need to open the path to the stored > > mount point, check that f_fsid still matches what you originally > > looked up, then you can run open_by_handle_at() on the file handle. > > If you have an open fd on the filesystem and f_fsid matches, you > > have the filesystem pinned until you close the mount fd, and so you > > can just sort your queued filehandles by f_fsid and process them all > > while you have the mount fd open.... > > > > Is that right? > > It's not wrong, but it's irrelevant to the requirement, which was to > *identify* the object, not to *access* the object. > See more below... > > > > > But that still leaves a problem in that the VFS filehandle does not > > contain a filesystem identifier itself, so you can never actually > > verify that the filehandle belongs to the mount that you opened for > > that f_fsid. i.e. The file handle is built by exportfs_encode_fh(), > > which filesystems use to encode inode/gen/parent information. > > Nothing else is placed in the VFS handle, so the file handle cannot > > be used to identify what filesystem it came from. > > > > These seem like a fundamental problems for storing VFS handles > > across reboots: identifying the filesystem is not atomic with the > > file handle generation and it that identification is not encoded > > into the file handle for later verification. > > > > IOWs, if you get the fsid translation wrong, the filehandle will end > > up being used on the wrong filesystem and userspace has no way of > > knowing that this occurred - it will get ESTALE or data that isn't > > what it expected. Either way, it'll look like data corruption to the > > application(s). Using f_fsid for this seems fragile to me and has > > potential to break in unexpected ways in highly dynamic > > environments. > > > > The potential damage sounds bad when you put it this way, but in fact > it really depends on the use case. For the use case of NFS client it's true > you MUST NOT get the wrong object when resolving file handles. > > With fanotify, this is not the case. > When a listener gets an event with an object identifier, the listener cannot > infer the path of that object. > > If the listener has several objects open (e.g. tail -f A B C) then when getting > an event, the identifier can be used to match the open file with certainty > (having verified no collisions of identifiers after opening the files). Sorry, you've lost me. How on do you reliably match a {fsid, fhandle} to an open file descriptor? You've got to have more information available than just a fd, fsid and a fhandle... > If the listener is watching multiple directories (e.g. inotifywatch --recursive) > then the listener has two options: > 1. Keep open fds for all watches dirs - this is what inotify_add_watch() > does internally (not fds per-se but keeping an elevated i_count) > 2. Keep fid->path map for all watches dirs and accept the fact that the > cached path information may be stale > > The 2nd option is valid for applications that use the events as hints > to take action. An indexer application, for example, doesn't care if > it will scan a directory where there were no changes as long as it will > get the correct hint eventually. > > So if an indexer application were to act on FAN_MOVE events by > scanning the entire subtree under the parent dir where an entry was > renamed, the index will be eventually consistent, regardless of all > the events on objects with stale path cache that may have been > received after the rename. Sure, but you don't need a file handle for this. you can just scan the directory heirachy any time you get a notification for that fsid. Even if you have multiple directory heirarchies that you are watching on a given mount. I'm -guessing- that you are using the filehandle to differentiate between different watched heirarchies, and you do that by taking a name_to_handle_at() snapshot of the path when the watch is set, yes? AFAICT, the application cannot care about whether it loses events across reboot because the indexer already needs to scan after boot to so that it is coherent with whatever state the filesystem is in after recovery. > > The XFS filehandle exposed by the ioctls, and the NFS filehandle for > > that matter, both include an identifier for the filesystem they > > belong to in the file handle. This identifier matches the stable > > filesystem identifier held by the filesystem (or the NFS export > > table), and hence the kernel could resolve whether the filehandle > > itself has been directed at the correct filesystem. > > > > The XFS ioctls do not do this fshandle checking - this is something > > performed by the libhandle library (part of xfsprogs). libhandle > > knows the format of the XFS filehandles, so it peaks inside them to > > extract the fsid to determine where to direct them. > > > > Applications must first initialise filesystems that file handles can > > be used on by calling path_to_fshandle() to populate an open file > > cache. Behind the scenes, this calls XFS_IOC_PATH_TO_FSHANDLE to > > associate a {fd, path, fshandle} tuple for that filesystem. The > > libhandle operations then match the fsid embedded in the file handle > > to the known open fshandles, and if they match it uses the > > associated fd to issue the ioctl to the correct filesystem. > > > > This fshandle fd is held open for as long as the application is > > running, so it pins the filesystem and so the fshandle obtained at > > init time is guaranteed to be valid until the application exits. > > Hence on startup an app simply needs to walk the paths it is > > interested in and call path_to_fshandle() on all of them, but > > otherwise it does not need to know what filesystem a filehandle > > belongs to - the libhandle implementation takes care of that > > entirely.... > > > > IOWs, this whole "find the right filesystem for the file handle" > > implementation is largely abstracted away from userspace by > > libhandle. Hence just looking at what the the XFS ioctls do does not > > give the whole picture of how stable filehandles were actually used > > by applications... > > > > I suspect that the right thing to do here is extend the VFS > > filehandles to contain an 8 byte fsid prefix (supplied by a new an > > export op) and an AT_FSID flag for name_to_handle_at() to return > > just the 8 byte fsid that is used by handles on that filesystem. > > This provides the filesystems with a well defined API for providing > > a stable identifier instead of redefining what filesystems need to > > return in some other UAPI. > > > > This also means that userspace can be entirely filesystem agnostic > > and it doesn't need to rely on parsing proc files to translate > > ephemeral mount IDs to paths, statvfs() and hoping that f_fsid is > > stable enough that it doesn't get the destination wrong. It also > > means that fanotify UAPI probably no longer needs to supply a > > f_fsid with the filehandle because it is built into the > > filehandle.... > > > > That is one option. Let's call it the "bullet proof" option. "Reliable". "Provides well defined behaviour". "Guarantees". > Another option, let's call it the "pragmatic" options, is that you accept > that my patch shouldn't break anything and agree to apply it. "shouldn't break anything" is the problem. You can assert all you want that nothing will break, but history tells us that even the most benign UAPI changes can break unexpected stuff in unexpected ways. That's the fundamental problem. We *know* that what you are trying to do with filehandles and fsid has -explicit, well known- issues. What we have here is a new interface that is .... problematic, and now it needs to redefine other parts of the UAPI to make "problems" with the new interface go away. Yes, we really suck at APIs, but that doesn't mean hacking a UAPI around to work around problems in another UAPI is the right answer. > In that case, a future indexer (or whatever) application author can use > fanotify, name_to_handle_at() and fstats() as is and document that after > mount cycle, the indexer may get confused and miss changes in obscure > filesystems that nobody uses on desktops and servers. But anyone wanting to use this for a HSM style application that needs a guarantee that the filehandle can be resolved to the correct filesystem for open_by_handle_at() is SOL? IOWs, this proposal is not really fixing the underlying problem, it's just kicking the can down the road. > The third option, let's call it the "sad" option, is that we do nothing > and said future indexer application author will need to find ways to > work around this deficiency or document that after mount cycle, the > indexer may get confused and miss changes in commonly used > desktop and server filesystems (i.e. XFS). Which it already needs to do, because there are many, many filesysetms out there that have f_fsid that change on every mount. > <side bar> > I think that what indexer author would really want is not "persistent fsid" > but rather a "persistent change journal" [1]. > I have not abandoned this effort and I have a POC [2] for a new fsnotify > backend (not fanotify) based on inputs that also you provided in LSFMM. > In this POC, which is temporarily reusing the code of overlayfs index, > the persistent identifier of an object is {s_uuid,fhandle}. > </side bar> Yup, that's pretty much what HSMs on top of DMAPI did. But then the app developers realised that they can still miss events, especially when the system crashes. Not to mention that the filesystem may not actually replay all the changes it reports to userspace during journal recovery because it is using asynchronous journalling and so much of the pending in memory change was lost, even though change events were reported to userspace.... Hence doing stuff like "fanotify intents" doesn't actually solve any "missing event" problems - it just creates more complex coherency problems because you cannot co-ordinate "intent done" events with filesystem journal completion points sanely. The fanotify journal needs to change state atomically with the filesystem journal state, and that's not really something that can be done by a layer above the filesystem.... Hence the introduction of the bulkstat interface in XFS for fast, efficient scanning of all the inodes in the filesystem for changes. The HSMs -always- scanned the filesystems after an unclean mount (i.e. not having a registered unmount event recorded in the userspace application event database) because the filesystem and the userspace database state are never in sync after a crash event. And, well, because userspace applications can crash and/or have bugs that lose events, these HSMs would also do periodic scans to determine if it had missed anything. When you are indexing hundreds of millions of files and many petabytes of storage across disk and tape (this was the typical scale of DMAPI installations in the mid 2000s), you care about proactively catching that one event that was missed because of a transient memory allocation event under heavy production load.... > Would you be willing to accept the "pragmatic" option? It doesn't really seem "pragmatic" to me. It looks "convenient" for fanotify to redefine what f_fsid means, but I'm not convinced that we should be changing another UAPI to paper over a sub-optimal UAPI in relatively new and largely unused fanotify functionality.... Cheers, Dave.
On Wed, Mar 24, 2021 at 11:18:36AM +0200, Amir Goldstein wrote: > On Wed, Mar 24, 2021 at 9:43 AM Christoph Hellwig <hch@infradead.org> wrote: > > > > On Wed, Mar 24, 2021 at 08:53:25AM +0200, Amir Goldstein wrote: > > > > This also means that userspace can be entirely filesystem agnostic > > > > and it doesn't need to rely on parsing proc files to translate > > > > ephemeral mount IDs to paths, statvfs() and hoping that f_fsid is > > > > stable enough that it doesn't get the destination wrong. It also > > > > means that fanotify UAPI probably no longer needs to supply a > > > > f_fsid with the filehandle because it is built into the > > > > filehandle.... > > > > > > > > > > That is one option. Let's call it the "bullet proof" option. > > > > > > Another option, let's call it the "pragmatic" options, is that you accept > > > that my patch shouldn't break anything and agree to apply it. > > > > Your patch may very well break something. Most Linux file systems do > > store the dev_t in the fsid and userspace may for whatever silly > > reasons depend on it. > > > > I acknowledge that. > I do not claim that my change carries zero risk of breakage. > However, if such userspace dependency exists, it would break on ext4, > btrfs, ocsf2, ceph and many more fs, so it would have to be a > dependency that is tightly coupled with a specific fs. > The probability of that is rather low IMO. > > I propose an opt-in mount option "-o fixed_fsid" for this behavior to make > everyone sleep better. Layering hacks on top of hacks to avoid fixing the fanotify UAPI limitations isn't a very palatable option. Especially those that require adding mount options we'll have to support forever more... Cheers, Dave.
> How do you check the identity of a file handle without a filesystem > identifier in it? having an external f_fsid isn't really sufficient > - you need to query the filesytem for it's identifier and determine > if the file handle contains that same identifier... > > > Furthermore, with name_to_handle_at(fd, ..., AT_EMPTY_PATH) > > and fstatfs() there are none of the races you mention below and > > fanotify obviously captures a valid {fsid,fhandle} tuple. > > Except that fsid is not guaranteed to be the same across mounts, let > alone reboots. There is no guarantee of uniqueness, either. IOWs, in > the bigger picture, f_fsid isn't something that can provide a > guaranteed answer to your "is $obj the same as it was at $TIME"... > > You can't even infer a path from the fsid, even if it is unique, The > fsid doesn't tell you what *mount point* it refers to. i.e in > the present of bind mounts, there can be multiple disjoint directory > heirachies that correspond to the same fsid... > This conversation is not converging. Partly because I may have failed to explain the requirements. I am telling you that I need to do X and you are telling me that I cannot do Y. I did not say that I needed to infer that path of an object nor that I cared which mount point the object was observed from. The application is able to cope with not being able to correctly resolve fid to path, but it is "incoventiet". The technical term to translate "inconvenient" is the "cost" of IO and time to recrawl the filesystem. I hear your main arguments against re-purposing f_fsid and potential breakage of applications loud and clear. On their own, they are perfectly valid and acceptable reasons to NACK my patch, so I am not going to pursue it. OTOH, I find your argument to leave f_fsid "less useful than it can be" because "it is not perfect", far less convincing, but it doesn't look like I am going to change your mind, so no point in arguing about that. [...] > > If the listener has several objects open (e.g. tail -f A B C) then when getting > > an event, the identifier can be used to match the open file with certainty > > (having verified no collisions of identifiers after opening the files). > > Sorry, you've lost me. How on do you reliably match a {fsid, fhandle} > to an open file descriptor? You've got to have more information > available than just a fd, fsid and a fhandle... > This was just an example of the simple case where the listener is started after opening the files and capturing the {fsid, fhandle} tuple from fd. The events received by the listener contain {fsid, fhandle} tuples that should match one of the captures fids. For this example, the stable fsid is irrelevant. > > If the listener is watching multiple directories (e.g. inotifywatch --recursive) > > then the listener has two options: > > 1. Keep open fds for all watches dirs - this is what inotify_add_watch() > > does internally (not fds per-se but keeping an elevated i_count) > > 2. Keep fid->path map for all watches dirs and accept the fact that the > > cached path information may be stale > > > > The 2nd option is valid for applications that use the events as hints > > to take action. An indexer application, for example, doesn't care if > > it will scan a directory where there were no changes as long as it will > > get the correct hint eventually. > > > > So if an indexer application were to act on FAN_MOVE events by > > scanning the entire subtree under the parent dir where an entry was > > renamed, the index will be eventually consistent, regardless of all > > the events on objects with stale path cache that may have been > > received after the rename. > > Sure, but you don't need a file handle for this. you can just scan > the directory heirachy any time you get a notification for that > fsid. Even if you have multiple directory heirarchies that you > are watching on a given mount. > > I'm -guessing- that you are using the filehandle to differentiate > between different watched heirarchies, and you do that by taking a > name_to_handle_at() snapshot of the path when the watch is set, yes? > More or less, yes. > AFAICT, the application cannot care about whether it loses > events across reboot because the indexer already needs to scan after > boot to so that it is coherent with whatever state the filesystem is > in after recovery. > That is absolutely right. But a listener can watch several mounted filesystems on a live system and mounts can come and go (this is standard for e.g. an AV scanner). When a filesystem is mounted, the application cannot KNOW that filesystem was not mounted somewhere else and modified without supervision, so the safest thing would be a full crawl. But users should be able to configure that they trust that the filesystem was not mounted elsewhere and that they trust that the listener has subscribed to watch for changes, before untrusted users got access to make modifications in the watches filesystem. Under those specific circumstances, it is "inconvenient" and potentially very expensive for the identity of the filesystem to change across mount cycle (e.g. when using loop mount). That said, a good application can use libblkid to overcome the "inconvenience" and detect fsid changes. The argument of "many other filesystem have unstable fsid" is not relevant. The users making all the assumptions above would know what filesystem is used and if they know that filesystem's fsid can be trusted to be stable, because they read the code and/or documentation they can use that knowledge to improve their system. [...] > > That is one option. Let's call it the "bullet proof" option. > > "Reliable". "Provides well defined behaviour". "Guarantees". > Yes, and also "Application is able to do the right thing without admin configuration". > > Another option, let's call it the "pragmatic" options, is that you accept > > that my patch shouldn't break anything and agree to apply it. > > "shouldn't break anything" is the problem. You can assert all you > want that nothing will break, but history tells us that even the > most benign UAPI changes can break unexpected stuff in unexpected > ways. > No arguments here. My only claim was that risk is not high. > That's the fundamental problem. We *know* that what you are trying > to do with filehandles and fsid has -explicit, well known- issues. > What we have here is a new interface that > is .... problematic, and now it needs to redefine other parts > of the UAPI to make "problems" with the new interface go away. > > Yes, we really suck at APIs, but that doesn't mean hacking a UAPI > around to work around problems in another UAPI is the right answer. > All very very true. And yet, those very true words masquerade the fact that the change that I proposed, IMO, slightly improves an existing UAPI and aligns xfs behavior with the behavior of other "prominent" Linux local filesytems. It's really a matter of perspective how we choose to present it. > > In that case, a future indexer (or whatever) application author can use > > fanotify, name_to_handle_at() and fstats() as is and document that after > > mount cycle, the indexer may get confused and miss changes in obscure > > filesystems that nobody uses on desktops and servers. > > But anyone wanting to use this for a HSM style application that > needs a guarantee that the filehandle can be resolved to the correct > filesystem for open_by_handle_at() is SOL? > > IOWs, this proposal is not really fixing the underlying problem, > it's just kicking the can down the road. > It's not a kick, it's just a nudge ;-) And I am not abandoning the HSM users. It's just a much bigger project that also involves persistent change intent journal. > > The third option, let's call it the "sad" option, is that we do nothing > > and said future indexer application author will need to find ways to > > work around this deficiency or document that after mount cycle, the > > indexer may get confused and miss changes in commonly used > > desktop and server filesystems (i.e. XFS). > > Which it already needs to do, because there are many, many > filesysetms out there that have f_fsid that change on every mount. > Not any of the filesystems that matter to desktop/server users. IOW, no other filesystem that comes as the default fs of any major distro. > > <side bar> > > I think that what indexer author would really want is not "persistent fsid" > > but rather a "persistent change journal" [1]. > > I have not abandoned this effort and I have a POC [2] for a new fsnotify > > backend (not fanotify) based on inputs that also you provided in LSFMM. > > In this POC, which is temporarily reusing the code of overlayfs index, > > the persistent identifier of an object is {s_uuid,fhandle}. > > </side bar> > > Yup, that's pretty much what HSMs on top of DMAPI did. But then the > app developers realised that they can still miss events, especially > when the system crashes. Not to mention that the filesystem may not > actually replay all the changes it reports to userspace during > journal recovery because it is using asynchronous journalling and so > much of the pending in memory change was lost, even though change > events were reported to userspace.... > Right... > Hence doing stuff like "fanotify intents" doesn't actually solve any > "missing event" problems - it just creates more complex coherency > problems because you cannot co-ordinate "intent done" events with > filesystem journal completion points sanely. The fanotify journal > needs to change state atomically with the filesystem journal state, > and that's not really something that can be done by a layer above > the filesystem.... > Surely, a filesystem can do it more efficiently internally, but perhaps the generic intent journal subsystem can be done...? My POC is a kernel subsystem (not fanotify) Intents are created in {mnt_want,{sb,file}_start}_write() call sites *before* fs freeze lock. When configured correctly, the backend store of the "modify intent" map is on the same filesystems that is being watched for changes and the ONLY information contained in the "modify intents" is the {uuid,fhandle} tuple of directories wherein modifications may have occurred. The "modify intent" records themselves are also metadata (directory entries), so after a crash, if there is no "modify intent" record for any given directory, it should be safe to assume that there were no modifications made under that directory. This assertion should hold also with crashes that "rollback" changes. My POC application uses that "map"/"index" of modified directories to perform a "pruned tree scan" after reboot. Do you see any flaws in this design? > Hence the introduction of the bulkstat interface in XFS for fast, > efficient scanning of all the inodes in the filesystem for changes. > The HSMs -always- scanned the filesystems after an unclean mount > (i.e. not having a registered unmount event recorded in the > userspace application event database) because the filesystem and the > userspace database state are never in sync after a crash event. > > And, well, because userspace applications can crash and/or have bugs > that lose events, these HSMs would also do periodic scans to > determine if it had missed anything. When you are indexing hundreds > of millions of files and many petabytes of storage across disk and > tape (this was the typical scale of DMAPI installations in the mid > 2000s), you care about proactively catching that one event that was > missed because of a transient memory allocation event under heavy > production load.... > Yes, I know all about that. The game is trying to reduce the number of cases when fs scan is needed and reduce the cost of the scan to minimum. I'll be happy to share more information about my POC if anyone asks and/or in LSFMM when/if that eventually happens. w.r.t $SUBJECT, if nobody else shares my POV that this MAY be a beneficial and bening change that is worth doing regardless of fanotify, I am not going to argue for it more. Thanks, Amir.
On Tue, Mar 23, 2021 at 06:50:44AM +0200, Amir Goldstein wrote: > On Tue, Mar 23, 2021 at 1:03 AM Dave Chinner <david@fromorbit.com> wrote: > > should be using something common across all filesystems from the > > linux superblock, not deep dark internal filesystem magic. The > > export interfaces that generate VFS (and NFS) filehandles already > > have a persistent fsid associated with them, which may in fact be > > the filesystem UUID in it's entirety. > > > > Yes, nfsd is using dark internal and AFAIK undocumnetd magic to > pick that identifier (Bruce, am I wrong?). Sorry, I kept putting off catching up with this thread and only now noticed the question. It's actually done mostly in userspace (rpc.mountd), so "dark internal" might not be fair, but it is rather complicated. There are several options (UUID, device number, number provided by the user with fsid= option), and I don't recall the logic behind which we use when. I don't *think* we have good comprehensive documentation of it anywhere. I wish we did. It'd take a little time to put together. Starting points would be linux/fs/nfsd/nfsfh.c and nfs-utils/support/export/cache.c. --b.
I've been reading through this whole thread, and it appears to me that the only real, long-term solution is to rely on file system UUID's (for those file systems that support real 128-bit UUID's), and optionally, for those file systems which support it, a new, "snapshot" or "fs-instance" UUID. The FS UUID is pretty simple; we just need an ioctl (or similar interface) which returns the UUID for a particular file system. The Snapshot UUID is the one which is bit trickier. If the underlying block device can supply something unique --- for example, the WWN or WWID which is defined by FC, ATA, SATA, SCSI, NVMe, etc. then that plus a partition identifier could be hashed to form a Snapshot UUID. LVM volumes have a LV UUID that could be used for a similar purpose. If you have a block device which doesn't relibly provide a WWN or WWID, or we could could imagine that a file system has a field in the superblock, and a file system specific program could get used to set that field to a random UUID, and that becomes part of the snapshot process. This may be problematic for remote/network file systems which don't have such a concept, but life's a bitch.... With that, then userspace can fetch the st_dev, st_ino, the FS UUID, and the Snapshot UUID, and use some combination of those fields (as available) to try determine whether two objects are unique or not. Is this perfect? Heck, no. But ultimately, this is a hard problem, and trying to wave our hands and create something that works given one set of assumptions --- and completely breaks in a diferent operating environment --- is going lead to angry users blaming the fs developers. It's a messy problem, and I think all we can do is expose the entire mess to userspace, and make it be a userspace problem. That way, the angry users can blame the userspace developers instead. :-) - Ted
On Fri, Mar 26, 2021 at 10:15 PM J. Bruce Fields <bfields@fieldses.org> wrote: > > On Tue, Mar 23, 2021 at 06:50:44AM +0200, Amir Goldstein wrote: > > On Tue, Mar 23, 2021 at 1:03 AM Dave Chinner <david@fromorbit.com> wrote: > > > should be using something common across all filesystems from the > > > linux superblock, not deep dark internal filesystem magic. The > > > export interfaces that generate VFS (and NFS) filehandles already > > > have a persistent fsid associated with them, which may in fact be > > > the filesystem UUID in it's entirety. > > > > > > > Yes, nfsd is using dark internal and AFAIK undocumnetd magic to > > pick that identifier (Bruce, am I wrong?). > > Sorry, I kept putting off catching up with this thread and only now > noticed the question. > > It's actually done mostly in userspace (rpc.mountd), so "dark internal" > might not be fair, but it is rather complicated. There are several > options (UUID, device number, number provided by the user with fsid= > option), and I don't recall the logic behind which we use when. > I'll take back "dark internal" then and replace it with "light external" ;-) which is also a problem. If userspace is involved in declaring the id of the *export* then from NFS client POV, that is not a problem, but from fsnotify POV, that identifier cannot be determined when an event happens on an inode NOT via the NFS client. As a matter of fact, the fanotify requirements about fsid are even more strict than being able to get fsid from the inode. From fanotify_mark.2: " EXDEV The filesystem object indicated by pathname resides within a filesystem subvolume (e.g., btrfs(5)) which uses a different fsid than its root superblock... " > I don't *think* we have good comprehensive documentation of it anywhere. > I wish we did. It'd take a little time to put together. Starting > points would be linux/fs/nfsd/nfsfh.c and > nfs-utils/support/export/cache.c. At least as far as fanotify is concerned, the documentation is not going to matter. The only thing needed is an fsid value that is queryable via a userspace API. f_fsid meets this criteria, which is why it was chosen for fanotify. Having the fsid reported by fanotify also be stable is a nice to have feature for very selective use cases, which is why I posted this xfs patch. Thanks, Amir.
On Sat, Mar 27, 2021 at 1:34 AM Theodore Ts'o <tytso@mit.edu> wrote: > > I've been reading through this whole thread, and it appears to me that > the only real, long-term solution is to rely on file system UUID's > (for those file systems that support real 128-bit UUID's), and > optionally, for those file systems which support it, a new, "snapshot" > or "fs-instance" UUID. > > The FS UUID is pretty simple; we just need an ioctl (or similar > interface) which returns the UUID for a particular file system. > > The Snapshot UUID is the one which is bit trickier. If the underlying > block device can supply something unique --- for example, the WWN or > WWID which is defined by FC, ATA, SATA, SCSI, NVMe, etc. then that > plus a partition identifier could be hashed to form a Snapshot UUID. > LVM volumes have a LV UUID that could be used for a similar purpose. > > If you have a block device which doesn't relibly provide a WWN or > WWID, or we could could imagine that a file system has a field in the > superblock, and a file system specific program could get used to set > that field to a random UUID, and that becomes part of the snapshot > process. This may be problematic for remote/network file systems > which don't have such a concept, but life's a bitch.... > > With that, then userspace can fetch the st_dev, st_ino, the FS UUID, > and the Snapshot UUID, and use some combination of those fields (as > available) to try determine whether two objects are unique or not. > > Is this perfect? Heck, no. But ultimately, this is a hard problem, > and trying to wave our hands and create something that works given one > set of assumptions --- and completely breaks in a diferent operating > environment --- is going lead to angry users blaming the fs > developers. It's a messy problem, and I think all we can do is expose > the entire mess to userspace, and make it be a userspace problem. > That way, the angry users can blame the userspace developers instead. :-) Sounds like a plan ;-) FWIW, if and when we will have a standard userspace API (fsinfo()?) to export fs instance uuid to userspace, fanotify can use the uuid instead of fsid when available (opt-in by new faotify_init() flag). The fanotify_event_info_header contains an "info_type" field, so it's not a problem for some events to report fsid (as today) and for other events to report uuid, depending on availability of the information per filesystem. Thanks, Amir.
diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c index e5e0713bebcd..37f8417b78c4 100644 --- a/fs/xfs/xfs_super.c +++ b/fs/xfs/xfs_super.c @@ -790,7 +790,7 @@ xfs_fs_statfs( struct xfs_mount *mp = XFS_M(dentry->d_sb); xfs_sb_t *sbp = &mp->m_sb; struct xfs_inode *ip = XFS_I(d_inode(dentry)); - uint64_t fakeinos, id; + uint64_t fakeinos; uint64_t icount; uint64_t ifree; uint64_t fdblocks; @@ -800,8 +800,8 @@ xfs_fs_statfs( statp->f_type = XFS_SUPER_MAGIC; statp->f_namelen = MAXNAMELEN - 1; - id = huge_encode_dev(mp->m_ddev_targp->bt_dev); - statp->f_fsid = u64_to_fsid(id); + statp->f_fsid.val[0] = mp->m_fixedfsid[0]; + statp->f_fsid.val[1] = mp->m_fixedfsid[1]; icount = percpu_counter_sum(&mp->m_icount); ifree = percpu_counter_sum(&mp->m_ifree);
Some filesystems on persistent storage backend use a digest of the filesystem's persistent uuid as the value for f_fsid returned by statfs(2). xfs, as many other filesystem provide the non-persistent block device number as the value of f_fsid. Since kernel v5.1, fanotify_init(2) supports the flag FAN_REPORT_FID for identifying objects using file_handle and f_fsid in events. The xfs specific ioctl XFS_IOC_PATH_TO_FSHANDLE similarly attaches an fsid to exported file handles, but it is not the same fsid exported via statfs(2) - it is a persistent fsid based on the filesystem's uuid. Use the same persistent value for f_fsid, so object identifiers in fanotify events will describe the objects more uniquely. Signed-off-by: Amir Goldstein <amir73il@gmail.com> --- Guys, This change would be useful for fanotify users. Do you see any problems with that minor change of uapi? The way I see it, now f_fsid of an xfs filesystem can change on reboots. With this change, it will change once more and never more. I did not find any kernel internal user other than fanotify and as for userland expectations, there should not be much expectations from the value of f_fsid as it is persistent for some filesystems and bdev number for others - there is no standard. Thanks, Amir. fs/xfs/xfs_super.c | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-)