Message ID | 20221017105709.10830-10-jlayton@kernel.org (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | fs: clean up handling of i_version counter | expand |
On Mon, Oct 17, 2022 at 06:57:09AM -0400, Jeff Layton wrote: > From: Jeff Layton <jlayton@redhat.com> > > Claim one of the spare fields in struct statx to hold a 64-bit inode > version attribute. When userland requests STATX_VERSION, copy the > value from the kstat struct there, and stop masking off > STATX_ATTR_VERSION_MONOTONIC. Can we please make the name more sepcific than "version"? It's way too generic and - we already have userspace facing "version" fields for inodes that refer to the on-disk format version exposed in various UAPIs. It's common for UAPI structures used for file operations to have a "version" field that refers to the *UAPI structure version* rather than file metadata or data being retrieved from the file in question. The need for an explanatory comment like this: > + __u64 stx_version; /* Inode change attribute */ demonstrates it is badly named. If you want it known as an inode change attribute, then don't name the variable "version". In reality, it really needs to be an opaque cookie, not something applications need to decode directly to make sense of. > Update the test-statx sample program to output the change attr and > MountId. > > Reviewed-by: NeilBrown <neilb@suse.de> > Signed-off-by: Jeff Layton <jlayton@kernel.org> > --- > fs/stat.c | 12 +++--------- > include/linux/stat.h | 9 --------- > include/uapi/linux/stat.h | 6 ++++-- > samples/vfs/test-statx.c | 8 ++++++-- > 4 files changed, 13 insertions(+), 22 deletions(-) > > Posting this as an RFC as we're still trying to sort out what semantics > we want to present to userland. In particular, this patch leaves the > problem of crash resilience in to userland applications on filesystems > that don't report as MONOTONIC. Firstly, if userspace wants to use the change attribute, they are going to have to detect crashes themselves anyway because no fs in the kernel can set the MONOTONIC flag right now and it may be years before kernels/filesystems actually support it in production systems. But more fundamentally, I think this monotonic increase guarantee is completely broken by the presence of snapshots and snapshot rollbacks. If you change something, then a while later decide it broke (e.g. a production system upgrade went awry) and you roll back the filesystem to the pre-upgrade snapshot, then all the change counters and m/ctimes are guaranteed to go backwards because they will revert to the snapshot values. Maybe the filesystem can bump some internal counter for the snapshot when the revert happens, but until that is implemented, filesystems that support snapshots and rollback can't assert MONOTONIC. And that's worse for other filesystems, because if you put them on dm-thinp and roll them back, they are completely unaware of the fact that a rollback happened and there's *nothing* the filesystem can do about this. Indeed, snapshots are suppose to be done on clean filesystems so snapshot images don't require journal recovery, so any crash detection put in the filesystem recovery code to guarantee MONOTONIC behaviour will be soundly defeated by such block device snapshot rollbacks. Hence I think MONOTONIC is completely unworkable for most existing filesystems because snapshots and rollbacks completely break the underlying assumption MONOTONIC relies on: that filesystem modifications always move forwards in both the time and modification order dimensions.... This means that monotonicity is probably not acheivable by any existing filesystem and so should not ever be mentioned in the UAPI. I think userspace semantics can be simplified down to "if the change cookie does not match exactly, caches are invalid" combined with "applications are responsible for detecting temporal discontiguities in filesystem presentation at start up (e.g. after a crash, unclean shutdown, restoration from backup, snapshot rollback, etc) for persistent cache invalidation purposes".... > Trond is of the opinion that monotonicity is a hard requirement, and > that we should not allow filesystems that can't provide that quality to > report STATX_VERSION at all. His rationale is that one of the main uses > for this is for backup applications, and for those a counter that could > go backward is worse than useless. From the perspective of a backup program doing incremental backups, an inode with a change counter that has a different value to the current backup inventory means the file contains different information than what the current backup inventory holds. Again, snapshots, rollbacks, etc. Therefore, regardless of whether the change counter has gone forwards or backwards, the backup program needs to back up this current version of the file in this backup because it is different to the inventory copy. Hence if the backup program fails to back it up, it will not be creating an exact backup of the user's data at the point in time the backup is run... Hence I don't see that MONOTONIC is a requirement for backup programs - they really do have to be able to handle filesystems that have modifications that move backwards in time as well as forwards... Cheers, Dave.
On Tue, 2022-10-18 at 09:14 +1100, Dave Chinner wrote: > On Mon, Oct 17, 2022 at 06:57:09AM -0400, Jeff Layton wrote: > > From: Jeff Layton <jlayton@redhat.com> > > > > Claim one of the spare fields in struct statx to hold a 64-bit inode > > version attribute. When userland requests STATX_VERSION, copy the > > value from the kstat struct there, and stop masking off > > STATX_ATTR_VERSION_MONOTONIC. > > Can we please make the name more sepcific than "version"? It's way > too generic and - we already have userspace facing "version" fields > for inodes that refer to the on-disk format version exposed in > various UAPIs. It's common for UAPI structures used for file > operations to have a "version" field that refers to the *UAPI > structure version* rather than file metadata or data being retrieved > from the file in question. > > The need for an explanatory comment like this: > > > + __u64 stx_version; /* Inode change attribute */ > > demonstrates it is badly named. If you want it known as an inode > change attribute, then don't name the variable "version". In > reality, it really needs to be an opaque cookie, not something > applications need to decode directly to make sense of. > Fair enough. I started with this being named stx_change_attr and other people objected. I then changed to stx_ino_version, but the "_ino" seemed redundant. I'm open to suggestions here. Naming things like this is hard. > > Update the test-statx sample program to output the change attr and > > MountId. > > > > Reviewed-by: NeilBrown <neilb@suse.de> > > Signed-off-by: Jeff Layton <jlayton@kernel.org> > > --- > > fs/stat.c | 12 +++--------- > > include/linux/stat.h | 9 --------- > > include/uapi/linux/stat.h | 6 ++++-- > > samples/vfs/test-statx.c | 8 ++++++-- > > 4 files changed, 13 insertions(+), 22 deletions(-) > > > > Posting this as an RFC as we're still trying to sort out what semantics > > we want to present to userland. In particular, this patch leaves the > > problem of crash resilience in to userland applications on filesystems > > that don't report as MONOTONIC. > > Firstly, if userspace wants to use the change attribute, they are > going to have to detect crashes themselves anyway because no fs in > the kernel can set the MONOTONIC flag right now and it may be years > before kernels/filesystems actually support it in production > systems. > We can turn it on today in CephFS, NFS and tmpfs. Maybe also btrfs (modulo the issue you point out with snapshots, of course). > But more fundamentally, I think this monotonic increase guarantee is > completely broken by the presence of snapshots and snapshot > rollbacks. If you change something, then a while later decide it > broke (e.g. a production system upgrade went awry) and you roll back > the filesystem to the pre-upgrade snapshot, then all the change > counters and m/ctimes are guaranteed to go backwards because they > will revert to the snapshot values. Maybe the filesystem can bump > some internal counter for the snapshot when the revert happens, but > until that is implemented, filesystems that support snapshots and > rollback can't assert MONOTONIC. > > And that's worse for other filesystems, because if you put them on > dm-thinp and roll them back, they are completely unaware of the fact > that a rollback happened and there's *nothing* the filesystem can do > about this. Indeed, snapshots are suppose to be done on clean > filesystems so snapshot images don't require journal recovery, so > any crash detection put in the filesystem recovery code to guarantee > MONOTONIC behaviour will be soundly defeated by such block device > snapshot rollbacks. > > Hence I think MONOTONIC is completely unworkable for most existing > filesystems because snapshots and rollbacks completely break the > underlying assumption MONOTONIC relies on: that filesystem > modifications always move forwards in both the time and modification > order dimensions.... > > This means that monotonicity is probably not acheivable by any > existing filesystem and so should not ever be mentioned in the UAPI. > I think userspace semantics can be simplified down to "if the change > cookie does not match exactly, caches are invalid" combined with > "applications are responsible for detecting temporal discontiguities > in filesystem presentation at start up (e.g. after a crash, unclean > shutdown, restoration from backup, snapshot rollback, etc) for > persistent cache invalidation purposes".... > I don't think we can make any sort of blanket statement about monotonicity in the face of snapshots. Restoring a snapshot (or a backup for that matter) means restoring the filesystem to a particular point in time in the past. I think it's reasonable to expect that the change attrs may roll backward in the face of these sorts of events. > > Trond is of the opinion that monotonicity is a hard requirement, and > > that we should not allow filesystems that can't provide that quality to > > report STATX_VERSION at all. His rationale is that one of the main uses > > for this is for backup applications, and for those a counter that could > > go backward is worse than useless. > > From the perspective of a backup program doing incremental backups, > an inode with a change counter that has a different value to the > current backup inventory means the file contains different > information than what the current backup inventory holds. Again, > snapshots, rollbacks, etc. > > Therefore, regardless of whether the change counter has gone > forwards or backwards, the backup program needs to back up this > current version of the file in this backup because it is different > to the inventory copy. Hence if the backup program fails to back it > up, it will not be creating an exact backup of the user's data at > the point in time the backup is run... > > Hence I don't see that MONOTONIC is a requirement for backup > programs - they really do have to be able to handle filesystems that > have modifications that move backwards in time as well as forwards... Rolling backward is not a problem in and of itself. The big issue is that after a crash, we can end up with a change attr seen before the crash that is now associated with a completely different inode state. The scenario is something like: - Change attr for an empty file starts at 1 - Write "A" to file, change attr goes to 2 - Read and statx happens (client sees "A" with change attr 2) - Crash (before last change is logged to disk) - Machine reboots, inode is empty, change attr back to 1 - Write "B" to file, change attr goes to 2 - Client stat's file, sees change attr 2 and assumes its cache is correct when it isn't (should be "B" not "A" now). The real danger comes not from the thing going backward, but the fact that it can march forward again after going backward, and then the client can see two different inode states associated with the same change attr value. Jumping all the change attributes forward by a significant amount after a crash should avoid this issue.
On Tue 18-10-22 06:35:14, Jeff Layton wrote: > On Tue, 2022-10-18 at 09:14 +1100, Dave Chinner wrote: > > On Mon, Oct 17, 2022 at 06:57:09AM -0400, Jeff Layton wrote: > > > Trond is of the opinion that monotonicity is a hard requirement, and > > > that we should not allow filesystems that can't provide that quality to > > > report STATX_VERSION at all. His rationale is that one of the main uses > > > for this is for backup applications, and for those a counter that could > > > go backward is worse than useless. > > > > From the perspective of a backup program doing incremental backups, > > an inode with a change counter that has a different value to the > > current backup inventory means the file contains different > > information than what the current backup inventory holds. Again, > > snapshots, rollbacks, etc. > > > > Therefore, regardless of whether the change counter has gone > > forwards or backwards, the backup program needs to back up this > > current version of the file in this backup because it is different > > to the inventory copy. Hence if the backup program fails to back it > > up, it will not be creating an exact backup of the user's data at > > the point in time the backup is run... > > > > Hence I don't see that MONOTONIC is a requirement for backup > > programs - they really do have to be able to handle filesystems that > > have modifications that move backwards in time as well as forwards... > > Rolling backward is not a problem in and of itself. The big issue is > that after a crash, we can end up with a change attr seen before the > crash that is now associated with a completely different inode state. > > The scenario is something like: > > - Change attr for an empty file starts at 1 > > - Write "A" to file, change attr goes to 2 > > - Read and statx happens (client sees "A" with change attr 2) > > - Crash (before last change is logged to disk) > > - Machine reboots, inode is empty, change attr back to 1 > > - Write "B" to file, change attr goes to 2 > > - Client stat's file, sees change attr 2 and assumes its cache is > correct when it isn't (should be "B" not "A" now). > > The real danger comes not from the thing going backward, but the fact > that it can march forward again after going backward, and then the > client can see two different inode states associated with the same > change attr value. Jumping all the change attributes forward by a > significant amount after a crash should avoid this issue. As Dave pointed out, the problem with change attr having the same value for a different inode state (after going backwards) holds not only for the crashes but also for restore from backups, fs snapshots, device snapshots etc. So relying on change attr only looks a bit fragile. It works for the common case but the edge cases are awkward and there's no easy way to detect you are in the edge case. So I think any implementation caring about data integrity would have to include something like ctime into the picture anyway. Or we could just completely give up any idea of monotonicity and on each mount select random prime P < 2^64 and instead of doing inc when advancing the change attribute, we'd advance it by P. That makes collisions after restore / crash fairly unlikely. Honza
On Tue, 2022-10-18 at 15:49 +0200, Jan Kara wrote: > On Tue 18-10-22 06:35:14, Jeff Layton wrote: > > On Tue, 2022-10-18 at 09:14 +1100, Dave Chinner wrote: > > > On Mon, Oct 17, 2022 at 06:57:09AM -0400, Jeff Layton wrote: > > > > Trond is of the opinion that monotonicity is a hard requirement, and > > > > that we should not allow filesystems that can't provide that quality to > > > > report STATX_VERSION at all. His rationale is that one of the main uses > > > > for this is for backup applications, and for those a counter that could > > > > go backward is worse than useless. > > > > > > From the perspective of a backup program doing incremental backups, > > > an inode with a change counter that has a different value to the > > > current backup inventory means the file contains different > > > information than what the current backup inventory holds. Again, > > > snapshots, rollbacks, etc. > > > > > > Therefore, regardless of whether the change counter has gone > > > forwards or backwards, the backup program needs to back up this > > > current version of the file in this backup because it is different > > > to the inventory copy. Hence if the backup program fails to back it > > > up, it will not be creating an exact backup of the user's data at > > > the point in time the backup is run... > > > > > > Hence I don't see that MONOTONIC is a requirement for backup > > > programs - they really do have to be able to handle filesystems that > > > have modifications that move backwards in time as well as forwards... > > > > Rolling backward is not a problem in and of itself. The big issue is > > that after a crash, we can end up with a change attr seen before the > > crash that is now associated with a completely different inode state. > > > > The scenario is something like: > > > > - Change attr for an empty file starts at 1 > > > > - Write "A" to file, change attr goes to 2 > > > > - Read and statx happens (client sees "A" with change attr 2) > > > > - Crash (before last change is logged to disk) > > > > - Machine reboots, inode is empty, change attr back to 1 > > > > - Write "B" to file, change attr goes to 2 > > > > - Client stat's file, sees change attr 2 and assumes its cache is > > correct when it isn't (should be "B" not "A" now). > > > > The real danger comes not from the thing going backward, but the fact > > that it can march forward again after going backward, and then the > > client can see two different inode states associated with the same > > change attr value. Jumping all the change attributes forward by a > > significant amount after a crash should avoid this issue. > > As Dave pointed out, the problem with change attr having the same value for > a different inode state (after going backwards) holds not only for the > crashes but also for restore from backups, fs snapshots, device snapshots > etc. So relying on change attr only looks a bit fragile. It works for the > common case but the edge cases are awkward and there's no easy way to > detect you are in the edge case. > This is true. In fact in the snapshot case you can't even rely on doing anything at reboot since you won't necessarily need to reboot to make it roll backward. Whether that obviates the use of this value altogether, I'm not sure. > So I think any implementation caring about data integrity would have to > include something like ctime into the picture anyway. Or we could just > completely give up any idea of monotonicity and on each mount select random > prime P < 2^64 and instead of doing inc when advancing the change > attribute, we'd advance it by P. That makes collisions after restore / > crash fairly unlikely. Part of the goal (at least for NFS) is to avoid unnecessary cache invalidations. If we just increment it by a particular offset on every reboot, then every time the server reboots, the clients will invalidate all of their cached inodes, and proceed to hammer the server with READ calls just as it's having to populate its own caches from disk. IOW, that will not be good for performance. Doing that after a crash is also less than ideal, but crashes should (hopefully) be rare enough that it's not a major issue. In any case, we need to decide whether and what to present to userland. There is a lot of disagreement here, and we need to come to a consensus. I think we have to answer 2 questions: 1/ Is this counter useful enough on its own, without any baked-in rollback resilience to justify exposing it via statx()? 2/ if the answer above is "yes", then is there any value to the MONOTONIC flag, given that we can't really do anything about snapshot rollbacks and the like? I tend to be in the camp of "let's make the raw counter available and leave it up to userland to deal with the potential issues". After all, the c/mtime are still widely used today to detect changes and they have many of the same problems. Trying to do anything more elaborate than that will lead to a lot of extra complexity in the kernel, and make it a lot more difficult for any filesystem to report this at all.
On Tue, 2022-10-18 at 06:35 -0400, Jeff Layton wrote: > On Tue, 2022-10-18 at 09:14 +1100, Dave Chinner wrote: > > On Mon, Oct 17, 2022 at 06:57:09AM -0400, Jeff Layton wrote: > > > From: Jeff Layton <jlayton@redhat.com> > > > > > > Claim one of the spare fields in struct statx to hold a 64-bit inode > > > version attribute. When userland requests STATX_VERSION, copy the > > > value from the kstat struct there, and stop masking off > > > STATX_ATTR_VERSION_MONOTONIC. > > > > Can we please make the name more sepcific than "version"? It's way > > too generic and - we already have userspace facing "version" fields > > for inodes that refer to the on-disk format version exposed in > > various UAPIs. It's common for UAPI structures used for file > > operations to have a "version" field that refers to the *UAPI > > structure version* rather than file metadata or data being retrieved > > from the file in question. > > > > The need for an explanatory comment like this: > > > > > + __u64 stx_version; /* Inode change attribute */ > > > > demonstrates it is badly named. If you want it known as an inode > > change attribute, then don't name the variable "version". In > > reality, it really needs to be an opaque cookie, not something > > applications need to decode directly to make sense of. > > > > Fair enough. I started with this being named stx_change_attr and other > people objected. I then changed to stx_ino_version, but the "_ino" > seemed redundant. > > I'm open to suggestions here. Naming things like this is hard. > How about: STATX_CHANGE / statx->stx_change / STATX_ATTR_CHANGE_MONOTONIC ?
On Tue 18-10-22 10:21:08, Jeff Layton wrote: > On Tue, 2022-10-18 at 15:49 +0200, Jan Kara wrote: > > On Tue 18-10-22 06:35:14, Jeff Layton wrote: > > > On Tue, 2022-10-18 at 09:14 +1100, Dave Chinner wrote: > > > > On Mon, Oct 17, 2022 at 06:57:09AM -0400, Jeff Layton wrote: > > > > > Trond is of the opinion that monotonicity is a hard requirement, and > > > > > that we should not allow filesystems that can't provide that quality to > > > > > report STATX_VERSION at all. His rationale is that one of the main uses > > > > > for this is for backup applications, and for those a counter that could > > > > > go backward is worse than useless. > > > > > > > > From the perspective of a backup program doing incremental backups, > > > > an inode with a change counter that has a different value to the > > > > current backup inventory means the file contains different > > > > information than what the current backup inventory holds. Again, > > > > snapshots, rollbacks, etc. > > > > > > > > Therefore, regardless of whether the change counter has gone > > > > forwards or backwards, the backup program needs to back up this > > > > current version of the file in this backup because it is different > > > > to the inventory copy. Hence if the backup program fails to back it > > > > up, it will not be creating an exact backup of the user's data at > > > > the point in time the backup is run... > > > > > > > > Hence I don't see that MONOTONIC is a requirement for backup > > > > programs - they really do have to be able to handle filesystems that > > > > have modifications that move backwards in time as well as forwards... > > > > > > Rolling backward is not a problem in and of itself. The big issue is > > > that after a crash, we can end up with a change attr seen before the > > > crash that is now associated with a completely different inode state. > > > > > > The scenario is something like: > > > > > > - Change attr for an empty file starts at 1 > > > > > > - Write "A" to file, change attr goes to 2 > > > > > > - Read and statx happens (client sees "A" with change attr 2) > > > > > > - Crash (before last change is logged to disk) > > > > > > - Machine reboots, inode is empty, change attr back to 1 > > > > > > - Write "B" to file, change attr goes to 2 > > > > > > - Client stat's file, sees change attr 2 and assumes its cache is > > > correct when it isn't (should be "B" not "A" now). > > > > > > The real danger comes not from the thing going backward, but the fact > > > that it can march forward again after going backward, and then the > > > client can see two different inode states associated with the same > > > change attr value. Jumping all the change attributes forward by a > > > significant amount after a crash should avoid this issue. > > > > As Dave pointed out, the problem with change attr having the same value for > > a different inode state (after going backwards) holds not only for the > > crashes but also for restore from backups, fs snapshots, device snapshots > > etc. So relying on change attr only looks a bit fragile. It works for the > > common case but the edge cases are awkward and there's no easy way to > > detect you are in the edge case. > > > > This is true. In fact in the snapshot case you can't even rely on doing > anything at reboot since you won't necessarily need to reboot to make it > roll backward. > > Whether that obviates the use of this value altogether, I'm not sure. > > > So I think any implementation caring about data integrity would have to > > include something like ctime into the picture anyway. Or we could just > > completely give up any idea of monotonicity and on each mount select random > > prime P < 2^64 and instead of doing inc when advancing the change > > attribute, we'd advance it by P. That makes collisions after restore / > > crash fairly unlikely. > > Part of the goal (at least for NFS) is to avoid unnecessary cache > invalidations. > > If we just increment it by a particular offset on every reboot, then > every time the server reboots, the clients will invalidate all of their > cached inodes, and proceed to hammer the server with READ calls just as > it's having to populate its own caches from disk. Note that I didn't propose to increment by offset on every reboot or mount. I have proposed that inode_maybe_inc_iversion() would not increment i_version by 1 (in fact 1 << I_VERSION_QUERIED_SHIFT) but rather by P (or P << I_VERSION_QUERIED_SHIFT) where P is a suitable number randomly selected on filesystem mount. This will not cause cache invalidation after a clean unmount + remount. It will cause cache invalidation after a crash, snapshot rollback etc., only for inodes where i_version changed. If P is suitably selected (e.g. as being a prime), then the chances of collisions (even after a snapshot rollback) are very low (on the order of 2^(-50) if my piece of envelope calculations are right). So this should nicely deal with all the problems we've spotted so far. But I may be missing something... Honza
On Tue, 2022-10-18 at 17:17 +0200, Jan Kara wrote: > On Tue 18-10-22 10:21:08, Jeff Layton wrote: > > On Tue, 2022-10-18 at 15:49 +0200, Jan Kara wrote: > > > On Tue 18-10-22 06:35:14, Jeff Layton wrote: > > > > On Tue, 2022-10-18 at 09:14 +1100, Dave Chinner wrote: > > > > > On Mon, Oct 17, 2022 at 06:57:09AM -0400, Jeff Layton wrote: > > > > > > Trond is of the opinion that monotonicity is a hard requirement, and > > > > > > that we should not allow filesystems that can't provide that quality to > > > > > > report STATX_VERSION at all. His rationale is that one of the main uses > > > > > > for this is for backup applications, and for those a counter that could > > > > > > go backward is worse than useless. > > > > > > > > > > From the perspective of a backup program doing incremental backups, > > > > > an inode with a change counter that has a different value to the > > > > > current backup inventory means the file contains different > > > > > information than what the current backup inventory holds. Again, > > > > > snapshots, rollbacks, etc. > > > > > > > > > > Therefore, regardless of whether the change counter has gone > > > > > forwards or backwards, the backup program needs to back up this > > > > > current version of the file in this backup because it is different > > > > > to the inventory copy. Hence if the backup program fails to back it > > > > > up, it will not be creating an exact backup of the user's data at > > > > > the point in time the backup is run... > > > > > > > > > > Hence I don't see that MONOTONIC is a requirement for backup > > > > > programs - they really do have to be able to handle filesystems that > > > > > have modifications that move backwards in time as well as forwards... > > > > > > > > Rolling backward is not a problem in and of itself. The big issue is > > > > that after a crash, we can end up with a change attr seen before the > > > > crash that is now associated with a completely different inode state. > > > > > > > > The scenario is something like: > > > > > > > > - Change attr for an empty file starts at 1 > > > > > > > > - Write "A" to file, change attr goes to 2 > > > > > > > > - Read and statx happens (client sees "A" with change attr 2) > > > > > > > > - Crash (before last change is logged to disk) > > > > > > > > - Machine reboots, inode is empty, change attr back to 1 > > > > > > > > - Write "B" to file, change attr goes to 2 > > > > > > > > - Client stat's file, sees change attr 2 and assumes its cache is > > > > correct when it isn't (should be "B" not "A" now). > > > > > > > > The real danger comes not from the thing going backward, but the fact > > > > that it can march forward again after going backward, and then the > > > > client can see two different inode states associated with the same > > > > change attr value. Jumping all the change attributes forward by a > > > > significant amount after a crash should avoid this issue. > > > > > > As Dave pointed out, the problem with change attr having the same value for > > > a different inode state (after going backwards) holds not only for the > > > crashes but also for restore from backups, fs snapshots, device snapshots > > > etc. So relying on change attr only looks a bit fragile. It works for the > > > common case but the edge cases are awkward and there's no easy way to > > > detect you are in the edge case. > > > > > > > This is true. In fact in the snapshot case you can't even rely on doing > > anything at reboot since you won't necessarily need to reboot to make it > > roll backward. > > > > Whether that obviates the use of this value altogether, I'm not sure. > > > > > So I think any implementation caring about data integrity would have to > > > include something like ctime into the picture anyway. Or we could just > > > completely give up any idea of monotonicity and on each mount select random > > > prime P < 2^64 and instead of doing inc when advancing the change > > > attribute, we'd advance it by P. That makes collisions after restore / > > > crash fairly unlikely. > > > > Part of the goal (at least for NFS) is to avoid unnecessary cache > > invalidations. > > > > If we just increment it by a particular offset on every reboot, then > > every time the server reboots, the clients will invalidate all of their > > cached inodes, and proceed to hammer the server with READ calls just as > > it's having to populate its own caches from disk. > > Note that I didn't propose to increment by offset on every reboot or mount. > I have proposed that inode_maybe_inc_iversion() would not increment > i_version by 1 (in fact 1 << I_VERSION_QUERIED_SHIFT) but rather by P (or P > << I_VERSION_QUERIED_SHIFT) where P is a suitable number randomly selected > on filesystem mount. > > This will not cause cache invalidation after a clean unmount + remount. It > will cause cache invalidation after a crash, snapshot rollback etc., only for > inodes where i_version changed. If P is suitably selected (e.g. as being a > prime), then the chances of collisions (even after a snapshot rollback) are > very low (on the order of 2^(-50) if my piece of envelope calculations are > right). > > So this should nicely deal with all the problems we've spotted so far. But > I may be missing something... Got it! That makes a lot more sense. Thinking about this some more... What sort of range for P would be suitable? Every increment would need to be by (shifted) P, so we can't choose too large a number. Queries are pretty rare vs. writes though, so that mitigates the issue somewhat. There are 31 primes between 1 and 127. Worst case, we'd still have 2^48 increments before the counter wraps. Let me think about this some more, but maybe that's good enough to ensure uniqueness.
On Tue 18-10-22 13:04:34, Jeff Layton wrote: > On Tue, 2022-10-18 at 17:17 +0200, Jan Kara wrote: > > On Tue 18-10-22 10:21:08, Jeff Layton wrote: > > > On Tue, 2022-10-18 at 15:49 +0200, Jan Kara wrote: > > > > On Tue 18-10-22 06:35:14, Jeff Layton wrote: > > > > > On Tue, 2022-10-18 at 09:14 +1100, Dave Chinner wrote: > > > > > > On Mon, Oct 17, 2022 at 06:57:09AM -0400, Jeff Layton wrote: > > > > > > > Trond is of the opinion that monotonicity is a hard requirement, and > > > > > > > that we should not allow filesystems that can't provide that quality to > > > > > > > report STATX_VERSION at all. His rationale is that one of the main uses > > > > > > > for this is for backup applications, and for those a counter that could > > > > > > > go backward is worse than useless. > > > > > > > > > > > > From the perspective of a backup program doing incremental backups, > > > > > > an inode with a change counter that has a different value to the > > > > > > current backup inventory means the file contains different > > > > > > information than what the current backup inventory holds. Again, > > > > > > snapshots, rollbacks, etc. > > > > > > > > > > > > Therefore, regardless of whether the change counter has gone > > > > > > forwards or backwards, the backup program needs to back up this > > > > > > current version of the file in this backup because it is different > > > > > > to the inventory copy. Hence if the backup program fails to back it > > > > > > up, it will not be creating an exact backup of the user's data at > > > > > > the point in time the backup is run... > > > > > > > > > > > > Hence I don't see that MONOTONIC is a requirement for backup > > > > > > programs - they really do have to be able to handle filesystems that > > > > > > have modifications that move backwards in time as well as forwards... > > > > > > > > > > Rolling backward is not a problem in and of itself. The big issue is > > > > > that after a crash, we can end up with a change attr seen before the > > > > > crash that is now associated with a completely different inode state. > > > > > > > > > > The scenario is something like: > > > > > > > > > > - Change attr for an empty file starts at 1 > > > > > > > > > > - Write "A" to file, change attr goes to 2 > > > > > > > > > > - Read and statx happens (client sees "A" with change attr 2) > > > > > > > > > > - Crash (before last change is logged to disk) > > > > > > > > > > - Machine reboots, inode is empty, change attr back to 1 > > > > > > > > > > - Write "B" to file, change attr goes to 2 > > > > > > > > > > - Client stat's file, sees change attr 2 and assumes its cache is > > > > > correct when it isn't (should be "B" not "A" now). > > > > > > > > > > The real danger comes not from the thing going backward, but the fact > > > > > that it can march forward again after going backward, and then the > > > > > client can see two different inode states associated with the same > > > > > change attr value. Jumping all the change attributes forward by a > > > > > significant amount after a crash should avoid this issue. > > > > > > > > As Dave pointed out, the problem with change attr having the same value for > > > > a different inode state (after going backwards) holds not only for the > > > > crashes but also for restore from backups, fs snapshots, device snapshots > > > > etc. So relying on change attr only looks a bit fragile. It works for the > > > > common case but the edge cases are awkward and there's no easy way to > > > > detect you are in the edge case. > > > > > > > > > > This is true. In fact in the snapshot case you can't even rely on doing > > > anything at reboot since you won't necessarily need to reboot to make it > > > roll backward. > > > > > > Whether that obviates the use of this value altogether, I'm not sure. > > > > > > > So I think any implementation caring about data integrity would have to > > > > include something like ctime into the picture anyway. Or we could just > > > > completely give up any idea of monotonicity and on each mount select random > > > > prime P < 2^64 and instead of doing inc when advancing the change > > > > attribute, we'd advance it by P. That makes collisions after restore / > > > > crash fairly unlikely. > > > > > > Part of the goal (at least for NFS) is to avoid unnecessary cache > > > invalidations. > > > > > > If we just increment it by a particular offset on every reboot, then > > > every time the server reboots, the clients will invalidate all of their > > > cached inodes, and proceed to hammer the server with READ calls just as > > > it's having to populate its own caches from disk. > > > > Note that I didn't propose to increment by offset on every reboot or mount. > > I have proposed that inode_maybe_inc_iversion() would not increment > > i_version by 1 (in fact 1 << I_VERSION_QUERIED_SHIFT) but rather by P (or P > > << I_VERSION_QUERIED_SHIFT) where P is a suitable number randomly selected > > on filesystem mount. > > > > This will not cause cache invalidation after a clean unmount + remount. It > > will cause cache invalidation after a crash, snapshot rollback etc., only for > > inodes where i_version changed. If P is suitably selected (e.g. as being a > > prime), then the chances of collisions (even after a snapshot rollback) are > > very low (on the order of 2^(-50) if my piece of envelope calculations are > > right). > > > > So this should nicely deal with all the problems we've spotted so far. But > > I may be missing something... > > > Got it! That makes a lot more sense. Thinking about this some more... > > What sort of range for P would be suitable? > > Every increment would need to be by (shifted) P, so we can't choose too > large a number. Queries are pretty rare vs. writes though, so that > mitigates the issue somewhat. Well, I agree that for large P the counter would wrap earlier. But is that a problem? Note that if P is a prime (indivisible by 2 is enough), then the counter would get to already used value still only after 2^63 steps. Thus if we give up monotonicity and just treat the counter as an opaque cookie, we do not have to care about wrapping. Sure given different P is selected for each mount the wrapping argument does not hold 100% but here comes the advantage of primes - if you have two different primes P and Q, then a collision means that k*P mod 2^63 = l*Q mod 2^63 and that holds for exactly one pair k,l from 1..2^63 range. So the chances of early collision even after selecting a different prime on each mount are *very* low. So I think we should select from a relatively large set of primes so that the chance of randomly selecting the same prime (and thus reissuing the same change attr for different inode state sometime later) are small. Honza
On Wed, 2022-10-19 at 19:23 +0200, Jan Kara wrote: > On Tue 18-10-22 13:04:34, Jeff Layton wrote: > > On Tue, 2022-10-18 at 17:17 +0200, Jan Kara wrote: > > > On Tue 18-10-22 10:21:08, Jeff Layton wrote: > > > > On Tue, 2022-10-18 at 15:49 +0200, Jan Kara wrote: > > > > > On Tue 18-10-22 06:35:14, Jeff Layton wrote: > > > > > > On Tue, 2022-10-18 at 09:14 +1100, Dave Chinner wrote: > > > > > > > On Mon, Oct 17, 2022 at 06:57:09AM -0400, Jeff Layton wrote: > > > > > > > > Trond is of the opinion that monotonicity is a hard requirement, and > > > > > > > > that we should not allow filesystems that can't provide that quality to > > > > > > > > report STATX_VERSION at all. His rationale is that one of the main uses > > > > > > > > for this is for backup applications, and for those a counter that could > > > > > > > > go backward is worse than useless. > > > > > > > > > > > > > > From the perspective of a backup program doing incremental backups, > > > > > > > an inode with a change counter that has a different value to the > > > > > > > current backup inventory means the file contains different > > > > > > > information than what the current backup inventory holds. Again, > > > > > > > snapshots, rollbacks, etc. > > > > > > > > > > > > > > Therefore, regardless of whether the change counter has gone > > > > > > > forwards or backwards, the backup program needs to back up this > > > > > > > current version of the file in this backup because it is different > > > > > > > to the inventory copy. Hence if the backup program fails to back it > > > > > > > up, it will not be creating an exact backup of the user's data at > > > > > > > the point in time the backup is run... > > > > > > > > > > > > > > Hence I don't see that MONOTONIC is a requirement for backup > > > > > > > programs - they really do have to be able to handle filesystems that > > > > > > > have modifications that move backwards in time as well as forwards... > > > > > > > > > > > > Rolling backward is not a problem in and of itself. The big issue is > > > > > > that after a crash, we can end up with a change attr seen before the > > > > > > crash that is now associated with a completely different inode state. > > > > > > > > > > > > The scenario is something like: > > > > > > > > > > > > - Change attr for an empty file starts at 1 > > > > > > > > > > > > - Write "A" to file, change attr goes to 2 > > > > > > > > > > > > - Read and statx happens (client sees "A" with change attr 2) > > > > > > > > > > > > - Crash (before last change is logged to disk) > > > > > > > > > > > > - Machine reboots, inode is empty, change attr back to 1 > > > > > > > > > > > > - Write "B" to file, change attr goes to 2 > > > > > > > > > > > > - Client stat's file, sees change attr 2 and assumes its cache is > > > > > > correct when it isn't (should be "B" not "A" now). > > > > > > > > > > > > The real danger comes not from the thing going backward, but the fact > > > > > > that it can march forward again after going backward, and then the > > > > > > client can see two different inode states associated with the same > > > > > > change attr value. Jumping all the change attributes forward by a > > > > > > significant amount after a crash should avoid this issue. > > > > > > > > > > As Dave pointed out, the problem with change attr having the same value for > > > > > a different inode state (after going backwards) holds not only for the > > > > > crashes but also for restore from backups, fs snapshots, device snapshots > > > > > etc. So relying on change attr only looks a bit fragile. It works for the > > > > > common case but the edge cases are awkward and there's no easy way to > > > > > detect you are in the edge case. > > > > > > > > > > > > > This is true. In fact in the snapshot case you can't even rely on doing > > > > anything at reboot since you won't necessarily need to reboot to make it > > > > roll backward. > > > > > > > > Whether that obviates the use of this value altogether, I'm not sure. > > > > > > > > > So I think any implementation caring about data integrity would have to > > > > > include something like ctime into the picture anyway. Or we could just > > > > > completely give up any idea of monotonicity and on each mount select random > > > > > prime P < 2^64 and instead of doing inc when advancing the change > > > > > attribute, we'd advance it by P. That makes collisions after restore / > > > > > crash fairly unlikely. > > > > > > > > Part of the goal (at least for NFS) is to avoid unnecessary cache > > > > invalidations. > > > > > > > > If we just increment it by a particular offset on every reboot, then > > > > every time the server reboots, the clients will invalidate all of their > > > > cached inodes, and proceed to hammer the server with READ calls just as > > > > it's having to populate its own caches from disk. > > > > > > Note that I didn't propose to increment by offset on every reboot or mount. > > > I have proposed that inode_maybe_inc_iversion() would not increment > > > i_version by 1 (in fact 1 << I_VERSION_QUERIED_SHIFT) but rather by P (or P > > > << I_VERSION_QUERIED_SHIFT) where P is a suitable number randomly selected > > > on filesystem mount. > > > > > > This will not cause cache invalidation after a clean unmount + remount. It > > > will cause cache invalidation after a crash, snapshot rollback etc., only for > > > inodes where i_version changed. If P is suitably selected (e.g. as being a > > > prime), then the chances of collisions (even after a snapshot rollback) are > > > very low (on the order of 2^(-50) if my piece of envelope calculations are > > > right). > > > > > > So this should nicely deal with all the problems we've spotted so far. But > > > I may be missing something... > > > > > > Got it! That makes a lot more sense. Thinking about this some more... > > > > What sort of range for P would be suitable? > > > > Every increment would need to be by (shifted) P, so we can't choose too > > large a number. Queries are pretty rare vs. writes though, so that > > mitigates the issue somewhat. > > Well, I agree that for large P the counter would wrap earlier. But is that > a problem? Note that if P is a prime (indivisible by 2 is enough), then the > counter would get to already used value still only after 2^63 steps. Thus if > we give up monotonicity and just treat the counter as an opaque cookie, we > do not have to care about wrapping. > > Sure given different P is selected for each mount the wrapping argument > does not hold 100% but here comes the advantage of primes - if you have two > different primes P and Q, then a collision means that k*P mod 2^63 = l*Q > mod 2^63 and that holds for exactly one pair k,l from 1..2^63 range. So the > chances of early collision even after selecting a different prime on each > mount are *very* low. > I think we'll have to start avoiding 1 as a value for P if we do this, but the rest makes sense. I like this idea, Jan! > So I think we should select from a relatively large set of primes so that > the chance of randomly selecting the same prime (and thus reissuing the > same change attr for different inode state sometime later) are small. > Monotonicity allows you to discard "old" attr updates. For instance, sometimes a NFS GETATTR response may be delayed for various reasons. If the client sees a change attr that is provably older than one it has already seen, it can discard the update. So, there is value in servers advertising that property, and NFSv4.2 has a way to do that. The Linux NFS client (at least) uses the same trick we do with jiffies to handle wrapping for MONOTONIC values. We should be able to advertise MONOTONIC as long as the client isn't comparing values that are more than ~2^62 apart. Once we start talking about applications storing these values for incremental backups, then the time between checks could be very long. So, I think we don't want _too_ large a value for P. The big question is how many individual change attr increments do we need to account for? We have 64 bits total (it's an atomic64_t). We consume the lowest bit for the QUERIED flag. That leaves us 63 bits of counter (currently). When we increment by a larger value, we're effectively decreasing the size of the counter. Let's assume a worst case of one increment per microsecond, interleaved by queries (so that they have to be real increments). 2^48 microseconds is close to 9 years. That leaves 15 bits for the P, which is primes from 3..32749. Is that a large enough pool of prime numbers? It looks like the kernel already has some infrastructure for handling primes in lib/math/prime_numbers.c. We could just select a global P value to use on every reboot, or just have filesystems set their own (maybe in a new field in the superblock?)
On Wed 19-10-22 14:47:48, Jeff Layton wrote: > On Wed, 2022-10-19 at 19:23 +0200, Jan Kara wrote: > > On Tue 18-10-22 13:04:34, Jeff Layton wrote: > > > On Tue, 2022-10-18 at 17:17 +0200, Jan Kara wrote: > > > > On Tue 18-10-22 10:21:08, Jeff Layton wrote: > > > > > On Tue, 2022-10-18 at 15:49 +0200, Jan Kara wrote: > > > > > > On Tue 18-10-22 06:35:14, Jeff Layton wrote: > > > > > > > On Tue, 2022-10-18 at 09:14 +1100, Dave Chinner wrote: > > > > > > > > On Mon, Oct 17, 2022 at 06:57:09AM -0400, Jeff Layton wrote: > > > > > > > > > Trond is of the opinion that monotonicity is a hard requirement, and > > > > > > > > > that we should not allow filesystems that can't provide that quality to > > > > > > > > > report STATX_VERSION at all. His rationale is that one of the main uses > > > > > > > > > for this is for backup applications, and for those a counter that could > > > > > > > > > go backward is worse than useless. > > > > > > > > > > > > > > > > From the perspective of a backup program doing incremental backups, > > > > > > > > an inode with a change counter that has a different value to the > > > > > > > > current backup inventory means the file contains different > > > > > > > > information than what the current backup inventory holds. Again, > > > > > > > > snapshots, rollbacks, etc. > > > > > > > > > > > > > > > > Therefore, regardless of whether the change counter has gone > > > > > > > > forwards or backwards, the backup program needs to back up this > > > > > > > > current version of the file in this backup because it is different > > > > > > > > to the inventory copy. Hence if the backup program fails to back it > > > > > > > > up, it will not be creating an exact backup of the user's data at > > > > > > > > the point in time the backup is run... > > > > > > > > > > > > > > > > Hence I don't see that MONOTONIC is a requirement for backup > > > > > > > > programs - they really do have to be able to handle filesystems that > > > > > > > > have modifications that move backwards in time as well as forwards... > > > > > > > > > > > > > > Rolling backward is not a problem in and of itself. The big issue is > > > > > > > that after a crash, we can end up with a change attr seen before the > > > > > > > crash that is now associated with a completely different inode state. > > > > > > > > > > > > > > The scenario is something like: > > > > > > > > > > > > > > - Change attr for an empty file starts at 1 > > > > > > > > > > > > > > - Write "A" to file, change attr goes to 2 > > > > > > > > > > > > > > - Read and statx happens (client sees "A" with change attr 2) > > > > > > > > > > > > > > - Crash (before last change is logged to disk) > > > > > > > > > > > > > > - Machine reboots, inode is empty, change attr back to 1 > > > > > > > > > > > > > > - Write "B" to file, change attr goes to 2 > > > > > > > > > > > > > > - Client stat's file, sees change attr 2 and assumes its cache is > > > > > > > correct when it isn't (should be "B" not "A" now). > > > > > > > > > > > > > > The real danger comes not from the thing going backward, but the fact > > > > > > > that it can march forward again after going backward, and then the > > > > > > > client can see two different inode states associated with the same > > > > > > > change attr value. Jumping all the change attributes forward by a > > > > > > > significant amount after a crash should avoid this issue. > > > > > > > > > > > > As Dave pointed out, the problem with change attr having the same value for > > > > > > a different inode state (after going backwards) holds not only for the > > > > > > crashes but also for restore from backups, fs snapshots, device snapshots > > > > > > etc. So relying on change attr only looks a bit fragile. It works for the > > > > > > common case but the edge cases are awkward and there's no easy way to > > > > > > detect you are in the edge case. > > > > > > > > > > > > > > > > This is true. In fact in the snapshot case you can't even rely on doing > > > > > anything at reboot since you won't necessarily need to reboot to make it > > > > > roll backward. > > > > > > > > > > Whether that obviates the use of this value altogether, I'm not sure. > > > > > > > > > > > So I think any implementation caring about data integrity would have to > > > > > > include something like ctime into the picture anyway. Or we could just > > > > > > completely give up any idea of monotonicity and on each mount select random > > > > > > prime P < 2^64 and instead of doing inc when advancing the change > > > > > > attribute, we'd advance it by P. That makes collisions after restore / > > > > > > crash fairly unlikely. > > > > > > > > > > Part of the goal (at least for NFS) is to avoid unnecessary cache > > > > > invalidations. > > > > > > > > > > If we just increment it by a particular offset on every reboot, then > > > > > every time the server reboots, the clients will invalidate all of their > > > > > cached inodes, and proceed to hammer the server with READ calls just as > > > > > it's having to populate its own caches from disk. > > > > > > > > Note that I didn't propose to increment by offset on every reboot or mount. > > > > I have proposed that inode_maybe_inc_iversion() would not increment > > > > i_version by 1 (in fact 1 << I_VERSION_QUERIED_SHIFT) but rather by P (or P > > > > << I_VERSION_QUERIED_SHIFT) where P is a suitable number randomly selected > > > > on filesystem mount. > > > > > > > > This will not cause cache invalidation after a clean unmount + remount. It > > > > will cause cache invalidation after a crash, snapshot rollback etc., only for > > > > inodes where i_version changed. If P is suitably selected (e.g. as being a > > > > prime), then the chances of collisions (even after a snapshot rollback) are > > > > very low (on the order of 2^(-50) if my piece of envelope calculations are > > > > right). > > > > > > > > So this should nicely deal with all the problems we've spotted so far. But > > > > I may be missing something... > > > > > > > > > Got it! That makes a lot more sense. Thinking about this some more... > > > > > > What sort of range for P would be suitable? > > > > > > Every increment would need to be by (shifted) P, so we can't choose too > > > large a number. Queries are pretty rare vs. writes though, so that > > > mitigates the issue somewhat. > > > > Well, I agree that for large P the counter would wrap earlier. But is that > > a problem? Note that if P is a prime (indivisible by 2 is enough), then the > > counter would get to already used value still only after 2^63 steps. Thus if > > we give up monotonicity and just treat the counter as an opaque cookie, we > > do not have to care about wrapping. > > > > Sure given different P is selected for each mount the wrapping argument > > does not hold 100% but here comes the advantage of primes - if you have two > > different primes P and Q, then a collision means that k*P mod 2^63 = l*Q > > mod 2^63 and that holds for exactly one pair k,l from 1..2^63 range. So the > > chances of early collision even after selecting a different prime on each > > mount are *very* low. > > I think we'll have to start avoiding 1 as a value for P if we do this, > but the rest makes sense. I like this idea, Jan! Yes, 1 is kind of special so we should better avoid it in this scheme. Especially if we're going to select only smaller primes. > > So I think we should select from a relatively large set of primes so that > > the chance of randomly selecting the same prime (and thus reissuing the > > same change attr for different inode state sometime later) are small. > > > > Monotonicity allows you to discard "old" attr updates. For instance, > sometimes a NFS GETATTR response may be delayed for various reasons. If > the client sees a change attr that is provably older than one it has > already seen, it can discard the update. So, there is value in servers > advertising that property, and NFSv4.2 has a way to do that. > > The Linux NFS client (at least) uses the same trick we do with jiffies > to handle wrapping for MONOTONIC values. We should be able to advertise > MONOTONIC as long as the client isn't comparing values that are more > than ~2^62 apart. > > Once we start talking about applications storing these values for > incremental backups, then the time between checks could be very long. > > So, I think we don't want _too_ large a value for P. The big question is > how many individual change attr increments do we need to account for? > > We have 64 bits total (it's an atomic64_t). We consume the lowest bit > for the QUERIED flag. That leaves us 63 bits of counter (currently). > When we increment by a larger value, we're effectively decreasing the > size of the counter. Yes, the larger value of P we take the sooner it will wrap which defeats comparisons attempting to establish any ordering of change cookie values. > Let's assume a worst case of one increment per microsecond, interleaved > by queries (so that they have to be real increments). 2^48 microseconds > is close to 9 years. > > That leaves 15 bits for the P, which is primes from 3..32749. Is that a > large enough pool of prime numbers? Well, there are ~3000 primes in this range so that gives you a 1/3000 chance that after a crash, backup restore, snapshot rollback etc. you will pick the same prime which results in collisions of change cookies and thus possibility of data corruption. Is that low enough chance? The events I mention above should be relatively rare but given the number of machines running this code I would think the collision is bound to happen and the consequences could be ... unpleasant. That's why I would prefer to pick primes at least say upto 1m (there are ~78k of those). But that makes wrapping more frequent (~100 days with 1us update period). Probably still usable for NFS but not really for backup purposes. So I'm not sure we should be advertising the values have any ordering. If the last used value would be persisted (e.g. in the filesystem's superblock), we could easily make sure the next selected P is different so in that case we could get away with a smaller set of primes but it would require filesystem on-disk format changes which has its own drawbacks. But that would be at least some path forward for providing change cookies that can be ordered on larger timescales. > It looks like the kernel already has some infrastructure for handling > primes in lib/math/prime_numbers.c. We could just select a global P > value to use on every reboot, or just have filesystems set their own > (maybe in a new field in the superblock?) IMO P needs to be selected on each mount to reliably solve the "restore from backup" and "snapshot rollback" scenarios. I agree it can be a new field in the VFS part of the superblock so that it is accessible by the iversion handling code. Honza
On Thu, 2022-10-20 at 12:39 +0200, Jan Kara wrote: > On Wed 19-10-22 14:47:48, Jeff Layton wrote: > > On Wed, 2022-10-19 at 19:23 +0200, Jan Kara wrote: > > > On Tue 18-10-22 13:04:34, Jeff Layton wrote: > > > > On Tue, 2022-10-18 at 17:17 +0200, Jan Kara wrote: > > > > > On Tue 18-10-22 10:21:08, Jeff Layton wrote: > > > > > > On Tue, 2022-10-18 at 15:49 +0200, Jan Kara wrote: > > > > > > > On Tue 18-10-22 06:35:14, Jeff Layton wrote: > > > > > > > > On Tue, 2022-10-18 at 09:14 +1100, Dave Chinner wrote: > > > > > > > > > On Mon, Oct 17, 2022 at 06:57:09AM -0400, Jeff Layton wrote: > > > > > > > > > > Trond is of the opinion that monotonicity is a hard requirement, and > > > > > > > > > > that we should not allow filesystems that can't provide that quality to > > > > > > > > > > report STATX_VERSION at all. His rationale is that one of the main uses > > > > > > > > > > for this is for backup applications, and for those a counter that could > > > > > > > > > > go backward is worse than useless. > > > > > > > > > > > > > > > > > > From the perspective of a backup program doing incremental backups, > > > > > > > > > an inode with a change counter that has a different value to the > > > > > > > > > current backup inventory means the file contains different > > > > > > > > > information than what the current backup inventory holds. Again, > > > > > > > > > snapshots, rollbacks, etc. > > > > > > > > > > > > > > > > > > Therefore, regardless of whether the change counter has gone > > > > > > > > > forwards or backwards, the backup program needs to back up this > > > > > > > > > current version of the file in this backup because it is different > > > > > > > > > to the inventory copy. Hence if the backup program fails to back it > > > > > > > > > up, it will not be creating an exact backup of the user's data at > > > > > > > > > the point in time the backup is run... > > > > > > > > > > > > > > > > > > Hence I don't see that MONOTONIC is a requirement for backup > > > > > > > > > programs - they really do have to be able to handle filesystems that > > > > > > > > > have modifications that move backwards in time as well as forwards... > > > > > > > > > > > > > > > > Rolling backward is not a problem in and of itself. The big issue is > > > > > > > > that after a crash, we can end up with a change attr seen before the > > > > > > > > crash that is now associated with a completely different inode state. > > > > > > > > > > > > > > > > The scenario is something like: > > > > > > > > > > > > > > > > - Change attr for an empty file starts at 1 > > > > > > > > > > > > > > > > - Write "A" to file, change attr goes to 2 > > > > > > > > > > > > > > > > - Read and statx happens (client sees "A" with change attr 2) > > > > > > > > > > > > > > > > - Crash (before last change is logged to disk) > > > > > > > > > > > > > > > > - Machine reboots, inode is empty, change attr back to 1 > > > > > > > > > > > > > > > > - Write "B" to file, change attr goes to 2 > > > > > > > > > > > > > > > > - Client stat's file, sees change attr 2 and assumes its cache is > > > > > > > > correct when it isn't (should be "B" not "A" now). > > > > > > > > > > > > > > > > The real danger comes not from the thing going backward, but the fact > > > > > > > > that it can march forward again after going backward, and then the > > > > > > > > client can see two different inode states associated with the same > > > > > > > > change attr value. Jumping all the change attributes forward by a > > > > > > > > significant amount after a crash should avoid this issue. > > > > > > > > > > > > > > As Dave pointed out, the problem with change attr having the same value for > > > > > > > a different inode state (after going backwards) holds not only for the > > > > > > > crashes but also for restore from backups, fs snapshots, device snapshots > > > > > > > etc. So relying on change attr only looks a bit fragile. It works for the > > > > > > > common case but the edge cases are awkward and there's no easy way to > > > > > > > detect you are in the edge case. > > > > > > > > > > > > > > > > > > > This is true. In fact in the snapshot case you can't even rely on doing > > > > > > anything at reboot since you won't necessarily need to reboot to make it > > > > > > roll backward. > > > > > > > > > > > > Whether that obviates the use of this value altogether, I'm not sure. > > > > > > > > > > > > > So I think any implementation caring about data integrity would have to > > > > > > > include something like ctime into the picture anyway. Or we could just > > > > > > > completely give up any idea of monotonicity and on each mount select random > > > > > > > prime P < 2^64 and instead of doing inc when advancing the change > > > > > > > attribute, we'd advance it by P. That makes collisions after restore / > > > > > > > crash fairly unlikely. > > > > > > > > > > > > Part of the goal (at least for NFS) is to avoid unnecessary cache > > > > > > invalidations. > > > > > > > > > > > > If we just increment it by a particular offset on every reboot, then > > > > > > every time the server reboots, the clients will invalidate all of their > > > > > > cached inodes, and proceed to hammer the server with READ calls just as > > > > > > it's having to populate its own caches from disk. > > > > > > > > > > Note that I didn't propose to increment by offset on every reboot or mount. > > > > > I have proposed that inode_maybe_inc_iversion() would not increment > > > > > i_version by 1 (in fact 1 << I_VERSION_QUERIED_SHIFT) but rather by P (or P > > > > > << I_VERSION_QUERIED_SHIFT) where P is a suitable number randomly selected > > > > > on filesystem mount. > > > > > > > > > > This will not cause cache invalidation after a clean unmount + remount. It > > > > > will cause cache invalidation after a crash, snapshot rollback etc., only for > > > > > inodes where i_version changed. If P is suitably selected (e.g. as being a > > > > > prime), then the chances of collisions (even after a snapshot rollback) are > > > > > very low (on the order of 2^(-50) if my piece of envelope calculations are > > > > > right). > > > > > > > > > > So this should nicely deal with all the problems we've spotted so far. But > > > > > I may be missing something... > > > > > > > > > > > > Got it! That makes a lot more sense. Thinking about this some more... > > > > > > > > What sort of range for P would be suitable? > > > > > > > > Every increment would need to be by (shifted) P, so we can't choose too > > > > large a number. Queries are pretty rare vs. writes though, so that > > > > mitigates the issue somewhat. > > > > > > Well, I agree that for large P the counter would wrap earlier. But is that > > > a problem? Note that if P is a prime (indivisible by 2 is enough), then the > > > counter would get to already used value still only after 2^63 steps. Thus if > > > we give up monotonicity and just treat the counter as an opaque cookie, we > > > do not have to care about wrapping. > > > > > > Sure given different P is selected for each mount the wrapping argument > > > does not hold 100% but here comes the advantage of primes - if you have two > > > different primes P and Q, then a collision means that k*P mod 2^63 = l*Q > > > mod 2^63 and that holds for exactly one pair k,l from 1..2^63 range. So the > > > chances of early collision even after selecting a different prime on each > > > mount are *very* low. > > > > I think we'll have to start avoiding 1 as a value for P if we do this, > > but the rest makes sense. I like this idea, Jan! > > Yes, 1 is kind of special so we should better avoid it in this scheme. > Especially if we're going to select only smaller primes. > > > > So I think we should select from a relatively large set of primes so that > > > the chance of randomly selecting the same prime (and thus reissuing the > > > same change attr for different inode state sometime later) are small. > > > > > > > Monotonicity allows you to discard "old" attr updates. For instance, > > sometimes a NFS GETATTR response may be delayed for various reasons. If > > the client sees a change attr that is provably older than one it has > > already seen, it can discard the update. So, there is value in servers > > advertising that property, and NFSv4.2 has a way to do that. > > > > The Linux NFS client (at least) uses the same trick we do with jiffies > > to handle wrapping for MONOTONIC values. We should be able to advertise > > MONOTONIC as long as the client isn't comparing values that are more > > than ~2^62 apart. > > > > Once we start talking about applications storing these values for > > incremental backups, then the time between checks could be very long. > > > > So, I think we don't want _too_ large a value for P. The big question is > > how many individual change attr increments do we need to account for? > > > > We have 64 bits total (it's an atomic64_t). We consume the lowest bit > > for the QUERIED flag. That leaves us 63 bits of counter (currently). > > When we increment by a larger value, we're effectively decreasing the > > size of the counter. > > Yes, the larger value of P we take the sooner it will wrap which defeats > comparisons attempting to establish any ordering of change cookie values. > > > Let's assume a worst case of one increment per microsecond, interleaved > > by queries (so that they have to be real increments). 2^48 microseconds > > is close to 9 years. > > > > That leaves 15 bits for the P, which is primes from 3..32749. Is that a > > large enough pool of prime numbers? > > Well, there are ~3000 primes in this range so that gives you a 1/3000 > chance that after a crash, backup restore, snapshot rollback etc. you will > pick the same prime which results in collisions of change cookies and thus > possibility of data corruption. Is that low enough chance? The events I > mention above should be relatively rare but given the number of machines > running this code I would think the collision is bound to happen and the > consequences could be ... unpleasant. That's why I would prefer to pick > primes at least say upto 1m (there are ~78k of those). But that makes > wrapping more frequent (~100 days with 1us update period). Probably still > usable for NFS but not really for backup purposes. So I'm not sure we > should be advertising the values have any ordering. > Ok. I'll aim for using values between 3 and 1M and see how that looks. We should be able to tune this to some degree as well. > If the last used value would be persisted (e.g. in the filesystem's > superblock), we could easily make sure the next selected P is different so > in that case we could get away with a smaller set of primes but it would > require filesystem on-disk format changes which has its own drawbacks. But > that would be at least some path forward for providing change cookies that > can be ordered on larger timescales. > Persisting just the last one might not be sufficient. If the machine crashes several times then you could still end up re-using the same P value. i_version is only incremented on changes, so if you're unlucky and it's only incremented again when the duplicate value of P comes up, you're back to the same problem. We might want to keep a record of the last several P values? OTOH, there is always going to be _some_ way to defeat this. At some point we just have to decide that a scenario is unlikely enough that we can ignore it. > > It looks like the kernel already has some infrastructure for handling > > primes in lib/math/prime_numbers.c. We could just select a global P > > value to use on every reboot, or just have filesystems set their own > > (maybe in a new field in the superblock?) > > IMO P needs to be selected on each mount to reliably solve the "restore > from backup" and "snapshot rollback" scenarios. I agree it can be a new > field in the VFS part of the superblock so that it is accessible by the > iversion handling code. > Sounds good. FWIW, I think I'm going to have to approach this in at least three patchsets: 1) clean up the presentation of the value, and plumb it through struct kstat (aiming for v6.2 for this part). 2) start incrementing the value after a write in addition to, or instead of before a write. (I have patches for tmpfs, ext4 and btrfs -- maybe v6.3?) 3) change the increment to use a prime number we select at mount time to ward off rollback issues. (still scoping out this part) Thanks!
diff --git a/fs/stat.c b/fs/stat.c index e7f8cd4b24e1..8396c372022f 100644 --- a/fs/stat.c +++ b/fs/stat.c @@ -593,11 +593,9 @@ cp_statx(const struct kstat *stat, struct statx __user *buffer) memset(&tmp, 0, sizeof(tmp)); - /* STATX_VERSION is kernel-only for now */ - tmp.stx_mask = stat->result_mask & ~STATX_VERSION; + tmp.stx_mask = stat->result_mask; tmp.stx_blksize = stat->blksize; - /* STATX_ATTR_VERSION_MONOTONIC is kernel-only for now */ - tmp.stx_attributes = stat->attributes & ~STATX_ATTR_VERSION_MONOTONIC; + tmp.stx_attributes = stat->attributes; tmp.stx_nlink = stat->nlink; tmp.stx_uid = from_kuid_munged(current_user_ns(), stat->uid); tmp.stx_gid = from_kgid_munged(current_user_ns(), stat->gid); @@ -621,6 +619,7 @@ cp_statx(const struct kstat *stat, struct statx __user *buffer) tmp.stx_mnt_id = stat->mnt_id; tmp.stx_dio_mem_align = stat->dio_mem_align; tmp.stx_dio_offset_align = stat->dio_offset_align; + tmp.stx_version = stat->version; return copy_to_user(buffer, &tmp, sizeof(tmp)) ? -EFAULT : 0; } @@ -636,11 +635,6 @@ int do_statx(int dfd, struct filename *filename, unsigned int flags, if ((flags & AT_STATX_SYNC_TYPE) == AT_STATX_SYNC_TYPE) return -EINVAL; - /* STATX_VERSION is kernel-only for now. Ignore requests - * from userland. - */ - mask &= ~STATX_VERSION; - error = vfs_statx(dfd, filename, flags, &stat, mask); if (error) return error; diff --git a/include/linux/stat.h b/include/linux/stat.h index 4e9428d86a3a..69c79e4fd1b1 100644 --- a/include/linux/stat.h +++ b/include/linux/stat.h @@ -54,13 +54,4 @@ struct kstat { u32 dio_offset_align; u64 version; }; - -/* These definitions are internal to the kernel for now. Mainly used by nfsd. */ - -/* mask values */ -#define STATX_VERSION 0x40000000U /* Want/got stx_change_attr */ - -/* file attribute values */ -#define STATX_ATTR_VERSION_MONOTONIC 0x8000000000000000ULL /* version monotonically increases */ - #endif diff --git a/include/uapi/linux/stat.h b/include/uapi/linux/stat.h index 7cab2c65d3d7..4a0a1f27c059 100644 --- a/include/uapi/linux/stat.h +++ b/include/uapi/linux/stat.h @@ -127,7 +127,8 @@ struct statx { __u32 stx_dio_mem_align; /* Memory buffer alignment for direct I/O */ __u32 stx_dio_offset_align; /* File offset alignment for direct I/O */ /* 0xa0 */ - __u64 __spare3[12]; /* Spare space for future expansion */ + __u64 stx_version; /* Inode change attribute */ + __u64 __spare3[11]; /* Spare space for future expansion */ /* 0x100 */ }; @@ -154,6 +155,7 @@ struct statx { #define STATX_BTIME 0x00000800U /* Want/got stx_btime */ #define STATX_MNT_ID 0x00001000U /* Got stx_mnt_id */ #define STATX_DIOALIGN 0x00002000U /* Want/got direct I/O alignment info */ +#define STATX_VERSION 0x00004000U /* Want/got stx_version */ #define STATX__RESERVED 0x80000000U /* Reserved for future struct statx expansion */ @@ -189,6 +191,6 @@ struct statx { #define STATX_ATTR_MOUNT_ROOT 0x00002000 /* Root of a mount */ #define STATX_ATTR_VERITY 0x00100000 /* [I] Verity protected file */ #define STATX_ATTR_DAX 0x00200000 /* File is currently in DAX state */ - +#define STATX_ATTR_VERSION_MONOTONIC 0x00400000 /* stx_version increases w/ every change */ #endif /* _UAPI_LINUX_STAT_H */ diff --git a/samples/vfs/test-statx.c b/samples/vfs/test-statx.c index 49c7a46cee07..868c9394e038 100644 --- a/samples/vfs/test-statx.c +++ b/samples/vfs/test-statx.c @@ -107,6 +107,8 @@ static void dump_statx(struct statx *stx) printf("Device: %-15s", buffer); if (stx->stx_mask & STATX_INO) printf(" Inode: %-11llu", (unsigned long long) stx->stx_ino); + if (stx->stx_mask & STATX_MNT_ID) + printf(" MountId: %llx", stx->stx_mnt_id); if (stx->stx_mask & STATX_NLINK) printf(" Links: %-5u", stx->stx_nlink); if (stx->stx_mask & STATX_TYPE) { @@ -145,7 +147,9 @@ static void dump_statx(struct statx *stx) if (stx->stx_mask & STATX_CTIME) print_time("Change: ", &stx->stx_ctime); if (stx->stx_mask & STATX_BTIME) - print_time(" Birth: ", &stx->stx_btime); + print_time("Birth: ", &stx->stx_btime); + if (stx->stx_mask & STATX_VERSION) + printf("Inode Version: %llu\n", stx->stx_version); if (stx->stx_attributes_mask) { unsigned char bits, mbits; @@ -218,7 +222,7 @@ int main(int argc, char **argv) struct statx stx; int ret, raw = 0, atflag = AT_SYMLINK_NOFOLLOW; - unsigned int mask = STATX_BASIC_STATS | STATX_BTIME; + unsigned int mask = STATX_BASIC_STATS | STATX_BTIME | STATX_MNT_ID | STATX_VERSION; for (argv++; *argv; argv++) { if (strcmp(*argv, "-F") == 0) {