Message ID | 20230518114742.128950-3-jlayton@kernel.org (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | fs: implement multigrain timestamps | expand |
On Thu 18-05-23 07:47:35, Jeff Layton wrote: > The VFS always uses coarse-grained timestamp updates for filling out the > ctime and mtime after a change. This has the benefit of allowing > filesystems to optimize away a lot metadata updates, down to around 1 > per jiffy, even when a file is under heavy writes. > > Unfortunately, this has always been an issue when we're exporting via > NFSv3, which relies on timestamps to validate caches. Even with NFSv4, a > lot of exported filesystems don't properly support a change attribute > and are subject to the same problems with timestamp granularity. Other > applications have similar issues (e.g backup applications). > > Switching to always using fine-grained timestamps would improve the > situation, but that becomes rather expensive, as the underlying > filesystem will have to log a lot more metadata updates. > > What we need is a way to only use fine-grained timestamps when they are > being actively queried. > > The kernel always stores normalized ctime values, so only the first 30 > bits of the tv_nsec field are ever used. Whenever the mtime changes, the > ctime must also change. > > Use the 31st bit of the ctime tv_nsec field to indicate that something > has queried the inode for the i_mtime or i_ctime. When this flag is set, > on the next timestamp update, the kernel can fetch a fine-grained > timestamp instead of the usual coarse-grained one. > > This patch adds the infrastructure this scheme. Filesytems can opt > into it by setting the FS_MULTIGRAIN_TS flag in the fstype. > > Later patches will convert individual filesystems over to use it. > > Signed-off-by: Jeff Layton <jlayton@kernel.org> So there are two things I dislike about this series because I think they are fragile: 1) If we have a filesystem supporting multigrain ts and someone accidentally directly uses the value of inode->i_ctime, he can get bogus value (with QUERIED flag). This mistake is very easy to do. So I think we should rename i_ctime to something like __i_ctime and always use accessor function for it. 2) As I already commented in a previous version of the series, the scheme with just one flag for both ctime and mtime and flag getting cleared in current_time() relies on the fact that filesystems always do an equivalent of: inode->i_mtime = inode->i_ctime = current_time(); Otherwise we can do coarse grained update where we should have done a fine grained one. Filesystems often update timestamps like this but not universally. Grepping shows some instances where only inode->i_mtime is set from current_time() e.g. in autofs or bfs. Again a mistake that is rather easy to make and results in subtle issues. I think this would be also nicely solved by renaming i_ctime to __i_ctime and using a function to set ctime. Mtime could then be updated with inode->i_mtime = ctime_peek(). I understand this is quite some churn but a very mechanical one that could be just done with Coccinelle and a few manual fixups. So IMHO it is worth the more robust result. Some more nits below. > +/** > + * current_mg_time - Return FS time (possibly fine-grained) > + * @inode: inode. > + * > + * Return the current time truncated to the time granularity supported by > + * the fs, as suitable for a ctime/mtime change. If the ctime is flagged > + * as having been QUERIED, get a fine-grained timestamp. > + */ The comment should also mention that QUERIED flag is cleared from the ctime. > +static struct timespec64 current_mg_time(struct inode *inode) > +{ > + struct timespec64 now; > + atomic_long_t *pnsec = (atomic_long_t *)&inode->i_ctime.tv_nsec; > + long nsec = atomic_long_fetch_andnot(I_CTIME_QUERIED, pnsec); > + > + if (nsec & I_CTIME_QUERIED) { > + ktime_get_real_ts64(&now); > + } else { > + struct timespec64 ctime; > + > + ktime_get_coarse_real_ts64(&now); > + > + /* > + * If we've recently fetched a fine-grained timestamp > + * then the coarse-grained one may still be earlier than the > + * existing one. Just keep the existing ctime if so. > + */ > + ctime = ctime_peek(inode); > + if (timespec64_compare(&ctime, &now) > 0) > + now = ctime; > + } > + > + return now; > +} > + ... > +/** > + * ctime_nsec_peek - peek at (but don't query) the ctime tv_nsec field > + * @inode: inode to fetch the ctime from > + * > + * Grab the current ctime tv_nsec field from the inode, mask off the > + * I_CTIME_QUERIED flag and return it. This is mostly intended for use by > + * internal consumers of the ctime that aren't concerned with ensuring a > + * fine-grained update on the next change (e.g. when preparing to store > + * the value in the backing store for later retrieval). > + * > + * This is safe to call regardless of whether the underlying filesystem > + * is using multigrain timestamps. > + */ > +static inline long ctime_nsec_peek(const struct inode *inode) > +{ > + return inode->i_ctime.tv_nsec &~ I_CTIME_QUERIED; This is somewhat unusual spacing. I'd use: inode->i_ctime.tv_nsec & ~I_CTIME_QUERIED > +} > + > +/** > + * ctime_peek - peek at (but don't query) the ctime > + * @inode: inode to fetch the ctime from > + * > + * Grab the current ctime from the inode, sans I_CTIME_QUERIED flag. For > + * use by internal consumers that don't require a fine-grained update on > + * the next change. > + * > + * This is safe to call regardless of whether the underlying filesystem > + * is using multigrain timestamps. > + */ > +static inline struct timespec64 ctime_peek(const struct inode *inode) > +{ > + struct timespec64 ctime; > + > + ctime.tv_sec = inode->i_ctime.tv_sec; > + ctime.tv_nsec = ctime_nsec_peek(inode); > + > + return ctime; > +} Given this is in a header that gets included in a lot of places, maybe we should call it like inode_ctime_peek() or inode_ctime_get() to reduce chances of a name clash? Honza
On Tue 23-05-23 12:02:40, Jan Kara wrote: > On Thu 18-05-23 07:47:35, Jeff Layton wrote: > > The VFS always uses coarse-grained timestamp updates for filling out the > > ctime and mtime after a change. This has the benefit of allowing > > filesystems to optimize away a lot metadata updates, down to around 1 > > per jiffy, even when a file is under heavy writes. > > > > Unfortunately, this has always been an issue when we're exporting via > > NFSv3, which relies on timestamps to validate caches. Even with NFSv4, a > > lot of exported filesystems don't properly support a change attribute > > and are subject to the same problems with timestamp granularity. Other > > applications have similar issues (e.g backup applications). > > > > Switching to always using fine-grained timestamps would improve the > > situation, but that becomes rather expensive, as the underlying > > filesystem will have to log a lot more metadata updates. > > > > What we need is a way to only use fine-grained timestamps when they are > > being actively queried. > > > > The kernel always stores normalized ctime values, so only the first 30 > > bits of the tv_nsec field are ever used. Whenever the mtime changes, the > > ctime must also change. > > > > Use the 31st bit of the ctime tv_nsec field to indicate that something > > has queried the inode for the i_mtime or i_ctime. When this flag is set, > > on the next timestamp update, the kernel can fetch a fine-grained > > timestamp instead of the usual coarse-grained one. > > > > This patch adds the infrastructure this scheme. Filesytems can opt > > into it by setting the FS_MULTIGRAIN_TS flag in the fstype. > > > > Later patches will convert individual filesystems over to use it. > > > > Signed-off-by: Jeff Layton <jlayton@kernel.org> > > So there are two things I dislike about this series because I think they > are fragile: > > 1) If we have a filesystem supporting multigrain ts and someone > accidentally directly uses the value of inode->i_ctime, he can get bogus > value (with QUERIED flag). This mistake is very easy to do. So I think we > should rename i_ctime to something like __i_ctime and always use accessor > function for it. > > 2) As I already commented in a previous version of the series, the scheme > with just one flag for both ctime and mtime and flag getting cleared in > current_time() relies on the fact that filesystems always do an equivalent > of: > > inode->i_mtime = inode->i_ctime = current_time(); > > Otherwise we can do coarse grained update where we should have done a fine > grained one. Filesystems often update timestamps like this but not > universally. Grepping shows some instances where only inode->i_mtime is set > from current_time() e.g. in autofs or bfs. Again a mistake that is rather > easy to make and results in subtle issues. I think this would be also > nicely solved by renaming i_ctime to __i_ctime and using a function to set > ctime. Mtime could then be updated with inode->i_mtime = ctime_peek(). > > I understand this is quite some churn but a very mechanical one that could > be just done with Coccinelle and a few manual fixups. So IMHO it is worth > the more robust result. Also as I'm thinking about it your current scheme is slightly racy. Suppose the filesystem does: CPU1 CPU2 statx() inode->i_ctime = current_time() current_mg_time() nsec = atomic_long_fetch_andnot(QUERIED, &inode->i_ctime.tv_nsec) nsec = atomic_long_fetch_or(QUERIED, &inode->i_ctime.tv_nsec) if (nsec & QUERIED) - not set ktime_get_coarse_real_ts64(&now) return timestamp_truncate(now, inode); - QUERIED flag in the inode->i_ctime gets overwritten by the assignment => we need not update ctime due to granularity although it was queried One more reason to use explicit function to update inode->i_ctime ;) Honza
On Tue, May 23, 2023 at 12:02:40PM +0200, Jan Kara wrote: > On Thu 18-05-23 07:47:35, Jeff Layton wrote: > > The VFS always uses coarse-grained timestamp updates for filling out the > > ctime and mtime after a change. This has the benefit of allowing > > filesystems to optimize away a lot metadata updates, down to around 1 > > per jiffy, even when a file is under heavy writes. > > > > Unfortunately, this has always been an issue when we're exporting via > > NFSv3, which relies on timestamps to validate caches. Even with NFSv4, a > > lot of exported filesystems don't properly support a change attribute > > and are subject to the same problems with timestamp granularity. Other > > applications have similar issues (e.g backup applications). > > > > Switching to always using fine-grained timestamps would improve the > > situation, but that becomes rather expensive, as the underlying > > filesystem will have to log a lot more metadata updates. > > > > What we need is a way to only use fine-grained timestamps when they are > > being actively queried. > > > > The kernel always stores normalized ctime values, so only the first 30 > > bits of the tv_nsec field are ever used. Whenever the mtime changes, the > > ctime must also change. > > > > Use the 31st bit of the ctime tv_nsec field to indicate that something > > has queried the inode for the i_mtime or i_ctime. When this flag is set, > > on the next timestamp update, the kernel can fetch a fine-grained > > timestamp instead of the usual coarse-grained one. > > > > This patch adds the infrastructure this scheme. Filesytems can opt > > into it by setting the FS_MULTIGRAIN_TS flag in the fstype. > > > > Later patches will convert individual filesystems over to use it. > > > > Signed-off-by: Jeff Layton <jlayton@kernel.org> > > So there are two things I dislike about this series because I think they > are fragile: > > 1) If we have a filesystem supporting multigrain ts and someone > accidentally directly uses the value of inode->i_ctime, he can get bogus > value (with QUERIED flag). This mistake is very easy to do. So I think we > should rename i_ctime to something like __i_ctime and always use accessor > function for it. > > 2) As I already commented in a previous version of the series, the scheme > with just one flag for both ctime and mtime and flag getting cleared in > current_time() relies on the fact that filesystems always do an equivalent > of: > > inode->i_mtime = inode->i_ctime = current_time(); > > Otherwise we can do coarse grained update where we should have done a fine > grained one. Filesystems often update timestamps like this but not > universally. Grepping shows some instances where only inode->i_mtime is set > from current_time() e.g. in autofs or bfs. Again a mistake that is rather > easy to make and results in subtle issues. I think this would be also > nicely solved by renaming i_ctime to __i_ctime and using a function to set > ctime. Mtime could then be updated with inode->i_mtime = ctime_peek(). > > I understand this is quite some churn but a very mechanical one that could > be just done with Coccinelle and a few manual fixups. So IMHO it is worth > the more robust result. Yeah, these are all good points. > > Some more nits below. > > > +/** > > + * current_mg_time - Return FS time (possibly fine-grained) > > + * @inode: inode. > > + * > > + * Return the current time truncated to the time granularity supported by > > + * the fs, as suitable for a ctime/mtime change. If the ctime is flagged > > + * as having been QUERIED, get a fine-grained timestamp. > > + */ > > The comment should also mention that QUERIED flag is cleared from the ctime. > > > +static struct timespec64 current_mg_time(struct inode *inode) > > +{ > > + struct timespec64 now; > > + atomic_long_t *pnsec = (atomic_long_t *)&inode->i_ctime.tv_nsec; > > + long nsec = atomic_long_fetch_andnot(I_CTIME_QUERIED, pnsec); > > + > > + if (nsec & I_CTIME_QUERIED) { > > + ktime_get_real_ts64(&now); > > + } else { > > + struct timespec64 ctime; > > + > > + ktime_get_coarse_real_ts64(&now); > > + > > + /* > > + * If we've recently fetched a fine-grained timestamp > > + * then the coarse-grained one may still be earlier than the > > + * existing one. Just keep the existing ctime if so. > > + */ > > + ctime = ctime_peek(inode); > > + if (timespec64_compare(&ctime, &now) > 0) > > + now = ctime; > > + } > > + > > + return now; > > +} > > + > > ... > > > +/** > > + * ctime_nsec_peek - peek at (but don't query) the ctime tv_nsec field > > + * @inode: inode to fetch the ctime from > > + * > > + * Grab the current ctime tv_nsec field from the inode, mask off the > > + * I_CTIME_QUERIED flag and return it. This is mostly intended for use by > > + * internal consumers of the ctime that aren't concerned with ensuring a > > + * fine-grained update on the next change (e.g. when preparing to store > > + * the value in the backing store for later retrieval). > > + * > > + * This is safe to call regardless of whether the underlying filesystem > > + * is using multigrain timestamps. > > + */ > > +static inline long ctime_nsec_peek(const struct inode *inode) > > +{ > > + return inode->i_ctime.tv_nsec &~ I_CTIME_QUERIED; > > This is somewhat unusual spacing. I'd use: > > inode->i_ctime.tv_nsec & ~I_CTIME_QUERIED > > > +} > > + > > +/** > > + * ctime_peek - peek at (but don't query) the ctime > > + * @inode: inode to fetch the ctime from > > + * > > + * Grab the current ctime from the inode, sans I_CTIME_QUERIED flag. For > > + * use by internal consumers that don't require a fine-grained update on > > + * the next change. > > + * > > + * This is safe to call regardless of whether the underlying filesystem > > + * is using multigrain timestamps. > > + */ > > +static inline struct timespec64 ctime_peek(const struct inode *inode) > > +{ > > + struct timespec64 ctime; > > + > > + ctime.tv_sec = inode->i_ctime.tv_sec; > > + ctime.tv_nsec = ctime_nsec_peek(inode); > > + > > + return ctime; > > +} > > Given this is in a header that gets included in a lot of places, maybe we > should call it like inode_ctime_peek() or inode_ctime_get() to reduce > chances of a name clash? I think I mentioned this in an earlier comment. Independent of this series, it would be kinda nice if we could start moving stuff out of fs.h so we end up with a finer grained split of fs.h.
On Tue, 2023-05-23 at 12:02 +0200, Jan Kara wrote: > On Thu 18-05-23 07:47:35, Jeff Layton wrote: > > The VFS always uses coarse-grained timestamp updates for filling out the > > ctime and mtime after a change. This has the benefit of allowing > > filesystems to optimize away a lot metadata updates, down to around 1 > > per jiffy, even when a file is under heavy writes. > > > > Unfortunately, this has always been an issue when we're exporting via > > NFSv3, which relies on timestamps to validate caches. Even with NFSv4, a > > lot of exported filesystems don't properly support a change attribute > > and are subject to the same problems with timestamp granularity. Other > > applications have similar issues (e.g backup applications). > > > > Switching to always using fine-grained timestamps would improve the > > situation, but that becomes rather expensive, as the underlying > > filesystem will have to log a lot more metadata updates. > > > > What we need is a way to only use fine-grained timestamps when they are > > being actively queried. > > > > The kernel always stores normalized ctime values, so only the first 30 > > bits of the tv_nsec field are ever used. Whenever the mtime changes, the > > ctime must also change. > > > > Use the 31st bit of the ctime tv_nsec field to indicate that something > > has queried the inode for the i_mtime or i_ctime. When this flag is set, > > on the next timestamp update, the kernel can fetch a fine-grained > > timestamp instead of the usual coarse-grained one. > > > > This patch adds the infrastructure this scheme. Filesytems can opt > > into it by setting the FS_MULTIGRAIN_TS flag in the fstype. > > > > Later patches will convert individual filesystems over to use it. > > > > Signed-off-by: Jeff Layton <jlayton@kernel.org> > > So there are two things I dislike about this series because I think they > are fragile: > > 1) If we have a filesystem supporting multigrain ts and someone > accidentally directly uses the value of inode->i_ctime, he can get bogus > value (with QUERIED flag). This mistake is very easy to do. So I think we > should rename i_ctime to something like __i_ctime and always use accessor > function for it. > We could do this, but it'll be quite invasive. We'd have to change any place that touches i_ctime (and there are a lot of them), even on filesystems that are not being converted. > 2) As I already commented in a previous version of the series, the scheme > with just one flag for both ctime and mtime and flag getting cleared in > current_time() relies on the fact that filesystems always do an equivalent > of: > > inode->i_mtime = inode->i_ctime = current_time(); > > Otherwise we can do coarse grained update where we should have done a fine > grained one. Filesystems often update timestamps like this but not > universally. Grepping shows some instances where only inode->i_mtime is set > from current_time() e.g. in autofs or bfs. Again a mistake that is rather > easy to make and results in subtle issues. I think this would be also > nicely solved by renaming i_ctime to __i_ctime and using a function to set > ctime. Mtime could then be updated with inode->i_mtime = ctime_peek(). > > I understand this is quite some churn but a very mechanical one that could > be just done with Coccinelle and a few manual fixups. So IMHO it is worth > the more robust result. AFAICT, under POSIX, you must _always_ set the ctime when you set the mtime, but the reverse is not true. That's why keeping the flag in the ctime makes sense. If we're updating the mtime, then we necessarily must update the ctime. > Some more nits below. > > > +/** > > + * current_mg_time - Return FS time (possibly fine-grained) > > + * @inode: inode. > > + * > > + * Return the current time truncated to the time granularity supported by > > + * the fs, as suitable for a ctime/mtime change. If the ctime is flagged > > + * as having been QUERIED, get a fine-grained timestamp. > > + */ > > The comment should also mention that QUERIED flag is cleared from the ctime. > Fair point. I can fix that up. > > +static struct timespec64 current_mg_time(struct inode *inode) > > +{ > > + struct timespec64 now; > > + atomic_long_t *pnsec = (atomic_long_t *)&inode->i_ctime.tv_nsec; > > + long nsec = atomic_long_fetch_andnot(I_CTIME_QUERIED, pnsec); > > + > > + if (nsec & I_CTIME_QUERIED) { > > + ktime_get_real_ts64(&now); > > + } else { > > + struct timespec64 ctime; > > + > > + ktime_get_coarse_real_ts64(&now); > > + > > + /* > > + * If we've recently fetched a fine-grained timestamp > > + * then the coarse-grained one may still be earlier than the > > + * existing one. Just keep the existing ctime if so. > > + */ > > + ctime = ctime_peek(inode); > > + if (timespec64_compare(&ctime, &now) > 0) > > + now = ctime; > > + } > > + > > + return now; > > +} > > + > > ... > > > +/** > > + * ctime_nsec_peek - peek at (but don't query) the ctime tv_nsec field > > + * @inode: inode to fetch the ctime from > > + * > > + * Grab the current ctime tv_nsec field from the inode, mask off the > > + * I_CTIME_QUERIED flag and return it. This is mostly intended for use by > > + * internal consumers of the ctime that aren't concerned with ensuring a > > + * fine-grained update on the next change (e.g. when preparing to store > > + * the value in the backing store for later retrieval). > > + * > > + * This is safe to call regardless of whether the underlying filesystem > > + * is using multigrain timestamps. > > + */ > > +static inline long ctime_nsec_peek(const struct inode *inode) > > +{ > > + return inode->i_ctime.tv_nsec &~ I_CTIME_QUERIED; > > This is somewhat unusual spacing. I'd use: > > inode->i_ctime.tv_nsec & ~I_CTIME_QUERIED > Yeah, I don't know what happened there. I'll fix that up. > > +} > > + > > +/** > > + * ctime_peek - peek at (but don't query) the ctime > > + * @inode: inode to fetch the ctime from > > + * > > + * Grab the current ctime from the inode, sans I_CTIME_QUERIED flag. For > > + * use by internal consumers that don't require a fine-grained update on > > + * the next change. > > + * > > + * This is safe to call regardless of whether the underlying filesystem > > + * is using multigrain timestamps. > > + */ > > +static inline struct timespec64 ctime_peek(const struct inode *inode) > > +{ > > + struct timespec64 ctime; > > + > > + ctime.tv_sec = inode->i_ctime.tv_sec; > > + ctime.tv_nsec = ctime_nsec_peek(inode); > > + > > + return ctime; > > +} > > Given this is in a header that gets included in a lot of places, maybe we > should call it like inode_ctime_peek() or inode_ctime_get() to reduce > chances of a name clash? I'd be fine with that, but "ctime" sort of implies inode->i_ctime to me. We don't really use that nomenclature elsewhere.
On Tue, 2023-05-23 at 12:17 +0200, Jan Kara wrote: > On Tue 23-05-23 12:02:40, Jan Kara wrote: > > On Thu 18-05-23 07:47:35, Jeff Layton wrote: > > > The VFS always uses coarse-grained timestamp updates for filling out the > > > ctime and mtime after a change. This has the benefit of allowing > > > filesystems to optimize away a lot metadata updates, down to around 1 > > > per jiffy, even when a file is under heavy writes. > > > > > > Unfortunately, this has always been an issue when we're exporting via > > > NFSv3, which relies on timestamps to validate caches. Even with NFSv4, a > > > lot of exported filesystems don't properly support a change attribute > > > and are subject to the same problems with timestamp granularity. Other > > > applications have similar issues (e.g backup applications). > > > > > > Switching to always using fine-grained timestamps would improve the > > > situation, but that becomes rather expensive, as the underlying > > > filesystem will have to log a lot more metadata updates. > > > > > > What we need is a way to only use fine-grained timestamps when they are > > > being actively queried. > > > > > > The kernel always stores normalized ctime values, so only the first 30 > > > bits of the tv_nsec field are ever used. Whenever the mtime changes, the > > > ctime must also change. > > > > > > Use the 31st bit of the ctime tv_nsec field to indicate that something > > > has queried the inode for the i_mtime or i_ctime. When this flag is set, > > > on the next timestamp update, the kernel can fetch a fine-grained > > > timestamp instead of the usual coarse-grained one. > > > > > > This patch adds the infrastructure this scheme. Filesytems can opt > > > into it by setting the FS_MULTIGRAIN_TS flag in the fstype. > > > > > > Later patches will convert individual filesystems over to use it. > > > > > > Signed-off-by: Jeff Layton <jlayton@kernel.org> > > > > So there are two things I dislike about this series because I think they > > are fragile: > > > > 1) If we have a filesystem supporting multigrain ts and someone > > accidentally directly uses the value of inode->i_ctime, he can get bogus > > value (with QUERIED flag). This mistake is very easy to do. So I think we > > should rename i_ctime to something like __i_ctime and always use accessor > > function for it. > > > > 2) As I already commented in a previous version of the series, the scheme > > with just one flag for both ctime and mtime and flag getting cleared in > > current_time() relies on the fact that filesystems always do an equivalent > > of: > > > > inode->i_mtime = inode->i_ctime = current_time(); > > > > Otherwise we can do coarse grained update where we should have done a fine > > grained one. Filesystems often update timestamps like this but not > > universally. Grepping shows some instances where only inode->i_mtime is set > > from current_time() e.g. in autofs or bfs. Again a mistake that is rather > > easy to make and results in subtle issues. I think this would be also > > nicely solved by renaming i_ctime to __i_ctime and using a function to set > > ctime. Mtime could then be updated with inode->i_mtime = ctime_peek(). > > > > I understand this is quite some churn but a very mechanical one that could > > be just done with Coccinelle and a few manual fixups. So IMHO it is worth > > the more robust result. > > Also as I'm thinking about it your current scheme is slightly racy. Suppose > the filesystem does: > > CPU1 CPU2 > > statx() > inode->i_ctime = current_time() > current_mg_time() > nsec = atomic_long_fetch_andnot(QUERIED, &inode->i_ctime.tv_nsec) > nsec = atomic_long_fetch_or(QUERIED, &inode->i_ctime.tv_nsec) > if (nsec & QUERIED) - not set > ktime_get_coarse_real_ts64(&now) > return timestamp_truncate(now, inode); > - QUERIED flag in the inode->i_ctime gets overwritten by the assignment > => we need not update ctime due to granularity although it was queried > > One more reason to use explicit function to update inode->i_ctime ;) When we store the new time in the i_ctime field, the flag gets cleared because at that point we're storing a new (unseen) time. However, you're correct: if the i_ctime in your above example starts at the same value that is currently being returned by ktime_get_coarse_real_ts64, then we'll lose the flag set in statx. I think the right fix there would be to not update the ctime at all if it's a coarse grained time, and the value wouldn't have an apparent change to an observer. That would leave the flag intact. That does mean we'd need to move to a function that does clock fetch and assigns it to i_ctime in one go (like you suggest). Something like: inode_update_ctime(inode); How we do that with atomic operations over two values (the tv_sec and tv_nsec) is a bit tricky. I'll have to think about it. Christian, given Jan's concerns do you want to drop this series for now and let me respin it? Thanks,
On Tue, May 23, 2023 at 06:56:11AM -0400, Jeff Layton wrote: > On Tue, 2023-05-23 at 12:17 +0200, Jan Kara wrote: > > On Tue 23-05-23 12:02:40, Jan Kara wrote: > > > On Thu 18-05-23 07:47:35, Jeff Layton wrote: > > > > The VFS always uses coarse-grained timestamp updates for filling out the > > > > ctime and mtime after a change. This has the benefit of allowing > > > > filesystems to optimize away a lot metadata updates, down to around 1 > > > > per jiffy, even when a file is under heavy writes. > > > > > > > > Unfortunately, this has always been an issue when we're exporting via > > > > NFSv3, which relies on timestamps to validate caches. Even with NFSv4, a > > > > lot of exported filesystems don't properly support a change attribute > > > > and are subject to the same problems with timestamp granularity. Other > > > > applications have similar issues (e.g backup applications). > > > > > > > > Switching to always using fine-grained timestamps would improve the > > > > situation, but that becomes rather expensive, as the underlying > > > > filesystem will have to log a lot more metadata updates. > > > > > > > > What we need is a way to only use fine-grained timestamps when they are > > > > being actively queried. > > > > > > > > The kernel always stores normalized ctime values, so only the first 30 > > > > bits of the tv_nsec field are ever used. Whenever the mtime changes, the > > > > ctime must also change. > > > > > > > > Use the 31st bit of the ctime tv_nsec field to indicate that something > > > > has queried the inode for the i_mtime or i_ctime. When this flag is set, > > > > on the next timestamp update, the kernel can fetch a fine-grained > > > > timestamp instead of the usual coarse-grained one. > > > > > > > > This patch adds the infrastructure this scheme. Filesytems can opt > > > > into it by setting the FS_MULTIGRAIN_TS flag in the fstype. > > > > > > > > Later patches will convert individual filesystems over to use it. > > > > > > > > Signed-off-by: Jeff Layton <jlayton@kernel.org> > > > > > > So there are two things I dislike about this series because I think they > > > are fragile: > > > > > > 1) If we have a filesystem supporting multigrain ts and someone > > > accidentally directly uses the value of inode->i_ctime, he can get bogus > > > value (with QUERIED flag). This mistake is very easy to do. So I think we > > > should rename i_ctime to something like __i_ctime and always use accessor > > > function for it. > > > > > > 2) As I already commented in a previous version of the series, the scheme > > > with just one flag for both ctime and mtime and flag getting cleared in > > > current_time() relies on the fact that filesystems always do an equivalent > > > of: > > > > > > inode->i_mtime = inode->i_ctime = current_time(); > > > > > > Otherwise we can do coarse grained update where we should have done a fine > > > grained one. Filesystems often update timestamps like this but not > > > universally. Grepping shows some instances where only inode->i_mtime is set > > > from current_time() e.g. in autofs or bfs. Again a mistake that is rather > > > easy to make and results in subtle issues. I think this would be also > > > nicely solved by renaming i_ctime to __i_ctime and using a function to set > > > ctime. Mtime could then be updated with inode->i_mtime = ctime_peek(). > > > > > > I understand this is quite some churn but a very mechanical one that could > > > be just done with Coccinelle and a few manual fixups. So IMHO it is worth > > > the more robust result. > > > > Also as I'm thinking about it your current scheme is slightly racy. Suppose > > the filesystem does: > > > > CPU1 CPU2 > > > > statx() > > inode->i_ctime = current_time() > > current_mg_time() > > nsec = atomic_long_fetch_andnot(QUERIED, &inode->i_ctime.tv_nsec) > > nsec = atomic_long_fetch_or(QUERIED, &inode->i_ctime.tv_nsec) > > if (nsec & QUERIED) - not set > > ktime_get_coarse_real_ts64(&now) > > return timestamp_truncate(now, inode); > > - QUERIED flag in the inode->i_ctime gets overwritten by the assignment > > => we need not update ctime due to granularity although it was queried > > > > One more reason to use explicit function to update inode->i_ctime ;) > > When we store the new time in the i_ctime field, the flag gets cleared > because at that point we're storing a new (unseen) time. > > However, you're correct: if the i_ctime in your above example starts at > the same value that is currently being returned by > ktime_get_coarse_real_ts64, then we'll lose the flag set in statx. > > I think the right fix there would be to not update the ctime at all if > it's a coarse grained time, and the value wouldn't have an apparent > change to an observer. That would leave the flag intact. > > That does mean we'd need to move to a function that does clock fetch and > assigns it to i_ctime in one go (like you suggest). Something like: > > inode_update_ctime(inode); > > How we do that with atomic operations over two values (the tv_sec and > tv_nsec) is a bit tricky. I'll have to think about it. > > Christian, given Jan's concerns do you want to drop this series for now > and let me respin it? I deliberately put it into a vfs.unstable.* branch. I would leave it there until you send a new one then drop it. If we get lucky the bots that run on -next will have time to report potential perf issues while it's not currently causing conflicts.
On Tue, 2023-05-23 at 13:01 +0200, Christian Brauner wrote: > On Tue, May 23, 2023 at 06:56:11AM -0400, Jeff Layton wrote: > > On Tue, 2023-05-23 at 12:17 +0200, Jan Kara wrote: > > > On Tue 23-05-23 12:02:40, Jan Kara wrote: > > > > On Thu 18-05-23 07:47:35, Jeff Layton wrote: > > > > > The VFS always uses coarse-grained timestamp updates for filling out the > > > > > ctime and mtime after a change. This has the benefit of allowing > > > > > filesystems to optimize away a lot metadata updates, down to around 1 > > > > > per jiffy, even when a file is under heavy writes. > > > > > > > > > > Unfortunately, this has always been an issue when we're exporting via > > > > > NFSv3, which relies on timestamps to validate caches. Even with NFSv4, a > > > > > lot of exported filesystems don't properly support a change attribute > > > > > and are subject to the same problems with timestamp granularity. Other > > > > > applications have similar issues (e.g backup applications). > > > > > > > > > > Switching to always using fine-grained timestamps would improve the > > > > > situation, but that becomes rather expensive, as the underlying > > > > > filesystem will have to log a lot more metadata updates. > > > > > > > > > > What we need is a way to only use fine-grained timestamps when they are > > > > > being actively queried. > > > > > > > > > > The kernel always stores normalized ctime values, so only the first 30 > > > > > bits of the tv_nsec field are ever used. Whenever the mtime changes, the > > > > > ctime must also change. > > > > > > > > > > Use the 31st bit of the ctime tv_nsec field to indicate that something > > > > > has queried the inode for the i_mtime or i_ctime. When this flag is set, > > > > > on the next timestamp update, the kernel can fetch a fine-grained > > > > > timestamp instead of the usual coarse-grained one. > > > > > > > > > > This patch adds the infrastructure this scheme. Filesytems can opt > > > > > into it by setting the FS_MULTIGRAIN_TS flag in the fstype. > > > > > > > > > > Later patches will convert individual filesystems over to use it. > > > > > > > > > > Signed-off-by: Jeff Layton <jlayton@kernel.org> > > > > > > > > So there are two things I dislike about this series because I think they > > > > are fragile: > > > > > > > > 1) If we have a filesystem supporting multigrain ts and someone > > > > accidentally directly uses the value of inode->i_ctime, he can get bogus > > > > value (with QUERIED flag). This mistake is very easy to do. So I think we > > > > should rename i_ctime to something like __i_ctime and always use accessor > > > > function for it. > > > > > > > > 2) As I already commented in a previous version of the series, the scheme > > > > with just one flag for both ctime and mtime and flag getting cleared in > > > > current_time() relies on the fact that filesystems always do an equivalent > > > > of: > > > > > > > > inode->i_mtime = inode->i_ctime = current_time(); > > > > > > > > Otherwise we can do coarse grained update where we should have done a fine > > > > grained one. Filesystems often update timestamps like this but not > > > > universally. Grepping shows some instances where only inode->i_mtime is set > > > > from current_time() e.g. in autofs or bfs. Again a mistake that is rather > > > > easy to make and results in subtle issues. I think this would be also > > > > nicely solved by renaming i_ctime to __i_ctime and using a function to set > > > > ctime. Mtime could then be updated with inode->i_mtime = ctime_peek(). > > > > > > > > I understand this is quite some churn but a very mechanical one that could > > > > be just done with Coccinelle and a few manual fixups. So IMHO it is worth > > > > the more robust result. > > > > > > Also as I'm thinking about it your current scheme is slightly racy. Suppose > > > the filesystem does: > > > > > > CPU1 CPU2 > > > > > > statx() > > > inode->i_ctime = current_time() > > > current_mg_time() > > > nsec = atomic_long_fetch_andnot(QUERIED, &inode->i_ctime.tv_nsec) > > > nsec = atomic_long_fetch_or(QUERIED, &inode->i_ctime.tv_nsec) > > > if (nsec & QUERIED) - not set > > > ktime_get_coarse_real_ts64(&now) > > > return timestamp_truncate(now, inode); > > > - QUERIED flag in the inode->i_ctime gets overwritten by the assignment > > > => we need not update ctime due to granularity although it was queried > > > > > > One more reason to use explicit function to update inode->i_ctime ;) > > > > When we store the new time in the i_ctime field, the flag gets cleared > > because at that point we're storing a new (unseen) time. > > > > However, you're correct: if the i_ctime in your above example starts at > > the same value that is currently being returned by > > ktime_get_coarse_real_ts64, then we'll lose the flag set in statx. > > > > I think the right fix there would be to not update the ctime at all if > > it's a coarse grained time, and the value wouldn't have an apparent > > change to an observer. That would leave the flag intact. > > > > That does mean we'd need to move to a function that does clock fetch and > > assigns it to i_ctime in one go (like you suggest). Something like: > > > > inode_update_ctime(inode); > > > > How we do that with atomic operations over two values (the tv_sec and > > tv_nsec) is a bit tricky. I'll have to think about it. > > > > Christian, given Jan's concerns do you want to drop this series for now > > and let me respin it? > > I deliberately put it into a vfs.unstable.* branch. I would leave it > there until you send a new one then drop it. If we get lucky the bots > that run on -next will have time to report potential perf issues while > it's not currently causing conflicts. Sounds good to me. Thanks!
On Tue 23-05-23 06:40:08, Jeff Layton wrote: > On Tue, 2023-05-23 at 12:02 +0200, Jan Kara wrote: > > On Thu 18-05-23 07:47:35, Jeff Layton wrote: > > > The VFS always uses coarse-grained timestamp updates for filling out the > > > ctime and mtime after a change. This has the benefit of allowing > > > filesystems to optimize away a lot metadata updates, down to around 1 > > > per jiffy, even when a file is under heavy writes. > > > > > > Unfortunately, this has always been an issue when we're exporting via > > > NFSv3, which relies on timestamps to validate caches. Even with NFSv4, a > > > lot of exported filesystems don't properly support a change attribute > > > and are subject to the same problems with timestamp granularity. Other > > > applications have similar issues (e.g backup applications). > > > > > > Switching to always using fine-grained timestamps would improve the > > > situation, but that becomes rather expensive, as the underlying > > > filesystem will have to log a lot more metadata updates. > > > > > > What we need is a way to only use fine-grained timestamps when they are > > > being actively queried. > > > > > > The kernel always stores normalized ctime values, so only the first 30 > > > bits of the tv_nsec field are ever used. Whenever the mtime changes, the > > > ctime must also change. > > > > > > Use the 31st bit of the ctime tv_nsec field to indicate that something > > > has queried the inode for the i_mtime or i_ctime. When this flag is set, > > > on the next timestamp update, the kernel can fetch a fine-grained > > > timestamp instead of the usual coarse-grained one. > > > > > > This patch adds the infrastructure this scheme. Filesytems can opt > > > into it by setting the FS_MULTIGRAIN_TS flag in the fstype. > > > > > > Later patches will convert individual filesystems over to use it. > > > > > > Signed-off-by: Jeff Layton <jlayton@kernel.org> > > > > So there are two things I dislike about this series because I think they > > are fragile: > > > > 1) If we have a filesystem supporting multigrain ts and someone > > accidentally directly uses the value of inode->i_ctime, he can get bogus > > value (with QUERIED flag). This mistake is very easy to do. So I think we > > should rename i_ctime to something like __i_ctime and always use accessor > > function for it. > > > > We could do this, but it'll be quite invasive. We'd have to change any > place that touches i_ctime (and there are a lot of them), even on > filesystems that are not being converted. Yes, that's why I suggested Coccinelle to deal with this. > > 2) As I already commented in a previous version of the series, the scheme > > with just one flag for both ctime and mtime and flag getting cleared in > > current_time() relies on the fact that filesystems always do an equivalent > > of: > > > > inode->i_mtime = inode->i_ctime = current_time(); > > > > Otherwise we can do coarse grained update where we should have done a fine > > grained one. Filesystems often update timestamps like this but not > > universally. Grepping shows some instances where only inode->i_mtime is set > > from current_time() e.g. in autofs or bfs. Again a mistake that is rather > > easy to make and results in subtle issues. I think this would be also > > nicely solved by renaming i_ctime to __i_ctime and using a function to set > > ctime. Mtime could then be updated with inode->i_mtime = ctime_peek(). > > > > I understand this is quite some churn but a very mechanical one that could > > be just done with Coccinelle and a few manual fixups. So IMHO it is worth > > the more robust result. > > AFAICT, under POSIX, you must _always_ set the ctime when you set the > mtime, but the reverse is not true. That's why keeping the flag in the > ctime makes sense. If we're updating the mtime, then we necessarily must > update the ctime. Yes, but technically speaking: inode->i_mtime = current_time(); inode->i_ctime = current_time(); follows these requirements but is broken with your changes in unobvious way. And if mtime update is hidden in some function or condition, it is not even that suboptimal coding pattern. > > > +} > > > + > > > +/** > > > + * ctime_peek - peek at (but don't query) the ctime > > > + * @inode: inode to fetch the ctime from > > > + * > > > + * Grab the current ctime from the inode, sans I_CTIME_QUERIED flag. For > > > + * use by internal consumers that don't require a fine-grained update on > > > + * the next change. > > > + * > > > + * This is safe to call regardless of whether the underlying filesystem > > > + * is using multigrain timestamps. > > > + */ > > > +static inline struct timespec64 ctime_peek(const struct inode *inode) > > > +{ > > > + struct timespec64 ctime; > > > + > > > + ctime.tv_sec = inode->i_ctime.tv_sec; > > > + ctime.tv_nsec = ctime_nsec_peek(inode); > > > + > > > + return ctime; > > > +} > > > > Given this is in a header that gets included in a lot of places, maybe we > > should call it like inode_ctime_peek() or inode_ctime_get() to reduce > > chances of a name clash? > > I'd be fine with that, but "ctime" sort of implies inode->i_ctime to me. > We don't really use that nomenclature elsewhere. Yes, I don't insist here but we do have 'ctime' e.g. in IPC code (sem_ctime, shm_ctime, msg_ctime), we have ctime in md layer, ctime(3) is also a glibc function. So it is not *completely* impossible the clash happens but we can deal with it when it happens. Honza
On Tue, 2023-05-23 at 14:46 +0200, Jan Kara wrote: > On Tue 23-05-23 06:40:08, Jeff Layton wrote: > > On Tue, 2023-05-23 at 12:02 +0200, Jan Kara wrote: > > > > > > So there are two things I dislike about this series because I think they > > > are fragile: > > > > > > 1) If we have a filesystem supporting multigrain ts and someone > > > accidentally directly uses the value of inode->i_ctime, he can get bogus > > > value (with QUERIED flag). This mistake is very easy to do. So I think we > > > should rename i_ctime to something like __i_ctime and always use accessor > > > function for it. > > > > > > > We could do this, but it'll be quite invasive. We'd have to change any > > place that touches i_ctime (and there are a lot of them), even on > > filesystems that are not being converted. > > Yes, that's why I suggested Coccinelle to deal with this. I've done the work to convert all of the accesses of i_ctime into accessor functions in the kernel. The current state of it is here: https://git.kernel.org/pub/scm/linux/kernel/git/jlayton/linux.git/commit/?h=ctime As expected, it touches a lot of code, all over the place. So far I have most of the conversion in one giant patch, and I need to split it up (probably per-subsystem). What's the best way to feed this change into mainline? Should I try to get subsystem maintainers to pick these up, or are we better off feeding this in via a separate branch? For reference, the diffstat for the big conversion patch is below: arch/powerpc/platforms/cell/spufs/inode.c | 2 +- arch/s390/hypfs/inode.c | 4 +- drivers/android/binderfs.c | 8 ++-- drivers/infiniband/hw/qib/qib_fs.c | 4 +- drivers/misc/ibmasm/ibmasmfs.c | 2 +- drivers/misc/ibmvmc.c | 2 +- drivers/usb/core/devio.c | 16 ++++---- drivers/usb/gadget/function/f_fs.c | 6 +-- drivers/usb/gadget/legacy/inode.c | 3 +- fs/9p/vfs_inode.c | 6 ++- fs/9p/vfs_inode_dotl.c | 11 +++--- fs/adfs/inode.c | 4 +- fs/affs/amigaffs.c | 6 +-- fs/affs/inode.c | 17 ++++---- fs/afs/dynroot.c | 2 +- fs/afs/inode.c | 6 +-- fs/attr.c | 2 +- fs/autofs/inode.c | 2 +- fs/autofs/root.c | 6 +-- fs/bad_inode.c | 3 +- fs/befs/linuxvfs.c | 2 +- fs/bfs/dir.c | 16 ++++---- fs/bfs/inode.c | 6 +-- fs/binfmt_misc.c | 3 +- fs/btrfs/delayed-inode.c | 10 +++-- fs/btrfs/file.c | 24 ++++------- fs/btrfs/inode.c | 66 ++++++++++++------------ ------- fs/btrfs/ioctl.c | 2 +- fs/btrfs/reflink.c | 7 ++-- fs/btrfs/transaction.c | 3 +- fs/btrfs/tree-log.c | 4 +- fs/btrfs/xattr.c | 4 +- fs/ceph/acl.c | 2 +- fs/ceph/caps.c | 2 +- fs/ceph/inode.c | 17 ++++---- fs/ceph/snap.c | 2 +- fs/ceph/xattr.c | 2 +- fs/coda/coda_linux.c | 2 +- fs/coda/dir.c | 2 +- fs/coda/file.c | 2 +- fs/coda/inode.c | 2 +- fs/configfs/inode.c | 6 +-- fs/cramfs/inode.c | 2 +- fs/debugfs/inode.c | 2 +- fs/devpts/inode.c | 6 +-- fs/ecryptfs/inode.c | 2 +- fs/efivarfs/file.c | 2 +- fs/efivarfs/inode.c | 2 +- fs/efs/inode.c | 5 ++- fs/erofs/inode.c | 16 ++++---- fs/exfat/file.c | 4 +- fs/exfat/inode.c | 6 +-- fs/exfat/namei.c | 29 +++++++------- fs/exfat/super.c | 4 +- fs/ext2/acl.c | 2 +- fs/ext2/dir.c | 6 +-- fs/ext2/ialloc.c | 2 +- fs/ext2/inode.c | 11 +++--- fs/ext2/ioctl.c | 4 +- fs/ext2/namei.c | 8 ++-- fs/ext2/super.c | 2 +- fs/ext2/xattr.c | 2 +- fs/ext4/acl.c | 2 +- fs/ext4/ext4.h | 20 ++++++++++ fs/ext4/extents.c | 12 +++--- fs/ext4/ialloc.c | 2 +- fs/ext4/inline.c | 4 +- fs/ext4/inode.c | 16 ++++---- fs/ext4/ioctl.c | 9 +++-- fs/ext4/namei.c | 26 ++++++------ fs/ext4/super.c | 2 +- fs/ext4/xattr.c | 6 +-- fs/f2fs/dir.c | 8 ++-- fs/f2fs/f2fs.h | 5 ++- fs/f2fs/file.c | 16 ++++---- fs/f2fs/inline.c | 2 +- fs/f2fs/inode.c | 10 ++--- fs/f2fs/namei.c | 12 +++--- fs/f2fs/recovery.c | 4 +- fs/f2fs/super.c | 2 +- fs/f2fs/xattr.c | 2 +- fs/fat/inode.c | 8 ++-- fs/fat/misc.c | 7 +++- fs/freevxfs/vxfs_inode.c | 4 +- fs/fuse/control.c | 2 +- fs/fuse/dir.c | 8 ++-- fs/fuse/inode.c | 18 +++++---- fs/gfs2/acl.c | 2 +- fs/gfs2/bmap.c | 11 +++--- fs/gfs2/dir.c | 15 +++---- fs/gfs2/file.c | 2 +- fs/gfs2/glops.c | 4 +- fs/gfs2/inode.c | 8 ++-- fs/gfs2/super.c | 4 +- fs/gfs2/xattr.c | 8 ++-- fs/hfs/catalog.c | 8 ++-- fs/hfs/dir.c | 2 +- fs/hfs/inode.c | 13 +++--- fs/hfs/sysdep.c | 2 +- fs/hfsplus/catalog.c | 8 ++-- fs/hfsplus/dir.c | 6 +-- fs/hfsplus/inode.c | 14 +++---- fs/hostfs/hostfs_kern.c | 5 ++- fs/hpfs/dir.c | 8 ++-- fs/hpfs/inode.c | 6 +-- fs/hpfs/namei.c | 26 ++++++------ fs/hpfs/super.c | 5 ++- fs/hugetlbfs/inode.c | 12 +++--- fs/inode.c | 12 ++++-- fs/isofs/inode.c | 4 +- fs/isofs/rock.c | 16 ++++---- fs/jffs2/dir.c | 19 ++++----- fs/jffs2/file.c | 3 +- fs/jffs2/fs.c | 10 ++--- fs/jffs2/os-linux.h | 2 +- fs/jfs/acl.c | 2 +- fs/jfs/inode.c | 2 +- fs/jfs/ioctl.c | 2 +- fs/jfs/jfs_imap.c | 8 ++-- fs/jfs/jfs_inode.c | 4 +- fs/jfs/namei.c | 25 ++++++------ fs/jfs/super.c | 2 +- fs/jfs/xattr.c | 2 +- fs/kernfs/inode.c | 4 +- fs/libfs.c | 32 ++++++++------- fs/minix/bitmap.c | 2 +- fs/minix/dir.c | 6 +-- fs/minix/inode.c | 11 +++--- fs/minix/itree_common.c | 4 +- fs/minix/namei.c | 6 +-- fs/nfs/callback_proc.c | 2 +- fs/nfs/fscache.h | 4 +- fs/nfs/inode.c | 21 +++++----- fs/nfsd/nfsctl.c | 2 +- fs/nfsd/nfsfh.c | 4 +- fs/nfsd/vfs.c | 2 +- fs/nilfs2/dir.c | 6 +-- fs/nilfs2/inode.c | 12 +++--- fs/nilfs2/ioctl.c | 2 +- fs/nilfs2/namei.c | 8 ++-- fs/nsfs.c | 2 +- fs/ntfs/inode.c | 15 +++---- fs/ntfs/mft.c | 3 +- fs/ntfs3/file.c | 6 +-- fs/ntfs3/frecord.c | 4 +- fs/ntfs3/inode.c | 14 ++++--- fs/ntfs3/namei.c | 10 ++--- fs/ntfs3/xattr.c | 4 +- fs/ocfs2/acl.c | 6 +-- fs/ocfs2/alloc.c | 6 +-- fs/ocfs2/aops.c | 2 +- fs/ocfs2/dir.c | 8 ++-- fs/ocfs2/dlmfs/dlmfs.c | 4 +- fs/ocfs2/dlmglue.c | 10 +++-- fs/ocfs2/file.c | 16 ++++---- fs/ocfs2/inode.c | 14 ++++--- fs/ocfs2/move_extents.c | 6 +-- fs/ocfs2/namei.c | 22 ++++++----- fs/ocfs2/refcounttree.c | 14 +++---- fs/ocfs2/xattr.c | 6 +-- fs/omfs/dir.c | 4 +- fs/omfs/inode.c | 10 ++--- fs/openpromfs/inode.c | 4 +- fs/orangefs/namei.c | 2 +- fs/orangefs/orangefs-utils.c | 6 +-- fs/overlayfs/file.c | 7 +++- fs/overlayfs/util.c | 2 +- fs/pipe.c | 2 +- fs/posix_acl.c | 2 +- fs/proc/base.c | 2 +- fs/proc/inode.c | 2 +- fs/proc/proc_sysctl.c | 2 +- fs/proc/self.c | 2 +- fs/proc/thread_self.c | 2 +- fs/pstore/inode.c | 4 +- fs/qnx4/inode.c | 4 +- fs/qnx6/inode.c | 4 +- fs/ramfs/inode.c | 6 +-- fs/reiserfs/inode.c | 14 +++---- fs/reiserfs/ioctl.c | 4 +- fs/reiserfs/namei.c | 21 +++++----- fs/reiserfs/stree.c | 4 +- fs/reiserfs/super.c | 2 +- fs/reiserfs/xattr.c | 5 ++- fs/reiserfs/xattr_acl.c | 2 +- fs/romfs/super.c | 4 +- fs/smb/client/file.c | 4 +- fs/smb/client/fscache.h | 5 ++- fs/smb/client/inode.c | 15 ++++--- fs/smb/client/smb2ops.c | 2 +- fs/smb/server/smb2pdu.c | 8 ++-- fs/squashfs/inode.c | 2 +- fs/stack.c | 2 +- fs/stat.c | 2 +- fs/sysv/dir.c | 6 +-- fs/sysv/ialloc.c | 2 +- fs/sysv/inode.c | 6 +-- fs/sysv/itree.c | 4 +- fs/sysv/namei.c | 6 +-- fs/tracefs/inode.c | 2 +- fs/ubifs/debug.c | 4 +- fs/ubifs/dir.c | 39 +++++++++--------- fs/ubifs/file.c | 16 ++++---- fs/ubifs/ioctl.c | 2 +- fs/ubifs/journal.c | 4 +- fs/ubifs/super.c | 4 +- fs/ubifs/xattr.c | 6 +-- fs/udf/ialloc.c | 2 +- fs/udf/inode.c | 17 ++++---- fs/udf/namei.c | 24 +++++------ fs/ufs/dir.c | 6 +-- fs/ufs/ialloc.c | 2 +- fs/ufs/inode.c | 23 ++++++----- fs/ufs/namei.c | 8 ++-- fs/vboxsf/utils.c | 4 +- fs/xfs/libxfs/xfs_inode_buf.c | 4 +- fs/xfs/libxfs/xfs_trans_inode.c | 2 +- fs/xfs/xfs_acl.c | 2 +- fs/xfs/xfs_bmap_util.c | 6 ++- fs/xfs/xfs_inode.c | 2 +- fs/xfs/xfs_inode_item.c | 2 +- fs/xfs/xfs_iops.c | 4 +- fs/xfs/xfs_itable.c | 4 +- fs/zonefs/super.c | 8 ++-- include/linux/fs.h | 1 + include/linux/fs_stack.h | 2 +- ipc/mqueue.c | 20 +++++----- kernel/bpf/inode.c | 4 +- mm/shmem.c | 28 +++++++------ net/sunrpc/rpc_pipe.c | 2 +- security/apparmor/apparmorfs.c | 6 +-- security/apparmor/policy_unpack.c | 4 +- security/inode.c | 2 +- security/selinux/selinuxfs.c | 2 +- 234 files changed, 851 insertions(+), 808 deletions(-)
On Tue, 2023-05-23 at 12:17 +0200, Jan Kara wrote: > On Tue 23-05-23 12:02:40, Jan Kara wrote: > > On Thu 18-05-23 07:47:35, Jeff Layton wrote: > > > The VFS always uses coarse-grained timestamp updates for filling out the > > > ctime and mtime after a change. This has the benefit of allowing > > > filesystems to optimize away a lot metadata updates, down to around 1 > > > per jiffy, even when a file is under heavy writes. > > > > > > Unfortunately, this has always been an issue when we're exporting via > > > NFSv3, which relies on timestamps to validate caches. Even with NFSv4, a > > > lot of exported filesystems don't properly support a change attribute > > > and are subject to the same problems with timestamp granularity. Other > > > applications have similar issues (e.g backup applications). > > > > > > Switching to always using fine-grained timestamps would improve the > > > situation, but that becomes rather expensive, as the underlying > > > filesystem will have to log a lot more metadata updates. > > > > > > What we need is a way to only use fine-grained timestamps when they are > > > being actively queried. > > > > > > The kernel always stores normalized ctime values, so only the first 30 > > > bits of the tv_nsec field are ever used. Whenever the mtime changes, the > > > ctime must also change. > > > > > > Use the 31st bit of the ctime tv_nsec field to indicate that something > > > has queried the inode for the i_mtime or i_ctime. When this flag is set, > > > on the next timestamp update, the kernel can fetch a fine-grained > > > timestamp instead of the usual coarse-grained one. > > > > > > This patch adds the infrastructure this scheme. Filesytems can opt > > > into it by setting the FS_MULTIGRAIN_TS flag in the fstype. > > > > > > Later patches will convert individual filesystems over to use it. > > > > > > Signed-off-by: Jeff Layton <jlayton@kernel.org> > > > > So there are two things I dislike about this series because I think they > > are fragile: > > > > 1) If we have a filesystem supporting multigrain ts and someone > > accidentally directly uses the value of inode->i_ctime, he can get bogus > > value (with QUERIED flag). This mistake is very easy to do. So I think we > > should rename i_ctime to something like __i_ctime and always use accessor > > function for it. > > > > 2) As I already commented in a previous version of the series, the scheme > > with just one flag for both ctime and mtime and flag getting cleared in > > current_time() relies on the fact that filesystems always do an equivalent > > of: > > > > inode->i_mtime = inode->i_ctime = current_time(); > > > > Otherwise we can do coarse grained update where we should have done a fine > > grained one. Filesystems often update timestamps like this but not > > universally. Grepping shows some instances where only inode->i_mtime is set > > from current_time() e.g. in autofs or bfs. Again a mistake that is rather > > easy to make and results in subtle issues. I think this would be also > > nicely solved by renaming i_ctime to __i_ctime and using a function to set > > ctime. Mtime could then be updated with inode->i_mtime = ctime_peek(). > > > > I understand this is quite some churn but a very mechanical one that could > > be just done with Coccinelle and a few manual fixups. So IMHO it is worth > > the more robust result. > > Also as I'm thinking about it your current scheme is slightly racy. Suppose > the filesystem does: > > CPU1 CPU2 > > statx() > inode->i_ctime = current_time() > current_mg_time() > nsec = atomic_long_fetch_andnot(QUERIED, &inode->i_ctime.tv_nsec) > nsec = atomic_long_fetch_or(QUERIED, &inode->i_ctime.tv_nsec) > if (nsec & QUERIED) - not set > ktime_get_coarse_real_ts64(&now) > return timestamp_truncate(now, inode); > - QUERIED flag in the inode->i_ctime gets overwritten by the assignment > => we need not update ctime due to granularity although it was queried > > One more reason to use explicit function to update inode->i_ctime ;) Thinking about this some more, I think we can fix the race you pointed out by just not clearing the queried flag when we fetch the i_ctime.tv_nsec field when we're updating. So, instead of atomic_long_fetch_andnot, we'd just want to use an atomic_long_fetch there, and just let the eventual assignment of inode->__i_ctime.tv_nsec be what clears the flag. Any task that goes to update the time during the interim window will fetch a fine-grained time, but that's what we want anyway. Since you bring up races though, there are a couple of other things we should be aware of. Note that both problems exist today too: 1) it's possible for two tasks to race in such a way that the ctime goes backward. There's no synchronization between tasks doing the updating, so an older time can overwrite a newer one. I think you'd need a pretty tight race to observe this though. 2) it's possible to fetch a "torn" timestamp out of the inode. timespec64 is two words, and we don't order its loads or stores. We could consider adding a seqcount_t and some barriers and fixing it that way. I'm not sure it's worth it though, given that we haven't worried about this in the past. For now, I propose that we ignore both problems, unless and until we can prove that they are more than theoretical.
On Tue, Jun 13, 2023 at 09:09:29AM -0400, Jeff Layton wrote: > On Tue, 2023-05-23 at 14:46 +0200, Jan Kara wrote: > > On Tue 23-05-23 06:40:08, Jeff Layton wrote: > > > On Tue, 2023-05-23 at 12:02 +0200, Jan Kara wrote: > > > > > > > > So there are two things I dislike about this series because I think they > > > > are fragile: > > > > > > > > 1) If we have a filesystem supporting multigrain ts and someone > > > > accidentally directly uses the value of inode->i_ctime, he can get bogus > > > > value (with QUERIED flag). This mistake is very easy to do. So I think we > > > > should rename i_ctime to something like __i_ctime and always use accessor > > > > function for it. > > > > > > > > > > We could do this, but it'll be quite invasive. We'd have to change any > > > place that touches i_ctime (and there are a lot of them), even on > > > filesystems that are not being converted. > > > > Yes, that's why I suggested Coccinelle to deal with this. > > > I've done the work to convert all of the accesses of i_ctime into > accessor functions in the kernel. The current state of it is here: > > > https://git.kernel.org/pub/scm/linux/kernel/git/jlayton/linux.git/commit/?h=ctime > > As expected, it touches a lot of code, all over the place. So far I have > most of the conversion in one giant patch, and I need to split it up > (probably per-subsystem). Yeah, you have time since it'll be v6.6 material. > > What's the best way to feed this change into mainline? Should I try to > get subsystem maintainers to pick these up, or are we better off feeding > this in via a separate branch? I would prefer if we send them all through the vfs tree since trickle down conversions are otherwise very painful and potentially very slow.
diff --git a/fs/inode.c b/fs/inode.c index 577799b7855f..24769e08fbaa 100644 --- a/fs/inode.c +++ b/fs/inode.c @@ -2029,6 +2029,7 @@ EXPORT_SYMBOL(file_remove_privs); static int inode_needs_update_time(struct inode *inode, struct timespec64 *now) { int sync_it = 0; + struct timespec64 ctime; /* First try to exhaust all avenues to not sync */ if (IS_NOCMTIME(inode)) @@ -2037,7 +2038,8 @@ static int inode_needs_update_time(struct inode *inode, struct timespec64 *now) if (!timespec64_equal(&inode->i_mtime, now)) sync_it = S_MTIME; - if (!timespec64_equal(&inode->i_ctime, now)) + ctime = ctime_peek(inode); + if (!timespec64_equal(&ctime, now)) sync_it |= S_CTIME; if (IS_I_VERSION(inode) && inode_iversion_need_inc(inode)) @@ -2431,6 +2433,40 @@ struct timespec64 timestamp_truncate(struct timespec64 t, struct inode *inode) } EXPORT_SYMBOL(timestamp_truncate); +/** + * current_mg_time - Return FS time (possibly fine-grained) + * @inode: inode. + * + * Return the current time truncated to the time granularity supported by + * the fs, as suitable for a ctime/mtime change. If the ctime is flagged + * as having been QUERIED, get a fine-grained timestamp. + */ +static struct timespec64 current_mg_time(struct inode *inode) +{ + struct timespec64 now; + atomic_long_t *pnsec = (atomic_long_t *)&inode->i_ctime.tv_nsec; + long nsec = atomic_long_fetch_andnot(I_CTIME_QUERIED, pnsec); + + if (nsec & I_CTIME_QUERIED) { + ktime_get_real_ts64(&now); + } else { + struct timespec64 ctime; + + ktime_get_coarse_real_ts64(&now); + + /* + * If we've recently fetched a fine-grained timestamp + * then the coarse-grained one may still be earlier than the + * existing one. Just keep the existing ctime if so. + */ + ctime = ctime_peek(inode); + if (timespec64_compare(&ctime, &now) > 0) + now = ctime; + } + + return now; +} + /** * current_time - Return FS time * @inode: inode. @@ -2445,12 +2481,10 @@ struct timespec64 current_time(struct inode *inode) { struct timespec64 now; - ktime_get_coarse_real_ts64(&now); - - if (unlikely(!inode->i_sb)) { - WARN(1, "current_time() called with uninitialized super_block in the inode"); - return now; - } + if (is_multigrain_ts(inode)) + now = current_mg_time(inode); + else + ktime_get_coarse_real_ts64(&now); return timestamp_truncate(now, inode); } diff --git a/fs/stat.c b/fs/stat.c index 9b513a142a56..74d8283cc5c6 100644 --- a/fs/stat.c +++ b/fs/stat.c @@ -26,6 +26,38 @@ #include "internal.h" #include "mount.h" +/** + * fill_multigrain_cmtime - Fill in the mtime and ctime and flag ctime as QUERIED + * @request_mask: STATX_* values requested + * @inode: inode from which to grab the c/mtime + * @stat: where to store the resulting values + * + * Given @inode, grab the ctime and mtime out if it and store the result + * in @stat. When fetching the value, flag it as queried so the next write + * will use a fine-grained timestamp. + */ +void fill_multigrain_cmtime(u32 request_mask, struct inode *inode, + struct kstat *stat) +{ + atomic_long_t *pnsec = (atomic_long_t *)&inode->i_ctime.tv_nsec; + + /* If neither time was requested, then don't report them */ + if (!(request_mask & (STATX_CTIME|STATX_MTIME))) { + stat->result_mask &= ~(STATX_CTIME|STATX_MTIME); + return; + } + + stat->mtime = inode->i_mtime; + stat->ctime.tv_sec = inode->i_ctime.tv_sec; + /* + * Atomically set the QUERIED flag and fetch the new value with + * the flag masked off. + */ + stat->ctime.tv_nsec = atomic_long_fetch_or(I_CTIME_QUERIED, pnsec) & + ~I_CTIME_QUERIED; +} +EXPORT_SYMBOL(fill_multigrain_cmtime); + /** * generic_fillattr - Fill in the basic attributes from the inode struct * @idmap: idmap of the mount the inode was found from @@ -58,11 +90,16 @@ void generic_fillattr(struct mnt_idmap *idmap, u32 request_mask, stat->rdev = inode->i_rdev; stat->size = i_size_read(inode); stat->atime = inode->i_atime; - stat->mtime = inode->i_mtime; - stat->ctime = inode->i_ctime; stat->blksize = i_blocksize(inode); stat->blocks = inode->i_blocks; + if (is_multigrain_ts(inode)) { + fill_multigrain_cmtime(request_mask, inode, stat); + } else { + stat->mtime = inode->i_mtime; + stat->ctime = inode->i_ctime; + } + if ((request_mask & STATX_CHANGE_COOKIE) && IS_I_VERSION(inode)) { stat->result_mask |= STATX_CHANGE_COOKIE; stat->change_cookie = inode_query_iversion(inode); diff --git a/include/linux/fs.h b/include/linux/fs.h index d5896f90093a..1f670cf1edbd 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -1474,7 +1474,7 @@ static inline bool fsuidgid_has_mapping(struct super_block *sb, kgid_has_mapping(fs_userns, kgid); } -extern struct timespec64 current_time(struct inode *inode); +struct timespec64 current_time(struct inode *inode); /* * Snapshotting support. @@ -2212,6 +2212,7 @@ struct file_system_type { #define FS_USERNS_MOUNT 8 /* Can be mounted by userns root */ #define FS_DISALLOW_NOTIFY_PERM 16 /* Disable fanotify permission events */ #define FS_ALLOW_IDMAP 32 /* FS has been updated to handle vfs idmappings. */ +#define FS_MULTIGRAIN_TS 64 /* Filesystem uses multigrain timestamps */ #define FS_RENAME_DOES_D_MOVE 32768 /* FS will handle d_move() during rename() internally. */ int (*init_fs_context)(struct fs_context *); const struct fs_parameter_spec *parameters; @@ -2235,6 +2236,67 @@ struct file_system_type { #define MODULE_ALIAS_FS(NAME) MODULE_ALIAS("fs-" NAME) +/* + * Multigrain timestamps + * + * Conditionally use fine-grained ctime and mtime timestamps when there + * are users actively observing them via getattr. The primary use-case + * for this is NFS clients that use the ctime to distinguish between + * different states of the file, and that are often fooled by multiple + * operations that occur in the same coarse-grained timer tick. + */ +static inline bool is_multigrain_ts(const struct inode *inode) +{ + return inode->i_sb->s_type->fs_flags & FS_MULTIGRAIN_TS; +} + +/* + * The kernel always keeps normalized struct timespec64 values in the ctime, + * which means that only the first 30 bits of the value are used. Use the + * 31st bit of the ctime's tv_nsec field as a flag to indicate that the value + * has been queried since it was last updated. + */ +#define I_CTIME_QUERIED (1L<<30) + +/** + * ctime_nsec_peek - peek at (but don't query) the ctime tv_nsec field + * @inode: inode to fetch the ctime from + * + * Grab the current ctime tv_nsec field from the inode, mask off the + * I_CTIME_QUERIED flag and return it. This is mostly intended for use by + * internal consumers of the ctime that aren't concerned with ensuring a + * fine-grained update on the next change (e.g. when preparing to store + * the value in the backing store for later retrieval). + * + * This is safe to call regardless of whether the underlying filesystem + * is using multigrain timestamps. + */ +static inline long ctime_nsec_peek(const struct inode *inode) +{ + return inode->i_ctime.tv_nsec &~ I_CTIME_QUERIED; +} + +/** + * ctime_peek - peek at (but don't query) the ctime + * @inode: inode to fetch the ctime from + * + * Grab the current ctime from the inode, sans I_CTIME_QUERIED flag. For + * use by internal consumers that don't require a fine-grained update on + * the next change. + * + * This is safe to call regardless of whether the underlying filesystem + * is using multigrain timestamps. + */ +static inline struct timespec64 ctime_peek(const struct inode *inode) +{ + struct timespec64 ctime; + + ctime.tv_sec = inode->i_ctime.tv_sec; + ctime.tv_nsec = ctime_nsec_peek(inode); + + return ctime; +} + extern struct dentry *mount_bdev(struct file_system_type *fs_type, int flags, const char *dev_name, void *data, int (*fill_super)(struct super_block *, void *, int)); @@ -2857,6 +2919,8 @@ extern void page_put_link(void *); extern int page_symlink(struct inode *inode, const char *symname, int len); extern const struct inode_operations page_symlink_inode_operations; extern void kfree_link(void *); +void fill_multigrain_cmtime(u32 request_mask, struct inode *inode, + struct kstat *stat); void generic_fillattr(struct mnt_idmap *, u32, struct inode *, struct kstat *); void generic_fill_statx_attr(struct inode *inode, struct kstat *stat); extern int vfs_getattr_nosec(const struct path *, struct kstat *, u32, unsigned int);
The VFS always uses coarse-grained timestamp updates for filling out the ctime and mtime after a change. This has the benefit of allowing filesystems to optimize away a lot metadata updates, down to around 1 per jiffy, even when a file is under heavy writes. Unfortunately, this has always been an issue when we're exporting via NFSv3, which relies on timestamps to validate caches. Even with NFSv4, a lot of exported filesystems don't properly support a change attribute and are subject to the same problems with timestamp granularity. Other applications have similar issues (e.g backup applications). Switching to always using fine-grained timestamps would improve the situation, but that becomes rather expensive, as the underlying filesystem will have to log a lot more metadata updates. What we need is a way to only use fine-grained timestamps when they are being actively queried. The kernel always stores normalized ctime values, so only the first 30 bits of the tv_nsec field are ever used. Whenever the mtime changes, the ctime must also change. Use the 31st bit of the ctime tv_nsec field to indicate that something has queried the inode for the i_mtime or i_ctime. When this flag is set, on the next timestamp update, the kernel can fetch a fine-grained timestamp instead of the usual coarse-grained one. This patch adds the infrastructure this scheme. Filesytems can opt into it by setting the FS_MULTIGRAIN_TS flag in the fstype. Later patches will convert individual filesystems over to use it. Signed-off-by: Jeff Layton <jlayton@kernel.org> --- fs/inode.c | 48 ++++++++++++++++++++++++++++----- fs/stat.c | 41 ++++++++++++++++++++++++++-- include/linux/fs.h | 66 +++++++++++++++++++++++++++++++++++++++++++++- 3 files changed, 145 insertions(+), 10 deletions(-)