Message ID | 20141127153315.GC14091@thunk.org (mailing list archive) |
---|---|
State | Not Applicable |
Headers | show |
I don't think this scheme works well. As mentioned earlier XFS doesn't even use vfs dirty tracking at the moment, so introducing this in a hidden way sounds like a bad idea. Probably the same for btrfs. I'd rather keep update_time as-is for now, don't add ->write_time and let btrfs and XFS figure out how to implement the semantics on their own. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, Nov 27, 2014 at 08:49:52AM -0800, Christoph Hellwig wrote: > I don't think this scheme works well. As mentioned earlier XFS doesn't > even use vfs dirty tracking at the moment, so introducing this in a > hidden way sounds like a bad idea. Probably the same for btrfs. > > I'd rather keep update_time as-is for now, don't add ->write_time and > let btrfs and XFS figure out how to implement the semantics on their > own. I can do that, but part of the reason why we were doing this rather involved set of changes was to allow other file systems to be able to take advantage of lazytime. I suppose there is value in allowing other file systems, such as jfs, f2fs, etc., to use it, but still, it's a bit of a shame to drop btrfs and xfs support for this feature. I'll note by the way that ext3 and ext4 doesn't really use VFS dirty tracking either --- see my other comments about the naming of "mark_inode_dirty" being a bit misleading, at least for all/most of the major file systems. The problem seems to be that replacement schemes that we've all using are slightly different. :-/ I suppose should let the btrfs folks decide whether they want to add is_readonly() and write_time() function --- or maybe help with the cleanup work so that mark_inode_dirty() can reflect an error to its callers. Chris, David, what do you think? - Ted -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, Nov 27, 2014 at 03:27:31PM -0500, Theodore Ts'o wrote: > I can do that, but part of the reason why we were doing this rather > involved set of changes was to allow other file systems to be able to > take advantage of lazytime. I suppose there is value in allowing > other file systems, such as jfs, f2fs, etc., to use it, but still, > it's a bit of a shame to drop btrfs and xfs support for this feature. I want to see xfs and btrfs support, but I think we're running in some conceptual problems here. I don't have the time right now to fully review the XFS changes for correctness and test them, and I'd rather keep things as-is for a while and then add properly designed and fully teste support in rather than something possible broken. > I'll note by the way that ext3 and ext4 doesn't really use VFS dirty > tracking either --- see my other comments about the naming of > "mark_inode_dirty" being a bit misleading, at least for all/most of > the major file systems. The problem seems to be that replacement > schemes that we've all using are slightly different. :-/ Indeed. It seems all existing ->dirty_inode instances basically just try to work around the problem that the VFS simply updates timestamps by writing into the inode without involving the filesystem. There are all kinds of bugs in different instances, as well as comments mentioning an assumption that this only happens for atime although the VFS also dos this "trick" for c/mtime, including a caller from the page fault code that the filesystems can't even avoid by providing non-default methods everywhere. > I suppose should let the btrfs folks decide whether they want to add > is_readonly() and write_time() function --- or maybe help with the > cleanup work so that mark_inode_dirty() can reflect an error to its > callers. Chris, David, what do you think? The ->is_readonly method seems like a clear winner to me, I'm all for adding it, and thus suggested moving it first in the series. I've read a bit more through the series and would like to suggest the following approach for the rest: - convert ext3/4 to use ->update_time instead of the ->dirty_time callout so it gets and exact notifications (preferably the few remaining filesystems as well, although that shouldn't really be a blocker) - defer timestamp updates for any filesystems not defining ->update_time (or ->dirty_time for now), and allow filesystems using ->update_time to defer the update as well by calling mark_inode_dirty with the I_DIRTY_TIME flag so that XFS and btrfs don't have to opt-in without testing. - Convert xfs, btrfs and the remaining filesystes using ->dirty_inode incrementally. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Mon, Dec 01, 2014 at 01:28:10AM -0800, Christoph Hellwig wrote: > > The ->is_readonly method seems like a clear winner to me, I'm all for > adding it, and thus suggested moving it first in the series. It's a real winner for me as well, but the reason why I dropped it is because if btrfs() has to keep its ->update_time function, we wouldn't actually have a user for is_readonly(). I suppose we could have update_time() call ->is_readonly() and then ->update_time() if they exist, but it only seemed to add an extra call and a bit of extra overhead without really simplifying things for btrfs. If there were other users of ->is_readonly, then it would make sense, but it seemed better to move into a separate code refactoring series. > I've read a bit more through the series and would like to suggest > the following approach for the rest: > > - convert ext3/4 to use ->update_time instead of the ->dirty_time > callout so it gets and exact notifications (preferably the few > remaining filesystems as well, although that shouldn't really be a > blocker) We could do that, although ext3/4's ->update_time() would be exactly the same as the generic update_time() function, so there would be code duplication. If the goal is to get rid of the magic in -->dirty_inode() being used to work around how the VFS makes changes to fields that end up in the on-disk inode, we would need to audit a lot of extra code paths; at the very least, in how the generic quota code handles updates to i_size and i_blocks (for example). And BTW, we don't actually have a dirty_time() function any more in the current patch series. update_time() is currently looking like this: static int update_time(struct inode *inode, struct timespec *time, int flags) { if (inode->i_op->update_time) return inode->i_op->update_time(inode, time, flags); if (flags & S_ATIME) inode->i_atime = *time; if (flags & S_VERSION) inode_inc_iversion(inode); if (flags & S_CTIME) inode->i_ctime = *time; if (flags & S_MTIME) inode->i_mtime = *time; if ((inode->i_sb->s_flags & MS_LAZYTIME) && !(flags & S_VERSION) && !(inode->i_state & I_DIRTY)) __mark_inode_dirty(inode, I_DIRTY_TIME); else __mark_inode_dirty(inode, I_DIRTY_SYNC); return 0; } > - Convert xfs, btrfs and the remaining filesystes using ->dirty_inode > incrementally. Right, so xfs and btrfs (which are the two file systems that have update_time at the moment) can just drop update_time() and then check the ->dirty_time() for (flags & I_DIRTY_TIME). Hmm, I suspect this might be better for xfs, yes? if ((inode->i_sb->s_flags & MS_LAZYTIME) && !(flags & S_VERSION) && !(inode->i_state & I_DIRTY)) __mark_inode_dirty(inode, I_DIRTY_TIME); else __mark_inode_dirty(inode, I_DIRTY_SYNC | I_DIRTY_TIME); XFS doesn't have a ->dirty_time yet, but that way XFS would be able to use the I_DIRTY_TIME flag to log the journal timestamps if it so desires, and perhaps drop the need for it to use update_time(). (And with XFS doing logical journalling, it may be that you might want to include the timestamp update in the journal if you have a journal transaction open already, so the disk is spun up or likely to be spin up anyway, right?) - Ted -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Mon, Dec 01, 2014 at 10:04:50AM -0500, Theodore Ts'o wrote: > On Mon, Dec 01, 2014 at 01:28:10AM -0800, Christoph Hellwig wrote: > > > > The ->is_readonly method seems like a clear winner to me, I'm all for > > adding it, and thus suggested moving it first in the series. > > It's a real winner for me as well, but the reason why I dropped it is > because if btrfs() has to keep its ->update_time function, we wouldn't > actually have a user for is_readonly(). I suppose we could have > update_time() call ->is_readonly() and then ->update_time() if they > exist, but it only seemed to add an extra call and a bit of extra > overhead without really simplifying things for btrfs. We would use is_readonly in order to remove some extra checks from btrfs (setxattr, removexattr, possibly setsize). > If there were other users of ->is_readonly, then it would make sense, > but it seemed better to move into a separate code refactoring series. Yeah it would be better addressed separately as it's not the point of lazytime patchset and only turned out to be a good idea during the iterations. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Mon, Dec 01, 2014 at 10:04:50AM -0500, Theodore Ts'o wrote: > > - convert ext3/4 to use ->update_time instead of the ->dirty_time > > callout so it gets and exact notifications (preferably the few > > remaining filesystems as well, although that shouldn't really be a > > blocker) > > We could do that, although ext3/4's ->update_time() would be exactly > the same as the generic update_time() function, so there would be code > duplication. If the goal is to get rid of the magic in > -->dirty_inode() being used to work around how the VFS makes changes > to fields that end up in the on-disk inode, we would need to audit a > lot of extra code paths; at the very least, in how the generic quota > code handles updates to i_size and i_blocks (for example). > > And BTW, we don't actually have a dirty_time() function any more in > the current patch series. update_time() is currently looking like > this: Sorry, I actually meant ->dirty_inode, which is where ext4 currently hooks in for time updates. ->update_time was introduced to a) more specificly catch the kind of update b) allow the filesystem to take locks or a start a transaction before the inode fields are updated to provide proper atomicy. It seems like the quota code has the same problem, but given that neither XFS nor btrfs use it it seems like no one cared enough to sort it out properly.. > static int update_time(struct inode *inode, struct timespec *time, int flags) > { > if (inode->i_op->update_time) > return inode->i_op->update_time(inode, time, flags); > > if (flags & S_ATIME) > inode->i_atime = *time; > if (flags & S_VERSION) > inode_inc_iversion(inode); > if (flags & S_CTIME) > inode->i_ctime = *time; > if (flags & S_MTIME) > inode->i_mtime = *time; > > if ((inode->i_sb->s_flags & MS_LAZYTIME) && !(flags & S_VERSION) && > !(inode->i_state & I_DIRTY)) > __mark_inode_dirty(inode, I_DIRTY_TIME); > else > __mark_inode_dirty(inode, I_DIRTY_SYNC); > return 0; Why do you need the additional I_DIRTY flag? A "lesser" __mark_inode_dirty should never override a stronger one. Otherwise this looks fine to me, except that I would split the default implementation into a new generic_update_time helper. > XFS doesn't have a ->dirty_time yet, but that way XFS would be able to > use the I_DIRTY_TIME flag to log the journal timestamps if it so > desires, and perhaps drop the need for it to use update_time(). We will probably always need a ->update_time to proide proper locking around the timestamp updates. > (And > with XFS doing logical journalling, it may be that you might want to > include the timestamp update in the journal if you have a journal > transaction open already, so the disk is spun up or likely to be spin > up anyway, right?) XFS transactions are explicitly opened and closed, so during the atime updates we'll never have one open. What we could try is to have CIL items that are on "indefinit" hold before they are batched into a checkpoint. We'd still commit them to an in-memory transaction in ->upate_time for that. All this requires a lot of through and will take some time, though. In the current from the generic lazytime might even be a loss for XFS as we're already really good at batching updates from multiple inodes in the same cluster for the in-place writeback, so I really don't want to just enable it without those optimizations without a lot of testing. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Tue, Dec 02, 2014 at 01:20:33AM -0800, Christoph Hellwig wrote: > Why do you need the additional I_DIRTY flag? A "lesser" > __mark_inode_dirty should never override a stronger one. Agreed, will fix. > Otherwise this looks fine to me, except that I would split the default > implementation into a new generic_update_time helper. Sure, I can do that. > > XFS doesn't have a ->dirty_time yet, but that way XFS would be able to > > use the I_DIRTY_TIME flag to log the journal timestamps if it so > > desires, and perhaps drop the need for it to use update_time(). > > We will probably always need a ->update_time to proide proper locking > around the timestamp updates. Couldn't you let the VFS set the inode timesstamps and then have xfs's ->dirty_time(inode, I_DIRTY_TIME) copy the timestamps to the on-disk inode structure under the appropriate lock, or am I missing something? > In the current from the generic lazytime might even be a loss for XFS as > we're already really good at batching updates from multiple inodes in > the same cluster for the in-place writeback, so I really don't want > to just enable it without those optimizations without a lot of testing. Fair enough; it's not surprising that this might be much more effective as an optimization for ext4, for no other reason that timestamp updates are so much heavyweight for us. I suspect that it should be a win for btrfs, though, and it should definitely be a win for those file systems that don't use journalling at all. - Ted -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
diff --git a/include/linux/fs.h b/include/linux/fs.h index f4b0ecd..befd5d2 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -1545,7 +1545,8 @@ struct inode_operations { int (*fiemap)(struct inode *, struct fiemap_extent_info *, u64 start, u64 len); int (*is_readonly)(struct inode *); - int (*update_time)(struct inode *, struct timespec *, int); + void (*update_time)(struct inode *); + int (*write_time)(struct inode *); int (*atomic_open)(struct inode *, struct dentry *, struct file *, unsigned open_flag, umode_t create_mode, int *opened); diff --git a/fs/inode.c b/fs/inode.c index 53f0173..94bc908 100644 --- a/fs/inode.c +++ b/fs/inode.c @@ -1506,9 +1506,6 @@ static int update_time(struct inode *inode, struct timespec *time, int flags) if (ret) return ret; } - if (inode->i_op->update_time) - return inode->i_op->update_time(inode, time, flags); - if (flags & S_ATIME) inode->i_atime = *time; if (flags & S_VERSION) @@ -1517,6 +1514,10 @@ static int update_time(struct inode *inode, struct timespec *time, int flags) inode->i_ctime = *time; if (flags & S_MTIME) inode->i_mtime = *time; + if (inode->i_op->update_time) + inode->i_op->update_time(inode); + if (inode->i_op->write_time) + return inode->i_op->write_time(inode); mark_inode_dirty_sync(inode); return 0; } diff --git a/fs/xfs/xfs_iops.c b/fs/xfs/xfs_iops.c index ec6dcdc..b69493d 100644 --- a/fs/xfs/xfs_iops.c +++ b/fs/xfs/xfs_iops.c @@ -983,42 +983,42 @@ xfs_vn_setattr( return error; } -STATIC int +STATIC void xfs_vn_update_time( - struct inode *inode, - struct timespec *now, - int flags) + struct inode *inode) +{ + struct xfs_inode *ip = XFS_I(inode); + + trace_xfs_update_time(ip); + xfs_ilock(ip, XFS_ILOCK_EXCL); + ip->i_d.di_ctime.t_sec = (__int32_t) inode->i_ctime.tv_sec; + ip->i_d.di_ctime.t_nsec = (__int32_t) inode->i_ctime.tv_nsec; + + ip->i_d.di_mtime.t_sec = (__int32_t)inode->i_mtime.tv_sec; + ip->i_d.di_mtime.t_nsec = (__int32_t) inode->i_mtime.tv_nsec; + + ip->i_d.di_atime.t_sec = (__int32_t) inode->i_atime.tv_sec; + ip->i_d.di_atime.t_nsec = (__int32_t) inode->i_atime.tv_nsec; + xfs_iunlock(ip, XFS_ILOCK_EXCL); +} + +STATIC int +xfs_vn_write_time( + struct inode *inode) { struct xfs_inode *ip = XFS_I(inode); struct xfs_mount *mp = ip->i_mount; struct xfs_trans *tp; int error; - trace_xfs_update_time(ip); - + trace_xfs_write_time(ip); tp = xfs_trans_alloc(mp, XFS_TRANS_FSYNC_TS); error = xfs_trans_reserve(tp, &M_RES(mp)->tr_fsyncts, 0, 0); if (error) { xfs_trans_cancel(tp, 0); return error; } - xfs_ilock(ip, XFS_ILOCK_EXCL); - if (flags & S_CTIME) { - inode->i_ctime = *now; - ip->i_d.di_ctime.t_sec = (__int32_t)now->tv_sec; - ip->i_d.di_ctime.t_nsec = (__int32_t)now->tv_nsec; - } - if (flags & S_MTIME) { - inode->i_mtime = *now; - ip->i_d.di_mtime.t_sec = (__int32_t)now->tv_sec; - ip->i_d.di_mtime.t_nsec = (__int32_t)now->tv_nsec; - } - if (flags & S_ATIME) { - inode->i_atime = *now; - ip->i_d.di_atime.t_sec = (__int32_t)now->tv_sec; - ip->i_d.di_atime.t_nsec = (__int32_t)now->tv_nsec; - } xfs_trans_ijoin(tp, ip, XFS_ILOCK_EXCL); xfs_trans_log_inode(tp, ip, XFS_ILOG_TIMESTAMP); return xfs_trans_commit(tp, 0); @@ -1130,6 +1130,7 @@ static const struct inode_operations xfs_inode_operations = { .listxattr = xfs_vn_listxattr, .fiemap = xfs_vn_fiemap, .update_time = xfs_vn_update_time, + .write_time = xfs_vn_write_time, }; static const struct inode_operations xfs_dir_inode_operations = { @@ -1157,6 +1158,7 @@ static const struct inode_operations xfs_dir_inode_operations = { .removexattr = generic_removexattr, .listxattr = xfs_vn_listxattr, .update_time = xfs_vn_update_time, + .write_time = xfs_vn_write_time, .tmpfile = xfs_vn_tmpfile, }; @@ -1185,6 +1187,7 @@ static const struct inode_operations xfs_dir_ci_inode_operations = { .removexattr = generic_removexattr, .listxattr = xfs_vn_listxattr, .update_time = xfs_vn_update_time, + .write_time = xfs_vn_write_time, .tmpfile = xfs_vn_tmpfile, }; @@ -1199,6 +1202,7 @@ static const struct inode_operations xfs_symlink_inode_operations = { .removexattr = generic_removexattr, .listxattr = xfs_vn_listxattr, .update_time = xfs_vn_update_time, + .write_time = xfs_vn_write_time, }; STATIC void diff --git a/fs/xfs/xfs_trace.h b/fs/xfs/xfs_trace.h index 51372e3..09d261c 100644 --- a/fs/xfs/xfs_trace.h +++ b/fs/xfs/xfs_trace.h @@ -677,6 +677,7 @@ DEFINE_INODE_EVENT(xfs_file_fsync); DEFINE_INODE_EVENT(xfs_destroy_inode); DEFINE_INODE_EVENT(xfs_evict_inode); DEFINE_INODE_EVENT(xfs_update_time); +DEFINE_INODE_EVENT(xfs_write_time); DEFINE_INODE_EVENT(xfs_dquot_dqalloc); DEFINE_INODE_EVENT(xfs_dquot_dqdetach); diff --git a/Documentation/filesystems/Locking b/Documentation/filesystems/Locking index b30753c..ee94a66 100644 --- a/Documentation/filesystems/Locking +++ b/Documentation/filesystems/Locking @@ -62,7 +62,8 @@ prototypes: ssize_t (*listxattr) (struct dentry *, char *, size_t); int (*removexattr) (struct dentry *, const char *); int (*fiemap)(struct inode *, struct fiemap_extent_info *, u64 start, u64 len); - void (*update_time)(struct inode *, struct timespec *, int); + void (*update_time)(struct inode *); + void (*write_time)(struct inode *); int (*atomic_open)(struct inode *, struct dentry *, struct file *, unsigned open_flag, umode_t create_mode, int *opened); @@ -95,6 +96,7 @@ listxattr: no removexattr: yes fiemap: no update_time: no +write_time: no atomic_open: yes tmpfile: no dentry_open: no diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c index bd46a22..a81a0652 100644 --- a/fs/btrfs/inode.c +++ b/fs/btrfs/inode.c @@ -5563,26 +5563,8 @@ static int btrfs_is_readonly(struct inode *inode) return 0; } -/* - * This is a copy of file_update_time. We need this so we can return error on - * ENOSPC for updating the inode in the case of file write and mmap writes. - */ -static int btrfs_update_time(struct inode *inode, struct timespec *now, - int flags) +static int btrfs_write_time(struct inode *inode) { - struct btrfs_root *root = BTRFS_I(inode)->root; - - if (btrfs_root_readonly(root)) - return -EROFS; - - if (flags & S_VERSION) - inode_inc_iversion(inode); - if (flags & S_CTIME) - inode->i_ctime = *now; - if (flags & S_MTIME) - inode->i_mtime = *now; - if (flags & S_ATIME) - inode->i_atime = *now; return btrfs_dirty_inode(inode); } @@ -9471,7 +9453,7 @@ static const struct inode_operations btrfs_dir_inode_operations = { .get_acl = btrfs_get_acl, .set_acl = btrfs_set_acl, .is_readonly = btrfs_is_readonly, - .update_time = btrfs_update_time, + .write_time = btrfs_write_time, .tmpfile = btrfs_tmpfile, }; static const struct inode_operations btrfs_dir_ro_inode_operations = { @@ -9480,7 +9462,7 @@ static const struct inode_operations btrfs_dir_ro_inode_operations = { .get_acl = btrfs_get_acl, .set_acl = btrfs_set_acl, .is_readonly = btrfs_is_readonly, - .update_time = btrfs_update_time, + .write_time = btrfs_write_time, }; static const struct file_operations btrfs_dir_file_operations = { @@ -9551,7 +9533,7 @@ static const struct inode_operations btrfs_file_inode_operations = { .get_acl = btrfs_get_acl, .set_acl = btrfs_set_acl, .is_readonly = btrfs_is_readonly, - .update_time = btrfs_update_time, + .write_time = btrfs_write_time, }; static const struct inode_operations btrfs_special_inode_operations = { .getattr = btrfs_getattr, @@ -9564,7 +9546,7 @@ static const struct inode_operations btrfs_special_inode_operations = { .get_acl = btrfs_get_acl, .set_acl = btrfs_set_acl, .is_readonly = btrfs_is_readonly, - .update_time = btrfs_update_time, + .write_time = btrfs_write_time, }; static const struct inode_operations btrfs_symlink_inode_operations = { .readlink = generic_readlink, @@ -9578,7 +9560,7 @@ static const struct inode_operations btrfs_symlink_inode_operations = { .listxattr = btrfs_listxattr, .removexattr = btrfs_removexattr, .is_readonly = btrfs_is_readonly, - .update_time = btrfs_update_time, + .write_time = btrfs_write_time, }; const struct dentry_operations btrfs_dentry_operations = {