Message ID | 1416599964-21892-3-git-send-email-tytso@mit.edu (mailing list archive) |
---|---|
State | Not Applicable |
Headers | show |
On Fri, Nov 21, 2014 at 02:59:22PM -0500, Theodore Ts'o wrote: > Add a new mount option which enables a new "lazytime" mode. This mode > causes atime, mtime, and ctime updates to only be made to the > in-memory version of the inode. The on-disk times will only get > updated when (a) if the inode needs to be updated for some non-time > related change, (b) if userspace calls fsync(), syncfs() or sync(), or > (c) just before an undeleted inode is evicted from memory. > > This is OK according to POSIX because there are no guarantees after a > crash unless userspace explicitly requests via a fsync(2) call. > > For workloads which feature a large number of random write to a > preallocated file, the lazytime mount option significantly reduces > writes to the inode table. The repeated 4k writes to a single block > will result in undesirable stress on flash devices and SMR disk > drives. Even on conventional HDD's, the repeated writes to the inode > table block will trigger Adjacent Track Interference (ATI) remediation > latencies, which very negatively impact 99.9 percentile latencies --- > which is a very big deal for web serving tiers (for example). > > Google-Bug-Id: 18297052 > > Signed-off-by: Theodore Ts'o <tytso@mit.edu> > --- > fs/fs-writeback.c | 38 +++++++++++++++++++++++++++++++++++++- > fs/inode.c | 18 ++++++++++++++++++ > fs/proc_namespace.c | 1 + > fs/sync.c | 7 +++++++ > include/linux/fs.h | 1 + > include/uapi/linux/fs.h | 1 + > 6 files changed, 65 insertions(+), 1 deletion(-) > > diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c > index ef9bef1..ce7de22 100644 > --- a/fs/fs-writeback.c > +++ b/fs/fs-writeback.c > @@ -483,7 +483,7 @@ __writeback_single_inode(struct inode *inode, struct writeback_control *wbc) > if (!mapping_tagged(mapping, PAGECACHE_TAG_DIRTY)) > inode->i_state &= ~I_DIRTY_PAGES; > dirty = inode->i_state & I_DIRTY; > - inode->i_state &= ~(I_DIRTY_SYNC | I_DIRTY_DATASYNC); > + inode->i_state &= ~(I_DIRTY_SYNC | I_DIRTY_DATASYNC | I_DIRTY_TIME); > spin_unlock(&inode->i_lock); > /* Don't write the inode if only I_DIRTY_PAGES was set */ > if (dirty & (I_DIRTY_SYNC | I_DIRTY_DATASYNC)) { > @@ -1277,6 +1277,41 @@ static void wait_sb_inodes(struct super_block *sb) > iput(old_inode); > } > > +/* > + * This works like wait_sb_inodes(), but it is called *before* we kick > + * the bdi so the inodes can get written out. > + */ > +static void flush_sb_dirty_time(struct super_block *sb) > +{ > + struct inode *inode, *old_inode = NULL; > + > + WARN_ON(!rwsem_is_locked(&sb->s_umount)); > + spin_lock(&inode_sb_list_lock); > + list_for_each_entry(inode, &sb->s_inodes, i_sb_list) { > + int dirty_time; > + > + spin_lock(&inode->i_lock); > + if (inode->i_state & (I_FREEING|I_WILL_FREE|I_NEW)) { > + spin_unlock(&inode->i_lock); > + continue; > + } > + dirty_time = inode->i_state & I_DIRTY_TIME; > + __iget(inode); > + spin_unlock(&inode->i_lock); > + spin_unlock(&inode_sb_list_lock); > + > + iput(old_inode); > + old_inode = inode; > + > + if (dirty_time) > + mark_inode_dirty(inode); > + cond_resched(); > + spin_lock(&inode_sb_list_lock); > + } > + spin_unlock(&inode_sb_list_lock); > + iput(old_inode); > +} This just seems wrong to me, not to mention extremely expensive when we have millions of cached inodes on the superblock. Why can't we just add a function like mark_inode_dirty_time() which puts the inode on the dirty inode list with a writeback time 24 hours in the future rather than 30s in the future? > --- a/fs/inode.c > +++ b/fs/inode.c > @@ -534,6 +534,18 @@ static void evict(struct inode *inode) > BUG_ON(!(inode->i_state & I_FREEING)); > BUG_ON(!list_empty(&inode->i_lru)); > > + if (inode->i_nlink && inode->i_state & I_DIRTY_TIME) { > + if (inode->i_op->write_time) > + inode->i_op->write_time(inode); > + else if (inode->i_sb->s_op->write_inode) { > + struct writeback_control wbc = { > + .sync_mode = WB_SYNC_NONE, > + }; > + mark_inode_dirty(inode); > + inode->i_sb->s_op->write_inode(inode, &wbc); > + } > + } > + Eviction is too late for this. I'm pretty sure that it won't get this far as iput_final() should catch the I_DIRTY_TIME in the !drop case via write_inode_now(). > int vfs_fsync_range(struct file *file, loff_t start, loff_t end, int datasync) > { > + struct inode *inode = file->f_mapping->host; > + > if (!file->f_op->fsync) > return -EINVAL; > + if (!datasync && inode->i_state & I_DIRTY_TIME) { FWIW, I'm surprised gcc isn't throwing warnings about that. Please make sure there isn't any ambiguity in the code by writing those checks like this: if (!datasync && (inode->i_state & I_DIRTY_TIME)) { > + spin_lock(&inode->i_lock); > + inode->i_state |= I_DIRTY_SYNC; > + spin_unlock(&inode->i_lock); > + } > return file->f_op->fsync(file, start, end, datasync); When we mark the inode I_DIRTY_TIME, we should also be marking it I_DIRTY_SYNC so that all the sync operations know that they should be writing this inode. That's partly why I also think these inodes should be tracked on the dirty inode list.... > diff --git a/include/linux/fs.h b/include/linux/fs.h > index 3633239..489b2f2 100644 > --- a/include/linux/fs.h > +++ b/include/linux/fs.h > @@ -1721,6 +1721,7 @@ struct super_operations { > #define __I_DIO_WAKEUP 9 > #define I_DIO_WAKEUP (1 << I_DIO_WAKEUP) > #define I_LINKABLE (1 << 10) > +#define I_DIRTY_TIME (1 << 11) > > #define I_DIRTY (I_DIRTY_SYNC | I_DIRTY_DATASYNC | I_DIRTY_PAGES) Shouldn't I_DIRTY also include I_DIRTY_TIME now?
On Tue, Nov 25, 2014 at 12:52:39PM +1100, Dave Chinner wrote: > > +static void flush_sb_dirty_time(struct super_block *sb) > > +{ ... > > +} > > This just seems wrong to me, not to mention extremely expensive when we have > millions of cached inodes on the superblock. #1, It only gets called on a sync(2) or syncfs(2), which can be pretty expensive as it is, so I didn't really worry about it. #2, We're already iterating over all of the inodes in the sync(2) or syncfs(2) path, so the code path in question is already O(N) in the number of inodes. > Why can't we just add a function like mark_inode_dirty_time() which > puts the inode on the dirty inode list with a writeback time 24 > hours in the future rather than 30s in the future? I was concerned about putting them on the dirty inode list because it would be extra inodes for the writeback threads would have to skip over and ignore (since they would not be dirty in the inde or data pages sense). Another solution would be to use a separate linked list for dirtytime inodes, but that means adding some extra fields to the inode structure, which some might view as bloat. Given #1 and #2 above, yes, we're doubling the CPU cost for sync(2) and syncfs(2), since we're not iterating over all of the inodes twice, but I believe that is a better trade-off than bloating the inode structure with an extra set of linked lists or increasing the CPU cost to the writeback path (which gets executed way more often than the sync or syncfs paths). > Eviction is too late for this. I'm pretty sure that it won't get > this far as iput_final() should catch the I_DIRTY_TIME in the !drop > case via write_inode_now(). Actually, the tracepoint for fs_lazytime_evict() does get triggered from time to time; but only when the inode is evicted due to memory pressure, i.e., via the evict_inodes() path. I thought about possibly doing this in iput_final(), but that would mean that whenever we closed the last fd on the file, we would push the inode out to disk. For files that we are writing, that's not so bad; but if we enable strictatime with lazytime, then we would be updating the atime for inodes that had been only been read on every close --- which in the case of say, all of the files in the kernel tree, would be a little unfortunate. Of course, the combination of strict atime and lazytime would result in a pretty heavy write load on a umount or sync(2), so I suspect keeping the relatime mode would make sense for most people, but I for those people who need strict Posix compliance, it seemed like doing something that worked well for strictatime plus lazytime would be a good thing, which is why I tried to defer things as much as possible. > if (!datasync && (inode->i_state & I_DIRTY_TIME)) { > > > + spin_lock(&inode->i_lock); > > + inode->i_state |= I_DIRTY_SYNC; > > + spin_unlock(&inode->i_lock); > > + } > > return file->f_op->fsync(file, start, end, datasync); > > When we mark the inode I_DIRTY_TIME, we should also be marking it > I_DIRTY_SYNC so that all the sync operations know that they should > be writing this inode. That's partly why I also think these inodes > should be tracked on the dirty inode list.... The whole point of what I was doing is that I_DIRTY_TIME was not part of I_DIRTY, and that when in update_time() we set I_DIRTY_TIME instead of I_DIRTY_SYNC. The goal is that these inodes would not end up on the dirty list, because that way they wouldn't be forced out to disk until either (a) the inode is written out for other reasons (i.e., a change in i_size or i_blocks, etc.), (b) the inode is explicitly fsync'ed, or (c) the the umount(2), sync(2), or syncfs(2) of the file system. That way, the timestamps are in the memory copy inode, but *not* written on disk until they are opportunistically written out due to some other modification of the inode which sets I_DIRTY_SYNC. If we were to set I_DIRTY_SYNC alongside I_DIRTY_TIME, and put these inodes on the dirty inode list, then I_DIRTY_TIME would effectively be a no-op, and there would be no point to this whole exercise. It may be that lazytime will never be the default, because it is so different from what we are currently doing. But I think it is worth doing, even if it is an optional mount option which is only used under special circumstances. For myself, we will be using it in Google and I will be using it on my laptop because it definitely reduces the write load to the SSD. This I've measured it via the tracepoints. If there is significant objections to doing this in the VFS layer, I'm happy to go back to doing this as in ext4-specific code. There were a few bits that were a bit more dodgy, and I can't make sync(2) and syncfs(2) flush the dirty timestamps if I do it as an ext4-specific hack, but for my use cases, I don't really care about those things. The main reason why I redid this patch set as a VFS specific change was because Cristoph and others specifically requested it. But if you strongly object, I can always go back to doing this in the ext4 code.... Cheers, - Ted -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 11/25/2014 06:33 AM, Theodore Ts'o wrote: <> > > I was concerned about putting them on the dirty inode list because it > would be extra inodes for the writeback threads would have to skip > over and ignore (since they would not be dirty in the inde or data > pages sense). > > Another solution would be to use a separate linked list for dirtytime > inodes, but that means adding some extra fields to the inode > structure, which some might view as bloat. You could use the same list-head for both lists. If the inode is on the dirty-inode-list then no need to add it to the list-for-dirtytime, it will be written soon anyway. else you add it to the list-for-dirtytime. If you (real)dirty an inode then you first remove it from the list-for-dirtytime first, and then add it to the dirty-inode-list. So at each given time it is only on one list <> Cheers Boaz -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Mon 24-11-14 23:33:35, Ted Tso wrote: > On Tue, Nov 25, 2014 at 12:52:39PM +1100, Dave Chinner wrote: > > Eviction is too late for this. I'm pretty sure that it won't get > > this far as iput_final() should catch the I_DIRTY_TIME in the !drop > > case via write_inode_now(). > > Actually, the tracepoint for fs_lazytime_evict() does get triggered > from time to time; but only when the inode is evicted due to memory > pressure, i.e., via the evict_inodes() path. > > I thought about possibly doing this in iput_final(), but that would > mean that whenever we closed the last fd on the file, we would push > the inode out to disk. For files that we are writing, that's not so > bad; but if we enable strictatime with lazytime, then we would be > updating the atime for inodes that had been only been read on every > close --- which in the case of say, all of the files in the kernel > tree, would be a little unfortunate. Actually, I'd also prefer to do the writing from iput_final(). My main reason is that shrinker starts behaving very differently when you put inodes with I_DIRTY_TIME to the LRU. See inode_lru_isolate() and in particular: /* * Referenced or dirty inodes are still in use. Give them another * pass * through the LRU as we canot reclaim them now. */ if (atomic_read(&inode->i_count) || (inode->i_state & ~I_REFERENCED)) { list_del_init(&inode->i_lru); spin_unlock(&inode->i_lock); this_cpu_dec(nr_unused); return LRU_REMOVED; } Regarding your concern that we'd write the inode when file is closed - that's not true. We'll write the inode only after corresponding dentry is evicted and thus drops inode reference. That doesn't seem too bad to me. Honza
On Mon 24-11-14 23:33:35, Ted Tso wrote: > On Tue, Nov 25, 2014 at 12:52:39PM +1100, Dave Chinner wrote: > > > +static void flush_sb_dirty_time(struct super_block *sb) > > > +{ > ... > > > +} > > > > This just seems wrong to me, not to mention extremely expensive when we have > > millions of cached inodes on the superblock. > > #1, It only gets called on a sync(2) or syncfs(2), which can be pretty > expensive as it is, so I didn't really worry about it. > > #2, We're already iterating over all of the inodes in the sync(2) or > syncfs(2) path, so the code path in question is already O(N) in the > number of inodes. > > > Why can't we just add a function like mark_inode_dirty_time() which > > puts the inode on the dirty inode list with a writeback time 24 > > hours in the future rather than 30s in the future? > > I was concerned about putting them on the dirty inode list because it > would be extra inodes for the writeback threads would have to skip > over and ignore (since they would not be dirty in the inde or data > pages sense). I agree this isn't going to work easily. Currently flusher relies on dirty list being sorted by i_dirtied_when and that gets harder to maintain if we ever have inodes with i_dirtied_when in future (we'd have to sort-in newly dirtied inodes). > Another solution would be to use a separate linked list for dirtytime > inodes, but that means adding some extra fields to the inode > structure, which some might view as bloat. Given #1 and #2 above, > yes, we're doubling the CPU cost for sync(2) and syncfs(2), since > we're not iterating over all of the inodes twice, but I believe that > is a better trade-off than bloating the inode structure with an extra > set of linked lists or increasing the CPU cost to the writeback path > (which gets executed way more often than the sync or syncfs paths). This would be possible and as Boaz says, it might be possible to reuse the same list_head in the inode for this. Getting rid of the full scan of all superblock inodes would be nice (as the scan gets really expensive for large numbers of inodes (think of i_sb_lock contention) and this makes it twice as bad) so I'd prefer to do this if possible. Honza
On Tue, Nov 25, 2014 at 06:19:27PM +0100, Jan Kara wrote: > Actually, I'd also prefer to do the writing from iput_final(). My main > reason is that shrinker starts behaving very differently when you put > inodes with I_DIRTY_TIME to the LRU. See inode_lru_isolate() and in > particular: > /* > * Referenced or dirty inodes are still in use. Give them another > * pass > * through the LRU as we canot reclaim them now. > */ > if (atomic_read(&inode->i_count) || > (inode->i_state & ~I_REFERENCED)) { > list_del_init(&inode->i_lru); > spin_unlock(&inode->i_lock); > this_cpu_dec(nr_unused); > return LRU_REMOVED; > } I must be missing something; how would the shirnker behave differently? I_DIRTY_TIME shouldn't have any effect on the shrinker; note that I_DIRTY_TIME is *not* part of I_DIRTY, and this was quite deliberate, because I didn't want I_DIRTY_TIME to have any affect on any of the other parts of the writeback or inode management parts. > Regarding your concern that we'd write the inode when file is closed - > that's not true. We'll write the inode only after corresponding dentry is > evicted and thus drops inode reference. That doesn't seem too bad to me. True, fair enough. It's not quite so lazy, but it should be close enough. I'm still not seeing the benefit in waiting until the last possible minute to write out the timestamps; evict() can block as it is if there are any writeback that needs to be completed, and if the writeback happens to pages subject to delalloc, the timestamp update could happen for free at that point. - Ted -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Tue, Nov 25, 2014 at 06:30:40PM +0100, Jan Kara wrote: > This would be possible and as Boaz says, it might be possible to reuse > the same list_head in the inode for this. Getting rid of the full scan of > all superblock inodes would be nice (as the scan gets really expensive for > large numbers of inodes (think of i_sb_lock contention) and this makes it > twice as bad) so I'd prefer to do this if possible. Fair enough, I'll give this a try. Personally, I've never been that solicitous towards the efficiency of sync, since if you ever use it, you tend to destroy performance just because of contention of the disk drive head caused by the writeback, never mind the i_sb_lock contention. ("I am sync(2), the destroyer of tail latency SLO's...") In fact there has sometimes been discussion about disabling sync(2) from non-root users, because the opportunity for mischief when a developer logs and types sync out of reflex is too high. Of course, if we ever did this, I'm sure such a patch would never be accepted upstream, but that's OK, most people don't seem to care about tail latency outside of Facebook and Google anyway.... - Ted -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Tue 25-11-14 12:57:16, Ted Tso wrote: > On Tue, Nov 25, 2014 at 06:19:27PM +0100, Jan Kara wrote: > > Actually, I'd also prefer to do the writing from iput_final(). My main > > reason is that shrinker starts behaving very differently when you put > > inodes with I_DIRTY_TIME to the LRU. See inode_lru_isolate() and in > > particular: > > /* > > * Referenced or dirty inodes are still in use. Give them another > > * pass > > * through the LRU as we canot reclaim them now. > > */ > > if (atomic_read(&inode->i_count) || > > (inode->i_state & ~I_REFERENCED)) { > > list_del_init(&inode->i_lru); > > spin_unlock(&inode->i_lock); > > this_cpu_dec(nr_unused); > > return LRU_REMOVED; > > } > > I must be missing something; how would the shirnker behave > differently? I_DIRTY_TIME shouldn't have any effect on the shrinker; > note that I_DIRTY_TIME is *not* part of I_DIRTY, and this was quite > deliberate, because I didn't want I_DIRTY_TIME to have any affect on > any of the other parts of the writeback or inode management parts. Sure, but the test tests whether the inode has *any other* bit than I_REFERENCED set. So I_DIRTY_TIME will trigger the test and we just remove the inode from lru list. You could exclude I_DIRTY_TIME from this test to avoid this problem but then the shrinker latency would get much larger because it will suddently do IO in evict(). So I still think doing the write in iput_final() is the best solution. > > Regarding your concern that we'd write the inode when file is closed - > > that's not true. We'll write the inode only after corresponding dentry is > > evicted and thus drops inode reference. That doesn't seem too bad to me. > > True, fair enough. It's not quite so lazy, but it should be close > enough. > > I'm still not seeing the benefit in waiting until the last possible > minute to write out the timestamps; evict() can block as it is if > there are any writeback that needs to be completed, and if the > writeback happens to pages subject to delalloc, the timestamp update > could happen for free at that point. Yeah, doing IO from evict is OK in princible but the change in shrinker success rate / latency worries me... It would certainly need careful testing under memory pressure & IO load with lots of outstanding timestamp updates and see how shrinker behaves (change in CPU consumption, numbers of evicted inodes, etc.). Honza
On Mon, Nov 24, 2014 at 11:33:35PM -0500, Theodore Ts'o wrote: > On Tue, Nov 25, 2014 at 12:52:39PM +1100, Dave Chinner wrote: > > > +static void flush_sb_dirty_time(struct super_block *sb) > > > +{ > ... > > > +} > > > > This just seems wrong to me, not to mention extremely expensive when we have > > millions of cached inodes on the superblock. > > #1, It only gets called on a sync(2) or syncfs(2), which can be pretty > expensive as it is, so I didn't really worry about it. > > #2, We're already iterating over all of the inodes in the sync(2) or > syncfs(2) path, so the code path in question is already O(N) in the > number of inodes. > > > Why can't we just add a function like mark_inode_dirty_time() which > > puts the inode on the dirty inode list with a writeback time 24 > > hours in the future rather than 30s in the future? > > I was concerned about putting them on the dirty inode list because it > would be extra inodes for the writeback threads would have to skip > over and ignore (since they would not be dirty in the inde or data > pages sense). Create another list - we already have multiple dirty inode lists in the struct bdi_writeback.... > Another solution would be to use a separate linked list for dirtytime > inodes, but that means adding some extra fields to the inode > structure, which some might view as bloat. We already have an i_wb_list in the inode for tracking the dirty list the inode belongs to. > Given #1 and #2 above, > yes, we're doubling the CPU cost for sync(2) and syncfs(2), since > we're not iterating over all of the inodes twice, but I believe that > is a better trade-off than bloating the inode structure with an extra > set of linked lists or increasing the CPU cost to the writeback path > (which gets executed way more often than the sync or syncfs paths). There is no need to bloat the inode at all, therefore we shouldn't be causing sync/syncfs regressions by enabling lazytime... > > Eviction is too late for this. I'm pretty sure that it won't get > > this far as iput_final() should catch the I_DIRTY_TIME in the !drop > > case via write_inode_now(). > > Actually, the tracepoint for fs_lazytime_evict() does get triggered > from time to time; but only when the inode is evicted due to memory > pressure, i.e., via the evict_inodes() path. That's indicative of a bug - if it's dirty then you shouldn't be evicting it. The LRU shrinker explicitly avoids reclaiming dirty inodes. Also, evict_inodes() is only called in the unmount path, and that happens after a sync_filesystem() call so that shouldn't be finding dirty inodes, either.... > I thought about possibly doing this in iput_final(), but that would > mean that whenever we closed the last fd on the file, we would push > the inode out to disk. I get the feeling from your responses that you really don't understand the VFS inode lifecycle or the writeback code works. Inodes don't get dropped form the inode cache when the last open FD on them is closed unless they are an unlinked file. The dentry cache still holds a reference to the inode.... > For files that we are writing, that's not so > bad; but if we enable strictatime with lazytime, then we would be > updating the atime for inodes that had been only been read on every > close --- which in the case of say, all of the files in the kernel > tree, would be a little unfortunate. > > Of course, the combination of strict atime and lazytime would result > in a pretty heavy write load on a umount or sync(2), so I suspect > keeping the relatime mode would make sense for most people, but I for > those people who need strict Posix compliance, it seemed like doing > something that worked well for strictatime plus lazytime would be a > good thing, which is why I tried to defer things as much as possible. > > > if (!datasync && (inode->i_state & I_DIRTY_TIME)) { > > > > > + spin_lock(&inode->i_lock); > > > + inode->i_state |= I_DIRTY_SYNC; > > > + spin_unlock(&inode->i_lock); > > > + } > > > return file->f_op->fsync(file, start, end, datasync); > > > > When we mark the inode I_DIRTY_TIME, we should also be marking it > > I_DIRTY_SYNC so that all the sync operations know that they should > > be writing this inode. That's partly why I also think these inodes > > should be tracked on the dirty inode list.... > > The whole point of what I was doing is that I_DIRTY_TIME was not part > of I_DIRTY, and that when in update_time() we set I_DIRTY_TIME instead > of I_DIRTY_SYNC. I_DIRTY_SYNC only matters once you get the inode into the fsync code or deep into the inode writeback code (i.e. __writeback_single_inode()). if we don't expire the inode at the high level writeback code, then the only time we'll get into __writeback_single_inode() is through specific foreground attempts to write the inode. In which case, we should be writing the inode if it is I_DIRTY_TIME, and so I_DIRTY_SYNC will trigger all the correct code paths to be taken to get us to writing it. And then we don't need hacks like the above to get fsync to write the inode.... > The goal is that these inodes would not end up on > the dirty list, because that way they wouldn't be forced out to disk > until either (a) the inode is written out for other reasons (i.e., a > change in i_size or i_blocks, etc.), (b) the inode is explicitly > fsync'ed, or (c) the the umount(2), sync(2), or syncfs(2) of the file > system. Precisely. If we don't expire the inode from the dirty list, and it will only get written back when inode writeback is explicitly asked for. > That way, the timestamps are in the memory copy inode, but *not* > written on disk until they are opportunistically written out due to > some other modification of the inode which sets I_DIRTY_SYNC. > > If we were to set I_DIRTY_SYNC alongside I_DIRTY_TIME, and put these > inodes on the dirty inode list, then I_DIRTY_TIME would effectively be > a no-op, and there would be no point to this whole exercise. Except for the fact that I_DIRTY_TIME would result in the inode being put on a different dirty list, and then when it's marked dirty properly it can be moved onto the dirty inode list that will trigger background writeback in 30s.... > If there is significant objections to doing this in the VFS layer, I'm > happy to go back to doing this as in ext4-specific code. > The main reason why I redid this patch set as a VFS specific change > was because Cristoph and others specifically requested it. But if you > strongly object, I can always go back to doing this in the ext4 code.... Don't be a drama queen, Ted. lazytime is something that needs to be done at the VFS but it needs to be done efficiently. Cheers, Dave.
diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c index ef9bef1..ce7de22 100644 --- a/fs/fs-writeback.c +++ b/fs/fs-writeback.c @@ -483,7 +483,7 @@ __writeback_single_inode(struct inode *inode, struct writeback_control *wbc) if (!mapping_tagged(mapping, PAGECACHE_TAG_DIRTY)) inode->i_state &= ~I_DIRTY_PAGES; dirty = inode->i_state & I_DIRTY; - inode->i_state &= ~(I_DIRTY_SYNC | I_DIRTY_DATASYNC); + inode->i_state &= ~(I_DIRTY_SYNC | I_DIRTY_DATASYNC | I_DIRTY_TIME); spin_unlock(&inode->i_lock); /* Don't write the inode if only I_DIRTY_PAGES was set */ if (dirty & (I_DIRTY_SYNC | I_DIRTY_DATASYNC)) { @@ -1277,6 +1277,41 @@ static void wait_sb_inodes(struct super_block *sb) iput(old_inode); } +/* + * This works like wait_sb_inodes(), but it is called *before* we kick + * the bdi so the inodes can get written out. + */ +static void flush_sb_dirty_time(struct super_block *sb) +{ + struct inode *inode, *old_inode = NULL; + + WARN_ON(!rwsem_is_locked(&sb->s_umount)); + spin_lock(&inode_sb_list_lock); + list_for_each_entry(inode, &sb->s_inodes, i_sb_list) { + int dirty_time; + + spin_lock(&inode->i_lock); + if (inode->i_state & (I_FREEING|I_WILL_FREE|I_NEW)) { + spin_unlock(&inode->i_lock); + continue; + } + dirty_time = inode->i_state & I_DIRTY_TIME; + __iget(inode); + spin_unlock(&inode->i_lock); + spin_unlock(&inode_sb_list_lock); + + iput(old_inode); + old_inode = inode; + + if (dirty_time) + mark_inode_dirty(inode); + cond_resched(); + spin_lock(&inode_sb_list_lock); + } + spin_unlock(&inode_sb_list_lock); + iput(old_inode); +} + /** * writeback_inodes_sb_nr - writeback dirty inodes from given super_block * @sb: the superblock @@ -1388,6 +1423,7 @@ void sync_inodes_sb(struct super_block *sb) return; WARN_ON(!rwsem_is_locked(&sb->s_umount)); + flush_sb_dirty_time(sb); bdi_queue_work(sb->s_bdi, &work); wait_for_completion(&done); diff --git a/fs/inode.c b/fs/inode.c index 8f5c4b5..6e91aca 100644 --- a/fs/inode.c +++ b/fs/inode.c @@ -534,6 +534,18 @@ static void evict(struct inode *inode) BUG_ON(!(inode->i_state & I_FREEING)); BUG_ON(!list_empty(&inode->i_lru)); + if (inode->i_nlink && inode->i_state & I_DIRTY_TIME) { + if (inode->i_op->write_time) + inode->i_op->write_time(inode); + else if (inode->i_sb->s_op->write_inode) { + struct writeback_control wbc = { + .sync_mode = WB_SYNC_NONE, + }; + mark_inode_dirty(inode); + inode->i_sb->s_op->write_inode(inode, &wbc); + } + } + if (!list_empty(&inode->i_wb_list)) inode_wb_list_del(inode); @@ -1515,6 +1527,12 @@ static int update_time(struct inode *inode, struct timespec *time, int flags) if (flags & S_MTIME) inode->i_mtime = *time; } + if (inode->i_sb->s_flags & MS_LAZYTIME) { + spin_lock(&inode->i_lock); + inode->i_state |= I_DIRTY_TIME; + spin_unlock(&inode->i_lock); + return 0; + } if (inode->i_op->write_time) return inode->i_op->write_time(inode); mark_inode_dirty_sync(inode); diff --git a/fs/proc_namespace.c b/fs/proc_namespace.c index 73ca174..f98234a 100644 --- a/fs/proc_namespace.c +++ b/fs/proc_namespace.c @@ -44,6 +44,7 @@ static int show_sb_opts(struct seq_file *m, struct super_block *sb) { MS_SYNCHRONOUS, ",sync" }, { MS_DIRSYNC, ",dirsync" }, { MS_MANDLOCK, ",mand" }, + { MS_LAZYTIME, ",lazytime" }, { 0, NULL } }; const struct proc_fs_info *fs_infop; diff --git a/fs/sync.c b/fs/sync.c index bdc729d..db7930e 100644 --- a/fs/sync.c +++ b/fs/sync.c @@ -177,8 +177,15 @@ SYSCALL_DEFINE1(syncfs, int, fd) */ int vfs_fsync_range(struct file *file, loff_t start, loff_t end, int datasync) { + struct inode *inode = file->f_mapping->host; + if (!file->f_op->fsync) return -EINVAL; + if (!datasync && inode->i_state & I_DIRTY_TIME) { + spin_lock(&inode->i_lock); + inode->i_state |= I_DIRTY_SYNC; + spin_unlock(&inode->i_lock); + } return file->f_op->fsync(file, start, end, datasync); } EXPORT_SYMBOL(vfs_fsync_range); diff --git a/include/linux/fs.h b/include/linux/fs.h index 3633239..489b2f2 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -1721,6 +1721,7 @@ struct super_operations { #define __I_DIO_WAKEUP 9 #define I_DIO_WAKEUP (1 << I_DIO_WAKEUP) #define I_LINKABLE (1 << 10) +#define I_DIRTY_TIME (1 << 11) #define I_DIRTY (I_DIRTY_SYNC | I_DIRTY_DATASYNC | I_DIRTY_PAGES) diff --git a/include/uapi/linux/fs.h b/include/uapi/linux/fs.h index 3735fa0..cc9713a 100644 --- a/include/uapi/linux/fs.h +++ b/include/uapi/linux/fs.h @@ -90,6 +90,7 @@ struct inodes_stat_t { #define MS_KERNMOUNT (1<<22) /* this is a kern_mount call */ #define MS_I_VERSION (1<<23) /* Update inode I_version field */ #define MS_STRICTATIME (1<<24) /* Always perform atime updates */ +#define MS_LAZYTIME (1<<25) /* Update the on-disk [acm]times lazily */ /* These sb flags are internal to the kernel */ #define MS_NOSEC (1<<28)
Add a new mount option which enables a new "lazytime" mode. This mode causes atime, mtime, and ctime updates to only be made to the in-memory version of the inode. The on-disk times will only get updated when (a) if the inode needs to be updated for some non-time related change, (b) if userspace calls fsync(), syncfs() or sync(), or (c) just before an undeleted inode is evicted from memory. This is OK according to POSIX because there are no guarantees after a crash unless userspace explicitly requests via a fsync(2) call. For workloads which feature a large number of random write to a preallocated file, the lazytime mount option significantly reduces writes to the inode table. The repeated 4k writes to a single block will result in undesirable stress on flash devices and SMR disk drives. Even on conventional HDD's, the repeated writes to the inode table block will trigger Adjacent Track Interference (ATI) remediation latencies, which very negatively impact 99.9 percentile latencies --- which is a very big deal for web serving tiers (for example). Google-Bug-Id: 18297052 Signed-off-by: Theodore Ts'o <tytso@mit.edu> --- fs/fs-writeback.c | 38 +++++++++++++++++++++++++++++++++++++- fs/inode.c | 18 ++++++++++++++++++ fs/proc_namespace.c | 1 + fs/sync.c | 7 +++++++ include/linux/fs.h | 1 + include/uapi/linux/fs.h | 1 + 6 files changed, 65 insertions(+), 1 deletion(-)