diff mbox series

[01/13] fs: avoid double-writing inodes on lazytime expiration

Message ID 20210105005452.92521-2-ebiggers@kernel.org (mailing list archive)
State New
Headers show
Series lazytime fixes and cleanups | expand

Commit Message

Eric Biggers Jan. 5, 2021, 12:54 a.m. UTC
From: Eric Biggers <ebiggers@google.com>

When lazytime is enabled and an inode with dirty timestamps is being
expired, either due to dirtytime_expire_interval being exceeded or due
to a sync or syncfs system call, we need to inform the filesystem that
the inode is dirty so that the inode's timestamps can be copied out to
the on-disk data structures.  That's because if the filesystem supports
lazytime, it will have ignored the ->dirty_inode(inode, I_DIRTY_TIME)
notification when the timestamp was modified in memory.

Currently this is accomplished by calling mark_inode_dirty_sync() from
__writeback_single_inode().  However, this has the unfortunate side
effect of also putting the inode the writeback list.  That's not
appropriate in this case, since the inode is already being written.

That causes the inode to remain dirty after a sync.  Normally that's
just wasteful, as it causes the inode to be written twice.  But when
fscrypt is used this bug also partially breaks the
FS_IOC_REMOVE_ENCRYPTION_KEY ioctl, as the ioctl reports that files are
still in-use when they aren't.  For more details, see the original
report at https://lore.kernel.org/r/20200306004555.GB225345@gmail.com

Fix this by calling ->dirty_inode(inode, I_DIRTY_SYNC) directly instead
of mark_inode_dirty_sync().

This fixes xfstest generic/580 when lazytime is enabled.

A later patch will introduce a ->lazytime_expired method to cleanly
separate out the lazytime expiration case, in particular for XFS which
uses the VFS-level dirtiness tracking only for lazytime.  But that's
separate from fixing this bug.  Also, note that XFS will incorrectly
ignore the I_DIRTY_SYNC notification from __writeback_single_inode()
both before and after this patch, as I_DIRTY_TIME was already cleared in
i_state.  Later patches will fix this separate bug.

Fixes: 0ae45f63d4ef ("vfs: add support for a lazytime mount option")
Cc: stable@vger.kernel.org
Signed-off-by: Eric Biggers <ebiggers@google.com>
---
 fs/fs-writeback.c | 17 ++++++++++++++---
 1 file changed, 14 insertions(+), 3 deletions(-)


base-commit: e71ba9452f0b5b2e8dc8aa5445198cd9214a6a62

Comments

Jan Kara Jan. 7, 2021, 2:47 p.m. UTC | #1
On Mon 04-01-21 16:54:40, Eric Biggers wrote:
> From: Eric Biggers <ebiggers@google.com>
> 
> When lazytime is enabled and an inode with dirty timestamps is being
> expired, either due to dirtytime_expire_interval being exceeded or due
> to a sync or syncfs system call, we need to inform the filesystem that
> the inode is dirty so that the inode's timestamps can be copied out to
> the on-disk data structures.  That's because if the filesystem supports
> lazytime, it will have ignored the ->dirty_inode(inode, I_DIRTY_TIME)
> notification when the timestamp was modified in memory.
> 
> Currently this is accomplished by calling mark_inode_dirty_sync() from
> __writeback_single_inode().  However, this has the unfortunate side
> effect of also putting the inode the writeback list.  That's not
> appropriate in this case, since the inode is already being written.
> 
> That causes the inode to remain dirty after a sync.  Normally that's
> just wasteful, as it causes the inode to be written twice.  But when
> fscrypt is used this bug also partially breaks the
> FS_IOC_REMOVE_ENCRYPTION_KEY ioctl, as the ioctl reports that files are
> still in-use when they aren't.  For more details, see the original
> report at https://lore.kernel.org/r/20200306004555.GB225345@gmail.com
> 
> Fix this by calling ->dirty_inode(inode, I_DIRTY_SYNC) directly instead
> of mark_inode_dirty_sync().
> 
> This fixes xfstest generic/580 when lazytime is enabled.
> 
> A later patch will introduce a ->lazytime_expired method to cleanly
> separate out the lazytime expiration case, in particular for XFS which
> uses the VFS-level dirtiness tracking only for lazytime.  But that's
> separate from fixing this bug.  Also, note that XFS will incorrectly
> ignore the I_DIRTY_SYNC notification from __writeback_single_inode()
> both before and after this patch, as I_DIRTY_TIME was already cleared in
> i_state.  Later patches will fix this separate bug.
> 
> Fixes: 0ae45f63d4ef ("vfs: add support for a lazytime mount option")
> Cc: stable@vger.kernel.org
> Signed-off-by: Eric Biggers <ebiggers@google.com>

Good catch! It could also cause issues with filesystem freezing which kind
of assumes that the filesystem will be clean after sync_filesystem()
(otherwise writeback threads can get stalled on frozen filesystem while
holding some locks and generally the system behavior becomes kind of
awkward).

> diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
> index acfb55834af23..081e335cdee47 100644
> --- a/fs/fs-writeback.c
> +++ b/fs/fs-writeback.c
> @@ -1509,11 +1509,22 @@ __writeback_single_inode(struct inode *inode, struct writeback_control *wbc)
>  
>  	spin_unlock(&inode->i_lock);
>  
> -	if (dirty & I_DIRTY_TIME)
> -		mark_inode_dirty_sync(inode);
>  	/* Don't write the inode if only I_DIRTY_PAGES was set */
>  	if (dirty & ~I_DIRTY_PAGES) {
> -		int err = write_inode(inode, wbc);
> +		int err;
> +
> +		/*
> +		 * If the inode is being written due to a lazytime timestamp
> +		 * expiration, then the filesystem needs to be notified about it
> +		 * so that e.g. the filesystem can update on-disk fields and
> +		 * journal the timestamp update.  Just calling write_inode()
> +		 * isn't enough.  Don't call mark_inode_dirty_sync(), as that
> +		 * would put the inode back on the dirty list.
> +		 */
> +		if ((dirty & I_DIRTY_TIME) && inode->i_sb->s_op->dirty_inode)
> +			inode->i_sb->s_op->dirty_inode(inode, I_DIRTY_SYNC);
> +
> +		err = write_inode(inode, wbc);
>  		if (ret == 0)
>  			ret = err;
>  	}

I have to say I dislike this special call of ->dirty_inode(). It works but
it makes me wonder, didn't we forget about something or won't we forget in
the future? Because it's very easy to miss this special case...

I think attached patch (compile-tested only) should actually fix the
problem as well without this special ->dirty_inode() call. It basically
only moves the mark_inode_dirty_sync() before inode->i_state clearing.
Because conceptually mark_inode_dirty_sync() is IMO the right function to
call. It will take care of clearing I_DIRTY_TIME flag (because we are
setting I_DIRTY_SYNC), it will also not touch inode->i_io_list if the inode
is queued for sync (I_SYNC_QUEUED is set in that case). The only problem
with calling it was that it was called *after* clearing dirty bits from
i_state... What do you think?

								Honza
Matthew Wilcox Jan. 7, 2021, 2:58 p.m. UTC | #2
On Thu, Jan 07, 2021 at 03:47:09PM +0100, Jan Kara wrote:
> I think attached patch (compile-tested only) should actually fix the
> problem as well without this special ->dirty_inode() call. It basically
> only moves the mark_inode_dirty_sync() before inode->i_state clearing.
> Because conceptually mark_inode_dirty_sync() is IMO the right function to
> call. It will take care of clearing I_DIRTY_TIME flag (because we are
> setting I_DIRTY_SYNC), it will also not touch inode->i_io_list if the inode
> is queued for sync (I_SYNC_QUEUED is set in that case). The only problem
> with calling it was that it was called *after* clearing dirty bits from
> i_state... What do you think?

I like this patch far more, fwiw.
Eric Biggers Jan. 7, 2021, 9:46 p.m. UTC | #3
On Thu, Jan 07, 2021 at 03:47:09PM +0100, Jan Kara wrote:
> > diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
> > index acfb55834af23..081e335cdee47 100644
> > --- a/fs/fs-writeback.c
> > +++ b/fs/fs-writeback.c
> > @@ -1509,11 +1509,22 @@ __writeback_single_inode(struct inode *inode, struct writeback_control *wbc)
> >  
> >  	spin_unlock(&inode->i_lock);
> >  
> > -	if (dirty & I_DIRTY_TIME)
> > -		mark_inode_dirty_sync(inode);
> >  	/* Don't write the inode if only I_DIRTY_PAGES was set */
> >  	if (dirty & ~I_DIRTY_PAGES) {
> > -		int err = write_inode(inode, wbc);
> > +		int err;
> > +
> > +		/*
> > +		 * If the inode is being written due to a lazytime timestamp
> > +		 * expiration, then the filesystem needs to be notified about it
> > +		 * so that e.g. the filesystem can update on-disk fields and
> > +		 * journal the timestamp update.  Just calling write_inode()
> > +		 * isn't enough.  Don't call mark_inode_dirty_sync(), as that
> > +		 * would put the inode back on the dirty list.
> > +		 */
> > +		if ((dirty & I_DIRTY_TIME) && inode->i_sb->s_op->dirty_inode)
> > +			inode->i_sb->s_op->dirty_inode(inode, I_DIRTY_SYNC);
> > +
> > +		err = write_inode(inode, wbc);
> >  		if (ret == 0)
> >  			ret = err;
> >  	}
> 
> I have to say I dislike this special call of ->dirty_inode(). It works but
> it makes me wonder, didn't we forget about something or won't we forget in
> the future? Because it's very easy to miss this special case...
> 
> I think attached patch (compile-tested only) should actually fix the
> problem as well without this special ->dirty_inode() call. It basically
> only moves the mark_inode_dirty_sync() before inode->i_state clearing.
> Because conceptually mark_inode_dirty_sync() is IMO the right function to
> call. It will take care of clearing I_DIRTY_TIME flag (because we are
> setting I_DIRTY_SYNC), it will also not touch inode->i_io_list if the inode
> is queued for sync (I_SYNC_QUEUED is set in that case). The only problem
> with calling it was that it was called *after* clearing dirty bits from
> i_state... What do you think?
> 
> 								Honza
> -- 
> Jan Kara <jack@suse.com>
> SUSE Labs, CR

> From 80ccc6a78d1c0532f600b98884f7a64e58333485 Mon Sep 17 00:00:00 2001
> From: Jan Kara <jack@suse.cz>
> Date: Thu, 7 Jan 2021 15:36:05 +0100
> Subject: [PATCH] fs: Make sure inode is clean after __writeback_single_inode()
> 
> Signed-off-by: Jan Kara <jack@suse.cz>
> ---
>  fs/fs-writeback.c | 23 ++++++++++++-----------
>  1 file changed, 12 insertions(+), 11 deletions(-)
> 
> diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
> index acfb55834af2..b9356f470fae 100644
> --- a/fs/fs-writeback.c
> +++ b/fs/fs-writeback.c
> @@ -1473,22 +1473,25 @@ __writeback_single_inode(struct inode *inode, struct writeback_control *wbc)
>  			ret = err;
>  	}
>  
> +	/*
> +	 * If inode has dirty timestamps and we need to write them, call
> +	 * mark_inode_dirty_sync() to notify filesystem about it.
> +	 */
> +	if (inode->i_state & I_DIRTY_TIME &&
> +	    (wbc->for_sync || wbc->sync_mode == WB_SYNC_ALL ||
> +	     time_after(jiffies, inode->dirtied_time_when +
> +			dirtytime_expire_interval * HZ))) {
> +		trace_writeback_lazytime(inode);
> +		mark_inode_dirty_sync(inode);
> +	}
> +
>  	/*
>  	 * Some filesystems may redirty the inode during the writeback
>  	 * due to delalloc, clear dirty metadata flags right before
>  	 * write_inode()
>  	 */
>  	spin_lock(&inode->i_lock);
> -
>  	dirty = inode->i_state & I_DIRTY;
> -	if ((inode->i_state & I_DIRTY_TIME) &&
> -	    ((dirty & I_DIRTY_INODE) ||
> -	     wbc->sync_mode == WB_SYNC_ALL || wbc->for_sync ||
> -	     time_after(jiffies, inode->dirtied_time_when +
> -			dirtytime_expire_interval * HZ))) {
> -		dirty |= I_DIRTY_TIME;
> -		trace_writeback_lazytime(inode);
> -	}
>  	inode->i_state &= ~dirty;
>  
>  	/*
> @@ -1509,8 +1512,6 @@ __writeback_single_inode(struct inode *inode, struct writeback_control *wbc)
>  
>  	spin_unlock(&inode->i_lock);
>  
> -	if (dirty & I_DIRTY_TIME)
> -		mark_inode_dirty_sync(inode);
>  	/* Don't write the inode if only I_DIRTY_PAGES was set */
>  	if (dirty & ~I_DIRTY_PAGES) {
>  		int err = write_inode(inode, wbc);

It looks like that's going to work, and it fixes the XFS bug too.

Note that if __writeback_single_inode() is called from writeback_single_inode()
(rather than writeback_sb_inodes()), then the inode might not be queued for
sync, in which case mark_inode_dirty_sync() will move it to a writeback list.

That's okay because afterwards, writeback_single_inode() will delete the inode
from any writeback list if it's been fully cleaned, right?  So clean inodes
won't get left on a writeback list.

It's confusing because there are comments in writeback_single_inode() and above
__writeback_single_inode() that say that the inode must not be moved between
writeback lists.  I take it that those comments are outdated, as they predate
I_SYNC_QUEUED being introduced by commit 5afced3bf281 ("writeback: Avoid
skipping inode writeback")?

- Eric
Christoph Hellwig Jan. 8, 2021, 8:54 a.m. UTC | #4
On Thu, Jan 07, 2021 at 01:46:37PM -0800, Eric Biggers wrote:
> It looks like that's going to work, and it fixes the XFS bug too.
> 
> Note that if __writeback_single_inode() is called from writeback_single_inode()
> (rather than writeback_sb_inodes()), then the inode might not be queued for
> sync, in which case mark_inode_dirty_sync() will move it to a writeback list.
> 
> That's okay because afterwards, writeback_single_inode() will delete the inode
> from any writeback list if it's been fully cleaned, right?  So clean inodes
> won't get left on a writeback list.
> 
> It's confusing because there are comments in writeback_single_inode() and above
> __writeback_single_inode() that say that the inode must not be moved between
> writeback lists.  I take it that those comments are outdated, as they predate
> I_SYNC_QUEUED being introduced by commit 5afced3bf281 ("writeback: Avoid
> skipping inode writeback")?

Yes.  I think we need to update the comment as well.
Christoph Hellwig Jan. 8, 2021, 9:01 a.m. UTC | #5
> +	/*
> +	 * If inode has dirty timestamps and we need to write them, call
> +	 * mark_inode_dirty_sync() to notify filesystem about it.
> +	 */
> +	if (inode->i_state & I_DIRTY_TIME &&
> +	    (wbc->for_sync || wbc->sync_mode == WB_SYNC_ALL ||
> +	     time_after(jiffies, inode->dirtied_time_when +
> +			dirtytime_expire_interval * HZ))) {

If we're touching this area, it would be nice to split this condition
into a readable helper ala:

static inline bool inode_needs_timestamp_sync(struct writeback_control *wbc,
		struct inode *inode)
{
	if (!(inode->i_state & I_DIRTY_TIME))
		return false;
	if (wbc->for_sync || wbc->sync_mode == WB_SYNC_ALL)
		return true;
	return time_after(jiffies, inode->dirtied_time_when +
			  dirtytime_expire_interval * HZ);
}
Eric Biggers Jan. 9, 2021, 5:11 p.m. UTC | #6
On Fri, Jan 08, 2021 at 10:01:33AM +0100, Christoph Hellwig wrote:
> > +	/*
> > +	 * If inode has dirty timestamps and we need to write them, call
> > +	 * mark_inode_dirty_sync() to notify filesystem about it.
> > +	 */
> > +	if (inode->i_state & I_DIRTY_TIME &&
> > +	    (wbc->for_sync || wbc->sync_mode == WB_SYNC_ALL ||
> > +	     time_after(jiffies, inode->dirtied_time_when +
> > +			dirtytime_expire_interval * HZ))) {
> 
> If we're touching this area, it would be nice to split this condition
> into a readable helper ala:
> 
> static inline bool inode_needs_timestamp_sync(struct writeback_control *wbc,
> 		struct inode *inode)
> {
> 	if (!(inode->i_state & I_DIRTY_TIME))
> 		return false;
> 	if (wbc->for_sync || wbc->sync_mode == WB_SYNC_ALL)
> 		return true;
> 	return time_after(jiffies, inode->dirtied_time_when +
> 			  dirtytime_expire_interval * HZ);
> }

I didn't end up doing this since it would only be called once, and IMO it's more
readable to keep it inlined next to the comment that explains what's going on.
Especially considering the later patch that drops the check for wbc->for_sync.

-  Eric
diff mbox series

Patch

diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index acfb55834af23..081e335cdee47 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -1509,11 +1509,22 @@  __writeback_single_inode(struct inode *inode, struct writeback_control *wbc)
 
 	spin_unlock(&inode->i_lock);
 
-	if (dirty & I_DIRTY_TIME)
-		mark_inode_dirty_sync(inode);
 	/* Don't write the inode if only I_DIRTY_PAGES was set */
 	if (dirty & ~I_DIRTY_PAGES) {
-		int err = write_inode(inode, wbc);
+		int err;
+
+		/*
+		 * If the inode is being written due to a lazytime timestamp
+		 * expiration, then the filesystem needs to be notified about it
+		 * so that e.g. the filesystem can update on-disk fields and
+		 * journal the timestamp update.  Just calling write_inode()
+		 * isn't enough.  Don't call mark_inode_dirty_sync(), as that
+		 * would put the inode back on the dirty list.
+		 */
+		if ((dirty & I_DIRTY_TIME) && inode->i_sb->s_op->dirty_inode)
+			inode->i_sb->s_op->dirty_inode(inode, I_DIRTY_SYNC);
+
+		err = write_inode(inode, wbc);
 		if (ret == 0)
 			ret = err;
 	}