diff mbox

xfs: don't let suspend freeze the buffer workqueue

Message ID 20170327204611.GA4864@birch.djwong.org (mailing list archive)
State Deferred
Headers show

Commit Message

Darrick J. Wong March 27, 2017, 8:46 p.m. UTC
Suspend has this annoying behavior in XFS where it freezes workqueues in
some arbitrary order.  This is a problem because the sync call can cause
an AIL push, which may have to perform a RMW cycle on an inode cluster
buffer if that buffer has been reclaimed.  When this happens, the AIL
issues a buffer read for which the io completion ends up on the xfs_buf
workqueue.  If /that/ workqueue has already been frozen, the suspend
will timeout because we froze the workqueues in the wrong order.

It seems suspicious to be freezing IO helper threads[1], so let's just
not do that anymore.  Prior to this patch, 4.10 fails to suspend on my
laptop about 80% of the time; afterwards, it works every time.  I've not
done much suspend-and-crash testing on it though.

[1] https://lwn.net/Articles/705269/

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 fs/xfs/xfs_super.c |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

Comments

Dave Chinner March 27, 2017, 10:40 p.m. UTC | #1
On Mon, Mar 27, 2017 at 01:46:11PM -0700, Darrick J. Wong wrote:
> Suspend has this annoying behavior in XFS where it freezes workqueues in
> some arbitrary order.  This is a problem because the sync call can cause
> an AIL push, which may have to perform a RMW cycle on an inode cluster
> buffer if that buffer has been reclaimed.  When this happens, the AIL
> issues a buffer read for which the io completion ends up on the xfs_buf
> workqueue.  If /that/ workqueue has already been frozen, the suspend
> will timeout because we froze the workqueues in the wrong order.
> 
> It seems suspicious to be freezing IO helper threads[1], so let's just
> not do that anymore.  Prior to this patch, 4.10 fails to suspend on my
> laptop about 80% of the time; afterwards, it works every time.  I've not
> done much suspend-and-crash testing on it though.

The problem with doing this is that it opens up an even worse race
condition - a memory corruption condition, and potentially a
filesystem corruption condition if not caught after resume quickly
enough.

That is, if we don't freeze the m_buf_workqueue, we end up doing
this:

freezer			aild		m_buf_workqueue
.....
freeze user processes
sync
freeze work queues
			pushes dirty metadata
			<starts IO>
freeze kernel threads
					<IO completion interrupt>
					<work queued>
create_image()
					run IO completion work

At this point, we have IO completion in XFS modifying memory state
that is being snapshotted by the freezer process. Hence we can end
up with the hibernation image having an inconsistent state such as
an inode marked clean (because it was copied after IO completion)
but the AIL still has the inode item linked into it (because it was
copied before IO completion).

The only way to avoid this race condition is to ensure that the
m_buf_workqueue cannot run work while the hibernation image is being
created. This is still just a workaround - we can lose IO
completions completely if they occur after the image creation begins
- but it removes the most common case of memory corruption that has
previously been seen during hibernation.

Unfortunately, these races are just about impossible to trigger in
test environments, so it's no surprise that you haven't "seen
problems".  The decision that I made here is that it is better for
suspend to hang and annoy people than it is for users to suffer
silent memory and filesystem corruption resulting in data loss
that has no obvious logical cause, cannot be reproduced or
debugged...

AFAICT, we can't fix this problem in XFS - we need the suspend code
to /freeze the filesystems/ to safely quiesce them because the only
choices we have right now is to choose the least worst behaviour of
either "hang during suspend" or "be silent and deadly".

Cheers,

Dave.
diff mbox

Patch

diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
index 7af5ca9..216ab89 100644
--- a/fs/xfs/xfs_super.c
+++ b/fs/xfs/xfs_super.c
@@ -842,7 +842,7 @@  xfs_init_mount_workqueues(
 	struct xfs_mount	*mp)
 {
 	mp->m_buf_workqueue = alloc_workqueue("xfs-buf/%s",
-			WQ_MEM_RECLAIM|WQ_FREEZABLE, 1, mp->m_fsname);
+			WQ_MEM_RECLAIM, 1, mp->m_fsname);
 	if (!mp->m_buf_workqueue)
 		goto out;