diff mbox

xfs: use atomic to provide buffer I/O accounting serialization

Message ID 1495477751-3742-1-git-send-email-bfoster@redhat.com (mailing list archive)
State New, archived
Headers show

Commit Message

Brian Foster May 22, 2017, 6:29 p.m. UTC
We've had user reports of unmount hangs in xfs_wait_buftarg() that
analysis shows is due to btp->bt_io_count == -1. bt_io_count
represents the count of in-flight asynchronous buffers and thus
should always be >= 0. xfs_wait_buftarg() waits for this value to
stabilize to zero in order to ensure that all untracked (with
respect to the lru) buffers have completed I/O processing before
unmount proceeds to tear down in-core data structures.

The value of -1 implies an I/O accounting decrement race. Indeed,
the fact that xfs_buf_ioacct_dec() is called from xfs_buf_rele()
(where the buffer lock is no longer held) means that bp->b_flags can
be updated from an unsafe context. While a user-level reproducer is
currently not available, some intrusive hacks to run racing buffer
lookups/ioacct/releases from multiple threads was used to
successfully manufacture this problem.

Existing callers do not expect to acquire the buffer lock from
xfs_buf_rele(). Therefore, we can not safely update ->b_flags from
this context. To close the race, replace the in-flight buffer flag
with a per-buffer atomic for tracking accounting against the
buftarg. This field resides in a hole in the existing data structure
and thus does not increase the size of xfs_buf.

Signed-off-by: Brian Foster <bfoster@redhat.com>
---
 fs/xfs/xfs_buf.c | 13 +++++--------
 fs/xfs/xfs_buf.h |  5 ++---
 2 files changed, 7 insertions(+), 11 deletions(-)

Comments

Christoph Hellwig May 22, 2017, 7:05 p.m. UTC | #1
On Mon, May 22, 2017 at 02:29:11PM -0400, Brian Foster wrote:
> We've had user reports of unmount hangs in xfs_wait_buftarg() that
> analysis shows is due to btp->bt_io_count == -1. bt_io_count
> represents the count of in-flight asynchronous buffers and thus
> should always be >= 0. xfs_wait_buftarg() waits for this value to
> stabilize to zero in order to ensure that all untracked (with
> respect to the lru) buffers have completed I/O processing before
> unmount proceeds to tear down in-core data structures.
> 
> The value of -1 implies an I/O accounting decrement race. Indeed,
> the fact that xfs_buf_ioacct_dec() is called from xfs_buf_rele()
> (where the buffer lock is no longer held) means that bp->b_flags can
> be updated from an unsafe context. While a user-level reproducer is
> currently not available, some intrusive hacks to run racing buffer
> lookups/ioacct/releases from multiple threads was used to
> successfully manufacture this problem.
> 
> Existing callers do not expect to acquire the buffer lock from
> xfs_buf_rele(). Therefore, we can not safely update ->b_flags from
> this context. To close the race, replace the in-flight buffer flag
> with a per-buffer atomic for tracking accounting against the
> buftarg. This field resides in a hole in the existing data structure
> and thus does not increase the size of xfs_buf.

I hate these uses of atomic_t as binary flags.  Can you use
test_and_set_bit and friends wit a bitop?  This would require
an unsigned long which an actually be larger than an atomic_t,
but it's both cleaner and provides headroom for additional atomic flags
in the future.
--
To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Brian Foster May 22, 2017, 10:04 p.m. UTC | #2
On Mon, May 22, 2017 at 12:05:10PM -0700, Christoph Hellwig wrote:
> On Mon, May 22, 2017 at 02:29:11PM -0400, Brian Foster wrote:
> > We've had user reports of unmount hangs in xfs_wait_buftarg() that
> > analysis shows is due to btp->bt_io_count == -1. bt_io_count
> > represents the count of in-flight asynchronous buffers and thus
> > should always be >= 0. xfs_wait_buftarg() waits for this value to
> > stabilize to zero in order to ensure that all untracked (with
> > respect to the lru) buffers have completed I/O processing before
> > unmount proceeds to tear down in-core data structures.
> > 
> > The value of -1 implies an I/O accounting decrement race. Indeed,
> > the fact that xfs_buf_ioacct_dec() is called from xfs_buf_rele()
> > (where the buffer lock is no longer held) means that bp->b_flags can
> > be updated from an unsafe context. While a user-level reproducer is
> > currently not available, some intrusive hacks to run racing buffer
> > lookups/ioacct/releases from multiple threads was used to
> > successfully manufacture this problem.
> > 
> > Existing callers do not expect to acquire the buffer lock from
> > xfs_buf_rele(). Therefore, we can not safely update ->b_flags from
> > this context. To close the race, replace the in-flight buffer flag
> > with a per-buffer atomic for tracking accounting against the
> > buftarg. This field resides in a hole in the existing data structure
> > and thus does not increase the size of xfs_buf.
> 
> I hate these uses of atomic_t as binary flags.  Can you use
> test_and_set_bit and friends wit a bitop?  This would require
> an unsigned long which an actually be larger than an atomic_t,
> but it's both cleaner and provides headroom for additional atomic flags
> in the future.

I thought it may be a little confusing to have multiple sets of flags
for a buffer, hence the counter (even though it is logically a flag).
But I'm fine with it for now if we don't mind wasting the extra space.

Though I suppose we could also add a smaller field and use cmpxchg() to
set and clear it... thoughts?

Brian
--
To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Darrick J. Wong May 23, 2017, 3:11 a.m. UTC | #3
On Mon, May 22, 2017 at 06:04:24PM -0400, Brian Foster wrote:
> On Mon, May 22, 2017 at 12:05:10PM -0700, Christoph Hellwig wrote:
> > On Mon, May 22, 2017 at 02:29:11PM -0400, Brian Foster wrote:
> > > We've had user reports of unmount hangs in xfs_wait_buftarg() that
> > > analysis shows is due to btp->bt_io_count == -1. bt_io_count
> > > represents the count of in-flight asynchronous buffers and thus
> > > should always be >= 0. xfs_wait_buftarg() waits for this value to
> > > stabilize to zero in order to ensure that all untracked (with
> > > respect to the lru) buffers have completed I/O processing before
> > > unmount proceeds to tear down in-core data structures.
> > > 
> > > The value of -1 implies an I/O accounting decrement race. Indeed,
> > > the fact that xfs_buf_ioacct_dec() is called from xfs_buf_rele()
> > > (where the buffer lock is no longer held) means that bp->b_flags can
> > > be updated from an unsafe context. While a user-level reproducer is
> > > currently not available, some intrusive hacks to run racing buffer
> > > lookups/ioacct/releases from multiple threads was used to
> > > successfully manufacture this problem.
> > > 
> > > Existing callers do not expect to acquire the buffer lock from
> > > xfs_buf_rele(). Therefore, we can not safely update ->b_flags from
> > > this context. To close the race, replace the in-flight buffer flag
> > > with a per-buffer atomic for tracking accounting against the
> > > buftarg. This field resides in a hole in the existing data structure
> > > and thus does not increase the size of xfs_buf.

There's only a hole on 64-bit systems, btw.

> > I hate these uses of atomic_t as binary flags.  Can you use
> > test_and_set_bit and friends wit a bitop?  This would require
> > an unsigned long which an actually be larger than an atomic_t,
> > but it's both cleaner and provides headroom for additional atomic flags
> > in the future.
> 
> I thought it may be a little confusing to have multiple sets of flags
> for a buffer, hence the counter (even though it is logically a flag).

It /is/ confusing.  If you stick with a flags variable of some sort, I
think at a bare minimum there ought to be a comment explaining what this
unlocked flags field is for, and why we didn't just make b_flags an
atomicly updated flags field.  (TBH I'm wondering why not do that?  Is
it to avoid making a larger change?)

> But I'm fine with it for now if we don't mind wasting the extra space.
> 
> Though I suppose we could also add a smaller field and use cmpxchg() to
> set and clear it... thoughts?

None in particular.  I don't know that we're adding flag bits all that
quickly, and we can always change to unsigned long if we have to.

--D

> 
> Brian
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Brian Foster May 23, 2017, 11:29 a.m. UTC | #4
On Mon, May 22, 2017 at 08:11:34PM -0700, Darrick J. Wong wrote:
> On Mon, May 22, 2017 at 06:04:24PM -0400, Brian Foster wrote:
> > On Mon, May 22, 2017 at 12:05:10PM -0700, Christoph Hellwig wrote:
> > > On Mon, May 22, 2017 at 02:29:11PM -0400, Brian Foster wrote:
> > > > We've had user reports of unmount hangs in xfs_wait_buftarg() that
> > > > analysis shows is due to btp->bt_io_count == -1. bt_io_count
> > > > represents the count of in-flight asynchronous buffers and thus
> > > > should always be >= 0. xfs_wait_buftarg() waits for this value to
> > > > stabilize to zero in order to ensure that all untracked (with
> > > > respect to the lru) buffers have completed I/O processing before
> > > > unmount proceeds to tear down in-core data structures.
> > > > 
> > > > The value of -1 implies an I/O accounting decrement race. Indeed,
> > > > the fact that xfs_buf_ioacct_dec() is called from xfs_buf_rele()
> > > > (where the buffer lock is no longer held) means that bp->b_flags can
> > > > be updated from an unsafe context. While a user-level reproducer is
> > > > currently not available, some intrusive hacks to run racing buffer
> > > > lookups/ioacct/releases from multiple threads was used to
> > > > successfully manufacture this problem.
> > > > 
> > > > Existing callers do not expect to acquire the buffer lock from
> > > > xfs_buf_rele(). Therefore, we can not safely update ->b_flags from
> > > > this context. To close the race, replace the in-flight buffer flag
> > > > with a per-buffer atomic for tracking accounting against the
> > > > buftarg. This field resides in a hole in the existing data structure
> > > > and thus does not increase the size of xfs_buf.
> 
> There's only a hole on 64-bit systems, btw.
> 
> > > I hate these uses of atomic_t as binary flags.  Can you use
> > > test_and_set_bit and friends wit a bitop?  This would require
> > > an unsigned long which an actually be larger than an atomic_t,
> > > but it's both cleaner and provides headroom for additional atomic flags
> > > in the future.
> > 
> > I thought it may be a little confusing to have multiple sets of flags
> > for a buffer, hence the counter (even though it is logically a flag).
> 
> It /is/ confusing.  If you stick with a flags variable of some sort, I
> think at a bare minimum there ought to be a comment explaining what this
> unlocked flags field is for, and why we didn't just make b_flags an
> atomicly updated flags field.  (TBH I'm wondering why not do that?  Is
> it to avoid making a larger change?)
> 

Sort of... As it is, b_flags has 20+ flags that are modified all over
the place in all sorts of combinations, checked in assert contexts,
etc.. On a quick pass, all of those cases appear to be protected via
buffer lock. I don't think the I/O accounting optimization flag being an
outlier (in that it tracks buffers past unlock through release to the
lru) and broken justifies converting all of the rest over to something
else.

But yes, clear documentation would be required and probably a different
flag naming scheme to makes things as clear as possible.

> > But I'm fine with it for now if we don't mind wasting the extra space.
> > 
> > Though I suppose we could also add a smaller field and use cmpxchg() to
> > set and clear it... thoughts?
> 
> None in particular.  I don't know that we're adding flag bits all that
> quickly, and we can always change to unsigned long if we have to.
> 

Indeed. I'm not a fan of making changes to support future unknown
changes. Ideally, the whole b_atomic_flags thing can wait until there's
a legitimate use case.

In fact, I think there is a much more simple option here: define 'bool
b_in_flight' and protect it with the existing ->b_lock spinlock. The
lock is already used in half the places the accounting is modified, I
don't think it's harmful to acquire in or outside of the buffer lock,
and this could always be condensed to an "atomic flag" if we ever truly
have the need to do so.

Actually, looking again at xfs_buf it occurs to me that we already have
->b_state, which is protected by the spinlock (presumably to deal with
lru operations outside of buf lock) and only has a single flag defined.
Perhaps I'll just define XFS_BSTATE_IN_FLIGHT and avoid the need for a
new field entirely.

Brian

> --D
> 
> > 
> > Brian
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c
index ba036c1..f0172d8 100644
--- a/fs/xfs/xfs_buf.c
+++ b/fs/xfs/xfs_buf.c
@@ -97,12 +97,12 @@  static inline void
 xfs_buf_ioacct_inc(
 	struct xfs_buf	*bp)
 {
-	if (bp->b_flags & (XBF_NO_IOACCT|_XBF_IN_FLIGHT))
+	if (bp->b_flags & XBF_NO_IOACCT)
 		return;
 
 	ASSERT(bp->b_flags & XBF_ASYNC);
-	bp->b_flags |= _XBF_IN_FLIGHT;
-	percpu_counter_inc(&bp->b_target->bt_io_count);
+	if (atomic_add_unless(&bp->b_in_flight, 1, 1))
+		percpu_counter_inc(&bp->b_target->bt_io_count);
 }
 
 /*
@@ -113,11 +113,8 @@  static inline void
 xfs_buf_ioacct_dec(
 	struct xfs_buf	*bp)
 {
-	if (!(bp->b_flags & _XBF_IN_FLIGHT))
-		return;
-
-	bp->b_flags &= ~_XBF_IN_FLIGHT;
-	percpu_counter_dec(&bp->b_target->bt_io_count);
+	if (atomic_dec_if_positive(&bp->b_in_flight) == 0)
+		percpu_counter_dec(&bp->b_target->bt_io_count);
 }
 
 /*
diff --git a/fs/xfs/xfs_buf.h b/fs/xfs/xfs_buf.h
index 8d1d44f..f351fc1 100644
--- a/fs/xfs/xfs_buf.h
+++ b/fs/xfs/xfs_buf.h
@@ -63,7 +63,6 @@  typedef enum {
 #define _XBF_KMEM	 (1 << 21)/* backed by heap memory */
 #define _XBF_DELWRI_Q	 (1 << 22)/* buffer on a delwri queue */
 #define _XBF_COMPOUND	 (1 << 23)/* compound buffer */
-#define _XBF_IN_FLIGHT	 (1 << 25) /* I/O in flight, for accounting purposes */
 
 typedef unsigned int xfs_buf_flags_t;
 
@@ -84,8 +83,7 @@  typedef unsigned int xfs_buf_flags_t;
 	{ _XBF_PAGES,		"PAGES" }, \
 	{ _XBF_KMEM,		"KMEM" }, \
 	{ _XBF_DELWRI_Q,	"DELWRI_Q" }, \
-	{ _XBF_COMPOUND,	"COMPOUND" }, \
-	{ _XBF_IN_FLIGHT,	"IN_FLIGHT" }
+	{ _XBF_COMPOUND,	"COMPOUND" }
 
 
 /*
@@ -207,6 +205,7 @@  typedef struct xfs_buf {
 	int			b_retries;
 	unsigned long		b_first_retry_time; /* in jiffies */
 	int			b_last_error;
+	atomic_t		b_in_flight;
 
 	const struct xfs_buf_ops	*b_ops;