diff mbox series

[1/4] xfs: buffer pins need to hold a buffer reference

Message ID 20230517000449.3997582-2-david@fromorbit.com (mailing list archive)
State Accepted
Headers show
Series [1/4] xfs: buffer pins need to hold a buffer reference | expand

Commit Message

Dave Chinner May 17, 2023, 12:04 a.m. UTC
From: Dave Chinner <dchinner@redhat.com>

When a buffer is unpinned by xfs_buf_item_unpin(), we need to access
the buffer after we've dropped the buffer log item reference count.
This opens a window where we can have two racing unpins for the
buffer item (e.g. shutdown checkpoint context callback processing
racing with journal IO iclog completion processing) and both attempt
to access the buffer after dropping the BLI reference count.  If we
are unlucky, the "BLI freed" context wins the race and frees the
buffer before the "BLI still active" case checks the buffer pin
count.

This results in a use after free that can only be triggered
in active filesystem shutdown situations.

To fix this, we need to ensure that buffer existence extends beyond
the BLI reference count checks and until the unpin processing is
complete. This implies that a buffer pin operation must also take a
buffer reference to ensure that the buffer cannot be freed until the
buffer unpin processing is complete.

Reported-by: yangerkun <yangerkun@huawei.com>
Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/xfs_buf_item.c | 88 ++++++++++++++++++++++++++++++++-----------
 1 file changed, 65 insertions(+), 23 deletions(-)

Comments

Darrick J. Wong May 17, 2023, 1:26 a.m. UTC | #1
On Wed, May 17, 2023 at 10:04:46AM +1000, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> When a buffer is unpinned by xfs_buf_item_unpin(), we need to access
> the buffer after we've dropped the buffer log item reference count.
> This opens a window where we can have two racing unpins for the
> buffer item (e.g. shutdown checkpoint context callback processing
> racing with journal IO iclog completion processing) and both attempt
> to access the buffer after dropping the BLI reference count.  If we
> are unlucky, the "BLI freed" context wins the race and frees the
> buffer before the "BLI still active" case checks the buffer pin
> count.
> 
> This results in a use after free that can only be triggered
> in active filesystem shutdown situations.
> 
> To fix this, we need to ensure that buffer existence extends beyond
> the BLI reference count checks and until the unpin processing is
> complete. This implies that a buffer pin operation must also take a
> buffer reference to ensure that the buffer cannot be freed until the
> buffer unpin processing is complete.
> 
> Reported-by: yangerkun <yangerkun@huawei.com>
> Signed-off-by: Dave Chinner <dchinner@redhat.com>

LGTM
Reviewed-by: Darrick J. Wong <djwong@kernel.org>

--D

> ---
>  fs/xfs/xfs_buf_item.c | 88 ++++++++++++++++++++++++++++++++-----------
>  1 file changed, 65 insertions(+), 23 deletions(-)
> 
> diff --git a/fs/xfs/xfs_buf_item.c b/fs/xfs/xfs_buf_item.c
> index df7322ed73fa..b2d211730fd2 100644
> --- a/fs/xfs/xfs_buf_item.c
> +++ b/fs/xfs/xfs_buf_item.c
> @@ -452,10 +452,18 @@ xfs_buf_item_format(
>   * This is called to pin the buffer associated with the buf log item in memory
>   * so it cannot be written out.
>   *
> - * We also always take a reference to the buffer log item here so that the bli
> - * is held while the item is pinned in memory. This means that we can
> - * unconditionally drop the reference count a transaction holds when the
> - * transaction is completed.
> + * We take a reference to the buffer log item here so that the BLI life cycle
> + * extends at least until the buffer is unpinned via xfs_buf_item_unpin() and
> + * inserted into the AIL.
> + *
> + * We also need to take a reference to the buffer itself as the BLI unpin
> + * processing requires accessing the buffer after the BLI has dropped the final
> + * BLI reference. See xfs_buf_item_unpin() for an explanation.
> + * If unpins race to drop the final BLI reference and only the
> + * BLI owns a reference to the buffer, then the loser of the race can have the
> + * buffer fgreed from under it (e.g. on shutdown). Taking a buffer reference per
> + * pin count ensures the life cycle of the buffer extends for as
> + * long as we hold the buffer pin reference in xfs_buf_item_unpin().
>   */
>  STATIC void
>  xfs_buf_item_pin(
> @@ -470,13 +478,30 @@ xfs_buf_item_pin(
>  
>  	trace_xfs_buf_item_pin(bip);
>  
> +	xfs_buf_hold(bip->bli_buf);
>  	atomic_inc(&bip->bli_refcount);
>  	atomic_inc(&bip->bli_buf->b_pin_count);
>  }
>  
>  /*
> - * This is called to unpin the buffer associated with the buf log item which
> - * was previously pinned with a call to xfs_buf_item_pin().
> + * This is called to unpin the buffer associated with the buf log item which was
> + * previously pinned with a call to xfs_buf_item_pin().  We enter this function
> + * with a buffer pin count, a buffer reference and a BLI reference.
> + *
> + * We must drop the BLI reference before we unpin the buffer because the AIL
> + * doesn't acquire a BLI reference whenever it accesses it. Therefore if the
> + * refcount drops to zero, the bli could still be AIL resident and the buffer
> + * submitted for I/O at any point before we return. This can result in IO
> + * completion freeing the buffer while we are still trying to access it here.
> + * This race condition can also occur in shutdown situations where we abort and
> + * unpin buffers from contexts other that journal IO completion.
> + *
> + * Hence we have to hold a buffer reference per pin count to ensure that the
> + * buffer cannot be freed until we have finished processing the unpin operation.
> + * The reference is taken in xfs_buf_item_pin(), and we must hold it until we
> + * are done processing the buffer state. In the case of an abort (remove =
> + * true) then we re-use the current pin reference as the IO reference we hand
> + * off to IO failure handling.
>   */
>  STATIC void
>  xfs_buf_item_unpin(
> @@ -493,24 +518,18 @@ xfs_buf_item_unpin(
>  
>  	trace_xfs_buf_item_unpin(bip);
>  
> -	/*
> -	 * Drop the bli ref associated with the pin and grab the hold required
> -	 * for the I/O simulation failure in the abort case. We have to do this
> -	 * before the pin count drops because the AIL doesn't acquire a bli
> -	 * reference. Therefore if the refcount drops to zero, the bli could
> -	 * still be AIL resident and the buffer submitted for I/O (and freed on
> -	 * completion) at any point before we return. This can be removed once
> -	 * the AIL properly holds a reference on the bli.
> -	 */
>  	freed = atomic_dec_and_test(&bip->bli_refcount);
> -	if (freed && !stale && remove)
> -		xfs_buf_hold(bp);
>  	if (atomic_dec_and_test(&bp->b_pin_count))
>  		wake_up_all(&bp->b_waiters);
>  
> -	 /* nothing to do but drop the pin count if the bli is active */
> -	if (!freed)
> +	 /*
> +	  * Nothing to do but drop the buffer pin reference if the BLI is
> +	  * still active
> +	  */
> +	if (!freed) {
> +		xfs_buf_rele(bp);
>  		return;
> +	}
>  
>  	if (stale) {
>  		ASSERT(bip->bli_flags & XFS_BLI_STALE);
> @@ -522,6 +541,15 @@ xfs_buf_item_unpin(
>  
>  		trace_xfs_buf_item_unpin_stale(bip);
>  
> +		/*
> +		 * The buffer has been locked and referenced since it was marked
> +		 * stale so we own both lock and reference exclusively here. We
> +		 * do not need the pin reference any more, so drop it now so
> +		 * that we only have one reference to drop once item completion
> +		 * processing is complete.
> +		 */
> +		xfs_buf_rele(bp);
> +
>  		/*
>  		 * If we get called here because of an IO error, we may or may
>  		 * not have the item on the AIL. xfs_trans_ail_delete() will
> @@ -538,16 +566,30 @@ xfs_buf_item_unpin(
>  			ASSERT(bp->b_log_item == NULL);
>  		}
>  		xfs_buf_relse(bp);
> -	} else if (remove) {
> +		return;
> +	}
> +
> +	if (remove) {
>  		/*
> -		 * The buffer must be locked and held by the caller to simulate
> -		 * an async I/O failure. We acquired the hold for this case
> -		 * before the buffer was unpinned.
> +		 * We need to simulate an async IO failures here to ensure that
> +		 * the correct error completion is run on this buffer. This
> +		 * requires a reference to the buffer and for the buffer to be
> +		 * locked. We can safely pass ownership of the pin reference to
> +		 * the IO to ensure that nothing can free the buffer while we
> +		 * wait for the lock and then run the IO failure completion.
>  		 */
>  		xfs_buf_lock(bp);
>  		bp->b_flags |= XBF_ASYNC;
>  		xfs_buf_ioend_fail(bp);
> +		return;
>  	}
> +
> +	/*
> +	 * BLI has no more active references - it will be moved to the AIL to
> +	 * manage the remaining BLI/buffer life cycle. There is nothing left for
> +	 * us to do here so drop the pin reference to the buffer.
> +	 */
> +	xfs_buf_rele(bp);
>  }
>  
>  STATIC uint
> -- 
> 2.40.1
>
Christoph Hellwig May 17, 2023, 12:58 p.m. UTC | #2
On Wed, May 17, 2023 at 10:04:46AM +1000, Dave Chinner wrote:
> To fix this, we need to ensure that buffer existence extends beyond
> the BLI reference count checks and until the unpin processing is
> complete. This implies that a buffer pin operation must also take a
> buffer reference to ensure that the buffer cannot be freed until the
> buffer unpin processing is complete.

Yeah.  I wonder why we haven't done this from the very beginning..

> +	 /*
> +	  * Nothing to do but drop the buffer pin reference if the BLI is
> +	  * still active
> +	  */

Nit: this block comment is indentented by an extra space.

> +	if (!freed) {
> +		xfs_buf_rele(bp);
>  		return;
> +	}
>  
>  	if (stale) {

Nit: this is the only use of the stale variable now, so we might
as well just drop it.

>  		ASSERT(bip->bli_flags & XFS_BLI_STALE);

.. which then also clearly shows this ASSERT is pointless now.

Otherwise looks good:

Reviewed-by: Christoph Hellwig <hch@lst.de>
Dave Chinner May 17, 2023, 10:24 p.m. UTC | #3
On Wed, May 17, 2023 at 05:58:55AM -0700, Christoph Hellwig wrote:
> On Wed, May 17, 2023 at 10:04:46AM +1000, Dave Chinner wrote:
> > To fix this, we need to ensure that buffer existence extends beyond
> > the BLI reference count checks and until the unpin processing is
> > complete. This implies that a buffer pin operation must also take a
> > buffer reference to ensure that the buffer cannot be freed until the
> > buffer unpin processing is complete.
> 
> Yeah.  I wonder why we haven't done this from the very beginning..

Likely because the whole BLI lifecycle is completely screwed up and
nobody has had the time to understand it fully and fix it properly.

> > +	 /*
> > +	  * Nothing to do but drop the buffer pin reference if the BLI is
> > +	  * still active
> > +	  */
> 
> Nit: this block comment is indentented by an extra space.
> 
> > +	if (!freed) {
> > +		xfs_buf_rele(bp);
> >  		return;
> > +	}
> >  
> >  	if (stale) {
> 
> Nit: this is the only use of the stale variable now, so we might
> as well just drop it.

Actually, after we've dropped the bli reference, it isn't safe to
reference the bli unless we know the buffer is stale. In this case,
we know the bli still exists because the buffer has been locked
since it was marked stale. However, for the other cases the BLI
could be freed from under us as it's reference count is zero and so
the next call to xfs_buf_item_relse() will free it no matter where
it comes from.

The reference counting around BLIs a total mess - this patch just
gets rid of one landmine but there's still plenty more in this code
that need to be untangled.

> >  		ASSERT(bip->bli_flags & XFS_BLI_STALE);
> 
> .. which then also clearly shows this ASSERT is pointless now.

*nod*

> Otherwise looks good:
> 
> Reviewed-by: Christoph Hellwig <hch@lst.de>

Thanks.

-Dave.
diff mbox series

Patch

diff --git a/fs/xfs/xfs_buf_item.c b/fs/xfs/xfs_buf_item.c
index df7322ed73fa..b2d211730fd2 100644
--- a/fs/xfs/xfs_buf_item.c
+++ b/fs/xfs/xfs_buf_item.c
@@ -452,10 +452,18 @@  xfs_buf_item_format(
  * This is called to pin the buffer associated with the buf log item in memory
  * so it cannot be written out.
  *
- * We also always take a reference to the buffer log item here so that the bli
- * is held while the item is pinned in memory. This means that we can
- * unconditionally drop the reference count a transaction holds when the
- * transaction is completed.
+ * We take a reference to the buffer log item here so that the BLI life cycle
+ * extends at least until the buffer is unpinned via xfs_buf_item_unpin() and
+ * inserted into the AIL.
+ *
+ * We also need to take a reference to the buffer itself as the BLI unpin
+ * processing requires accessing the buffer after the BLI has dropped the final
+ * BLI reference. See xfs_buf_item_unpin() for an explanation.
+ * If unpins race to drop the final BLI reference and only the
+ * BLI owns a reference to the buffer, then the loser of the race can have the
+ * buffer fgreed from under it (e.g. on shutdown). Taking a buffer reference per
+ * pin count ensures the life cycle of the buffer extends for as
+ * long as we hold the buffer pin reference in xfs_buf_item_unpin().
  */
 STATIC void
 xfs_buf_item_pin(
@@ -470,13 +478,30 @@  xfs_buf_item_pin(
 
 	trace_xfs_buf_item_pin(bip);
 
+	xfs_buf_hold(bip->bli_buf);
 	atomic_inc(&bip->bli_refcount);
 	atomic_inc(&bip->bli_buf->b_pin_count);
 }
 
 /*
- * This is called to unpin the buffer associated with the buf log item which
- * was previously pinned with a call to xfs_buf_item_pin().
+ * This is called to unpin the buffer associated with the buf log item which was
+ * previously pinned with a call to xfs_buf_item_pin().  We enter this function
+ * with a buffer pin count, a buffer reference and a BLI reference.
+ *
+ * We must drop the BLI reference before we unpin the buffer because the AIL
+ * doesn't acquire a BLI reference whenever it accesses it. Therefore if the
+ * refcount drops to zero, the bli could still be AIL resident and the buffer
+ * submitted for I/O at any point before we return. This can result in IO
+ * completion freeing the buffer while we are still trying to access it here.
+ * This race condition can also occur in shutdown situations where we abort and
+ * unpin buffers from contexts other that journal IO completion.
+ *
+ * Hence we have to hold a buffer reference per pin count to ensure that the
+ * buffer cannot be freed until we have finished processing the unpin operation.
+ * The reference is taken in xfs_buf_item_pin(), and we must hold it until we
+ * are done processing the buffer state. In the case of an abort (remove =
+ * true) then we re-use the current pin reference as the IO reference we hand
+ * off to IO failure handling.
  */
 STATIC void
 xfs_buf_item_unpin(
@@ -493,24 +518,18 @@  xfs_buf_item_unpin(
 
 	trace_xfs_buf_item_unpin(bip);
 
-	/*
-	 * Drop the bli ref associated with the pin and grab the hold required
-	 * for the I/O simulation failure in the abort case. We have to do this
-	 * before the pin count drops because the AIL doesn't acquire a bli
-	 * reference. Therefore if the refcount drops to zero, the bli could
-	 * still be AIL resident and the buffer submitted for I/O (and freed on
-	 * completion) at any point before we return. This can be removed once
-	 * the AIL properly holds a reference on the bli.
-	 */
 	freed = atomic_dec_and_test(&bip->bli_refcount);
-	if (freed && !stale && remove)
-		xfs_buf_hold(bp);
 	if (atomic_dec_and_test(&bp->b_pin_count))
 		wake_up_all(&bp->b_waiters);
 
-	 /* nothing to do but drop the pin count if the bli is active */
-	if (!freed)
+	 /*
+	  * Nothing to do but drop the buffer pin reference if the BLI is
+	  * still active
+	  */
+	if (!freed) {
+		xfs_buf_rele(bp);
 		return;
+	}
 
 	if (stale) {
 		ASSERT(bip->bli_flags & XFS_BLI_STALE);
@@ -522,6 +541,15 @@  xfs_buf_item_unpin(
 
 		trace_xfs_buf_item_unpin_stale(bip);
 
+		/*
+		 * The buffer has been locked and referenced since it was marked
+		 * stale so we own both lock and reference exclusively here. We
+		 * do not need the pin reference any more, so drop it now so
+		 * that we only have one reference to drop once item completion
+		 * processing is complete.
+		 */
+		xfs_buf_rele(bp);
+
 		/*
 		 * If we get called here because of an IO error, we may or may
 		 * not have the item on the AIL. xfs_trans_ail_delete() will
@@ -538,16 +566,30 @@  xfs_buf_item_unpin(
 			ASSERT(bp->b_log_item == NULL);
 		}
 		xfs_buf_relse(bp);
-	} else if (remove) {
+		return;
+	}
+
+	if (remove) {
 		/*
-		 * The buffer must be locked and held by the caller to simulate
-		 * an async I/O failure. We acquired the hold for this case
-		 * before the buffer was unpinned.
+		 * We need to simulate an async IO failures here to ensure that
+		 * the correct error completion is run on this buffer. This
+		 * requires a reference to the buffer and for the buffer to be
+		 * locked. We can safely pass ownership of the pin reference to
+		 * the IO to ensure that nothing can free the buffer while we
+		 * wait for the lock and then run the IO failure completion.
 		 */
 		xfs_buf_lock(bp);
 		bp->b_flags |= XBF_ASYNC;
 		xfs_buf_ioend_fail(bp);
+		return;
 	}
+
+	/*
+	 * BLI has no more active references - it will be moved to the AIL to
+	 * manage the remaining BLI/buffer life cycle. There is nothing left for
+	 * us to do here so drop the pin reference to the buffer.
+	 */
+	xfs_buf_rele(bp);
 }
 
 STATIC uint