diff mbox series

xfs/log: protect the logging content under xc_ctx_lock

Message ID 1572442631-4472-1-git-send-email-kernelfans@gmail.com (mailing list archive)
State New, archived
Headers show
Series xfs/log: protect the logging content under xc_ctx_lock | expand

Commit Message

Pingfan Liu Oct. 30, 2019, 1:37 p.m. UTC
xc_cil_lock is not enough to protect the integrity of a trans logging.
Taking the scenario:
  cpuA                                 cpuB                          cpuC

  xlog_cil_insert_format_items()

  spin_lock(&cil->xc_cil_lock)
  link transA's items to xc_cil,
     including item1
  spin_unlock(&cil->xc_cil_lock)
                                                                      xlog_cil_push() fetches transA's item under xc_cil_lock
                                       issue transB, modify item1
                                                                      xlog_write(), but now, item1 contains content from transB and we have a broken transA

Survive this race issue by putting under the protection of xc_ctx_lock.
Meanwhile the xc_cil_lock can be dropped as xc_ctx_lock does it against
xlog_cil_insert_items()

Signed-off-by: Pingfan Liu <kernelfans@gmail.com>
Cc: "Darrick J. Wong" <darrick.wong@oracle.com>
Cc: Brian Foster <bfoster@redhat.com>
To: linux-xfs@vger.kernel.org
Cc: linux-fsdevel@vger.kernel.org
---
 fs/xfs/xfs_log_cil.c | 35 +++++++++++++++++++----------------
 1 file changed, 19 insertions(+), 16 deletions(-)

Comments

Darrick J. Wong Oct. 30, 2019, 4:48 p.m. UTC | #1
On Wed, Oct 30, 2019 at 09:37:11PM +0800, Pingfan Liu wrote:
> xc_cil_lock is not enough to protect the integrity of a trans logging.
> Taking the scenario:
>   cpuA                                 cpuB                          cpuC
> 
>   xlog_cil_insert_format_items()
> 
>   spin_lock(&cil->xc_cil_lock)
>   link transA's items to xc_cil,
>      including item1
>   spin_unlock(&cil->xc_cil_lock)
>                                                                       xlog_cil_push() fetches transA's item under xc_cil_lock
>                                        issue transB, modify item1
>                                                                       xlog_write(), but now, item1 contains content from transB and we have a broken transA
> 
> Survive this race issue by putting under the protection of xc_ctx_lock.
> Meanwhile the xc_cil_lock can be dropped as xc_ctx_lock does it against
> xlog_cil_insert_items()

How did you trigger this race?  Is there a test case to reproduce, or
did you figure this out via code inspection?

--D

> 
> Signed-off-by: Pingfan Liu <kernelfans@gmail.com>
> Cc: "Darrick J. Wong" <darrick.wong@oracle.com>
> Cc: Brian Foster <bfoster@redhat.com>
> To: linux-xfs@vger.kernel.org
> Cc: linux-fsdevel@vger.kernel.org
> ---
>  fs/xfs/xfs_log_cil.c | 35 +++++++++++++++++++----------------
>  1 file changed, 19 insertions(+), 16 deletions(-)
> 
> diff --git a/fs/xfs/xfs_log_cil.c b/fs/xfs/xfs_log_cil.c
> index 004af09..f8df3b5 100644
> --- a/fs/xfs/xfs_log_cil.c
> +++ b/fs/xfs/xfs_log_cil.c
> @@ -723,22 +723,6 @@ xlog_cil_push(
>  	 */
>  	lv = NULL;
>  	num_iovecs = 0;
> -	spin_lock(&cil->xc_cil_lock);
> -	while (!list_empty(&cil->xc_cil)) {
> -		struct xfs_log_item	*item;
> -
> -		item = list_first_entry(&cil->xc_cil,
> -					struct xfs_log_item, li_cil);
> -		list_del_init(&item->li_cil);
> -		if (!ctx->lv_chain)
> -			ctx->lv_chain = item->li_lv;
> -		else
> -			lv->lv_next = item->li_lv;
> -		lv = item->li_lv;
> -		item->li_lv = NULL;
> -		num_iovecs += lv->lv_niovecs;
> -	}
> -	spin_unlock(&cil->xc_cil_lock);
>  
>  	/*
>  	 * initialise the new context and attach it to the CIL. Then attach
> @@ -783,6 +767,25 @@ xlog_cil_push(
>  	up_write(&cil->xc_ctx_lock);
>  
>  	/*
> +	 * cil->xc_cil_lock around this loop can be dropped, since xc_ctx_lock
> +	 * protects us against xlog_cil_insert_items().
> +	 */
> +	while (!list_empty(&cil->xc_cil)) {
> +		struct xfs_log_item	*item;
> +
> +		item = list_first_entry(&cil->xc_cil,
> +					struct xfs_log_item, li_cil);
> +		list_del_init(&item->li_cil);
> +		if (!ctx->lv_chain)
> +			ctx->lv_chain = item->li_lv;
> +		else
> +			lv->lv_next = item->li_lv;
> +		lv = item->li_lv;
> +		item->li_lv = NULL;
> +		num_iovecs += lv->lv_niovecs;
> +	}
> +
> +	/*
>  	 * Build a checkpoint transaction header and write it to the log to
>  	 * begin the transaction. We need to account for the space used by the
>  	 * transaction header here as it is not accounted for in xlog_write().
> -- 
> 2.7.5
>
Pingfan Liu Oct. 31, 2019, 3:48 a.m. UTC | #2
On Wed, Oct 30, 2019 at 09:48:25AM -0700, Darrick J. Wong wrote:
> On Wed, Oct 30, 2019 at 09:37:11PM +0800, Pingfan Liu wrote:
> > xc_cil_lock is not enough to protect the integrity of a trans logging.
> > Taking the scenario:
> >   cpuA                                 cpuB                          cpuC
> > 
> >   xlog_cil_insert_format_items()
> > 
> >   spin_lock(&cil->xc_cil_lock)
> >   link transA's items to xc_cil,
> >      including item1
> >   spin_unlock(&cil->xc_cil_lock)
> >                                                                       xlog_cil_push() fetches transA's item under xc_cil_lock
> >                                        issue transB, modify item1
> >                                                                       xlog_write(), but now, item1 contains content from transB and we have a broken transA
> > 
> > Survive this race issue by putting under the protection of xc_ctx_lock.
> > Meanwhile the xc_cil_lock can be dropped as xc_ctx_lock does it against
> > xlog_cil_insert_items()
> 
> How did you trigger this race?  Is there a test case to reproduce, or
> did you figure this out via code inspection?
> 
Via code inspection. To hit this bug, the condition is hard to meet:
a broken transA is written to disk, then system encounters a failure before
transB is written. Only if this happens, the recovery will bring us to a
broken context.

Regards,
	Pingfan
Brian Foster Oct. 31, 2019, 11:36 a.m. UTC | #3
Dropped linux-fsdevel from cc. There's no reason to spam -fsdevel with
low level XFS patches.

On Wed, Oct 30, 2019 at 09:37:11PM +0800, Pingfan Liu wrote:
> xc_cil_lock is not enough to protect the integrity of a trans logging.
> Taking the scenario:
>   cpuA                                 cpuB                          cpuC
> 
>   xlog_cil_insert_format_items()
> 
>   spin_lock(&cil->xc_cil_lock)
>   link transA's items to xc_cil,
>      including item1
>   spin_unlock(&cil->xc_cil_lock)

So you commit a transaction, item1 ends up on the CIL.

>                                                                       xlog_cil_push() fetches transA's item under xc_cil_lock

xlog_cil_push() doesn't use ->xc_cil_lock, so I'm not sure what this
means. This sequence executes under ->xc_ctx_lock in write mode, which
locks out all transaction committers.

>                                        issue transB, modify item1

So presumably transB joins item1 while it is on the CIL from trans A and
commits. 

>                                                                       xlog_write(), but now, item1 contains content from transB and we have a broken transA

I'm not following how this is possible. The CIL push above, under
exclusive lock, removes each log item from ->xc_cil and pulls the log
vectors off of the log items to form the lv chain on the CIL context.
This means that the transB commit either updates the lv attached to the
log item from transA with the latest in-core version or uses the new
shadow buffer allocated in the commit path of transB. Either way is fine
because there is no guarantee of per-transaction granularity in the
on-disk log. The purpose of the on-disk log is to guarantee filesystem
consistency after a crash.

All in all, I can't really tell what problem you're describing here. If
you believe there's an issue in this code, I'd suggest to either try and
instrument it manually to reproduce a demonstrable problem and/or
provide far more detailed of a description to explain it.

> 
> Survive this race issue by putting under the protection of xc_ctx_lock.
> Meanwhile the xc_cil_lock can be dropped as xc_ctx_lock does it against
> xlog_cil_insert_items()
> 
> Signed-off-by: Pingfan Liu <kernelfans@gmail.com>
> Cc: "Darrick J. Wong" <darrick.wong@oracle.com>
> Cc: Brian Foster <bfoster@redhat.com>
> To: linux-xfs@vger.kernel.org
> Cc: linux-fsdevel@vger.kernel.org
> ---

FYI, this patch also doesn't apply to for-next. I'm guessing because
it's based on your previous patch to add the spinlock around the loop.

>  fs/xfs/xfs_log_cil.c | 35 +++++++++++++++++++----------------
>  1 file changed, 19 insertions(+), 16 deletions(-)
> 
> diff --git a/fs/xfs/xfs_log_cil.c b/fs/xfs/xfs_log_cil.c
> index 004af09..f8df3b5 100644
> --- a/fs/xfs/xfs_log_cil.c
> +++ b/fs/xfs/xfs_log_cil.c
> @@ -723,22 +723,6 @@ xlog_cil_push(
>  	 */
>  	lv = NULL;
>  	num_iovecs = 0;
> -	spin_lock(&cil->xc_cil_lock);
> -	while (!list_empty(&cil->xc_cil)) {

There's a comment just above that documents this loop that isn't
moved/modified.

> -		struct xfs_log_item	*item;
> -
> -		item = list_first_entry(&cil->xc_cil,
> -					struct xfs_log_item, li_cil);
> -		list_del_init(&item->li_cil);
> -		if (!ctx->lv_chain)
> -			ctx->lv_chain = item->li_lv;
> -		else
> -			lv->lv_next = item->li_lv;
> -		lv = item->li_lv;
> -		item->li_lv = NULL;
> -		num_iovecs += lv->lv_niovecs;
> -	}
> -	spin_unlock(&cil->xc_cil_lock);
>  
>  	/*
>  	 * initialise the new context and attach it to the CIL. Then attach
> @@ -783,6 +767,25 @@ xlog_cil_push(
>  	up_write(&cil->xc_ctx_lock);
>  
>  	/*
> +	 * cil->xc_cil_lock around this loop can be dropped, since xc_ctx_lock
> +	 * protects us against xlog_cil_insert_items().
> +	 */
> +	while (!list_empty(&cil->xc_cil)) {
> +		struct xfs_log_item	*item;
> +
> +		item = list_first_entry(&cil->xc_cil,
> +					struct xfs_log_item, li_cil);
> +		list_del_init(&item->li_cil);
> +		if (!ctx->lv_chain)
> +			ctx->lv_chain = item->li_lv;
> +		else
> +			lv->lv_next = item->li_lv;
> +		lv = item->li_lv;
> +		item->li_lv = NULL;
> +		num_iovecs += lv->lv_niovecs;
> +	}
> +

This places the associated loop outside of ->xc_ctx_lock, which means we
can now race modifying ->xc_cil during a CIL push and a transaction
commit. Have you tested this?

Brian

> +	/*
>  	 * Build a checkpoint transaction header and write it to the log to
>  	 * begin the transaction. We need to account for the space used by the
>  	 * transaction header here as it is not accounted for in xlog_write().
> -- 
> 2.7.5
>
Dave Chinner Oct. 31, 2019, 9:40 p.m. UTC | #4
On Wed, Oct 30, 2019 at 09:37:11PM +0800, Pingfan Liu wrote:
> xc_cil_lock is not enough to protect the integrity of a trans logging.
> Taking the scenario:
>   cpuA                                 cpuB                          cpuC
> 
>   xlog_cil_insert_format_items()
> 
>   spin_lock(&cil->xc_cil_lock)
>   link transA's items to xc_cil,
>      including item1
>   spin_unlock(&cil->xc_cil_lock)
>                                                                       xlog_cil_push() fetches transA's item under xc_cil_lock
>                                        issue transB, modify item1
>                                                                       xlog_write(), but now, item1 contains content from transB and we have a broken transA

TL;DR: 1. log vectors. 2. CIL context lock exclusion.

When CPU A formats the item during commit, it copies all the changes
into a list of log vectors, and that is attached to the log item
and the item is added to the CIL. The item is then unlocked. This is
done with the CIL context lock held excluding CIL pushes.

When CPU C pushes on the CIL, it detatches the -log vectors- from
the log item and removes the item from the CIL. This is done hold
the CIL context lock, excluding transaction commits from modifying
the CIL log vector list. It then formats the -log vectors- into the
journal by passing them to xlog_write().  It does not use log items
for this, and because the log vector list has been isolated and is
now private to the push context, we don't need to hold any locks
anymore to call xlog_write....

When CPU B modifies item1, it modifies the item and logs the new
changes to the log item. It does not modify the log vector that
might be attached to the log item from a previous change. The log
vector is only updated during transaction commit, so the changes
being made in transaction on CPU B are private to that transaction
until they are committed, formatted into log vectors and inserted
into the CIL under the CIL context lock.

> Survive this race issue by putting under the protection of xc_ctx_lock.
> Meanwhile the xc_cil_lock can be dropped as xc_ctx_lock does it against
> xlog_cil_insert_items()
> 
> Signed-off-by: Pingfan Liu <kernelfans@gmail.com>
> Cc: "Darrick J. Wong" <darrick.wong@oracle.com>
> Cc: Brian Foster <bfoster@redhat.com>
> To: linux-xfs@vger.kernel.org
> Cc: linux-fsdevel@vger.kernel.org
> ---
>  fs/xfs/xfs_log_cil.c | 35 +++++++++++++++++++----------------
>  1 file changed, 19 insertions(+), 16 deletions(-)
> 
> diff --git a/fs/xfs/xfs_log_cil.c b/fs/xfs/xfs_log_cil.c
> index 004af09..f8df3b5 100644
> --- a/fs/xfs/xfs_log_cil.c
> +++ b/fs/xfs/xfs_log_cil.c
> @@ -723,22 +723,6 @@ xlog_cil_push(
>  	 */
>  	lv = NULL;
>  	num_iovecs = 0;
> -	spin_lock(&cil->xc_cil_lock);
> -	while (!list_empty(&cil->xc_cil)) {
> -		struct xfs_log_item	*item;
> -
> -		item = list_first_entry(&cil->xc_cil,
> -					struct xfs_log_item, li_cil);
> -		list_del_init(&item->li_cil);
> -		if (!ctx->lv_chain)
> -			ctx->lv_chain = item->li_lv;
> -		else
> -			lv->lv_next = item->li_lv;
> -		lv = item->li_lv;
> -		item->li_lv = NULL;
> -		num_iovecs += lv->lv_niovecs;
> -	}
> -	spin_unlock(&cil->xc_cil_lock);
>  
>  	/*
>  	 * initialise the new context and attach it to the CIL. Then attach
> @@ -783,6 +767,25 @@ xlog_cil_push(
>  	up_write(&cil->xc_ctx_lock);
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^

We don't hold the CIL context lock anymore....

>  
>  	/*
> +	 * cil->xc_cil_lock around this loop can be dropped, since xc_ctx_lock
> +	 * protects us against xlog_cil_insert_items().
> +	 */
> +	while (!list_empty(&cil->xc_cil)) {
> +		struct xfs_log_item	*item;
> +
> +		item = list_first_entry(&cil->xc_cil,
> +					struct xfs_log_item, li_cil);
> +		list_del_init(&item->li_cil);
> +		if (!ctx->lv_chain)
> +			ctx->lv_chain = item->li_lv;
> +		else
> +			lv->lv_next = item->li_lv;
> +		lv = item->li_lv;
> +		item->li_lv = NULL;
> +		num_iovecs += lv->lv_niovecs;
> +	}

So this is completely unserialised now. i.e. even if there was a
problem like you suggest, this modification doesn't do what you say
it does.

Cheers,

Dave.
Pingfan Liu Nov. 1, 2019, 3:39 a.m. UTC | #5
On Fri, Nov 01, 2019 at 08:40:31AM +1100, Dave Chinner wrote:
> On Wed, Oct 30, 2019 at 09:37:11PM +0800, Pingfan Liu wrote:
> > xc_cil_lock is not enough to protect the integrity of a trans logging.
> > Taking the scenario:
> >   cpuA                                 cpuB                          cpuC
> > 
> >   xlog_cil_insert_format_items()
> > 
> >   spin_lock(&cil->xc_cil_lock)
> >   link transA's items to xc_cil,
> >      including item1
> >   spin_unlock(&cil->xc_cil_lock)
> >                                                                       xlog_cil_push() fetches transA's item under xc_cil_lock
> >                                        issue transB, modify item1
> >                                                                       xlog_write(), but now, item1 contains content from transB and we have a broken transA
> 
> TL;DR: 1. log vectors. 2. CIL context lock exclusion.
> 
> When CPU A formats the item during commit, it copies all the changes
> into a list of log vectors, and that is attached to the log item
> and the item is added to the CIL. The item is then unlocked. This is
> done with the CIL context lock held excluding CIL pushes.
> 
> When CPU C pushes on the CIL, it detatches the -log vectors- from
> the log item and removes the item from the CIL. This is done hold
> the CIL context lock, excluding transaction commits from modifying
> the CIL log vector list. It then formats the -log vectors- into the
> journal by passing them to xlog_write().  It does not use log items
> for this, and because the log vector list has been isolated and is
> now private to the push context, we don't need to hold any locks
> anymore to call xlog_write....
Yes. I failed to realize it. The critical "item->li_lv = NULL" in
xlog_cil_push(), which isolates the vectors and free of new
modification even after releasing xc_ctx_lock.
[...]
> >  	 * initialise the new context and attach it to the CIL. Then attach
> > @@ -783,6 +767,25 @@ xlog_cil_push(
> >  	up_write(&cil->xc_ctx_lock);
>         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> 
> We don't hold the CIL context lock anymore....
> 
Doze on it, make a mistaken reverse recognition of the up/down meaning.

Thank you for very patient and detailed explain. I get a full
understanding now.

Regards,
	Pingfan
Pingfan Liu Nov. 1, 2019, 4:02 a.m. UTC | #6
On Thu, Oct 31, 2019 at 07:36:40AM -0400, Brian Foster wrote:
> Dropped linux-fsdevel from cc. There's no reason to spam -fsdevel with
> low level XFS patches.
> 
> On Wed, Oct 30, 2019 at 09:37:11PM +0800, Pingfan Liu wrote:
[...]
> 
> I'm not following how this is possible. The CIL push above, under
It turns out not to be a bug as I replied to Dave's mail in this thread.
> exclusive lock, removes each log item from ->xc_cil and pulls the log
> vectors off of the log items to form the lv chain on the CIL context.
> This means that the transB commit either updates the lv attached to the
> log item from transA with the latest in-core version or uses the new
> shadow buffer allocated in the commit path of transB. Either way is fine
> because there is no guarantee of per-transaction granularity in the
Yes, no guarantee of per-transaction granularity, but there is a
boundary placed on several effect-merged transactions. That is what
xc_ctx_lock and private chain ctx->lv_chain guarantee.
> on-disk log. The purpose of the on-disk log is to guarantee filesystem
> consistency after a crash.
> 
> All in all, I can't really tell what problem you're describing here. If
> you believe there's an issue in this code, I'd suggest to either try and
> instrument it manually to reproduce a demonstrable problem and/or
> provide far more detailed of a description to explain it.
> 
Sorry that I raise a false alarm. But thank you all for helping me to figure
out through this.

Regards,
	Pingfan
diff mbox series

Patch

diff --git a/fs/xfs/xfs_log_cil.c b/fs/xfs/xfs_log_cil.c
index 004af09..f8df3b5 100644
--- a/fs/xfs/xfs_log_cil.c
+++ b/fs/xfs/xfs_log_cil.c
@@ -723,22 +723,6 @@  xlog_cil_push(
 	 */
 	lv = NULL;
 	num_iovecs = 0;
-	spin_lock(&cil->xc_cil_lock);
-	while (!list_empty(&cil->xc_cil)) {
-		struct xfs_log_item	*item;
-
-		item = list_first_entry(&cil->xc_cil,
-					struct xfs_log_item, li_cil);
-		list_del_init(&item->li_cil);
-		if (!ctx->lv_chain)
-			ctx->lv_chain = item->li_lv;
-		else
-			lv->lv_next = item->li_lv;
-		lv = item->li_lv;
-		item->li_lv = NULL;
-		num_iovecs += lv->lv_niovecs;
-	}
-	spin_unlock(&cil->xc_cil_lock);
 
 	/*
 	 * initialise the new context and attach it to the CIL. Then attach
@@ -783,6 +767,25 @@  xlog_cil_push(
 	up_write(&cil->xc_ctx_lock);
 
 	/*
+	 * cil->xc_cil_lock around this loop can be dropped, since xc_ctx_lock
+	 * protects us against xlog_cil_insert_items().
+	 */
+	while (!list_empty(&cil->xc_cil)) {
+		struct xfs_log_item	*item;
+
+		item = list_first_entry(&cil->xc_cil,
+					struct xfs_log_item, li_cil);
+		list_del_init(&item->li_cil);
+		if (!ctx->lv_chain)
+			ctx->lv_chain = item->li_lv;
+		else
+			lv->lv_next = item->li_lv;
+		lv = item->li_lv;
+		item->li_lv = NULL;
+		num_iovecs += lv->lv_niovecs;
+	}
+
+	/*
 	 * Build a checkpoint transaction header and write it to the log to
 	 * begin the transaction. We need to account for the space used by the
 	 * transaction header here as it is not accounted for in xlog_write().