diff mbox series

[04/28] lustre: ptlrpc: Do not assert when bd_nob_transferred != 0

Message ID 1539543498-29105-5-git-send-email-jsimmons@infradead.org (mailing list archive)
State New, archived
Headers show
Series lustre: more assorted fixes for lustre 2.10 | expand

Commit Message

James Simmons Oct. 14, 2018, 6:57 p.m. UTC
From: Doug Oucharek <dougso@me.com>

There is a case in the routine ptlrpc_register_bulk() where we were
asserting if bd_nob_transferred != 0 when not resending.  There is
evidence that network errors can create a situation where
this does happen. So we should not be asserting!

This patch changes that assert to an error return code of -EIO.

Signed-off-by: Doug Oucharek <dougso@me.com>
WC-bug-id: https://jira.whamcloud.com/browse/LU-9828
Reviewed-on: https://review.whamcloud.com/28491
Reviewed-by: Dmitry Eremin <dmitry.eremin@intel.com>
Reviewed-by: Sonia Sharma <sharmaso@whamcloud.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 drivers/staging/lustre/lustre/ptlrpc/niobuf.c | 8 ++++++--
 1 file changed, 6 insertions(+), 2 deletions(-)

Comments

NeilBrown Oct. 17, 2018, 11:13 p.m. UTC | #1
On Sun, Oct 14 2018, James Simmons wrote:

> From: Doug Oucharek <dougso@me.com>
>
> There is a case in the routine ptlrpc_register_bulk() where we were
> asserting if bd_nob_transferred != 0 when not resending.  There is
> evidence that network errors can create a situation where
> this does happen. So we should not be asserting!
>
> This patch changes that assert to an error return code of -EIO.
>
> Signed-off-by: Doug Oucharek <dougso@me.com>
> WC-bug-id: https://jira.whamcloud.com/browse/LU-9828
> Reviewed-on: https://review.whamcloud.com/28491
> Reviewed-by: Dmitry Eremin <dmitry.eremin@intel.com>
> Reviewed-by: Sonia Sharma <sharmaso@whamcloud.com>
> Reviewed-by: Oleg Drokin <green@whamcloud.com>
> Signed-off-by: James Simmons <jsimmons@infradead.org>
> ---
>  drivers/staging/lustre/lustre/ptlrpc/niobuf.c | 8 ++++++--
>  1 file changed, 6 insertions(+), 2 deletions(-)
>
> diff --git a/drivers/staging/lustre/lustre/ptlrpc/niobuf.c b/drivers/staging/lustre/lustre/ptlrpc/niobuf.c
> index 27eb1c0..7e7db24 100644
> --- a/drivers/staging/lustre/lustre/ptlrpc/niobuf.c
> +++ b/drivers/staging/lustre/lustre/ptlrpc/niobuf.c
> @@ -139,8 +139,12 @@ static int ptlrpc_register_bulk(struct ptlrpc_request *req)
>  	/* cleanup the state of the bulk for it will be reused */
>  	if (req->rq_resend || req->rq_send_state == LUSTRE_IMP_REPLAY)
>  		desc->bd_nob_transferred = 0;
> -	else
> -		LASSERT(desc->bd_nob_transferred == 0);
> +	else if (desc->bd_nob_transferred != 0)
> +		/* If the network failed after an RPC was sent, this condition
> +		 * could happen.  Rather than assert (was here before), return
> +		 * an EIO error.
> +		 */
> +		return -EIO;

This looks weird, and the justification is rather lame.
I wonder if this is an attempt to fix the same problem that the smp_mb()
in the previous patch was attempting to fix (and I'm not yet convinced
that either is the correct fix).

NeilBrown


>  
>  	desc->bd_failure = 0;
>  
> -- 
> 1.8.3.1
James Simmons Oct. 21, 2018, 10:44 p.m. UTC | #2
> On Sun, Oct 14 2018, James Simmons wrote:
> 
> > From: Doug Oucharek <dougso@me.com>
> >
> > There is a case in the routine ptlrpc_register_bulk() where we were
> > asserting if bd_nob_transferred != 0 when not resending.  There is
> > evidence that network errors can create a situation where
> > this does happen. So we should not be asserting!
> >
> > This patch changes that assert to an error return code of -EIO.
> >
> > Signed-off-by: Doug Oucharek <dougso@me.com>
> > WC-bug-id: https://jira.whamcloud.com/browse/LU-9828
> > Reviewed-on: https://review.whamcloud.com/28491
> > Reviewed-by: Dmitry Eremin <dmitry.eremin@intel.com>
> > Reviewed-by: Sonia Sharma <sharmaso@whamcloud.com>
> > Reviewed-by: Oleg Drokin <green@whamcloud.com>
> > Signed-off-by: James Simmons <jsimmons@infradead.org>
> > ---
> >  drivers/staging/lustre/lustre/ptlrpc/niobuf.c | 8 ++++++--
> >  1 file changed, 6 insertions(+), 2 deletions(-)
> >
> > diff --git a/drivers/staging/lustre/lustre/ptlrpc/niobuf.c b/drivers/staging/lustre/lustre/ptlrpc/niobuf.c
> > index 27eb1c0..7e7db24 100644
> > --- a/drivers/staging/lustre/lustre/ptlrpc/niobuf.c
> > +++ b/drivers/staging/lustre/lustre/ptlrpc/niobuf.c
> > @@ -139,8 +139,12 @@ static int ptlrpc_register_bulk(struct ptlrpc_request *req)
> >  	/* cleanup the state of the bulk for it will be reused */
> >  	if (req->rq_resend || req->rq_send_state == LUSTRE_IMP_REPLAY)
> >  		desc->bd_nob_transferred = 0;
> > -	else
> > -		LASSERT(desc->bd_nob_transferred == 0);
> > +	else if (desc->bd_nob_transferred != 0)
> > +		/* If the network failed after an RPC was sent, this condition
> > +		 * could happen.  Rather than assert (was here before), return
> > +		 * an EIO error.
> > +		 */
> > +		return -EIO;
> 
> This looks weird, and the justification is rather lame.
> I wonder if this is an attempt to fix the same problem that the smp_mb()
> in the previous patch was attempting to fix (and I'm not yet convinced
> that either is the correct fix).

When the above condition happens the LASSERT ends up taking out the 
node with a panic which in turn kills the application running on the cluster.
When replaced with reporting an EIO error the node survives as well as the 
job. The job might fail at its IO but it wouldn't fail performing its work 
flow which is way more important.

> >  	desc->bd_failure = 0;
> >  
> > -- 
> > 1.8.3.1
>
NeilBrown Oct. 22, 2018, 3:26 a.m. UTC | #3
On Sun, Oct 21 2018, James Simmons wrote:

>> On Sun, Oct 14 2018, James Simmons wrote:
>> 
>> > From: Doug Oucharek <dougso@me.com>
>> >
>> > There is a case in the routine ptlrpc_register_bulk() where we were
>> > asserting if bd_nob_transferred != 0 when not resending.  There is
>> > evidence that network errors can create a situation where
>> > this does happen. So we should not be asserting!
>> >
>> > This patch changes that assert to an error return code of -EIO.
>> >
>> > Signed-off-by: Doug Oucharek <dougso@me.com>
>> > WC-bug-id: https://jira.whamcloud.com/browse/LU-9828
>> > Reviewed-on: https://review.whamcloud.com/28491
>> > Reviewed-by: Dmitry Eremin <dmitry.eremin@intel.com>
>> > Reviewed-by: Sonia Sharma <sharmaso@whamcloud.com>
>> > Reviewed-by: Oleg Drokin <green@whamcloud.com>
>> > Signed-off-by: James Simmons <jsimmons@infradead.org>
>> > ---
>> >  drivers/staging/lustre/lustre/ptlrpc/niobuf.c | 8 ++++++--
>> >  1 file changed, 6 insertions(+), 2 deletions(-)
>> >
>> > diff --git a/drivers/staging/lustre/lustre/ptlrpc/niobuf.c b/drivers/staging/lustre/lustre/ptlrpc/niobuf.c
>> > index 27eb1c0..7e7db24 100644
>> > --- a/drivers/staging/lustre/lustre/ptlrpc/niobuf.c
>> > +++ b/drivers/staging/lustre/lustre/ptlrpc/niobuf.c
>> > @@ -139,8 +139,12 @@ static int ptlrpc_register_bulk(struct ptlrpc_request *req)
>> >  	/* cleanup the state of the bulk for it will be reused */
>> >  	if (req->rq_resend || req->rq_send_state == LUSTRE_IMP_REPLAY)
>> >  		desc->bd_nob_transferred = 0;
>> > -	else
>> > -		LASSERT(desc->bd_nob_transferred == 0);
>> > +	else if (desc->bd_nob_transferred != 0)
>> > +		/* If the network failed after an RPC was sent, this condition
>> > +		 * could happen.  Rather than assert (was here before), return
>> > +		 * an EIO error.
>> > +		 */
>> > +		return -EIO;
>> 
>> This looks weird, and the justification is rather lame.
>> I wonder if this is an attempt to fix the same problem that the smp_mb()
>> in the previous patch was attempting to fix (and I'm not yet convinced
>> that either is the correct fix).
>
> When the above condition happens the LASSERT ends up taking out the 
> node with a panic which in turn kills the application running on the cluster.
> When replaced with reporting an EIO error the node survives as well as the 
> job. The job might fail at its IO but it wouldn't fail performing its work 
> flow which is way more important.

Yes, a meaningless error is better than a crash, but a proper fix is
better still.  As I said, my guess is that the memory barrier in the
previous patch might have fixed the bug, so the LASSERT can remain.

Doug: is there any chance that this might be the case?

Thanks,
NeilBrown
James Simmons Nov. 4, 2018, 9:29 p.m. UTC | #4
> >> On Sun, Oct 14 2018, James Simmons wrote:
> >> 
> >> > From: Doug Oucharek <dougso@me.com>
> >> >
> >> > There is a case in the routine ptlrpc_register_bulk() where we were
> >> > asserting if bd_nob_transferred != 0 when not resending.  There is
> >> > evidence that network errors can create a situation where
> >> > this does happen. So we should not be asserting!
> >> >
> >> > This patch changes that assert to an error return code of -EIO.
> >> >
> >> > Signed-off-by: Doug Oucharek <dougso@me.com>
> >> > WC-bug-id: https://jira.whamcloud.com/browse/LU-9828
> >> > Reviewed-on: https://review.whamcloud.com/28491
> >> > Reviewed-by: Dmitry Eremin <dmitry.eremin@intel.com>
> >> > Reviewed-by: Sonia Sharma <sharmaso@whamcloud.com>
> >> > Reviewed-by: Oleg Drokin <green@whamcloud.com>
> >> > Signed-off-by: James Simmons <jsimmons@infradead.org>
> >> > ---
> >> >  drivers/staging/lustre/lustre/ptlrpc/niobuf.c | 8 ++++++--
> >> >  1 file changed, 6 insertions(+), 2 deletions(-)
> >> >
> >> > diff --git a/drivers/staging/lustre/lustre/ptlrpc/niobuf.c b/drivers/staging/lustre/lustre/ptlrpc/niobuf.c
> >> > index 27eb1c0..7e7db24 100644
> >> > --- a/drivers/staging/lustre/lustre/ptlrpc/niobuf.c
> >> > +++ b/drivers/staging/lustre/lustre/ptlrpc/niobuf.c
> >> > @@ -139,8 +139,12 @@ static int ptlrpc_register_bulk(struct ptlrpc_request *req)
> >> >  	/* cleanup the state of the bulk for it will be reused */
> >> >  	if (req->rq_resend || req->rq_send_state == LUSTRE_IMP_REPLAY)
> >> >  		desc->bd_nob_transferred = 0;
> >> > -	else
> >> > -		LASSERT(desc->bd_nob_transferred == 0);
> >> > +	else if (desc->bd_nob_transferred != 0)
> >> > +		/* If the network failed after an RPC was sent, this condition
> >> > +		 * could happen.  Rather than assert (was here before), return
> >> > +		 * an EIO error.
> >> > +		 */
> >> > +		return -EIO;
> >> 
> >> This looks weird, and the justification is rather lame.
> >> I wonder if this is an attempt to fix the same problem that the smp_mb()
> >> in the previous patch was attempting to fix (and I'm not yet convinced
> >> that either is the correct fix).
> >
> > When the above condition happens the LASSERT ends up taking out the 
> > node with a panic which in turn kills the application running on the cluster.
> > When replaced with reporting an EIO error the node survives as well as the 
> > job. The job might fail at its IO but it wouldn't fail performing its work 
> > flow which is way more important.
> 
> Yes, a meaningless error is better than a crash, but a proper fix is
> better still.  As I said, my guess is that the memory barrier in the
> previous patch might have fixed the bug, so the LASSERT can remain.
> 
> Doug: is there any chance that this might be the case?

I got a hold of Doug and discussed this issue. So the answer is that the
original logs to track down the original problem no longer exist. So 
finding the original source of the problem can't be done at this point.
Would you be okay with a version of this patch with dump_stack() and
treat it as a debug patch. We really need to collect logs to figure out
the real problem. I will push a debug patch to OpenSFS branch since it
is more widely used.
NeilBrown Nov. 4, 2018, 11:59 p.m. UTC | #5
On Sun, Nov 04 2018, James Simmons wrote:

>> >> On Sun, Oct 14 2018, James Simmons wrote:
>> >> 
>> >> > From: Doug Oucharek <dougso@me.com>
>> >> >
>> >> > There is a case in the routine ptlrpc_register_bulk() where we were
>> >> > asserting if bd_nob_transferred != 0 when not resending.  There is
>> >> > evidence that network errors can create a situation where
>> >> > this does happen. So we should not be asserting!
>> >> >
>> >> > This patch changes that assert to an error return code of -EIO.
>> >> >
>> >> > Signed-off-by: Doug Oucharek <dougso@me.com>
>> >> > WC-bug-id: https://jira.whamcloud.com/browse/LU-9828
>> >> > Reviewed-on: https://review.whamcloud.com/28491
>> >> > Reviewed-by: Dmitry Eremin <dmitry.eremin@intel.com>
>> >> > Reviewed-by: Sonia Sharma <sharmaso@whamcloud.com>
>> >> > Reviewed-by: Oleg Drokin <green@whamcloud.com>
>> >> > Signed-off-by: James Simmons <jsimmons@infradead.org>
>> >> > ---
>> >> >  drivers/staging/lustre/lustre/ptlrpc/niobuf.c | 8 ++++++--
>> >> >  1 file changed, 6 insertions(+), 2 deletions(-)
>> >> >
>> >> > diff --git a/drivers/staging/lustre/lustre/ptlrpc/niobuf.c b/drivers/staging/lustre/lustre/ptlrpc/niobuf.c
>> >> > index 27eb1c0..7e7db24 100644
>> >> > --- a/drivers/staging/lustre/lustre/ptlrpc/niobuf.c
>> >> > +++ b/drivers/staging/lustre/lustre/ptlrpc/niobuf.c
>> >> > @@ -139,8 +139,12 @@ static int ptlrpc_register_bulk(struct ptlrpc_request *req)
>> >> >  	/* cleanup the state of the bulk for it will be reused */
>> >> >  	if (req->rq_resend || req->rq_send_state == LUSTRE_IMP_REPLAY)
>> >> >  		desc->bd_nob_transferred = 0;
>> >> > -	else
>> >> > -		LASSERT(desc->bd_nob_transferred == 0);
>> >> > +	else if (desc->bd_nob_transferred != 0)
>> >> > +		/* If the network failed after an RPC was sent, this condition
>> >> > +		 * could happen.  Rather than assert (was here before), return
>> >> > +		 * an EIO error.
>> >> > +		 */
>> >> > +		return -EIO;
>> >> 
>> >> This looks weird, and the justification is rather lame.
>> >> I wonder if this is an attempt to fix the same problem that the smp_mb()
>> >> in the previous patch was attempting to fix (and I'm not yet convinced
>> >> that either is the correct fix).
>> >
>> > When the above condition happens the LASSERT ends up taking out the 
>> > node with a panic which in turn kills the application running on the cluster.
>> > When replaced with reporting an EIO error the node survives as well as the 
>> > job. The job might fail at its IO but it wouldn't fail performing its work 
>> > flow which is way more important.
>> 
>> Yes, a meaningless error is better than a crash, but a proper fix is
>> better still.  As I said, my guess is that the memory barrier in the
>> previous patch might have fixed the bug, so the LASSERT can remain.
>> 
>> Doug: is there any chance that this might be the case?
>
> I got a hold of Doug and discussed this issue. So the answer is that the
> original logs to track down the original problem no longer exist. So 
> finding the original source of the problem can't be done at this point.
> Would you be okay with a version of this patch with dump_stack() and
> treat it as a debug patch. We really need to collect logs to figure out
> the real problem. I will push a debug patch to OpenSFS branch since it
> is more widely used.

Yes.
   else if (WARN_ON(desc->nb_nob_transferred != 0))
       /* comment explaining what we know 8/
       return -EIO

would be perfectly appropriate.

Thanks,
NeilBrown
diff mbox series

Patch

diff --git a/drivers/staging/lustre/lustre/ptlrpc/niobuf.c b/drivers/staging/lustre/lustre/ptlrpc/niobuf.c
index 27eb1c0..7e7db24 100644
--- a/drivers/staging/lustre/lustre/ptlrpc/niobuf.c
+++ b/drivers/staging/lustre/lustre/ptlrpc/niobuf.c
@@ -139,8 +139,12 @@  static int ptlrpc_register_bulk(struct ptlrpc_request *req)
 	/* cleanup the state of the bulk for it will be reused */
 	if (req->rq_resend || req->rq_send_state == LUSTRE_IMP_REPLAY)
 		desc->bd_nob_transferred = 0;
-	else
-		LASSERT(desc->bd_nob_transferred == 0);
+	else if (desc->bd_nob_transferred != 0)
+		/* If the network failed after an RPC was sent, this condition
+		 * could happen.  Rather than assert (was here before), return
+		 * an EIO error.
+		 */
+		return -EIO;
 
 	desc->bd_failure = 0;