upstream server crash
diff mbox

Message ID 1477322680.14828.6.camel@redhat.com
State New
Headers show

Commit Message

Jeff Layton Oct. 24, 2016, 3:24 p.m. UTC
On Mon, 2016-10-24 at 11:19 -0400, Jeff Layton wrote:
> On Mon, 2016-10-24 at 09:51 -0400, Chuck Lever wrote:
> > 
> > > 
> > > 
> > > On Oct 24, 2016, at 9:31 AM, Jeff Layton <jlayton@redhat.com> wrote:
> > > 
> > > On Mon, 2016-10-24 at 11:15 +0800, Eryu Guan wrote:
> > > > 
> > > > 
> > > > On Sun, Oct 23, 2016 at 02:21:15PM -0400, J. Bruce Fields wrote:
> > > > > 
> > > > > 
> > > > > 
> > > > > I'm getting an intermittent crash in the nfs server as of
> > > > > 68778945e46f143ed7974b427a8065f69a4ce944 "SUNRPC: Separate buffer
> > > > > pointers for RPC Call and Reply messages".
> > > > > 
> > > > > I haven't tried to understand that commit or why it would be a problem yet, I
> > > > > don't see an obvious connection--I can take a closer look Monday.
> > > > > 
> > > > > Could even be that I just landed on this commit by chance, the problem is a
> > > > > little hard to reproduce so I don't completely trust my testing.
> > > > 
> > > > I've hit the same crash on 4.9-rc1 kernel, and it's reproduced for me
> > > > reliably by running xfstests generic/013 case, on a loopback mounted
> > > > NFSv4.1 (or NFSv4.2), XFS is the underlying exported fs. More details
> > > > please see
> > > > 
> > > > http://marc.info/?l=linux-nfs&m=147714320129362&w=2
> > > > 
> > > 
> > > Looks like you landed at the same commit as Bruce, so that's probably
> > > legit. That commit is very small though. The only real change that
> > > doesn't affect the new field is this:
> > > 
> > > 
> > > @@ -1766,7 +1766,7 @@ rpc_xdr_encode(struct rpc_task *task)
> > >                      req->rq_buffer,
> > >                      req->rq_callsize);
> > >         xdr_buf_init(&req->rq_rcv_buf,
> > > -                    (char *)req->rq_buffer + req->rq_callsize,
> > > +                    req->rq_rbuffer,
> > >                      req->rq_rcvsize);
> > > 
> > > 
> > > So I'm guessing this is breaking the callback channel somehow?
> > 
> > Could be the TCP backchannel code is using rq_buffer in a different
> > way than RDMA backchannel or the forward channel code.
> > 
> 
> Well, it basically allocates a page per rpc_rqst and then maps that.
> 
> One thing I notice is that this patch ensures that rq_rbuffer gets set
> up in rpc_malloc and xprt_rdma_allocate, but it looks like
> xprt_alloc_bc_req didn't get the same treatment.
> 
> I suspect that that may be the problem...
> 
In fact, maybe we just need this here? (untested and probably
whitespace damaged):

--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Comments

Chuck Lever Oct. 24, 2016, 3:55 p.m. UTC | #1
> On Oct 24, 2016, at 11:24 AM, Jeff Layton <jlayton@redhat.com> wrote:
> 
> On Mon, 2016-10-24 at 11:19 -0400, Jeff Layton wrote:
>> On Mon, 2016-10-24 at 09:51 -0400, Chuck Lever wrote:
>>> 
>>>> 
>>>> 
>>>> On Oct 24, 2016, at 9:31 AM, Jeff Layton <jlayton@redhat.com> wrote:
>>>> 
>>>> On Mon, 2016-10-24 at 11:15 +0800, Eryu Guan wrote:
>>>>> 
>>>>> 
>>>>> On Sun, Oct 23, 2016 at 02:21:15PM -0400, J. Bruce Fields wrote:
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> I'm getting an intermittent crash in the nfs server as of
>>>>>> 68778945e46f143ed7974b427a8065f69a4ce944 "SUNRPC: Separate buffer
>>>>>> pointers for RPC Call and Reply messages".
>>>>>> 
>>>>>> I haven't tried to understand that commit or why it would be a problem yet, I
>>>>>> don't see an obvious connection--I can take a closer look Monday.
>>>>>> 
>>>>>> Could even be that I just landed on this commit by chance, the problem is a
>>>>>> little hard to reproduce so I don't completely trust my testing.
>>>>> 
>>>>> I've hit the same crash on 4.9-rc1 kernel, and it's reproduced for me
>>>>> reliably by running xfstests generic/013 case, on a loopback mounted
>>>>> NFSv4.1 (or NFSv4.2), XFS is the underlying exported fs. More details
>>>>> please see
>>>>> 
>>>>> http://marc.info/?l=linux-nfs&m=147714320129362&w=2
>>>>> 
>>>> 
>>>> Looks like you landed at the same commit as Bruce, so that's probably
>>>> legit. That commit is very small though. The only real change that
>>>> doesn't affect the new field is this:
>>>> 
>>>> 
>>>> @@ -1766,7 +1766,7 @@ rpc_xdr_encode(struct rpc_task *task)
>>>>                     req->rq_buffer,
>>>>                     req->rq_callsize);
>>>>        xdr_buf_init(&req->rq_rcv_buf,
>>>> -                    (char *)req->rq_buffer + req->rq_callsize,
>>>> +                    req->rq_rbuffer,
>>>>                     req->rq_rcvsize);
>>>> 
>>>> 
>>>> So I'm guessing this is breaking the callback channel somehow?
>>> 
>>> Could be the TCP backchannel code is using rq_buffer in a different
>>> way than RDMA backchannel or the forward channel code.
>>> 
>> 
>> Well, it basically allocates a page per rpc_rqst and then maps that.
>> 
>> One thing I notice is that this patch ensures that rq_rbuffer gets set
>> up in rpc_malloc and xprt_rdma_allocate, but it looks like
>> xprt_alloc_bc_req didn't get the same treatment.
>> 
>> I suspect that that may be the problem...
>> 
> In fact, maybe we just need this here? (untested and probably
> whitespace damaged):
> 
> diff --git a/net/sunrpc/backchannel_rqst.c b/net/sunrpc/backchannel_rqst.c
> index ac701c28f44f..c561aa8ce05b 100644
> --- a/net/sunrpc/backchannel_rqst.c
> +++ b/net/sunrpc/backchannel_rqst.c
> @@ -100,6 +100,7 @@ struct rpc_rqst *xprt_alloc_bc_req(struct rpc_xprt *xprt, gfp_t gfp_flags)
>                goto out_free;
>        }
>        req->rq_rcv_buf.len = PAGE_SIZE;
> +       req->rq_rbuffer = req->rq_rcv_buf.head[0].iov_base;

That looks plausible! Basically that is needed after xdr_buf_init()
is done for a backchannel rpc_rqst's receive buffer.

net/sunrpc/xprtrdma/backchannel.c might need a similar change. I saw
crashes with generic/013 at bake-a-thon last week, but as the iommu
was involved with those, I've been looking in a different place. Will
give this a try.


>        /* Preallocate one XDR send buffer */
>        if (xprt_alloc_xdr_buf(&req->rq_snd_buf, gfp_flags) < 0) {

--
Chuck Lever



--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
J. Bruce Fields Oct. 24, 2016, 6:08 p.m. UTC | #2
On Mon, Oct 24, 2016 at 11:24:40AM -0400, Jeff Layton wrote:
> On Mon, 2016-10-24 at 11:19 -0400, Jeff Layton wrote:
> > On Mon, 2016-10-24 at 09:51 -0400, Chuck Lever wrote:
> > > 
> > > > 
> > > > 
> > > > On Oct 24, 2016, at 9:31 AM, Jeff Layton <jlayton@redhat.com> wrote:
> > > > 
> > > > On Mon, 2016-10-24 at 11:15 +0800, Eryu Guan wrote:
> > > > > 
> > > > > 
> > > > > On Sun, Oct 23, 2016 at 02:21:15PM -0400, J. Bruce Fields wrote:
> > > > > > 
> > > > > > 
> > > > > > 
> > > > > > I'm getting an intermittent crash in the nfs server as of
> > > > > > 68778945e46f143ed7974b427a8065f69a4ce944 "SUNRPC: Separate buffer
> > > > > > pointers for RPC Call and Reply messages".
> > > > > > 
> > > > > > I haven't tried to understand that commit or why it would be a problem yet, I
> > > > > > don't see an obvious connection--I can take a closer look Monday.
> > > > > > 
> > > > > > Could even be that I just landed on this commit by chance, the problem is a
> > > > > > little hard to reproduce so I don't completely trust my testing.
> > > > > 
> > > > > I've hit the same crash on 4.9-rc1 kernel, and it's reproduced for me
> > > > > reliably by running xfstests generic/013 case, on a loopback mounted
> > > > > NFSv4.1 (or NFSv4.2), XFS is the underlying exported fs. More details
> > > > > please see
> > > > > 
> > > > > http://marc.info/?l=linux-nfs&m=147714320129362&w=2
> > > > > 
> > > > 
> > > > Looks like you landed at the same commit as Bruce, so that's probably
> > > > legit. That commit is very small though. The only real change that
> > > > doesn't affect the new field is this:
> > > > 
> > > > 
> > > > @@ -1766,7 +1766,7 @@ rpc_xdr_encode(struct rpc_task *task)
> > > >                      req->rq_buffer,
> > > >                      req->rq_callsize);
> > > >         xdr_buf_init(&req->rq_rcv_buf,
> > > > -                    (char *)req->rq_buffer + req->rq_callsize,
> > > > +                    req->rq_rbuffer,
> > > >                      req->rq_rcvsize);
> > > > 
> > > > 
> > > > So I'm guessing this is breaking the callback channel somehow?
> > > 
> > > Could be the TCP backchannel code is using rq_buffer in a different
> > > way than RDMA backchannel or the forward channel code.
> > > 
> > 
> > Well, it basically allocates a page per rpc_rqst and then maps that.
> > 
> > One thing I notice is that this patch ensures that rq_rbuffer gets set
> > up in rpc_malloc and xprt_rdma_allocate, but it looks like
> > xprt_alloc_bc_req didn't get the same treatment.
> > 
> > I suspect that that may be the problem...
> > 
> In fact, maybe we just need this here? (untested and probably
> whitespace damaged):

No change in results for me.

--b.
> 
> diff --git a/net/sunrpc/backchannel_rqst.c b/net/sunrpc/backchannel_rqst.c
> index ac701c28f44f..c561aa8ce05b 100644
> --- a/net/sunrpc/backchannel_rqst.c
> +++ b/net/sunrpc/backchannel_rqst.c
> @@ -100,6 +100,7 @@ struct rpc_rqst *xprt_alloc_bc_req(struct rpc_xprt *xprt, gfp_t gfp_flags)
>                 goto out_free;
>         }
>         req->rq_rcv_buf.len = PAGE_SIZE;
> +       req->rq_rbuffer = req->rq_rcv_buf.head[0].iov_base;
>  
>         /* Preallocate one XDR send buffer */
>         if (xprt_alloc_xdr_buf(&req->rq_snd_buf, gfp_flags) < 0) {
--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Patch
diff mbox

diff --git a/net/sunrpc/backchannel_rqst.c b/net/sunrpc/backchannel_rqst.c
index ac701c28f44f..c561aa8ce05b 100644
--- a/net/sunrpc/backchannel_rqst.c
+++ b/net/sunrpc/backchannel_rqst.c
@@ -100,6 +100,7 @@  struct rpc_rqst *xprt_alloc_bc_req(struct rpc_xprt *xprt, gfp_t gfp_flags)
                goto out_free;
        }
        req->rq_rcv_buf.len = PAGE_SIZE;
+       req->rq_rbuffer = req->rq_rcv_buf.head[0].iov_base;
 
        /* Preallocate one XDR send buffer */
        if (xprt_alloc_xdr_buf(&req->rq_snd_buf, gfp_flags) < 0) {