diff mbox

[untested] mlx5: Avoid that mlx5_ib_sg_to_klms() overflows the klms[] array

Message ID 16ea1371-84a5-c055-5b0c-fdc6d355276a@mellanox.com (mailing list archive)
State Deferred
Headers show

Commit Message

Max Gurtovoy April 26, 2017, 12:25 p.m. UTC
On 4/26/2017 3:18 PM, Laurence Oberman wrote:
>
>
> ----- Original Message -----
>> From: "Laurence Oberman" <loberman@redhat.com>
>> To: "Max Gurtovoy" <maxg@mellanox.com>
>> Cc: "Leon Romanovsky" <leonro@mellanox.com>, "Bart Van Assche" <bart.vanassche@sandisk.com>, "Doug Ledford"
>> <dledford@redhat.com>, "Sagi Grimberg" <sagi@grimberg.me>, "Israel Rukshin" <israelr@mellanox.com>,
>> linux-rdma@vger.kernel.org
>> Sent: Wednesday, April 26, 2017 7:47:37 AM
>> Subject: Re: [PATCH, untested] mlx5: Avoid that mlx5_ib_sg_to_klms() overflows the klms[] array
>>
>>
>>
>> ----- Original Message -----
>>> From: "Max Gurtovoy" <maxg@mellanox.com>
>>> To: "Laurence Oberman" <loberman@redhat.com>, "Leon Romanovsky"
>>> <leonro@mellanox.com>
>>> Cc: "Bart Van Assche" <bart.vanassche@sandisk.com>, "Doug Ledford"
>>> <dledford@redhat.com>, "Sagi Grimberg"
>>> <sagi@grimberg.me>, "Israel Rukshin" <israelr@mellanox.com>,
>>> linux-rdma@vger.kernel.org
>>> Sent: Wednesday, April 26, 2017 4:31:57 AM
>>> Subject: Re: [PATCH, untested] mlx5: Avoid that mlx5_ib_sg_to_klms()
>>> overflows the klms[] array
>>>
>>>
>>>
>>> On 4/25/2017 11:37 PM, Laurence Oberman wrote:
>>>>
>>>>
>>>> ----- Original Message -----
>>>>> From: "Leon Romanovsky" <leonro@mellanox.com>
>>>>> To: "Bart Van Assche" <bart.vanassche@sandisk.com>
>>>>> Cc: "Doug Ledford" <dledford@redhat.com>, "Max Gurtovoy"
>>>>> <maxg@mellanox.com>, "Sagi Grimberg" <sagi@grimberg.me>,
>>>>> "Israel Rukshin" <israelr@mellanox.com>, "Laurence Oberman"
>>>>> <loberman@redhat.com>, linux-rdma@vger.kernel.org
>>>>> Sent: Tuesday, April 25, 2017 1:58:49 PM
>>>>> Subject: Re: [PATCH, untested] mlx5: Avoid that mlx5_ib_sg_to_klms()
>>>>> overflows the klms[] array
>>>>>
>>>>> On Mon, Apr 24, 2017 at 03:15:28PM -0700, Bart Van Assche wrote:
>>>>>> ib_map_mr_sg() can pass an SG-list to .map_mr_sg() that is larger
>>>>>> than what fits into a single MR. .map_mr_sg() must not attempt to
>>>>>> map more SG-list elements than what fits into a single MR.
>>>>>> Hence make sure that mlx5_ib_sg_to_klms() does not write outside
>>>>>> the MR klms[] array.
>>>>>>
>>>>>> Fixes: b005d3164713 ("mlx5: Add arbitrary sg list support")
>>>>>> Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com>
>>>>>> Reviewed-by: Max Gurtovoy <maxg@mellanox.com>
>>>>>> Cc: Sagi Grimberg <sagi@grimberg.me>
>>>>>> Cc: Leon Romanovsky <leonro@mellanox.com>
>>>>>> Cc: Israel Rukshin <israelr@mellanox.com>
>>>>>> Cc: <stable@vger.kernel.org>
>>>>>> ---
>>>>>>  drivers/infiniband/hw/mlx5/mr.c | 2 +-
>>>>>>  1 file changed, 1 insertion(+), 1 deletion(-)
>>>>>>
>>>>>
>>>>> Bart,
>>>>>
>>>>> Thanks a lot, it indeed looks right.
>>>>> Acked-by: Leon Romanovsky <leonro@mellanox.com>
>>>>>
>>>>> Thanks
>>>>>
>>>>
>>>>
>>>> Hello Bart, Leon, Max and Israel.
>>>>
>>>> I cloned off Barts tree.
>>>>
>>>> git clone https://github.com/bvanassche/linux
>>>> cd linux
>>>> git checkout block-scsi-for-next
>>>>
>>>> I checked all patches were in for this test.
>>>>
>>>> a83e404 IB/srp: Reenable IB_MR_TYPE_SG_GAPS
>>>> dfa5a2b mlx5: Avoid that mlx5_ib_sg_to_klms() overflows the klms[] array
>>>> f759c80 mlx5: Fix mlx5_ib_map_mr_sg mr lengt
>>>
>>> Hi,
>>> copying Sagi's request from different thread:
>>>
>>> "
>>> Can you please enable srp_add_one debug:
>>>
>>> echo "func srp_add_one +p" > /sys/kernel/debug/dynamic_debug/control
>>>
>>> In addition apply the following:
>>> --
>>> diff --git a/drivers/infiniband/hw/mlx5/mr.c
>>> b/drivers/infiniband/hw/mlx5/mr.c
>>> index d9c6c0ea750b..040fbc387e4f 100644
>>> --- a/drivers/infiniband/hw/mlx5/mr.c
>>> +++ b/drivers/infiniband/hw/mlx5/mr.c
>>> @@ -1403,6 +1403,8 @@ mlx5_alloc_priv_descs(struct ib_device *device,
>>>          int add_size;
>>>          int ret;
>>>
>>> +       WARN_ON_ONCE(ndescs > device->attr.max_fast_reg_page_list_len);
>>> +
>>>          add_size = max_t(int, MLX5_UMR_ALIGN - ARCH_KMALLOC_MINALIGN, 0);
>>>
>>>          mr->descs_alloc = kzalloc(size + add_size, GFP_KERNEL);
>>>
>>> "
>>>
>>> Max.
>>>
>>>>
>>>> Built and tested the kernel.
>>>>
>>>> However this issue is not resolved :(
>>>>
>>>> [ 2707.931909] scsi host1: ib_srp: failed RECV status WR flushed (5) for
>>>> CQE ffff8817edca86b0
>>>> [ 2708.089806] mlx5_0:dump_cqe:262:(pid 20129): dump error cqe
>>>> [ 2708.121342] 00000000 00000000 00000000 00000000
>>>> [ 2708.147104] 00000000 00000000 00000000 00000000
>>>> [ 2708.172633] 00000000 00000000 00000000 00000000
>>>> [ 2708.198702] 00000000 0f007806 2500002a 14a527d0
>>>> [ 2732.434127] scsi host1: ib_srp: reconnect succeeded
>>>> [ 2733.048023] scsi host1: ib_srp: failed RECV status WR flushed (5) for
>>>> CQE ffff8817ed0a9c30
>>>>
>>>> [root@localhost ~]# [ 2746.413277] mlx5_0:dump_cqe:262:(pid 15877): dump
>>>> error cqe
>>>> [ 2746.443240] 00000000 00000000 00000000 00000000
>>>> [ 2746.469323] 00000000 00000000 00000000 00000000
>>>> [ 2746.495310] 00000000 00000000 00000000 00000000
>>>> [ 2746.521407] 00000000 0f007806 25000032 003c7ad0
>>>> [ 2752.445899] scsi host1: ib_srp: reconnect succeeded
>>>> [ 2752.481835] scsi host1: ib_srp: failed RECV status WR flushed (5) for
>>>> CQE ffff8817ed0a9cf0
>>>> [ 2763.267386] mlx5_0:dump_cqe:262:(pid 15877): dump error cqe
>>>> [ 2763.297826] 00000000 00000000 00000000 00000000
>>>> [ 2763.323352] 00000000 00000000 00000000 00000000
>>>> [ 2763.348722] 00000000 00000000 00000000 00000000
>>>> [ 2763.374681] 00000000 0f007806 2500003a 00084bd0
>>>>
>>>> [root@localhost ~]# [ 2769.385203] fast_io_fail_tmo expired for SRP
>>>> port-1:1 / host1.
>>>> [ 2769.415956] scsi host1: ib_srp: reconnect succeeded
>>>> [ 2769.450258] scsi host1: ib_srp: failed RECV status WR flushed (5) for
>>>> CQE ffff8817ed0a9cf0
>>>> [ 2780.064627] mlx5_0:dump_cqe:262:(pid 18771): dump error cqe
>>>> [ 2780.093520] 00000000 00000000 00000000 00000000
>>>> [ 2780.120067] 00000000 00000000 00000000 00000000
>>>> [ 2780.145575] 00000000 00000000 00000000 00000000
>>>> [ 2780.171153] 00000000 0f007806 25000042 000833d0
>>>> [ 2785.923399] scsi host1: ib_srp: reconnect succeeded
>>>> [ 2785.957504] scsi host1: ib_srp: failed RECV status WR flushed (5) for
>>>> CQE ffff8817ed0a9cf0
>>>> [ 2796.463426] mlx5_0:dump_cqe:262:(pid 18771): dump error cqe
>>>> [ 2796.495257] 00000000 00000000 00000000 00000000
>>>> [ 2796.521506] 00000000 00000000 00000000 00000000
>>>> [ 2796.547640] 00000000 00000000 00000000 00000000
>>>> [ 2796.573120] 00000000 0f007806 2500004a 00083bd0
>>>> [ 2802.562578] scsi host1: ib_srp: reconnect succeeded
>>>> [ 2802.596880] scsi host1: ib_srp: failed RECV status WR flushed (5) for
>>>> CQE ffff8817ed0a9cf0
>>>>
>>>> Regards
>>>> Laurence
>>>>
>>>
>> Doing this now
>> Thanks
>> Laurence
>
> Max
>
> The Patch is not correct.
>
> drivers/infiniband/hw/mlx5/mr.c: In function 'mlx5_alloc_priv_descs':
> drivers/infiniband/hw/mlx5/mr.c:1406:30: error: 'struct ib_device' has no member named 'attr'
>   WARN_ON_ONCE(ndescs > device->attr.max_fast_reg_page_list_len);
>                               ^
> ./include/asm-generic/bug.h:117:27: note: in definition of macro 'WARN_ON_ONCE'
>   int __ret_warn_once = !!(condition);   \
>
> I think you meant to give me
>
> WARN_ON_ONCE(ndescs > ib_device_attr->attr.max_fast_reg_page_list_len);
>
> Can you confirm

Hi Laurence,
should be device->attrs.max_fast_reg_page_list_len.

please check this one that might solve the issue (on top of everything):


mlx5_klm));
                 if (err)

thanks,
Max.

>
> Thanks
> Laurence
>
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Comments

Laurence Oberman April 26, 2017, 1:28 p.m. UTC | #1
----- Original Message -----
> From: "Max Gurtovoy" <maxg@mellanox.com>
> To: "Laurence Oberman" <loberman@redhat.com>
> Cc: "Leon Romanovsky" <leonro@mellanox.com>, "Bart Van Assche" <bart.vanassche@sandisk.com>, "Doug Ledford"
> <dledford@redhat.com>, "Sagi Grimberg" <sagi@grimberg.me>, "Israel Rukshin" <israelr@mellanox.com>,
> linux-rdma@vger.kernel.org
> Sent: Wednesday, April 26, 2017 8:25:30 AM
> Subject: Re: [PATCH, untested] mlx5: Avoid that mlx5_ib_sg_to_klms() overflows the klms[] array
> 
> 
> 
> On 4/26/2017 3:18 PM, Laurence Oberman wrote:
> >
> >
> > ----- Original Message -----
> >> From: "Laurence Oberman" <loberman@redhat.com>
> >> To: "Max Gurtovoy" <maxg@mellanox.com>
> >> Cc: "Leon Romanovsky" <leonro@mellanox.com>, "Bart Van Assche"
> >> <bart.vanassche@sandisk.com>, "Doug Ledford"
> >> <dledford@redhat.com>, "Sagi Grimberg" <sagi@grimberg.me>, "Israel
> >> Rukshin" <israelr@mellanox.com>,
> >> linux-rdma@vger.kernel.org
> >> Sent: Wednesday, April 26, 2017 7:47:37 AM
> >> Subject: Re: [PATCH, untested] mlx5: Avoid that mlx5_ib_sg_to_klms()
> >> overflows the klms[] array
> >>
> >>
> >>
> >> ----- Original Message -----
> >>> From: "Max Gurtovoy" <maxg@mellanox.com>
> >>> To: "Laurence Oberman" <loberman@redhat.com>, "Leon Romanovsky"
> >>> <leonro@mellanox.com>
> >>> Cc: "Bart Van Assche" <bart.vanassche@sandisk.com>, "Doug Ledford"
> >>> <dledford@redhat.com>, "Sagi Grimberg"
> >>> <sagi@grimberg.me>, "Israel Rukshin" <israelr@mellanox.com>,
> >>> linux-rdma@vger.kernel.org
> >>> Sent: Wednesday, April 26, 2017 4:31:57 AM
> >>> Subject: Re: [PATCH, untested] mlx5: Avoid that mlx5_ib_sg_to_klms()
> >>> overflows the klms[] array
> >>>
> >>>
> >>>
> >>> On 4/25/2017 11:37 PM, Laurence Oberman wrote:
> >>>>
> >>>>
> >>>> ----- Original Message -----
> >>>>> From: "Leon Romanovsky" <leonro@mellanox.com>
> >>>>> To: "Bart Van Assche" <bart.vanassche@sandisk.com>
> >>>>> Cc: "Doug Ledford" <dledford@redhat.com>, "Max Gurtovoy"
> >>>>> <maxg@mellanox.com>, "Sagi Grimberg" <sagi@grimberg.me>,
> >>>>> "Israel Rukshin" <israelr@mellanox.com>, "Laurence Oberman"
> >>>>> <loberman@redhat.com>, linux-rdma@vger.kernel.org
> >>>>> Sent: Tuesday, April 25, 2017 1:58:49 PM
> >>>>> Subject: Re: [PATCH, untested] mlx5: Avoid that mlx5_ib_sg_to_klms()
> >>>>> overflows the klms[] array
> >>>>>
> >>>>> On Mon, Apr 24, 2017 at 03:15:28PM -0700, Bart Van Assche wrote:
> >>>>>> ib_map_mr_sg() can pass an SG-list to .map_mr_sg() that is larger
> >>>>>> than what fits into a single MR. .map_mr_sg() must not attempt to
> >>>>>> map more SG-list elements than what fits into a single MR.
> >>>>>> Hence make sure that mlx5_ib_sg_to_klms() does not write outside
> >>>>>> the MR klms[] array.
> >>>>>>
> >>>>>> Fixes: b005d3164713 ("mlx5: Add arbitrary sg list support")
> >>>>>> Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com>
> >>>>>> Reviewed-by: Max Gurtovoy <maxg@mellanox.com>
> >>>>>> Cc: Sagi Grimberg <sagi@grimberg.me>
> >>>>>> Cc: Leon Romanovsky <leonro@mellanox.com>
> >>>>>> Cc: Israel Rukshin <israelr@mellanox.com>
> >>>>>> Cc: <stable@vger.kernel.org>
> >>>>>> ---
> >>>>>>  drivers/infiniband/hw/mlx5/mr.c | 2 +-
> >>>>>>  1 file changed, 1 insertion(+), 1 deletion(-)
> >>>>>>
> >>>>>
> >>>>> Bart,
> >>>>>
> >>>>> Thanks a lot, it indeed looks right.
> >>>>> Acked-by: Leon Romanovsky <leonro@mellanox.com>
> >>>>>
> >>>>> Thanks
> >>>>>
> >>>>
> >>>>
> >>>> Hello Bart, Leon, Max and Israel.
> >>>>
> >>>> I cloned off Barts tree.
> >>>>
> >>>> git clone https://github.com/bvanassche/linux
> >>>> cd linux
> >>>> git checkout block-scsi-for-next
> >>>>
> >>>> I checked all patches were in for this test.
> >>>>
> >>>> a83e404 IB/srp: Reenable IB_MR_TYPE_SG_GAPS
> >>>> dfa5a2b mlx5: Avoid that mlx5_ib_sg_to_klms() overflows the klms[] array
> >>>> f759c80 mlx5: Fix mlx5_ib_map_mr_sg mr lengt
> >>>
> >>> Hi,
> >>> copying Sagi's request from different thread:
> >>>
> >>> "
> >>> Can you please enable srp_add_one debug:
> >>>
> >>> echo "func srp_add_one +p" > /sys/kernel/debug/dynamic_debug/control
> >>>
> >>> In addition apply the following:
> >>> --
> >>> diff --git a/drivers/infiniband/hw/mlx5/mr.c
> >>> b/drivers/infiniband/hw/mlx5/mr.c
> >>> index d9c6c0ea750b..040fbc387e4f 100644
> >>> --- a/drivers/infiniband/hw/mlx5/mr.c
> >>> +++ b/drivers/infiniband/hw/mlx5/mr.c
> >>> @@ -1403,6 +1403,8 @@ mlx5_alloc_priv_descs(struct ib_device *device,
> >>>          int add_size;
> >>>          int ret;
> >>>
> >>> +       WARN_ON_ONCE(ndescs > device->attr.max_fast_reg_page_list_len);
> >>> +
> >>>          add_size = max_t(int, MLX5_UMR_ALIGN - ARCH_KMALLOC_MINALIGN,
> >>>          0);
> >>>
> >>>          mr->descs_alloc = kzalloc(size + add_size, GFP_KERNEL);
> >>>
> >>> "
> >>>
> >>> Max.
> >>>
> >>>>
> >>>> Built and tested the kernel.
> >>>>
> >>>> However this issue is not resolved :(
> >>>>
> >>>> [ 2707.931909] scsi host1: ib_srp: failed RECV status WR flushed (5) for
> >>>> CQE ffff8817edca86b0
> >>>> [ 2708.089806] mlx5_0:dump_cqe:262:(pid 20129): dump error cqe
> >>>> [ 2708.121342] 00000000 00000000 00000000 00000000
> >>>> [ 2708.147104] 00000000 00000000 00000000 00000000
> >>>> [ 2708.172633] 00000000 00000000 00000000 00000000
> >>>> [ 2708.198702] 00000000 0f007806 2500002a 14a527d0
> >>>> [ 2732.434127] scsi host1: ib_srp: reconnect succeeded
> >>>> [ 2733.048023] scsi host1: ib_srp: failed RECV status WR flushed (5) for
> >>>> CQE ffff8817ed0a9c30
> >>>>
> >>>> [root@localhost ~]# [ 2746.413277] mlx5_0:dump_cqe:262:(pid 15877): dump
> >>>> error cqe
> >>>> [ 2746.443240] 00000000 00000000 00000000 00000000
> >>>> [ 2746.469323] 00000000 00000000 00000000 00000000
> >>>> [ 2746.495310] 00000000 00000000 00000000 00000000
> >>>> [ 2746.521407] 00000000 0f007806 25000032 003c7ad0
> >>>> [ 2752.445899] scsi host1: ib_srp: reconnect succeeded
> >>>> [ 2752.481835] scsi host1: ib_srp: failed RECV status WR flushed (5) for
> >>>> CQE ffff8817ed0a9cf0
> >>>> [ 2763.267386] mlx5_0:dump_cqe:262:(pid 15877): dump error cqe
> >>>> [ 2763.297826] 00000000 00000000 00000000 00000000
> >>>> [ 2763.323352] 00000000 00000000 00000000 00000000
> >>>> [ 2763.348722] 00000000 00000000 00000000 00000000
> >>>> [ 2763.374681] 00000000 0f007806 2500003a 00084bd0
> >>>>
> >>>> [root@localhost ~]# [ 2769.385203] fast_io_fail_tmo expired for SRP
> >>>> port-1:1 / host1.
> >>>> [ 2769.415956] scsi host1: ib_srp: reconnect succeeded
> >>>> [ 2769.450258] scsi host1: ib_srp: failed RECV status WR flushed (5) for
> >>>> CQE ffff8817ed0a9cf0
> >>>> [ 2780.064627] mlx5_0:dump_cqe:262:(pid 18771): dump error cqe
> >>>> [ 2780.093520] 00000000 00000000 00000000 00000000
> >>>> [ 2780.120067] 00000000 00000000 00000000 00000000
> >>>> [ 2780.145575] 00000000 00000000 00000000 00000000
> >>>> [ 2780.171153] 00000000 0f007806 25000042 000833d0
> >>>> [ 2785.923399] scsi host1: ib_srp: reconnect succeeded
> >>>> [ 2785.957504] scsi host1: ib_srp: failed RECV status WR flushed (5) for
> >>>> CQE ffff8817ed0a9cf0
> >>>> [ 2796.463426] mlx5_0:dump_cqe:262:(pid 18771): dump error cqe
> >>>> [ 2796.495257] 00000000 00000000 00000000 00000000
> >>>> [ 2796.521506] 00000000 00000000 00000000 00000000
> >>>> [ 2796.547640] 00000000 00000000 00000000 00000000
> >>>> [ 2796.573120] 00000000 0f007806 2500004a 00083bd0
> >>>> [ 2802.562578] scsi host1: ib_srp: reconnect succeeded
> >>>> [ 2802.596880] scsi host1: ib_srp: failed RECV status WR flushed (5) for
> >>>> CQE ffff8817ed0a9cf0
> >>>>
> >>>> Regards
> >>>> Laurence
> >>>>
> >>>
> >> Doing this now
> >> Thanks
> >> Laurence
> >
> > Max
> >
> > The Patch is not correct.
> >
> > drivers/infiniband/hw/mlx5/mr.c: In function 'mlx5_alloc_priv_descs':
> > drivers/infiniband/hw/mlx5/mr.c:1406:30: error: 'struct ib_device' has no
> > member named 'attr'
> >   WARN_ON_ONCE(ndescs > device->attr.max_fast_reg_page_list_len);
> >                               ^
> > ./include/asm-generic/bug.h:117:27: note: in definition of macro
> > 'WARN_ON_ONCE'
> >   int __ret_warn_once = !!(condition);   \
> >
> > I think you meant to give me
> >
> > WARN_ON_ONCE(ndescs > ib_device_attr->attr.max_fast_reg_page_list_len);
> >
> > Can you confirm
> 
> Hi Laurence,
> should be device->attrs.max_fast_reg_page_list_len.
> 
> please check this one that might solve the issue (on top of everything):
> 
> 
> diff --git a/drivers/infiniband/hw/mlx5/mr.c
> b/drivers/infiniband/hw/mlx5/mr.c
> index b8f9382..063d116 100644
> --- a/drivers/infiniband/hw/mlx5/mr.c
> +++ b/drivers/infiniband/hw/mlx5/mr.c
> @@ -1559,7 +1559,7 @@ struct ib_mr *mlx5_ib_alloc_mr(struct ib_pd *pd,
>                  mr->max_descs = ndescs;
>          } else if (mr_type == IB_MR_TYPE_SG_GAPS) {
>                  mr->access_mode = MLX5_MKC_ACCESS_MODE_KLMS;
> -
> +               MLX5_SET(mkc, mkc, translations_octword_size,
> ALIGN(max_num_sg + 1, 4));
>                  err = mlx5_alloc_priv_descs(pd->device, mr,
>                                              ndescs, sizeof(struct
> mlx5_klm));
>                  if (err)
> 
> thanks,
> Max.
> 
> >
> > Thanks
> > Laurence
> >
> --
> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

Hello Max

I have the corrected WARN_ON_ONCE patch and the above patch as well as the rest as it was from Barts tree.

Still fails.

For a baseline I can revert 
a83e404 IB/srp: Reenable IB_MR_TYPE_SG_GAPS

Then test again to make sure we are starting from a good place.

Initiator log

[  280.481951] scsi host1: ib_srp: failed FAST REG status memory management operation error (6) for CQE ffff8817d9a881b8
[  301.149106] scsi host1: ib_srp: reconnect succeeded
[  301.280635] scsi host1: ib_srp: failed RECV status WR flushed (5) for CQE ffff8817ed32f2f0
[  334.596420] scsi host2: ib_srp: failed RECV status WR flushed (5) for CQE ffff8817c592c970
[  334.599689] mlx5_1:dump_cqe:262:(pid 20): dump error cqe
[  334.599691] 00000000 00000000 00000000 00000000
[  334.599692] 00000000 00000000 00000000 00000000
[  334.599692] 00000000 00000000 00000000 00000000
[  334.599693] 00000000 0f007806 2500002d 067b48d0
[  334.599697] scsi host2: ib_srp: failed FAST REG status memory management operation error (6) for CQE ffff8817c6e30078
[  336.117248] mlx5_0:dump_cqe:262:(pid 130): dump error cqe
[  336.145840] 00000000 00000000 00000000 00000000
[  336.171830] 00000000 00000000 00000000 00000000
[  336.197688] 00000000 00000000 00000000 00000000
[  336.223720] 00000000 0f007806 25000032 005408d0
[  339.712706] fast_io_fail_tmo expired for SRP port-1:1 / host1.
[  341.453634] scsi host1: ib_srp: reconnect succeeded
[  341.481600] mlx5_0:dump_cqe:262:(pid 130): dump error cqe
[  341.482145] scsi host1: ib_srp: failed RECV status WR flushed (5) for CQE ffff8817ecaf6970
[  341.559359] 00000000 00000000 00000000 00000000
[  341.585397] 00000000 00000000 00000000 00000000
[  341.610948] 00000000 00000000 00000000 00000000
[  341.637515] 00000000 0f007806 2500003d 000046d0
[  342.297598] sd 1:0:0:9: rejecting I/O to offline device
[  342.297936] sd 1:0:0:9: [sdg] tag#28 FAILED Result: hostbyte=DID_TRANSPORT_FAILFAST driverbyte=DRIVER_OK
[  342.297941] sd 1:0:0:9: [sdg] tag#28 CDB: Write(10) 2a 00 00 00 40 00 00 40 00 00
[  342.297943] blk_update_request: recoverable transport error, dev sdg, sector 16384
[  342.297951] sd 1:0:0:20: [sdar] tag#5 FAILED Result: hostbyte=DID_TRANSPORT_FAILFAST driverbyte=DRIVER_OK
[  342.297952] sd 1:0:0:20: [sdar] tag#15 FAILED Result: hostbyte=DID_TRANSPORT_FAILFAST driverbyte=DRIVER_OK
[  342.297956] sd 1:0:0:20: [sdar] tag#5 CDB: Write(10) 2a 00 00 03 c0 00 00 40 00 00
[  342.297956] sd 1:0:0:20: [sdar] tag#15 CDB: Write(10) 2a 00 00 2c c0 00 00 40 00 00
[  342.297958] blk_update_request: recoverable transport error, dev sdar, sector 245760
[  342.297959] blk_update_request: recoverable transport error, dev sdar, sector 2932736
[  342.298119] device-mapper: multipath: Failing path 8:96.
[  342.298266] sd 1:0:0:9: [sdg] tag#29 FAILED Result: hostbyte=DID_TRANSPORT_FAILFAST driverbyte=DRIVER_OK
[  342.298268] sd 1:0:0:9: [sdg] tag#29 CDB: Write(10) 2a 00 00 00 c0 00 00 40 00 00
[  342.298269] blk_update_request: recoverable transport error, dev sdg, sector 49152
[  342.298300] device-mapper: multipath: Failing path 66:176.
[  342.298486] sd 1:0:0:20: [sdar] tag#16 FAILED Result: hostbyte=DID_TRANSPORT_FAILFAST driverbyte=DRIVER_OK
[  342.298488] sd 1:0:0:20: [sdar] tag#6 FAILED Result: hostbyte=DID_TRANSPORT_FAILFAST driverbyte=DRIVER_OK
[  342.298489] sd 1:0:0:20: [sdar] tag#16 CDB: Write(10) 2a 00 00 2d 40 00 00 40 00 00
[  342.298490] sd 1:0:0:20: [sdar] tag#6 CDB: Write(10) 2a 00 00 04 40 00 00 40 00 00
[  342.298491] blk_update_request: recoverable transport error, dev sdar, sector 2965504
[  342.298492] blk_update_request: recoverable transport error, dev sdar, sector 278528
[  342.298582] sd 1:0:0:9: [sdg] tag#30 FAILED Result: hostbyte=DID_TRANSPORT_FAILFAST driverbyte=DRIVER_OK
[  342.298584] sd 1:0:0:9: [sdg] tag#30 CDB: Write(10) 2a 00 00 01 40 00 00 40 00 00
[  342.298585] blk_update_request: recoverable transport error, dev sdg, sector 81920
[  342.298889] sd 1:0:0:9: [sdg] tag#31 FAILED Result: hostbyte=DID_TRANSPORT_FAILFAST driverbyte=DRIVER_OK
[  342.298890] sd 1:0:0:9: [sdg] tag#31 CDB: Write(10) 2a 00 00 01 c0 00 00 40 00 00
[  342.298891] blk_update_request: recoverable transport error, dev sdg, sector 114688
[  342.298981] sd 1:0:0:20: [sdar] tag#7 FAILED Result: hostbyte=DID_TRANSPORT_FAILFAST driverbyte=DRIVER_OK
[  342.298983] sd 1:0:0:20: [sdar] tag#7 CDB: Write(10) 2a 00 00 04 c0 00 00 40 00 00
[  342.298985] blk_update_request: recoverable transport error, dev sdar, sector 311296
[  342.299004] sd 1:0:0:20: [sdar] tag#17 FAILED Result: hostbyte=DID_TRANSPORT_FAILFAST driverbyte=DRIVER_OK
[  342.299007] sd 1:0:0:20: [sdar] tag#17 CDB: Write(10) 2a 00 00 34 c0 00 00 40 00 00
[  342.299009] blk_update_request: recoverable transport error, dev sdar, sector 3457024
[  342.356353] device-mapper: multipath: Failing path 8:64.
[  342.356489] device-mapper: multipath: Failing path 8:128.
[  342.356628] device-mapper: multipath: Failing path 8:160.
[  342.356699] device-mapper: multipath: Failing path 8:176.
[  342.356767] device-mapper: multipath: Failing path 8:240.
[  342.356834] device-mapper: multipath: Failing path 8:208.
[  342.356900] device-mapper: multipath: Failing path 65:16.
[  342.356967] device-mapper: multipath: Failing path 65:64.
[  342.357035] device-mapper: multipath: Failing path 65:96.
[  342.357103] device-mapper: multipath: Failing path 65:128.
[  342.357169] device-mapper: multipath: Failing path 65:176.
[  342.357237] device-mapper: multipath: Failing path 65:208.
[  342.357303] device-mapper: multipath: Failing path 65:224.
[  342.357371] device-mapper: multipath: Failing path 66:0.
[  342.357454] device-mapper: multipath: Failing path 66:32.
[  342.357521] device-mapper: multipath: Failing path 66:48.
[  342.357647] device-mapper: multipath: Failing path 66:80.
[  342.357714] device-mapper: multipath: Failing path 66:112.
[  342.357781] device-mapper: multipath: Failing path 66:144.
[  342.357936] device-mapper: multipath: Failing path 66:208.
[  342.358019] device-mapper: multipath: Failing path 66:240.
[  342.358115] device-mapper: multipath: Failing path 67:16.
[  342.358183] device-mapper: multipath: Failing path 67:48.
[  342.358264] device-mapper: multipath: Failing path 67:80.
[  342.358359] device-mapper: multipath: Failing path 67:128.
[  342.358442] device-mapper: multipath: Failing path 67:160.
[  342.358594] device-mapper: multipath: Failing path 67:224.
[  342.358671] device-mapper: multipath: Failing path 67:208.
[  350.157728] scsi host2: ib_srp: reconnect succeeded
[  350.189605] mlx5_1:dump_cqe:262:(pid 4756): dump error cqe
[  350.193180] mlx5_1:dump_cqe:262:(pid 1275): dump error cqe
[  350.193182] 00000000 00000000 00000000 00000000
[  350.193182] 00000000 00000000 00000000 00000000
[  350.193183] 00000000 00000000 00000000 00000000
[  350.193183] 00000000 0f007806 25000035 04f569d0
[  350.193187] scsi host2: ib_srp: failed FAST REG status memory management operation error (6) for CQE ffff8817c6e30078
[  350.412637] 00000000 00000000 00000000 00000000
[  350.436431] 00000000 00000000 00000000 00000000
[  350.461871] 00000000 00000000 00000000 00000000
[  350.487549] 00000000 0f007806 25000032 000843d0

Target Log

Thee events happened after the first failures on the initiator

[ 1111.029847] ib_srpt Received CM TimeWait exit for ch 0x4f6e72000390fe7c7cfe900300726ed3-49.
[ 1111.078815] ib_srpt Received CM TimeWait exit for ch 0x4f6e72000390fe7c7cfe900300726ed3-48.
[ 1111.127420] ib_srpt Received CM TimeWait exit for ch 0x4f6e72000390fe7c7cfe900300726ed3-47.
[ 1111.175801] ib_srpt Received CM TimeWait exit for ch 0x4f6e72000390fe7c7cfe900300726ed3-46.
[ 1111.223725] ib_srpt Received CM TimeWait exit for ch 0x4f6e72000390fe7c7cfe900300726ed3-45.
[ 1111.271957] ib_srpt Received CM TimeWait exit for ch 0x4f6e72000390fe7c7cfe900300726ed3-44.
[ 1111.319494] ib_srpt Received CM TimeWait exit for ch 0x4f6e72000390fe7c7cfe900300726ed3-43.
[ 1111.365795] ib_srpt Received CM TimeWait exit for ch 0x4f6e72000390fe7c7cfe900300726ed3-42.

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Laurence Oberman April 26, 2017, 1:50 p.m. UTC | #2
----- Original Message -----
> From: "Laurence Oberman" <loberman@redhat.com>
> To: "Max Gurtovoy" <maxg@mellanox.com>
> Cc: "Leon Romanovsky" <leonro@mellanox.com>, "Bart Van Assche" <bart.vanassche@sandisk.com>, "Doug Ledford"
> <dledford@redhat.com>, "Sagi Grimberg" <sagi@grimberg.me>, "Israel Rukshin" <israelr@mellanox.com>,
> linux-rdma@vger.kernel.org
> Sent: Wednesday, April 26, 2017 9:28:37 AM
> Subject: Re: [PATCH, untested] mlx5: Avoid that mlx5_ib_sg_to_klms() overflows the klms[] array
> 
> 
> 
> ----- Original Message -----
> > From: "Max Gurtovoy" <maxg@mellanox.com>
> > To: "Laurence Oberman" <loberman@redhat.com>
> > Cc: "Leon Romanovsky" <leonro@mellanox.com>, "Bart Van Assche"
> > <bart.vanassche@sandisk.com>, "Doug Ledford"
> > <dledford@redhat.com>, "Sagi Grimberg" <sagi@grimberg.me>, "Israel Rukshin"
> > <israelr@mellanox.com>,
> > linux-rdma@vger.kernel.org
> > Sent: Wednesday, April 26, 2017 8:25:30 AM
> > Subject: Re: [PATCH, untested] mlx5: Avoid that mlx5_ib_sg_to_klms()
> > overflows the klms[] array
> > 
> > 
> > 
> > On 4/26/2017 3:18 PM, Laurence Oberman wrote:
> > >
> > >
> > > ----- Original Message -----
> > >> From: "Laurence Oberman" <loberman@redhat.com>
> > >> To: "Max Gurtovoy" <maxg@mellanox.com>
> > >> Cc: "Leon Romanovsky" <leonro@mellanox.com>, "Bart Van Assche"
> > >> <bart.vanassche@sandisk.com>, "Doug Ledford"
> > >> <dledford@redhat.com>, "Sagi Grimberg" <sagi@grimberg.me>, "Israel
> > >> Rukshin" <israelr@mellanox.com>,
> > >> linux-rdma@vger.kernel.org
> > >> Sent: Wednesday, April 26, 2017 7:47:37 AM
> > >> Subject: Re: [PATCH, untested] mlx5: Avoid that mlx5_ib_sg_to_klms()
> > >> overflows the klms[] array
> > >>
> > >>
> > >>
> > >> ----- Original Message -----
> > >>> From: "Max Gurtovoy" <maxg@mellanox.com>
> > >>> To: "Laurence Oberman" <loberman@redhat.com>, "Leon Romanovsky"
> > >>> <leonro@mellanox.com>
> > >>> Cc: "Bart Van Assche" <bart.vanassche@sandisk.com>, "Doug Ledford"
> > >>> <dledford@redhat.com>, "Sagi Grimberg"
> > >>> <sagi@grimberg.me>, "Israel Rukshin" <israelr@mellanox.com>,
> > >>> linux-rdma@vger.kernel.org
> > >>> Sent: Wednesday, April 26, 2017 4:31:57 AM
> > >>> Subject: Re: [PATCH, untested] mlx5: Avoid that mlx5_ib_sg_to_klms()
> > >>> overflows the klms[] array
> > >>>
> > >>>
> > >>>
> > >>> On 4/25/2017 11:37 PM, Laurence Oberman wrote:
> > >>>>
> > >>>>
> > >>>> ----- Original Message -----
> > >>>>> From: "Leon Romanovsky" <leonro@mellanox.com>
> > >>>>> To: "Bart Van Assche" <bart.vanassche@sandisk.com>
> > >>>>> Cc: "Doug Ledford" <dledford@redhat.com>, "Max Gurtovoy"
> > >>>>> <maxg@mellanox.com>, "Sagi Grimberg" <sagi@grimberg.me>,
> > >>>>> "Israel Rukshin" <israelr@mellanox.com>, "Laurence Oberman"
> > >>>>> <loberman@redhat.com>, linux-rdma@vger.kernel.org
> > >>>>> Sent: Tuesday, April 25, 2017 1:58:49 PM
> > >>>>> Subject: Re: [PATCH, untested] mlx5: Avoid that mlx5_ib_sg_to_klms()
> > >>>>> overflows the klms[] array
> > >>>>>
> > >>>>> On Mon, Apr 24, 2017 at 03:15:28PM -0700, Bart Van Assche wrote:
> > >>>>>> ib_map_mr_sg() can pass an SG-list to .map_mr_sg() that is larger
> > >>>>>> than what fits into a single MR. .map_mr_sg() must not attempt to
> > >>>>>> map more SG-list elements than what fits into a single MR.
> > >>>>>> Hence make sure that mlx5_ib_sg_to_klms() does not write outside
> > >>>>>> the MR klms[] array.
> > >>>>>>
> > >>>>>> Fixes: b005d3164713 ("mlx5: Add arbitrary sg list support")
> > >>>>>> Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com>
> > >>>>>> Reviewed-by: Max Gurtovoy <maxg@mellanox.com>
> > >>>>>> Cc: Sagi Grimberg <sagi@grimberg.me>
> > >>>>>> Cc: Leon Romanovsky <leonro@mellanox.com>
> > >>>>>> Cc: Israel Rukshin <israelr@mellanox.com>
> > >>>>>> Cc: <stable@vger.kernel.org>
> > >>>>>> ---
> > >>>>>>  drivers/infiniband/hw/mlx5/mr.c | 2 +-
> > >>>>>>  1 file changed, 1 insertion(+), 1 deletion(-)
> > >>>>>>
> > >>>>>
> > >>>>> Bart,
> > >>>>>
> > >>>>> Thanks a lot, it indeed looks right.
> > >>>>> Acked-by: Leon Romanovsky <leonro@mellanox.com>
> > >>>>>
> > >>>>> Thanks
> > >>>>>
> > >>>>
> > >>>>
> > >>>> Hello Bart, Leon, Max and Israel.
> > >>>>
> > >>>> I cloned off Barts tree.
> > >>>>
> > >>>> git clone https://github.com/bvanassche/linux
> > >>>> cd linux
> > >>>> git checkout block-scsi-for-next
> > >>>>
> > >>>> I checked all patches were in for this test.
> > >>>>
> > >>>> a83e404 IB/srp: Reenable IB_MR_TYPE_SG_GAPS
> > >>>> dfa5a2b mlx5: Avoid that mlx5_ib_sg_to_klms() overflows the klms[]
> > >>>> array
> > >>>> f759c80 mlx5: Fix mlx5_ib_map_mr_sg mr lengt
> > >>>
> > >>> Hi,
> > >>> copying Sagi's request from different thread:
> > >>>
> > >>> "
> > >>> Can you please enable srp_add_one debug:
> > >>>
> > >>> echo "func srp_add_one +p" > /sys/kernel/debug/dynamic_debug/control
> > >>>
> > >>> In addition apply the following:
> > >>> --
> > >>> diff --git a/drivers/infiniband/hw/mlx5/mr.c
> > >>> b/drivers/infiniband/hw/mlx5/mr.c
> > >>> index d9c6c0ea750b..040fbc387e4f 100644
> > >>> --- a/drivers/infiniband/hw/mlx5/mr.c
> > >>> +++ b/drivers/infiniband/hw/mlx5/mr.c
> > >>> @@ -1403,6 +1403,8 @@ mlx5_alloc_priv_descs(struct ib_device *device,
> > >>>          int add_size;
> > >>>          int ret;
> > >>>
> > >>> +       WARN_ON_ONCE(ndescs > device->attr.max_fast_reg_page_list_len);
> > >>> +
> > >>>          add_size = max_t(int, MLX5_UMR_ALIGN - ARCH_KMALLOC_MINALIGN,
> > >>>          0);
> > >>>
> > >>>          mr->descs_alloc = kzalloc(size + add_size, GFP_KERNEL);
> > >>>
> > >>> "
> > >>>
> > >>> Max.
> > >>>
> > >>>>
> > >>>> Built and tested the kernel.
> > >>>>
> > >>>> However this issue is not resolved :(
> > >>>>
> > >>>> [ 2707.931909] scsi host1: ib_srp: failed RECV status WR flushed (5)
> > >>>> for
> > >>>> CQE ffff8817edca86b0
> > >>>> [ 2708.089806] mlx5_0:dump_cqe:262:(pid 20129): dump error cqe
> > >>>> [ 2708.121342] 00000000 00000000 00000000 00000000
> > >>>> [ 2708.147104] 00000000 00000000 00000000 00000000
> > >>>> [ 2708.172633] 00000000 00000000 00000000 00000000
> > >>>> [ 2708.198702] 00000000 0f007806 2500002a 14a527d0
> > >>>> [ 2732.434127] scsi host1: ib_srp: reconnect succeeded
> > >>>> [ 2733.048023] scsi host1: ib_srp: failed RECV status WR flushed (5)
> > >>>> for
> > >>>> CQE ffff8817ed0a9c30
> > >>>>
> > >>>> [root@localhost ~]# [ 2746.413277] mlx5_0:dump_cqe:262:(pid 15877):
> > >>>> dump
> > >>>> error cqe
> > >>>> [ 2746.443240] 00000000 00000000 00000000 00000000
> > >>>> [ 2746.469323] 00000000 00000000 00000000 00000000
> > >>>> [ 2746.495310] 00000000 00000000 00000000 00000000
> > >>>> [ 2746.521407] 00000000 0f007806 25000032 003c7ad0
> > >>>> [ 2752.445899] scsi host1: ib_srp: reconnect succeeded
> > >>>> [ 2752.481835] scsi host1: ib_srp: failed RECV status WR flushed (5)
> > >>>> for
> > >>>> CQE ffff8817ed0a9cf0
> > >>>> [ 2763.267386] mlx5_0:dump_cqe:262:(pid 15877): dump error cqe
> > >>>> [ 2763.297826] 00000000 00000000 00000000 00000000
> > >>>> [ 2763.323352] 00000000 00000000 00000000 00000000
> > >>>> [ 2763.348722] 00000000 00000000 00000000 00000000
> > >>>> [ 2763.374681] 00000000 0f007806 2500003a 00084bd0
> > >>>>
> > >>>> [root@localhost ~]# [ 2769.385203] fast_io_fail_tmo expired for SRP
> > >>>> port-1:1 / host1.
> > >>>> [ 2769.415956] scsi host1: ib_srp: reconnect succeeded
> > >>>> [ 2769.450258] scsi host1: ib_srp: failed RECV status WR flushed (5)
> > >>>> for
> > >>>> CQE ffff8817ed0a9cf0
> > >>>> [ 2780.064627] mlx5_0:dump_cqe:262:(pid 18771): dump error cqe
> > >>>> [ 2780.093520] 00000000 00000000 00000000 00000000
> > >>>> [ 2780.120067] 00000000 00000000 00000000 00000000
> > >>>> [ 2780.145575] 00000000 00000000 00000000 00000000
> > >>>> [ 2780.171153] 00000000 0f007806 25000042 000833d0
> > >>>> [ 2785.923399] scsi host1: ib_srp: reconnect succeeded
> > >>>> [ 2785.957504] scsi host1: ib_srp: failed RECV status WR flushed (5)
> > >>>> for
> > >>>> CQE ffff8817ed0a9cf0
> > >>>> [ 2796.463426] mlx5_0:dump_cqe:262:(pid 18771): dump error cqe
> > >>>> [ 2796.495257] 00000000 00000000 00000000 00000000
> > >>>> [ 2796.521506] 00000000 00000000 00000000 00000000
> > >>>> [ 2796.547640] 00000000 00000000 00000000 00000000
> > >>>> [ 2796.573120] 00000000 0f007806 2500004a 00083bd0
> > >>>> [ 2802.562578] scsi host1: ib_srp: reconnect succeeded
> > >>>> [ 2802.596880] scsi host1: ib_srp: failed RECV status WR flushed (5)
> > >>>> for
> > >>>> CQE ffff8817ed0a9cf0
> > >>>>
> > >>>> Regards
> > >>>> Laurence
> > >>>>
> > >>>
> > >> Doing this now
> > >> Thanks
> > >> Laurence
> > >
> > > Max
> > >
> > > The Patch is not correct.
> > >
> > > drivers/infiniband/hw/mlx5/mr.c: In function 'mlx5_alloc_priv_descs':
> > > drivers/infiniband/hw/mlx5/mr.c:1406:30: error: 'struct ib_device' has no
> > > member named 'attr'
> > >   WARN_ON_ONCE(ndescs > device->attr.max_fast_reg_page_list_len);
> > >                               ^
> > > ./include/asm-generic/bug.h:117:27: note: in definition of macro
> > > 'WARN_ON_ONCE'
> > >   int __ret_warn_once = !!(condition);   \
> > >
> > > I think you meant to give me
> > >
> > > WARN_ON_ONCE(ndescs > ib_device_attr->attr.max_fast_reg_page_list_len);
> > >
> > > Can you confirm
> > 
> > Hi Laurence,
> > should be device->attrs.max_fast_reg_page_list_len.
> > 
> > please check this one that might solve the issue (on top of everything):
> > 
> > 
> > diff --git a/drivers/infiniband/hw/mlx5/mr.c
> > b/drivers/infiniband/hw/mlx5/mr.c
> > index b8f9382..063d116 100644
> > --- a/drivers/infiniband/hw/mlx5/mr.c
> > +++ b/drivers/infiniband/hw/mlx5/mr.c
> > @@ -1559,7 +1559,7 @@ struct ib_mr *mlx5_ib_alloc_mr(struct ib_pd *pd,
> >                  mr->max_descs = ndescs;
> >          } else if (mr_type == IB_MR_TYPE_SG_GAPS) {
> >                  mr->access_mode = MLX5_MKC_ACCESS_MODE_KLMS;
> > -
> > +               MLX5_SET(mkc, mkc, translations_octword_size,
> > ALIGN(max_num_sg + 1, 4));
> >                  err = mlx5_alloc_priv_descs(pd->device, mr,
> >                                              ndescs, sizeof(struct
> > mlx5_klm));
> >                  if (err)
> > 
> > thanks,
> > Max.
> > 
> > >
> > > Thanks
> > > Laurence
> > >
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > 
> 
> Hello Max
> 
> I have the corrected WARN_ON_ONCE patch and the above patch as well as the
> rest as it was from Barts tree.
> 
> Still fails.
> 
> For a baseline I can revert
> a83e404 IB/srp: Reenable IB_MR_TYPE_SG_GAPS
> 
> Then test again to make sure we are starting from a good place.
> 
> Initiator log
> 
> [  280.481951] scsi host1: ib_srp: failed FAST REG status memory management
> operation error (6) for CQE ffff8817d9a881b8
> [  301.149106] scsi host1: ib_srp: reconnect succeeded
> [  301.280635] scsi host1: ib_srp: failed RECV status WR flushed (5) for CQE
> ffff8817ed32f2f0
> [  334.596420] scsi host2: ib_srp: failed RECV status WR flushed (5) for CQE
> ffff8817c592c970
> [  334.599689] mlx5_1:dump_cqe:262:(pid 20): dump error cqe
> [  334.599691] 00000000 00000000 00000000 00000000
> [  334.599692] 00000000 00000000 00000000 00000000
> [  334.599692] 00000000 00000000 00000000 00000000
> [  334.599693] 00000000 0f007806 2500002d 067b48d0
> [  334.599697] scsi host2: ib_srp: failed FAST REG status memory management
> operation error (6) for CQE ffff8817c6e30078
> [  336.117248] mlx5_0:dump_cqe:262:(pid 130): dump error cqe
> [  336.145840] 00000000 00000000 00000000 00000000
> [  336.171830] 00000000 00000000 00000000 00000000
> [  336.197688] 00000000 00000000 00000000 00000000
> [  336.223720] 00000000 0f007806 25000032 005408d0
> [  339.712706] fast_io_fail_tmo expired for SRP port-1:1 / host1.
> [  341.453634] scsi host1: ib_srp: reconnect succeeded
> [  341.481600] mlx5_0:dump_cqe:262:(pid 130): dump error cqe
> [  341.482145] scsi host1: ib_srp: failed RECV status WR flushed (5) for CQE
> ffff8817ecaf6970
> [  341.559359] 00000000 00000000 00000000 00000000
> [  341.585397] 00000000 00000000 00000000 00000000
> [  341.610948] 00000000 00000000 00000000 00000000
> [  341.637515] 00000000 0f007806 2500003d 000046d0
> [  342.297598] sd 1:0:0:9: rejecting I/O to offline device
> [  342.297936] sd 1:0:0:9: [sdg] tag#28 FAILED Result:
> hostbyte=DID_TRANSPORT_FAILFAST driverbyte=DRIVER_OK
> [  342.297941] sd 1:0:0:9: [sdg] tag#28 CDB: Write(10) 2a 00 00 00 40 00 00
> 40 00 00
> [  342.297943] blk_update_request: recoverable transport error, dev sdg,
> sector 16384
> [  342.297951] sd 1:0:0:20: [sdar] tag#5 FAILED Result:
> hostbyte=DID_TRANSPORT_FAILFAST driverbyte=DRIVER_OK
> [  342.297952] sd 1:0:0:20: [sdar] tag#15 FAILED Result:
> hostbyte=DID_TRANSPORT_FAILFAST driverbyte=DRIVER_OK
> [  342.297956] sd 1:0:0:20: [sdar] tag#5 CDB: Write(10) 2a 00 00 03 c0 00 00
> 40 00 00
> [  342.297956] sd 1:0:0:20: [sdar] tag#15 CDB: Write(10) 2a 00 00 2c c0 00 00
> 40 00 00
> [  342.297958] blk_update_request: recoverable transport error, dev sdar,
> sector 245760
> [  342.297959] blk_update_request: recoverable transport error, dev sdar,
> sector 2932736
> [  342.298119] device-mapper: multipath: Failing path 8:96.
> [  342.298266] sd 1:0:0:9: [sdg] tag#29 FAILED Result:
> hostbyte=DID_TRANSPORT_FAILFAST driverbyte=DRIVER_OK
> [  342.298268] sd 1:0:0:9: [sdg] tag#29 CDB: Write(10) 2a 00 00 00 c0 00 00
> 40 00 00
> [  342.298269] blk_update_request: recoverable transport error, dev sdg,
> sector 49152
> [  342.298300] device-mapper: multipath: Failing path 66:176.
> [  342.298486] sd 1:0:0:20: [sdar] tag#16 FAILED Result:
> hostbyte=DID_TRANSPORT_FAILFAST driverbyte=DRIVER_OK
> [  342.298488] sd 1:0:0:20: [sdar] tag#6 FAILED Result:
> hostbyte=DID_TRANSPORT_FAILFAST driverbyte=DRIVER_OK
> [  342.298489] sd 1:0:0:20: [sdar] tag#16 CDB: Write(10) 2a 00 00 2d 40 00 00
> 40 00 00
> [  342.298490] sd 1:0:0:20: [sdar] tag#6 CDB: Write(10) 2a 00 00 04 40 00 00
> 40 00 00
> [  342.298491] blk_update_request: recoverable transport error, dev sdar,
> sector 2965504
> [  342.298492] blk_update_request: recoverable transport error, dev sdar,
> sector 278528
> [  342.298582] sd 1:0:0:9: [sdg] tag#30 FAILED Result:
> hostbyte=DID_TRANSPORT_FAILFAST driverbyte=DRIVER_OK
> [  342.298584] sd 1:0:0:9: [sdg] tag#30 CDB: Write(10) 2a 00 00 01 40 00 00
> 40 00 00
> [  342.298585] blk_update_request: recoverable transport error, dev sdg,
> sector 81920
> [  342.298889] sd 1:0:0:9: [sdg] tag#31 FAILED Result:
> hostbyte=DID_TRANSPORT_FAILFAST driverbyte=DRIVER_OK
> [  342.298890] sd 1:0:0:9: [sdg] tag#31 CDB: Write(10) 2a 00 00 01 c0 00 00
> 40 00 00
> [  342.298891] blk_update_request: recoverable transport error, dev sdg,
> sector 114688
> [  342.298981] sd 1:0:0:20: [sdar] tag#7 FAILED Result:
> hostbyte=DID_TRANSPORT_FAILFAST driverbyte=DRIVER_OK
> [  342.298983] sd 1:0:0:20: [sdar] tag#7 CDB: Write(10) 2a 00 00 04 c0 00 00
> 40 00 00
> [  342.298985] blk_update_request: recoverable transport error, dev sdar,
> sector 311296
> [  342.299004] sd 1:0:0:20: [sdar] tag#17 FAILED Result:
> hostbyte=DID_TRANSPORT_FAILFAST driverbyte=DRIVER_OK
> [  342.299007] sd 1:0:0:20: [sdar] tag#17 CDB: Write(10) 2a 00 00 34 c0 00 00
> 40 00 00
> [  342.299009] blk_update_request: recoverable transport error, dev sdar,
> sector 3457024
> [  342.356353] device-mapper: multipath: Failing path 8:64.
> [  342.356489] device-mapper: multipath: Failing path 8:128.
> [  342.356628] device-mapper: multipath: Failing path 8:160.
> [  342.356699] device-mapper: multipath: Failing path 8:176.
> [  342.356767] device-mapper: multipath: Failing path 8:240.
> [  342.356834] device-mapper: multipath: Failing path 8:208.
> [  342.356900] device-mapper: multipath: Failing path 65:16.
> [  342.356967] device-mapper: multipath: Failing path 65:64.
> [  342.357035] device-mapper: multipath: Failing path 65:96.
> [  342.357103] device-mapper: multipath: Failing path 65:128.
> [  342.357169] device-mapper: multipath: Failing path 65:176.
> [  342.357237] device-mapper: multipath: Failing path 65:208.
> [  342.357303] device-mapper: multipath: Failing path 65:224.
> [  342.357371] device-mapper: multipath: Failing path 66:0.
> [  342.357454] device-mapper: multipath: Failing path 66:32.
> [  342.357521] device-mapper: multipath: Failing path 66:48.
> [  342.357647] device-mapper: multipath: Failing path 66:80.
> [  342.357714] device-mapper: multipath: Failing path 66:112.
> [  342.357781] device-mapper: multipath: Failing path 66:144.
> [  342.357936] device-mapper: multipath: Failing path 66:208.
> [  342.358019] device-mapper: multipath: Failing path 66:240.
> [  342.358115] device-mapper: multipath: Failing path 67:16.
> [  342.358183] device-mapper: multipath: Failing path 67:48.
> [  342.358264] device-mapper: multipath: Failing path 67:80.
> [  342.358359] device-mapper: multipath: Failing path 67:128.
> [  342.358442] device-mapper: multipath: Failing path 67:160.
> [  342.358594] device-mapper: multipath: Failing path 67:224.
> [  342.358671] device-mapper: multipath: Failing path 67:208.
> [  350.157728] scsi host2: ib_srp: reconnect succeeded
> [  350.189605] mlx5_1:dump_cqe:262:(pid 4756): dump error cqe
> [  350.193180] mlx5_1:dump_cqe:262:(pid 1275): dump error cqe
> [  350.193182] 00000000 00000000 00000000 00000000
> [  350.193182] 00000000 00000000 00000000 00000000
> [  350.193183] 00000000 00000000 00000000 00000000
> [  350.193183] 00000000 0f007806 25000035 04f569d0
> [  350.193187] scsi host2: ib_srp: failed FAST REG status memory management
> operation error (6) for CQE ffff8817c6e30078
> [  350.412637] 00000000 00000000 00000000 00000000
> [  350.436431] 00000000 00000000 00000000 00000000
> [  350.461871] 00000000 00000000 00000000 00000000
> [  350.487549] 00000000 0f007806 25000032 000843d0
> 
> Target Log
> 
> Thee events happened after the first failures on the initiator
> 
> [ 1111.029847] ib_srpt Received CM TimeWait exit for ch
> 0x4f6e72000390fe7c7cfe900300726ed3-49.
> [ 1111.078815] ib_srpt Received CM TimeWait exit for ch
> 0x4f6e72000390fe7c7cfe900300726ed3-48.
> [ 1111.127420] ib_srpt Received CM TimeWait exit for ch
> 0x4f6e72000390fe7c7cfe900300726ed3-47.
> [ 1111.175801] ib_srpt Received CM TimeWait exit for ch
> 0x4f6e72000390fe7c7cfe900300726ed3-46.
> [ 1111.223725] ib_srpt Received CM TimeWait exit for ch
> 0x4f6e72000390fe7c7cfe900300726ed3-45.
> [ 1111.271957] ib_srpt Received CM TimeWait exit for ch
> 0x4f6e72000390fe7c7cfe900300726ed3-44.
> [ 1111.319494] ib_srpt Received CM TimeWait exit for ch
> 0x4f6e72000390fe7c7cfe900300726ed3-43.
> [ 1111.365795] ib_srpt Received CM TimeWait exit for ch
> 0x4f6e72000390fe7c7cfe900300726ed3-42.
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

Max

These are the parameters all my tests run with.
Same as always.

[root@localhost modprobe.d]# cat ib_srp.conf
options ib_srp cmd_sg_entries=255 indirect_sg_entries=2048 

I dont set prefer_fr so it defaults to Y

[root@localhost parameters]# cat prefer_fr
Y

I have no settings for mlx5_core, all defaults.

Thanks
Laurence

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Laurence Oberman April 26, 2017, 3:10 p.m. UTC | #3
----- Original Message -----
> From: "Laurence Oberman" <loberman@redhat.com>
> To: "Max Gurtovoy" <maxg@mellanox.com>
> Cc: "Leon Romanovsky" <leonro@mellanox.com>, "Bart Van Assche" <bart.vanassche@sandisk.com>, "Doug Ledford"
> <dledford@redhat.com>, "Sagi Grimberg" <sagi@grimberg.me>, "Israel Rukshin" <israelr@mellanox.com>,
> linux-rdma@vger.kernel.org
> Sent: Wednesday, April 26, 2017 9:50:25 AM
> Subject: Re: [PATCH, untested] mlx5: Avoid that mlx5_ib_sg_to_klms() overflows the klms[] array
> 
> 
> 
> ----- Original Message -----
> > From: "Laurence Oberman" <loberman@redhat.com>
> > To: "Max Gurtovoy" <maxg@mellanox.com>
> > Cc: "Leon Romanovsky" <leonro@mellanox.com>, "Bart Van Assche"
> > <bart.vanassche@sandisk.com>, "Doug Ledford"
> > <dledford@redhat.com>, "Sagi Grimberg" <sagi@grimberg.me>, "Israel Rukshin"
> > <israelr@mellanox.com>,
> > linux-rdma@vger.kernel.org
> > Sent: Wednesday, April 26, 2017 9:28:37 AM
> > Subject: Re: [PATCH, untested] mlx5: Avoid that mlx5_ib_sg_to_klms()
> > overflows the klms[] array
> > 
> > 
> > 
> > ----- Original Message -----
> > > From: "Max Gurtovoy" <maxg@mellanox.com>
> > > To: "Laurence Oberman" <loberman@redhat.com>
> > > Cc: "Leon Romanovsky" <leonro@mellanox.com>, "Bart Van Assche"
> > > <bart.vanassche@sandisk.com>, "Doug Ledford"
> > > <dledford@redhat.com>, "Sagi Grimberg" <sagi@grimberg.me>, "Israel
> > > Rukshin"
> > > <israelr@mellanox.com>,
> > > linux-rdma@vger.kernel.org
> > > Sent: Wednesday, April 26, 2017 8:25:30 AM
> > > Subject: Re: [PATCH, untested] mlx5: Avoid that mlx5_ib_sg_to_klms()
> > > overflows the klms[] array
> > > 
> > > 
> > > 
> > > On 4/26/2017 3:18 PM, Laurence Oberman wrote:
> > > >
> > > >
> > > > ----- Original Message -----
> > > >> From: "Laurence Oberman" <loberman@redhat.com>
> > > >> To: "Max Gurtovoy" <maxg@mellanox.com>
> > > >> Cc: "Leon Romanovsky" <leonro@mellanox.com>, "Bart Van Assche"
> > > >> <bart.vanassche@sandisk.com>, "Doug Ledford"
> > > >> <dledford@redhat.com>, "Sagi Grimberg" <sagi@grimberg.me>, "Israel
> > > >> Rukshin" <israelr@mellanox.com>,
> > > >> linux-rdma@vger.kernel.org
> > > >> Sent: Wednesday, April 26, 2017 7:47:37 AM
> > > >> Subject: Re: [PATCH, untested] mlx5: Avoid that mlx5_ib_sg_to_klms()
> > > >> overflows the klms[] array
> > > >>
> > > >>
> > > >>
> > > >> ----- Original Message -----
> > > >>> From: "Max Gurtovoy" <maxg@mellanox.com>
> > > >>> To: "Laurence Oberman" <loberman@redhat.com>, "Leon Romanovsky"
> > > >>> <leonro@mellanox.com>
> > > >>> Cc: "Bart Van Assche" <bart.vanassche@sandisk.com>, "Doug Ledford"
> > > >>> <dledford@redhat.com>, "Sagi Grimberg"
> > > >>> <sagi@grimberg.me>, "Israel Rukshin" <israelr@mellanox.com>,
> > > >>> linux-rdma@vger.kernel.org
> > > >>> Sent: Wednesday, April 26, 2017 4:31:57 AM
> > > >>> Subject: Re: [PATCH, untested] mlx5: Avoid that mlx5_ib_sg_to_klms()
> > > >>> overflows the klms[] array
> > > >>>
> > > >>>
> > > >>>
> > > >>> On 4/25/2017 11:37 PM, Laurence Oberman wrote:
> > > >>>>
> > > >>>>
> > > >>>> ----- Original Message -----
> > > >>>>> From: "Leon Romanovsky" <leonro@mellanox.com>
> > > >>>>> To: "Bart Van Assche" <bart.vanassche@sandisk.com>
> > > >>>>> Cc: "Doug Ledford" <dledford@redhat.com>, "Max Gurtovoy"
> > > >>>>> <maxg@mellanox.com>, "Sagi Grimberg" <sagi@grimberg.me>,
> > > >>>>> "Israel Rukshin" <israelr@mellanox.com>, "Laurence Oberman"
> > > >>>>> <loberman@redhat.com>, linux-rdma@vger.kernel.org
> > > >>>>> Sent: Tuesday, April 25, 2017 1:58:49 PM
> > > >>>>> Subject: Re: [PATCH, untested] mlx5: Avoid that
> > > >>>>> mlx5_ib_sg_to_klms()
> > > >>>>> overflows the klms[] array
> > > >>>>>
> > > >>>>> On Mon, Apr 24, 2017 at 03:15:28PM -0700, Bart Van Assche wrote:
> > > >>>>>> ib_map_mr_sg() can pass an SG-list to .map_mr_sg() that is larger
> > > >>>>>> than what fits into a single MR. .map_mr_sg() must not attempt to
> > > >>>>>> map more SG-list elements than what fits into a single MR.
> > > >>>>>> Hence make sure that mlx5_ib_sg_to_klms() does not write outside
> > > >>>>>> the MR klms[] array.
> > > >>>>>>
> > > >>>>>> Fixes: b005d3164713 ("mlx5: Add arbitrary sg list support")
> > > >>>>>> Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com>
> > > >>>>>> Reviewed-by: Max Gurtovoy <maxg@mellanox.com>
> > > >>>>>> Cc: Sagi Grimberg <sagi@grimberg.me>
> > > >>>>>> Cc: Leon Romanovsky <leonro@mellanox.com>
> > > >>>>>> Cc: Israel Rukshin <israelr@mellanox.com>
> > > >>>>>> Cc: <stable@vger.kernel.org>
> > > >>>>>> ---
> > > >>>>>>  drivers/infiniband/hw/mlx5/mr.c | 2 +-
> > > >>>>>>  1 file changed, 1 insertion(+), 1 deletion(-)
> > > >>>>>>
> > > >>>>>
> > > >>>>> Bart,
> > > >>>>>
> > > >>>>> Thanks a lot, it indeed looks right.
> > > >>>>> Acked-by: Leon Romanovsky <leonro@mellanox.com>
> > > >>>>>
> > > >>>>> Thanks
> > > >>>>>
> > > >>>>
> > > >>>>
> > > >>>> Hello Bart, Leon, Max and Israel.
> > > >>>>
> > > >>>> I cloned off Barts tree.
> > > >>>>
> > > >>>> git clone https://github.com/bvanassche/linux
> > > >>>> cd linux
> > > >>>> git checkout block-scsi-for-next
> > > >>>>
> > > >>>> I checked all patches were in for this test.
> > > >>>>
> > > >>>> a83e404 IB/srp: Reenable IB_MR_TYPE_SG_GAPS
> > > >>>> dfa5a2b mlx5: Avoid that mlx5_ib_sg_to_klms() overflows the klms[]
> > > >>>> array
> > > >>>> f759c80 mlx5: Fix mlx5_ib_map_mr_sg mr lengt
> > > >>>
> > > >>> Hi,
> > > >>> copying Sagi's request from different thread:
> > > >>>
> > > >>> "
> > > >>> Can you please enable srp_add_one debug:
> > > >>>
> > > >>> echo "func srp_add_one +p" > /sys/kernel/debug/dynamic_debug/control
> > > >>>
> > > >>> In addition apply the following:
> > > >>> --
> > > >>> diff --git a/drivers/infiniband/hw/mlx5/mr.c
> > > >>> b/drivers/infiniband/hw/mlx5/mr.c
> > > >>> index d9c6c0ea750b..040fbc387e4f 100644
> > > >>> --- a/drivers/infiniband/hw/mlx5/mr.c
> > > >>> +++ b/drivers/infiniband/hw/mlx5/mr.c
> > > >>> @@ -1403,6 +1403,8 @@ mlx5_alloc_priv_descs(struct ib_device *device,
> > > >>>          int add_size;
> > > >>>          int ret;
> > > >>>
> > > >>> +       WARN_ON_ONCE(ndescs >
> > > >>> device->attr.max_fast_reg_page_list_len);
> > > >>> +
> > > >>>          add_size = max_t(int, MLX5_UMR_ALIGN -
> > > >>>          ARCH_KMALLOC_MINALIGN,
> > > >>>          0);
> > > >>>
> > > >>>          mr->descs_alloc = kzalloc(size + add_size, GFP_KERNEL);
> > > >>>
> > > >>> "
> > > >>>
> > > >>> Max.
> > > >>>
> > > >>>>
> > > >>>> Built and tested the kernel.
> > > >>>>
> > > >>>> However this issue is not resolved :(
> > > >>>>
> > > >>>> [ 2707.931909] scsi host1: ib_srp: failed RECV status WR flushed (5)
> > > >>>> for
> > > >>>> CQE ffff8817edca86b0
> > > >>>> [ 2708.089806] mlx5_0:dump_cqe:262:(pid 20129): dump error cqe
> > > >>>> [ 2708.121342] 00000000 00000000 00000000 00000000
> > > >>>> [ 2708.147104] 00000000 00000000 00000000 00000000
> > > >>>> [ 2708.172633] 00000000 00000000 00000000 00000000
> > > >>>> [ 2708.198702] 00000000 0f007806 2500002a 14a527d0
> > > >>>> [ 2732.434127] scsi host1: ib_srp: reconnect succeeded
> > > >>>> [ 2733.048023] scsi host1: ib_srp: failed RECV status WR flushed (5)
> > > >>>> for
> > > >>>> CQE ffff8817ed0a9c30
> > > >>>>
> > > >>>> [root@localhost ~]# [ 2746.413277] mlx5_0:dump_cqe:262:(pid 15877):
> > > >>>> dump
> > > >>>> error cqe
> > > >>>> [ 2746.443240] 00000000 00000000 00000000 00000000
> > > >>>> [ 2746.469323] 00000000 00000000 00000000 00000000
> > > >>>> [ 2746.495310] 00000000 00000000 00000000 00000000
> > > >>>> [ 2746.521407] 00000000 0f007806 25000032 003c7ad0
> > > >>>> [ 2752.445899] scsi host1: ib_srp: reconnect succeeded
> > > >>>> [ 2752.481835] scsi host1: ib_srp: failed RECV status WR flushed (5)
> > > >>>> for
> > > >>>> CQE ffff8817ed0a9cf0
> > > >>>> [ 2763.267386] mlx5_0:dump_cqe:262:(pid 15877): dump error cqe
> > > >>>> [ 2763.297826] 00000000 00000000 00000000 00000000
> > > >>>> [ 2763.323352] 00000000 00000000 00000000 00000000
> > > >>>> [ 2763.348722] 00000000 00000000 00000000 00000000
> > > >>>> [ 2763.374681] 00000000 0f007806 2500003a 00084bd0
> > > >>>>
> > > >>>> [root@localhost ~]# [ 2769.385203] fast_io_fail_tmo expired for SRP
> > > >>>> port-1:1 / host1.
> > > >>>> [ 2769.415956] scsi host1: ib_srp: reconnect succeeded
> > > >>>> [ 2769.450258] scsi host1: ib_srp: failed RECV status WR flushed (5)
> > > >>>> for
> > > >>>> CQE ffff8817ed0a9cf0
> > > >>>> [ 2780.064627] mlx5_0:dump_cqe:262:(pid 18771): dump error cqe
> > > >>>> [ 2780.093520] 00000000 00000000 00000000 00000000
> > > >>>> [ 2780.120067] 00000000 00000000 00000000 00000000
> > > >>>> [ 2780.145575] 00000000 00000000 00000000 00000000
> > > >>>> [ 2780.171153] 00000000 0f007806 25000042 000833d0
> > > >>>> [ 2785.923399] scsi host1: ib_srp: reconnect succeeded
> > > >>>> [ 2785.957504] scsi host1: ib_srp: failed RECV status WR flushed (5)
> > > >>>> for
> > > >>>> CQE ffff8817ed0a9cf0
> > > >>>> [ 2796.463426] mlx5_0:dump_cqe:262:(pid 18771): dump error cqe
> > > >>>> [ 2796.495257] 00000000 00000000 00000000 00000000
> > > >>>> [ 2796.521506] 00000000 00000000 00000000 00000000
> > > >>>> [ 2796.547640] 00000000 00000000 00000000 00000000
> > > >>>> [ 2796.573120] 00000000 0f007806 2500004a 00083bd0
> > > >>>> [ 2802.562578] scsi host1: ib_srp: reconnect succeeded
> > > >>>> [ 2802.596880] scsi host1: ib_srp: failed RECV status WR flushed (5)
> > > >>>> for
> > > >>>> CQE ffff8817ed0a9cf0
> > > >>>>
> > > >>>> Regards
> > > >>>> Laurence
> > > >>>>
> > > >>>
> > > >> Doing this now
> > > >> Thanks
> > > >> Laurence
> > > >
> > > > Max
> > > >
> > > > The Patch is not correct.
> > > >
> > > > drivers/infiniband/hw/mlx5/mr.c: In function 'mlx5_alloc_priv_descs':
> > > > drivers/infiniband/hw/mlx5/mr.c:1406:30: error: 'struct ib_device' has
> > > > no
> > > > member named 'attr'
> > > >   WARN_ON_ONCE(ndescs > device->attr.max_fast_reg_page_list_len);
> > > >                               ^
> > > > ./include/asm-generic/bug.h:117:27: note: in definition of macro
> > > > 'WARN_ON_ONCE'
> > > >   int __ret_warn_once = !!(condition);   \
> > > >
> > > > I think you meant to give me
> > > >
> > > > WARN_ON_ONCE(ndescs > ib_device_attr->attr.max_fast_reg_page_list_len);
> > > >
> > > > Can you confirm
> > > 
> > > Hi Laurence,
> > > should be device->attrs.max_fast_reg_page_list_len.
> > > 
> > > please check this one that might solve the issue (on top of everything):
> > > 
> > > 
> > > diff --git a/drivers/infiniband/hw/mlx5/mr.c
> > > b/drivers/infiniband/hw/mlx5/mr.c
> > > index b8f9382..063d116 100644
> > > --- a/drivers/infiniband/hw/mlx5/mr.c
> > > +++ b/drivers/infiniband/hw/mlx5/mr.c
> > > @@ -1559,7 +1559,7 @@ struct ib_mr *mlx5_ib_alloc_mr(struct ib_pd *pd,
> > >                  mr->max_descs = ndescs;
> > >          } else if (mr_type == IB_MR_TYPE_SG_GAPS) {
> > >                  mr->access_mode = MLX5_MKC_ACCESS_MODE_KLMS;
> > > -
> > > +               MLX5_SET(mkc, mkc, translations_octword_size,
> > > ALIGN(max_num_sg + 1, 4));
> > >                  err = mlx5_alloc_priv_descs(pd->device, mr,
> > >                                              ndescs, sizeof(struct
> > > mlx5_klm));
> > >                  if (err)
> > > 
> > > thanks,
> > > Max.
> > > 
> > > >
> > > > Thanks
> > > > Laurence
> > > >
> > > --
> > > To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
> > > the body of a message to majordomo@vger.kernel.org
> > > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > > 
> > 
> > Hello Max
> > 
> > I have the corrected WARN_ON_ONCE patch and the above patch as well as the
> > rest as it was from Barts tree.
> > 
> > Still fails.
> > 
> > For a baseline I can revert
> > a83e404 IB/srp: Reenable IB_MR_TYPE_SG_GAPS
> > 
> > Then test again to make sure we are starting from a good place.
> > 
> > Initiator log
> > 
> > [  280.481951] scsi host1: ib_srp: failed FAST REG status memory management
> > operation error (6) for CQE ffff8817d9a881b8
> > [  301.149106] scsi host1: ib_srp: reconnect succeeded
> > [  301.280635] scsi host1: ib_srp: failed RECV status WR flushed (5) for
> > CQE
> > ffff8817ed32f2f0
> > [  334.596420] scsi host2: ib_srp: failed RECV status WR flushed (5) for
> > CQE
> > ffff8817c592c970
> > [  334.599689] mlx5_1:dump_cqe:262:(pid 20): dump error cqe
> > [  334.599691] 00000000 00000000 00000000 00000000
> > [  334.599692] 00000000 00000000 00000000 00000000
> > [  334.599692] 00000000 00000000 00000000 00000000
> > [  334.599693] 00000000 0f007806 2500002d 067b48d0
> > [  334.599697] scsi host2: ib_srp: failed FAST REG status memory management
> > operation error (6) for CQE ffff8817c6e30078
> > [  336.117248] mlx5_0:dump_cqe:262:(pid 130): dump error cqe
> > [  336.145840] 00000000 00000000 00000000 00000000
> > [  336.171830] 00000000 00000000 00000000 00000000
> > [  336.197688] 00000000 00000000 00000000 00000000
> > [  336.223720] 00000000 0f007806 25000032 005408d0
> > [  339.712706] fast_io_fail_tmo expired for SRP port-1:1 / host1.
> > [  341.453634] scsi host1: ib_srp: reconnect succeeded
> > [  341.481600] mlx5_0:dump_cqe:262:(pid 130): dump error cqe
> > [  341.482145] scsi host1: ib_srp: failed RECV status WR flushed (5) for
> > CQE
> > ffff8817ecaf6970
> > [  341.559359] 00000000 00000000 00000000 00000000
> > [  341.585397] 00000000 00000000 00000000 00000000
> > [  341.610948] 00000000 00000000 00000000 00000000
> > [  341.637515] 00000000 0f007806 2500003d 000046d0
> > [  342.297598] sd 1:0:0:9: rejecting I/O to offline device
> > [  342.297936] sd 1:0:0:9: [sdg] tag#28 FAILED Result:
> > hostbyte=DID_TRANSPORT_FAILFAST driverbyte=DRIVER_OK
> > [  342.297941] sd 1:0:0:9: [sdg] tag#28 CDB: Write(10) 2a 00 00 00 40 00 00
> > 40 00 00
> > [  342.297943] blk_update_request: recoverable transport error, dev sdg,
> > sector 16384
> > [  342.297951] sd 1:0:0:20: [sdar] tag#5 FAILED Result:
> > hostbyte=DID_TRANSPORT_FAILFAST driverbyte=DRIVER_OK
> > [  342.297952] sd 1:0:0:20: [sdar] tag#15 FAILED Result:
> > hostbyte=DID_TRANSPORT_FAILFAST driverbyte=DRIVER_OK
> > [  342.297956] sd 1:0:0:20: [sdar] tag#5 CDB: Write(10) 2a 00 00 03 c0 00
> > 00
> > 40 00 00
> > [  342.297956] sd 1:0:0:20: [sdar] tag#15 CDB: Write(10) 2a 00 00 2c c0 00
> > 00
> > 40 00 00
> > [  342.297958] blk_update_request: recoverable transport error, dev sdar,
> > sector 245760
> > [  342.297959] blk_update_request: recoverable transport error, dev sdar,
> > sector 2932736
> > [  342.298119] device-mapper: multipath: Failing path 8:96.
> > [  342.298266] sd 1:0:0:9: [sdg] tag#29 FAILED Result:
> > hostbyte=DID_TRANSPORT_FAILFAST driverbyte=DRIVER_OK
> > [  342.298268] sd 1:0:0:9: [sdg] tag#29 CDB: Write(10) 2a 00 00 00 c0 00 00
> > 40 00 00
> > [  342.298269] blk_update_request: recoverable transport error, dev sdg,
> > sector 49152
> > [  342.298300] device-mapper: multipath: Failing path 66:176.
> > [  342.298486] sd 1:0:0:20: [sdar] tag#16 FAILED Result:
> > hostbyte=DID_TRANSPORT_FAILFAST driverbyte=DRIVER_OK
> > [  342.298488] sd 1:0:0:20: [sdar] tag#6 FAILED Result:
> > hostbyte=DID_TRANSPORT_FAILFAST driverbyte=DRIVER_OK
> > [  342.298489] sd 1:0:0:20: [sdar] tag#16 CDB: Write(10) 2a 00 00 2d 40 00
> > 00
> > 40 00 00
> > [  342.298490] sd 1:0:0:20: [sdar] tag#6 CDB: Write(10) 2a 00 00 04 40 00
> > 00
> > 40 00 00
> > [  342.298491] blk_update_request: recoverable transport error, dev sdar,
> > sector 2965504
> > [  342.298492] blk_update_request: recoverable transport error, dev sdar,
> > sector 278528
> > [  342.298582] sd 1:0:0:9: [sdg] tag#30 FAILED Result:
> > hostbyte=DID_TRANSPORT_FAILFAST driverbyte=DRIVER_OK
> > [  342.298584] sd 1:0:0:9: [sdg] tag#30 CDB: Write(10) 2a 00 00 01 40 00 00
> > 40 00 00
> > [  342.298585] blk_update_request: recoverable transport error, dev sdg,
> > sector 81920
> > [  342.298889] sd 1:0:0:9: [sdg] tag#31 FAILED Result:
> > hostbyte=DID_TRANSPORT_FAILFAST driverbyte=DRIVER_OK
> > [  342.298890] sd 1:0:0:9: [sdg] tag#31 CDB: Write(10) 2a 00 00 01 c0 00 00
> > 40 00 00
> > [  342.298891] blk_update_request: recoverable transport error, dev sdg,
> > sector 114688
> > [  342.298981] sd 1:0:0:20: [sdar] tag#7 FAILED Result:
> > hostbyte=DID_TRANSPORT_FAILFAST driverbyte=DRIVER_OK
> > [  342.298983] sd 1:0:0:20: [sdar] tag#7 CDB: Write(10) 2a 00 00 04 c0 00
> > 00
> > 40 00 00
> > [  342.298985] blk_update_request: recoverable transport error, dev sdar,
> > sector 311296
> > [  342.299004] sd 1:0:0:20: [sdar] tag#17 FAILED Result:
> > hostbyte=DID_TRANSPORT_FAILFAST driverbyte=DRIVER_OK
> > [  342.299007] sd 1:0:0:20: [sdar] tag#17 CDB: Write(10) 2a 00 00 34 c0 00
> > 00
> > 40 00 00
> > [  342.299009] blk_update_request: recoverable transport error, dev sdar,
> > sector 3457024
> > [  342.356353] device-mapper: multipath: Failing path 8:64.
> > [  342.356489] device-mapper: multipath: Failing path 8:128.
> > [  342.356628] device-mapper: multipath: Failing path 8:160.
> > [  342.356699] device-mapper: multipath: Failing path 8:176.
> > [  342.356767] device-mapper: multipath: Failing path 8:240.
> > [  342.356834] device-mapper: multipath: Failing path 8:208.
> > [  342.356900] device-mapper: multipath: Failing path 65:16.
> > [  342.356967] device-mapper: multipath: Failing path 65:64.
> > [  342.357035] device-mapper: multipath: Failing path 65:96.
> > [  342.357103] device-mapper: multipath: Failing path 65:128.
> > [  342.357169] device-mapper: multipath: Failing path 65:176.
> > [  342.357237] device-mapper: multipath: Failing path 65:208.
> > [  342.357303] device-mapper: multipath: Failing path 65:224.
> > [  342.357371] device-mapper: multipath: Failing path 66:0.
> > [  342.357454] device-mapper: multipath: Failing path 66:32.
> > [  342.357521] device-mapper: multipath: Failing path 66:48.
> > [  342.357647] device-mapper: multipath: Failing path 66:80.
> > [  342.357714] device-mapper: multipath: Failing path 66:112.
> > [  342.357781] device-mapper: multipath: Failing path 66:144.
> > [  342.357936] device-mapper: multipath: Failing path 66:208.
> > [  342.358019] device-mapper: multipath: Failing path 66:240.
> > [  342.358115] device-mapper: multipath: Failing path 67:16.
> > [  342.358183] device-mapper: multipath: Failing path 67:48.
> > [  342.358264] device-mapper: multipath: Failing path 67:80.
> > [  342.358359] device-mapper: multipath: Failing path 67:128.
> > [  342.358442] device-mapper: multipath: Failing path 67:160.
> > [  342.358594] device-mapper: multipath: Failing path 67:224.
> > [  342.358671] device-mapper: multipath: Failing path 67:208.
> > [  350.157728] scsi host2: ib_srp: reconnect succeeded
> > [  350.189605] mlx5_1:dump_cqe:262:(pid 4756): dump error cqe
> > [  350.193180] mlx5_1:dump_cqe:262:(pid 1275): dump error cqe
> > [  350.193182] 00000000 00000000 00000000 00000000
> > [  350.193182] 00000000 00000000 00000000 00000000
> > [  350.193183] 00000000 00000000 00000000 00000000
> > [  350.193183] 00000000 0f007806 25000035 04f569d0
> > [  350.193187] scsi host2: ib_srp: failed FAST REG status memory management
> > operation error (6) for CQE ffff8817c6e30078
> > [  350.412637] 00000000 00000000 00000000 00000000
> > [  350.436431] 00000000 00000000 00000000 00000000
> > [  350.461871] 00000000 00000000 00000000 00000000
> > [  350.487549] 00000000 0f007806 25000032 000843d0
> > 
> > Target Log
> > 
> > Thee events happened after the first failures on the initiator
> > 
> > [ 1111.029847] ib_srpt Received CM TimeWait exit for ch
> > 0x4f6e72000390fe7c7cfe900300726ed3-49.
> > [ 1111.078815] ib_srpt Received CM TimeWait exit for ch
> > 0x4f6e72000390fe7c7cfe900300726ed3-48.
> > [ 1111.127420] ib_srpt Received CM TimeWait exit for ch
> > 0x4f6e72000390fe7c7cfe900300726ed3-47.
> > [ 1111.175801] ib_srpt Received CM TimeWait exit for ch
> > 0x4f6e72000390fe7c7cfe900300726ed3-46.
> > [ 1111.223725] ib_srpt Received CM TimeWait exit for ch
> > 0x4f6e72000390fe7c7cfe900300726ed3-45.
> > [ 1111.271957] ib_srpt Received CM TimeWait exit for ch
> > 0x4f6e72000390fe7c7cfe900300726ed3-44.
> > [ 1111.319494] ib_srpt Received CM TimeWait exit for ch
> > 0x4f6e72000390fe7c7cfe900300726ed3-43.
> > [ 1111.365795] ib_srpt Received CM TimeWait exit for ch
> > 0x4f6e72000390fe7c7cfe900300726ed3-42.
> > 
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > 
> 
> Max
> 
> These are the parameters all my tests run with.
> Same as always.
> 
> [root@localhost modprobe.d]# cat ib_srp.conf
> options ib_srp cmd_sg_entries=255 indirect_sg_entries=2048
> 
> I dont set prefer_fr so it defaults to Y
> 
> [root@localhost parameters]# cat prefer_fr
> Y
> 
> I have no settings for mlx5_core, all defaults.
> 
> Thanks
> Laurence
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

Max,

Reverting a83e404 IB/srp: Reenable IB_MR_TYPE_SG_GAPS on the same source tree with all esle applied I am stable.
So clearly we still have issues with IB_MR_TYPE_SG_GAPS.

Thanks
Laurence
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/drivers/infiniband/hw/mlx5/mr.c 
b/drivers/infiniband/hw/mlx5/mr.c
index b8f9382..063d116 100644
--- a/drivers/infiniband/hw/mlx5/mr.c
+++ b/drivers/infiniband/hw/mlx5/mr.c
@@ -1559,7 +1559,7 @@  struct ib_mr *mlx5_ib_alloc_mr(struct ib_pd *pd,
                 mr->max_descs = ndescs;
         } else if (mr_type == IB_MR_TYPE_SG_GAPS) {
                 mr->access_mode = MLX5_MKC_ACCESS_MODE_KLMS;
-
+               MLX5_SET(mkc, mkc, translations_octword_size, 
ALIGN(max_num_sg + 1, 4));
                 err = mlx5_alloc_priv_descs(pd->device, mr,
                                             ndescs, sizeof(struct