diff mbox

Unexpected issues with 2 NVME initiators using the same target

Message ID 3eb5814f-14cb-2b94-adf8-335d4b2eb7e9@grimberg.me (mailing list archive)
State Not Applicable
Headers show

Commit Message

Sagi Grimberg Feb. 27, 2017, 8:33 p.m. UTC
Hey Joseph,

> In our lab we are dealing with an issue which has some of the same symptoms.  Wanted to add to the thread in case it is useful here.  We have a target system with 16 Intel P3520 disks and a Mellanox CX4 50Gb NIC directly connected (no switch) to a single initiator system with a matching Mellanox CX4 50Gb NIC.  We are running Ubuntu 16.10 with 4.10-RC8 mainline kernel.  All drivers are kernel default drivers.  I've attached our nvmetcli json, and FIO workload, and dmesg from both systems.
>
> We are able to provoke this problem with a variety of workloads but a high bandwidth read operation seems to cause it the most reliably, harder to produce with smaller block sizes.  For some reason the problem seems produced when we stop and restart IO - I can run the FIO workload on the initiator system for 1-2 hours without any new events in dmesg, pushing about 5500MB/sec the whole time, then kill it and wait 10 seconds and restart it, and the errors and reconnect events happen reliably at that point.  Working to characterize further this week and also to see if we can produce on a smaller configuration.  Happy to provide any additional details that would be useful or try any fixes!
>
> On the initiator we see events like this:
>
> [51390.065641] mlx5_0:dump_cqe:262:(pid 0): dump error cqe
> [51390.065644] 00000000 00000000 00000000 00000000
> [51390.065645] 00000000 00000000 00000000 00000000
> [51390.065646] 00000000 00000000 00000000 00000000
> [51390.065648] 00000000 08007806 250003ab 02b9dcd2
> [51390.065666] nvme nvme3: MEMREG for CQE 0xffff9fc845039410 failed with status memory management operation error (6)
> [51390.079156] nvme nvme3: reconnecting in 10 seconds
> [51400.432782] nvme nvme3: Successfully reconnected

Seems to me this is a CX4 FW issue. Mellanox can elaborate on these
vendor specific syndromes on this output.

> On the target we see events like this:
>
> [51370.394694] mlx5_0:dump_cqe:262:(pid 6623): dump error cqe
> [51370.394696] 00000000 00000000 00000000 00000000
> [51370.394697] 00000000 00000000 00000000 00000000
> [51370.394699] 00000000 00000000 00000000 00000000
> [51370.394701] 00000000 00008813 080003ea 00c3b1d2

If the host is failing on memory mapping while the target is initiating
rdma access it makes sense that it will see errors.

>
> Sometimes, but less frequently, we also will see events on the target like this as part of the problem:
>
> [21322.678571] nvmet: ctrl 1 fatal error occurred!

Again, also makes sense because for nvmet this is a fatal error and we
need to teardown the controller.

You can try out this patch to see if it makes the memreg issues to go
away:
--
                                 qp->sq.wr_data[idx] = IB_WR_LOCAL_INV;
                                 ctrl->imm = 
cpu_to_be32(wr->ex.invalidate_rkey);
                                 set_linv_wr(qp, &seg, &size);
@@ -3901,7 +3901,7 @@ int mlx5_ib_post_send(struct ib_qp *ibqp, struct 
ib_send_wr *wr,
                                 break;

                         case IB_WR_REG_MR:
-                               next_fence = 
MLX5_FENCE_MODE_INITIATOR_SMALL;
+                               next_fence = 
MLX5_FENCE_MODE_STRONG_ORDERING;
                                 qp->sq.wr_data[idx] = IB_WR_REG_MR;
                                 ctrl->imm = cpu_to_be32(reg_wr(wr)->key);
                                 err = set_reg_wr(qp, reg_wr(wr), &seg, 
&size);
--

Note that this will have a big performance (negative) impact on small
read workloads.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Comments

Sagi Grimberg March 6, 2017, 11:28 a.m. UTC | #1
> Hi Sagi,
>
> I think we need to add fence to the UMR wqe.
>
> so lets try this one:
>
> diff --git a/drivers/infiniband/hw/mlx5/qp.c
> b/drivers/infiniband/hw/mlx5/qp.c
> index ad8a263..c38c4fa 100644
> --- a/drivers/infiniband/hw/mlx5/qp.c
> +++ b/drivers/infiniband/hw/mlx5/qp.c
> @@ -3737,8 +3737,7 @@ static void dump_wqe(struct mlx5_ib_qp *qp, int
> idx, int size_16)
>
>  static u8 get_fence(u8 fence, struct ib_send_wr *wr)
>  {
> -       if (unlikely(wr->opcode == IB_WR_LOCAL_INV &&
> -                    wr->send_flags & IB_SEND_FENCE))
> +       if (wr->opcode == IB_WR_LOCAL_INV || wr->opcode == IB_WR_REG_MR)
>                 return MLX5_FENCE_MODE_STRONG_ORDERING;
>
>         if (unlikely(fence)) {

This will kill performance, isn't there another fix that can
be applied just for retransmission flow?

> Couldn't repro that case but I run some initial tests in my Lab (with my
> patch above) - not performace servers:
>
> Initiator with 24 CPUs (2 threads/core, 6 cores/socket, 2 sockets),
> Connect IB (same driver mlx5_ib), kernel 4.10.0, fio test with 24 jobs
> and 128 iodepth.
> register_always=N
>
> Target - 1 subsystem with 1 ns (null_blk)
>
> bs   read (without/with patch)   write (without/with patch)
> --- --------------------------  ---------------------------
> 512     1019k / 1008k                 1004k / 992k
> 1k      1021k / 1013k                 1002k / 991k
> 4k      1030k / 1022k                 978k  / 969k
>
> CPU usage is 100% for both cases in the initiator side.
> haven't seen difference with bs = 16k.
> No so big drop like we would expect,

Obviously you won't see a drop without registering memory
for small IO (register_always=N), this would bypass registration
altogether... Please retest with register_always=Y.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Sagi Grimberg March 7, 2017, 1:41 p.m. UTC | #2
> Hi,
>
> Shahar/Joseph, what is your link layer conf (IB/Eth) ?
> In eth case, have you configured some PFC ? if not, can you try it ?
> I suspect that this is the root cause

The root cause is that the device fails frwr in retransmission
flow, if PFC is not on, it will happen almost immediately, if not
it will happen at some point...

> and it might help you avoiding
> this case, meanwhile we're looking for for the best solution.
>
> Adding Vladimir that will run iSER on his performance setup with the new
> fencing patch (not an NVMEoF related issue).
> We can run also NVMEoF later on if needed.

Thanks.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
shahar.salzman March 9, 2017, 12:18 p.m. UTC | #3
Hi,

Sorry for the delay, I have been OOO for the past few days.

Indeed the underlying transport is Ethernet, and we have found that flow 
control is disabled on the switch side.
I still do not have access to the system for re-tests with the flow 
control, and 4.10. But I am working on assembling another system.

A little off-topic, but something that may help me get the other system 
up faster, can I use a dual CX4 card as initiator and target (i.e. one 
port going up to the switch, and the other coming back into the system) 
without the kernel looping back? I have a spare CX4, so if possible, I 
will use it to build a mini system for recreates of this sort.

Thanks,
Shahar


On 03/07/2017 03:41 PM, Sagi Grimberg wrote:
>
>> Hi,
>>
>> Shahar/Joseph, what is your link layer conf (IB/Eth) ?
>> In eth case, have you configured some PFC ? if not, can you try it ?
>> I suspect that this is the root cause
>
> The root cause is that the device fails frwr in retransmission
> flow, if PFC is not on, it will happen almost immediately, if not
> it will happen at some point...
>
>> and it might help you avoiding
>> this case, meanwhile we're looking for for the best solution.
>>
>> Adding Vladimir that will run iSER on his performance setup with the new
>> fencing patch (not an NVMEoF related issue).
>> We can run also NVMEoF later on if needed.
>
> Thanks.

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Sagi Grimberg March 13, 2017, 9:43 a.m. UTC | #4
> Patched  by patch of Max (block 4K):
>
> allways reg          Y                N
> write              1902K            1923.3K
> read               1315K            2009K
>
> Original  OFED code (block 4K)
>
> allways reg          Y                N
> write              1947K           1982K
> read               1273K           1978K												

First, the write comparison is redundant because
we send immediate data without memory registration.

And, I'd compare against upstream code and not OFED.

So it seems that strong fencing does not effect performance
from the ULP point of view, surprising...
I'd suggest comparing on nvmf and srp as well.

If this is the case, and it indeed resolves the issue, we
should move forward with it as is.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Sagi Grimberg March 28, 2017, 11:34 a.m. UTC | #5
Hey Joseph,

> Hey folks.  Apologies if this message comes through twice, but when I originally sent it the list flagged it as too large due to the dmesg log attachments, and then a coworker just told me they never saw it, so I don't think it made it through on the first attempt.
>
> Please see last note above and dmesg example attached - after more extensive testing with Max's patch we are still able to produce cqe dump errors (at a much lower frequency) as well as a new failure mode involving a crash dump.
>

This is a different issue AFAICT,

Looks like nvmet_sq_destroy() is stuck waiting for
the final reference to drop (which seems to never happen).

I'm trying to look for a code path where this may happen.
Can jyou tell if the backend block device completed all of
its I/O when this happens (can check for active tags in debugfs).
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Marta Rybczynska April 10, 2017, 11:40 a.m. UTC | #6
> On 3/17/2017 8:37 PM, Gruher, Joseph R wrote:
>>
>>
>>> -----Original Message-----
>>> From: Max Gurtovoy [mailto:maxg@mellanox.com]
>>>
>>> I think we need to add fence to the UMR wqe.
>>>
>>> diff --git a/drivers/infiniband/hw/mlx5/qp.c b/drivers/infiniband/hw/mlx5/qp.c
>>> index ad8a263..c38c4fa 100644
>>> --- a/drivers/infiniband/hw/mlx5/qp.c
>>> +++ b/drivers/infiniband/hw/mlx5/qp.c
>>> @@ -3737,8 +3737,7 @@ static void dump_wqe(struct mlx5_ib_qp *qp, int idx,
>>> int size_16)
>>>
>>>   static u8 get_fence(u8 fence, struct ib_send_wr *wr)
>>>   {
>>> -       if (unlikely(wr->opcode == IB_WR_LOCAL_INV &&
>>> -                    wr->send_flags & IB_SEND_FENCE))
>>> +       if (wr->opcode == IB_WR_LOCAL_INV || wr->opcode == IB_WR_REG_MR)
>>>                  return MLX5_FENCE_MODE_STRONG_ORDERING;
>>>
>>>          if (unlikely(fence)) {
>>>
>>> 
>>
>> You mention these patches are only for testing.  How do we get to something
>> which can be submitted to upstream?
> 
> Yes, we need to be careful and not put the strong_fence if it's not a must.
> I'll be out for the upcoming week, but I'll ask our mlx5 maintainers to
> prepare a suitable patch and check some other applications performance
> numbers.
> Thanks for the testing, you can use this patch meanwhile till we push
> the formal solution.
> 

Hello Max,
We're seeing the same issue in our setup and we're running this patch
on our system for some time already. It seems to have fixed the issue.
When there is a final patch available, we can test it too.

Thanks,
Marta
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Marta Rybczynska April 11, 2017, 12:47 p.m. UTC | #7
----- Mail original -----
> On 4/10/2017 2:40 PM, Marta Rybczynska wrote:
>>> On 3/17/2017 8:37 PM, Gruher, Joseph R wrote:
>>>>
>>>>
>>>>> -----Original Message-----
>>>>> From: Max Gurtovoy [mailto:maxg@mellanox.com]
>>>>>
>>>>> I think we need to add fence to the UMR wqe.
>>>>>
>>>>> diff --git a/drivers/infiniband/hw/mlx5/qp.c b/drivers/infiniband/hw/mlx5/qp.c
>>>>> index ad8a263..c38c4fa 100644
>>>>> --- a/drivers/infiniband/hw/mlx5/qp.c
>>>>> +++ b/drivers/infiniband/hw/mlx5/qp.c
>>>>> @@ -3737,8 +3737,7 @@ static void dump_wqe(struct mlx5_ib_qp *qp, int idx,
>>>>> int size_16)
>>>>>
>>>>>   static u8 get_fence(u8 fence, struct ib_send_wr *wr)
>>>>>   {
>>>>> -       if (unlikely(wr->opcode == IB_WR_LOCAL_INV &&
>>>>> -                    wr->send_flags & IB_SEND_FENCE))
>>>>> +       if (wr->opcode == IB_WR_LOCAL_INV || wr->opcode == IB_WR_REG_MR)
>>>>>                  return MLX5_FENCE_MODE_STRONG_ORDERING;
>>>>>
>>>>>          if (unlikely(fence)) {
>>>>>
>>>>>
>>>>
>>>> You mention these patches are only for testing.  How do we get to something
>>>> which can be submitted to upstream?
>>>
>>> Yes, we need to be careful and not put the strong_fence if it's not a must.
>>> I'll be out for the upcoming week, but I'll ask our mlx5 maintainers to
>>> prepare a suitable patch and check some other applications performance
>>> numbers.
>>> Thanks for the testing, you can use this patch meanwhile till we push
>>> the formal solution.
>>>
>>
>> Hello Max,
>> We're seeing the same issue in our setup and we're running this patch
>> on our system for some time already. It seems to have fixed the issue.
>> When there is a final patch available, we can test it too.
>>
>> Thanks,
>> Marta
>>
> 
> Hi Marta,
> thanks for testing my patch. I'll send it early next week (holiday's in
> our country) so it will be available in 4.11 kernel hopefully.
> if you can share on which NIC's you tested it and the perf numbers you
> get (with and without the patch), it will be great.
> 

Hello Max,
That's great news. We're ready to start testing as the final version
of the patch is out. We're using ConnectX 4. Unfortunately in our use
case the patch is not a question of performance: the workload doesn't
work without it. We may think about running other workloads for the
tests.

Marta
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Sagi Grimberg April 20, 2017, 10:18 a.m. UTC | #8
Max,

> Hi Marta,
> thanks for testing my patch. I'll send it early next week (holiday's in
> our country) so it will be available in 4.11 kernel hopefully.
> if you can share on which NIC's you tested it and the perf numbers you
> get (with and without the patch), it will be great.

Can we get it to 4.12 please (we need you to send it out today).
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Sagi Grimberg April 26, 2017, 2:45 p.m. UTC | #9
>> Can we get it to 4.12 please (we need you to send it out today).
>
> Sagi,
> we are preparing a cap bit for it.
> I'll send it as soon as it will be ready (otherwise we'll add fence for
> devices that can handle without it - such as ConnectX5).

Why? can't you just get what you have as a fix and incrementally add the
optimization for CX5?
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Sagi Grimberg May 15, 2017, noon UTC | #10
> Any update here?  Would love to be able to load up the new kernels without patching them every time. :)

I would like that too,

Max, can you send a patch? or should I?
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Leon Romanovsky May 15, 2017, 1:31 p.m. UTC | #11
On Mon, May 15, 2017 at 03:00:07PM +0300, Sagi Grimberg wrote:
> > Any update here?  Would love to be able to load up the new kernels without patching them every time. :)
>
> I would like that too,
>
> Max, can you send a patch? or should I?

Sagi,

Max is doing his best to provide a patch, unfortunately he is limited
by various architecture implications which he needs to resolve before
sending it.

Thanks

> --
> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
Sagi Grimberg May 15, 2017, 1:43 p.m. UTC | #12
>>> Any update here?  Would love to be able to load up the new kernels without patching them every time. :)
>>
>> I would like that too,
>>
>> Max, can you send a patch? or should I?
>
> Sagi,
>
> Max is doing his best to provide a patch, unfortunately he is limited
> by various architecture implications which he needs to resolve before
> sending it.

Well, he already sent a patch that fixes the issue.
He said that he needs additional optimization for CX5 (which I assume
doesn't need the strong fence), but that still does not change the
fact that CX4 is broken. I asked to include his patch, fix the
existing bug and incrementally optimize CX5.

What do you mean by architecture implications? it's broken and there is
a request from the community to fix it. Are you suggesting that it
doesn't solve the issue?

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Leon Romanovsky May 15, 2017, 2:36 p.m. UTC | #13
On Mon, May 15, 2017 at 04:43:18PM +0300, Sagi Grimberg wrote:
> > > > Any update here?  Would love to be able to load up the new kernels without patching them every time. :)
> > >
> > > I would like that too,
> > >
> > > Max, can you send a patch? or should I?
> >
> > Sagi,
> >
> > Max is doing his best to provide a patch, unfortunately he is limited
> > by various architecture implications which he needs to resolve before
> > sending it.
>
> Well, he already sent a patch that fixes the issue.
> He said that he needs additional optimization for CX5 (which I assume
> doesn't need the strong fence), but that still does not change the
> fact that CX4 is broken. I asked to include his patch, fix the
> existing bug and incrementally optimize CX5.
>
> What do you mean by architecture implications? it's broken and there is
> a request from the community to fix it. Are you suggesting that it
> doesn't solve the issue?

I understand you and both Max and me are feeling the same as you. For more
than 2 months, we constantly (almost on daily basis) asked for a solution from
architecture group, but received different answers. The proposals were
extremely broad from need for strong fence for all cards to no need for
strong fence at all.

Thanks

>
Christoph Hellwig May 15, 2017, 2:59 p.m. UTC | #14
On Mon, May 15, 2017 at 05:36:32PM +0300, Leon Romanovsky wrote:
> I understand you and both Max and me are feeling the same as you. For more
> than 2 months, we constantly (almost on daily basis) asked for a solution from
> architecture group, but received different answers. The proposals were
> extremely broad from need for strong fence for all cards to no need for
> strong fence at all.

So let's get the patch to do a strong fence everywhere now, and relax
it later where possible.

Correntness before speed..
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Leon Romanovsky May 15, 2017, 5:05 p.m. UTC | #15
On Mon, May 15, 2017 at 07:59:52AM -0700, Christoph Hellwig wrote:
> On Mon, May 15, 2017 at 05:36:32PM +0300, Leon Romanovsky wrote:
> > I understand you and both Max and me are feeling the same as you. For more
> > than 2 months, we constantly (almost on daily basis) asked for a solution from
> > architecture group, but received different answers. The proposals were
> > extremely broad from need for strong fence for all cards to no need for
> > strong fence at all.
>
> So let's get the patch to do a strong fence everywhere now, and relax
> it later where possible.
>
> Correntness before speed..

OK, please give me and Max till EOW to stop this saga. One of the two
options will be: Max will resend original patch, or Max will send patch
blessed by architecture group.

Thanks
Marta Rybczynska May 17, 2017, 12:56 p.m. UTC | #16
> On Mon, May 15, 2017 at 07:59:52AM -0700, Christoph Hellwig wrote:
>> On Mon, May 15, 2017 at 05:36:32PM +0300, Leon Romanovsky wrote:
>> > I understand you and both Max and me are feeling the same as you. For more
>> > than 2 months, we constantly (almost on daily basis) asked for a solution from
>> > architecture group, but received different answers. The proposals were
>> > extremely broad from need for strong fence for all cards to no need for
>> > strong fence at all.
>>
>> So let's get the patch to do a strong fence everywhere now, and relax
>> it later where possible.
>>
>> Correntness before speed..
> 
> OK, please give me and Max till EOW to stop this saga. One of the two
> options will be: Max will resend original patch, or Max will send patch
> blessed by architecture group.
> 

Good luck with this Max & Leon! It seems to be a complicated problem.
Just an idea: in our case it *seems* that the problem started appearing
after a firmware upgrade, older ones do not seem to have the same
behaviour. Maybe it's a hint for you.

Thanks!
Marta
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Leon Romanovsky May 18, 2017, 1:34 p.m. UTC | #17
On Wed, May 17, 2017 at 02:56:36PM +0200, Marta Rybczynska wrote:
> > On Mon, May 15, 2017 at 07:59:52AM -0700, Christoph Hellwig wrote:
> >> On Mon, May 15, 2017 at 05:36:32PM +0300, Leon Romanovsky wrote:
> >> > I understand you and both Max and me are feeling the same as you. For more
> >> > than 2 months, we constantly (almost on daily basis) asked for a solution from
> >> > architecture group, but received different answers. The proposals were
> >> > extremely broad from need for strong fence for all cards to no need for
> >> > strong fence at all.
> >>
> >> So let's get the patch to do a strong fence everywhere now, and relax
> >> it later where possible.
> >>
> >> Correntness before speed..
> >
> > OK, please give me and Max till EOW to stop this saga. One of the two
> > options will be: Max will resend original patch, or Max will send patch
> > blessed by architecture group.
> >
>
> Good luck with this Max & Leon! It seems to be a complicated problem.
> Just an idea: in our case it *seems* that the problem started appearing
> after a firmware upgrade, older ones do not seem to have the same
> behaviour. Maybe it's a hint for you.

OK, we came to the agreement which capability bits we should add. Max
will return to the office at the middle of the next week and we will
proceed with the submission of proper patch once our shared code will
be accepted.

In the meantime, i put the original patch to be part of our regression.
https://git.kernel.org/pub/scm/linux/kernel/git/leon/linux-rdma.git/commit/?h=testing/queue-next&id=a40ac569f243db552661e6efad70080bb406823c

Thank you for your patience.

>
> Thanks!
> Marta
> --
> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
Robert LeBlanc June 19, 2017, 5:21 p.m. UTC | #18
I ran into this with 4.9.32 when I rebooted the target. I tested
4.12-rc6 and this particular error seems to have been resolved, but I
now get a new one on the initiator. This one doesn't seem as
impactful.

[Mon Jun 19 11:17:20 2017] mlx5_0:dump_cqe:275:(pid 0): dump error cqe
[Mon Jun 19 11:17:20 2017] 00000000 00000000 00000000 00000000
[Mon Jun 19 11:17:20 2017] 00000000 00000000 00000000 00000000
[Mon Jun 19 11:17:20 2017] 00000000 00000000 00000000 00000000
[Mon Jun 19 11:17:20 2017] 00000000 93005204 0a0001bd 45c8e0d2
[Mon Jun 19 11:17:20 2017] iser: iser_err_comp: command failure: local
protection error (4) vend_err 52
[Mon Jun 19 11:17:20 2017]  connection3:0: detected conn error (1011)
[Mon Jun 19 11:17:31 2017] mlx5_0:dump_cqe:275:(pid 0): dump error cqe
[Mon Jun 19 11:17:31 2017] mlx5_0:dump_cqe:275:(pid 0): dump error cqe
[Mon Jun 19 11:17:31 2017] 00000000 00000000 00000000 00000000
[Mon Jun 19 11:17:31 2017] 00000000 00000000 00000000 00000000
[Mon Jun 19 11:17:31 2017] 00000000 00000000 00000000 00000000
[Mon Jun 19 11:17:31 2017] 00000000 93005204 0a0001e7 45dd82d2
[Mon Jun 19 11:17:31 2017] iser: iser_err_comp: command failure: local
protection error (4) vend_err 52
[Mon Jun 19 11:17:31 2017]  connection4:0: detected conn error (1011)
[Mon Jun 19 11:17:31 2017] 00000000 00000000 00000000 00000000
[Mon Jun 19 11:17:31 2017] 00000000 00000000 00000000 00000000
[Mon Jun 19 11:17:31 2017] 00000000 00000000 00000000 00000000
[Mon Jun 19 11:17:31 2017] 00000000 93005204 0a0001f4 004915d2
[Mon Jun 19 11:17:31 2017] iser: iser_err_comp: command failure: local
protection error (4) vend_err 52
[Mon Jun 19 11:17:31 2017]  connection3:0: detected conn error (1011)
[Mon Jun 19 11:17:44 2017] mlx5_0:dump_cqe:275:(pid 0): dump error cqe
[Mon Jun 19 11:17:44 2017] 00000000 00000000 00000000 00000000
[Mon Jun 19 11:17:44 2017] 00000000 00000000 00000000 00000000
[Mon Jun 19 11:17:44 2017] 00000000 00000000 00000000 00000000
[Mon Jun 19 11:17:44 2017] 00000000 93005204 0a0001f6 004519d2
[Mon Jun 19 11:17:44 2017] iser: iser_err_comp: command failure: local
protection error (4) vend_err 52
[Mon Jun 19 11:17:44 2017]  connection3:0: detected conn error (1011)
[Mon Jun 19 11:18:55 2017] mlx5_0:dump_cqe:275:(pid 0): dump error cqe
[Mon Jun 19 11:18:55 2017] 00000000 00000000 00000000 00000000
[Mon Jun 19 11:18:55 2017] 00000000 00000000 00000000 00000000
[Mon Jun 19 11:18:55 2017] 00000000 00000000 00000000 00000000
[Mon Jun 19 11:18:55 2017] 00000000 93005204 0a0001f7 01934fd2
[Mon Jun 19 11:18:55 2017] iser: iser_err_comp: command failure: local
protection error (4) vend_err 52
[Mon Jun 19 11:18:55 2017]  connection3:0: detected conn error (1011)
[Mon Jun 19 11:20:25 2017] mlx5_0:dump_cqe:275:(pid 0): dump error cqe
[Mon Jun 19 11:20:25 2017] 00000000 00000000 00000000 00000000
[Mon Jun 19 11:20:25 2017] 00000000 00000000 00000000 00000000
[Mon Jun 19 11:20:25 2017] 00000000 00000000 00000000 00000000
[Mon Jun 19 11:20:25 2017] 00000000 93005204 0a0001f8 0274edd2
[Mon Jun 19 11:20:25 2017] iser: iser_err_comp: command failure: local
protection error (4) vend_err 52
[Mon Jun 19 11:20:25 2017]  connection3:0: detected conn error (1011)

I'm going to try to cherry-pick the fix to 4.9.x and do some testing there.

Thanks,

----------------
Robert LeBlanc
PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1


On Thu, May 18, 2017 at 7:34 AM, Leon Romanovsky <leon@kernel.org> wrote:
> On Wed, May 17, 2017 at 02:56:36PM +0200, Marta Rybczynska wrote:
>> > On Mon, May 15, 2017 at 07:59:52AM -0700, Christoph Hellwig wrote:
>> >> On Mon, May 15, 2017 at 05:36:32PM +0300, Leon Romanovsky wrote:
>> >> > I understand you and both Max and me are feeling the same as you. For more
>> >> > than 2 months, we constantly (almost on daily basis) asked for a solution from
>> >> > architecture group, but received different answers. The proposals were
>> >> > extremely broad from need for strong fence for all cards to no need for
>> >> > strong fence at all.
>> >>
>> >> So let's get the patch to do a strong fence everywhere now, and relax
>> >> it later where possible.
>> >>
>> >> Correntness before speed..
>> >
>> > OK, please give me and Max till EOW to stop this saga. One of the two
>> > options will be: Max will resend original patch, or Max will send patch
>> > blessed by architecture group.
>> >
>>
>> Good luck with this Max & Leon! It seems to be a complicated problem.
>> Just an idea: in our case it *seems* that the problem started appearing
>> after a firmware upgrade, older ones do not seem to have the same
>> behaviour. Maybe it's a hint for you.
>
> OK, we came to the agreement which capability bits we should add. Max
> will return to the office at the middle of the next week and we will
> proceed with the submission of proper patch once our shared code will
> be accepted.
>
> In the meantime, i put the original patch to be part of our regression.
> https://git.kernel.org/pub/scm/linux/kernel/git/leon/linux-rdma.git/commit/?h=testing/queue-next&id=a40ac569f243db552661e6efad70080bb406823c
>
> Thank you for your patience.
>
>>
>> Thanks!
>> Marta
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Sagi Grimberg June 20, 2017, 6:39 a.m. UTC | #19
Hi Robert,

> I ran into this with 4.9.32 when I rebooted the target. I tested
> 4.12-rc6 and this particular error seems to have been resolved, but I
> now get a new one on the initiator. This one doesn't seem as
> impactful.
> 
> [Mon Jun 19 11:17:20 2017] mlx5_0:dump_cqe:275:(pid 0): dump error cqe
> [Mon Jun 19 11:17:20 2017] 00000000 00000000 00000000 00000000
> [Mon Jun 19 11:17:20 2017] 00000000 00000000 00000000 00000000
> [Mon Jun 19 11:17:20 2017] 00000000 00000000 00000000 00000000
> [Mon Jun 19 11:17:20 2017] 00000000 93005204 0a0001bd 45c8e0d2

Max, Leon,

Care to parse this syndrome for us? ;)
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Leon Romanovsky June 20, 2017, 7:46 a.m. UTC | #20
On Tue, Jun 20, 2017 at 09:39:36AM +0300, Sagi Grimberg wrote:
> Hi Robert,
>
> > I ran into this with 4.9.32 when I rebooted the target. I tested
> > 4.12-rc6 and this particular error seems to have been resolved, but I
> > now get a new one on the initiator. This one doesn't seem as
> > impactful.
> >
> > [Mon Jun 19 11:17:20 2017] mlx5_0:dump_cqe:275:(pid 0): dump error cqe
> > [Mon Jun 19 11:17:20 2017] 00000000 00000000 00000000 00000000
> > [Mon Jun 19 11:17:20 2017] 00000000 00000000 00000000 00000000
> > [Mon Jun 19 11:17:20 2017] 00000000 00000000 00000000 00000000
> > [Mon Jun 19 11:17:20 2017] 00000000 93005204 0a0001bd 45c8e0d2
>
> Max, Leon,
>
> Care to parse this syndrome for us? ;)

Here the parsed output, it says that it was access to mkey which is
free.

======== cqe_with_error ========
wqe_id                           : 0x0
srqn_usr_index                   : 0x0
byte_cnt                         : 0x0
hw_error_syndrome                : 0x93
hw_syndrome_type                 : 0x0
vendor_error_syndrome            : 0x52
syndrome                         : LOCAL_PROTECTION_ERROR (0x4)
s_wqe_opcode                     : SEND (0xa)
qpn_dctn_flow_tag                : 0x1bd
wqe_counter                      : 0x45c8
signature                        : 0xe0
opcode                           : REQUESTOR_ERROR (0xd)
cqe_format                       : NO_INLINE_DATA (0x0)
owner                            : 0x0

Thanks
Sagi Grimberg June 20, 2017, 7:58 a.m. UTC | #21
>> Hi Robert,
>>
>>> I ran into this with 4.9.32 when I rebooted the target. I tested
>>> 4.12-rc6 and this particular error seems to have been resolved, but I
>>> now get a new one on the initiator. This one doesn't seem as
>>> impactful.
>>>
>>> [Mon Jun 19 11:17:20 2017] mlx5_0:dump_cqe:275:(pid 0): dump error cqe
>>> [Mon Jun 19 11:17:20 2017] 00000000 00000000 00000000 00000000
>>> [Mon Jun 19 11:17:20 2017] 00000000 00000000 00000000 00000000
>>> [Mon Jun 19 11:17:20 2017] 00000000 00000000 00000000 00000000
>>> [Mon Jun 19 11:17:20 2017] 00000000 93005204 0a0001bd 45c8e0d2
>>
>> Max, Leon,
>>
>> Care to parse this syndrome for us? ;)
> 
> Here the parsed output, it says that it was access to mkey which is
> free.
> 
> ======== cqe_with_error ========
> wqe_id                           : 0x0
> srqn_usr_index                   : 0x0
> byte_cnt                         : 0x0
> hw_error_syndrome                : 0x93
> hw_syndrome_type                 : 0x0
> vendor_error_syndrome            : 0x52

Can you share the check that correlates to the vendor+hw syndrome?

> syndrome                         : LOCAL_PROTECTION_ERROR (0x4)
> s_wqe_opcode                     : SEND (0xa)

That's interesting, the opcode is a send operation. I'm assuming
that this is immediate-data write? Robert, did this happen when
you issued >4k writes to the target?
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Leon Romanovsky June 20, 2017, 8:33 a.m. UTC | #22
On Tue, Jun 20, 2017 at 10:58:47AM +0300, Sagi Grimberg wrote:
>
> > > Hi Robert,
> > >
> > > > I ran into this with 4.9.32 when I rebooted the target. I tested
> > > > 4.12-rc6 and this particular error seems to have been resolved, but I
> > > > now get a new one on the initiator. This one doesn't seem as
> > > > impactful.
> > > >
> > > > [Mon Jun 19 11:17:20 2017] mlx5_0:dump_cqe:275:(pid 0): dump error cqe
> > > > [Mon Jun 19 11:17:20 2017] 00000000 00000000 00000000 00000000
> > > > [Mon Jun 19 11:17:20 2017] 00000000 00000000 00000000 00000000
> > > > [Mon Jun 19 11:17:20 2017] 00000000 00000000 00000000 00000000
> > > > [Mon Jun 19 11:17:20 2017] 00000000 93005204 0a0001bd 45c8e0d2
> > >
> > > Max, Leon,
> > >
> > > Care to parse this syndrome for us? ;)
> >
> > Here the parsed output, it says that it was access to mkey which is
> > free.
> >
> > ======== cqe_with_error ========
> > wqe_id                           : 0x0
> > srqn_usr_index                   : 0x0
> > byte_cnt                         : 0x0
> > hw_error_syndrome                : 0x93
> > hw_syndrome_type                 : 0x0
> > vendor_error_syndrome            : 0x52
>
> Can you share the check that correlates to the vendor+hw syndrome?

mkey.free == 1

>
> > syndrome                         : LOCAL_PROTECTION_ERROR (0x4)
> > s_wqe_opcode                     : SEND (0xa)
>
> That's interesting, the opcode is a send operation. I'm assuming
> that this is immediate-data write? Robert, did this happen when
> you issued >4k writes to the target?
Sagi Grimberg June 20, 2017, 9:33 a.m. UTC | #23
>>> Here the parsed output, it says that it was access to mkey which is
>>> free.

Missed that :)

>>> ======== cqe_with_error ========
>>> wqe_id                           : 0x0
>>> srqn_usr_index                   : 0x0
>>> byte_cnt                         : 0x0
>>> hw_error_syndrome                : 0x93
>>> hw_syndrome_type                 : 0x0
>>> vendor_error_syndrome            : 0x52
>>
>> Can you share the check that correlates to the vendor+hw syndrome?
> 
> mkey.free == 1

Hmm, the way I understand it is that the HW is trying to access
(locally via send) a MR which was already invalidated.

Thinking of this further, this can happen in a case where the target
already completed the transaction, sent SEND_WITH_INVALIDATE but the
original send ack was lost somewhere causing the device to retransmit
from the MR (which was already invalidated). This is highly unlikely
though.

Shouldn't this be protected somehow by the device?
Can someone explain why the above cannot happen? Jason? Liran? Anyone?

Say host register MR (a) and send (1) from that MR to a target,
send (1) ack got lost, and the target issues SEND_WITH_INVALIDATE
on MR (a) and the host HCA process it, then host HCA timeout on send (1)
so it retries, but ehh, its already invalidated.

Or, we can also have a race where we destroy all our MRs when I/O
is still running (but from the code we should be safe here).

Robert, when you rebooted the target, I assume iscsi ping
timeout expired and the connection teardown started correct?
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Robert LeBlanc June 20, 2017, 2:41 p.m. UTC | #24
On Tue, Jun 20, 2017 at 1:58 AM, Sagi Grimberg <sagi@grimberg.me> wrote:
>
>>> Hi Robert,
>>>
>>>> I ran into this with 4.9.32 when I rebooted the target. I tested
>>>> 4.12-rc6 and this particular error seems to have been resolved, but I
>>>> now get a new one on the initiator. This one doesn't seem as
>>>> impactful.
>>>>
>>>> [Mon Jun 19 11:17:20 2017] mlx5_0:dump_cqe:275:(pid 0): dump error cqe
>>>> [Mon Jun 19 11:17:20 2017] 00000000 00000000 00000000 00000000
>>>> [Mon Jun 19 11:17:20 2017] 00000000 00000000 00000000 00000000
>>>> [Mon Jun 19 11:17:20 2017] 00000000 00000000 00000000 00000000
>>>> [Mon Jun 19 11:17:20 2017] 00000000 93005204 0a0001bd 45c8e0d2
>>>
>>>
>>> Max, Leon,
>>>
>>> Care to parse this syndrome for us? ;)
>>
>>
>> Here the parsed output, it says that it was access to mkey which is
>> free.
>>
>> ======== cqe_with_error ========
>> wqe_id                           : 0x0
>> srqn_usr_index                   : 0x0
>> byte_cnt                         : 0x0
>> hw_error_syndrome                : 0x93
>> hw_syndrome_type                 : 0x0
>> vendor_error_syndrome            : 0x52
>
>
> Can you share the check that correlates to the vendor+hw syndrome?
>
>> syndrome                         : LOCAL_PROTECTION_ERROR (0x4)
>> s_wqe_opcode                     : SEND (0xa)
>
>
> That's interesting, the opcode is a send operation. I'm assuming
> that this is immediate-data write? Robert, did this happen when
> you issued >4k writes to the target?

I was running dd with oflag=direct, so yes.

----------------
Robert LeBlanc
PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Robert LeBlanc June 20, 2017, 2:43 p.m. UTC | #25
On Tue, Jun 20, 2017 at 3:33 AM, Sagi Grimberg <sagi@grimberg.me> wrote:
>
>>>> Here the parsed output, it says that it was access to mkey which is
>>>> free.
>
>
> Missed that :)
>
>>>> ======== cqe_with_error ========
>>>> wqe_id                           : 0x0
>>>> srqn_usr_index                   : 0x0
>>>> byte_cnt                         : 0x0
>>>> hw_error_syndrome                : 0x93
>>>> hw_syndrome_type                 : 0x0
>>>> vendor_error_syndrome            : 0x52
>>>
>>>
>>> Can you share the check that correlates to the vendor+hw syndrome?
>>
>>
>> mkey.free == 1
>
>
> Hmm, the way I understand it is that the HW is trying to access
> (locally via send) a MR which was already invalidated.
>
> Thinking of this further, this can happen in a case where the target
> already completed the transaction, sent SEND_WITH_INVALIDATE but the
> original send ack was lost somewhere causing the device to retransmit
> from the MR (which was already invalidated). This is highly unlikely
> though.
>
> Shouldn't this be protected somehow by the device?
> Can someone explain why the above cannot happen? Jason? Liran? Anyone?
>
> Say host register MR (a) and send (1) from that MR to a target,
> send (1) ack got lost, and the target issues SEND_WITH_INVALIDATE
> on MR (a) and the host HCA process it, then host HCA timeout on send (1)
> so it retries, but ehh, its already invalidated.
>
> Or, we can also have a race where we destroy all our MRs when I/O
> is still running (but from the code we should be safe here).
>
> Robert, when you rebooted the target, I assume iscsi ping
> timeout expired and the connection teardown started correct?

I do remember that the ping timed out and the connection was torn down
according to the messages.

----------------
Robert LeBlanc
PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Robert LeBlanc June 20, 2017, 10:58 p.m. UTC | #26
On Tue, Jun 20, 2017 at 4:31 AM, Max Gurtovoy <maxg@mellanox.com> wrote:
>
>
> On 6/20/2017 12:33 PM, Sagi Grimberg wrote:
>>
>>
>>>>> Here the parsed output, it says that it was access to mkey which is
>>>>> free.
>>
>>
>> Missed that :)
>>
>>>>> ======== cqe_with_error ========
>>>>> wqe_id                           : 0x0
>>>>> srqn_usr_index                   : 0x0
>>>>> byte_cnt                         : 0x0
>>>>> hw_error_syndrome                : 0x93
>>>>> hw_syndrome_type                 : 0x0
>>>>> vendor_error_syndrome            : 0x52
>>>>
>>>>
>>>> Can you share the check that correlates to the vendor+hw syndrome?
>>>
>>>
>>> mkey.free == 1
>>
>>
>> Hmm, the way I understand it is that the HW is trying to access
>> (locally via send) a MR which was already invalidated.
>>
>> Thinking of this further, this can happen in a case where the target
>> already completed the transaction, sent SEND_WITH_INVALIDATE but the
>> original send ack was lost somewhere causing the device to retransmit
>> from the MR (which was already invalidated). This is highly unlikely
>> though.
>>
>> Shouldn't this be protected somehow by the device?
>> Can someone explain why the above cannot happen? Jason? Liran? Anyone?
>>
>> Say host register MR (a) and send (1) from that MR to a target,
>> send (1) ack got lost, and the target issues SEND_WITH_INVALIDATE
>> on MR (a) and the host HCA process it, then host HCA timeout on send (1)
>> so it retries, but ehh, its already invalidated.
>
>
> This might happen IMO.
> Robert, can you test this untested patch (this is not the full solution,
> just something to think about):
>
> diff --git a/drivers/infiniband/ulp/iser/iser_verbs.c
> b/drivers/infiniband/ulp/iser/iser_verbs.c
> index c538a38..e93bd40 100644
> --- a/drivers/infiniband/ulp/iser/iser_verbs.c
> +++ b/drivers/infiniband/ulp/iser/iser_verbs.c
> @@ -1079,7 +1079,7 @@ int iser_post_send(struct ib_conn *ib_conn, struct
> iser_tx_desc *tx_desc,
>         wr->sg_list = tx_desc->tx_sg;
>         wr->num_sge = tx_desc->num_sge;
>         wr->opcode = IB_WR_SEND;
> -       wr->send_flags = signal ? IB_SEND_SIGNALED : 0;
> +       wr->send_flags = IB_SEND_SIGNALED;
>
>         ib_ret = ib_post_send(ib_conn->qp, &tx_desc->wrs[0].send, &bad_wr);
>         if (ib_ret)
>
>
>>
>> Or, we can also have a race where we destroy all our MRs when I/O
>> is still running (but from the code we should be safe here).
>>
>> Robert, when you rebooted the target, I assume iscsi ping
>> timeout expired and the connection teardown started correct?

I still get the local protection errors with this patch. I am seeing a
ping timeout on the initiator when I reboot the target.
[Tue Jun 20 16:41:21 2017]  connection7:0: detected conn error (1011)
[Tue Jun 20 16:41:26 2017]  session7: session recovery timed out after 5 secs

Since I'm gracefully shutting down the targets in this case, shouldn't
the connection be closed gracefully by the target instead of the
initiator having to wait for ping to fail?

----------------
Robert LeBlanc
PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Sagi Grimberg June 27, 2017, 7:16 a.m. UTC | #27
> I still get the local protection errors with this patch. I am seeing a
> ping timeout on the initiator when I reboot the target.
> [Tue Jun 20 16:41:21 2017]  connection7:0: detected conn error (1011)
> [Tue Jun 20 16:41:26 2017]  session7: session recovery timed out after 5 secs
> 

Not a big surprise as its not really addressing the issue...

> Since I'm gracefully shutting down the targets in this case, shouldn't
> the connection be closed gracefully by the target instead of the
> initiator having to wait for ping to fail?

Not really, even in orderly shutdown, the device driver (mlx5 in
this case) shutdown sequence is triggered before ib_isert and
fires DEVICE_REMOVAL events to all its upper layer users (ib_isert
being one of them), which forces resource teardown (no disconnect).

We could resgister a shutdown handler in ib_isert, but its not really
its responsibility as a transport driver..

It would be nice if we had targetcli daemon'ised as a service
and register a shutdown notification, orderly remove and save the
existing configuration before the kernel even sees it. But its
a different scope really...

CC'ing target-devel
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/drivers/infiniband/hw/mlx5/qp.c 
b/drivers/infiniband/hw/mlx5/qp.c
index ad8a2638e339..0f9a12570262 100644
--- a/drivers/infiniband/hw/mlx5/qp.c
+++ b/drivers/infiniband/hw/mlx5/qp.c
@@ -3893,7 +3893,7 @@  int mlx5_ib_post_send(struct ib_qp *ibqp, struct 
ib_send_wr *wr,
                                 goto out;

                         case IB_WR_LOCAL_INV:
-                               next_fence = 
MLX5_FENCE_MODE_INITIATOR_SMALL;
+                               next_fence = 
MLX5_FENCE_MODE_STRONG_ORDERING;