[v3,for-next,0/2] Fix crash due to sleepy mutex while holding lock in post_{send|recv|poll}

Message ID	1574335200-34923-1-git-send-email-liuyixian@huawei.com (mailing list archive)
Headers	show Return-Path: <SRS0=NZET=ZN=vger.kernel.org=linux-rdma-owner@kernel.org> From: Yixian Liu <liuyixian@huawei.com> To: <dledford@redhat.com>, <jgg@ziepe.ca>, <leon@kernel.org> CC: <linux-rdma@vger.kernel.org>, <linuxarm@huawei.com> Subject: [PATCH v3 for-next 0/2] Fix crash due to sleepy mutex while holding lock in post_{send\|recv\|poll} Date: Thu, 21 Nov 2019 19:19:58 +0800 Message-ID: <1574335200-34923-1-git-send-email-liuyixian@huawei.com> MIME-Version: 1.0 Content-Type: text/plain Sender: linux-rdma-owner@vger.kernel.org Precedence: bulk
Series	Fix crash due to sleepy mutex while holding lock in post_{send\|recv\|poll} \| expand [v3,for-next,0/2] Fix crash due to sleepy mutex while holding lock in post_{send\|recv\|poll} [v3,for-next,1/2] RDMA/hns: Add the workqueue framework for flush cqe handler [v3,for-next,2/2] RDMA/hns: Delayed flush cqe process with workqueue

Message ID

1574335200-34923-1-git-send-email-liuyixian@huawei.com (mailing list archive)

Headers

From: Yixian Liu <liuyixian@huawei.com>
To: <dledford@redhat.com>, <jgg@ziepe.ca>, <leon@kernel.org>
CC: <linux-rdma@vger.kernel.org>, <linuxarm@huawei.com>
Subject: [PATCH v3 for-next 0/2] Fix crash due to sleepy mutex while holding
 lock in post_{send|recv|poll}
Date: Thu, 21 Nov 2019 19:19:58 +0800
Message-ID: <1574335200-34923-1-git-send-email-liuyixian@huawei.com>
MIME-Version: 1.0
Content-Type: text/plain
Sender: linux-rdma-owner@vger.kernel.org
Precedence: bulk

Series

Fix crash due to sleepy mutex while holding lock in post_{send|recv|poll} | expand

Message

Yixian Liu Nov. 21, 2019, 11:19 a.m. UTC

Earlier Background:
HiP08 RoCE hardware lacks ability(a known hardware problem) to flush
outstanding WQEs if QP state gets into errored mode for some reason.
To overcome this hardware problem and as a workaround, when QP is
detected to be in errored state during various legs like post send,
post receive etc [1], flush needs to be performed from the driver.

These data-path legs might get called concurrently from various context,
like thread and interrupt as well (like NVMe driver). Hence, these need
to be protected with spin-locks for the concurrency. This code exists
within the driver.

Problem:
Earlier The patch[1] sent to solve the hardware limitation explained
in the background section had a bug in the software flushing leg. It
acquired mutex while modifying QP state to errored state and while
conveying it to the hardware using the mailbox. This caused leg to
sleep while holding spin-lock and caused crash.

Suggested Solution:
In this patch, we have proposed to defer the flushing of the QP in
Errored state using the workqueue.

We do understand that this might have an impact on the recovery times
as scheduling of the wqorkqueue handler depends upon the occupancy of
the system. Therefore to roughly mitigate this affect we have tried
to use Concurrency Managed workqueue to give worker thread (and
hence handler) a chance to run over more than one core.


[1] https://patchwork.kernel.org/patch/10534271/


This patch-set consists of:
[Patch 001] Introduce workqueue based WQE Flush Handler
[Patch 002] Call WQE flush handler in post {send|receive|poll}

v3 changes:
1. Fall back to dynamically allocate flush_work.

v2 changes:
1. Remove new created workqueue according to Jason's comment
2. Remove dynamic allocation for flush_work according to Jason's comment
3. Change current irq singlethread workqueue to concurrency management
   workqueue to ensure work unblocked.

Yixian Liu (2):
  RDMA/hns: Add the workqueue framework for flush cqe handler
  RDMA/hns: Delayed flush cqe process with workqueue

 drivers/infiniband/hw/hns/hns_roce_device.h |  2 +
 drivers/infiniband/hw/hns/hns_roce_hw_v2.c  | 88 +++++++++++++----------------
 drivers/infiniband/hw/hns/hns_roce_qp.c     | 43 ++++++++++++++
 3 files changed, 85 insertions(+), 48 deletions(-)

Comments

Yixian Liu Dec. 16, 2019, 12:51 p.m. UTC | #1

Hi Jason,

I want to make sure that is there any further comments on this patch set?

Thanks!
Eason

On 2019/11/21 19:19, Yixian Liu wrote:
> Earlier Background:
> HiP08 RoCE hardware lacks ability(a known hardware problem) to flush
> outstanding WQEs if QP state gets into errored mode for some reason.
> To overcome this hardware problem and as a workaround, when QP is
> detected to be in errored state during various legs like post send,
> post receive etc [1], flush needs to be performed from the driver.
> 
> These data-path legs might get called concurrently from various context,
> like thread and interrupt as well (like NVMe driver). Hence, these need
> to be protected with spin-locks for the concurrency. This code exists
> within the driver.
> 
> Problem:
> Earlier The patch[1] sent to solve the hardware limitation explained
> in the background section had a bug in the software flushing leg. It
> acquired mutex while modifying QP state to errored state and while
> conveying it to the hardware using the mailbox. This caused leg to
> sleep while holding spin-lock and caused crash.
> 
> Suggested Solution:
> In this patch, we have proposed to defer the flushing of the QP in
> Errored state using the workqueue.
> 
> We do understand that this might have an impact on the recovery times
> as scheduling of the wqorkqueue handler depends upon the occupancy of
> the system. Therefore to roughly mitigate this affect we have tried
> to use Concurrency Managed workqueue to give worker thread (and
> hence handler) a chance to run over more than one core.
> 
> 
> [1] https://patchwork.kernel.org/patch/10534271/
> 
> 
> This patch-set consists of:
> [Patch 001] Introduce workqueue based WQE Flush Handler
> [Patch 002] Call WQE flush handler in post {send|receive|poll}
> 
> v3 changes:
> 1. Fall back to dynamically allocate flush_work.
> 
> v2 changes:
> 1. Remove new created workqueue according to Jason's comment
> 2. Remove dynamic allocation for flush_work according to Jason's comment
> 3. Change current irq singlethread workqueue to concurrency management
>    workqueue to ensure work unblocked.
> 
> Yixian Liu (2):
>   RDMA/hns: Add the workqueue framework for flush cqe handler
>   RDMA/hns: Delayed flush cqe process with workqueue
> 
>  drivers/infiniband/hw/hns/hns_roce_device.h |  2 +
>  drivers/infiniband/hw/hns/hns_roce_hw_v2.c  | 88 +++++++++++++----------------
>  drivers/infiniband/hw/hns/hns_roce_qp.c     | 43 ++++++++++++++
>  3 files changed, 85 insertions(+), 48 deletions(-)
>

Jason Gunthorpe Dec. 18, 2019, 2 p.m. UTC | #2

On Mon, Dec 16, 2019 at 08:51:03PM +0800, Liuyixian (Eason) wrote:
> Hi Jason,
> 
> I want to make sure that is there any further comments on this patch set?

I still dislike it alot.

Jason

Yixian Liu Dec. 20, 2019, 10:12 a.m. UTC | #3

On 2019/12/18 22:00, Jason Gunthorpe wrote:
> On Mon, Dec 16, 2019 at 08:51:03PM +0800, Liuyixian (Eason) wrote:
>> Hi Jason,
>>
>> I want to make sure that is there any further comments on this patch set?
> 
> I still dislike it alot.
> 

Hi Jason,

Thanks for reply :)
Let us recall the previous discussions and check is there anything is pending on.

Comments in patch set V1:
1: why do you need a dedicated HIGHPRI work queue?
   -- Accepted. The HIGHPRI flag was moved out from v2.
2. As far as I could tell the only thing the triggered the work to run
   was some variable which was only set in another work queue 'hns_roce_irq_work_handle()'
   -- Accepted. No new work queue is created but the irq WQ is reused from v2.
3. don't do allocations at all if you can't allow them to fail.
   -- Accepted. The flush_work structure is initialized at compile time.

Comments in patch set V2:
1. It kind of looks like this can be called multiple times? It won't work
   right unless it is called exactly once.
   -- Agreed. So I proposed to fall back to V1 in V3 again.
2. Why do you need more than one work in parallel for this? Once you
   start to move the HW to error that only has to happen once, surely?
   -- Explained. This question has raised out the key point of our solution.
      Flush operation should be implemented once the QP state is going to
      be error or the producer index is updated for the QP in error state.
3. The work function does something that looks like it only has to happen
   once per QP. One do you need to keep re-queing this thing every time the user posts
   a WR?
   -- Explained. The root cause is that hip08 needs the newest PI to flush the outstanding
      wqe every time the PI is updated. That's re-queuing is needed.

In conclusion, we have accepted all the comments in V1, and explained to all 3 comments in V2
but no further response. Do I have missed to cover any of your concerns?

I would really appreciate it if you could help to point out the unsolved issue.

Thanks.

Jason Gunthorpe Dec. 20, 2019, 1:37 p.m. UTC | #4

On Fri, Dec 20, 2019 at 06:12:51PM +0800, Liuyixian (Eason) wrote:
> 
> 
> On 2019/12/18 22:00, Jason Gunthorpe wrote:
> > On Mon, Dec 16, 2019 at 08:51:03PM +0800, Liuyixian (Eason) wrote:
> >> Hi Jason,
> >>
> >> I want to make sure that is there any further comments on this patch set?
> > 
> > I still dislike it alot.
> > 
> 
> Hi Jason,
> 
> Thanks for reply :)
> Let us recall the previous discussions and check is there anything is pending on.
> 
> Comments in patch set V1:
> 1: why do you need a dedicated HIGHPRI work queue?
> 2. As far as I could tell the only thing the triggered the work to run
>    was some variable which was only set in another work queue 'hns_roce_irq_work_handle()'
> 3. don't do allocations at all if you can't allow them to fail.
> 
> Comments in patch set V2:
> 1. It kind of looks like this can be called multiple times? It won't work
>    right unless it is called exactly once.
> 2. Why do you need more than one work in parallel for this? Once you
>    start to move the HW to error that only has to happen once, surely?
>       Flush operation should be implemented once the QP state is going to
>       be error or the producer index is updated for the QP in error state.
> 3. The work function does something that looks like it only has to happen
>    once per QP. One do you need to keep re-queing this thing every time the user posts
>    a WR?
>       wqe every time the PI is updated. That's re-queuing is needed.
> 
> In conclusion, we have accepted all the comments in V1, and explained to all 3 comments in V2
> but no further response. Do I have missed to cover any of your concerns?
> 
> I would really appreciate it if you could help to point out the unsolved issue.

You explained you need to push the latest PI to the HW, so I still
don't understand why it needs to be coded in such a confusing

Keep a note if the PI is being pushed, if not, start pushing it. If
yes, update that PI that will be pushed and do nothing

Jason

Yixian Liu Dec. 23, 2019, 1:49 p.m. UTC | #5

On 2019/12/20 21:37, Jason Gunthorpe wrote:
> On Fri, Dec 20, 2019 at 06:12:51PM +0800, Liuyixian (Eason) wrote:
>>
>>
>> On 2019/12/18 22:00, Jason Gunthorpe wrote:
>>> On Mon, Dec 16, 2019 at 08:51:03PM +0800, Liuyixian (Eason) wrote:
>>>> Hi Jason,
>>>>
>>>> I want to make sure that is there any further comments on this patch set?
>>>
>>> I still dislike it alot.
>>>
>>
>> Hi Jason,
>>
>> Thanks for reply :)
>> Let us recall the previous discussions and check is there anything is pending on.
>>
>> Comments in patch set V1:
>> 1: why do you need a dedicated HIGHPRI work queue?
>> 2. As far as I could tell the only thing the triggered the work to run
>>    was some variable which was only set in another work queue 'hns_roce_irq_work_handle()'
>> 3. don't do allocations at all if you can't allow them to fail.
>>
>> Comments in patch set V2:
>> 1. It kind of looks like this can be called multiple times? It won't work
>>    right unless it is called exactly once.
>> 2. Why do you need more than one work in parallel for this? Once you
>>    start to move the HW to error that only has to happen once, surely?
>>       Flush operation should be implemented once the QP state is going to
>>       be error or the producer index is updated for the QP in error state.
>> 3. The work function does something that looks like it only has to happen
>>    once per QP. One do you need to keep re-queing this thing every time the user posts
>>    a WR?
>>       wqe every time the PI is updated. That's re-queuing is needed.
>>
>> In conclusion, we have accepted all the comments in V1, and explained to all 3 comments in V2
>> but no further response. Do I have missed to cover any of your concerns?
>>
>> I would really appreciate it if you could help to point out the unsolved issue.
> 
> You explained you need to push the latest PI to the HW, so I still
> don't understand why it needs to be coded in such a confusing
> 
> Keep a note if the PI is being pushed, if not, start pushing it. If
> yes, update that PI that will be pushed and do nothing
> 
> Jason

Good suggestion! I will fix it in v4.

Thanks.