diff mbox series

[1/1] Revert "RDMA/rxe: Add workqueue support for rxe tasks"

Message ID 20230922163231.2237811-1-yanjun.zhu@intel.com (mailing list archive)
State Accepted
Headers show
Series [1/1] Revert "RDMA/rxe: Add workqueue support for rxe tasks" | expand

Commit Message

Zhu Yanjun Sept. 22, 2023, 4:32 p.m. UTC
From: Zhu Yanjun <yanjun.zhu@linux.dev>

This reverts commit 9b4b7c1f9f54120940e243251e2b1407767b3381.

This commit replaces tasklet with workqueue. But this results
in occasionally pocess hang at the blktests test case srp/002.
After the discussion in the link[1], this commit is reverted.

Link: https://lore.kernel.org/linux-rdma/e8b76fae-780a-470e-8ec4-c6b650793d10@leemhuis.info/T/#t [1]
Signed-off-by: Zhu Yanjun <yanjun.zhu@linux.dev>
CC: rpearsonhpe@gmail.com
CC: matsuda-daisuke@fujitsu.com
CC: bvanassche@acm.org
CC: shinichiro.kawasaki@wdc.com
---
 drivers/infiniband/sw/rxe/rxe.c      |   9 +--
 drivers/infiniband/sw/rxe/rxe_task.c | 110 ++++++++++++---------------
 drivers/infiniband/sw/rxe/rxe_task.h |   6 +-
 3 files changed, 49 insertions(+), 76 deletions(-)

Comments

Bart Van Assche Sept. 22, 2023, 4:42 p.m. UTC | #1
On 9/22/23 09:32, Zhu Yanjun wrote:
> From: Zhu Yanjun <yanjun.zhu@linux.dev>
> 
> This reverts commit 9b4b7c1f9f54120940e243251e2b1407767b3381.
> 
> This commit replaces tasklet with workqueue. But this results
> in occasionally pocess hang at the blktests test case srp/002.
> After the discussion in the link[1], this commit is reverted.
> 
> Link: https://lore.kernel.org/linux-rdma/e8b76fae-780a-470e-8ec4-c6b650793d10@leemhuis.info/T/#t [1]
> Signed-off-by: Zhu Yanjun <yanjun.zhu@linux.dev>
> CC: rpearsonhpe@gmail.com
> CC: matsuda-daisuke@fujitsu.com
> CC: bvanassche@acm.org
> CC: shinichiro.kawasaki@wdc.com

Shouldn't the Cc-list occur before the Signed-off-by tag? Anyway:

Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Leon Romanovsky Sept. 26, 2023, 9:43 a.m. UTC | #2
On Fri, Sep 22, 2023 at 09:42:08AM -0700, Bart Van Assche wrote:
> On 9/22/23 09:32, Zhu Yanjun wrote:
> > From: Zhu Yanjun <yanjun.zhu@linux.dev>
> > 
> > This reverts commit 9b4b7c1f9f54120940e243251e2b1407767b3381.
> > 
> > This commit replaces tasklet with workqueue. But this results
> > in occasionally pocess hang at the blktests test case srp/002.
> > After the discussion in the link[1], this commit is reverted.
> > 
> > Link: https://lore.kernel.org/linux-rdma/e8b76fae-780a-470e-8ec4-c6b650793d10@leemhuis.info/T/#t [1]
> > Signed-off-by: Zhu Yanjun <yanjun.zhu@linux.dev>
> > CC: rpearsonhpe@gmail.com
> > CC: matsuda-daisuke@fujitsu.com
> > CC: bvanassche@acm.org
> > CC: shinichiro.kawasaki@wdc.com
> 
> Shouldn't the Cc-list occur before the Signed-off-by tag? Anyway:

chackpatch didn't complain, so I took it as is.

> 
> Reviewed-by: Bart Van Assche <bvanassche@acm.org>

Thanks
Leon Romanovsky Sept. 26, 2023, 9:43 a.m. UTC | #3
On Fri, 22 Sep 2023 12:32:31 -0400, Zhu Yanjun wrote:
> This reverts commit 9b4b7c1f9f54120940e243251e2b1407767b3381.
> 
> This commit replaces tasklet with workqueue. But this results
> in occasionally pocess hang at the blktests test case srp/002.
> After the discussion in the link[1], this commit is reverted.
> 
> 
> [...]

Applied, thanks!

[1/1] Revert "RDMA/rxe: Add workqueue support for rxe tasks"
      https://git.kernel.org/rdma/rdma/c/e710c390a8f860

Best regards,
Leon Romanovsky Sept. 26, 2023, 2:06 p.m. UTC | #4
On Tue, Sep 26, 2023 at 12:43:57PM +0300, Leon Romanovsky wrote:
> 
> On Fri, 22 Sep 2023 12:32:31 -0400, Zhu Yanjun wrote:
> > This reverts commit 9b4b7c1f9f54120940e243251e2b1407767b3381.
> > 
> > This commit replaces tasklet with workqueue. But this results
> > in occasionally pocess hang at the blktests test case srp/002.
> > After the discussion in the link[1], this commit is reverted.
> > 
> > 
> > [...]
> 
> Applied, thanks!
> 
> [1/1] Revert "RDMA/rxe: Add workqueue support for rxe tasks"
>       https://git.kernel.org/rdma/rdma/c/e710c390a8f860

I applied this patch, but will delay it for some time with a hope that
fix will arrive in the near future.

Thanks

> 
> Best regards,
> -- 
> Leon Romanovsky <leon@kernel.org>
>
Bart Van Assche Sept. 26, 2023, 5:05 p.m. UTC | #5
On 9/26/23 07:06, Leon Romanovsky wrote:
> On Tue, Sep 26, 2023 at 12:43:57PM +0300, Leon Romanovsky wrote:
>>
>> On Fri, 22 Sep 2023 12:32:31 -0400, Zhu Yanjun wrote:
>>> This reverts commit 9b4b7c1f9f54120940e243251e2b1407767b3381.
>>>
>>> This commit replaces tasklet with workqueue. But this results
>>> in occasionally pocess hang at the blktests test case srp/002.
>>> After the discussion in the link[1], this commit is reverted.
>>>
>>>
>>> [...]
>>
>> Applied, thanks!
>>
>> [1/1] Revert "RDMA/rxe: Add workqueue support for rxe tasks"
>>        https://git.kernel.org/rdma/rdma/c/e710c390a8f860
> 
> I applied this patch, but will delay it for some time with a hope that
> fix will arrive in the near future.

Thank you Leon. With this revert applied on top of the rdma for-next branch, I
don't see the KASAN complaint anymore that I reported yesterday. I think this
is more evidence that the KASAN complaint was caused by the RXE driver.

Bart.
Bob Pearson Sept. 26, 2023, 6:34 p.m. UTC | #6
On 9/26/23 12:05, Bart Van Assche wrote:
> On 9/26/23 07:06, Leon Romanovsky wrote:
>> On Tue, Sep 26, 2023 at 12:43:57PM +0300, Leon Romanovsky wrote:
>>>
>>> On Fri, 22 Sep 2023 12:32:31 -0400, Zhu Yanjun wrote:
>>>> This reverts commit 9b4b7c1f9f54120940e243251e2b1407767b3381.
>>>>
>>>> This commit replaces tasklet with workqueue. But this results
>>>> in occasionally pocess hang at the blktests test case srp/002.
>>>> After the discussion in the link[1], this commit is reverted.
>>>>
>>>>
>>>> [...]
>>>
>>> Applied, thanks!
>>>
>>> [1/1] Revert "RDMA/rxe: Add workqueue support for rxe tasks"
>>>        https://git.kernel.org/rdma/rdma/c/e710c390a8f860
>>
>> I applied this patch, but will delay it for some time with a hope that
>> fix will arrive in the near future.
> 
> Thank you Leon. With this revert applied on top of the rdma for-next branch, I
> don't see the KASAN complaint anymore that I reported yesterday. I think this
> is more evidence that the KASAN complaint was caused by the RXE driver.
> 
> Bart.

Bart,

I am working to try to reproduce the KASAN warning. Unfortunately, so far I am not able to see it
in Ubuntu + Linus' kernel (as you described) on metal. The config file is different but copies
the CONFIG_KASAN_xxx exactly as yours. With KASAN enabled it hangs on every iteration of srp/002
but without a KASAN warning. I am now building an openSuSE VM for qemu and will see if that
causes the warning.

Thanks for the support,

Bob
Bart Van Assche Sept. 26, 2023, 8:24 p.m. UTC | #7
On 9/26/23 11:34, Bob Pearson wrote:
> I am working to try to reproduce the KASAN warning. Unfortunately,
> so far I am not able to see it in Ubuntu + Linus' kernel (as you 
> described) on metal. The config file is different but copies the 
> CONFIG_KASAN_xxx exactly as yours. With KASAN enabled it hangs on 
> every iteration of srp/002 but without a KASAN warning. I am now 
> building an openSuSE VM for qemu and will see if that causes the 
> warning.

Hi Bob,

Did you try to understand the report that I shared? My conclusion from
the report is that when using tasklets rxe_completer() only runs after
rxe_requester() has finished and also that when using work queues that
rxe_completer() may run concurrently with rxe_requester(). This patch
seems to fix all issues that I ran into with the rdma_rxe workqueue
patch (I have not tried to verify the performance implications of this
patch):

diff --git a/drivers/infiniband/sw/rxe/rxe_task.c 
b/drivers/infiniband/sw/rxe/rxe_task.c
index 1501120d4f52..6cd5d5a7a316 100644
--- a/drivers/infiniband/sw/rxe/rxe_task.c
+++ b/drivers/infiniband/sw/rxe/rxe_task.c
@@ -10,7 +10,7 @@ static struct workqueue_struct *rxe_wq;

  int rxe_alloc_wq(void)
  {
-       rxe_wq = alloc_workqueue("rxe_wq", WQ_UNBOUND, WQ_MAX_ACTIVE);
+       rxe_wq = alloc_workqueue("rxe_wq", WQ_UNBOUND, 1);
         if (!rxe_wq)
                 return -ENOMEM;

Thanks,

Bart.
Rain River Sept. 27, 2023, 12:08 a.m. UTC | #8
On Wed, Sep 27, 2023 at 4:37 AM Bart Van Assche <bvanassche@acm.org> wrote:
>
> On 9/26/23 11:34, Bob Pearson wrote:
> > I am working to try to reproduce the KASAN warning. Unfortunately,
> > so far I am not able to see it in Ubuntu + Linus' kernel (as you
> > described) on metal. The config file is different but copies the
> > CONFIG_KASAN_xxx exactly as yours. With KASAN enabled it hangs on
> > every iteration of srp/002 but without a KASAN warning. I am now
> > building an openSuSE VM for qemu and will see if that causes the
> > warning.
>
> Hi Bob,
>
> Did you try to understand the report that I shared? My conclusion from
> the report is that when using tasklets rxe_completer() only runs after
> rxe_requester() has finished and also that when using work queues that
> rxe_completer() may run concurrently with rxe_requester(). This patch
> seems to fix all issues that I ran into with the rdma_rxe workqueue
> patch (I have not tried to verify the performance implications of this
> patch):

In the same test environment in the link
https://lore.kernel.org/all/4e7aac82-f006-aaa7-6769-d1c9691a0cec@gmail.com/T/#m3294d00f5cf3247dfdb2ea3688b1467167f72704,

RXE with workqueue has worse performance than RXE with tasklet.
Sometimes RXE with workqueue can not work well.

Need this commit in RXE.

>
> diff --git a/drivers/infiniband/sw/rxe/rxe_task.c
> b/drivers/infiniband/sw/rxe/rxe_task.c
> index 1501120d4f52..6cd5d5a7a316 100644
> --- a/drivers/infiniband/sw/rxe/rxe_task.c
> +++ b/drivers/infiniband/sw/rxe/rxe_task.c
> @@ -10,7 +10,7 @@ static struct workqueue_struct *rxe_wq;
>
>   int rxe_alloc_wq(void)
>   {
> -       rxe_wq = alloc_workqueue("rxe_wq", WQ_UNBOUND, WQ_MAX_ACTIVE);
> +       rxe_wq = alloc_workqueue("rxe_wq", WQ_UNBOUND, 1);
>          if (!rxe_wq)
>                  return -ENOMEM;
>
> Thanks,
>
> Bart.
Bob Pearson Sept. 27, 2023, 4:36 p.m. UTC | #9
On 9/26/23 15:24, Bart Van Assche wrote:
> On 9/26/23 11:34, Bob Pearson wrote:
>> I am working to try to reproduce the KASAN warning. Unfortunately,
>> so far I am not able to see it in Ubuntu + Linus' kernel (as you described) on metal. The config file is different but copies the CONFIG_KASAN_xxx exactly as yours. With KASAN enabled it hangs on every iteration of srp/002 but without a KASAN warning. I am now building an openSuSE VM for qemu and will see if that causes the warning.
> 
> Hi Bob,
> 
> Did you try to understand the report that I shared? 

Looking at the three stack traces from KASAN (alloc, free, and use after free)
it appears that there was an ack packet (skb) created in rxe_responder (normal) and
then passed to rxe_completer which apparently successfully processed it and then
freed the skb (also normal). Then the same skb is enqueued on the response queue
in rxe_comp_queue_pkt(). This is very strange and hard to understand. The only way
the original packet could have been 'completed' would be for it to have been first
enqueued on qp->resp_pkts by skb_queue_tail() and then dequeued after the completer
task runs by skb_dequeue(). The skb queue routines are protected by an irqsave spinlock
so they should operate atomically. In other words the completer can't get the skb until
skb_queue_tail() is finished touching the skb. So it looks like the first pass through
rxe_comp_queue_pkt() shouldn't be to blame. There is no way I can imagine that the
packet could be queued twice on the local loopback path. One strange thing in the
trace is a "? rxe_recv_mcast_pkt" which seems unlikely to be true as all the packets
are rc and hence not mcast. Not sure how to interpret this. Perhaps the stack is
corrupted from scribbles which might cause the above impossibility.

My conclusion from
> the report is that when using tasklets rxe_completer() only runs after
> rxe_requester() has finished and also that when using work queues that
> rxe_completer() may run concurrently with rxe_requester(). 

The completer task was always intended to run in parallel with the requester
and responder tasks whether they are tasklets or workqueue items. Tasklets
tend to run sequentially but there is no reason whey they can't run in
parallel. The completer task is triggered by response packets from another
process's queue pair which is asynchronous from the requester task which
generated the request packets.

For unrelated reasons I am planning to merge the requester task and completer
task into a single task because in high scale situation with lots of qps
it performs better and allows removing some of the locking between them.

This patch
> seems to fix all issues that I ran into with the rdma_rxe workqueue
> patch (I have not tried to verify the performance implications of this
> patch):
> 
> diff --git a/drivers/infiniband/sw/rxe/rxe_task.c b/drivers/infiniband/sw/rxe/rxe_task.c
> index 1501120d4f52..6cd5d5a7a316 100644
> --- a/drivers/infiniband/sw/rxe/rxe_task.c
> +++ b/drivers/infiniband/sw/rxe/rxe_task.c
> @@ -10,7 +10,7 @@ static struct workqueue_struct *rxe_wq;
> 
>  int rxe_alloc_wq(void)
>  {
> -       rxe_wq = alloc_workqueue("rxe_wq", WQ_UNBOUND, WQ_MAX_ACTIVE);
> +       rxe_wq = alloc_workqueue("rxe_wq", WQ_UNBOUND, 1);
>         if (!rxe_wq)
>                 return -ENOMEM;
> 
> Thanks,
> 
> Bart.
Bob Pearson Sept. 27, 2023, 4:51 p.m. UTC | #10
On 9/26/23 15:24, Bart Van Assche wrote:
> On 9/26/23 11:34, Bob Pearson wrote:
>> I am working to try to reproduce the KASAN warning. Unfortunately,
>> so far I am not able to see it in Ubuntu + Linus' kernel (as you described) on metal. The config file is different but copies the CONFIG_KASAN_xxx exactly as yours. With KASAN enabled it hangs on every iteration of srp/002 but without a KASAN warning. I am now building an openSuSE VM for qemu and will see if that causes the warning.
> 
> Hi Bob,
> 
> Did you try to understand the report that I shared? My conclusion from
> the report is that when using tasklets rxe_completer() only runs after
> rxe_requester() has finished and also that when using work queues that
> rxe_completer() may run concurrently with rxe_requester(). This patch
> seems to fix all issues that I ran into with the rdma_rxe workqueue
> patch (I have not tried to verify the performance implications of this
> patch):
> 
> diff --git a/drivers/infiniband/sw/rxe/rxe_task.c b/drivers/infiniband/sw/rxe/rxe_task.c
> index 1501120d4f52..6cd5d5a7a316 100644
> --- a/drivers/infiniband/sw/rxe/rxe_task.c
> +++ b/drivers/infiniband/sw/rxe/rxe_task.c
> @@ -10,7 +10,7 @@ static struct workqueue_struct *rxe_wq;
> 
>  int rxe_alloc_wq(void)
>  {
> -       rxe_wq = alloc_workqueue("rxe_wq", WQ_UNBOUND, WQ_MAX_ACTIVE);
> +       rxe_wq = alloc_workqueue("rxe_wq", WQ_UNBOUND, 1);
>         if (!rxe_wq)
>                 return -ENOMEM;
> 
> Thanks,
> 
> Bart.

The workqueue doc says

Some users depend on the strict execution ordering of ST wq. The combination of @max_active of 1 and WQ_UNBOUND
is used to achieve this behavior. Work items on such wq are always queued to the unbound worker-pools and only
one work item can be active at any given time thus achieving the same ordering property as ST wq.

When I have tried this setting I see very low performance compared to 512. It seems that only one item at a time
can run on all the CPUs even though it also says that max_active is the number of threads per cpu.

Nevertheless this is a good hint since it seems to imply that there is a race between the requester and
completer which is certainly possible.

Bob
Leon Romanovsky Oct. 1, 2023, 6:30 a.m. UTC | #11
On Wed, Sep 27, 2023 at 11:51:12AM -0500, Bob Pearson wrote:
> On 9/26/23 15:24, Bart Van Assche wrote:
> > On 9/26/23 11:34, Bob Pearson wrote:
> >> I am working to try to reproduce the KASAN warning. Unfortunately,
> >> so far I am not able to see it in Ubuntu + Linus' kernel (as you described) on metal. The config file is different but copies the CONFIG_KASAN_xxx exactly as yours. With KASAN enabled it hangs on every iteration of srp/002 but without a KASAN warning. I am now building an openSuSE VM for qemu and will see if that causes the warning.
> > 
> > Hi Bob,
> > 
> > Did you try to understand the report that I shared? My conclusion from
> > the report is that when using tasklets rxe_completer() only runs after
> > rxe_requester() has finished and also that when using work queues that
> > rxe_completer() may run concurrently with rxe_requester(). This patch
> > seems to fix all issues that I ran into with the rdma_rxe workqueue
> > patch (I have not tried to verify the performance implications of this
> > patch):
> > 
> > diff --git a/drivers/infiniband/sw/rxe/rxe_task.c b/drivers/infiniband/sw/rxe/rxe_task.c
> > index 1501120d4f52..6cd5d5a7a316 100644
> > --- a/drivers/infiniband/sw/rxe/rxe_task.c
> > +++ b/drivers/infiniband/sw/rxe/rxe_task.c
> > @@ -10,7 +10,7 @@ static struct workqueue_struct *rxe_wq;
> > 
> >  int rxe_alloc_wq(void)
> >  {
> > -       rxe_wq = alloc_workqueue("rxe_wq", WQ_UNBOUND, WQ_MAX_ACTIVE);
> > +       rxe_wq = alloc_workqueue("rxe_wq", WQ_UNBOUND, 1);
> >         if (!rxe_wq)
> >                 return -ENOMEM;
> > 
> > Thanks,
> > 
> > Bart.

<...>

> Nevertheless this is a good hint since it seems to imply that there is a race between the requester and
> completer which is certainly possible.

Bob, Bart

Can you please send this change as a formal patch?
As we prefer workqueue with bad performance implementation over tasklets.

Thanks

> 
> Bob
> 
>
Zhu Yanjun Oct. 4, 2023, 1 a.m. UTC | #12
在 2023/10/4 8:46, Zhu Yanjun 写道:
>
> 在 2023/10/4 2:11, Leon Romanovsky 写道:
>> On Tue, Oct 03, 2023 at 11:29:42PM +0800, Zhu Yanjun wrote:
>>> 在 2023/10/3 17:59, Leon Romanovsky 写道:
>>>> On Tue, Oct 03, 2023 at 04:55:40PM +0800, Zhu Yanjun wrote:
>>>>> 在 2023/10/1 14:50, Leon Romanovsky 写道:
>>>>>> On Sun, Oct 1, 2023, at 09:47, Zhu Yanjun wrote:
>>>>>>> 在 2023/10/1 14:39, Leon Romanovsky 写道:
>>>>>>>> On Sun, Oct 1, 2023, at 09:34, Zhu Yanjun wrote:
>>>>>>>>> 在 2023/10/1 14:30, Leon Romanovsky 写道:
>>>>>>>>>> On Wed, Sep 27, 2023 at 11:51:12AM -0500, Bob Pearson wrote:
>>>>>>>>>>> On 9/26/23 15:24, Bart Van Assche wrote:
>>>>>>>>>>>> On 9/26/23 11:34, Bob Pearson wrote:
>>>>>>>>>>>>> I am working to try to reproduce the KASAN warning. 
>>>>>>>>>>>>> Unfortunately,
>>>>>>>>>>>>> so far I am not able to see it in Ubuntu + Linus' kernel 
>>>>>>>>>>>>> (as you described) on metal. The config file is different 
>>>>>>>>>>>>> but copies the CONFIG_KASAN_xxx exactly as yours. With 
>>>>>>>>>>>>> KASAN enabled it hangs on every iteration of srp/002 but 
>>>>>>>>>>>>> without a KASAN warning. I am now building an openSuSE VM 
>>>>>>>>>>>>> for qemu and will see if that causes the warning.
>>>>>>>>>>>> Hi Bob,
>>>>>>>>>>>>
>>>>>>>>>>>> Did you try to understand the report that I shared? My 
>>>>>>>>>>>> conclusion from
>>>>>>>>>>>> the report is that when using tasklets rxe_completer() only 
>>>>>>>>>>>> runs after
>>>>>>>>>>>> rxe_requester() has finished and also that when using work 
>>>>>>>>>>>> queues that
>>>>>>>>>>>> rxe_completer() may run concurrently with rxe_requester(). 
>>>>>>>>>>>> This patch
>>>>>>>>>>>> seems to fix all issues that I ran into with the rdma_rxe 
>>>>>>>>>>>> workqueue
>>>>>>>>>>>> patch (I have not tried to verify the performance 
>>>>>>>>>>>> implications of this
>>>>>>>>>>>> patch):
>>>>>>>>>>>>
>>>>>>>>>>>> diff --git a/drivers/infiniband/sw/rxe/rxe_task.c 
>>>>>>>>>>>> b/drivers/infiniband/sw/rxe/rxe_task.c
>>>>>>>>>>>> index 1501120d4f52..6cd5d5a7a316 100644
>>>>>>>>>>>> --- a/drivers/infiniband/sw/rxe/rxe_task.c
>>>>>>>>>>>> +++ b/drivers/infiniband/sw/rxe/rxe_task.c
>>>>>>>>>>>> @@ -10,7 +10,7 @@ static struct workqueue_struct *rxe_wq;
>>>>>>>>>>>>
>>>>>>>>>>>>       int rxe_alloc_wq(void)
>>>>>>>>>>>>       {
>>>>>>>>>>>> -       rxe_wq = alloc_workqueue("rxe_wq", WQ_UNBOUND, 
>>>>>>>>>>>> WQ_MAX_ACTIVE);
>>>>>>>>>>>> +       rxe_wq = alloc_workqueue("rxe_wq", WQ_UNBOUND, 1);
>>>>>>>>>>>>              if (!rxe_wq)
>>>>>>>>>>>>                      return -ENOMEM;
>>>>> With this commit, a test run for several days. The similar problem 
>>>>> still
>>>>> occurred.
>>>>>
>>>>> The problem is very similar with the one that Bart mentioned.
>>>>>
>>>>> It is very possible that WQ_MAX_ACTIVE is changed to 1, then this 
>>>>> problem is
>>>>> alleviated.
>>>>>
>>>>> In the following
>>>>>
>>>>> 4661 __printf(1, 4)
>>>>> 4662 struct workqueue_struct *alloc_workqueue(const char *fmt,
>>>>> 4663                                          unsigned int flags,
>>>>> 4664                                          int max_active, ...)
>>>>> 4665 {
>>>>> 4666         va_list args;
>>>>> 4667         struct workqueue_struct *wq;
>>>>> 4668         struct pool_workqueue *pwq;
>>>>> 4669
>>>>> 4670         /*
>>>>> 4671          * Unbound && max_active == 1 used to imply ordered, 
>>>>> which is
>>>>> no longer
>>>>> 4672          * the case on many machines due to per-pod pools. While
>>>>> 4673          * alloc_ordered_workqueue() is the right way to 
>>>>> create an
>>>>> ordered
>>>>> 4674          * workqueue, keep the previous behavior to avoid subtle
>>>>> breakages.
>>>>> 4675          */
>>>>> 4676         if ((flags & WQ_UNBOUND) && max_active == 1)
>>>>> <---This means that workqueue is ordered.
>>>>> 4677                 flags |= __WQ_ORDERED;
>>>>> ...
>>>>>
>>>>> Do this mean that the ordered workqueue covers the root cause? When
>>>>> workqueue is changed to ordered, it is difficult to reproduce this 
>>>>> problem.

>>>>> Got it.
>
> Is there any way to ensure the following?
>
> if a mail does not appear in the rdma maillist, this mail will not be 
> reviewed?


Sorry. My bad. I used the wrong rdma maillist.


>
>>
>>> The analysis is as below:
>>>
>>> Because workqueue will sleep when it is preempted, sometimes the 
>>> sleep time
>>> will exceed the timeout
>>>
>>> of rdma packets. As such, rdma stack or ULP will oom or hang. This 
>>> is why
>>> workqueue will cause ULP hang.
>>>
>>> But tasklet will not sleep. So this kind of problem will not occur with
>>> tasklet.
>>>
>>> About the performance, currently ordered workqueue can only execute 
>>> at most
>>> one work item at any given
>>>
>>> time in the queued order. So in RXE, workqueue will not execute more 
>>> jobs
>>> than tasklet.
>> It is because of changing max_active to be 1. Once that bug will be
>> fixed, RXE will be able to spread traffic on all CPUs.
>

Sure. I agree with you.


After max_active is changed to 1, the workqueue is the ordered workqueue.

The ordered workqueue will execute the work item one by one on differen 
CPUs,

that is, after one work item is complete, the ordered workqueue will 
execute another one

in the queued order on different CPUs. Tasklet will execute the jobs in 
the same CPU one by one.

So if the total job number is the same, the ordered workqueue will have 
the same execution time with the tasklet.

But the ordered workqueue has more overhead in scheduling than the tasklet.

In total, the performance of the ordered workqueue is not good compared 
with the tasklet.

Zhu Yanjun
Zhu Yanjun Oct. 4, 2023, 3:41 a.m. UTC | #13
在 2023/9/27 4:24, Bart Van Assche 写道:
> On 9/26/23 11:34, Bob Pearson wrote:
>> I am working to try to reproduce the KASAN warning. Unfortunately,
>> so far I am not able to see it in Ubuntu + Linus' kernel (as you 
>> described) on metal. The config file is different but copies the 
>> CONFIG_KASAN_xxx exactly as yours. With KASAN enabled it hangs on 
>> every iteration of srp/002 but without a KASAN warning. I am now 
>> building an openSuSE VM for qemu and will see if that causes the warning.
> 
> Hi Bob,
> 
> Did you try to understand the report that I shared? My conclusion from
> the report is that when using tasklets rxe_completer() only runs after
> rxe_requester() has finished and also that when using work queues that
> rxe_completer() may run concurrently with rxe_requester(). This patch
> seems to fix all issues that I ran into with the rdma_rxe workqueue
> patch (I have not tried to verify the performance implications of this
> patch):
> 
> diff --git a/drivers/infiniband/sw/rxe/rxe_task.c 
> b/drivers/infiniband/sw/rxe/rxe_task.c
> index 1501120d4f52..6cd5d5a7a316 100644
> --- a/drivers/infiniband/sw/rxe/rxe_task.c
> +++ b/drivers/infiniband/sw/rxe/rxe_task.c
> @@ -10,7 +10,7 @@ static struct workqueue_struct *rxe_wq;
> 
>   int rxe_alloc_wq(void)
>   {
> -       rxe_wq = alloc_workqueue("rxe_wq", WQ_UNBOUND, WQ_MAX_ACTIVE);
> +       rxe_wq = alloc_workqueue("rxe_wq", WQ_UNBOUND, 1);
>          if (!rxe_wq)
>                  return -ENOMEM;

Hi, Bart

With the above commit, I still found a similar problem. But the problem 
occurs very rarely. With the following, to now, the problem does not occur.

diff --git a/drivers/infiniband/sw/rxe/rxe_task.c 
b/drivers/infiniband/sw/rxe/rxe_task.c
index 1501120d4f52..3189c3705295 100644
--- a/drivers/infiniband/sw/rxe/rxe_task.c
+++ b/drivers/infiniband/sw/rxe/rxe_task.c
@@ -10,7 +10,7 @@ static struct workqueue_struct *rxe_wq;

  int rxe_alloc_wq(void)
  {
-       rxe_wq = alloc_workqueue("rxe_wq", WQ_UNBOUND, WQ_MAX_ACTIVE);
+       rxe_wq = alloc_workqueue("rxe_wq", WQ_HIGHPRI | WQ_UNBOUND, 1);
         if (!rxe_wq)
                 return -ENOMEM;


And with the tasklet, this problem also does not occur.

With "alloc_workqueue("rxe_wq", WQ_HIGHPRI | WQ_UNBOUND, 1);", an 
ordered workqueue with high priority is allocated.

To the same number of work item, the ordered workqueue has the same 
runing time with the tasklet. But the tasklet is based on softirq. Its 
overhead on scheduling is less than workqueue. So in theory, tasklet's 
performance should be better than the ordered workqueue.

Best Regards,
Zhu Yanjun

> 
> Thanks,
> 
> Bart.
Bart Van Assche Oct. 4, 2023, 5:43 p.m. UTC | #14
On 10/3/23 20:41, Zhu Yanjun wrote:
> 在 2023/9/27 4:24, Bart Van Assche 写道:
>> diff --git a/drivers/infiniband/sw/rxe/rxe_task.c 
>> b/drivers/infiniband/sw/rxe/rxe_task.c
>> index 1501120d4f52..6cd5d5a7a316 100644
>> --- a/drivers/infiniband/sw/rxe/rxe_task.c
>> +++ b/drivers/infiniband/sw/rxe/rxe_task.c
>> @@ -10,7 +10,7 @@ static struct workqueue_struct *rxe_wq;
>>
>>   int rxe_alloc_wq(void)
>>   {
>> -       rxe_wq = alloc_workqueue("rxe_wq", WQ_UNBOUND, WQ_MAX_ACTIVE);
>> +       rxe_wq = alloc_workqueue("rxe_wq", WQ_UNBOUND, 1);
>>          if (!rxe_wq)
>>                  return -ENOMEM;
> 
> Hi, Bart
> 
> With the above commit, I still found a similar problem. But the problem 
> occurs very rarely. With the following, to now, the problem does not occur.
> 
> diff --git a/drivers/infiniband/sw/rxe/rxe_task.c 
> b/drivers/infiniband/sw/rxe/rxe_task.c
> index 1501120d4f52..3189c3705295 100644
> --- a/drivers/infiniband/sw/rxe/rxe_task.c
> +++ b/drivers/infiniband/sw/rxe/rxe_task.c
> @@ -10,7 +10,7 @@ static struct workqueue_struct *rxe_wq;
> 
>   int rxe_alloc_wq(void)
>   {
> -       rxe_wq = alloc_workqueue("rxe_wq", WQ_UNBOUND, WQ_MAX_ACTIVE);
> +       rxe_wq = alloc_workqueue("rxe_wq", WQ_HIGHPRI | WQ_UNBOUND, 1);
>          if (!rxe_wq)
>                  return -ENOMEM;
> 
> 
> And with the tasklet, this problem also does not occur.
> 
> With "alloc_workqueue("rxe_wq", WQ_HIGHPRI | WQ_UNBOUND, 1);", an 
> ordered workqueue with high priority is allocated.
> 
> To the same number of work item, the ordered workqueue has the same 
> runing time with the tasklet. But the tasklet is based on softirq. Its 
> overhead on scheduling is less than workqueue. So in theory, tasklet's 
> performance should be better than the ordered workqueue.

Hi Zhu,

Thank you for having reported this. I'm OK with integrating the above 
change in my patch. However, code changes must be motivated. Do you 
perhaps have an explanation of why WQ_HIGHPRI makes the issue disappear 
that you observed?

Regarding tasklets: tasklets are a nightmare from the point of view of
the Linux kernel scheduler. As an example, these may make it impossible
for the scheduler to schedule real-time threads in time.

Thanks,

Bart.
Bart Van Assche Oct. 4, 2023, 5:44 p.m. UTC | #15
On 9/30/23 23:30, Leon Romanovsky wrote:
> On Wed, Sep 27, 2023 at 11:51:12AM -0500, Bob Pearson wrote:
>> On 9/26/23 15:24, Bart Van Assche wrote:
>>> diff --git a/drivers/infiniband/sw/rxe/rxe_task.c b/drivers/infiniband/sw/rxe/rxe_task.c
>>> index 1501120d4f52..6cd5d5a7a316 100644
>>> --- a/drivers/infiniband/sw/rxe/rxe_task.c
>>> +++ b/drivers/infiniband/sw/rxe/rxe_task.c
>>> @@ -10,7 +10,7 @@ static struct workqueue_struct *rxe_wq;
>>>
>>>   int rxe_alloc_wq(void)
>>>   {
>>> -       rxe_wq = alloc_workqueue("rxe_wq", WQ_UNBOUND, WQ_MAX_ACTIVE);
>>> +       rxe_wq = alloc_workqueue("rxe_wq", WQ_UNBOUND, 1);
>>>          if (!rxe_wq)
>>>                  return -ENOMEM;
>>>
>>> Thanks,
>>>
>>> Bart.
> 
> <...>
> 
>> Nevertheless this is a good hint since it seems to imply that there is a race between the requester and
>> completer which is certainly possible.
> 
> Bob, Bart
> 
> Can you please send this change as a formal patch?
> As we prefer workqueue with bad performance implementation over tasklets.

Hi Bob,

Do you perhaps have a preference for who posts the formal patch?

Thanks,

Bart.
Jason Gunthorpe Oct. 4, 2023, 6:38 p.m. UTC | #16
On Wed, Oct 04, 2023 at 10:43:20AM -0700, Bart Van Assche wrote:

> Thank you for having reported this. I'm OK with integrating the
> above change in my patch. However, code changes must be
> motivated. Do you perhaps have an explanation of why WQ_HIGHPRI
> makes the issue disappear that you observed?

I think it is clear there are locking bugs in all this, so it is not
surprising that changing the scheduling behavior can make locking bugs
hide

Jason
Bob Pearson Oct. 4, 2023, 9:16 p.m. UTC | #17
On 10/4/23 12:44, Bart Van Assche wrote:
> On 9/30/23 23:30, Leon Romanovsky wrote:
>> On Wed, Sep 27, 2023 at 11:51:12AM -0500, Bob Pearson wrote:
>>> On 9/26/23 15:24, Bart Van Assche wrote:
>>>> diff --git a/drivers/infiniband/sw/rxe/rxe_task.c b/drivers/infiniband/sw/rxe/rxe_task.c
>>>> index 1501120d4f52..6cd5d5a7a316 100644
>>>> --- a/drivers/infiniband/sw/rxe/rxe_task.c
>>>> +++ b/drivers/infiniband/sw/rxe/rxe_task.c
>>>> @@ -10,7 +10,7 @@ static struct workqueue_struct *rxe_wq;
>>>>
>>>>   int rxe_alloc_wq(void)
>>>>   {
>>>> -       rxe_wq = alloc_workqueue("rxe_wq", WQ_UNBOUND, WQ_MAX_ACTIVE);
>>>> +       rxe_wq = alloc_workqueue("rxe_wq", WQ_UNBOUND, 1);
>>>>          if (!rxe_wq)
>>>>                  return -ENOMEM;
>>>>
>>>> Thanks,
>>>>
>>>> Bart.
>>
>> <...>
>>
>>> Nevertheless this is a good hint since it seems to imply that there is a race between the requester and
>>> completer which is certainly possible.
>>
>> Bob, Bart
>>
>> Can you please send this change as a formal patch?
>> As we prefer workqueue with bad performance implementation over tasklets.
> 
> Hi Bob,
> 
> Do you perhaps have a preference for who posts the formal patch?
> 
> Thanks,
> 
> Bart.
> 

Bart,

Not really.

I have spent the past two weeks chasing this bug and don't have much to report. I have never been able to
reproduce your kasan bug. I have found like Zhu that the hang is always there but the frequency changes a
lot depending on changes. For example various printk's can increase or decrease the frequency.

I spent this morning looking at flame graphs captured during the hang which lasts about 60 seconds before
it times out and check tears down the test. It is attached to this note. There seems to be a lot of recursion
in what I assume is some attempt at error recovery. The recursion is probably in user space because the
symbols are not available to perf.

I would be worried that there may be stack overflow which could cause bad behavior.

Bob
Zhu Yanjun Oct. 5, 2023, 9:25 a.m. UTC | #18
在 2023/10/5 2:38, Jason Gunthorpe 写道:
> On Wed, Oct 04, 2023 at 10:43:20AM -0700, Bart Van Assche wrote:
> 
>> Thank you for having reported this. I'm OK with integrating the
>> above change in my patch. However, code changes must be
>> motivated. Do you perhaps have an explanation of why WQ_HIGHPRI
>> makes the issue disappear that you observed?
> 
> I think it is clear there are locking bugs in all this, so it is not
> surprising that changing the scheduling behavior can make locking bugs
> hide
> 
> Jason

With the flag WQ_HIGHPRI, an ordered workqueue with high priority is 
allocated. With this workqueue, to now, the test has run for several 
days. And the problem did not appear. So to my test environment, this 
problem seems fixed with the above commit.

I doubt, without the flag WQ_HIGHPRI, the workqueue is allocated with 
normal priority. The work item in this workqueue is more likely 
preempted than high priority work queue.

Best Regards,
Zhu Yanjun
Jason Gunthorpe Oct. 5, 2023, 2:21 p.m. UTC | #19
On Thu, Oct 05, 2023 at 05:25:29PM +0800, Zhu Yanjun wrote:
> 在 2023/10/5 2:38, Jason Gunthorpe 写道:
> > On Wed, Oct 04, 2023 at 10:43:20AM -0700, Bart Van Assche wrote:
> > 
> > > Thank you for having reported this. I'm OK with integrating the
> > > above change in my patch. However, code changes must be
> > > motivated. Do you perhaps have an explanation of why WQ_HIGHPRI
> > > makes the issue disappear that you observed?
> > 
> > I think it is clear there are locking bugs in all this, so it is not
> > surprising that changing the scheduling behavior can make locking bugs
> > hide
> > 
> > Jason
> 
> With the flag WQ_HIGHPRI, an ordered workqueue with high priority is
> allocated. With this workqueue, to now, the test has run for several days.
> And the problem did not appear. So to my test environment, this problem
> seems fixed with the above commit.
> 
> I doubt, without the flag WQ_HIGHPRI, the workqueue is allocated with normal
> priority. The work item in this workqueue is more likely preempted than high
> priority work queue.

Which is why it shows there are locking problems in this code.

Jason
Bart Van Assche Oct. 5, 2023, 2:50 p.m. UTC | #20
On 10/5/23 07:21, Jason Gunthorpe wrote:
> Which is why it shows there are locking problems in this code.

Hi Jason,

Since the locking problems have not yet been root-caused, do you
agree that it is safer to revert patch "RDMA/rxe: Add workqueue
support for rxe tasks" rather than trying to fix it?

Thanks,

Bart.
Jason Gunthorpe Oct. 5, 2023, 3:56 p.m. UTC | #21
On Thu, Oct 05, 2023 at 07:50:28AM -0700, Bart Van Assche wrote:
> On 10/5/23 07:21, Jason Gunthorpe wrote:
> > Which is why it shows there are locking problems in this code.
> 
> Hi Jason,
> 
> Since the locking problems have not yet been root-caused, do you
> agree that it is safer to revert patch "RDMA/rxe: Add workqueue
> support for rxe tasks" rather than trying to fix it?

I don't think that makes the locking problems go away any more that
using a single threaded work queue?

Jason
Bob Pearson Oct. 6, 2023, 3:58 p.m. UTC | #22
On 10/5/23 10:56, Jason Gunthorpe wrote:
> On Thu, Oct 05, 2023 at 07:50:28AM -0700, Bart Van Assche wrote:
>> On 10/5/23 07:21, Jason Gunthorpe wrote:
>>> Which is why it shows there are locking problems in this code.
>>
>> Hi Jason,
>>
>> Since the locking problems have not yet been root-caused, do you
>> agree that it is safer to revert patch "RDMA/rxe: Add workqueue
>> support for rxe tasks" rather than trying to fix it?
> 
> I don't think that makes the locking problems go away any more that
> using a single threaded work queue?
> 
> Jason

This is slightly off topic but may still be relevant.
If there are locking bugs they are between the two send side tasks
rxe_completer and rxe_requester which share the send queue and other
state. Bart attempts to fix this by setting max_active to 1 which
limits the ability of these two work queue tasks from interfering.
For completely different reasons we have looked at merging these
two tasks into a single task which it turns out improves performance,
especially in high scale situations where it reduces the number of
cpu cores needed to complete work. But even at low scale (1-2 QPs)
it helps because of improved caching. It turns out that if the work
is mostly sends and writes that there isn't much for the completer
to do while if it is mostly reads there isn't much for the requester
to do. So combining them doesn't hurt performance by having fewer cores
to do the work. But this also prevents the two tasks for a given QP
to run at the same time which should eliminate locking issues.
If no one hates the idea I can send in our patch that does this.

Bob
Zhu Yanjun Oct. 7, 2023, 12:35 a.m. UTC | #23
在 2023/10/6 23:58, Bob Pearson 写道:
> On 10/5/23 10:56, Jason Gunthorpe wrote:
>> On Thu, Oct 05, 2023 at 07:50:28AM -0700, Bart Van Assche wrote:
>>> On 10/5/23 07:21, Jason Gunthorpe wrote:
>>>> Which is why it shows there are locking problems in this code.
>>> Hi Jason,
>>>
>>> Since the locking problems have not yet been root-caused, do you
>>> agree that it is safer to revert patch "RDMA/rxe: Add workqueue
>>> support for rxe tasks" rather than trying to fix it?
>> I don't think that makes the locking problems go away any more that
>> using a single threaded work queue?
>>
>> Jason
> This is slightly off topic but may still be relevant.
> If there are locking bugs they are between the two send side tasks
> rxe_completer and rxe_requester which share the send queue and other
> state. Bart attempts to fix this by setting max_active to 1 which
> limits the ability of these two work queue tasks from interfering.
> For completely different reasons we have looked at merging these
> two tasks into a single task which it turns out improves performance,
> especially in high scale situations where it reduces the number of
> cpu cores needed to complete work. But even at low scale (1-2 QPs)
> it helps because of improved caching. It turns out that if the work
> is mostly sends and writes that there isn't much for the completer
> to do while if it is mostly reads there isn't much for the requester
> to do. So combining them doesn't hurt performance by having fewer cores
> to do the work. But this also prevents the two tasks for a given QP
> to run at the same time which should eliminate locking issues.
> If no one hates the idea I can send in our patch that does this.

Hi, Bob

The blktest hang problem is not fixed now. I do not think that it is a 
good time

to introduce this new feature.

IMHO, we should fix this blktest hang problem. And we also need some time to

wait for the test results from the users. If this problem proves fixed. 
Then we have a

stable RXE version for the new feature from you.

In short, we need a stable RXE version for use. Then based on the stable 
RXE version,

new features are welcomed.

Please let me know your advice.

Please Bart comment on this.

Please Jason && Leon comment on this.

Warm Regards,

Zhu Yanjun

>
> Bob
Zhu Yanjun Oct. 8, 2023, 4:01 p.m. UTC | #24
在 2023/10/5 22:50, Bart Van Assche 写道:
> On 10/5/23 07:21, Jason Gunthorpe wrote:
>> Which is why it shows there are locking problems in this code.
> 
> Hi Jason,
> 
> Since the locking problems have not yet been root-caused, do you
> agree that it is safer to revert patch "RDMA/rxe: Add workqueue
> support for rxe tasks" rather than trying to fix it?

Hi, Jason && Leon

I spent a lot of time on this problem. It seems that it is a very 
difficult problem.

So I agree with Bart. Can we revert patch "RDMA/rxe: Add workqueue
support for rxe tasks" rather than trying to fix it? Then Bob can apply 
his new patch to a stable RXE?

Any reply is appreciated.
Warm Regards,

Zhu Yanjun

> 
> Thanks,
> 
> Bart.
Leon Romanovsky Oct. 8, 2023, 5:09 p.m. UTC | #25
On Mon, Oct 09, 2023 at 12:01:37AM +0800, Zhu Yanjun wrote:
> 在 2023/10/5 22:50, Bart Van Assche 写道:
> > On 10/5/23 07:21, Jason Gunthorpe wrote:
> > > Which is why it shows there are locking problems in this code.
> > 
> > Hi Jason,
> > 
> > Since the locking problems have not yet been root-caused, do you
> > agree that it is safer to revert patch "RDMA/rxe: Add workqueue
> > support for rxe tasks" rather than trying to fix it?
> 
> Hi, Jason && Leon
> 
> I spent a lot of time on this problem. It seems that it is a very difficult
> problem.
> 
> So I agree with Bart. Can we revert patch "RDMA/rxe: Add workqueue
> support for rxe tasks" rather than trying to fix it? Then Bob can apply his
> new patch to a stable RXE?

Jason, I'm fine with reverting patch.

Thanks

> 
> Any reply is appreciated.
> Warm Regards,
> 
> Zhu Yanjun
> 
> > 
> > Thanks,
> > 
> > Bart.
>
Daisuke Matsuda (Fujitsu) Oct. 10, 2023, 4:53 a.m. UTC | #26
On Mon, Oct 9, 2023 1:02 AM Zhu Yanjun wrote:
> 在 2023/10/5 22:50, Bart Van Assche 写道:
> > On 10/5/23 07:21, Jason Gunthorpe wrote:
> >> Which is why it shows there are locking problems in this code.
> >
> > Hi Jason,
> >
> > Since the locking problems have not yet been root-caused, do you
> > agree that it is safer to revert patch "RDMA/rxe: Add workqueue
> > support for rxe tasks" rather than trying to fix it?
> 
> Hi, Jason && Leon
> 
> I spent a lot of time on this problem. It seems that it is a very
> difficult problem.
> 
> So I agree with Bart. Can we revert patch "RDMA/rxe: Add workqueue
> support for rxe tasks" rather than trying to fix it? Then Bob can apply
> his new patch to a stable RXE?

Cf. https://lore.kernel.org/lkml/f15b06b934aa0ace8b28dc046022e5507458eb99.1694153251.git.matsuda-daisuke@fujitsu.com/
I have ODP patches that is fully dependent on "RDMA/rxe: Add workqueue
support for rxe tasks". So I personally prefer preserving workqueue to reverting
the workqueue patch.

Each developer here has different motive and interest. I think the rxe driver should
take in new specs and new features actively so that it can be used by developers
without access to HCAs. I believe workqueue is better suited for this purpose.
Additionally, the disadvantages of tasklet are documented as follows:
https://lwn.net/Articles/830964/
However, stability is very important, so I will not insist on my opinion.

I agree it is very difficult to find the root cause of the locking problem. It cannot
be helped that we will somehow hide the issue for now so that it will not bother
actual users of the driver. Perhaps, there are three choices to do this.

Solution 1: Reverting "RDMA/rxe: Add workqueue support for rxe tasks"
I see this is supported by Zhu, Bart and approved by Leon.

Solution 2: Serializing execution of work items
> -       rxe_wq = alloc_workqueue("rxe_wq", WQ_UNBOUND, WQ_MAX_ACTIVE);
> +       rxe_wq = alloc_workqueue("rxe_wq", WQ_HIGHPRI | WQ_UNBOUND, 1);

Solution 3: Merging requester and completer (not yet submitted/tested)
https://lore.kernel.org/all/93c8ad67-f008-4352-8887-099723c2f4ec@gmail.com/
Not clear to me if we should call this a new feature or a fix.
If it can eliminate the hang issue, it could be an ultimate solution.

It is understandable some people do not want to wait for solution 3 to be submitted and verified.
Is there any problem if we adopt solution 2?
If so, then I agree to going with solution 1.
If not, solution 2 is better to me.

Thanks,
Daisuke Matsuda
Jason Gunthorpe Oct. 10, 2023, 4:09 p.m. UTC | #27
On Tue, Oct 10, 2023 at 04:53:55AM +0000, Daisuke Matsuda (Fujitsu) wrote:

> Solution 1: Reverting "RDMA/rxe: Add workqueue support for rxe tasks"
> I see this is supported by Zhu, Bart and approved by Leon.
> 
> Solution 2: Serializing execution of work items
> > -       rxe_wq = alloc_workqueue("rxe_wq", WQ_UNBOUND, WQ_MAX_ACTIVE);
> > +       rxe_wq = alloc_workqueue("rxe_wq", WQ_HIGHPRI | WQ_UNBOUND, 1);
> 
> Solution 3: Merging requester and completer (not yet submitted/tested)
> https://lore.kernel.org/all/93c8ad67-f008-4352-8887-099723c2f4ec@gmail.com/
> Not clear to me if we should call this a new feature or a fix.
> If it can eliminate the hang issue, it could be an ultimate solution.
> 
> It is understandable some people do not want to wait for solution 3 to be submitted and verified.
> Is there any problem if we adopt solution 2?
> If so, then I agree to going with solution 1.
> If not, solution 2 is better to me.

I also do not want to go backwards, I don't believe the locking is
magically correct under tasklets. 2 is painful enough to continue to
motivate people to fix this while unbreaking block tests.

I'm still puzzled why Bob can't reproduce the things Bart has seen.

Jason
Bart Van Assche Oct. 10, 2023, 9:29 p.m. UTC | #28
On 10/10/23 09:09, Jason Gunthorpe wrote:
> On Tue, Oct 10, 2023 at 04:53:55AM +0000, Daisuke Matsuda (Fujitsu) wrote:
> 
>> Solution 1: Reverting "RDMA/rxe: Add workqueue support for rxe tasks"
>> I see this is supported by Zhu, Bart and approved by Leon.
>>
>> Solution 2: Serializing execution of work items
>>> -       rxe_wq = alloc_workqueue("rxe_wq", WQ_UNBOUND, WQ_MAX_ACTIVE);
>>> +       rxe_wq = alloc_workqueue("rxe_wq", WQ_HIGHPRI | WQ_UNBOUND, 1);
>>
>> Solution 3: Merging requester and completer (not yet submitted/tested)
>> https://lore.kernel.org/all/93c8ad67-f008-4352-8887-099723c2f4ec@gmail.com/
>> Not clear to me if we should call this a new feature or a fix.
>> If it can eliminate the hang issue, it could be an ultimate solution.
>>
>> It is understandable some people do not want to wait for solution 3 to be submitted and verified.
>> Is there any problem if we adopt solution 2?
>> If so, then I agree to going with solution 1.
>> If not, solution 2 is better to me.
> 
> I also do not want to go backwards, I don't believe the locking is
> magically correct under tasklets. 2 is painful enough to continue to
> motivate people to fix this while unbreaking block tests.

In my opinion (2) is not a solution. Zhu Yanjun reported test failures with
rxe_wq = alloc_workqueue("rxe_wq", WQ_UNBOUND, 1). Adding WQ_HIGHPRI probably
made it less likely to trigger any race conditions but I don't believe that
this is sufficient as a solution.

> I'm still puzzled why Bob can't reproduce the things Bart has seen.

Is this necessary? The KASAN complaint that I reported should be more than
enough for someone who is familiar with the RXE driver to identify and fix
the root cause. I can help with testing candidate fixes.

Thanks,

Bart.
Jason Gunthorpe Oct. 11, 2023, 3:51 p.m. UTC | #29
On Tue, Oct 10, 2023 at 02:29:19PM -0700, Bart Van Assche wrote:
> On 10/10/23 09:09, Jason Gunthorpe wrote:
> > On Tue, Oct 10, 2023 at 04:53:55AM +0000, Daisuke Matsuda (Fujitsu) wrote:
> > 
> > > Solution 1: Reverting "RDMA/rxe: Add workqueue support for rxe tasks"
> > > I see this is supported by Zhu, Bart and approved by Leon.
> > > 
> > > Solution 2: Serializing execution of work items
> > > > -       rxe_wq = alloc_workqueue("rxe_wq", WQ_UNBOUND, WQ_MAX_ACTIVE);
> > > > +       rxe_wq = alloc_workqueue("rxe_wq", WQ_HIGHPRI | WQ_UNBOUND, 1);
> > > 
> > > Solution 3: Merging requester and completer (not yet submitted/tested)
> > > https://lore.kernel.org/all/93c8ad67-f008-4352-8887-099723c2f4ec@gmail.com/
> > > Not clear to me if we should call this a new feature or a fix.
> > > If it can eliminate the hang issue, it could be an ultimate solution.
> > > 
> > > It is understandable some people do not want to wait for solution 3 to be submitted and verified.
> > > Is there any problem if we adopt solution 2?
> > > If so, then I agree to going with solution 1.
> > > If not, solution 2 is better to me.
> > 
> > I also do not want to go backwards, I don't believe the locking is
> > magically correct under tasklets. 2 is painful enough to continue to
> > motivate people to fix this while unbreaking block tests.
> 
> In my opinion (2) is not a solution. Zhu Yanjun reported test failures with
> rxe_wq = alloc_workqueue("rxe_wq", WQ_UNBOUND, 1). Adding WQ_HIGHPRI probably
> made it less likely to trigger any race conditions but I don't believe that
> this is sufficient as a solution.

I've been going on the assumption that rxe has always been full of
bugs. I don't believe the work queue change added new bugs, it just
made the existing bugs easier to hit.

It is hard to be sure until someon can find out what is going wrong.

If we revert it then rxe will probably just stop development
entirely. Daisuke's ODP work will be blocked and if Bob was able to
fix it he would have done so already. Which mean's Bobs ongoing work
is lost too.

I *vastly* prefer we root cause and fix it properly. Rxe was finally
starting to get a reasonable set of people interested in it, I do not
want to kill that off.

Again, I'm troubled that this doesn't seem to be reproducing for other
people.

> > I'm still puzzled why Bob can't reproduce the things Bart has seen.
> 
> Is this necessary?

It is always easier to debug something you can change than to try and
guess what an oops is trying to say..

> The KASAN complaint that I reported should be more than enough for
> someone who is familiar with the RXE driver to identify and fix the
> root cause. I can help with testing candidate fixes.

Bob?

Jason
Bart Van Assche Oct. 11, 2023, 8:14 p.m. UTC | #30
On 10/11/23 08:51, Jason Gunthorpe wrote:
> If we revert it then rxe will probably just stop development
> entirely. Daisuke's ODP work will be blocked and if Bob was able to
> fix it he would have done so already. Which mean's Bobs ongoing work
> is lost too.

If Daisuke's work depends on the RXE changes then Daisuke may decide
to help with the RXE changes.

Introducing regressions while refactoring code is not acceptable.

I don't have enough spare time to help with the RXE driver.

Thanks,

Bart.
Jason Gunthorpe Oct. 11, 2023, 11:12 p.m. UTC | #31
On Wed, Oct 11, 2023 at 01:14:16PM -0700, Bart Van Assche wrote:
> On 10/11/23 08:51, Jason Gunthorpe wrote:
> > If we revert it then rxe will probably just stop development
> > entirely. Daisuke's ODP work will be blocked and if Bob was able to
> > fix it he would have done so already. Which mean's Bobs ongoing work
> > is lost too.
> 
> If Daisuke's work depends on the RXE changes then Daisuke may decide
> to help with the RXE changes.
> 
> Introducing regressions while refactoring code is not acceptable.

Generally, but I don't view rxe as a production part of the kernel so
I prefer to give time to resolve it.

> I don't have enough spare time to help with the RXE driver.

Nor I

Jason
Zhu Yanjun Oct. 12, 2023, 11:49 a.m. UTC | #32
在 2023/10/12 7:12, Jason Gunthorpe 写道:
> On Wed, Oct 11, 2023 at 01:14:16PM -0700, Bart Van Assche wrote:
>> On 10/11/23 08:51, Jason Gunthorpe wrote:
>>> If we revert it then rxe will probably just stop development
>>> entirely. Daisuke's ODP work will be blocked and if Bob was able to
>>> fix it he would have done so already. Which mean's Bobs ongoing work
>>> is lost too.
>>
>> If Daisuke's work depends on the RXE changes then Daisuke may decide
>> to help with the RXE changes.
>>
>> Introducing regressions while refactoring code is not acceptable.
> 
> Generally, but I don't view rxe as a production part of the kernel so
> I prefer to give time to resolve it.
> 
>> I don't have enough spare time to help with the RXE driver.

commit 11ab7cc7ee32d6c3e16ac74c34c4bbdbf8f99292
Author: Bart Van Assche <bvanassche@acm.org>
Date:   Tue Aug 22 09:57:07 2023 -0700

     Change the default RDMA driver from rdma_rxe to siw

     Since the siw driver is more stable than the rdma_rxe driver, 
change the
     default into siw. See e.g.
 
https://lore.kernel.org/all/c3d1a966-b9b0-d015-38ec-86270b5045fc@acm.org/.

     Signed-off-by: Bart Van Assche <bvanassche@acm.org>
     Signed-off-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>


> 
> Nor I
> 
> Jason
Bob Pearson Oct. 12, 2023, 3:38 p.m. UTC | #33
On 10/12/23 06:49, Zhu Yanjun wrote:
> 在 2023/10/12 7:12, Jason Gunthorpe 写道:
>> On Wed, Oct 11, 2023 at 01:14:16PM -0700, Bart Van Assche wrote:
>>> On 10/11/23 08:51, Jason Gunthorpe wrote:
>>>> If we revert it then rxe will probably just stop development
>>>> entirely. Daisuke's ODP work will be blocked and if Bob was able to
>>>> fix it he would have done so already. Which mean's Bobs ongoing work
>>>> is lost too.
>>>
>>> If Daisuke's work depends on the RXE changes then Daisuke may decide
>>> to help with the RXE changes.
>>>
>>> Introducing regressions while refactoring code is not acceptable.
>>
>> Generally, but I don't view rxe as a production part of the kernel so
>> I prefer to give time to resolve it.
>>
>>> I don't have enough spare time to help with the RXE driver.
> 
> commit 11ab7cc7ee32d6c3e16ac74c34c4bbdbf8f99292
> Author: Bart Van Assche <bvanassche@acm.org>
> Date:   Tue Aug 22 09:57:07 2023 -0700
> 
>     Change the default RDMA driver from rdma_rxe to siw
> 
>     Since the siw driver is more stable than the rdma_rxe driver, change the
>     default into siw. See e.g.
> 
> https://lore.kernel.org/all/c3d1a966-b9b0-d015-38ec-86270b5045fc@acm.org/.
> 
>     Signed-off-by: Bart Van Assche <bvanassche@acm.org>
>     Signed-off-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>
> 
> 
>>
>> Nor I
>>
>> Jason
> 
All,

I have spent the past several weeks working on trying to resolve this issue. The one thing I can say
for sure is that the failures or their rates are very sensitive to small timing changes. I totally agree
Jason that the bug has always been there and most of the suggested changes are just masking or unmasking
it. I have been running under all the kernel lock checking I can set and have not seen any warnings
so I doubt the error is a deadlock. My suspicion remains that the root cause of the hang is loss of
a completion or a timeout before a late completion leading to the transport state machine death. There
are surely other bugs in the driver and they may show up in parallel with this hang. I see the hang
consistently from 1-2% to 30-40% of the time when running srp/002 depending on various changes I have
tried but I have not been able to reproduce the KASAN bug yet. Because the hang is easy to reproduce
I have focused on that.

Bob
diff mbox series

Patch

diff --git a/drivers/infiniband/sw/rxe/rxe.c b/drivers/infiniband/sw/rxe/rxe.c
index 54c723a6edda..7a7e713de52d 100644
--- a/drivers/infiniband/sw/rxe/rxe.c
+++ b/drivers/infiniband/sw/rxe/rxe.c
@@ -212,15 +212,9 @@  static int __init rxe_module_init(void)
 {
 	int err;
 
-	err = rxe_alloc_wq();
-	if (err)
-		return err;
-
 	err = rxe_net_init();
-	if (err) {
-		rxe_destroy_wq();
+	if (err)
 		return err;
-	}
 
 	rdma_link_register(&rxe_link_ops);
 	pr_info("loaded\n");
@@ -232,7 +226,6 @@  static void __exit rxe_module_exit(void)
 	rdma_link_unregister(&rxe_link_ops);
 	ib_unregister_driver(RDMA_DRIVER_RXE);
 	rxe_net_exit();
-	rxe_destroy_wq();
 
 	pr_info("unloaded\n");
 }
diff --git a/drivers/infiniband/sw/rxe/rxe_task.c b/drivers/infiniband/sw/rxe/rxe_task.c
index 1501120d4f52..fb9a6bc8e620 100644
--- a/drivers/infiniband/sw/rxe/rxe_task.c
+++ b/drivers/infiniband/sw/rxe/rxe_task.c
@@ -6,24 +6,8 @@ 
 
 #include "rxe.h"
 
-static struct workqueue_struct *rxe_wq;
-
-int rxe_alloc_wq(void)
-{
-	rxe_wq = alloc_workqueue("rxe_wq", WQ_UNBOUND, WQ_MAX_ACTIVE);
-	if (!rxe_wq)
-		return -ENOMEM;
-
-	return 0;
-}
-
-void rxe_destroy_wq(void)
-{
-	destroy_workqueue(rxe_wq);
-}
-
 /* Check if task is idle i.e. not running, not scheduled in
- * work queue and not draining. If so move to busy to
+ * tasklet queue and not draining. If so move to busy to
  * reserve a slot in do_task() by setting to busy and taking
  * a qp reference to cover the gap from now until the task finishes.
  * state will move out of busy if task returns a non zero value
@@ -37,6 +21,9 @@  static bool __reserve_if_idle(struct rxe_task *task)
 {
 	WARN_ON(rxe_read(task->qp) <= 0);
 
+	if (task->tasklet.state & BIT(TASKLET_STATE_SCHED))
+		return false;
+
 	if (task->state == TASK_STATE_IDLE) {
 		rxe_get(task->qp);
 		task->state = TASK_STATE_BUSY;
@@ -51,7 +38,7 @@  static bool __reserve_if_idle(struct rxe_task *task)
 }
 
 /* check if task is idle or drained and not currently
- * scheduled in the work queue. This routine is
+ * scheduled in the tasklet queue. This routine is
  * called by rxe_cleanup_task or rxe_disable_task to
  * see if the queue is empty.
  * Context: caller should hold task->lock.
@@ -59,7 +46,7 @@  static bool __reserve_if_idle(struct rxe_task *task)
  */
 static bool __is_done(struct rxe_task *task)
 {
-	if (work_pending(&task->work))
+	if (task->tasklet.state & BIT(TASKLET_STATE_SCHED))
 		return false;
 
 	if (task->state == TASK_STATE_IDLE ||
@@ -90,23 +77,23 @@  static bool is_done(struct rxe_task *task)
  * schedules the task. They must call __reserve_if_idle to
  * move the task to busy before calling or scheduling.
  * The task can also be moved to drained or invalid
- * by calls to rxe_cleanup_task or rxe_disable_task.
+ * by calls to rxe-cleanup_task or rxe_disable_task.
  * In that case tasks which get here are not executed but
  * just flushed. The tasks are designed to look to see if
- * there is work to do and then do part of it before returning
+ * there is work to do and do part of it before returning
  * here with a return value of zero until all the work
- * has been consumed then it returns a non-zero value.
+ * has been consumed then it retuens a non-zero value.
  * The number of times the task can be run is limited by
  * max iterations so one task cannot hold the cpu forever.
- * If the limit is hit and work remains the task is rescheduled.
  */
-static void do_task(struct rxe_task *task)
+static void do_task(struct tasklet_struct *t)
 {
+	int cont;
+	int ret;
+	struct rxe_task *task = from_tasklet(task, t, tasklet);
 	unsigned int iterations;
 	unsigned long flags;
 	int resched = 0;
-	int cont;
-	int ret;
 
 	WARN_ON(rxe_read(task->qp) <= 0);
 
@@ -128,22 +115,25 @@  static void do_task(struct rxe_task *task)
 		} while (ret == 0 && iterations-- > 0);
 
 		spin_lock_irqsave(&task->lock, flags);
-		/* we're not done yet but we ran out of iterations.
-		 * yield the cpu and reschedule the task
-		 */
-		if (!ret) {
-			task->state = TASK_STATE_IDLE;
-			resched = 1;
-			goto exit;
-		}
-
 		switch (task->state) {
 		case TASK_STATE_BUSY:
-			task->state = TASK_STATE_IDLE;
+			if (ret) {
+				task->state = TASK_STATE_IDLE;
+			} else {
+				/* This can happen if the client
+				 * can add work faster than the
+				 * tasklet can finish it.
+				 * Reschedule the tasklet and exit
+				 * the loop to give up the cpu
+				 */
+				task->state = TASK_STATE_IDLE;
+				resched = 1;
+			}
 			break;
 
-		/* someone tried to schedule the task while we
-		 * were running, keep going
+		/* someone tried to run the task since the last time we called
+		 * func, so we will call one more time regardless of the
+		 * return value
 		 */
 		case TASK_STATE_ARMED:
 			task->state = TASK_STATE_BUSY;
@@ -151,24 +141,22 @@  static void do_task(struct rxe_task *task)
 			break;
 
 		case TASK_STATE_DRAINING:
-			task->state = TASK_STATE_DRAINED;
+			if (ret)
+				task->state = TASK_STATE_DRAINED;
+			else
+				cont = 1;
 			break;
 
 		default:
 			WARN_ON(1);
-			rxe_dbg_qp(task->qp, "unexpected task state = %d",
-				   task->state);
-			task->state = TASK_STATE_IDLE;
+			rxe_info_qp(task->qp, "unexpected task state = %d", task->state);
 		}
 
-exit:
 		if (!cont) {
 			task->num_done++;
 			if (WARN_ON(task->num_done != task->num_sched))
-				rxe_dbg_qp(
-					task->qp,
-					"%ld tasks scheduled, %ld tasks done",
-					task->num_sched, task->num_done);
+				rxe_err_qp(task->qp, "%ld tasks scheduled, %ld tasks done",
+					   task->num_sched, task->num_done);
 		}
 		spin_unlock_irqrestore(&task->lock, flags);
 	} while (cont);
@@ -181,12 +169,6 @@  static void do_task(struct rxe_task *task)
 	rxe_put(task->qp);
 }
 
-/* wrapper around do_task to fix argument for work queue */
-static void do_work(struct work_struct *work)
-{
-	do_task(container_of(work, struct rxe_task, work));
-}
-
 int rxe_init_task(struct rxe_task *task, struct rxe_qp *qp,
 		  int (*func)(struct rxe_qp *))
 {
@@ -194,9 +176,11 @@  int rxe_init_task(struct rxe_task *task, struct rxe_qp *qp,
 
 	task->qp = qp;
 	task->func = func;
+
+	tasklet_setup(&task->tasklet, do_task);
+
 	task->state = TASK_STATE_IDLE;
 	spin_lock_init(&task->lock);
-	INIT_WORK(&task->work, do_work);
 
 	return 0;
 }
@@ -229,6 +213,8 @@  void rxe_cleanup_task(struct rxe_task *task)
 	while (!is_done(task))
 		cond_resched();
 
+	tasklet_kill(&task->tasklet);
+
 	spin_lock_irqsave(&task->lock, flags);
 	task->state = TASK_STATE_INVALID;
 	spin_unlock_irqrestore(&task->lock, flags);
@@ -240,7 +226,7 @@  void rxe_cleanup_task(struct rxe_task *task)
 void rxe_run_task(struct rxe_task *task)
 {
 	unsigned long flags;
-	bool run;
+	int run;
 
 	WARN_ON(rxe_read(task->qp) <= 0);
 
@@ -249,11 +235,11 @@  void rxe_run_task(struct rxe_task *task)
 	spin_unlock_irqrestore(&task->lock, flags);
 
 	if (run)
-		do_task(task);
+		do_task(&task->tasklet);
 }
 
-/* schedule the task to run later as a work queue entry.
- * the queue_work call can be called holding
+/* schedule the task to run later as a tasklet.
+ * the tasklet)schedule call can be called holding
  * the lock.
  */
 void rxe_sched_task(struct rxe_task *task)
@@ -264,7 +250,7 @@  void rxe_sched_task(struct rxe_task *task)
 
 	spin_lock_irqsave(&task->lock, flags);
 	if (__reserve_if_idle(task))
-		queue_work(rxe_wq, &task->work);
+		tasklet_schedule(&task->tasklet);
 	spin_unlock_irqrestore(&task->lock, flags);
 }
 
@@ -291,9 +277,7 @@  void rxe_disable_task(struct rxe_task *task)
 	while (!is_done(task))
 		cond_resched();
 
-	spin_lock_irqsave(&task->lock, flags);
-	task->state = TASK_STATE_DRAINED;
-	spin_unlock_irqrestore(&task->lock, flags);
+	tasklet_disable(&task->tasklet);
 }
 
 void rxe_enable_task(struct rxe_task *task)
@@ -307,7 +291,7 @@  void rxe_enable_task(struct rxe_task *task)
 		spin_unlock_irqrestore(&task->lock, flags);
 		return;
 	}
-
 	task->state = TASK_STATE_IDLE;
+	tasklet_enable(&task->tasklet);
 	spin_unlock_irqrestore(&task->lock, flags);
 }
diff --git a/drivers/infiniband/sw/rxe/rxe_task.h b/drivers/infiniband/sw/rxe/rxe_task.h
index a63e258b3d66..facb7c8e3729 100644
--- a/drivers/infiniband/sw/rxe/rxe_task.h
+++ b/drivers/infiniband/sw/rxe/rxe_task.h
@@ -22,7 +22,7 @@  enum {
  * called again.
  */
 struct rxe_task {
-	struct work_struct	work;
+	struct tasklet_struct	tasklet;
 	int			state;
 	spinlock_t		lock;
 	struct rxe_qp		*qp;
@@ -32,10 +32,6 @@  struct rxe_task {
 	long			num_done;
 };
 
-int rxe_alloc_wq(void);
-
-void rxe_destroy_wq(void);
-
 /*
  * init rxe_task structure
  *	qp  => parameter to pass to func