Message ID | 20230922163231.2237811-1-yanjun.zhu@intel.com (mailing list archive) |
---|---|
State | Accepted |
Headers | show |
Series | [1/1] Revert "RDMA/rxe: Add workqueue support for rxe tasks" | expand |
On 9/22/23 09:32, Zhu Yanjun wrote: > From: Zhu Yanjun <yanjun.zhu@linux.dev> > > This reverts commit 9b4b7c1f9f54120940e243251e2b1407767b3381. > > This commit replaces tasklet with workqueue. But this results > in occasionally pocess hang at the blktests test case srp/002. > After the discussion in the link[1], this commit is reverted. > > Link: https://lore.kernel.org/linux-rdma/e8b76fae-780a-470e-8ec4-c6b650793d10@leemhuis.info/T/#t [1] > Signed-off-by: Zhu Yanjun <yanjun.zhu@linux.dev> > CC: rpearsonhpe@gmail.com > CC: matsuda-daisuke@fujitsu.com > CC: bvanassche@acm.org > CC: shinichiro.kawasaki@wdc.com Shouldn't the Cc-list occur before the Signed-off-by tag? Anyway: Reviewed-by: Bart Van Assche <bvanassche@acm.org>
On Fri, Sep 22, 2023 at 09:42:08AM -0700, Bart Van Assche wrote: > On 9/22/23 09:32, Zhu Yanjun wrote: > > From: Zhu Yanjun <yanjun.zhu@linux.dev> > > > > This reverts commit 9b4b7c1f9f54120940e243251e2b1407767b3381. > > > > This commit replaces tasklet with workqueue. But this results > > in occasionally pocess hang at the blktests test case srp/002. > > After the discussion in the link[1], this commit is reverted. > > > > Link: https://lore.kernel.org/linux-rdma/e8b76fae-780a-470e-8ec4-c6b650793d10@leemhuis.info/T/#t [1] > > Signed-off-by: Zhu Yanjun <yanjun.zhu@linux.dev> > > CC: rpearsonhpe@gmail.com > > CC: matsuda-daisuke@fujitsu.com > > CC: bvanassche@acm.org > > CC: shinichiro.kawasaki@wdc.com > > Shouldn't the Cc-list occur before the Signed-off-by tag? Anyway: chackpatch didn't complain, so I took it as is. > > Reviewed-by: Bart Van Assche <bvanassche@acm.org> Thanks
On Fri, 22 Sep 2023 12:32:31 -0400, Zhu Yanjun wrote: > This reverts commit 9b4b7c1f9f54120940e243251e2b1407767b3381. > > This commit replaces tasklet with workqueue. But this results > in occasionally pocess hang at the blktests test case srp/002. > After the discussion in the link[1], this commit is reverted. > > > [...] Applied, thanks! [1/1] Revert "RDMA/rxe: Add workqueue support for rxe tasks" https://git.kernel.org/rdma/rdma/c/e710c390a8f860 Best regards,
On Tue, Sep 26, 2023 at 12:43:57PM +0300, Leon Romanovsky wrote: > > On Fri, 22 Sep 2023 12:32:31 -0400, Zhu Yanjun wrote: > > This reverts commit 9b4b7c1f9f54120940e243251e2b1407767b3381. > > > > This commit replaces tasklet with workqueue. But this results > > in occasionally pocess hang at the blktests test case srp/002. > > After the discussion in the link[1], this commit is reverted. > > > > > > [...] > > Applied, thanks! > > [1/1] Revert "RDMA/rxe: Add workqueue support for rxe tasks" > https://git.kernel.org/rdma/rdma/c/e710c390a8f860 I applied this patch, but will delay it for some time with a hope that fix will arrive in the near future. Thanks > > Best regards, > -- > Leon Romanovsky <leon@kernel.org> >
On 9/26/23 07:06, Leon Romanovsky wrote: > On Tue, Sep 26, 2023 at 12:43:57PM +0300, Leon Romanovsky wrote: >> >> On Fri, 22 Sep 2023 12:32:31 -0400, Zhu Yanjun wrote: >>> This reverts commit 9b4b7c1f9f54120940e243251e2b1407767b3381. >>> >>> This commit replaces tasklet with workqueue. But this results >>> in occasionally pocess hang at the blktests test case srp/002. >>> After the discussion in the link[1], this commit is reverted. >>> >>> >>> [...] >> >> Applied, thanks! >> >> [1/1] Revert "RDMA/rxe: Add workqueue support for rxe tasks" >> https://git.kernel.org/rdma/rdma/c/e710c390a8f860 > > I applied this patch, but will delay it for some time with a hope that > fix will arrive in the near future. Thank you Leon. With this revert applied on top of the rdma for-next branch, I don't see the KASAN complaint anymore that I reported yesterday. I think this is more evidence that the KASAN complaint was caused by the RXE driver. Bart.
On 9/26/23 12:05, Bart Van Assche wrote: > On 9/26/23 07:06, Leon Romanovsky wrote: >> On Tue, Sep 26, 2023 at 12:43:57PM +0300, Leon Romanovsky wrote: >>> >>> On Fri, 22 Sep 2023 12:32:31 -0400, Zhu Yanjun wrote: >>>> This reverts commit 9b4b7c1f9f54120940e243251e2b1407767b3381. >>>> >>>> This commit replaces tasklet with workqueue. But this results >>>> in occasionally pocess hang at the blktests test case srp/002. >>>> After the discussion in the link[1], this commit is reverted. >>>> >>>> >>>> [...] >>> >>> Applied, thanks! >>> >>> [1/1] Revert "RDMA/rxe: Add workqueue support for rxe tasks" >>> https://git.kernel.org/rdma/rdma/c/e710c390a8f860 >> >> I applied this patch, but will delay it for some time with a hope that >> fix will arrive in the near future. > > Thank you Leon. With this revert applied on top of the rdma for-next branch, I > don't see the KASAN complaint anymore that I reported yesterday. I think this > is more evidence that the KASAN complaint was caused by the RXE driver. > > Bart. Bart, I am working to try to reproduce the KASAN warning. Unfortunately, so far I am not able to see it in Ubuntu + Linus' kernel (as you described) on metal. The config file is different but copies the CONFIG_KASAN_xxx exactly as yours. With KASAN enabled it hangs on every iteration of srp/002 but without a KASAN warning. I am now building an openSuSE VM for qemu and will see if that causes the warning. Thanks for the support, Bob
On 9/26/23 11:34, Bob Pearson wrote: > I am working to try to reproduce the KASAN warning. Unfortunately, > so far I am not able to see it in Ubuntu + Linus' kernel (as you > described) on metal. The config file is different but copies the > CONFIG_KASAN_xxx exactly as yours. With KASAN enabled it hangs on > every iteration of srp/002 but without a KASAN warning. I am now > building an openSuSE VM for qemu and will see if that causes the > warning. Hi Bob, Did you try to understand the report that I shared? My conclusion from the report is that when using tasklets rxe_completer() only runs after rxe_requester() has finished and also that when using work queues that rxe_completer() may run concurrently with rxe_requester(). This patch seems to fix all issues that I ran into with the rdma_rxe workqueue patch (I have not tried to verify the performance implications of this patch): diff --git a/drivers/infiniband/sw/rxe/rxe_task.c b/drivers/infiniband/sw/rxe/rxe_task.c index 1501120d4f52..6cd5d5a7a316 100644 --- a/drivers/infiniband/sw/rxe/rxe_task.c +++ b/drivers/infiniband/sw/rxe/rxe_task.c @@ -10,7 +10,7 @@ static struct workqueue_struct *rxe_wq; int rxe_alloc_wq(void) { - rxe_wq = alloc_workqueue("rxe_wq", WQ_UNBOUND, WQ_MAX_ACTIVE); + rxe_wq = alloc_workqueue("rxe_wq", WQ_UNBOUND, 1); if (!rxe_wq) return -ENOMEM; Thanks, Bart.
On Wed, Sep 27, 2023 at 4:37 AM Bart Van Assche <bvanassche@acm.org> wrote: > > On 9/26/23 11:34, Bob Pearson wrote: > > I am working to try to reproduce the KASAN warning. Unfortunately, > > so far I am not able to see it in Ubuntu + Linus' kernel (as you > > described) on metal. The config file is different but copies the > > CONFIG_KASAN_xxx exactly as yours. With KASAN enabled it hangs on > > every iteration of srp/002 but without a KASAN warning. I am now > > building an openSuSE VM for qemu and will see if that causes the > > warning. > > Hi Bob, > > Did you try to understand the report that I shared? My conclusion from > the report is that when using tasklets rxe_completer() only runs after > rxe_requester() has finished and also that when using work queues that > rxe_completer() may run concurrently with rxe_requester(). This patch > seems to fix all issues that I ran into with the rdma_rxe workqueue > patch (I have not tried to verify the performance implications of this > patch): In the same test environment in the link https://lore.kernel.org/all/4e7aac82-f006-aaa7-6769-d1c9691a0cec@gmail.com/T/#m3294d00f5cf3247dfdb2ea3688b1467167f72704, RXE with workqueue has worse performance than RXE with tasklet. Sometimes RXE with workqueue can not work well. Need this commit in RXE. > > diff --git a/drivers/infiniband/sw/rxe/rxe_task.c > b/drivers/infiniband/sw/rxe/rxe_task.c > index 1501120d4f52..6cd5d5a7a316 100644 > --- a/drivers/infiniband/sw/rxe/rxe_task.c > +++ b/drivers/infiniband/sw/rxe/rxe_task.c > @@ -10,7 +10,7 @@ static struct workqueue_struct *rxe_wq; > > int rxe_alloc_wq(void) > { > - rxe_wq = alloc_workqueue("rxe_wq", WQ_UNBOUND, WQ_MAX_ACTIVE); > + rxe_wq = alloc_workqueue("rxe_wq", WQ_UNBOUND, 1); > if (!rxe_wq) > return -ENOMEM; > > Thanks, > > Bart.
On 9/26/23 15:24, Bart Van Assche wrote: > On 9/26/23 11:34, Bob Pearson wrote: >> I am working to try to reproduce the KASAN warning. Unfortunately, >> so far I am not able to see it in Ubuntu + Linus' kernel (as you described) on metal. The config file is different but copies the CONFIG_KASAN_xxx exactly as yours. With KASAN enabled it hangs on every iteration of srp/002 but without a KASAN warning. I am now building an openSuSE VM for qemu and will see if that causes the warning. > > Hi Bob, > > Did you try to understand the report that I shared? Looking at the three stack traces from KASAN (alloc, free, and use after free) it appears that there was an ack packet (skb) created in rxe_responder (normal) and then passed to rxe_completer which apparently successfully processed it and then freed the skb (also normal). Then the same skb is enqueued on the response queue in rxe_comp_queue_pkt(). This is very strange and hard to understand. The only way the original packet could have been 'completed' would be for it to have been first enqueued on qp->resp_pkts by skb_queue_tail() and then dequeued after the completer task runs by skb_dequeue(). The skb queue routines are protected by an irqsave spinlock so they should operate atomically. In other words the completer can't get the skb until skb_queue_tail() is finished touching the skb. So it looks like the first pass through rxe_comp_queue_pkt() shouldn't be to blame. There is no way I can imagine that the packet could be queued twice on the local loopback path. One strange thing in the trace is a "? rxe_recv_mcast_pkt" which seems unlikely to be true as all the packets are rc and hence not mcast. Not sure how to interpret this. Perhaps the stack is corrupted from scribbles which might cause the above impossibility. My conclusion from > the report is that when using tasklets rxe_completer() only runs after > rxe_requester() has finished and also that when using work queues that > rxe_completer() may run concurrently with rxe_requester(). The completer task was always intended to run in parallel with the requester and responder tasks whether they are tasklets or workqueue items. Tasklets tend to run sequentially but there is no reason whey they can't run in parallel. The completer task is triggered by response packets from another process's queue pair which is asynchronous from the requester task which generated the request packets. For unrelated reasons I am planning to merge the requester task and completer task into a single task because in high scale situation with lots of qps it performs better and allows removing some of the locking between them. This patch > seems to fix all issues that I ran into with the rdma_rxe workqueue > patch (I have not tried to verify the performance implications of this > patch): > > diff --git a/drivers/infiniband/sw/rxe/rxe_task.c b/drivers/infiniband/sw/rxe/rxe_task.c > index 1501120d4f52..6cd5d5a7a316 100644 > --- a/drivers/infiniband/sw/rxe/rxe_task.c > +++ b/drivers/infiniband/sw/rxe/rxe_task.c > @@ -10,7 +10,7 @@ static struct workqueue_struct *rxe_wq; > > int rxe_alloc_wq(void) > { > - rxe_wq = alloc_workqueue("rxe_wq", WQ_UNBOUND, WQ_MAX_ACTIVE); > + rxe_wq = alloc_workqueue("rxe_wq", WQ_UNBOUND, 1); > if (!rxe_wq) > return -ENOMEM; > > Thanks, > > Bart.
On 9/26/23 15:24, Bart Van Assche wrote: > On 9/26/23 11:34, Bob Pearson wrote: >> I am working to try to reproduce the KASAN warning. Unfortunately, >> so far I am not able to see it in Ubuntu + Linus' kernel (as you described) on metal. The config file is different but copies the CONFIG_KASAN_xxx exactly as yours. With KASAN enabled it hangs on every iteration of srp/002 but without a KASAN warning. I am now building an openSuSE VM for qemu and will see if that causes the warning. > > Hi Bob, > > Did you try to understand the report that I shared? My conclusion from > the report is that when using tasklets rxe_completer() only runs after > rxe_requester() has finished and also that when using work queues that > rxe_completer() may run concurrently with rxe_requester(). This patch > seems to fix all issues that I ran into with the rdma_rxe workqueue > patch (I have not tried to verify the performance implications of this > patch): > > diff --git a/drivers/infiniband/sw/rxe/rxe_task.c b/drivers/infiniband/sw/rxe/rxe_task.c > index 1501120d4f52..6cd5d5a7a316 100644 > --- a/drivers/infiniband/sw/rxe/rxe_task.c > +++ b/drivers/infiniband/sw/rxe/rxe_task.c > @@ -10,7 +10,7 @@ static struct workqueue_struct *rxe_wq; > > int rxe_alloc_wq(void) > { > - rxe_wq = alloc_workqueue("rxe_wq", WQ_UNBOUND, WQ_MAX_ACTIVE); > + rxe_wq = alloc_workqueue("rxe_wq", WQ_UNBOUND, 1); > if (!rxe_wq) > return -ENOMEM; > > Thanks, > > Bart. The workqueue doc says Some users depend on the strict execution ordering of ST wq. The combination of @max_active of 1 and WQ_UNBOUND is used to achieve this behavior. Work items on such wq are always queued to the unbound worker-pools and only one work item can be active at any given time thus achieving the same ordering property as ST wq. When I have tried this setting I see very low performance compared to 512. It seems that only one item at a time can run on all the CPUs even though it also says that max_active is the number of threads per cpu. Nevertheless this is a good hint since it seems to imply that there is a race between the requester and completer which is certainly possible. Bob
On Wed, Sep 27, 2023 at 11:51:12AM -0500, Bob Pearson wrote: > On 9/26/23 15:24, Bart Van Assche wrote: > > On 9/26/23 11:34, Bob Pearson wrote: > >> I am working to try to reproduce the KASAN warning. Unfortunately, > >> so far I am not able to see it in Ubuntu + Linus' kernel (as you described) on metal. The config file is different but copies the CONFIG_KASAN_xxx exactly as yours. With KASAN enabled it hangs on every iteration of srp/002 but without a KASAN warning. I am now building an openSuSE VM for qemu and will see if that causes the warning. > > > > Hi Bob, > > > > Did you try to understand the report that I shared? My conclusion from > > the report is that when using tasklets rxe_completer() only runs after > > rxe_requester() has finished and also that when using work queues that > > rxe_completer() may run concurrently with rxe_requester(). This patch > > seems to fix all issues that I ran into with the rdma_rxe workqueue > > patch (I have not tried to verify the performance implications of this > > patch): > > > > diff --git a/drivers/infiniband/sw/rxe/rxe_task.c b/drivers/infiniband/sw/rxe/rxe_task.c > > index 1501120d4f52..6cd5d5a7a316 100644 > > --- a/drivers/infiniband/sw/rxe/rxe_task.c > > +++ b/drivers/infiniband/sw/rxe/rxe_task.c > > @@ -10,7 +10,7 @@ static struct workqueue_struct *rxe_wq; > > > > int rxe_alloc_wq(void) > > { > > - rxe_wq = alloc_workqueue("rxe_wq", WQ_UNBOUND, WQ_MAX_ACTIVE); > > + rxe_wq = alloc_workqueue("rxe_wq", WQ_UNBOUND, 1); > > if (!rxe_wq) > > return -ENOMEM; > > > > Thanks, > > > > Bart. <...> > Nevertheless this is a good hint since it seems to imply that there is a race between the requester and > completer which is certainly possible. Bob, Bart Can you please send this change as a formal patch? As we prefer workqueue with bad performance implementation over tasklets. Thanks > > Bob > >
在 2023/10/4 8:46, Zhu Yanjun 写道: > > 在 2023/10/4 2:11, Leon Romanovsky 写道: >> On Tue, Oct 03, 2023 at 11:29:42PM +0800, Zhu Yanjun wrote: >>> 在 2023/10/3 17:59, Leon Romanovsky 写道: >>>> On Tue, Oct 03, 2023 at 04:55:40PM +0800, Zhu Yanjun wrote: >>>>> 在 2023/10/1 14:50, Leon Romanovsky 写道: >>>>>> On Sun, Oct 1, 2023, at 09:47, Zhu Yanjun wrote: >>>>>>> 在 2023/10/1 14:39, Leon Romanovsky 写道: >>>>>>>> On Sun, Oct 1, 2023, at 09:34, Zhu Yanjun wrote: >>>>>>>>> 在 2023/10/1 14:30, Leon Romanovsky 写道: >>>>>>>>>> On Wed, Sep 27, 2023 at 11:51:12AM -0500, Bob Pearson wrote: >>>>>>>>>>> On 9/26/23 15:24, Bart Van Assche wrote: >>>>>>>>>>>> On 9/26/23 11:34, Bob Pearson wrote: >>>>>>>>>>>>> I am working to try to reproduce the KASAN warning. >>>>>>>>>>>>> Unfortunately, >>>>>>>>>>>>> so far I am not able to see it in Ubuntu + Linus' kernel >>>>>>>>>>>>> (as you described) on metal. The config file is different >>>>>>>>>>>>> but copies the CONFIG_KASAN_xxx exactly as yours. With >>>>>>>>>>>>> KASAN enabled it hangs on every iteration of srp/002 but >>>>>>>>>>>>> without a KASAN warning. I am now building an openSuSE VM >>>>>>>>>>>>> for qemu and will see if that causes the warning. >>>>>>>>>>>> Hi Bob, >>>>>>>>>>>> >>>>>>>>>>>> Did you try to understand the report that I shared? My >>>>>>>>>>>> conclusion from >>>>>>>>>>>> the report is that when using tasklets rxe_completer() only >>>>>>>>>>>> runs after >>>>>>>>>>>> rxe_requester() has finished and also that when using work >>>>>>>>>>>> queues that >>>>>>>>>>>> rxe_completer() may run concurrently with rxe_requester(). >>>>>>>>>>>> This patch >>>>>>>>>>>> seems to fix all issues that I ran into with the rdma_rxe >>>>>>>>>>>> workqueue >>>>>>>>>>>> patch (I have not tried to verify the performance >>>>>>>>>>>> implications of this >>>>>>>>>>>> patch): >>>>>>>>>>>> >>>>>>>>>>>> diff --git a/drivers/infiniband/sw/rxe/rxe_task.c >>>>>>>>>>>> b/drivers/infiniband/sw/rxe/rxe_task.c >>>>>>>>>>>> index 1501120d4f52..6cd5d5a7a316 100644 >>>>>>>>>>>> --- a/drivers/infiniband/sw/rxe/rxe_task.c >>>>>>>>>>>> +++ b/drivers/infiniband/sw/rxe/rxe_task.c >>>>>>>>>>>> @@ -10,7 +10,7 @@ static struct workqueue_struct *rxe_wq; >>>>>>>>>>>> >>>>>>>>>>>> int rxe_alloc_wq(void) >>>>>>>>>>>> { >>>>>>>>>>>> - rxe_wq = alloc_workqueue("rxe_wq", WQ_UNBOUND, >>>>>>>>>>>> WQ_MAX_ACTIVE); >>>>>>>>>>>> + rxe_wq = alloc_workqueue("rxe_wq", WQ_UNBOUND, 1); >>>>>>>>>>>> if (!rxe_wq) >>>>>>>>>>>> return -ENOMEM; >>>>> With this commit, a test run for several days. The similar problem >>>>> still >>>>> occurred. >>>>> >>>>> The problem is very similar with the one that Bart mentioned. >>>>> >>>>> It is very possible that WQ_MAX_ACTIVE is changed to 1, then this >>>>> problem is >>>>> alleviated. >>>>> >>>>> In the following >>>>> >>>>> 4661 __printf(1, 4) >>>>> 4662 struct workqueue_struct *alloc_workqueue(const char *fmt, >>>>> 4663 unsigned int flags, >>>>> 4664 int max_active, ...) >>>>> 4665 { >>>>> 4666 va_list args; >>>>> 4667 struct workqueue_struct *wq; >>>>> 4668 struct pool_workqueue *pwq; >>>>> 4669 >>>>> 4670 /* >>>>> 4671 * Unbound && max_active == 1 used to imply ordered, >>>>> which is >>>>> no longer >>>>> 4672 * the case on many machines due to per-pod pools. While >>>>> 4673 * alloc_ordered_workqueue() is the right way to >>>>> create an >>>>> ordered >>>>> 4674 * workqueue, keep the previous behavior to avoid subtle >>>>> breakages. >>>>> 4675 */ >>>>> 4676 if ((flags & WQ_UNBOUND) && max_active == 1) >>>>> <---This means that workqueue is ordered. >>>>> 4677 flags |= __WQ_ORDERED; >>>>> ... >>>>> >>>>> Do this mean that the ordered workqueue covers the root cause? When >>>>> workqueue is changed to ordered, it is difficult to reproduce this >>>>> problem. >>>>> Got it. > > Is there any way to ensure the following? > > if a mail does not appear in the rdma maillist, this mail will not be > reviewed? Sorry. My bad. I used the wrong rdma maillist. > >> >>> The analysis is as below: >>> >>> Because workqueue will sleep when it is preempted, sometimes the >>> sleep time >>> will exceed the timeout >>> >>> of rdma packets. As such, rdma stack or ULP will oom or hang. This >>> is why >>> workqueue will cause ULP hang. >>> >>> But tasklet will not sleep. So this kind of problem will not occur with >>> tasklet. >>> >>> About the performance, currently ordered workqueue can only execute >>> at most >>> one work item at any given >>> >>> time in the queued order. So in RXE, workqueue will not execute more >>> jobs >>> than tasklet. >> It is because of changing max_active to be 1. Once that bug will be >> fixed, RXE will be able to spread traffic on all CPUs. > Sure. I agree with you. After max_active is changed to 1, the workqueue is the ordered workqueue. The ordered workqueue will execute the work item one by one on differen CPUs, that is, after one work item is complete, the ordered workqueue will execute another one in the queued order on different CPUs. Tasklet will execute the jobs in the same CPU one by one. So if the total job number is the same, the ordered workqueue will have the same execution time with the tasklet. But the ordered workqueue has more overhead in scheduling than the tasklet. In total, the performance of the ordered workqueue is not good compared with the tasklet. Zhu Yanjun
在 2023/9/27 4:24, Bart Van Assche 写道: > On 9/26/23 11:34, Bob Pearson wrote: >> I am working to try to reproduce the KASAN warning. Unfortunately, >> so far I am not able to see it in Ubuntu + Linus' kernel (as you >> described) on metal. The config file is different but copies the >> CONFIG_KASAN_xxx exactly as yours. With KASAN enabled it hangs on >> every iteration of srp/002 but without a KASAN warning. I am now >> building an openSuSE VM for qemu and will see if that causes the warning. > > Hi Bob, > > Did you try to understand the report that I shared? My conclusion from > the report is that when using tasklets rxe_completer() only runs after > rxe_requester() has finished and also that when using work queues that > rxe_completer() may run concurrently with rxe_requester(). This patch > seems to fix all issues that I ran into with the rdma_rxe workqueue > patch (I have not tried to verify the performance implications of this > patch): > > diff --git a/drivers/infiniband/sw/rxe/rxe_task.c > b/drivers/infiniband/sw/rxe/rxe_task.c > index 1501120d4f52..6cd5d5a7a316 100644 > --- a/drivers/infiniband/sw/rxe/rxe_task.c > +++ b/drivers/infiniband/sw/rxe/rxe_task.c > @@ -10,7 +10,7 @@ static struct workqueue_struct *rxe_wq; > > int rxe_alloc_wq(void) > { > - rxe_wq = alloc_workqueue("rxe_wq", WQ_UNBOUND, WQ_MAX_ACTIVE); > + rxe_wq = alloc_workqueue("rxe_wq", WQ_UNBOUND, 1); > if (!rxe_wq) > return -ENOMEM; Hi, Bart With the above commit, I still found a similar problem. But the problem occurs very rarely. With the following, to now, the problem does not occur. diff --git a/drivers/infiniband/sw/rxe/rxe_task.c b/drivers/infiniband/sw/rxe/rxe_task.c index 1501120d4f52..3189c3705295 100644 --- a/drivers/infiniband/sw/rxe/rxe_task.c +++ b/drivers/infiniband/sw/rxe/rxe_task.c @@ -10,7 +10,7 @@ static struct workqueue_struct *rxe_wq; int rxe_alloc_wq(void) { - rxe_wq = alloc_workqueue("rxe_wq", WQ_UNBOUND, WQ_MAX_ACTIVE); + rxe_wq = alloc_workqueue("rxe_wq", WQ_HIGHPRI | WQ_UNBOUND, 1); if (!rxe_wq) return -ENOMEM; And with the tasklet, this problem also does not occur. With "alloc_workqueue("rxe_wq", WQ_HIGHPRI | WQ_UNBOUND, 1);", an ordered workqueue with high priority is allocated. To the same number of work item, the ordered workqueue has the same runing time with the tasklet. But the tasklet is based on softirq. Its overhead on scheduling is less than workqueue. So in theory, tasklet's performance should be better than the ordered workqueue. Best Regards, Zhu Yanjun > > Thanks, > > Bart.
On 10/3/23 20:41, Zhu Yanjun wrote: > 在 2023/9/27 4:24, Bart Van Assche 写道: >> diff --git a/drivers/infiniband/sw/rxe/rxe_task.c >> b/drivers/infiniband/sw/rxe/rxe_task.c >> index 1501120d4f52..6cd5d5a7a316 100644 >> --- a/drivers/infiniband/sw/rxe/rxe_task.c >> +++ b/drivers/infiniband/sw/rxe/rxe_task.c >> @@ -10,7 +10,7 @@ static struct workqueue_struct *rxe_wq; >> >> int rxe_alloc_wq(void) >> { >> - rxe_wq = alloc_workqueue("rxe_wq", WQ_UNBOUND, WQ_MAX_ACTIVE); >> + rxe_wq = alloc_workqueue("rxe_wq", WQ_UNBOUND, 1); >> if (!rxe_wq) >> return -ENOMEM; > > Hi, Bart > > With the above commit, I still found a similar problem. But the problem > occurs very rarely. With the following, to now, the problem does not occur. > > diff --git a/drivers/infiniband/sw/rxe/rxe_task.c > b/drivers/infiniband/sw/rxe/rxe_task.c > index 1501120d4f52..3189c3705295 100644 > --- a/drivers/infiniband/sw/rxe/rxe_task.c > +++ b/drivers/infiniband/sw/rxe/rxe_task.c > @@ -10,7 +10,7 @@ static struct workqueue_struct *rxe_wq; > > int rxe_alloc_wq(void) > { > - rxe_wq = alloc_workqueue("rxe_wq", WQ_UNBOUND, WQ_MAX_ACTIVE); > + rxe_wq = alloc_workqueue("rxe_wq", WQ_HIGHPRI | WQ_UNBOUND, 1); > if (!rxe_wq) > return -ENOMEM; > > > And with the tasklet, this problem also does not occur. > > With "alloc_workqueue("rxe_wq", WQ_HIGHPRI | WQ_UNBOUND, 1);", an > ordered workqueue with high priority is allocated. > > To the same number of work item, the ordered workqueue has the same > runing time with the tasklet. But the tasklet is based on softirq. Its > overhead on scheduling is less than workqueue. So in theory, tasklet's > performance should be better than the ordered workqueue. Hi Zhu, Thank you for having reported this. I'm OK with integrating the above change in my patch. However, code changes must be motivated. Do you perhaps have an explanation of why WQ_HIGHPRI makes the issue disappear that you observed? Regarding tasklets: tasklets are a nightmare from the point of view of the Linux kernel scheduler. As an example, these may make it impossible for the scheduler to schedule real-time threads in time. Thanks, Bart.
On 9/30/23 23:30, Leon Romanovsky wrote: > On Wed, Sep 27, 2023 at 11:51:12AM -0500, Bob Pearson wrote: >> On 9/26/23 15:24, Bart Van Assche wrote: >>> diff --git a/drivers/infiniband/sw/rxe/rxe_task.c b/drivers/infiniband/sw/rxe/rxe_task.c >>> index 1501120d4f52..6cd5d5a7a316 100644 >>> --- a/drivers/infiniband/sw/rxe/rxe_task.c >>> +++ b/drivers/infiniband/sw/rxe/rxe_task.c >>> @@ -10,7 +10,7 @@ static struct workqueue_struct *rxe_wq; >>> >>> int rxe_alloc_wq(void) >>> { >>> - rxe_wq = alloc_workqueue("rxe_wq", WQ_UNBOUND, WQ_MAX_ACTIVE); >>> + rxe_wq = alloc_workqueue("rxe_wq", WQ_UNBOUND, 1); >>> if (!rxe_wq) >>> return -ENOMEM; >>> >>> Thanks, >>> >>> Bart. > > <...> > >> Nevertheless this is a good hint since it seems to imply that there is a race between the requester and >> completer which is certainly possible. > > Bob, Bart > > Can you please send this change as a formal patch? > As we prefer workqueue with bad performance implementation over tasklets. Hi Bob, Do you perhaps have a preference for who posts the formal patch? Thanks, Bart.
On Wed, Oct 04, 2023 at 10:43:20AM -0700, Bart Van Assche wrote: > Thank you for having reported this. I'm OK with integrating the > above change in my patch. However, code changes must be > motivated. Do you perhaps have an explanation of why WQ_HIGHPRI > makes the issue disappear that you observed? I think it is clear there are locking bugs in all this, so it is not surprising that changing the scheduling behavior can make locking bugs hide Jason
On 10/4/23 12:44, Bart Van Assche wrote: > On 9/30/23 23:30, Leon Romanovsky wrote: >> On Wed, Sep 27, 2023 at 11:51:12AM -0500, Bob Pearson wrote: >>> On 9/26/23 15:24, Bart Van Assche wrote: >>>> diff --git a/drivers/infiniband/sw/rxe/rxe_task.c b/drivers/infiniband/sw/rxe/rxe_task.c >>>> index 1501120d4f52..6cd5d5a7a316 100644 >>>> --- a/drivers/infiniband/sw/rxe/rxe_task.c >>>> +++ b/drivers/infiniband/sw/rxe/rxe_task.c >>>> @@ -10,7 +10,7 @@ static struct workqueue_struct *rxe_wq; >>>> >>>> int rxe_alloc_wq(void) >>>> { >>>> - rxe_wq = alloc_workqueue("rxe_wq", WQ_UNBOUND, WQ_MAX_ACTIVE); >>>> + rxe_wq = alloc_workqueue("rxe_wq", WQ_UNBOUND, 1); >>>> if (!rxe_wq) >>>> return -ENOMEM; >>>> >>>> Thanks, >>>> >>>> Bart. >> >> <...> >> >>> Nevertheless this is a good hint since it seems to imply that there is a race between the requester and >>> completer which is certainly possible. >> >> Bob, Bart >> >> Can you please send this change as a formal patch? >> As we prefer workqueue with bad performance implementation over tasklets. > > Hi Bob, > > Do you perhaps have a preference for who posts the formal patch? > > Thanks, > > Bart. > Bart, Not really. I have spent the past two weeks chasing this bug and don't have much to report. I have never been able to reproduce your kasan bug. I have found like Zhu that the hang is always there but the frequency changes a lot depending on changes. For example various printk's can increase or decrease the frequency. I spent this morning looking at flame graphs captured during the hang which lasts about 60 seconds before it times out and check tears down the test. It is attached to this note. There seems to be a lot of recursion in what I assume is some attempt at error recovery. The recursion is probably in user space because the symbols are not available to perf. I would be worried that there may be stack overflow which could cause bad behavior. Bob
在 2023/10/5 2:38, Jason Gunthorpe 写道: > On Wed, Oct 04, 2023 at 10:43:20AM -0700, Bart Van Assche wrote: > >> Thank you for having reported this. I'm OK with integrating the >> above change in my patch. However, code changes must be >> motivated. Do you perhaps have an explanation of why WQ_HIGHPRI >> makes the issue disappear that you observed? > > I think it is clear there are locking bugs in all this, so it is not > surprising that changing the scheduling behavior can make locking bugs > hide > > Jason With the flag WQ_HIGHPRI, an ordered workqueue with high priority is allocated. With this workqueue, to now, the test has run for several days. And the problem did not appear. So to my test environment, this problem seems fixed with the above commit. I doubt, without the flag WQ_HIGHPRI, the workqueue is allocated with normal priority. The work item in this workqueue is more likely preempted than high priority work queue. Best Regards, Zhu Yanjun
On Thu, Oct 05, 2023 at 05:25:29PM +0800, Zhu Yanjun wrote: > 在 2023/10/5 2:38, Jason Gunthorpe 写道: > > On Wed, Oct 04, 2023 at 10:43:20AM -0700, Bart Van Assche wrote: > > > > > Thank you for having reported this. I'm OK with integrating the > > > above change in my patch. However, code changes must be > > > motivated. Do you perhaps have an explanation of why WQ_HIGHPRI > > > makes the issue disappear that you observed? > > > > I think it is clear there are locking bugs in all this, so it is not > > surprising that changing the scheduling behavior can make locking bugs > > hide > > > > Jason > > With the flag WQ_HIGHPRI, an ordered workqueue with high priority is > allocated. With this workqueue, to now, the test has run for several days. > And the problem did not appear. So to my test environment, this problem > seems fixed with the above commit. > > I doubt, without the flag WQ_HIGHPRI, the workqueue is allocated with normal > priority. The work item in this workqueue is more likely preempted than high > priority work queue. Which is why it shows there are locking problems in this code. Jason
On 10/5/23 07:21, Jason Gunthorpe wrote:
> Which is why it shows there are locking problems in this code.
Hi Jason,
Since the locking problems have not yet been root-caused, do you
agree that it is safer to revert patch "RDMA/rxe: Add workqueue
support for rxe tasks" rather than trying to fix it?
Thanks,
Bart.
On Thu, Oct 05, 2023 at 07:50:28AM -0700, Bart Van Assche wrote: > On 10/5/23 07:21, Jason Gunthorpe wrote: > > Which is why it shows there are locking problems in this code. > > Hi Jason, > > Since the locking problems have not yet been root-caused, do you > agree that it is safer to revert patch "RDMA/rxe: Add workqueue > support for rxe tasks" rather than trying to fix it? I don't think that makes the locking problems go away any more that using a single threaded work queue? Jason
On 10/5/23 10:56, Jason Gunthorpe wrote: > On Thu, Oct 05, 2023 at 07:50:28AM -0700, Bart Van Assche wrote: >> On 10/5/23 07:21, Jason Gunthorpe wrote: >>> Which is why it shows there are locking problems in this code. >> >> Hi Jason, >> >> Since the locking problems have not yet been root-caused, do you >> agree that it is safer to revert patch "RDMA/rxe: Add workqueue >> support for rxe tasks" rather than trying to fix it? > > I don't think that makes the locking problems go away any more that > using a single threaded work queue? > > Jason This is slightly off topic but may still be relevant. If there are locking bugs they are between the two send side tasks rxe_completer and rxe_requester which share the send queue and other state. Bart attempts to fix this by setting max_active to 1 which limits the ability of these two work queue tasks from interfering. For completely different reasons we have looked at merging these two tasks into a single task which it turns out improves performance, especially in high scale situations where it reduces the number of cpu cores needed to complete work. But even at low scale (1-2 QPs) it helps because of improved caching. It turns out that if the work is mostly sends and writes that there isn't much for the completer to do while if it is mostly reads there isn't much for the requester to do. So combining them doesn't hurt performance by having fewer cores to do the work. But this also prevents the two tasks for a given QP to run at the same time which should eliminate locking issues. If no one hates the idea I can send in our patch that does this. Bob
在 2023/10/6 23:58, Bob Pearson 写道: > On 10/5/23 10:56, Jason Gunthorpe wrote: >> On Thu, Oct 05, 2023 at 07:50:28AM -0700, Bart Van Assche wrote: >>> On 10/5/23 07:21, Jason Gunthorpe wrote: >>>> Which is why it shows there are locking problems in this code. >>> Hi Jason, >>> >>> Since the locking problems have not yet been root-caused, do you >>> agree that it is safer to revert patch "RDMA/rxe: Add workqueue >>> support for rxe tasks" rather than trying to fix it? >> I don't think that makes the locking problems go away any more that >> using a single threaded work queue? >> >> Jason > This is slightly off topic but may still be relevant. > If there are locking bugs they are between the two send side tasks > rxe_completer and rxe_requester which share the send queue and other > state. Bart attempts to fix this by setting max_active to 1 which > limits the ability of these two work queue tasks from interfering. > For completely different reasons we have looked at merging these > two tasks into a single task which it turns out improves performance, > especially in high scale situations where it reduces the number of > cpu cores needed to complete work. But even at low scale (1-2 QPs) > it helps because of improved caching. It turns out that if the work > is mostly sends and writes that there isn't much for the completer > to do while if it is mostly reads there isn't much for the requester > to do. So combining them doesn't hurt performance by having fewer cores > to do the work. But this also prevents the two tasks for a given QP > to run at the same time which should eliminate locking issues. > If no one hates the idea I can send in our patch that does this. Hi, Bob The blktest hang problem is not fixed now. I do not think that it is a good time to introduce this new feature. IMHO, we should fix this blktest hang problem. And we also need some time to wait for the test results from the users. If this problem proves fixed. Then we have a stable RXE version for the new feature from you. In short, we need a stable RXE version for use. Then based on the stable RXE version, new features are welcomed. Please let me know your advice. Please Bart comment on this. Please Jason && Leon comment on this. Warm Regards, Zhu Yanjun > > Bob
在 2023/10/5 22:50, Bart Van Assche 写道: > On 10/5/23 07:21, Jason Gunthorpe wrote: >> Which is why it shows there are locking problems in this code. > > Hi Jason, > > Since the locking problems have not yet been root-caused, do you > agree that it is safer to revert patch "RDMA/rxe: Add workqueue > support for rxe tasks" rather than trying to fix it? Hi, Jason && Leon I spent a lot of time on this problem. It seems that it is a very difficult problem. So I agree with Bart. Can we revert patch "RDMA/rxe: Add workqueue support for rxe tasks" rather than trying to fix it? Then Bob can apply his new patch to a stable RXE? Any reply is appreciated. Warm Regards, Zhu Yanjun > > Thanks, > > Bart.
On Mon, Oct 09, 2023 at 12:01:37AM +0800, Zhu Yanjun wrote: > 在 2023/10/5 22:50, Bart Van Assche 写道: > > On 10/5/23 07:21, Jason Gunthorpe wrote: > > > Which is why it shows there are locking problems in this code. > > > > Hi Jason, > > > > Since the locking problems have not yet been root-caused, do you > > agree that it is safer to revert patch "RDMA/rxe: Add workqueue > > support for rxe tasks" rather than trying to fix it? > > Hi, Jason && Leon > > I spent a lot of time on this problem. It seems that it is a very difficult > problem. > > So I agree with Bart. Can we revert patch "RDMA/rxe: Add workqueue > support for rxe tasks" rather than trying to fix it? Then Bob can apply his > new patch to a stable RXE? Jason, I'm fine with reverting patch. Thanks > > Any reply is appreciated. > Warm Regards, > > Zhu Yanjun > > > > > Thanks, > > > > Bart. >
On Mon, Oct 9, 2023 1:02 AM Zhu Yanjun wrote: > 在 2023/10/5 22:50, Bart Van Assche 写道: > > On 10/5/23 07:21, Jason Gunthorpe wrote: > >> Which is why it shows there are locking problems in this code. > > > > Hi Jason, > > > > Since the locking problems have not yet been root-caused, do you > > agree that it is safer to revert patch "RDMA/rxe: Add workqueue > > support for rxe tasks" rather than trying to fix it? > > Hi, Jason && Leon > > I spent a lot of time on this problem. It seems that it is a very > difficult problem. > > So I agree with Bart. Can we revert patch "RDMA/rxe: Add workqueue > support for rxe tasks" rather than trying to fix it? Then Bob can apply > his new patch to a stable RXE? Cf. https://lore.kernel.org/lkml/f15b06b934aa0ace8b28dc046022e5507458eb99.1694153251.git.matsuda-daisuke@fujitsu.com/ I have ODP patches that is fully dependent on "RDMA/rxe: Add workqueue support for rxe tasks". So I personally prefer preserving workqueue to reverting the workqueue patch. Each developer here has different motive and interest. I think the rxe driver should take in new specs and new features actively so that it can be used by developers without access to HCAs. I believe workqueue is better suited for this purpose. Additionally, the disadvantages of tasklet are documented as follows: https://lwn.net/Articles/830964/ However, stability is very important, so I will not insist on my opinion. I agree it is very difficult to find the root cause of the locking problem. It cannot be helped that we will somehow hide the issue for now so that it will not bother actual users of the driver. Perhaps, there are three choices to do this. Solution 1: Reverting "RDMA/rxe: Add workqueue support for rxe tasks" I see this is supported by Zhu, Bart and approved by Leon. Solution 2: Serializing execution of work items > - rxe_wq = alloc_workqueue("rxe_wq", WQ_UNBOUND, WQ_MAX_ACTIVE); > + rxe_wq = alloc_workqueue("rxe_wq", WQ_HIGHPRI | WQ_UNBOUND, 1); Solution 3: Merging requester and completer (not yet submitted/tested) https://lore.kernel.org/all/93c8ad67-f008-4352-8887-099723c2f4ec@gmail.com/ Not clear to me if we should call this a new feature or a fix. If it can eliminate the hang issue, it could be an ultimate solution. It is understandable some people do not want to wait for solution 3 to be submitted and verified. Is there any problem if we adopt solution 2? If so, then I agree to going with solution 1. If not, solution 2 is better to me. Thanks, Daisuke Matsuda
On Tue, Oct 10, 2023 at 04:53:55AM +0000, Daisuke Matsuda (Fujitsu) wrote: > Solution 1: Reverting "RDMA/rxe: Add workqueue support for rxe tasks" > I see this is supported by Zhu, Bart and approved by Leon. > > Solution 2: Serializing execution of work items > > - rxe_wq = alloc_workqueue("rxe_wq", WQ_UNBOUND, WQ_MAX_ACTIVE); > > + rxe_wq = alloc_workqueue("rxe_wq", WQ_HIGHPRI | WQ_UNBOUND, 1); > > Solution 3: Merging requester and completer (not yet submitted/tested) > https://lore.kernel.org/all/93c8ad67-f008-4352-8887-099723c2f4ec@gmail.com/ > Not clear to me if we should call this a new feature or a fix. > If it can eliminate the hang issue, it could be an ultimate solution. > > It is understandable some people do not want to wait for solution 3 to be submitted and verified. > Is there any problem if we adopt solution 2? > If so, then I agree to going with solution 1. > If not, solution 2 is better to me. I also do not want to go backwards, I don't believe the locking is magically correct under tasklets. 2 is painful enough to continue to motivate people to fix this while unbreaking block tests. I'm still puzzled why Bob can't reproduce the things Bart has seen. Jason
On 10/10/23 09:09, Jason Gunthorpe wrote: > On Tue, Oct 10, 2023 at 04:53:55AM +0000, Daisuke Matsuda (Fujitsu) wrote: > >> Solution 1: Reverting "RDMA/rxe: Add workqueue support for rxe tasks" >> I see this is supported by Zhu, Bart and approved by Leon. >> >> Solution 2: Serializing execution of work items >>> - rxe_wq = alloc_workqueue("rxe_wq", WQ_UNBOUND, WQ_MAX_ACTIVE); >>> + rxe_wq = alloc_workqueue("rxe_wq", WQ_HIGHPRI | WQ_UNBOUND, 1); >> >> Solution 3: Merging requester and completer (not yet submitted/tested) >> https://lore.kernel.org/all/93c8ad67-f008-4352-8887-099723c2f4ec@gmail.com/ >> Not clear to me if we should call this a new feature or a fix. >> If it can eliminate the hang issue, it could be an ultimate solution. >> >> It is understandable some people do not want to wait for solution 3 to be submitted and verified. >> Is there any problem if we adopt solution 2? >> If so, then I agree to going with solution 1. >> If not, solution 2 is better to me. > > I also do not want to go backwards, I don't believe the locking is > magically correct under tasklets. 2 is painful enough to continue to > motivate people to fix this while unbreaking block tests. In my opinion (2) is not a solution. Zhu Yanjun reported test failures with rxe_wq = alloc_workqueue("rxe_wq", WQ_UNBOUND, 1). Adding WQ_HIGHPRI probably made it less likely to trigger any race conditions but I don't believe that this is sufficient as a solution. > I'm still puzzled why Bob can't reproduce the things Bart has seen. Is this necessary? The KASAN complaint that I reported should be more than enough for someone who is familiar with the RXE driver to identify and fix the root cause. I can help with testing candidate fixes. Thanks, Bart.
On Tue, Oct 10, 2023 at 02:29:19PM -0700, Bart Van Assche wrote: > On 10/10/23 09:09, Jason Gunthorpe wrote: > > On Tue, Oct 10, 2023 at 04:53:55AM +0000, Daisuke Matsuda (Fujitsu) wrote: > > > > > Solution 1: Reverting "RDMA/rxe: Add workqueue support for rxe tasks" > > > I see this is supported by Zhu, Bart and approved by Leon. > > > > > > Solution 2: Serializing execution of work items > > > > - rxe_wq = alloc_workqueue("rxe_wq", WQ_UNBOUND, WQ_MAX_ACTIVE); > > > > + rxe_wq = alloc_workqueue("rxe_wq", WQ_HIGHPRI | WQ_UNBOUND, 1); > > > > > > Solution 3: Merging requester and completer (not yet submitted/tested) > > > https://lore.kernel.org/all/93c8ad67-f008-4352-8887-099723c2f4ec@gmail.com/ > > > Not clear to me if we should call this a new feature or a fix. > > > If it can eliminate the hang issue, it could be an ultimate solution. > > > > > > It is understandable some people do not want to wait for solution 3 to be submitted and verified. > > > Is there any problem if we adopt solution 2? > > > If so, then I agree to going with solution 1. > > > If not, solution 2 is better to me. > > > > I also do not want to go backwards, I don't believe the locking is > > magically correct under tasklets. 2 is painful enough to continue to > > motivate people to fix this while unbreaking block tests. > > In my opinion (2) is not a solution. Zhu Yanjun reported test failures with > rxe_wq = alloc_workqueue("rxe_wq", WQ_UNBOUND, 1). Adding WQ_HIGHPRI probably > made it less likely to trigger any race conditions but I don't believe that > this is sufficient as a solution. I've been going on the assumption that rxe has always been full of bugs. I don't believe the work queue change added new bugs, it just made the existing bugs easier to hit. It is hard to be sure until someon can find out what is going wrong. If we revert it then rxe will probably just stop development entirely. Daisuke's ODP work will be blocked and if Bob was able to fix it he would have done so already. Which mean's Bobs ongoing work is lost too. I *vastly* prefer we root cause and fix it properly. Rxe was finally starting to get a reasonable set of people interested in it, I do not want to kill that off. Again, I'm troubled that this doesn't seem to be reproducing for other people. > > I'm still puzzled why Bob can't reproduce the things Bart has seen. > > Is this necessary? It is always easier to debug something you can change than to try and guess what an oops is trying to say.. > The KASAN complaint that I reported should be more than enough for > someone who is familiar with the RXE driver to identify and fix the > root cause. I can help with testing candidate fixes. Bob? Jason
On 10/11/23 08:51, Jason Gunthorpe wrote: > If we revert it then rxe will probably just stop development > entirely. Daisuke's ODP work will be blocked and if Bob was able to > fix it he would have done so already. Which mean's Bobs ongoing work > is lost too. If Daisuke's work depends on the RXE changes then Daisuke may decide to help with the RXE changes. Introducing regressions while refactoring code is not acceptable. I don't have enough spare time to help with the RXE driver. Thanks, Bart.
On Wed, Oct 11, 2023 at 01:14:16PM -0700, Bart Van Assche wrote: > On 10/11/23 08:51, Jason Gunthorpe wrote: > > If we revert it then rxe will probably just stop development > > entirely. Daisuke's ODP work will be blocked and if Bob was able to > > fix it he would have done so already. Which mean's Bobs ongoing work > > is lost too. > > If Daisuke's work depends on the RXE changes then Daisuke may decide > to help with the RXE changes. > > Introducing regressions while refactoring code is not acceptable. Generally, but I don't view rxe as a production part of the kernel so I prefer to give time to resolve it. > I don't have enough spare time to help with the RXE driver. Nor I Jason
在 2023/10/12 7:12, Jason Gunthorpe 写道: > On Wed, Oct 11, 2023 at 01:14:16PM -0700, Bart Van Assche wrote: >> On 10/11/23 08:51, Jason Gunthorpe wrote: >>> If we revert it then rxe will probably just stop development >>> entirely. Daisuke's ODP work will be blocked and if Bob was able to >>> fix it he would have done so already. Which mean's Bobs ongoing work >>> is lost too. >> >> If Daisuke's work depends on the RXE changes then Daisuke may decide >> to help with the RXE changes. >> >> Introducing regressions while refactoring code is not acceptable. > > Generally, but I don't view rxe as a production part of the kernel so > I prefer to give time to resolve it. > >> I don't have enough spare time to help with the RXE driver. commit 11ab7cc7ee32d6c3e16ac74c34c4bbdbf8f99292 Author: Bart Van Assche <bvanassche@acm.org> Date: Tue Aug 22 09:57:07 2023 -0700 Change the default RDMA driver from rdma_rxe to siw Since the siw driver is more stable than the rdma_rxe driver, change the default into siw. See e.g. https://lore.kernel.org/all/c3d1a966-b9b0-d015-38ec-86270b5045fc@acm.org/. Signed-off-by: Bart Van Assche <bvanassche@acm.org> Signed-off-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com> > > Nor I > > Jason
On 10/12/23 06:49, Zhu Yanjun wrote: > 在 2023/10/12 7:12, Jason Gunthorpe 写道: >> On Wed, Oct 11, 2023 at 01:14:16PM -0700, Bart Van Assche wrote: >>> On 10/11/23 08:51, Jason Gunthorpe wrote: >>>> If we revert it then rxe will probably just stop development >>>> entirely. Daisuke's ODP work will be blocked and if Bob was able to >>>> fix it he would have done so already. Which mean's Bobs ongoing work >>>> is lost too. >>> >>> If Daisuke's work depends on the RXE changes then Daisuke may decide >>> to help with the RXE changes. >>> >>> Introducing regressions while refactoring code is not acceptable. >> >> Generally, but I don't view rxe as a production part of the kernel so >> I prefer to give time to resolve it. >> >>> I don't have enough spare time to help with the RXE driver. > > commit 11ab7cc7ee32d6c3e16ac74c34c4bbdbf8f99292 > Author: Bart Van Assche <bvanassche@acm.org> > Date: Tue Aug 22 09:57:07 2023 -0700 > > Change the default RDMA driver from rdma_rxe to siw > > Since the siw driver is more stable than the rdma_rxe driver, change the > default into siw. See e.g. > > https://lore.kernel.org/all/c3d1a966-b9b0-d015-38ec-86270b5045fc@acm.org/. > > Signed-off-by: Bart Van Assche <bvanassche@acm.org> > Signed-off-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com> > > >> >> Nor I >> >> Jason > All, I have spent the past several weeks working on trying to resolve this issue. The one thing I can say for sure is that the failures or their rates are very sensitive to small timing changes. I totally agree Jason that the bug has always been there and most of the suggested changes are just masking or unmasking it. I have been running under all the kernel lock checking I can set and have not seen any warnings so I doubt the error is a deadlock. My suspicion remains that the root cause of the hang is loss of a completion or a timeout before a late completion leading to the transport state machine death. There are surely other bugs in the driver and they may show up in parallel with this hang. I see the hang consistently from 1-2% to 30-40% of the time when running srp/002 depending on various changes I have tried but I have not been able to reproduce the KASAN bug yet. Because the hang is easy to reproduce I have focused on that. Bob
diff --git a/drivers/infiniband/sw/rxe/rxe.c b/drivers/infiniband/sw/rxe/rxe.c index 54c723a6edda..7a7e713de52d 100644 --- a/drivers/infiniband/sw/rxe/rxe.c +++ b/drivers/infiniband/sw/rxe/rxe.c @@ -212,15 +212,9 @@ static int __init rxe_module_init(void) { int err; - err = rxe_alloc_wq(); - if (err) - return err; - err = rxe_net_init(); - if (err) { - rxe_destroy_wq(); + if (err) return err; - } rdma_link_register(&rxe_link_ops); pr_info("loaded\n"); @@ -232,7 +226,6 @@ static void __exit rxe_module_exit(void) rdma_link_unregister(&rxe_link_ops); ib_unregister_driver(RDMA_DRIVER_RXE); rxe_net_exit(); - rxe_destroy_wq(); pr_info("unloaded\n"); } diff --git a/drivers/infiniband/sw/rxe/rxe_task.c b/drivers/infiniband/sw/rxe/rxe_task.c index 1501120d4f52..fb9a6bc8e620 100644 --- a/drivers/infiniband/sw/rxe/rxe_task.c +++ b/drivers/infiniband/sw/rxe/rxe_task.c @@ -6,24 +6,8 @@ #include "rxe.h" -static struct workqueue_struct *rxe_wq; - -int rxe_alloc_wq(void) -{ - rxe_wq = alloc_workqueue("rxe_wq", WQ_UNBOUND, WQ_MAX_ACTIVE); - if (!rxe_wq) - return -ENOMEM; - - return 0; -} - -void rxe_destroy_wq(void) -{ - destroy_workqueue(rxe_wq); -} - /* Check if task is idle i.e. not running, not scheduled in - * work queue and not draining. If so move to busy to + * tasklet queue and not draining. If so move to busy to * reserve a slot in do_task() by setting to busy and taking * a qp reference to cover the gap from now until the task finishes. * state will move out of busy if task returns a non zero value @@ -37,6 +21,9 @@ static bool __reserve_if_idle(struct rxe_task *task) { WARN_ON(rxe_read(task->qp) <= 0); + if (task->tasklet.state & BIT(TASKLET_STATE_SCHED)) + return false; + if (task->state == TASK_STATE_IDLE) { rxe_get(task->qp); task->state = TASK_STATE_BUSY; @@ -51,7 +38,7 @@ static bool __reserve_if_idle(struct rxe_task *task) } /* check if task is idle or drained and not currently - * scheduled in the work queue. This routine is + * scheduled in the tasklet queue. This routine is * called by rxe_cleanup_task or rxe_disable_task to * see if the queue is empty. * Context: caller should hold task->lock. @@ -59,7 +46,7 @@ static bool __reserve_if_idle(struct rxe_task *task) */ static bool __is_done(struct rxe_task *task) { - if (work_pending(&task->work)) + if (task->tasklet.state & BIT(TASKLET_STATE_SCHED)) return false; if (task->state == TASK_STATE_IDLE || @@ -90,23 +77,23 @@ static bool is_done(struct rxe_task *task) * schedules the task. They must call __reserve_if_idle to * move the task to busy before calling or scheduling. * The task can also be moved to drained or invalid - * by calls to rxe_cleanup_task or rxe_disable_task. + * by calls to rxe-cleanup_task or rxe_disable_task. * In that case tasks which get here are not executed but * just flushed. The tasks are designed to look to see if - * there is work to do and then do part of it before returning + * there is work to do and do part of it before returning * here with a return value of zero until all the work - * has been consumed then it returns a non-zero value. + * has been consumed then it retuens a non-zero value. * The number of times the task can be run is limited by * max iterations so one task cannot hold the cpu forever. - * If the limit is hit and work remains the task is rescheduled. */ -static void do_task(struct rxe_task *task) +static void do_task(struct tasklet_struct *t) { + int cont; + int ret; + struct rxe_task *task = from_tasklet(task, t, tasklet); unsigned int iterations; unsigned long flags; int resched = 0; - int cont; - int ret; WARN_ON(rxe_read(task->qp) <= 0); @@ -128,22 +115,25 @@ static void do_task(struct rxe_task *task) } while (ret == 0 && iterations-- > 0); spin_lock_irqsave(&task->lock, flags); - /* we're not done yet but we ran out of iterations. - * yield the cpu and reschedule the task - */ - if (!ret) { - task->state = TASK_STATE_IDLE; - resched = 1; - goto exit; - } - switch (task->state) { case TASK_STATE_BUSY: - task->state = TASK_STATE_IDLE; + if (ret) { + task->state = TASK_STATE_IDLE; + } else { + /* This can happen if the client + * can add work faster than the + * tasklet can finish it. + * Reschedule the tasklet and exit + * the loop to give up the cpu + */ + task->state = TASK_STATE_IDLE; + resched = 1; + } break; - /* someone tried to schedule the task while we - * were running, keep going + /* someone tried to run the task since the last time we called + * func, so we will call one more time regardless of the + * return value */ case TASK_STATE_ARMED: task->state = TASK_STATE_BUSY; @@ -151,24 +141,22 @@ static void do_task(struct rxe_task *task) break; case TASK_STATE_DRAINING: - task->state = TASK_STATE_DRAINED; + if (ret) + task->state = TASK_STATE_DRAINED; + else + cont = 1; break; default: WARN_ON(1); - rxe_dbg_qp(task->qp, "unexpected task state = %d", - task->state); - task->state = TASK_STATE_IDLE; + rxe_info_qp(task->qp, "unexpected task state = %d", task->state); } -exit: if (!cont) { task->num_done++; if (WARN_ON(task->num_done != task->num_sched)) - rxe_dbg_qp( - task->qp, - "%ld tasks scheduled, %ld tasks done", - task->num_sched, task->num_done); + rxe_err_qp(task->qp, "%ld tasks scheduled, %ld tasks done", + task->num_sched, task->num_done); } spin_unlock_irqrestore(&task->lock, flags); } while (cont); @@ -181,12 +169,6 @@ static void do_task(struct rxe_task *task) rxe_put(task->qp); } -/* wrapper around do_task to fix argument for work queue */ -static void do_work(struct work_struct *work) -{ - do_task(container_of(work, struct rxe_task, work)); -} - int rxe_init_task(struct rxe_task *task, struct rxe_qp *qp, int (*func)(struct rxe_qp *)) { @@ -194,9 +176,11 @@ int rxe_init_task(struct rxe_task *task, struct rxe_qp *qp, task->qp = qp; task->func = func; + + tasklet_setup(&task->tasklet, do_task); + task->state = TASK_STATE_IDLE; spin_lock_init(&task->lock); - INIT_WORK(&task->work, do_work); return 0; } @@ -229,6 +213,8 @@ void rxe_cleanup_task(struct rxe_task *task) while (!is_done(task)) cond_resched(); + tasklet_kill(&task->tasklet); + spin_lock_irqsave(&task->lock, flags); task->state = TASK_STATE_INVALID; spin_unlock_irqrestore(&task->lock, flags); @@ -240,7 +226,7 @@ void rxe_cleanup_task(struct rxe_task *task) void rxe_run_task(struct rxe_task *task) { unsigned long flags; - bool run; + int run; WARN_ON(rxe_read(task->qp) <= 0); @@ -249,11 +235,11 @@ void rxe_run_task(struct rxe_task *task) spin_unlock_irqrestore(&task->lock, flags); if (run) - do_task(task); + do_task(&task->tasklet); } -/* schedule the task to run later as a work queue entry. - * the queue_work call can be called holding +/* schedule the task to run later as a tasklet. + * the tasklet)schedule call can be called holding * the lock. */ void rxe_sched_task(struct rxe_task *task) @@ -264,7 +250,7 @@ void rxe_sched_task(struct rxe_task *task) spin_lock_irqsave(&task->lock, flags); if (__reserve_if_idle(task)) - queue_work(rxe_wq, &task->work); + tasklet_schedule(&task->tasklet); spin_unlock_irqrestore(&task->lock, flags); } @@ -291,9 +277,7 @@ void rxe_disable_task(struct rxe_task *task) while (!is_done(task)) cond_resched(); - spin_lock_irqsave(&task->lock, flags); - task->state = TASK_STATE_DRAINED; - spin_unlock_irqrestore(&task->lock, flags); + tasklet_disable(&task->tasklet); } void rxe_enable_task(struct rxe_task *task) @@ -307,7 +291,7 @@ void rxe_enable_task(struct rxe_task *task) spin_unlock_irqrestore(&task->lock, flags); return; } - task->state = TASK_STATE_IDLE; + tasklet_enable(&task->tasklet); spin_unlock_irqrestore(&task->lock, flags); } diff --git a/drivers/infiniband/sw/rxe/rxe_task.h b/drivers/infiniband/sw/rxe/rxe_task.h index a63e258b3d66..facb7c8e3729 100644 --- a/drivers/infiniband/sw/rxe/rxe_task.h +++ b/drivers/infiniband/sw/rxe/rxe_task.h @@ -22,7 +22,7 @@ enum { * called again. */ struct rxe_task { - struct work_struct work; + struct tasklet_struct tasklet; int state; spinlock_t lock; struct rxe_qp *qp; @@ -32,10 +32,6 @@ struct rxe_task { long num_done; }; -int rxe_alloc_wq(void); - -void rxe_destroy_wq(void); - /* * init rxe_task structure * qp => parameter to pass to func