[for-next,v3,00/13] Implement work queues for rdma_rxe

Message ID	20221029031009.64467-1-rpearsonhpe@gmail.com (mailing list archive)
Headers	show Return-Path: <linux-rdma-owner@kernel.org> From: Bob Pearson <rpearsonhpe@gmail.com> To: jgg@nvidia.com, leon@kernel.org, zyjzyj2000@gmail.com, jhack@hpe.com, linux-rdma@vger.kernel.org Cc: Bob Pearson <rpearsonhpe@gmail.com> Subject: [PATCH for-next v3 00/13] Implement work queues for rdma_rxe Date: Fri, 28 Oct 2022 22:09:57 -0500 Message-Id: <20221029031009.64467-1-rpearsonhpe@gmail.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Precedence: bulk
Series	Implement work queues for rdma_rxe \| expand [for-next,v3,00/13] Implement work queues for rdma_rxe [for-next,v3,01/13] RDMA/rxe: Make task interface pluggable [for-next,v3,02/13] RDMA/rxe: Split rxe_drain_resp_pkts() [for-next,v3,03/13] RDMA/rxe: Simplify reset state handling in rxe_resp.c [for-next,v3,04/13] RDMA/rxe: Handle qp error in rxe_resp.c [v3,05/13] RDMA/rxe: Cleanup comp tasks in rxe_qp.c [for-next,v3,06/13] RDMA/rxe: Remove __rxe_do_task() [for-next,v3,07/13] RDMA/rxe: Make tasks schedule each other [for-next,v3,08/13] RDMA/rxe: Implement disable/enable_task() [for-next,v3,09/13] RDMA/rxe: Replace TASK_STATE_START by TASK_STATE_IDLE [for-next,v3,10/13] RDMA/rxe: Replace task->destroyed by task state INVALID. [for-next,v3,11/13] RDMA/rxe: Add workqueue support for tasks [for-next,v3,12/13] RDMA/rxe: Make WORKQUEUE default for RC tasks [for-next,v3,13/13] RDMA/rxe: Remove tasklets from rxe_task.c

Bob Pearson Oct. 29, 2022, 3:09 a.m. UTC

This patch series implements work queues as an alternative for
the main tasklets in the rdma_rxe driver. The patch series starts
with a patch that makes the internal API for task execution pluggable
and implements an inline and a tasklet based set of functions.
The remaining patches cleanup the qp reset and error code in the
three tasklets and modify the locking logic to prevent making
multiple calls to the tasklet scheduling routine. After
this preparation the work queue equivalent set of functions is
added and the tasklet version is dropped. 

The advantages of the work queue version of deferred task execution
is mainly that the work queue variant has much better scalability
and overall performance than the tasklet variant.  The perftest
microbenchmarks in local loopback mode (not a very realistic test
case) can reach approximately 100Gb/sec with work queues compared to
about 16Gb/sec for tasklets.

This version of the patch series drops the tasklet version as an option
but keeps the option of switching between the workqueue and inline
versions.

This patch series is derived from an earlier patch set developed by
Ian Ziemba at HPE which is used in some Lustre storage clients attached
to Lustre servers with hard RoCE v2 NICs.

It is based on the current version of wip/jgg-for-next.

v3:
Link: https://lore.kernel.org/linux-rdma/202210220559.f7taTL8S-lkp@intel.com/
The v3 version drops the first few patches which have already been accepted
in for-next. It also drops the last patch of the v2 version which
introduced module parameters to select between the task interfaces. It also
drops the tasklet version entirely. It fixes a minor error caught by
the kernel test robot <lkp@intel.com> with a missing static declaration.

v2:
The v2 version of the patch set has some minor changes that address
comments from Leon Romanovsky regarding locking of the valid parameter
and the setup parameters for alloc_workqueue. It also has one
additional cleanup patch.

Bob Pearson (13):
  RDMA/rxe: Make task interface pluggable
  RDMA/rxe: Split rxe_drain_resp_pkts()
  RDMA/rxe: Simplify reset state handling in rxe_resp.c
  RDMA/rxe: Handle qp error in rxe_resp.c
  RDMA/rxe: Cleanup comp tasks in rxe_qp.c
  RDMA/rxe: Remove __rxe_do_task()
  RDMA/rxe: Make tasks schedule each other
  RDMA/rxe: Implement disable/enable_task()
  RDMA/rxe: Replace TASK_STATE_START by TASK_STATE_IDLE
  RDMA/rxe: Replace task->destroyed by task state INVALID.
  RDMA/rxe: Add workqueue support for tasks
  RDMA/rxe: Make WORKQUEUE default for RC tasks
  RDMA/rxe: Remove tasklets from rxe_task.c

 drivers/infiniband/sw/rxe/rxe.c      |   9 +-
 drivers/infiniband/sw/rxe/rxe_comp.c |  24 ++-
 drivers/infiniband/sw/rxe/rxe_qp.c   |  80 ++++-----
 drivers/infiniband/sw/rxe/rxe_req.c  |   4 +-
 drivers/infiniband/sw/rxe/rxe_resp.c |  70 +++++---
 drivers/infiniband/sw/rxe/rxe_task.c | 258 +++++++++++++++++++--------
 drivers/infiniband/sw/rxe/rxe_task.h |  56 +++---
 7 files changed, 329 insertions(+), 172 deletions(-)


base-commit: 692373d186205dfb1b56f35f22702412d94d9420

Daisuke Matsuda (Fujitsu) Nov. 2, 2022, 10:17 a.m. UTC | #1

On Sat, Oct 29, 2022 12:10 PM Bob Pearson wrote:
> This patch series implements work queues as an alternative for
> the main tasklets in the rdma_rxe driver. The patch series starts
> with a patch that makes the internal API for task execution pluggable
> and implements an inline and a tasklet based set of functions.
> The remaining patches cleanup the qp reset and error code in the
> three tasklets and modify the locking logic to prevent making
> multiple calls to the tasklet scheduling routine. After
> this preparation the work queue equivalent set of functions is
> added and the tasklet version is dropped.

Thank you for posting the 3rd series.
It looks fine at a glance, but now I am concerned about problems
that can be potentially caused by concurrency.

> 
> The advantages of the work queue version of deferred task execution
> is mainly that the work queue variant has much better scalability
> and overall performance than the tasklet variant.  The perftest
> microbenchmarks in local loopback mode (not a very realistic test
> case) can reach approximately 100Gb/sec with work queues compared to
> about 16Gb/sec for tasklets.

As you wrote, the advantage of work queue version is that the number works
that can run parallelly scales with the number of logical CPUs. However, the
dispatched works (rxe_requester, rxe_responder, and rxe_completer) are
designed for serial execution on tasklet, so we must not rely on them functioning
properly on parallel execution.

There could be 3 problems, which stem from the fact that works are not necessarily
executed in the same order the packets are received. Works are enqueued to worker
pools on each CPU, and each CPU respectively schedules the works, so the ordering
of works among CPUs is not guaranteed.

[1]
On UC/UD connections, responder does not check the psn of inbound packets,
so the payloads can be copied to MRs without checking the order. If there are
works that write to overlapping memory locations, they can potentially cause
data corruption depending the order.

[2]
On RC connections, responder checks the psn, and drops the packet if it is not
the expected one. Requester can retransmit the request in this case, so the order
seems to be guaranteed for RC.

However, responder updates the next expected psn (qp->resp.psn) BEFORE
replying an ACK packet. If the work is preempted soon after storing the next psn,
another work on another CPU can potentially reply another ACK packet earlier.
This behaviour is against the spec.
Cf. IB Specification Vol 1-Release-1.5 " 9.5 TRANSACTION ORDERING"

[3]
Again on RC connections, the next expected psn (qp->resp.psn) can be
loaded and stored at the same time from different threads. It seems we
have to use a synchronization method, perhaps like READ_ONCE() and
WRITE_ONCE() macros, to prevent loading an old value. This one is just an
example; there can be other variables that need similar consideration.

All the problems above can be solved by making the work queue single-
threaded. We can do it by using flags=WQ_UNBOUND and max_active=1
for alloc_workqueue(), but this should be the last resort since this spoils
the performance benefit of work queue.

I am not sure what we can do with [1] right now.
For [2] and [3], we could just move the update of psn later than the ack reply,
and use *_ONCE() macros for shared variables.

Thanks,
Daisuke

> 
> This version of the patch series drops the tasklet version as an option
> but keeps the option of switching between the workqueue and inline
> versions.
> 
> This patch series is derived from an earlier patch set developed by
> Ian Ziemba at HPE which is used in some Lustre storage clients attached
> to Lustre servers with hard RoCE v2 NICs.
> 
> It is based on the current version of wip/jgg-for-next.
> 
> v3:
> Link: https://lore.kernel.org/linux-rdma/202210220559.f7taTL8S-lkp@intel.com/
> The v3 version drops the first few patches which have already been accepted
> in for-next. It also drops the last patch of the v2 version which
> introduced module parameters to select between the task interfaces. It also
> drops the tasklet version entirely. It fixes a minor error caught by
> the kernel test robot <lkp@intel.com> with a missing static declaration.
> 
> v2:
> The v2 version of the patch set has some minor changes that address
> comments from Leon Romanovsky regarding locking of the valid parameter
> and the setup parameters for alloc_workqueue. It also has one
> additional cleanup patch.
> 
> Bob Pearson (13):
>   RDMA/rxe: Make task interface pluggable
>   RDMA/rxe: Split rxe_drain_resp_pkts()
>   RDMA/rxe: Simplify reset state handling in rxe_resp.c
>   RDMA/rxe: Handle qp error in rxe_resp.c
>   RDMA/rxe: Cleanup comp tasks in rxe_qp.c
>   RDMA/rxe: Remove __rxe_do_task()
>   RDMA/rxe: Make tasks schedule each other
>   RDMA/rxe: Implement disable/enable_task()
>   RDMA/rxe: Replace TASK_STATE_START by TASK_STATE_IDLE
>   RDMA/rxe: Replace task->destroyed by task state INVALID.
>   RDMA/rxe: Add workqueue support for tasks
>   RDMA/rxe: Make WORKQUEUE default for RC tasks
>   RDMA/rxe: Remove tasklets from rxe_task.c
> 
>  drivers/infiniband/sw/rxe/rxe.c      |   9 +-
>  drivers/infiniband/sw/rxe/rxe_comp.c |  24 ++-
>  drivers/infiniband/sw/rxe/rxe_qp.c   |  80 ++++-----
>  drivers/infiniband/sw/rxe/rxe_req.c  |   4 +-
>  drivers/infiniband/sw/rxe/rxe_resp.c |  70 +++++---
>  drivers/infiniband/sw/rxe/rxe_task.c | 258 +++++++++++++++++++--------
>  drivers/infiniband/sw/rxe/rxe_task.h |  56 +++---
>  7 files changed, 329 insertions(+), 172 deletions(-)
> 
> 
> base-commit: 692373d186205dfb1b56f35f22702412d94d9420
> --
> 2.34.1

Bob Pearson Nov. 2, 2022, 11:20 a.m. UTC | #2

On 11/2/22 05:17, Daisuke Matsuda (Fujitsu) wrote:
> On Sat, Oct 29, 2022 12:10 PM Bob Pearson wrote:
>> This patch series implements work queues as an alternative for
>> the main tasklets in the rdma_rxe driver. The patch series starts
>> with a patch that makes the internal API for task execution pluggable
>> and implements an inline and a tasklet based set of functions.
>> The remaining patches cleanup the qp reset and error code in the
>> three tasklets and modify the locking logic to prevent making
>> multiple calls to the tasklet scheduling routine. After
>> this preparation the work queue equivalent set of functions is
>> added and the tasklet version is dropped.
> 
> Thank you for posting the 3rd series.
> It looks fine at a glance, but now I am concerned about problems
> that can be potentially caused by concurrency.
> 
>>
>> The advantages of the work queue version of deferred task execution
>> is mainly that the work queue variant has much better scalability
>> and overall performance than the tasklet variant.  The perftest
>> microbenchmarks in local loopback mode (not a very realistic test
>> case) can reach approximately 100Gb/sec with work queues compared to
>> about 16Gb/sec for tasklets.
> 
> As you wrote, the advantage of work queue version is that the number works
> that can run parallelly scales with the number of logical CPUs. However, the
> dispatched works (rxe_requester, rxe_responder, and rxe_completer) are
> designed for serial execution on tasklet, so we must not rely on them functioning
> properly on parallel execution.

Work queues are serial for each separate work task just like tasklets. There isn't
a problem here. The tasklets for different tasks can run in parallel but tend to
do so less than work queue tasks. The reason is that tasklets are scheduled by
default on the same cpu as the thread that scheduled it while work queues are scheduled
by the kernel scheduler and get spread around.
> 
> There could be 3 problems, which stem from the fact that works are not necessarily
> executed in the same order the packets are received. Works are enqueued to worker
> pools on each CPU, and each CPU respectively schedules the works, so the ordering
> of works among CPUs is not guaranteed.
> 
> [1]
> On UC/UD connections, responder does not check the psn of inbound packets,
> so the payloads can be copied to MRs without checking the order. If there are
> works that write to overlapping memory locations, they can potentially cause
> data corruption depending the order.
> 
> [2]
> On RC connections, responder checks the psn, and drops the packet if it is not
> the expected one. Requester can retransmit the request in this case, so the order
> seems to be guaranteed for RC.
> 
> However, responder updates the next expected psn (qp->resp.psn) BEFORE
> replying an ACK packet. If the work is preempted soon after storing the next psn,
> another work on another CPU can potentially reply another ACK packet earlier.
> This behaviour is against the spec.
> Cf. IB Specification Vol 1-Release-1.5 " 9.5 TRANSACTION ORDERING"
> 
> [3]
> Again on RC connections, the next expected psn (qp->resp.psn) can be
> loaded and stored at the same time from different threads. It seems we
> have to use a synchronization method, perhaps like READ_ONCE() and
> WRITE_ONCE() macros, to prevent loading an old value. This one is just an
> example; there can be other variables that need similar consideration.
> 
> 
> All the problems above can be solved by making the work queue single-
> threaded. We can do it by using flags=WQ_UNBOUND and max_active=1
> for alloc_workqueue(), but this should be the last resort since this spoils
> the performance benefit of work queue.
> 
> I am not sure what we can do with [1] right now.
> For [2] and [3], we could just move the update of psn later than the ack reply,
> and use *_ONCE() macros for shared variables.
> 
> Thanks,
> Daisuke
> 
>>
>> This version of the patch series drops the tasklet version as an option
>> but keeps the option of switching between the workqueue and inline
>> versions.
>>
>> This patch series is derived from an earlier patch set developed by
>> Ian Ziemba at HPE which is used in some Lustre storage clients attached
>> to Lustre servers with hard RoCE v2 NICs.
>>
>> It is based on the current version of wip/jgg-for-next.
>>
>> v3:
>> Link: https://lore.kernel.org/linux-rdma/202210220559.f7taTL8S-lkp@intel.com/
>> The v3 version drops the first few patches which have already been accepted
>> in for-next. It also drops the last patch of the v2 version which
>> introduced module parameters to select between the task interfaces. It also
>> drops the tasklet version entirely. It fixes a minor error caught by
>> the kernel test robot <lkp@intel.com> with a missing static declaration.
>>
>> v2:
>> The v2 version of the patch set has some minor changes that address
>> comments from Leon Romanovsky regarding locking of the valid parameter
>> and the setup parameters for alloc_workqueue. It also has one
>> additional cleanup patch.
>>
>> Bob Pearson (13):
>>   RDMA/rxe: Make task interface pluggable
>>   RDMA/rxe: Split rxe_drain_resp_pkts()
>>   RDMA/rxe: Simplify reset state handling in rxe_resp.c
>>   RDMA/rxe: Handle qp error in rxe_resp.c
>>   RDMA/rxe: Cleanup comp tasks in rxe_qp.c
>>   RDMA/rxe: Remove __rxe_do_task()
>>   RDMA/rxe: Make tasks schedule each other
>>   RDMA/rxe: Implement disable/enable_task()
>>   RDMA/rxe: Replace TASK_STATE_START by TASK_STATE_IDLE
>>   RDMA/rxe: Replace task->destroyed by task state INVALID.
>>   RDMA/rxe: Add workqueue support for tasks
>>   RDMA/rxe: Make WORKQUEUE default for RC tasks
>>   RDMA/rxe: Remove tasklets from rxe_task.c
>>
>>  drivers/infiniband/sw/rxe/rxe.c      |   9 +-
>>  drivers/infiniband/sw/rxe/rxe_comp.c |  24 ++-
>>  drivers/infiniband/sw/rxe/rxe_qp.c   |  80 ++++-----
>>  drivers/infiniband/sw/rxe/rxe_req.c  |   4 +-
>>  drivers/infiniband/sw/rxe/rxe_resp.c |  70 +++++---
>>  drivers/infiniband/sw/rxe/rxe_task.c | 258 +++++++++++++++++++--------
>>  drivers/infiniband/sw/rxe/rxe_task.h |  56 +++---
>>  7 files changed, 329 insertions(+), 172 deletions(-)
>>
>>
>> base-commit: 692373d186205dfb1b56f35f22702412d94d9420
>> --
>> 2.34.1
>

Daisuke Matsuda (Fujitsu) Nov. 4, 2022, 4:59 a.m. UTC | #3

On Wed, Nov 2, 2022 8:21 PM Bob Pearson wrote:
> On 11/2/22 05:17, Daisuke Matsuda (Fujitsu) wrote:
> > On Sat, Oct 29, 2022 12:10 PM Bob Pearson wrote:
> >> This patch series implements work queues as an alternative for
> >> the main tasklets in the rdma_rxe driver. The patch series starts
> >> with a patch that makes the internal API for task execution pluggable
> >> and implements an inline and a tasklet based set of functions.
> >> The remaining patches cleanup the qp reset and error code in the
> >> three tasklets and modify the locking logic to prevent making
> >> multiple calls to the tasklet scheduling routine. After
> >> this preparation the work queue equivalent set of functions is
> >> added and the tasklet version is dropped.
> >
> > Thank you for posting the 3rd series.
> > It looks fine at a glance, but now I am concerned about problems
> > that can be potentially caused by concurrency.
> >
> >>
> >> The advantages of the work queue version of deferred task execution
> >> is mainly that the work queue variant has much better scalability
> >> and overall performance than the tasklet variant.  The perftest
> >> microbenchmarks in local loopback mode (not a very realistic test
> >> case) can reach approximately 100Gb/sec with work queues compared to
> >> about 16Gb/sec for tasklets.
> >
> > As you wrote, the advantage of work queue version is that the number works
> > that can run parallelly scales with the number of logical CPUs. However, the
> > dispatched works (rxe_requester, rxe_responder, and rxe_completer) are
> > designed for serial execution on tasklet, so we must not rely on them functioning
> > properly on parallel execution.
> 
> Work queues are serial for each separate work task just like tasklets. There isn't
> a problem here. The tasklets for different tasks can run in parallel but tend to
> do so less than work queue tasks. The reason is that tasklets are scheduled by
> default on the same cpu as the thread that scheduled it while work queues are scheduled
> by the kernel scheduler and get spread around.

=====
rxe_wq = alloc_workqueue("rxe_wq", WQ_CPU_INTENSIVE, WQ_MAX_ACTIVE);
=====
You are using the WQ_CPU_INTENSIVE flag. This allows works to be scheduled by
the system scheduler, but each work is still enqueued to worker pools of each CPU
and thus bound to the CPU the issuer is running on. It seems the behaviour you
expect can be achieved by the WQ_UNBOUND flag. Unbound work items will run
on any CPU at the cost of cache utilization.

Two of the same tasklets never run concurrently on two different processors by nature,
but that is not the case with work queues. If two softirqs running on different CPUs
enqueue responder works at almost the same time, it is possible that they are dispatched
and run on the different CPUs at the same time. I mean the problems may arise in such
a situation.

Please let me know if I missed anything. I referred to the following document.
The easiest solution is to use @flags= WQ_UNBOUND and @max_active=1 to let works
run serially.
cf. https://www.kernel.org/doc/html/latest/core-api/workqueue.html

Thanks,
Daisuke

> >
> > There could be 3 problems, which stem from the fact that works are not necessarily
> > executed in the same order the packets are received. Works are enqueued to worker
> > pools on each CPU, and each CPU respectively schedules the works, so the ordering
> > of works among CPUs is not guaranteed.
> >
> > [1]
> > On UC/UD connections, responder does not check the psn of inbound packets,
> > so the payloads can be copied to MRs without checking the order. If there are
> > works that write to overlapping memory locations, they can potentially cause
> > data corruption depending the order.
> >
> > [2]
> > On RC connections, responder checks the psn, and drops the packet if it is not
> > the expected one. Requester can retransmit the request in this case, so the order
> > seems to be guaranteed for RC.
> >
> > However, responder updates the next expected psn (qp->resp.psn) BEFORE
> > replying an ACK packet. If the work is preempted soon after storing the next psn,
> > another work on another CPU can potentially reply another ACK packet earlier.
> > This behaviour is against the spec.
> > Cf. IB Specification Vol 1-Release-1.5 " 9.5 TRANSACTION ORDERING"
> >
> > [3]
> > Again on RC connections, the next expected psn (qp->resp.psn) can be
> > loaded and stored at the same time from different threads. It seems we
> > have to use a synchronization method, perhaps like READ_ONCE() and
> > WRITE_ONCE() macros, to prevent loading an old value. This one is just an
> > example; there can be other variables that need similar consideration.
> >
> >
> > All the problems above can be solved by making the work queue single-
> > threaded. We can do it by using flags=WQ_UNBOUND and max_active=1
> > for alloc_workqueue(), but this should be the last resort since this spoils
> > the performance benefit of work queue.
> >
> > I am not sure what we can do with [1] right now.
> > For [2] and [3], we could just move the update of psn later than the ack reply,
> > and use *_ONCE() macros for shared variables.
> >
> > Thanks,
> > Daisuke
> >
> >>
> >> This version of the patch series drops the tasklet version as an option
> >> but keeps the option of switching between the workqueue and inline
> >> versions.
> >>
> >> This patch series is derived from an earlier patch set developed by
> >> Ian Ziemba at HPE which is used in some Lustre storage clients attached
> >> to Lustre servers with hard RoCE v2 NICs.
> >>
> >> It is based on the current version of wip/jgg-for-next.
> >>
> >> v3:
> >> Link: https://lore.kernel.org/linux-rdma/202210220559.f7taTL8S-lkp@intel.com/
> >> The v3 version drops the first few patches which have already been accepted
> >> in for-next. It also drops the last patch of the v2 version which
> >> introduced module parameters to select between the task interfaces. It also
> >> drops the tasklet version entirely. It fixes a minor error caught by
> >> the kernel test robot <lkp@intel.com> with a missing static declaration.
> >>
> >> v2:
> >> The v2 version of the patch set has some minor changes that address
> >> comments from Leon Romanovsky regarding locking of the valid parameter
> >> and the setup parameters for alloc_workqueue. It also has one
> >> additional cleanup patch.
> >>
> >> Bob Pearson (13):
> >>   RDMA/rxe: Make task interface pluggable
> >>   RDMA/rxe: Split rxe_drain_resp_pkts()
> >>   RDMA/rxe: Simplify reset state handling in rxe_resp.c
> >>   RDMA/rxe: Handle qp error in rxe_resp.c
> >>   RDMA/rxe: Cleanup comp tasks in rxe_qp.c
> >>   RDMA/rxe: Remove __rxe_do_task()
> >>   RDMA/rxe: Make tasks schedule each other
> >>   RDMA/rxe: Implement disable/enable_task()
> >>   RDMA/rxe: Replace TASK_STATE_START by TASK_STATE_IDLE
> >>   RDMA/rxe: Replace task->destroyed by task state INVALID.
> >>   RDMA/rxe: Add workqueue support for tasks
> >>   RDMA/rxe: Make WORKQUEUE default for RC tasks
> >>   RDMA/rxe: Remove tasklets from rxe_task.c
> >>
> >>  drivers/infiniband/sw/rxe/rxe.c      |   9 +-
> >>  drivers/infiniband/sw/rxe/rxe_comp.c |  24 ++-
> >>  drivers/infiniband/sw/rxe/rxe_qp.c   |  80 ++++-----
> >>  drivers/infiniband/sw/rxe/rxe_req.c  |   4 +-
> >>  drivers/infiniband/sw/rxe/rxe_resp.c |  70 +++++---
> >>  drivers/infiniband/sw/rxe/rxe_task.c | 258 +++++++++++++++++++--------
> >>  drivers/infiniband/sw/rxe/rxe_task.h |  56 +++---
> >>  7 files changed, 329 insertions(+), 172 deletions(-)
> >>
> >>
> >> base-commit: 692373d186205dfb1b56f35f22702412d94d9420
> >> --
> >> 2.34.1
> >

Bob Pearson Nov. 5, 2022, 9:15 p.m. UTC | #4

On 11/3/22 23:59, Daisuke Matsuda (Fujitsu) wrote:
> On Wed, Nov 2, 2022 8:21 PM Bob Pearson wrote:
>> On 11/2/22 05:17, Daisuke Matsuda (Fujitsu) wrote:
>>> On Sat, Oct 29, 2022 12:10 PM Bob Pearson wrote:
>>>> This patch series implements work queues as an alternative for
>>>> the main tasklets in the rdma_rxe driver. The patch series starts
>>>> with a patch that makes the internal API for task execution pluggable
>>>> and implements an inline and a tasklet based set of functions.
>>>> The remaining patches cleanup the qp reset and error code in the
>>>> three tasklets and modify the locking logic to prevent making
>>>> multiple calls to the tasklet scheduling routine. After
>>>> this preparation the work queue equivalent set of functions is
>>>> added and the tasklet version is dropped.
>>>
>>> Thank you for posting the 3rd series.
>>> It looks fine at a glance, but now I am concerned about problems
>>> that can be potentially caused by concurrency.
>>>
>>>>
>>>> The advantages of the work queue version of deferred task execution
>>>> is mainly that the work queue variant has much better scalability
>>>> and overall performance than the tasklet variant.  The perftest
>>>> microbenchmarks in local loopback mode (not a very realistic test
>>>> case) can reach approximately 100Gb/sec with work queues compared to
>>>> about 16Gb/sec for tasklets.
>>>
>>> As you wrote, the advantage of work queue version is that the number works
>>> that can run parallelly scales with the number of logical CPUs. However, the
>>> dispatched works (rxe_requester, rxe_responder, and rxe_completer) are
>>> designed for serial execution on tasklet, so we must not rely on them functioning
>>> properly on parallel execution.
>>
>> Work queues are serial for each separate work task just like tasklets. There isn't
>> a problem here. The tasklets for different tasks can run in parallel but tend to
>> do so less than work queue tasks. The reason is that tasklets are scheduled by
>> default on the same cpu as the thread that scheduled it while work queues are scheduled
>> by the kernel scheduler and get spread around.
> 
> =====
> rxe_wq = alloc_workqueue("rxe_wq", WQ_CPU_INTENSIVE, WQ_MAX_ACTIVE);
> =====
> You are using the WQ_CPU_INTENSIVE flag. This allows works to be scheduled by
> the system scheduler, but each work is still enqueued to worker pools of each CPU
> and thus bound to the CPU the issuer is running on. It seems the behaviour you
> expect can be achieved by the WQ_UNBOUND flag. Unbound work items will run
> on any CPU at the cost of cache utilization.
> 
> Two of the same tasklets never run concurrently on two different processors by nature,
> but that is not the case with work queues. If two softirqs running on different CPUs
> enqueue responder works at almost the same time, it is possible that they are dispatched
> and run on the different CPUs at the same time. I mean the problems may arise in such
> a situation.
> 
> Please let me know if I missed anything. I referred to the following document.
> The easiest solution is to use @flags= WQ_UNBOUND and @max_active=1 to let works
> run serially.
> cf. https://www.kernel.org/doc/html/latest/core-api/workqueue.html
> 
> Thanks,
> Daisuke
> 
According to this:

    Workqueue guarantees that a work item cannot be re-entrant if the following conditions hold
    after a work item gets queued:

        The work function hasn’t been changed.

        No one queues the work item to another workqueue.

        The work item hasn’t been reinitiated.

    In other words, if the above conditions hold, the work item is guaranteed to be executed by at
    most one worker system-wide at any given time.

    Note that requeuing the work item (to the same queue) in the self function doesn’t break these
    conditions, so it’s safe to do. Otherwise, caution is required when breaking the conditions
    inside a work function.

I should be OK. Each work item checks the state under lock before scheduling the item and
if it is free moves it to busy and then schedules it. Only one instance of a work item
at a time should be running.

I only know what I see from running top. It seems that the work items do get spread out over
time on the cpus.

The CPU_INTENSIVE is certainly correct for our application which will run all the cpus at
100% for extended periods of time. We are benchmarking storage with IOR.

Bob

>>>
>>> There could be 3 problems, which stem from the fact that works are not necessarily
>>> executed in the same order the packets are received. Works are enqueued to worker
>>> pools on each CPU, and each CPU respectively schedules the works, so the ordering
>>> of works among CPUs is not guaranteed.
>>>
>>> [1]
>>> On UC/UD connections, responder does not check the psn of inbound packets,
>>> so the payloads can be copied to MRs without checking the order. If there are
>>> works that write to overlapping memory locations, they can potentially cause
>>> data corruption depending the order.
>>>
>>> [2]
>>> On RC connections, responder checks the psn, and drops the packet if it is not
>>> the expected one. Requester can retransmit the request in this case, so the order
>>> seems to be guaranteed for RC.
>>>
>>> However, responder updates the next expected psn (qp->resp.psn) BEFORE
>>> replying an ACK packet. If the work is preempted soon after storing the next psn,
>>> another work on another CPU can potentially reply another ACK packet earlier.
>>> This behaviour is against the spec.
>>> Cf. IB Specification Vol 1-Release-1.5 " 9.5 TRANSACTION ORDERING"
>>>
>>> [3]
>>> Again on RC connections, the next expected psn (qp->resp.psn) can be
>>> loaded and stored at the same time from different threads. It seems we
>>> have to use a synchronization method, perhaps like READ_ONCE() and
>>> WRITE_ONCE() macros, to prevent loading an old value. This one is just an
>>> example; there can be other variables that need similar consideration.
>>>
>>>
>>> All the problems above can be solved by making the work queue single-
>>> threaded. We can do it by using flags=WQ_UNBOUND and max_active=1
>>> for alloc_workqueue(), but this should be the last resort since this spoils
>>> the performance benefit of work queue.
>>>
>>> I am not sure what we can do with [1] right now.
>>> For [2] and [3], we could just move the update of psn later than the ack reply,
>>> and use *_ONCE() macros for shared variables.
>>>
>>> Thanks,
>>> Daisuke
>>>
>>>>
>>>> This version of the patch series drops the tasklet version as an option
>>>> but keeps the option of switching between the workqueue and inline
>>>> versions.
>>>>
>>>> This patch series is derived from an earlier patch set developed by
>>>> Ian Ziemba at HPE which is used in some Lustre storage clients attached
>>>> to Lustre servers with hard RoCE v2 NICs.
>>>>
>>>> It is based on the current version of wip/jgg-for-next.
>>>>
>>>> v3:
>>>> Link: https://lore.kernel.org/linux-rdma/202210220559.f7taTL8S-lkp@intel.com/
>>>> The v3 version drops the first few patches which have already been accepted
>>>> in for-next. It also drops the last patch of the v2 version which
>>>> introduced module parameters to select between the task interfaces. It also
>>>> drops the tasklet version entirely. It fixes a minor error caught by
>>>> the kernel test robot <lkp@intel.com> with a missing static declaration.
>>>>
>>>> v2:
>>>> The v2 version of the patch set has some minor changes that address
>>>> comments from Leon Romanovsky regarding locking of the valid parameter
>>>> and the setup parameters for alloc_workqueue. It also has one
>>>> additional cleanup patch.
>>>>
>>>> Bob Pearson (13):
>>>>   RDMA/rxe: Make task interface pluggable
>>>>   RDMA/rxe: Split rxe_drain_resp_pkts()
>>>>   RDMA/rxe: Simplify reset state handling in rxe_resp.c
>>>>   RDMA/rxe: Handle qp error in rxe_resp.c
>>>>   RDMA/rxe: Cleanup comp tasks in rxe_qp.c
>>>>   RDMA/rxe: Remove __rxe_do_task()
>>>>   RDMA/rxe: Make tasks schedule each other
>>>>   RDMA/rxe: Implement disable/enable_task()
>>>>   RDMA/rxe: Replace TASK_STATE_START by TASK_STATE_IDLE
>>>>   RDMA/rxe: Replace task->destroyed by task state INVALID.
>>>>   RDMA/rxe: Add workqueue support for tasks
>>>>   RDMA/rxe: Make WORKQUEUE default for RC tasks
>>>>   RDMA/rxe: Remove tasklets from rxe_task.c
>>>>
>>>>  drivers/infiniband/sw/rxe/rxe.c      |   9 +-
>>>>  drivers/infiniband/sw/rxe/rxe_comp.c |  24 ++-
>>>>  drivers/infiniband/sw/rxe/rxe_qp.c   |  80 ++++-----
>>>>  drivers/infiniband/sw/rxe/rxe_req.c  |   4 +-
>>>>  drivers/infiniband/sw/rxe/rxe_resp.c |  70 +++++---
>>>>  drivers/infiniband/sw/rxe/rxe_task.c | 258 +++++++++++++++++++--------
>>>>  drivers/infiniband/sw/rxe/rxe_task.h |  56 +++---
>>>>  7 files changed, 329 insertions(+), 172 deletions(-)
>>>>
>>>>
>>>> base-commit: 692373d186205dfb1b56f35f22702412d94d9420
>>>> --
>>>> 2.34.1
>>>
>

Daisuke Matsuda (Fujitsu) Nov. 7, 2022, 8:21 a.m. UTC | #5

On Sun, Nov 6, 2022 6:15 AM Bob Pearson
> On 11/3/22 23:59, Daisuke Matsuda (Fujitsu) wrote:
> > On Wed, Nov 2, 2022 8:21 PM Bob Pearson wrote:
> >> On 11/2/22 05:17, Daisuke Matsuda (Fujitsu) wrote:
> >>> On Sat, Oct 29, 2022 12:10 PM Bob Pearson wrote:
> >>>> This patch series implements work queues as an alternative for
> >>>> the main tasklets in the rdma_rxe driver. The patch series starts
> >>>> with a patch that makes the internal API for task execution pluggable
> >>>> and implements an inline and a tasklet based set of functions.
> >>>> The remaining patches cleanup the qp reset and error code in the
> >>>> three tasklets and modify the locking logic to prevent making
> >>>> multiple calls to the tasklet scheduling routine. After
> >>>> this preparation the work queue equivalent set of functions is
> >>>> added and the tasklet version is dropped.
> >>>
> >>> Thank you for posting the 3rd series.
> >>> It looks fine at a glance, but now I am concerned about problems
> >>> that can be potentially caused by concurrency.
> >>>
> >>>>
> >>>> The advantages of the work queue version of deferred task execution
> >>>> is mainly that the work queue variant has much better scalability
> >>>> and overall performance than the tasklet variant.  The perftest
> >>>> microbenchmarks in local loopback mode (not a very realistic test
> >>>> case) can reach approximately 100Gb/sec with work queues compared to
> >>>> about 16Gb/sec for tasklets.
> >>>
> >>> As you wrote, the advantage of work queue version is that the number works
> >>> that can run parallelly scales with the number of logical CPUs. However, the
> >>> dispatched works (rxe_requester, rxe_responder, and rxe_completer) are
> >>> designed for serial execution on tasklet, so we must not rely on them functioning
> >>> properly on parallel execution.
> >>
> >> Work queues are serial for each separate work task just like tasklets. There isn't
> >> a problem here. The tasklets for different tasks can run in parallel but tend to
> >> do so less than work queue tasks. The reason is that tasklets are scheduled by
> >> default on the same cpu as the thread that scheduled it while work queues are scheduled
> >> by the kernel scheduler and get spread around.
> >
> > =====
> > rxe_wq = alloc_workqueue("rxe_wq", WQ_CPU_INTENSIVE, WQ_MAX_ACTIVE);
> > =====
> > You are using the WQ_CPU_INTENSIVE flag. This allows works to be scheduled by
> > the system scheduler, but each work is still enqueued to worker pools of each CPU
> > and thus bound to the CPU the issuer is running on. It seems the behaviour you
> > expect can be achieved by the WQ_UNBOUND flag. Unbound work items will run
> > on any CPU at the cost of cache utilization.
> >
> > Two of the same tasklets never run concurrently on two different processors by nature,
> > but that is not the case with work queues. If two softirqs running on different CPUs
> > enqueue responder works at almost the same time, it is possible that they are dispatched
> > and run on the different CPUs at the same time. I mean the problems may arise in such
> > a situation.
> >
> > Please let me know if I missed anything. I referred to the following document.
> > The easiest solution is to use @flags= WQ_UNBOUND and @max_active=1 to let works
> > run serially.
> > cf. https://www.kernel.org/doc/html/latest/core-api/workqueue.html
> >
> > Thanks,
> > Daisuke
> >
> According to this:
> 
>     Workqueue guarantees that a work item cannot be re-entrant if the following conditions hold
>     after a work item gets queued:
> 
>         The work function hasn’t been changed.
> 
>         No one queues the work item to another workqueue.
> 
>         The work item hasn’t been reinitiated.
> 
>     In other words, if the above conditions hold, the work item is guaranteed to be executed by at
>     most one worker system-wide at any given time.
> 
>     Note that requeuing the work item (to the same queue) in the self function doesn’t break these
>     conditions, so it’s safe to do. Otherwise, caution is required when breaking the conditions
>     inside a work function.
> 
> I should be OK. Each work item checks the state under lock before scheduling the item and
> if it is free moves it to busy and then schedules it. Only one instance of a work item
> at a time should be running.

Thank you for the explanation.
Per-qp work items should meet the three conditions. That is what I have missing.
Now I see. You are correct.

> 
> I only know what I see from running top. It seems that the work items do get spread out over
> time on the cpus.

It seems process_one_work() schedules items for both UNBOUND and CPU_INTENSIVE
workers in the same way. This is not stated explicitly in the document.

> 
> The CPU_INTENSIVE is certainly correct for our application which will run all the cpus at
> 100% for extended periods of time. We are benchmarking storage with IOR.

It is OK with me. I have not come up with any situations where the CPU_INTENSIVE
flag bothers other rxe users.

Thanks,
Daisuke

> 
> Bob
> 
> >>>
> >>> There could be 3 problems, which stem from the fact that works are not necessarily
> >>> executed in the same order the packets are received. Works are enqueued to worker
> >>> pools on each CPU, and each CPU respectively schedules the works, so the ordering
> >>> of works among CPUs is not guaranteed.
> >>>
> >>> [1]
> >>> On UC/UD connections, responder does not check the psn of inbound packets,
> >>> so the payloads can be copied to MRs without checking the order. If there are
> >>> works that write to overlapping memory locations, they can potentially cause
> >>> data corruption depending the order.
> >>>
> >>> [2]
> >>> On RC connections, responder checks the psn, and drops the packet if it is not
> >>> the expected one. Requester can retransmit the request in this case, so the order
> >>> seems to be guaranteed for RC.
> >>>
> >>> However, responder updates the next expected psn (qp->resp.psn) BEFORE
> >>> replying an ACK packet. If the work is preempted soon after storing the next psn,
> >>> another work on another CPU can potentially reply another ACK packet earlier.
> >>> This behaviour is against the spec.
> >>> Cf. IB Specification Vol 1-Release-1.5 " 9.5 TRANSACTION ORDERING"
> >>>
> >>> [3]
> >>> Again on RC connections, the next expected psn (qp->resp.psn) can be
> >>> loaded and stored at the same time from different threads. It seems we
> >>> have to use a synchronization method, perhaps like READ_ONCE() and
> >>> WRITE_ONCE() macros, to prevent loading an old value. This one is just an
> >>> example; there can be other variables that need similar consideration.
> >>>
> >>>
> >>> All the problems above can be solved by making the work queue single-
> >>> threaded. We can do it by using flags=WQ_UNBOUND and max_active=1
> >>> for alloc_workqueue(), but this should be the last resort since this spoils
> >>> the performance benefit of work queue.
> >>>
> >>> I am not sure what we can do with [1] right now.
> >>> For [2] and [3], we could just move the update of psn later than the ack reply,
> >>> and use *_ONCE() macros for shared variables.
> >>>
> >>> Thanks,
> >>> Daisuke
> >>>
> >>>>
> >>>> This version of the patch series drops the tasklet version as an option
> >>>> but keeps the option of switching between the workqueue and inline
> >>>> versions.
> >>>>
> >>>> This patch series is derived from an earlier patch set developed by
> >>>> Ian Ziemba at HPE which is used in some Lustre storage clients attached
> >>>> to Lustre servers with hard RoCE v2 NICs.
> >>>>
> >>>> It is based on the current version of wip/jgg-for-next.
> >>>>
> >>>> v3:
> >>>> Link: https://lore.kernel.org/linux-rdma/202210220559.f7taTL8S-lkp@intel.com/
> >>>> The v3 version drops the first few patches which have already been accepted
> >>>> in for-next. It also drops the last patch of the v2 version which
> >>>> introduced module parameters to select between the task interfaces. It also
> >>>> drops the tasklet version entirely. It fixes a minor error caught by
> >>>> the kernel test robot <lkp@intel.com> with a missing static declaration.
> >>>>
> >>>> v2:
> >>>> The v2 version of the patch set has some minor changes that address
> >>>> comments from Leon Romanovsky regarding locking of the valid parameter
> >>>> and the setup parameters for alloc_workqueue. It also has one
> >>>> additional cleanup patch.
> >>>>
> >>>> Bob Pearson (13):
> >>>>   RDMA/rxe: Make task interface pluggable
> >>>>   RDMA/rxe: Split rxe_drain_resp_pkts()
> >>>>   RDMA/rxe: Simplify reset state handling in rxe_resp.c
> >>>>   RDMA/rxe: Handle qp error in rxe_resp.c
> >>>>   RDMA/rxe: Cleanup comp tasks in rxe_qp.c
> >>>>   RDMA/rxe: Remove __rxe_do_task()
> >>>>   RDMA/rxe: Make tasks schedule each other
> >>>>   RDMA/rxe: Implement disable/enable_task()
> >>>>   RDMA/rxe: Replace TASK_STATE_START by TASK_STATE_IDLE
> >>>>   RDMA/rxe: Replace task->destroyed by task state INVALID.
> >>>>   RDMA/rxe: Add workqueue support for tasks
> >>>>   RDMA/rxe: Make WORKQUEUE default for RC tasks
> >>>>   RDMA/rxe: Remove tasklets from rxe_task.c
> >>>>
> >>>>  drivers/infiniband/sw/rxe/rxe.c      |   9 +-
> >>>>  drivers/infiniband/sw/rxe/rxe_comp.c |  24 ++-
> >>>>  drivers/infiniband/sw/rxe/rxe_qp.c   |  80 ++++-----
> >>>>  drivers/infiniband/sw/rxe/rxe_req.c  |   4 +-
> >>>>  drivers/infiniband/sw/rxe/rxe_resp.c |  70 +++++---
> >>>>  drivers/infiniband/sw/rxe/rxe_task.c | 258 +++++++++++++++++++--------
> >>>>  drivers/infiniband/sw/rxe/rxe_task.h |  56 +++---
> >>>>  7 files changed, 329 insertions(+), 172 deletions(-)
> >>>>
> >>>>
> >>>> base-commit: 692373d186205dfb1b56f35f22702412d94d9420
> >>>> --
> >>>> 2.34.1
> >>>
> >

Bob Pearson Nov. 7, 2022, 4:06 p.m. UTC | #6

On 11/7/22 02:21, Daisuke Matsuda (Fujitsu) wrote:
> On Sun, Nov 6, 2022 6:15 AM Bob Pearson
>> On 11/3/22 23:59, Daisuke Matsuda (Fujitsu) wrote:
>>> On Wed, Nov 2, 2022 8:21 PM Bob Pearson wrote:
>>>> On 11/2/22 05:17, Daisuke Matsuda (Fujitsu) wrote:
>>>>> On Sat, Oct 29, 2022 12:10 PM Bob Pearson wrote:
>>>>>> This patch series implements work queues as an alternative for
>>>>>> the main tasklets in the rdma_rxe driver. The patch series starts
>>>>>> with a patch that makes the internal API for task execution pluggable
>>>>>> and implements an inline and a tasklet based set of functions.
>>>>>> The remaining patches cleanup the qp reset and error code in the
>>>>>> three tasklets and modify the locking logic to prevent making
>>>>>> multiple calls to the tasklet scheduling routine. After
>>>>>> this preparation the work queue equivalent set of functions is
>>>>>> added and the tasklet version is dropped.
>>>>>
>>>>> Thank you for posting the 3rd series.
>>>>> It looks fine at a glance, but now I am concerned about problems
>>>>> that can be potentially caused by concurrency.
>>>>>
>>>>>>
>>>>>> The advantages of the work queue version of deferred task execution
>>>>>> is mainly that the work queue variant has much better scalability
>>>>>> and overall performance than the tasklet variant.  The perftest
>>>>>> microbenchmarks in local loopback mode (not a very realistic test
>>>>>> case) can reach approximately 100Gb/sec with work queues compared to
>>>>>> about 16Gb/sec for tasklets.
>>>>>
>>>>> As you wrote, the advantage of work queue version is that the number works
>>>>> that can run parallelly scales with the number of logical CPUs. However, the
>>>>> dispatched works (rxe_requester, rxe_responder, and rxe_completer) are
>>>>> designed for serial execution on tasklet, so we must not rely on them functioning
>>>>> properly on parallel execution.
>>>>
>>>> Work queues are serial for each separate work task just like tasklets. There isn't
>>>> a problem here. The tasklets for different tasks can run in parallel but tend to
>>>> do so less than work queue tasks. The reason is that tasklets are scheduled by
>>>> default on the same cpu as the thread that scheduled it while work queues are scheduled
>>>> by the kernel scheduler and get spread around.
>>>
>>> =====
>>> rxe_wq = alloc_workqueue("rxe_wq", WQ_CPU_INTENSIVE, WQ_MAX_ACTIVE);
>>> =====
>>> You are using the WQ_CPU_INTENSIVE flag. This allows works to be scheduled by
>>> the system scheduler, but each work is still enqueued to worker pools of each CPU
>>> and thus bound to the CPU the issuer is running on. It seems the behaviour you
>>> expect can be achieved by the WQ_UNBOUND flag. Unbound work items will run
>>> on any CPU at the cost of cache utilization.
>>>
>>> Two of the same tasklets never run concurrently on two different processors by nature,
>>> but that is not the case with work queues. If two softirqs running on different CPUs
>>> enqueue responder works at almost the same time, it is possible that they are dispatched
>>> and run on the different CPUs at the same time. I mean the problems may arise in such
>>> a situation.
>>>
>>> Please let me know if I missed anything. I referred to the following document.
>>> The easiest solution is to use @flags= WQ_UNBOUND and @max_active=1 to let works
>>> run serially.
>>> cf. https://www.kernel.org/doc/html/latest/core-api/workqueue.html
>>>
>>> Thanks,
>>> Daisuke
>>>
>> According to this:
>>
>>     Workqueue guarantees that a work item cannot be re-entrant if the following conditions hold
>>     after a work item gets queued:
>>
>>         The work function hasn’t been changed.
>>
>>         No one queues the work item to another workqueue.
>>
>>         The work item hasn’t been reinitiated.
>>
>>     In other words, if the above conditions hold, the work item is guaranteed to be executed by at
>>     most one worker system-wide at any given time.
>>
>>     Note that requeuing the work item (to the same queue) in the self function doesn’t break these
>>     conditions, so it’s safe to do. Otherwise, caution is required when breaking the conditions
>>     inside a work function.
>>
>> I should be OK. Each work item checks the state under lock before scheduling the item and
>> if it is free moves it to busy and then schedules it. Only one instance of a work item
>> at a time should be running.
> 
> Thank you for the explanation.
> Per-qp work items should meet the three conditions. That is what I have missing.
> Now I see. You are correct.
> 
>>
>> I only know what I see from running top. It seems that the work items do get spread out over
>> time on the cpus.
> 
> It seems process_one_work() schedules items for both UNBOUND and CPU_INTENSIVE
> workers in the same way. This is not stated explicitly in the document.
> 
>>
>> The CPU_INTENSIVE is certainly correct for our application which will run all the cpus at
>> 100% for extended periods of time. We are benchmarking storage with IOR.
> 
> It is OK with me. I have not come up with any situations where the CPU_INTENSIVE
> flag bothers other rxe users.
> 
> Thanks,
> Daisuke
> 
>>
>> Bob
>>
>>>>>
>>>>> There could be 3 problems, which stem from the fact that works are not necessarily
>>>>> executed in the same order the packets are received. Works are enqueued to worker
>>>>> pools on each CPU, and each CPU respectively schedules the works, so the ordering
>>>>> of works among CPUs is not guaranteed.
>>>>>
>>>>> [1]
>>>>> On UC/UD connections, responder does not check the psn of inbound packets,
>>>>> so the payloads can be copied to MRs without checking the order. If there are
>>>>> works that write to overlapping memory locations, they can potentially cause
>>>>> data corruption depending the order.
>>>>>
>>>>> [2]
>>>>> On RC connections, responder checks the psn, and drops the packet if it is not
>>>>> the expected one. Requester can retransmit the request in this case, so the order
>>>>> seems to be guaranteed for RC.
>>>>>
>>>>> However, responder updates the next expected psn (qp->resp.psn) BEFORE
>>>>> replying an ACK packet. If the work is preempted soon after storing the next psn,
>>>>> another work on another CPU can potentially reply another ACK packet earlier.
>>>>> This behaviour is against the spec.
>>>>> Cf. IB Specification Vol 1-Release-1.5 " 9.5 TRANSACTION ORDERING"
>>>>>
>>>>> [3]
>>>>> Again on RC connections, the next expected psn (qp->resp.psn) can be
>>>>> loaded and stored at the same time from different threads. It seems we
>>>>> have to use a synchronization method, perhaps like READ_ONCE() and
>>>>> WRITE_ONCE() macros, to prevent loading an old value. This one is just an
>>>>> example; there can be other variables that need similar consideration.
>>>>>
>>>>>
>>>>> All the problems above can be solved by making the work queue single-
>>>>> threaded. We can do it by using flags=WQ_UNBOUND and max_active=1
>>>>> for alloc_workqueue(), but this should be the last resort since this spoils
>>>>> the performance benefit of work queue.
>>>>>
>>>>> I am not sure what we can do with [1] right now.
>>>>> For [2] and [3], we could just move the update of psn later than the ack reply,
>>>>> and use *_ONCE() macros for shared variables.
>>>>>
>>>>> Thanks,
>>>>> Daisuke

Thank you for taking the time to review this.

Bob
>>>>>
>>>>>>
>>>>>> This version of the patch series drops the tasklet version as an option
>>>>>> but keeps the option of switching between the workqueue and inline
>>>>>> versions.
>>>>>>
>>>>>> This patch series is derived from an earlier patch set developed by
>>>>>> Ian Ziemba at HPE which is used in some Lustre storage clients attached
>>>>>> to Lustre servers with hard RoCE v2 NICs.
>>>>>>
>>>>>> It is based on the current version of wip/jgg-for-next.
>>>>>>
>>>>>> v3:
>>>>>> Link: https://lore.kernel.org/linux-rdma/202210220559.f7taTL8S-lkp@intel.com/
>>>>>> The v3 version drops the first few patches which have already been accepted
>>>>>> in for-next. It also drops the last patch of the v2 version which
>>>>>> introduced module parameters to select between the task interfaces. It also
>>>>>> drops the tasklet version entirely. It fixes a minor error caught by
>>>>>> the kernel test robot <lkp@intel.com> with a missing static declaration.
>>>>>>
>>>>>> v2:
>>>>>> The v2 version of the patch set has some minor changes that address
>>>>>> comments from Leon Romanovsky regarding locking of the valid parameter
>>>>>> and the setup parameters for alloc_workqueue. It also has one
>>>>>> additional cleanup patch.
>>>>>>
>>>>>> Bob Pearson (13):
>>>>>>   RDMA/rxe: Make task interface pluggable
>>>>>>   RDMA/rxe: Split rxe_drain_resp_pkts()
>>>>>>   RDMA/rxe: Simplify reset state handling in rxe_resp.c
>>>>>>   RDMA/rxe: Handle qp error in rxe_resp.c
>>>>>>   RDMA/rxe: Cleanup comp tasks in rxe_qp.c
>>>>>>   RDMA/rxe: Remove __rxe_do_task()
>>>>>>   RDMA/rxe: Make tasks schedule each other
>>>>>>   RDMA/rxe: Implement disable/enable_task()
>>>>>>   RDMA/rxe: Replace TASK_STATE_START by TASK_STATE_IDLE
>>>>>>   RDMA/rxe: Replace task->destroyed by task state INVALID.
>>>>>>   RDMA/rxe: Add workqueue support for tasks
>>>>>>   RDMA/rxe: Make WORKQUEUE default for RC tasks
>>>>>>   RDMA/rxe: Remove tasklets from rxe_task.c
>>>>>>
>>>>>>  drivers/infiniband/sw/rxe/rxe.c      |   9 +-
>>>>>>  drivers/infiniband/sw/rxe/rxe_comp.c |  24 ++-
>>>>>>  drivers/infiniband/sw/rxe/rxe_qp.c   |  80 ++++-----
>>>>>>  drivers/infiniband/sw/rxe/rxe_req.c  |   4 +-
>>>>>>  drivers/infiniband/sw/rxe/rxe_resp.c |  70 +++++---
>>>>>>  drivers/infiniband/sw/rxe/rxe_task.c | 258 +++++++++++++++++++--------
>>>>>>  drivers/infiniband/sw/rxe/rxe_task.h |  56 +++---
>>>>>>  7 files changed, 329 insertions(+), 172 deletions(-)
>>>>>>
>>>>>>
>>>>>> base-commit: 692373d186205dfb1b56f35f22702412d94d9420
>>>>>> --
>>>>>> 2.34.1
>>>>>
>>>
>

Daisuke Matsuda (Fujitsu) Nov. 18, 2022, 5:02 a.m. UTC | #7

On Sat, Oct 29, 2022 12:10 PM Bob Pearson wrote:

<...>

> Bob Pearson (13):
>   RDMA/rxe: Make task interface pluggable
>   RDMA/rxe: Split rxe_drain_resp_pkts()
>   RDMA/rxe: Simplify reset state handling in rxe_resp.c
>   RDMA/rxe: Handle qp error in rxe_resp.c
>   RDMA/rxe: Cleanup comp tasks in rxe_qp.c
>   RDMA/rxe: Remove __rxe_do_task()
>   RDMA/rxe: Make tasks schedule each other
>   RDMA/rxe: Implement disable/enable_task()
>   RDMA/rxe: Replace TASK_STATE_START by TASK_STATE_IDLE
>   RDMA/rxe: Replace task->destroyed by task state INVALID.
>   RDMA/rxe: Add workqueue support for tasks
>   RDMA/rxe: Make WORKQUEUE default for RC tasks
>   RDMA/rxe: Remove tasklets from rxe_task.c

Hello Bob,

I have found a soft lockup issue reproducible with rdma-core testcases.
It does not happen on the latest for-next tree but with this patch series,
so we need to fix the issue before getting it merged.

I did the test on a VM with 8 CPUs. I fetched the latest rdma-core and built it,
and executed the following command inside 'rdma-core' directory:
# while true; do ./build/bin/run_tests.py -v --dev rxe_ens6 --gid 1; sleep 2; done
(Please specify your 'dev' and 'gid' appropriately.)

Before 10 minutes passed, my console freezed and became unresponsive,
showing the test progress below:
=====
test_create_ah (tests.test_addr.AHTest)
Test ibv_create_ah. ... ok
test_create_ah_roce (tests.test_addr.AHTest)
Verify that AH can't be created without GRH in RoCE ... ok
test_destroy_ah (tests.test_addr.AHTest)
Test ibv_destroy_ah. ... ok
test_atomic_cmp_and_swap (tests.test_atomic.AtomicTest) ... ok
test_atomic_fetch_and_add (tests.test_atomic.AtomicTest) ... ok
test_atomic_invalid_lkey (tests.test_atomic.AtomicTest) ...
=====
Note this does not always happen. You may have to wait for some minutes.

Here is the backtrace:
=====
[ 1212.135650] watchdog: BUG: soft lockup - CPU#3 stuck for 1017s! [python3:3428]
[ 1212.138144] Modules linked in: rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd grace fscache netfs rpcrdma rdma_ucm ib_srpt ib_isert iscsi_target_mod target_core_mod ib_iser libiscsi scsi_transport_iscsi rdma_cm iw_cm ib_cm rdma_rxe ib_uverbs ip6_udp_tunnel udp_tunnel ib_core rfkill sunrpc intel_rapl_msr intel_rapl_common kvm_intel kvm irqbypass joydev nd_pmem virtio_balloon dax_pmem nd_btt i2c_piix4 pcspkr drm xfs libcrc32c sd_mod t10_pi sr_mod crc64_rocksoft_generic cdrom crc64_rocksoft crc64 sg nd_e820 ata_generic libnvdimm crct10dif_pclmul crc32_pclmul crc32c_intel ata_piix virtio_net libata ghash_clmulni_intel e1000 net_failover sha512_ssse3 failover virtio_console serio_raw dm_mirror dm_region_hash dm_log dm_mod fuse
[ 1212.147152] CPU: 3 PID: 3428 Comm: python3 Kdump: loaded Tainted: G             L     6.1.0-rc1+ #29
[ 1212.148425] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.13.0-1ubuntu1.1 04/01/2014
[ 1212.149754] RIP: 0010:__local_bh_enable_ip+0x26/0x70
[ 1212.150464] Code: 00 00 66 90 0f 1f 44 00 00 65 8b 05 c4 7e d2 4e a9 00 00 0f 00 75 31 83 ee 01 f7 de 65 01 35 b1 7e d2 4e 65 8b 05 aa 7e d2 4e <a9> 00 ff ff 00 74 1b 65 ff 0d 9c 7e d2 4e 65 8b 05 95 7e d2 4e 85
[ 1212.153081] RSP: 0018:ffffaf0b8054baf8 EFLAGS: 00000203
[ 1212.153822] RAX: 0000000000000001 RBX: ffff8d2189171450 RCX: 00000000ffffffff
[ 1212.154823] RDX: 0000000000000001 RSI: 00000000fffffe00 RDI: ffffffffc0b51baa
[ 1212.155826] RBP: ffff8d2189171474 R08: 0000011a38ad4004 R09: 0000000000000101
[ 1212.156862] R10: ffffffffb2c06100 R11: 0000000000000000 R12: 0000000000000000
[ 1212.157860] R13: ffff8d218df5eda0 R14: ffff8d2181058328 R15: 0000000000000000
[ 1212.158858] FS:  00007f8cec55f740(0000) GS:ffff8d23b5cc0000(0000) knlGS:0000000000000000
[ 1212.159989] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 1212.160837] CR2: 00007f8ceaa15024 CR3: 0000000108c2c002 CR4: 0000000000060ee0
[ 1212.161841] Call Trace:
[ 1212.162205]  <TASK>
[ 1212.162522]  work_cleanup+0x3a/0x40 [rdma_rxe]
[ 1212.163172]  rxe_qp_do_cleanup+0x54/0x1e0 [rdma_rxe]
[ 1212.163888]  execute_in_process_context+0x23/0x70
[ 1212.164575]  __rxe_cleanup+0xc6/0x170 [rdma_rxe]
[ 1212.165245]  rxe_destroy_qp+0x28/0x40 [rdma_rxe]
[ 1212.165909]  ib_destroy_qp_user+0x90/0x1b0 [ib_core]
[ 1212.166646]  uverbs_free_qp+0x35/0x90 [ib_uverbs]
[ 1212.167333]  destroy_hw_idr_uobject+0x1e/0x50 [ib_uverbs]
[ 1212.168103]  uverbs_destroy_uobject+0x37/0x1c0 [ib_uverbs]
[ 1212.168899]  uobj_destroy+0x3c/0x80 [ib_uverbs]
[ 1212.169532]  ib_uverbs_run_method+0x203/0x320 [ib_uverbs]
[ 1212.170412]  ? uverbs_free_qp+0x90/0x90 [ib_uverbs]
[ 1212.171151]  ib_uverbs_cmd_verbs+0x172/0x220 [ib_uverbs]
[ 1212.171912]  ? free_unref_page_commit+0x7e/0x170
[ 1212.172583]  ? xa_destroy+0x82/0x110
[ 1212.173104]  ? kvfree_call_rcu+0x27d/0x310
[ 1212.173692]  ? ioctl_has_perm.constprop.0.isra.0+0xbd/0x120
[ 1212.174481]  ib_uverbs_ioctl+0xa4/0x110 [ib_uverbs]
[ 1212.175182]  __x64_sys_ioctl+0x8a/0xc0
[ 1212.175725]  do_syscall_64+0x3b/0x90
[ 1212.176243]  entry_SYSCALL_64_after_hwframe+0x63/0xcd
[ 1212.176973] RIP: 0033:0x7f8cebe3ec6b
[ 1212.177489] Code: 73 01 c3 48 8b 0d b5 b1 1b 00 f7 d8 64 89 01 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa b8 10 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 85 b1 1b 00 f7 d8 64 89 01 48
[ 1212.180072] RSP: 002b:00007ffee64927a8 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
[ 1212.181150] RAX: ffffffffffffffda RBX: 00007ffee64928d8 RCX: 00007f8cebe3ec6b
[ 1212.182131] RDX: 00007ffee64928c0 RSI: 00000000c0181b01 RDI: 0000000000000003
[ 1212.183099] RBP: 00007ffee64928a0 R08: 000056424f8c0c70 R09: 000000000002ae30
[ 1212.184064] R10: 00007f8cead75e10 R11: 0000000000000246 R12: 00007ffee649287c
[ 1212.185061] R13: 0000000000000022 R14: 000056424f8c0560 R15: 00007f8cebb903d0
[ 1212.186031]  </TASK>

Message from syslogd@c9ibremote at Nov 17 17:03:12 ...
 kernel:watchdog: BUG: soft lockup - CPU#3 stuck for 1017s! [python3:3428]
=====
It seems that ibv_destroy_qp(3) was issued from userspace, right?
I am not very familiar with uverbs ;(

The process got stuck at the do-while loop below:
=====
rxe_task.c
---
/* busy wait until any previous tasks are done */
static void cleanup_task(struct rxe_task *task)
{
        bool busy;

        do {
                spin_lock_bh(&task->lock);
                busy = (task->state == TASK_STATE_BUSY ||
                        task->state == TASK_STATE_ARMED);
                if (!busy)
                        task->state = TASK_STATE_INVALID;
                spin_unlock_bh(&task->lock);
        } while (busy);
}
=====
Typically task->state is 0 (i.e. "TASK_STATE_IDLE") when we reach here,
but in the infinite loop, task->state was constantly 1 (i.e. "TASK_STATE_BUSY").

IMO, the bottom halves completed their works leaving task->state "TASK_STATE_BUSY",
and ibv_destroy_qp(3) issued from userspace thereafter got stuck,
but I am not sure how this counter-intuitive state transition occurs.

Do you have any idea why this happens and how to fix this?
I cannot take enough time to inspect this issue further right now,
but I would update you if I find anything helpful to fix this.

Thanks,
Daisuke

> 
>  drivers/infiniband/sw/rxe/rxe.c      |   9 +-
>  drivers/infiniband/sw/rxe/rxe_comp.c |  24 ++-
>  drivers/infiniband/sw/rxe/rxe_qp.c   |  80 ++++-----
>  drivers/infiniband/sw/rxe/rxe_req.c  |   4 +-
>  drivers/infiniband/sw/rxe/rxe_resp.c |  70 +++++---
>  drivers/infiniband/sw/rxe/rxe_task.c | 258 +++++++++++++++++++--------
>  drivers/infiniband/sw/rxe/rxe_task.h |  56 +++---
>  7 files changed, 329 insertions(+), 172 deletions(-)
> 
> 
> base-commit: 692373d186205dfb1b56f35f22702412d94d9420
> --
> 2.34.1

[for-next,v3,00/13] Implement work queues for rdma_rxe

Message

Comments