diff mbox series

[net-next] net/mlx5: Make ASO poll CQ usable in atomic context

Message ID d941bb44798b938705ca6944d8ee02c97af7ddd5.1664017836.git.leonro@nvidia.com (mailing list archive)
State Deferred
Delegated to: Netdev Maintainers
Headers show
Series [net-next] net/mlx5: Make ASO poll CQ usable in atomic context | expand

Checks

Context Check Description
netdev/tree_selection success Clearly marked for net-next
netdev/fixes_present success Fixes tag not required for -next series
netdev/subject_prefix success Link
netdev/cover_letter success Single patches do not need cover letters
netdev/patch_count success Link
netdev/header_inline success No static functions without inline keyword in header files
netdev/build_32bit success Errors and warnings before: 1 this patch: 1
netdev/cc_maintainers warning 9 maintainers not CCed: roid@nvidia.com linux-rdma@vger.kernel.org ehakim@nvidia.com jianbol@nvidia.com borisp@nvidia.com raeds@nvidia.com lariel@nvidia.com liorna@nvidia.com ozsh@nvidia.com
netdev/build_clang success Errors and warnings before: 0 this patch: 0
netdev/module_param success Was 0 now: 0
netdev/verify_signedoff success Signed-off-by tag matches author and committer
netdev/check_selftest success No net selftest shell script
netdev/verify_fixes success No Fixes tag
netdev/build_allmodconfig_warn success Errors and warnings before: 1 this patch: 1
netdev/checkpatch success total: 0 errors, 0 warnings, 0 checks, 63 lines checked
netdev/kdoc success Errors and warnings before: 0 this patch: 0
netdev/source_inline success Was 0 now: 0

Commit Message

Leon Romanovsky Sept. 24, 2022, 11:17 a.m. UTC
From: Leon Romanovsky <leonro@nvidia.com>

Poll CQ functions shouldn't sleep as they are called in atomic context.
The following splat appears once the mlx5_aso_poll_cq() is used in such
flow.

 BUG: scheduling while atomic: swapper/17/0/0x00000100
 Modules linked in: sch_ingress openvswitch nsh mlx5_vdpa vringh vhost_iotlb vdpa mlx5_ib mlx5_core xt_conntrack xt_MASQUERADE nf_conntrack_netlink nfnetlink xt_addrtype iptable_nat nf_nat br_netfilter overlay rpcrdma rdma_ucm ib_iser libiscsi scsi_transport_iscsi ib_umad rdma_cm ib_ipoib iw_cm ib_cm ib_uverbs ib_core fuse [last unloaded: mlx5_core]
 CPU: 17 PID: 0 Comm: swapper/17 Tainted: G        W          6.0.0-rc2+ #13
 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
 Call Trace:
  <IRQ>
  dump_stack_lvl+0x34/0x44
  __schedule_bug.cold+0x47/0x53
  __schedule+0x4b6/0x670
  ? hrtimer_start_range_ns+0x28d/0x360
  schedule+0x50/0x90
  schedule_hrtimeout_range_clock+0x98/0x120
  ? __hrtimer_init+0xb0/0xb0
  usleep_range_state+0x60/0x90
  mlx5_aso_poll_cq+0xad/0x190 [mlx5_core]
  mlx5e_ipsec_aso_update_curlft+0x81/0xb0 [mlx5_core]
  xfrm_timer_handler+0x6b/0x360
  ? xfrm_find_acq_byseq+0x50/0x50
  __hrtimer_run_queues+0x139/0x290
  hrtimer_run_softirq+0x7d/0xe0
  __do_softirq+0xc7/0x272
  irq_exit_rcu+0x87/0xb0
  sysvec_apic_timer_interrupt+0x72/0x90
  </IRQ>
  <TASK>
  asm_sysvec_apic_timer_interrupt+0x16/0x20
 RIP: 0010:default_idle+0x18/0x20
 Code: ae 7d ff ff cc cc cc cc cc cc cc cc cc cc cc cc cc cc 0f 1f 44 00 00 8b 05 b5 30 0d 01 85 c0 7e 07 0f 00 2d 0a e3 53 00 fb f4 <c3> 0f 1f 80 00 00 00 00 0f 1f 44 00 00 65 48 8b 04 25 80 ad 01 00
 RSP: 0018:ffff888100883ee0 EFLAGS: 00000242
 RAX: 0000000000000001 RBX: ffff888100849580 RCX: 4000000000000000
 RDX: 0000000000000001 RSI: 0000000000000083 RDI: 000000000008863c
 RBP: 0000000000000011 R08: 00000064e6977fa9 R09: 0000000000000001
 R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
 R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
  default_idle_call+0x37/0xb0
  do_idle+0x1cd/0x1e0
  cpu_startup_entry+0x19/0x20
  start_secondary+0xfe/0x120
  secondary_startup_64_no_verify+0xcd/0xdb
  </TASK>
 softirq: huh, entered softirq 8 HRTIMER 00000000a97c08cb with preempt_count 00000100, exited with 00000000?

Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
---
Hi,

I kept this patch for a long time in my IPsec tree and planned to submit
it together with mlx5 IPsec series as no other users except TC used this API.
Unfortunately, MACsec started to use this poll CQ API too and spread the
same mistake.

I have no idea what "With newer FW, the wait for the first ASO WQE..."
comment in TC means, the sentence is a magic to me, so I leave it as is.
---
 drivers/net/ethernet/mellanox/mlx5/core/en/tc/meter.c  |  6 +++++-
 .../net/ethernet/mellanox/mlx5/core/en_accel/macsec.c  |  4 ++--
 drivers/net/ethernet/mellanox/mlx5/core/lib/aso.c      | 10 +---------
 drivers/net/ethernet/mellanox/mlx5/core/lib/aso.h      |  2 +-
 4 files changed, 9 insertions(+), 13 deletions(-)

Comments

Saeed Mahameed Sept. 24, 2022, 5:24 p.m. UTC | #1
On 24 Sep 14:17, Leon Romanovsky wrote:
>From: Leon Romanovsky <leonro@nvidia.com>
>
>Poll CQ functions shouldn't sleep as they are called in atomic context.
>The following splat appears once the mlx5_aso_poll_cq() is used in such
>flow.
>
> BUG: scheduling while atomic: swapper/17/0/0x00000100

...

> 	/* With newer FW, the wait for the first ASO WQE is more than 2us, put the wait 10ms. */
>-	err = mlx5_aso_poll_cq(aso, true, 10);
>+	expires = jiffies + msecs_to_jiffies(10);
>+	do {
>+		err = mlx5_aso_poll_cq(aso, true);
>+	} while (err && time_is_after_jiffies(expires));
> 	mutex_unlock(&flow_meters->aso_lock);
         ^^^^
busy poll won't work, this mutex is held and can sleep anyway.
Let's discuss internally and solve this by design.
Leon Romanovsky Sept. 24, 2022, 6:51 p.m. UTC | #2
On Sat, Sep 24, 2022 at 10:24:25AM -0700, Saeed Mahameed wrote:
> On 24 Sep 14:17, Leon Romanovsky wrote:
> > From: Leon Romanovsky <leonro@nvidia.com>
> > 
> > Poll CQ functions shouldn't sleep as they are called in atomic context.
> > The following splat appears once the mlx5_aso_poll_cq() is used in such
> > flow.
> > 
> > BUG: scheduling while atomic: swapper/17/0/0x00000100
> 
> ...
> 
> > 	/* With newer FW, the wait for the first ASO WQE is more than 2us, put the wait 10ms. */
> > -	err = mlx5_aso_poll_cq(aso, true, 10);
> > +	expires = jiffies + msecs_to_jiffies(10);
> > +	do {
> > +		err = mlx5_aso_poll_cq(aso, true);
> > +	} while (err && time_is_after_jiffies(expires));
> > 	mutex_unlock(&flow_meters->aso_lock);
>         ^^^^
> busy poll won't work, this mutex is held and can sleep anyway.
> Let's discuss internally and solve this by design.

This is TC code, it doesn't need atomic context and had mutex + sleep
from the beginning.

My change cleans mlx5_aso_poll_cq() from busy loop for the IPsec path,
so why do you plan to change in the design?

Thanks
Saeed Mahameed Sept. 24, 2022, 8:11 p.m. UTC | #3
On 24 Sep 21:51, Leon Romanovsky wrote:
>On Sat, Sep 24, 2022 at 10:24:25AM -0700, Saeed Mahameed wrote:
>> On 24 Sep 14:17, Leon Romanovsky wrote:
>> > From: Leon Romanovsky <leonro@nvidia.com>
>> >
>> > Poll CQ functions shouldn't sleep as they are called in atomic context.
>> > The following splat appears once the mlx5_aso_poll_cq() is used in such
>> > flow.
>> >
>> > BUG: scheduling while atomic: swapper/17/0/0x00000100
>>
>> ...
>>
>> > 	/* With newer FW, the wait for the first ASO WQE is more than 2us, put the wait 10ms. */
>> > -	err = mlx5_aso_poll_cq(aso, true, 10);
>> > +	expires = jiffies + msecs_to_jiffies(10);
>> > +	do {
>> > +		err = mlx5_aso_poll_cq(aso, true);
>> > +	} while (err && time_is_after_jiffies(expires));
>> > 	mutex_unlock(&flow_meters->aso_lock);
>>         ^^^^
>> busy poll won't work, this mutex is held and can sleep anyway.
>> Let's discuss internally and solve this by design.
>
>This is TC code, it doesn't need atomic context and had mutex + sleep
>from the beginning.
>

then let's bring back the sleep for TC use-case, we don't need to burn the
CPU!
But still the change has bigger effect on other aso use cases, see below.

>My change cleans mlx5_aso_poll_cq() from busy loop for the IPsec path,
>so why do you plan to change in the design?
>

a typical use case for aso is

post_aso_wqe();
poll_aso_cqe();

The HW needs time to consume and excute the wqe then generate the CQE.
This is why the driver needs to wait a bit when polling for the cqe,
your change will break others. Let's discuss internally.

>Thanks
Leon Romanovsky Sept. 25, 2022, 6:01 a.m. UTC | #4
On Sat, Sep 24, 2022 at 01:11:31PM -0700, Saeed Mahameed wrote:
> On 24 Sep 21:51, Leon Romanovsky wrote:
> > On Sat, Sep 24, 2022 at 10:24:25AM -0700, Saeed Mahameed wrote:
> > > On 24 Sep 14:17, Leon Romanovsky wrote:
> > > > From: Leon Romanovsky <leonro@nvidia.com>
> > > >
> > > > Poll CQ functions shouldn't sleep as they are called in atomic context.
> > > > The following splat appears once the mlx5_aso_poll_cq() is used in such
> > > > flow.
> > > >
> > > > BUG: scheduling while atomic: swapper/17/0/0x00000100
> > > 
> > > ...
> > > 
> > > > 	/* With newer FW, the wait for the first ASO WQE is more than 2us, put the wait 10ms. */
> > > > -	err = mlx5_aso_poll_cq(aso, true, 10);
> > > > +	expires = jiffies + msecs_to_jiffies(10);
> > > > +	do {
> > > > +		err = mlx5_aso_poll_cq(aso, true);
> > > > +	} while (err && time_is_after_jiffies(expires));
> > > > 	mutex_unlock(&flow_meters->aso_lock);
> > >         ^^^^
> > > busy poll won't work, this mutex is held and can sleep anyway.
> > > Let's discuss internally and solve this by design.
> > 
> > This is TC code, it doesn't need atomic context and had mutex + sleep
> > from the beginning.
> > 
> 
> then let's bring back the sleep for TC use-case, we don't need to burn the
> CPU!

Again, this is exactly how TC was implemented. The difference is that
before it burnt cycles in poll_cq and now it does it inside TC code.

> But still the change has bigger effect on other aso use cases, see below.

I have serious doubts about it.

> 
> > My change cleans mlx5_aso_poll_cq() from busy loop for the IPsec path,
> > so why do you plan to change in the design?
> > 
> 
> a typical use case for aso is
> 
> post_aso_wqe();
> poll_aso_cqe();
> 
> The HW needs time to consume and excute the wqe then generate the CQE.
> This is why the driver needs to wait a bit when polling for the cqe,
> your change will break others. Let's discuss internally.

IPsec was always implemented without any time delays, and I'm sure that
MACsec doesn't need too, it is probably copy/paste.

More generally, post_wqe->poll_cq is very basic paradigm in RDMA and
delays were never needed.

But if you insist, let's discuss internally.

Thanks

> 
> > Thanks
diff mbox series

Patch

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en/tc/meter.c b/drivers/net/ethernet/mellanox/mlx5/core/en/tc/meter.c
index a53e205f4a89..b9011356299e 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en/tc/meter.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en/tc/meter.c
@@ -115,6 +115,7 @@  mlx5e_tc_meter_modify(struct mlx5_core_dev *mdev,
 	struct mlx5e_flow_meters *flow_meters;
 	u8 cir_man, cir_exp, cbs_man, cbs_exp;
 	struct mlx5_aso_wqe *aso_wqe;
+	unsigned long expires;
 	struct mlx5_aso *aso;
 	u64 rate, burst;
 	u8 ds_cnt;
@@ -187,7 +188,10 @@  mlx5e_tc_meter_modify(struct mlx5_core_dev *mdev,
 	mlx5_aso_post_wqe(aso, true, &aso_wqe->ctrl);
 
 	/* With newer FW, the wait for the first ASO WQE is more than 2us, put the wait 10ms. */
-	err = mlx5_aso_poll_cq(aso, true, 10);
+	expires = jiffies + msecs_to_jiffies(10);
+	do {
+		err = mlx5_aso_poll_cq(aso, true);
+	} while (err && time_is_after_jiffies(expires));
 	mutex_unlock(&flow_meters->aso_lock);
 
 	return err;
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_accel/macsec.c b/drivers/net/ethernet/mellanox/mlx5/core/en_accel/macsec.c
index a13169723153..31e1973c3740 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_accel/macsec.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_accel/macsec.c
@@ -1441,7 +1441,7 @@  static int macsec_aso_set_arm_event(struct mlx5_core_dev *mdev, struct mlx5e_mac
 			   MLX5_ACCESS_ASO_OPC_MOD_MACSEC);
 	macsec_aso_build_ctrl(aso, &aso_wqe->aso_ctrl, in);
 	mlx5_aso_post_wqe(maso, false, &aso_wqe->ctrl);
-	err = mlx5_aso_poll_cq(maso, false, 10);
+	err = mlx5_aso_poll_cq(maso, false);
 	mutex_unlock(&aso->aso_lock);
 
 	return err;
@@ -1466,7 +1466,7 @@  static int macsec_aso_query(struct mlx5_core_dev *mdev, struct mlx5e_macsec *mac
 	macsec_aso_build_wqe_ctrl_seg(aso, &aso_wqe->aso_ctrl, NULL);
 
 	mlx5_aso_post_wqe(maso, false, &aso_wqe->ctrl);
-	err = mlx5_aso_poll_cq(maso, false, 10);
+	err = mlx5_aso_poll_cq(maso, false);
 	if (err)
 		goto err_out;
 
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/lib/aso.c b/drivers/net/ethernet/mellanox/mlx5/core/lib/aso.c
index 21e14507ff5c..baa8092f335e 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/lib/aso.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/lib/aso.c
@@ -381,20 +381,12 @@  void mlx5_aso_post_wqe(struct mlx5_aso *aso, bool with_data,
 	WRITE_ONCE(doorbell_cseg, NULL);
 }
 
-int mlx5_aso_poll_cq(struct mlx5_aso *aso, bool with_data, u32 interval_ms)
+int mlx5_aso_poll_cq(struct mlx5_aso *aso, bool with_data)
 {
 	struct mlx5_aso_cq *cq = &aso->cq;
 	struct mlx5_cqe64 *cqe;
-	unsigned long expires;
 
 	cqe = mlx5_cqwq_get_cqe(&cq->wq);
-
-	expires = jiffies + msecs_to_jiffies(interval_ms);
-	while (!cqe && time_is_after_jiffies(expires)) {
-		usleep_range(2, 10);
-		cqe = mlx5_cqwq_get_cqe(&cq->wq);
-	}
-
 	if (!cqe)
 		return -ETIMEDOUT;
 
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/lib/aso.h b/drivers/net/ethernet/mellanox/mlx5/core/lib/aso.h
index d854e01d7fc5..2d40dcf9d42e 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/lib/aso.h
+++ b/drivers/net/ethernet/mellanox/mlx5/core/lib/aso.h
@@ -83,7 +83,7 @@  void mlx5_aso_build_wqe(struct mlx5_aso *aso, u8 ds_cnt,
 			u32 obj_id, u32 opc_mode);
 void mlx5_aso_post_wqe(struct mlx5_aso *aso, bool with_data,
 		       struct mlx5_wqe_ctrl_seg *doorbell_cseg);
-int mlx5_aso_poll_cq(struct mlx5_aso *aso, bool with_data, u32 interval_ms);
+int mlx5_aso_poll_cq(struct mlx5_aso *aso, bool with_data);
 
 struct mlx5_aso *mlx5_aso_create(struct mlx5_core_dev *mdev, u32 pdn);
 void mlx5_aso_destroy(struct mlx5_aso *aso);