diff mbox series

[02/13] wifi: mwifiex: Use default @max_active for workqueues

Message ID 20230509015032.3768622-3-tj@kernel.org (mailing list archive)
State Not Applicable
Delegated to: Kalle Valo
Headers show
Series None | expand

Commit Message

Tejun Heo May 9, 2023, 1:50 a.m. UTC
These workqueues only host a single work item and thus doen't need explicit
concurrency limit. Let's use the default @max_active. This doesn't cost
anything and clearly expresses that @max_active doesn't matter.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Amitkumar Karwar <amitkarwar@gmail.com>
Cc: Ganapathi Bhat <ganapathi017@gmail.com>
Cc: Sharvari Harisangam <sharvari.harisangam@nxp.com>
Cc: Xinming Hu <huxinming820@gmail.com>
Cc: Kalle Valo <kvalo@kernel.org>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Jakub Kicinski <kuba@kernel.org>
Cc: Paolo Abeni <pabeni@redhat.com>
Cc: linux-wireless@vger.kernel.org
Cc: netdev@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
---
 drivers/net/wireless/marvell/mwifiex/cfg80211.c | 4 ++--
 drivers/net/wireless/marvell/mwifiex/main.c     | 8 ++++----
 2 files changed, 6 insertions(+), 6 deletions(-)

Comments

Kalle Valo May 10, 2023, 8:45 a.m. UTC | #1
Tejun Heo <tj@kernel.org> wrote:

> These workqueues only host a single work item and thus doen't need explicit
> concurrency limit. Let's use the default @max_active. This doesn't cost
> anything and clearly expresses that @max_active doesn't matter.
> 
> Signed-off-by: Tejun Heo <tj@kernel.org>
> Cc: Amitkumar Karwar <amitkarwar@gmail.com>
> Cc: Ganapathi Bhat <ganapathi017@gmail.com>
> Cc: Sharvari Harisangam <sharvari.harisangam@nxp.com>
> Cc: Xinming Hu <huxinming820@gmail.com>
> Cc: Kalle Valo <kvalo@kernel.org>
> Cc: "David S. Miller" <davem@davemloft.net>
> Cc: Eric Dumazet <edumazet@google.com>
> Cc: Jakub Kicinski <kuba@kernel.org>
> Cc: Paolo Abeni <pabeni@redhat.com>
> Cc: linux-wireless@vger.kernel.org
> Cc: netdev@vger.kernel.org
> Cc: linux-kernel@vger.kernel.org

I didn't review the patch but I assume it's ok. Feel free to take it via your
tree:

Acked-by: Kalle Valo <kvalo@kernel.org>

Patch set to Not Applicable.
Brian Norris May 10, 2023, 6:09 p.m. UTC | #2
On Mon, May 08, 2023 at 03:50:21PM -1000, Tejun Heo wrote:
> These workqueues only host a single work item and thus doen't need explicit
> concurrency limit. Let's use the default @max_active. This doesn't cost
> anything and clearly expresses that @max_active doesn't matter.
> 
> Signed-off-by: Tejun Heo <tj@kernel.org>
> Cc: Amitkumar Karwar <amitkarwar@gmail.com>
> Cc: Ganapathi Bhat <ganapathi017@gmail.com>
> Cc: Sharvari Harisangam <sharvari.harisangam@nxp.com>
> Cc: Xinming Hu <huxinming820@gmail.com>
> Cc: Kalle Valo <kvalo@kernel.org>
> Cc: "David S. Miller" <davem@davemloft.net>
> Cc: Eric Dumazet <edumazet@google.com>
> Cc: Jakub Kicinski <kuba@kernel.org>
> Cc: Paolo Abeni <pabeni@redhat.com>
> Cc: linux-wireless@vger.kernel.org
> Cc: netdev@vger.kernel.org
> Cc: linux-kernel@vger.kernel.org

Reviewed-by: Brian Norris <briannorris@chromium.org>

I'll admit, the workqueue documentation sounds a bit like "max_active ==
1 + WQ_UNBOUND" is what we want ("one work item [...] active at any
given time"), but that's more of my misunderstanding than anything --
each work item can only be active in a single context at any given time,
so that note is talking about distinct (i.e., more than 1) work items.

While I'm here: we're still debugging what's affecting WiFi performance
on some of our WiFi systems, but it's possible I'll be turning some of
these into struct kthread_worker instead. We can cross that bridge
(including potential conflicts) if/when we come to it though.

Thanks,
Brian
Tejun Heo May 10, 2023, 6:16 p.m. UTC | #3
Hello,

On Wed, May 10, 2023 at 11:09:55AM -0700, Brian Norris wrote:
> I'll admit, the workqueue documentation sounds a bit like "max_active ==
> 1 + WQ_UNBOUND" is what we want ("one work item [...] active at any
> given time"), but that's more of my misunderstanding than anything --
> each work item can only be active in a single context at any given time,
> so that note is talking about distinct (i.e., more than 1) work items.

Yeah, a future patch is gonna change the semantics a bit and I'll update the
doc to be clearer.

> While I'm here: we're still debugging what's affecting WiFi performance
> on some of our WiFi systems, but it's possible I'll be turning some of
> these into struct kthread_worker instead. We can cross that bridge
> (including potential conflicts) if/when we come to it though.

Can you elaborate the performance problem you're seeing? I'm working on a
major update for workqueue to improve its locality behavior, so if you're
experiencing issues on CPUs w/ multiple L3 caches, it'd be a good test case.

Thanks.
Brian Norris May 10, 2023, 6:57 p.m. UTC | #4
Hi,

On Wed, May 10, 2023 at 08:16:00AM -1000, Tejun Heo wrote:
> > While I'm here: we're still debugging what's affecting WiFi performance
> > on some of our WiFi systems, but it's possible I'll be turning some of
> > these into struct kthread_worker instead. We can cross that bridge
> > (including potential conflicts) if/when we come to it though.
> 
> Can you elaborate the performance problem you're seeing? I'm working on a
> major update for workqueue to improve its locality behavior, so if you're
> experiencing issues on CPUs w/ multiple L3 caches, it'd be a good test case.

Sure!

Test case: iperf TCP RX (i.e., hits "MWIFIEX_RX_WORK_QUEUE" a lot) at
some of the higher (VHT 80 MHz) data rates.

Hardware: Mediatek MT8173 2xA53 (little) + 2xA72 (big) CPU
(I'm not familiar with its cache details)
+
Marvell SD8897 SDIO WiFi (mwifiex_sdio)

We're looking at a major regression from our 4.19 kernel to a 5.15
kernel (yeah, that's downstream reality). So far, we've found that
performance is:

(1) much better (nearly the same as 4.19) if we add WQ_SYSFS and pin the
work queue to one CPU (doesn't really matter which CPU, as long as it's
not the one loaded with IRQ(?) work)

(2) moderately better if we pin the CPU frequency (e.g., "performance"
cpufreq governor instead of "schedutil")

(3) moderately better (not quite as good as (2)) if we switch a
kthread_worker and don't pin anything.

We tried (2) because we saw a lot more CPU migration on kernel 5.15
(work moves across all 4 CPUs throughout the run; on kernel 4.19 it
mostly switched between 2 CPUs).

We tried (3) suspecting some kind of EAS issue (instead of distributing
our workload onto 4 different kworkers, our work (and therefore our load
calculation) is mostly confined to a single kernel thread). But it still
seems like our issues are more than "just" EAS / cpufreq issues, since
(2) and (3) aren't as good as (1).

NB: there weren't many relevant mwifiex or MTK-SDIO changes in this
range.

So we're still investigating a few other areas, but it does seem like
"locality" (in some sense of the word) is relevant. We'd probably be
open to testing any patches you have, although it's likely we'd have the
easiest time if we can port those to 5.15. We're constantly working on
getting good upstream support for Chromebook chips, but ARM SoC reality
is that it still varies a lot as to how much works upstream on any given
system.

Thanks,
Brian
Tejun Heo May 10, 2023, 7:19 p.m. UTC | #5
Hello,

On Wed, May 10, 2023 at 11:57:41AM -0700, Brian Norris wrote:
> Test case: iperf TCP RX (i.e., hits "MWIFIEX_RX_WORK_QUEUE" a lot) at
> some of the higher (VHT 80 MHz) data rates.
> 
> Hardware: Mediatek MT8173 2xA53 (little) + 2xA72 (big) CPU
> (I'm not familiar with its cache details)
> +
> Marvell SD8897 SDIO WiFi (mwifiex_sdio)

Yeah, we had multiple of similar cases on, what I think are, similar
configurations, which is why I'm working on improving workqueue locality.

> We're looking at a major regression from our 4.19 kernel to a 5.15
> kernel (yeah, that's downstream reality). So far, we've found that
> performance is:

That's curious. 4.19 is old but I scanned the history and there's nothing
which can cause that kind of perf regression for unbound workqueues between
4.19 and 5.15.

> (1) much better (nearly the same as 4.19) if we add WQ_SYSFS and pin the
> work queue to one CPU (doesn't really matter which CPU, as long as it's
> not the one loaded with IRQ(?) work)
> 
> (2) moderately better if we pin the CPU frequency (e.g., "performance"
> cpufreq governor instead of "schedutil")
> 
> (3) moderately better (not quite as good as (2)) if we switch a
> kthread_worker and don't pin anything.

Hmm... so it's not just workqueue.

> We tried (2) because we saw a lot more CPU migration on kernel 5.15
> (work moves across all 4 CPUs throughout the run; on kernel 4.19 it
> mostly switched between 2 CPUs).

Workqueue can contribute to this but it seems more likely that scheduling
changes are also part of the story.

> We tried (3) suspecting some kind of EAS issue (instead of distributing
> our workload onto 4 different kworkers, our work (and therefore our load
> calculation) is mostly confined to a single kernel thread). But it still
> seems like our issues are more than "just" EAS / cpufreq issues, since
> (2) and (3) aren't as good as (1).
> 
> NB: there weren't many relevant mwifiex or MTK-SDIO changes in this
> range.
> 
> So we're still investigating a few other areas, but it does seem like
> "locality" (in some sense of the word) is relevant. We'd probably be
> open to testing any patches you have, although it's likely we'd have the
> easiest time if we can port those to 5.15. We're constantly working on
> getting good upstream support for Chromebook chips, but ARM SoC reality
> is that it still varies a lot as to how much works upstream on any given
> system.

I should be able to post the patchset later today or tomorrow. It comes with
sysfs knobs to control affinity scopes and strictness, so hopefully you
should be able to find the configuration that works without too much
difficulty.

Thanks.
Brian Norris May 10, 2023, 7:50 p.m. UTC | #6
On Wed, May 10, 2023 at 09:19:20AM -1000, Tejun Heo wrote:
> On Wed, May 10, 2023 at 11:57:41AM -0700, Brian Norris wrote:
> > (1) much better (nearly the same as 4.19) if we add WQ_SYSFS and pin the
> > work queue to one CPU (doesn't really matter which CPU, as long as it's
> > not the one loaded with IRQ(?) work)
> > 
> > (2) moderately better if we pin the CPU frequency (e.g., "performance"
> > cpufreq governor instead of "schedutil")
> > 
> > (3) moderately better (not quite as good as (2)) if we switch a
> > kthread_worker and don't pin anything.
> 
> Hmm... so it's not just workqueue.

Right. And not just cpufreq either.

> > We tried (2) because we saw a lot more CPU migration on kernel 5.15
> > (work moves across all 4 CPUs throughout the run; on kernel 4.19 it
> > mostly switched between 2 CPUs).
> 
> Workqueue can contribute to this but it seems more likely that scheduling
> changes are also part of the story.

Yeah, that's one theory. And in that vein, that's one reason we might
consider switching to a kthread_worker anyway, even if that doesn't
solve all the regression -- because schedutil relies on per-entity load
calculations to make decisions, and workqueues don't help the scheduler
understand that load when spread across N CPUs (workers). A dedicated
kthread would better represent our workload to the scheduler.

(Threaded NAPI -- mwifiex doesn't support NAPI -- takes a similar
approach, as it has its own thread per NAPI context.)

> > We tried (3) suspecting some kind of EAS issue (instead of distributing
> > our workload onto 4 different kworkers, our work (and therefore our load
> > calculation) is mostly confined to a single kernel thread). But it still
> > seems like our issues are more than "just" EAS / cpufreq issues, since
> > (2) and (3) aren't as good as (1).
> > 
> > NB: there weren't many relevant mwifiex or MTK-SDIO changes in this
> > range.
> > 
> > So we're still investigating a few other areas, but it does seem like
> > "locality" (in some sense of the word) is relevant. We'd probably be
> > open to testing any patches you have, although it's likely we'd have the
> > easiest time if we can port those to 5.15. We're constantly working on
> > getting good upstream support for Chromebook chips, but ARM SoC reality
> > is that it still varies a lot as to how much works upstream on any given
> > system.
> 
> I should be able to post the patchset later today or tomorrow. It comes with
> sysfs knobs to control affinity scopes and strictness, so hopefully you
> should be able to find the configuration that works without too much
> difficulty.

Great!

Brian
Tejun Heo May 19, 2023, 12:36 a.m. UTC | #7
On Mon, May 08, 2023 at 03:50:21PM -1000, Tejun Heo wrote:
> These workqueues only host a single work item and thus doen't need explicit
> concurrency limit. Let's use the default @max_active. This doesn't cost
> anything and clearly expresses that @max_active doesn't matter.

Applied to wq/for-6.5-cleanup-ordered.

Thanks.
diff mbox series

Patch

diff --git a/drivers/net/wireless/marvell/mwifiex/cfg80211.c b/drivers/net/wireless/marvell/mwifiex/cfg80211.c
index bcd564dc3554..5337ee4b6f10 100644
--- a/drivers/net/wireless/marvell/mwifiex/cfg80211.c
+++ b/drivers/net/wireless/marvell/mwifiex/cfg80211.c
@@ -3127,7 +3127,7 @@  struct wireless_dev *mwifiex_add_virtual_intf(struct wiphy *wiphy,
 	priv->dfs_cac_workqueue = alloc_workqueue("MWIFIEX_DFS_CAC%s",
 						  WQ_HIGHPRI |
 						  WQ_MEM_RECLAIM |
-						  WQ_UNBOUND, 1, name);
+						  WQ_UNBOUND, 0, name);
 	if (!priv->dfs_cac_workqueue) {
 		mwifiex_dbg(adapter, ERROR, "cannot alloc DFS CAC queue\n");
 		ret = -ENOMEM;
@@ -3138,7 +3138,7 @@  struct wireless_dev *mwifiex_add_virtual_intf(struct wiphy *wiphy,
 
 	priv->dfs_chan_sw_workqueue = alloc_workqueue("MWIFIEX_DFS_CHSW%s",
 						      WQ_HIGHPRI | WQ_UNBOUND |
-						      WQ_MEM_RECLAIM, 1, name);
+						      WQ_MEM_RECLAIM, 0, name);
 	if (!priv->dfs_chan_sw_workqueue) {
 		mwifiex_dbg(adapter, ERROR, "cannot alloc DFS channel sw queue\n");
 		ret = -ENOMEM;
diff --git a/drivers/net/wireless/marvell/mwifiex/main.c b/drivers/net/wireless/marvell/mwifiex/main.c
index ea22a08e6c08..1cd9d20cca16 100644
--- a/drivers/net/wireless/marvell/mwifiex/main.c
+++ b/drivers/net/wireless/marvell/mwifiex/main.c
@@ -1547,7 +1547,7 @@  mwifiex_reinit_sw(struct mwifiex_adapter *adapter)
 
 	adapter->workqueue =
 		alloc_workqueue("MWIFIEX_WORK_QUEUE",
-				WQ_HIGHPRI | WQ_MEM_RECLAIM | WQ_UNBOUND, 1);
+				WQ_HIGHPRI | WQ_MEM_RECLAIM | WQ_UNBOUND, 0);
 	if (!adapter->workqueue)
 		goto err_kmalloc;
 
@@ -1557,7 +1557,7 @@  mwifiex_reinit_sw(struct mwifiex_adapter *adapter)
 		adapter->rx_workqueue = alloc_workqueue("MWIFIEX_RX_WORK_QUEUE",
 							WQ_HIGHPRI |
 							WQ_MEM_RECLAIM |
-							WQ_UNBOUND, 1);
+							WQ_UNBOUND, 0);
 		if (!adapter->rx_workqueue)
 			goto err_kmalloc;
 		INIT_WORK(&adapter->rx_work, mwifiex_rx_work_queue);
@@ -1702,7 +1702,7 @@  mwifiex_add_card(void *card, struct completion *fw_done,
 
 	adapter->workqueue =
 		alloc_workqueue("MWIFIEX_WORK_QUEUE",
-				WQ_HIGHPRI | WQ_MEM_RECLAIM | WQ_UNBOUND, 1);
+				WQ_HIGHPRI | WQ_MEM_RECLAIM | WQ_UNBOUND, 0);
 	if (!adapter->workqueue)
 		goto err_kmalloc;
 
@@ -1712,7 +1712,7 @@  mwifiex_add_card(void *card, struct completion *fw_done,
 		adapter->rx_workqueue = alloc_workqueue("MWIFIEX_RX_WORK_QUEUE",
 							WQ_HIGHPRI |
 							WQ_MEM_RECLAIM |
-							WQ_UNBOUND, 1);
+							WQ_UNBOUND, 0);
 		if (!adapter->rx_workqueue)
 			goto err_kmalloc;