Message ID | 20230509015032.3768622-3-tj@kernel.org (mailing list archive) |
---|---|
State | Not Applicable |
Delegated to: | Kalle Valo |
Headers | show |
Series | None | expand |
Tejun Heo <tj@kernel.org> wrote: > These workqueues only host a single work item and thus doen't need explicit > concurrency limit. Let's use the default @max_active. This doesn't cost > anything and clearly expresses that @max_active doesn't matter. > > Signed-off-by: Tejun Heo <tj@kernel.org> > Cc: Amitkumar Karwar <amitkarwar@gmail.com> > Cc: Ganapathi Bhat <ganapathi017@gmail.com> > Cc: Sharvari Harisangam <sharvari.harisangam@nxp.com> > Cc: Xinming Hu <huxinming820@gmail.com> > Cc: Kalle Valo <kvalo@kernel.org> > Cc: "David S. Miller" <davem@davemloft.net> > Cc: Eric Dumazet <edumazet@google.com> > Cc: Jakub Kicinski <kuba@kernel.org> > Cc: Paolo Abeni <pabeni@redhat.com> > Cc: linux-wireless@vger.kernel.org > Cc: netdev@vger.kernel.org > Cc: linux-kernel@vger.kernel.org I didn't review the patch but I assume it's ok. Feel free to take it via your tree: Acked-by: Kalle Valo <kvalo@kernel.org> Patch set to Not Applicable.
On Mon, May 08, 2023 at 03:50:21PM -1000, Tejun Heo wrote: > These workqueues only host a single work item and thus doen't need explicit > concurrency limit. Let's use the default @max_active. This doesn't cost > anything and clearly expresses that @max_active doesn't matter. > > Signed-off-by: Tejun Heo <tj@kernel.org> > Cc: Amitkumar Karwar <amitkarwar@gmail.com> > Cc: Ganapathi Bhat <ganapathi017@gmail.com> > Cc: Sharvari Harisangam <sharvari.harisangam@nxp.com> > Cc: Xinming Hu <huxinming820@gmail.com> > Cc: Kalle Valo <kvalo@kernel.org> > Cc: "David S. Miller" <davem@davemloft.net> > Cc: Eric Dumazet <edumazet@google.com> > Cc: Jakub Kicinski <kuba@kernel.org> > Cc: Paolo Abeni <pabeni@redhat.com> > Cc: linux-wireless@vger.kernel.org > Cc: netdev@vger.kernel.org > Cc: linux-kernel@vger.kernel.org Reviewed-by: Brian Norris <briannorris@chromium.org> I'll admit, the workqueue documentation sounds a bit like "max_active == 1 + WQ_UNBOUND" is what we want ("one work item [...] active at any given time"), but that's more of my misunderstanding than anything -- each work item can only be active in a single context at any given time, so that note is talking about distinct (i.e., more than 1) work items. While I'm here: we're still debugging what's affecting WiFi performance on some of our WiFi systems, but it's possible I'll be turning some of these into struct kthread_worker instead. We can cross that bridge (including potential conflicts) if/when we come to it though. Thanks, Brian
Hello, On Wed, May 10, 2023 at 11:09:55AM -0700, Brian Norris wrote: > I'll admit, the workqueue documentation sounds a bit like "max_active == > 1 + WQ_UNBOUND" is what we want ("one work item [...] active at any > given time"), but that's more of my misunderstanding than anything -- > each work item can only be active in a single context at any given time, > so that note is talking about distinct (i.e., more than 1) work items. Yeah, a future patch is gonna change the semantics a bit and I'll update the doc to be clearer. > While I'm here: we're still debugging what's affecting WiFi performance > on some of our WiFi systems, but it's possible I'll be turning some of > these into struct kthread_worker instead. We can cross that bridge > (including potential conflicts) if/when we come to it though. Can you elaborate the performance problem you're seeing? I'm working on a major update for workqueue to improve its locality behavior, so if you're experiencing issues on CPUs w/ multiple L3 caches, it'd be a good test case. Thanks.
Hi, On Wed, May 10, 2023 at 08:16:00AM -1000, Tejun Heo wrote: > > While I'm here: we're still debugging what's affecting WiFi performance > > on some of our WiFi systems, but it's possible I'll be turning some of > > these into struct kthread_worker instead. We can cross that bridge > > (including potential conflicts) if/when we come to it though. > > Can you elaborate the performance problem you're seeing? I'm working on a > major update for workqueue to improve its locality behavior, so if you're > experiencing issues on CPUs w/ multiple L3 caches, it'd be a good test case. Sure! Test case: iperf TCP RX (i.e., hits "MWIFIEX_RX_WORK_QUEUE" a lot) at some of the higher (VHT 80 MHz) data rates. Hardware: Mediatek MT8173 2xA53 (little) + 2xA72 (big) CPU (I'm not familiar with its cache details) + Marvell SD8897 SDIO WiFi (mwifiex_sdio) We're looking at a major regression from our 4.19 kernel to a 5.15 kernel (yeah, that's downstream reality). So far, we've found that performance is: (1) much better (nearly the same as 4.19) if we add WQ_SYSFS and pin the work queue to one CPU (doesn't really matter which CPU, as long as it's not the one loaded with IRQ(?) work) (2) moderately better if we pin the CPU frequency (e.g., "performance" cpufreq governor instead of "schedutil") (3) moderately better (not quite as good as (2)) if we switch a kthread_worker and don't pin anything. We tried (2) because we saw a lot more CPU migration on kernel 5.15 (work moves across all 4 CPUs throughout the run; on kernel 4.19 it mostly switched between 2 CPUs). We tried (3) suspecting some kind of EAS issue (instead of distributing our workload onto 4 different kworkers, our work (and therefore our load calculation) is mostly confined to a single kernel thread). But it still seems like our issues are more than "just" EAS / cpufreq issues, since (2) and (3) aren't as good as (1). NB: there weren't many relevant mwifiex or MTK-SDIO changes in this range. So we're still investigating a few other areas, but it does seem like "locality" (in some sense of the word) is relevant. We'd probably be open to testing any patches you have, although it's likely we'd have the easiest time if we can port those to 5.15. We're constantly working on getting good upstream support for Chromebook chips, but ARM SoC reality is that it still varies a lot as to how much works upstream on any given system. Thanks, Brian
Hello, On Wed, May 10, 2023 at 11:57:41AM -0700, Brian Norris wrote: > Test case: iperf TCP RX (i.e., hits "MWIFIEX_RX_WORK_QUEUE" a lot) at > some of the higher (VHT 80 MHz) data rates. > > Hardware: Mediatek MT8173 2xA53 (little) + 2xA72 (big) CPU > (I'm not familiar with its cache details) > + > Marvell SD8897 SDIO WiFi (mwifiex_sdio) Yeah, we had multiple of similar cases on, what I think are, similar configurations, which is why I'm working on improving workqueue locality. > We're looking at a major regression from our 4.19 kernel to a 5.15 > kernel (yeah, that's downstream reality). So far, we've found that > performance is: That's curious. 4.19 is old but I scanned the history and there's nothing which can cause that kind of perf regression for unbound workqueues between 4.19 and 5.15. > (1) much better (nearly the same as 4.19) if we add WQ_SYSFS and pin the > work queue to one CPU (doesn't really matter which CPU, as long as it's > not the one loaded with IRQ(?) work) > > (2) moderately better if we pin the CPU frequency (e.g., "performance" > cpufreq governor instead of "schedutil") > > (3) moderately better (not quite as good as (2)) if we switch a > kthread_worker and don't pin anything. Hmm... so it's not just workqueue. > We tried (2) because we saw a lot more CPU migration on kernel 5.15 > (work moves across all 4 CPUs throughout the run; on kernel 4.19 it > mostly switched between 2 CPUs). Workqueue can contribute to this but it seems more likely that scheduling changes are also part of the story. > We tried (3) suspecting some kind of EAS issue (instead of distributing > our workload onto 4 different kworkers, our work (and therefore our load > calculation) is mostly confined to a single kernel thread). But it still > seems like our issues are more than "just" EAS / cpufreq issues, since > (2) and (3) aren't as good as (1). > > NB: there weren't many relevant mwifiex or MTK-SDIO changes in this > range. > > So we're still investigating a few other areas, but it does seem like > "locality" (in some sense of the word) is relevant. We'd probably be > open to testing any patches you have, although it's likely we'd have the > easiest time if we can port those to 5.15. We're constantly working on > getting good upstream support for Chromebook chips, but ARM SoC reality > is that it still varies a lot as to how much works upstream on any given > system. I should be able to post the patchset later today or tomorrow. It comes with sysfs knobs to control affinity scopes and strictness, so hopefully you should be able to find the configuration that works without too much difficulty. Thanks.
On Wed, May 10, 2023 at 09:19:20AM -1000, Tejun Heo wrote: > On Wed, May 10, 2023 at 11:57:41AM -0700, Brian Norris wrote: > > (1) much better (nearly the same as 4.19) if we add WQ_SYSFS and pin the > > work queue to one CPU (doesn't really matter which CPU, as long as it's > > not the one loaded with IRQ(?) work) > > > > (2) moderately better if we pin the CPU frequency (e.g., "performance" > > cpufreq governor instead of "schedutil") > > > > (3) moderately better (not quite as good as (2)) if we switch a > > kthread_worker and don't pin anything. > > Hmm... so it's not just workqueue. Right. And not just cpufreq either. > > We tried (2) because we saw a lot more CPU migration on kernel 5.15 > > (work moves across all 4 CPUs throughout the run; on kernel 4.19 it > > mostly switched between 2 CPUs). > > Workqueue can contribute to this but it seems more likely that scheduling > changes are also part of the story. Yeah, that's one theory. And in that vein, that's one reason we might consider switching to a kthread_worker anyway, even if that doesn't solve all the regression -- because schedutil relies on per-entity load calculations to make decisions, and workqueues don't help the scheduler understand that load when spread across N CPUs (workers). A dedicated kthread would better represent our workload to the scheduler. (Threaded NAPI -- mwifiex doesn't support NAPI -- takes a similar approach, as it has its own thread per NAPI context.) > > We tried (3) suspecting some kind of EAS issue (instead of distributing > > our workload onto 4 different kworkers, our work (and therefore our load > > calculation) is mostly confined to a single kernel thread). But it still > > seems like our issues are more than "just" EAS / cpufreq issues, since > > (2) and (3) aren't as good as (1). > > > > NB: there weren't many relevant mwifiex or MTK-SDIO changes in this > > range. > > > > So we're still investigating a few other areas, but it does seem like > > "locality" (in some sense of the word) is relevant. We'd probably be > > open to testing any patches you have, although it's likely we'd have the > > easiest time if we can port those to 5.15. We're constantly working on > > getting good upstream support for Chromebook chips, but ARM SoC reality > > is that it still varies a lot as to how much works upstream on any given > > system. > > I should be able to post the patchset later today or tomorrow. It comes with > sysfs knobs to control affinity scopes and strictness, so hopefully you > should be able to find the configuration that works without too much > difficulty. Great! Brian
On Mon, May 08, 2023 at 03:50:21PM -1000, Tejun Heo wrote: > These workqueues only host a single work item and thus doen't need explicit > concurrency limit. Let's use the default @max_active. This doesn't cost > anything and clearly expresses that @max_active doesn't matter. Applied to wq/for-6.5-cleanup-ordered. Thanks.
diff --git a/drivers/net/wireless/marvell/mwifiex/cfg80211.c b/drivers/net/wireless/marvell/mwifiex/cfg80211.c index bcd564dc3554..5337ee4b6f10 100644 --- a/drivers/net/wireless/marvell/mwifiex/cfg80211.c +++ b/drivers/net/wireless/marvell/mwifiex/cfg80211.c @@ -3127,7 +3127,7 @@ struct wireless_dev *mwifiex_add_virtual_intf(struct wiphy *wiphy, priv->dfs_cac_workqueue = alloc_workqueue("MWIFIEX_DFS_CAC%s", WQ_HIGHPRI | WQ_MEM_RECLAIM | - WQ_UNBOUND, 1, name); + WQ_UNBOUND, 0, name); if (!priv->dfs_cac_workqueue) { mwifiex_dbg(adapter, ERROR, "cannot alloc DFS CAC queue\n"); ret = -ENOMEM; @@ -3138,7 +3138,7 @@ struct wireless_dev *mwifiex_add_virtual_intf(struct wiphy *wiphy, priv->dfs_chan_sw_workqueue = alloc_workqueue("MWIFIEX_DFS_CHSW%s", WQ_HIGHPRI | WQ_UNBOUND | - WQ_MEM_RECLAIM, 1, name); + WQ_MEM_RECLAIM, 0, name); if (!priv->dfs_chan_sw_workqueue) { mwifiex_dbg(adapter, ERROR, "cannot alloc DFS channel sw queue\n"); ret = -ENOMEM; diff --git a/drivers/net/wireless/marvell/mwifiex/main.c b/drivers/net/wireless/marvell/mwifiex/main.c index ea22a08e6c08..1cd9d20cca16 100644 --- a/drivers/net/wireless/marvell/mwifiex/main.c +++ b/drivers/net/wireless/marvell/mwifiex/main.c @@ -1547,7 +1547,7 @@ mwifiex_reinit_sw(struct mwifiex_adapter *adapter) adapter->workqueue = alloc_workqueue("MWIFIEX_WORK_QUEUE", - WQ_HIGHPRI | WQ_MEM_RECLAIM | WQ_UNBOUND, 1); + WQ_HIGHPRI | WQ_MEM_RECLAIM | WQ_UNBOUND, 0); if (!adapter->workqueue) goto err_kmalloc; @@ -1557,7 +1557,7 @@ mwifiex_reinit_sw(struct mwifiex_adapter *adapter) adapter->rx_workqueue = alloc_workqueue("MWIFIEX_RX_WORK_QUEUE", WQ_HIGHPRI | WQ_MEM_RECLAIM | - WQ_UNBOUND, 1); + WQ_UNBOUND, 0); if (!adapter->rx_workqueue) goto err_kmalloc; INIT_WORK(&adapter->rx_work, mwifiex_rx_work_queue); @@ -1702,7 +1702,7 @@ mwifiex_add_card(void *card, struct completion *fw_done, adapter->workqueue = alloc_workqueue("MWIFIEX_WORK_QUEUE", - WQ_HIGHPRI | WQ_MEM_RECLAIM | WQ_UNBOUND, 1); + WQ_HIGHPRI | WQ_MEM_RECLAIM | WQ_UNBOUND, 0); if (!adapter->workqueue) goto err_kmalloc; @@ -1712,7 +1712,7 @@ mwifiex_add_card(void *card, struct completion *fw_done, adapter->rx_workqueue = alloc_workqueue("MWIFIEX_RX_WORK_QUEUE", WQ_HIGHPRI | WQ_MEM_RECLAIM | - WQ_UNBOUND, 1); + WQ_UNBOUND, 0); if (!adapter->rx_workqueue) goto err_kmalloc;
These workqueues only host a single work item and thus doen't need explicit concurrency limit. Let's use the default @max_active. This doesn't cost anything and clearly expresses that @max_active doesn't matter. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Amitkumar Karwar <amitkarwar@gmail.com> Cc: Ganapathi Bhat <ganapathi017@gmail.com> Cc: Sharvari Harisangam <sharvari.harisangam@nxp.com> Cc: Xinming Hu <huxinming820@gmail.com> Cc: Kalle Valo <kvalo@kernel.org> Cc: "David S. Miller" <davem@davemloft.net> Cc: Eric Dumazet <edumazet@google.com> Cc: Jakub Kicinski <kuba@kernel.org> Cc: Paolo Abeni <pabeni@redhat.com> Cc: linux-wireless@vger.kernel.org Cc: netdev@vger.kernel.org Cc: linux-kernel@vger.kernel.org --- drivers/net/wireless/marvell/mwifiex/cfg80211.c | 4 ++-- drivers/net/wireless/marvell/mwifiex/main.c | 8 ++++---- 2 files changed, 6 insertions(+), 6 deletions(-)