diff mbox series

[[PATCH,v2,iwl-next] v2 3/4] idpf: convert workqueues to unbound

Message ID 20240826181032.3042222-4-manojvishy@google.com (mailing list archive)
State Awaiting Upstream
Delegated to: Netdev Maintainers
Headers show
Series [[PATCH,v2,iwl-next] v2 3/4] idpf: convert workqueues to unbound | expand

Checks

Context Check Description
netdev/series_format warning Single patches do not need cover letters; Target tree name not specified in the subject
netdev/tree_selection success Guessed tree name to be net-next
netdev/ynl success Generated files up to date; no warnings/errors; no diff in generated;
netdev/fixes_present success Fixes tag not required for -next series
netdev/header_inline success No static functions without inline keyword in header files
netdev/build_32bit success Errors and warnings before: 16 this patch: 16
netdev/build_tools success No tools touched, skip
netdev/cc_maintainers fail 3 blamed authors not CCed: madhu.chittim@intel.com willemb@google.com pavan.kumar.linga@intel.com; 5 maintainers not CCed: pabeni@redhat.com kuba@kernel.org madhu.chittim@intel.com willemb@google.com pavan.kumar.linga@intel.com
netdev/build_clang success Errors and warnings before: 16 this patch: 16
netdev/verify_signedoff success Signed-off-by tag matches author and committer
netdev/deprecated_api success None detected
netdev/check_selftest success No net selftest shell script
netdev/verify_fixes success Fixes tag looks correct
netdev/build_allmodconfig_warn success Errors and warnings before: 16 this patch: 16
netdev/checkpatch warning WARNING: Please use correct Fixes: style 'Fixes: <12 chars of sha1> ("<title line>")' - ie: 'Fixes: 0fe45467a104 ("idpf: add create vport and netdev configuration")'
netdev/build_clang_rust success No Rust files in patch. Skipping build
netdev/kdoc success Errors and warnings before: 2 this patch: 2
netdev/source_inline success Was 0 now: 0
netdev/contest success net-next-2024-08-27--03-00 (tests: 713)

Commit Message

Manoj Vishwanathan Aug. 26, 2024, 6:10 p.m. UTC
From: Marco Leogrande <leogrande@google.com>

When a workqueue is created with `WQ_UNBOUND`, its work items are
served by special worker-pools, whose host workers are not bound to
any specific CPU. In the default configuration (i.e. when
`queue_delayed_work` and friends do not specify which CPU to run the
work item on), `WQ_UNBOUND` allows the work item to be executed on any
CPU in the same node of the CPU it was enqueued on. While this
solution potentially sacrifices locality, it avoids contention with
other processes that might dominate the CPU time of the processor the
work item was scheduled on.

This is not just a theoretical problem: in a praticular scenario
misconfigured process was hogging most of the time from CPU0, leaving
less than 0.5% of its CPU time to the kworker. The IDPF workqueues
that were using the kworker on CPU0 suffered large completion delays
as a result, causing performance degradation, timeouts and eventual
system crash.

Tested:

* I have also run a manual test to gauge the performance
  improvement. The test consists of an antagonist process
  (`./stress --cpu 2`) consuming as much of CPU 0 as possible. This
  process is run under `taskset 01` to bind it to CPU0, and its
  priority is changed with `chrt -pQ 9900 10000 ${pid}` and
  `renice -n -20 ${pid}` after start.

  Then, the IDPF driver is forced to prefer CPU0 by editing all calls
  to `queue_delayed_work`, `mod_delayed_work`, etc... to use CPU 0.

  Finally, `ktraces` for the workqueue events are collected.

  Without the current patch, the antagonist process can force
  arbitrary delays between `workqueue_queue_work` and
  `workqueue_execute_start`, that in my tests were as high as
  `30ms`. With the current patch applied, the workqueue can be
  migrated to another unloaded CPU in the same node, and, keeping
  everything else equal, the maximum delay I could see was `6us`.

Fixes: 0fe45467a1041 (idpf: add create vport and netdev configuration)
Signed-off-by: Marco Leogrande <leogrande@google.com>
Signed-off-by: Manoj Vishwanathan <manojvishy@google.com>
---
 drivers/net/ethernet/intel/idpf/idpf_main.c | 15 ++++++++++-----
 1 file changed, 10 insertions(+), 5 deletions(-)

Comments

Jacob Keller Aug. 28, 2024, 10:02 p.m. UTC | #1
On 8/26/2024 11:10 AM, Manoj Vishwanathan wrote:
> From: Marco Leogrande <leogrande@google.com>
> 
> When a workqueue is created with `WQ_UNBOUND`, its work items are
> served by special worker-pools, whose host workers are not bound to
> any specific CPU. In the default configuration (i.e. when
> `queue_delayed_work` and friends do not specify which CPU to run the
> work item on), `WQ_UNBOUND` allows the work item to be executed on any
> CPU in the same node of the CPU it was enqueued on. While this
> solution potentially sacrifices locality, it avoids contention with
> other processes that might dominate the CPU time of the processor the
> work item was scheduled on.
> 
> This is not just a theoretical problem: in a praticular scenario

Nit: s/praticular/particular/

> misconfigured process was hogging most of the time from CPU0, leaving
> less than 0.5% of its CPU time to the kworker. The IDPF workqueues
> that were using the kworker on CPU0 suffered large completion delays
> as a result, causing performance degradation, timeouts and eventual
> system crash.
> 

Curious how the delay could result in a full system crash. That seems
like some other concurrency issue. I guess something like a Tx timeout
could happen though.

> Tested:
> 
> * I have also run a manual test to gauge the performance
>   improvement. The test consists of an antagonist process
>   (`./stress --cpu 2`) consuming as much of CPU 0 as possible. This
>   process is run under `taskset 01` to bind it to CPU0, and its
>   priority is changed with `chrt -pQ 9900 10000 ${pid}` and
>   `renice -n -20 ${pid}` after start.
> 
>   Then, the IDPF driver is forced to prefer CPU0 by editing all calls
>   to `queue_delayed_work`, `mod_delayed_work`, etc... to use CPU 0.
> 
>   Finally, `ktraces` for the workqueue events are collected.
> 
>   Without the current patch, the antagonist process can force
>   arbitrary delays between `workqueue_queue_work` and
>   `workqueue_execute_start`, that in my tests were as high as
>   `30ms`. With the current patch applied, the workqueue can be
>   migrated to another unloaded CPU in the same node, and, keeping
>   everything else equal, the maximum delay I could see was `6us`.
> 

Hmm. I don't have a direct issue with using WQ_UNBOUND, and I can't
think of any reason these work queue tasks *need* to be CPU bound.

I do feel like there may be other solutions to managing the tasks on the
system such that this isn't necessary.

However, if using WQ_UNBOUND solves these problems and is simpler in
that system administrators are less likely to screw things up, I think
its a net positive.

I do not know if there are any other side effects of WQ_UNBOUND, so take
this with a grain of salt:
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>


> Fixes: 0fe45467a1041 (idpf: add create vport and netdev configuration)
> Signed-off-by: Marco Leogrande <leogrande@google.com>
> Signed-off-by: Manoj Vishwanathan <manojvishy@google.com>
> ---
>  drivers/net/ethernet/intel/idpf/idpf_main.c | 15 ++++++++++-----
>  1 file changed, 10 insertions(+), 5 deletions(-)
> 
> diff --git a/drivers/net/ethernet/intel/idpf/idpf_main.c b/drivers/net/ethernet/intel/idpf/idpf_main.c
> index db476b3314c8..dfd56fc5ff65 100644
> --- a/drivers/net/ethernet/intel/idpf/idpf_main.c
> +++ b/drivers/net/ethernet/intel/idpf/idpf_main.c
> @@ -174,7 +174,8 @@ static int idpf_probe(struct pci_dev *pdev, const struct pci_device_id *ent)
>  	pci_set_master(pdev);
>  	pci_set_drvdata(pdev, adapter);
>  
> -	adapter->init_wq = alloc_workqueue("%s-%s-init", 0, 0,
> +	adapter->init_wq = alloc_workqueue("%s-%s-init",
> +					   WQ_UNBOUND | WQ_MEM_RECLAIM, 0,
>  					   dev_driver_string(dev),
>  					   dev_name(dev));
>  	if (!adapter->init_wq) {
> @@ -183,7 +184,8 @@ static int idpf_probe(struct pci_dev *pdev, const struct pci_device_id *ent)
>  		goto err_free;
>  	}
>  
> -	adapter->serv_wq = alloc_workqueue("%s-%s-service", 0, 0,
> +	adapter->serv_wq = alloc_workqueue("%s-%s-service",
> +					   WQ_UNBOUND | WQ_MEM_RECLAIM, 0,
>  					   dev_driver_string(dev),
>  					   dev_name(dev));
>  	if (!adapter->serv_wq) {
> @@ -192,7 +194,8 @@ static int idpf_probe(struct pci_dev *pdev, const struct pci_device_id *ent)
>  		goto err_serv_wq_alloc;
>  	}
>  
> -	adapter->mbx_wq = alloc_workqueue("%s-%s-mbx", 0, 0,
> +	adapter->mbx_wq = alloc_workqueue("%s-%s-mbx",
> +					  WQ_UNBOUND | WQ_MEM_RECLAIM, 0,
>  					  dev_driver_string(dev),
>  					  dev_name(dev));
>  	if (!adapter->mbx_wq) {
> @@ -201,7 +204,8 @@ static int idpf_probe(struct pci_dev *pdev, const struct pci_device_id *ent)
>  		goto err_mbx_wq_alloc;
>  	}
>  
> -	adapter->stats_wq = alloc_workqueue("%s-%s-stats", 0, 0,
> +	adapter->stats_wq = alloc_workqueue("%s-%s-stats",
> +					    WQ_UNBOUND | WQ_MEM_RECLAIM, 0,
>  					    dev_driver_string(dev),
>  					    dev_name(dev));
>  	if (!adapter->stats_wq) {
> @@ -210,7 +214,8 @@ static int idpf_probe(struct pci_dev *pdev, const struct pci_device_id *ent)
>  		goto err_stats_wq_alloc;
>  	}
>  
> -	adapter->vc_event_wq = alloc_workqueue("%s-%s-vc_event", 0, 0,
> +	adapter->vc_event_wq = alloc_workqueue("%s-%s-vc_event",
> +					       WQ_UNBOUND | WQ_MEM_RECLAIM, 0,
>  					       dev_driver_string(dev),
>  					       dev_name(dev));
>  	if (!adapter->vc_event_wq) {

This seems like quite a lot of work queues for a driver :D
Pavan Kumar Linga Aug. 29, 2024, 4:02 p.m. UTC | #2
On 8/26/2024 11:10 AM, Manoj Vishwanathan wrote:
> From: Marco Leogrande <leogrande@google.com>
> 
> When a workqueue is created with `WQ_UNBOUND`, its work items are
> served by special worker-pools, whose host workers are not bound to
> any specific CPU. In the default configuration (i.e. when
> `queue_delayed_work` and friends do not specify which CPU to run the
> work item on), `WQ_UNBOUND` allows the work item to be executed on any
> CPU in the same node of the CPU it was enqueued on. While this
> solution potentially sacrifices locality, it avoids contention with
> other processes that might dominate the CPU time of the processor the
> work item was scheduled on.
> 
> This is not just a theoretical problem: in a praticular scenario > misconfigured process was hogging most of the time from CPU0, leaving
> less than 0.5% of its CPU time to the kworker. The IDPF workqueues
> that were using the kworker on CPU0 suffered large completion delays
> as a result, causing performance degradation, timeouts and eventual
> system crash.
> 
> Tested:
> 
> * I have also run a manual test to gauge the performance
>    improvement. The test consists of an antagonist process
>    (`./stress --cpu 2`) consuming as much of CPU 0 as possible. This
>    process is run under `taskset 01` to bind it to CPU0, and its
>    priority is changed with `chrt -pQ 9900 10000 ${pid}` and
>    `renice -n -20 ${pid}` after start.
> 
>    Then, the IDPF driver is forced to prefer CPU0 by editing all calls
>    to `queue_delayed_work`, `mod_delayed_work`, etc... to use CPU 0.
> 
>    Finally, `ktraces` for the workqueue events are collected.
> 
>    Without the current patch, the antagonist process can force
>    arbitrary delays between `workqueue_queue_work` and
>    `workqueue_execute_start`, that in my tests were as high as
>    `30ms`. With the current patch applied, the workqueue can be
>    migrated to another unloaded CPU in the same node, and, keeping
>    everything else equal, the maximum delay I could see was `6us`.
> 
> Fixes: 0fe45467a1041 (idpf: add create vport and netdev configuration)
> Signed-off-by: Marco Leogrande <leogrande@google.com>
> Signed-off-by: Manoj Vishwanathan <manojvishy@google.com>

Except the nit (s/praticular/particular) what Jake mentioned, changes 
look good to me.

Reviewed-by: Pavan Kumar Linga <pavan.kumar.linga@intel.com>

> ---
>   drivers/net/ethernet/intel/idpf/idpf_main.c | 15 ++++++++++-----
>   1 file changed, 10 insertions(+), 5 deletions(-)
> 
> diff --git a/drivers/net/ethernet/intel/idpf/idpf_main.c b/drivers/net/ethernet/intel/idpf/idpf_main.c
> index db476b3314c8..dfd56fc5ff65 100644
> --- a/drivers/net/ethernet/intel/idpf/idpf_main.c
> +++ b/drivers/net/ethernet/intel/idpf/idpf_main.c
> @@ -174,7 +174,8 @@ static int idpf_probe(struct pci_dev *pdev, const struct pci_device_id *ent)
>   	pci_set_master(pdev);
>   	pci_set_drvdata(pdev, adapter);
>   
> -	adapter->init_wq = alloc_workqueue("%s-%s-init", 0, 0,
> +	adapter->init_wq = alloc_workqueue("%s-%s-init",
> +					   WQ_UNBOUND | WQ_MEM_RECLAIM, 0,
>   					   dev_driver_string(dev),
>   					   dev_name(dev));
>   	if (!adapter->init_wq) {
> @@ -183,7 +184,8 @@ static int idpf_probe(struct pci_dev *pdev, const struct pci_device_id *ent)
>   		goto err_free;
>   	}
>   
> -	adapter->serv_wq = alloc_workqueue("%s-%s-service", 0, 0,
> +	adapter->serv_wq = alloc_workqueue("%s-%s-service",
> +					   WQ_UNBOUND | WQ_MEM_RECLAIM, 0,
>   					   dev_driver_string(dev),
>   					   dev_name(dev));
>   	if (!adapter->serv_wq) {
> @@ -192,7 +194,8 @@ static int idpf_probe(struct pci_dev *pdev, const struct pci_device_id *ent)
>   		goto err_serv_wq_alloc;
>   	}
>   
> -	adapter->mbx_wq = alloc_workqueue("%s-%s-mbx", 0, 0,
> +	adapter->mbx_wq = alloc_workqueue("%s-%s-mbx",
> +					  WQ_UNBOUND | WQ_MEM_RECLAIM, 0,
>   					  dev_driver_string(dev),
>   					  dev_name(dev));
>   	if (!adapter->mbx_wq) {
> @@ -201,7 +204,8 @@ static int idpf_probe(struct pci_dev *pdev, const struct pci_device_id *ent)
>   		goto err_mbx_wq_alloc;
>   	}
>   
> -	adapter->stats_wq = alloc_workqueue("%s-%s-stats", 0, 0,
> +	adapter->stats_wq = alloc_workqueue("%s-%s-stats",
> +					    WQ_UNBOUND | WQ_MEM_RECLAIM, 0,
>   					    dev_driver_string(dev),
>   					    dev_name(dev));
>   	if (!adapter->stats_wq) {
> @@ -210,7 +214,8 @@ static int idpf_probe(struct pci_dev *pdev, const struct pci_device_id *ent)
>   		goto err_stats_wq_alloc;
>   	}
>   
> -	adapter->vc_event_wq = alloc_workqueue("%s-%s-vc_event", 0, 0,
> +	adapter->vc_event_wq = alloc_workqueue("%s-%s-vc_event",
> +					       WQ_UNBOUND | WQ_MEM_RECLAIM, 0,
>   					       dev_driver_string(dev),
>   					       dev_name(dev));
>   	if (!adapter->vc_event_wq) {
diff mbox series

Patch

diff --git a/drivers/net/ethernet/intel/idpf/idpf_main.c b/drivers/net/ethernet/intel/idpf/idpf_main.c
index db476b3314c8..dfd56fc5ff65 100644
--- a/drivers/net/ethernet/intel/idpf/idpf_main.c
+++ b/drivers/net/ethernet/intel/idpf/idpf_main.c
@@ -174,7 +174,8 @@  static int idpf_probe(struct pci_dev *pdev, const struct pci_device_id *ent)
 	pci_set_master(pdev);
 	pci_set_drvdata(pdev, adapter);
 
-	adapter->init_wq = alloc_workqueue("%s-%s-init", 0, 0,
+	adapter->init_wq = alloc_workqueue("%s-%s-init",
+					   WQ_UNBOUND | WQ_MEM_RECLAIM, 0,
 					   dev_driver_string(dev),
 					   dev_name(dev));
 	if (!adapter->init_wq) {
@@ -183,7 +184,8 @@  static int idpf_probe(struct pci_dev *pdev, const struct pci_device_id *ent)
 		goto err_free;
 	}
 
-	adapter->serv_wq = alloc_workqueue("%s-%s-service", 0, 0,
+	adapter->serv_wq = alloc_workqueue("%s-%s-service",
+					   WQ_UNBOUND | WQ_MEM_RECLAIM, 0,
 					   dev_driver_string(dev),
 					   dev_name(dev));
 	if (!adapter->serv_wq) {
@@ -192,7 +194,8 @@  static int idpf_probe(struct pci_dev *pdev, const struct pci_device_id *ent)
 		goto err_serv_wq_alloc;
 	}
 
-	adapter->mbx_wq = alloc_workqueue("%s-%s-mbx", 0, 0,
+	adapter->mbx_wq = alloc_workqueue("%s-%s-mbx",
+					  WQ_UNBOUND | WQ_MEM_RECLAIM, 0,
 					  dev_driver_string(dev),
 					  dev_name(dev));
 	if (!adapter->mbx_wq) {
@@ -201,7 +204,8 @@  static int idpf_probe(struct pci_dev *pdev, const struct pci_device_id *ent)
 		goto err_mbx_wq_alloc;
 	}
 
-	adapter->stats_wq = alloc_workqueue("%s-%s-stats", 0, 0,
+	adapter->stats_wq = alloc_workqueue("%s-%s-stats",
+					    WQ_UNBOUND | WQ_MEM_RECLAIM, 0,
 					    dev_driver_string(dev),
 					    dev_name(dev));
 	if (!adapter->stats_wq) {
@@ -210,7 +214,8 @@  static int idpf_probe(struct pci_dev *pdev, const struct pci_device_id *ent)
 		goto err_stats_wq_alloc;
 	}
 
-	adapter->vc_event_wq = alloc_workqueue("%s-%s-vc_event", 0, 0,
+	adapter->vc_event_wq = alloc_workqueue("%s-%s-vc_event",
+					       WQ_UNBOUND | WQ_MEM_RECLAIM, 0,
 					       dev_driver_string(dev),
 					       dev_name(dev));
 	if (!adapter->vc_event_wq) {