diff mbox series

[net-next] ibmvnic: Toggle between queue types in affinity mapping

Message ID 20230123221727.30423-1-nnac123@linux.ibm.com (mailing list archive)
State Superseded
Delegated to: Netdev Maintainers
Headers show
Series [net-next] ibmvnic: Toggle between queue types in affinity mapping | expand

Checks

Context Check Description
netdev/tree_selection success Clearly marked for net-next
netdev/fixes_present success Fixes tag not required for -next series
netdev/subject_prefix success Link
netdev/cover_letter success Single patches do not need cover letters
netdev/patch_count success Link
netdev/header_inline success No static functions without inline keyword in header files
netdev/build_32bit success Errors and warnings before: 0 this patch: 0
netdev/cc_maintainers warning 11 maintainers not CCed: edumazet@google.com mpe@ellerman.id.au christophe.leroy@csgroup.eu davem@davemloft.net danymadden@us.ibm.com linuxppc-dev@lists.ozlabs.org kuba@kernel.org tlfalcon@linux.ibm.com pabeni@redhat.com ricklind@linux.ibm.com npiggin@gmail.com
netdev/build_clang success Errors and warnings before: 0 this patch: 0
netdev/module_param success Was 0 now: 0
netdev/verify_signedoff success Signed-off-by tag matches author and committer
netdev/check_selftest success No net selftest shell script
netdev/verify_fixes success No Fixes tag
netdev/build_allmodconfig_warn success Errors and warnings before: 0 this patch: 0
netdev/checkpatch success total: 0 errors, 0 warnings, 0 checks, 46 lines checked
netdev/kdoc success Errors and warnings before: 0 this patch: 0
netdev/source_inline success Was 0 now: 0

Commit Message

Nick Child Jan. 23, 2023, 10:17 p.m. UTC
Previously, ibmvnic IRQs were assigned to CPU numbers by assigning all
the IRQs for transmit queues then assigning all the IRQs for receive
queues. With multi-threaded processors, in a heavy RX or TX environment,
physical cores would either be overloaded or underutilized (due to the
IRQ assignment algorithm). This approach is sub-optimal because IRQs for
the same subprocess (RX or TX) would be bound to adjacent CPU numbers,
meaning they were more likely to be contending for the same core.

For example, in a system with 64 CPU's and 32 queues, the IRQs would
be bound to CPU in the following pattern:

IRQ type |  CPU number
-----------------------
TX0	 |	0-1
TX1	 |	2-3
<etc>
RX0	 |	32-33
RX1	 |	34-35
<etc>

Observe that in SMT-8, the first 4 tx queues would be sharing the
same core.

A more optimal algorithm would balance the number RX and TX IRQ's across
the physical cores. Therefore, to increase performance, distribute RX and
TX IRQs across cores by alternating between assigning IRQs for RX and TX
queues to CPUs.
With a system with 64 CPUs and 32 queues, this results in the following
pattern (binding is done in reverse order for readable code):

IRQ type |  CPU number
-----------------------
TX15	 |	0-1
RX15	 |	2-3
TX14	 |	4-5
RX14	 |	6-7
<etc>

Observe that in SMT-8, there is equal distribution of RX and TX IRQs
per core. In the above case, each core handles 2 TX and 2 RX IRQ's.

Signed-off-by: Nick Child <nnac123@linux.ibm.com>
Reviewed-by: Haren Myneni <haren@linux.ibm.com>
---
 drivers/net/ethernet/ibm/ibmvnic.c | 23 ++++++++++++-----------
 1 file changed, 12 insertions(+), 11 deletions(-)

Comments

Jakub Kicinski Jan. 25, 2023, 2:39 a.m. UTC | #1
On Mon, 23 Jan 2023 16:17:27 -0600 Nick Child wrote:
> A more optimal algorithm would balance the number RX and TX IRQ's across
> the physical cores. Therefore, to increase performance, distribute RX and
> TX IRQs across cores by alternating between assigning IRQs for RX and TX
> queues to CPUs.
> With a system with 64 CPUs and 32 queues, this results in the following
> pattern (binding is done in reverse order for readable code):
> 
> IRQ type |  CPU number
> -----------------------
> TX15	 |	0-1
> RX15	 |	2-3
> TX14	 |	4-5
> RX14	 |	6-7

Seems sensible but why did you invert the order? To save LoC?
Nick Child Jan. 25, 2023, 4:55 p.m. UTC | #2
On 1/24/23 20:39, Jakub Kicinski wrote:
> On Mon, 23 Jan 2023 16:17:27 -0600 Nick Child wrote:
>> A more optimal algorithm would balance the number RX and TX IRQ's across
>> the physical cores. Therefore, to increase performance, distribute RX and
>> TX IRQs across cores by alternating between assigning IRQs for RX and TX
>> queues to CPUs.
>> With a system with 64 CPUs and 32 queues, this results in the following
>> pattern (binding is done in reverse order for readable code):
>>
>> IRQ type |  CPU number
>> -----------------------
>> TX15	 |	0-1
>> RX15	 |	2-3
>> TX14	 |	4-5
>> RX14	 |	6-7
> 
> Seems sensible but why did you invert the order? To save LoC?

Thanks for checking this out Jakub.

Correct, the effect on performance is the same and IMO the algorithm
is more readable. Less so about minimizing lines and more about
making the code understandable for the next dev.
Jakub Kicinski Jan. 25, 2023, 6:14 p.m. UTC | #3
On Wed, 25 Jan 2023 10:55:20 -0600 Nick Child wrote:
> On 1/24/23 20:39, Jakub Kicinski wrote:
> > On Mon, 23 Jan 2023 16:17:27 -0600 Nick Child wrote:  
> >> A more optimal algorithm would balance the number RX and TX IRQ's across
> >> the physical cores. Therefore, to increase performance, distribute RX and
> >> TX IRQs across cores by alternating between assigning IRQs for RX and TX
> >> queues to CPUs.
> >> With a system with 64 CPUs and 32 queues, this results in the following
> >> pattern (binding is done in reverse order for readable code):
> >>
> >> IRQ type |  CPU number
> >> -----------------------
> >> TX15	 |	0-1
> >> RX15	 |	2-3
> >> TX14	 |	4-5
> >> RX14	 |	6-7  
> > 
> > Seems sensible but why did you invert the order? To save LoC?  
> 
> Thanks for checking this out Jakub.
> 
> Correct, the effect on performance is the same and IMO the algorithm
> is more readable. Less so about minimizing lines and more about
> making the code understandable for the next dev.

I spend way too much time explaining IRQ pinning to developers at my
"day job" :( Stuff like threaded NAPI means that more and more people
interact with it. So I think having a more easily understandable mapping
is worth the extra complexity in the driver. By which I mean:

Tx0 -> 0-1
Rx0 -> 2-3
Tx1 -> 4-5

IOW  Qn  -> n*4+is_rx*2 - n*4+is_rx*2+1
Rick Lindsley Jan. 25, 2023, 7:16 p.m. UTC | #4
On 1/24/23 18:39, Jakub Kicinski wrote:

> Seems sensible but why did you invert the order? To save LoC?

Proc zero is often the default vector for other interrupts.  If we're going to diddle with the irq's for performance,  it would make sense to me to steer around proc 0.

Rick
diff mbox series

Patch

diff --git a/drivers/net/ethernet/ibm/ibmvnic.c b/drivers/net/ethernet/ibm/ibmvnic.c
index e19a6bb3f444..314a72cef592 100644
--- a/drivers/net/ethernet/ibm/ibmvnic.c
+++ b/drivers/net/ethernet/ibm/ibmvnic.c
@@ -254,6 +254,7 @@  static void ibmvnic_set_affinity(struct ibmvnic_adapter *adapter)
 	int num_txqs = adapter->num_active_tx_scrqs;
 	int total_queues, stride, stragglers, i;
 	unsigned int num_cpu, cpu;
+	bool is_rx_queue;
 	int rc = 0;
 
 	netdev_dbg(adapter->netdev, "%s: Setting irq affinity hints", __func__);
@@ -273,14 +274,22 @@  static void ibmvnic_set_affinity(struct ibmvnic_adapter *adapter)
 	/* next available cpu to assign irq to */
 	cpu = cpumask_next(-1, cpu_online_mask);
 
-	for (i = 0; i < num_txqs; i++) {
-		queue = txqs[i];
+	for (i = 0; i < total_queues; i++) {
+		is_rx_queue = false;
+		/* balance core load by alternating rx and tx assignments */
+		if ((i % 2 == 1 && num_rxqs > 0) || num_txqs == 0) {
+			queue = rxqs[--num_rxqs];
+			is_rx_queue = true;
+		} else {
+			queue = txqs[--num_txqs];
+		}
+
 		rc = ibmvnic_set_queue_affinity(queue, &cpu, &stragglers,
 						stride);
 		if (rc)
 			goto out;
 
-		if (!queue)
+		if (!queue || is_rx_queue)
 			continue;
 
 		rc = __netif_set_xps_queue(adapter->netdev,
@@ -291,14 +300,6 @@  static void ibmvnic_set_affinity(struct ibmvnic_adapter *adapter)
 				    __func__, i, rc);
 	}
 
-	for (i = 0; i < num_rxqs; i++) {
-		queue = rxqs[i];
-		rc = ibmvnic_set_queue_affinity(queue, &cpu, &stragglers,
-						stride);
-		if (rc)
-			goto out;
-	}
-
 out:
 	if (rc) {
 		netdev_warn(adapter->netdev,