[net-next,RFC,0/5] Configuring NAPI instance for a queue

Message ID	171234737780.5075.5717254021446469741.stgit@anambiarhost.jf.intel.com (mailing list archive)
Headers	show Received: from mgamail.intel.com (mgamail.intel.com [198.175.65.16]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 3829D173347 for <netdev@vger.kernel.org>; Fri, 5 Apr 2024 19:51:21 +0000 (UTC) Subject: [net-next,RFC PATCH 0/5] Configuring NAPI instance for a queue From: Amritha Nambiar <amritha.nambiar@intel.com> To: netdev@vger.kernel.org, kuba@kernel.org, davem@davemloft.net Cc: edumazet@google.com, pabeni@redhat.com, ast@kernel.org, sdf@google.com, lorenzo@kernel.org, tariqt@nvidia.com, daniel@iogearbox.net, anthony.l.nguyen@intel.com, lucien.xin@gmail.com, hawk@kernel.org, sridhar.samudrala@intel.com, amritha.nambiar@intel.com Date: Fri, 05 Apr 2024 13:09:28 -0700 Message-ID: <171234737780.5075.5717254021446469741.stgit@anambiarhost.jf.intel.com> User-Agent: StGit/unknown-version Precedence: bulk MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: 7bit
Series	Configuring NAPI instance for a queue \| expand [net-next,RFC,0/5] Configuring NAPI instance for a queue [net-next,RFC,1/5] netdev-genl: spec: Extend netdev netlink spec in YAML for queue-set [net-next,RFC,2/5] netdev-genl: Add netlink framework functions for queue-set NAPI [net-next,RFC,3/5] ice: Add support to enable/disable a vector [net-next,RFC,4/5] ice: Handle unused vectors dynamically [net-next,RFC,5/5] ice: Add driver support for ndo_queue_set_napi

Nambiar, Amritha April 5, 2024, 8:09 p.m. UTC

Support user configurability of queue<->NAPI association. The netdev-genl
interface is extended with 'queue-set' command. Currently, the set
command enables associating a NAPI ID for a queue, but can also be extended
to support configuring other attributes. To set the NAPI attribute, the
command requires the queue identifiers and the ID of the NAPI instance that
the queue has to be associated with.

Previous discussion:
https://lore.kernel.org/netdev/32e32635-ca75-99b8-2285-1d87a29b6d89@intel.com/

Example:
$ ./cli.py --spec netdev.yaml --do queue-set --json='{"ifindex": 12, "id": 0, "type": 0, "napi-id": 595}'
{'id': 0, 'ifindex': 12, 'napi-id': 595, 'type': 'rx'}

The queue<->NAPI association is achieved by mapping the queue to the
corresponding interrupt vector. This interface, thereby, exposes
configurability of the queue<->vector mapping.

The interface is generic such that any Tx or Rx queue[s] can be associated
with a NAPI instance. So, it is possible to have multiple queues (Tx queues
only, Rx queues only or both Tx and Rx queues) on a NAPI/vector. This is a
step away from the usual 1:1 queue<->vector model, although the queue-pair
association can still be retained by configuring the Tx[n]-Rx[n] queue-pair
to the same NAPI instance.

The usecase for configuring NAPI pollers to queues this way from the userspace
is to limit the number of pollers (by sharing multiple queues on a vector).
This allows to have more cores for application threads by reducing the cores
needed for poller threads.

Patch 1-2 introduces the kernel interface hooks to set the NAPI ID attribute
for a queue.
Patch 3-4 prepares the ice driver for device configuration. The queue<->vector
mapping configured from userspace would not be reflected in the HW unless the
queue is reset. Currently, the driver supports only reset of the entire VSI or
a queue-pair. Resetting the VSI for each queue configuration is an overkill.
So, add the ice driver support to reset a vector. Also, add support to handle
vectors that gets unused due to the dynamic nature of configurability.
Patch 5 implements the support for queue<->NAPI configuration in the ice driver.

Testing Hints:
1. Move a Rx queue from its current NAPI to a different NAPI during traffic
2. Move a Rx queue back to its initial NAPI during traffic
3. Move all Rx queues of NAPI-A to NAPI-B during traffic
4. Repeat the tests above for Tx queue
5. Move all Tx and Rx queues of a NAPI to a new NAPI during traffic
6. Move a queue to an unused NAPI during traffic

---

Amritha Nambiar (5):
netdev-genl: spec: Extend netdev netlink spec in YAML for queue-set
netdev-genl: Add netlink framework functions for queue-set NAPI
ice: Add support to enable/disable a vector
ice: Handle unused vectors dynamically
ice: Add driver support for ndo_queue_set_napi

--

Jakub Kicinski April 9, 2024, 11:21 p.m. UTC | #1

On Fri, 05 Apr 2024 13:09:28 -0700 Amritha Nambiar wrote:
> $ ./cli.py --spec netdev.yaml --do queue-set  --json='{"ifindex": 12, "id": 0, "type": 0, "napi-id": 595}'
> {'id': 0, 'ifindex': 12, 'napi-id': 595, 'type': 'rx'}

NAPI ID is not stable. What happens if you set the ID, bring the
device down and up again? I think we need to make NAPI IDs stable.

What happens if you change the channel count? Do we lose the config?
We try never to lose explicit user config. I think for simplicity
we should store the config in the core / common code.

How does the user know whether queue <> NAPI association is based
on driver defaults or explicit configuration? I think I mentioned
this in earlier discussions but the configuration may need to be
detached from the existing objects (for one thing they may not exist 
at all when the device is down).

Last but not least your driver patch implements the start/stop steps
of the "queue API" I think we should pull that out into the core.

Also the tests now exist - take a look at the sample one in
tools/testing/selftests/drivers/net/stats.py
Would be great to have all future netdev family extensions accompanied
by tests which can run both on real HW and netdevsim.

Nambiar, Amritha April 11, 2024, 10:46 p.m. UTC | #2

On 4/9/2024 4:21 PM, Jakub Kicinski wrote:
> On Fri, 05 Apr 2024 13:09:28 -0700 Amritha Nambiar wrote:
>> $ ./cli.py --spec netdev.yaml --do queue-set  --json='{"ifindex": 12, "id": 0, "type": 0, "napi-id": 595}'
>> {'id': 0, 'ifindex': 12, 'napi-id': 595, 'type': 'rx'}
> 
> NAPI ID is not stable. What happens if you set the ID, bring the
> device down and up again? I think we need to make NAPI IDs stable.
> 

I tried this (device down/up and check NAPIs) on both bnxt and intel/ice.
On bnxt: New NAPI IDs are created sequentially once the device is up 
after turning down.
On ice: The NAPI IDs are stable and remains the same once the device is 
up after turning down.

In case of ice, device down/up executes napi_disable/napi_enable. The 
NAPI IDs are not lost as netif_napi_del is not called at IFF_DOWN. On 
IFF_DOWN, the IRQs associations with the OS are freed, but the resources 
allocated for the vectors and hence the NAPIs for the vectors persists 
(unless unload/reconfig).

> What happens if you change the channel count? Do we lose the config?
> We try never to lose explicit user config. I think for simplicity
> we should store the config in the core / common code.
> 

Yes, we lose the config in case of re-configuring channels. The reconfig 
path involves freeing the vectors and reallocating based on the new 
channel config, so, for the NAPIs associated with the vectors, 
netif_napi_del and netif_napi_add executes creating new NAPI IDs 
sequentially.

Wouldn't losing the explicit user config make sense in this case? By 
changing the channel count, the user has updated the queue layout, the 
queue<>vector mappings etc., so I think, the previous configs from set 
queue<>NAPI should be overwritten with the new config from set-channel.

> How does the user know whether queue <> NAPI association is based
> on driver defaults or explicit configuration?

I am not sure of this. ethtool shows pre-set defaults and current 
settings, but in this case, it is tricky :(

  I think I mentioned
> this in earlier discussions but the configuration may need to be
> detached from the existing objects (for one thing they may not exist
> at all when the device is down).
> 

Yes, we did have that discussion about detaching queues from NAPI. But, 
I am not sure how to accomplish that. Any thoughts on what other 
possible object can be used for the configuration?
WRT ice, when the device is down, the queues are listed and exists as 
inactive queues, NAPI IDs exists, IRQs associations with the OS are freed.

> Last but not least your driver patch implements the start/stop steps
> of the "queue API" I think we should pull that out into the core.
> 

Agree, it would be good to have these steps in the core, but I think the 
challenge is that we would still end up with a lot of code in the driver 
as well, due to all the hardware-centric bits in it.

> Also the tests now exist - take a look at the sample one in
> tools/testing/selftests/drivers/net/stats.py
> Would be great to have all future netdev family extensions accompanied
> by tests which can run both on real HW and netdevsim.

Okay, I will write tests for the new extensions here.

Jakub Kicinski April 12, 2024, 1:47 a.m. UTC | #3

On Thu, 11 Apr 2024 15:46:45 -0700 Nambiar, Amritha wrote:
> On 4/9/2024 4:21 PM, Jakub Kicinski wrote:
> > On Fri, 05 Apr 2024 13:09:28 -0700 Amritha Nambiar wrote:  
> >> $ ./cli.py --spec netdev.yaml --do queue-set  --json='{"ifindex": 12, "id": 0, "type": 0, "napi-id": 595}'
> >> {'id': 0, 'ifindex': 12, 'napi-id': 595, 'type': 'rx'}  
> > 
> > NAPI ID is not stable. What happens if you set the ID, bring the
> > device down and up again? I think we need to make NAPI IDs stable.
> 
> I tried this (device down/up and check NAPIs) on both bnxt and intel/ice.
> On bnxt: New NAPI IDs are created sequentially once the device is up 
> after turning down.
> On ice: The NAPI IDs are stable and remains the same once the device is 
> up after turning down.
> 
> In case of ice, device down/up executes napi_disable/napi_enable. The 
> NAPI IDs are not lost as netif_napi_del is not called at IFF_DOWN. On 
> IFF_DOWN, the IRQs associations with the OS are freed, but the resources 
> allocated for the vectors and hence the NAPIs for the vectors persists 
> (unless unload/reconfig).

SG! So let's just make sure we cover that in tests.

> > What happens if you change the channel count? Do we lose the config?
> > We try never to lose explicit user config. I think for simplicity
> > we should store the config in the core / common code.
> 
> Yes, we lose the config in case of re-configuring channels. The reconfig 
> path involves freeing the vectors and reallocating based on the new 
> channel config, so, for the NAPIs associated with the vectors, 
> netif_napi_del and netif_napi_add executes creating new NAPI IDs 
> sequentially.
> 
> Wouldn't losing the explicit user config make sense in this case? By 
> changing the channel count, the user has updated the queue layout, the 
> queue<>vector mappings etc., so I think, the previous configs from set 
> queue<>NAPI should be overwritten with the new config from set-channel.

We do prevent indirection table from being reset on channel count
change. I think same logic applies here..

> > How does the user know whether queue <> NAPI association is based
> > on driver defaults or explicit configuration?  
> 
> I am not sure of this. ethtool shows pre-set defaults and current 
> settings, but in this case, it is tricky :(

Can you say more about the use case for moving the queues around?
If you just want to have fewer NAPI vectors and more queues, but
don't care about exact mapping - we could probably come up with
a simpler API, no? Are the queues stack queues or also AF_XDP?

> > I think I mentioned
> > this in earlier discussions but the configuration may need to be
> > detached from the existing objects (for one thing they may not exist
> > at all when the device is down).
> 
> Yes, we did have that discussion about detaching queues from NAPI. But, 
> I am not sure how to accomplish that. Any thoughts on what other 
> possible object can be used for the configuration?

We could stick to the queue as the object perhaps. The "target NAPI"
would just be part of the config passed to the alloc/start callbacks.

> WRT ice, when the device is down, the queues are listed and exists as 
> inactive queues, NAPI IDs exists, IRQs associations with the OS are freed.
> 
> > Last but not least your driver patch implements the start/stop steps
> > of the "queue API" I think we should pull that out into the core.
> 
> Agree, it would be good to have these steps in the core, but I think the 
> challenge is that we would still end up with a lot of code in the driver 
> as well, due to all the hardware-centric bits in it.

For one feature I think adding code in the core is not beneficial.
But we have multiple adjacent needs, so when we add up your work,
zero copy, page pool config, maybe queue alloc.. hopefully the code
in the core will be net positive.

> > Also the tests now exist - take a look at the sample one in
> > tools/testing/selftests/drivers/net/stats.py
> > Would be great to have all future netdev family extensions accompanied
> > by tests which can run both on real HW and netdevsim.  
> 
> Okay, I will write tests for the new extensions here.

Nambiar, Amritha April 18, 2024, 9:23 p.m. UTC | #4

On 4/11/2024 6:47 PM, Jakub Kicinski wrote:
> On Thu, 11 Apr 2024 15:46:45 -0700 Nambiar, Amritha wrote:
>> On 4/9/2024 4:21 PM, Jakub Kicinski wrote:
>>> On Fri, 05 Apr 2024 13:09:28 -0700 Amritha Nambiar wrote:
>>>> $ ./cli.py --spec netdev.yaml --do queue-set  --json='{"ifindex": 12, "id": 0, "type": 0, "napi-id": 595}'
>>>> {'id': 0, 'ifindex': 12, 'napi-id': 595, 'type': 'rx'}
>>>
>>> NAPI ID is not stable. What happens if you set the ID, bring the
>>> device down and up again? I think we need to make NAPI IDs stable.
>>
>> I tried this (device down/up and check NAPIs) on both bnxt and intel/ice.
>> On bnxt: New NAPI IDs are created sequentially once the device is up
>> after turning down.
>> On ice: The NAPI IDs are stable and remains the same once the device is
>> up after turning down.
>>
>> In case of ice, device down/up executes napi_disable/napi_enable. The
>> NAPI IDs are not lost as netif_napi_del is not called at IFF_DOWN. On
>> IFF_DOWN, the IRQs associations with the OS are freed, but the resources
>> allocated for the vectors and hence the NAPIs for the vectors persists
>> (unless unload/reconfig).
> 
> SG! So let's just make sure we cover that in tests.
> 
>>> What happens if you change the channel count? Do we lose the config?
>>> We try never to lose explicit user config. I think for simplicity
>>> we should store the config in the core / common code.
>>
>> Yes, we lose the config in case of re-configuring channels. The reconfig
>> path involves freeing the vectors and reallocating based on the new
>> channel config, so, for the NAPIs associated with the vectors,
>> netif_napi_del and netif_napi_add executes creating new NAPI IDs
>> sequentially.
>>
>> Wouldn't losing the explicit user config make sense in this case? By
>> changing the channel count, the user has updated the queue layout, the
>> queue<>vector mappings etc., so I think, the previous configs from set
>> queue<>NAPI should be overwritten with the new config from set-channel.
> 
> We do prevent indirection table from being reset on channel count
> change. I think same logic applies here..
> 

Okay. I tried this on bnxt (this may be outside scope and secondary, but 
hoping all the additional information helps).
It looks like bnxt differentiates if the indirection table was based on 
driver defaults vs user configuration. If the indirection table was from 
driver defaults, then changing channel count to fewer queues is allowed. 
If it was based on explicit user configuration, changing channel count 
to fewer queues is not allowed as the indirection table might then point 
to inactive queues. So, the rss user configuration is preserved by 
blocking new channel configurations that do not align.
So applying the same logic here would mean, changing the channel count 
to queues < 'default queue for the last user configured NAPI ID' would 
have to be prevented. This becomes difficult to track unless pre-set 
default queue <> NAPI configs are maintained.

>>> How does the user know whether queue <> NAPI association is based
>>> on driver defaults or explicit configuration?
>>
>> I am not sure of this. ethtool shows pre-set defaults and current
>> settings, but in this case, it is tricky :(
> 
> Can you say more about the use case for moving the queues around?
> If you just want to have fewer NAPI vectors and more queues, but
> don't care about exact mapping - we could probably come up with
> a simpler API, no? Are the queues stack queues or also AF_XDP?
> 

I'll try to explain. The goal is to have fewer NAPI pollers. The number 
of NAPI pollers is the same as the number of active NAPIs (kthread per 
NAPI). It is possible to limit the number of pollers by mapping 
multiples queues on an interrupt vector (fewer vectors, more queues) 
implicitly in the driver. But, we are looking for a more granular 
approach, in our case, the queues are grouped into 
queue-groups/rss-contexts. We would like to reduce the number of pollers 
within certain selected queue-groups/rss-contexts (not all the 
queue-groups), hence need the configurability.
This would benefit our hyper-threading use case, where a single physical 
core can be used for both network and application processing. If the 
NAPI to queue association is known, we can pin the NAPI thread to the 
logical core and the application thread to the corresponding sibling 
logical core.

The queues are stack queues, not AF_XDP.

>>> I think I mentioned
>>> this in earlier discussions but the configuration may need to be
>>> detached from the existing objects (for one thing they may not exist
>>> at all when the device is down).
>>
>> Yes, we did have that discussion about detaching queues from NAPI. But,
>> I am not sure how to accomplish that. Any thoughts on what other
>> possible object can be used for the configuration?
> 
> We could stick to the queue as the object perhaps. The "target NAPI"
> would just be part of the config passed to the alloc/start callbacks.
> 

Okay.

>> WRT ice, when the device is down, the queues are listed and exists as
>> inactive queues, NAPI IDs exists, IRQs associations with the OS are freed.
>>
>>> Last but not least your driver patch implements the start/stop steps
>>> of the "queue API" I think we should pull that out into the core.
>>
>> Agree, it would be good to have these steps in the core, but I think the
>> challenge is that we would still end up with a lot of code in the driver
>> as well, due to all the hardware-centric bits in it.
> 
> For one feature I think adding code in the core is not beneficial.
> But we have multiple adjacent needs, so when we add up your work,
> zero copy, page pool config, maybe queue alloc.. hopefully the code
> in the core will be net positive.
> 
>>> Also the tests now exist - take a look at the sample one in
>>> tools/testing/selftests/drivers/net/stats.py
>>> Would be great to have all future netdev family extensions accompanied
>>> by tests which can run both on real HW and netdevsim.
>>
>> Okay, I will write tests for the new extensions here.

Jakub Kicinski April 19, 2024, 1:37 a.m. UTC | #5

On Thu, 18 Apr 2024 14:23:03 -0700 Nambiar, Amritha wrote:
> >> I am not sure of this. ethtool shows pre-set defaults and current
> >> settings, but in this case, it is tricky :(  
> > 
> > Can you say more about the use case for moving the queues around?
> > If you just want to have fewer NAPI vectors and more queues, but
> > don't care about exact mapping - we could probably come up with
> > a simpler API, no? Are the queues stack queues or also AF_XDP?
> 
> I'll try to explain. The goal is to have fewer NAPI pollers. The number 
> of NAPI pollers is the same as the number of active NAPIs (kthread per 
> NAPI). It is possible to limit the number of pollers by mapping 
> multiples queues on an interrupt vector (fewer vectors, more queues) 
> implicitly in the driver. But, we are looking for a more granular 
> approach, in our case, the queues are grouped into 
> queue-groups/rss-contexts. We would like to reduce the number of pollers 
> within certain selected queue-groups/rss-contexts (not all the 
> queue-groups), hence need the configurability.
> This would benefit our hyper-threading use case, where a single physical 
> core can be used for both network and application processing. If the 
> NAPI to queue association is known, we can pin the NAPI thread to the 
> logical core and the application thread to the corresponding sibling 
> logical core.

Could you provide a more detailed example of a desired configuration? 
I'm not sure I'm getting the point.

What's the point of having multiple queues if they end up in the same
NAPI?

[net-next,RFC,0/5] Configuring NAPI instance for a queue

Message

Comments