diff mbox series

[net-next,5/5] netdev-genl: Support setting per-NAPI config values

Message ID 20240829131214.169977-6-jdamato@fastly.com (mailing list archive)
State Superseded
Delegated to: Netdev Maintainers
Headers show
Series Add support for per-NAPI config via netlink | expand

Checks

Context Check Description
netdev/series_format success Posting correctly formatted
netdev/tree_selection success Clearly marked for net-next, async
netdev/ynl fail Tree is dirty after regen; no warnings/errors; no diff in generated;
netdev/fixes_present success Fixes tag not required for -next series
netdev/header_inline success No static functions without inline keyword in header files
netdev/build_32bit success Errors and warnings before: 54 this patch: 54
netdev/build_tools success Errors and warnings before: 0 this patch: 0
netdev/cc_maintainers success CCed 7 of 7 maintainers
netdev/build_clang fail Errors and warnings before: 120 this patch: 105
netdev/verify_signedoff success Signed-off-by tag matches author and committer
netdev/deprecated_api success None detected
netdev/check_selftest success No net selftest shell script
netdev/verify_fixes success No Fixes tag
netdev/build_allmodconfig_warn fail Errors and warnings before: 4063 this patch: 3900
netdev/checkpatch warning WARNING: line length of 94 exceeds 80 columns WARNING: line length of 97 exceeds 80 columns
netdev/build_clang_rust success No Rust files in patch. Skipping build
netdev/kdoc success Errors and warnings before: 12 this patch: 12
netdev/source_inline success Was 0 now: 0

Commit Message

Joe Damato Aug. 29, 2024, 1:12 p.m. UTC
Add support to set per-NAPI defer_hard_irqs and gro_flush_timeout.

Signed-off-by: Joe Damato <jdamato@fastly.com>
Reviewed-by: Martin Karsten <mkarsten@uwaterloo.ca>
Tested-by: Martin Karsten <mkarsten@uwaterloo.ca>
---
 Documentation/netlink/specs/netdev.yaml | 11 ++++++
 include/uapi/linux/netdev.h             |  1 +
 net/core/netdev-genl-gen.c              | 14 ++++++++
 net/core/netdev-genl-gen.h              |  1 +
 net/core/netdev-genl.c                  | 45 +++++++++++++++++++++++++
 tools/include/uapi/linux/netdev.h       |  1 +
 6 files changed, 73 insertions(+)

Comments

Jakub Kicinski Aug. 29, 2024, 10:31 p.m. UTC | #1
On Thu, 29 Aug 2024 13:12:01 +0000 Joe Damato wrote:
> +	napi = napi_by_id(napi_id);
> +	if (napi)
> +		err = netdev_nl_napi_set_config(napi, info);
> +	else
> +		err = -EINVAL;

if (napi) {
...
} else {
	NL_SET_BAD_ATTR(info->extack, info->attrs[NETDEV_A_NAPI_ID])
	err = -ENOENT;
}

> +      doc: Set configurable NAPI instance settings.

We should pause and think here how configuring NAPI params should
behave. NAPI instances are ephemeral, if you close and open the
device (or for some drivers change any BPF or ethtool setting)
the NAPIs may get wiped and recreated, discarding all configuration.

This is not how the sysfs API behaves, the sysfs settings on the device
survive close. It's (weirdly?) also not how queues behave, because we
have struct netdev{_rx,}_queue to store stuff persistently. Even tho
you'd think queues are as ephemeral as NAPIs if not more.

I guess we can either document this, and move on (which may be fine,
you have more practical experience than me). Or we can add an internal
concept of a "channel" (which perhaps maybe if you squint is what
ethtool -l calls NAPIs?) or just "napi_storage" as an array inside
net_device and store such config there. For simplicity of matching
config to NAPIs we can assume drivers add NAPI instances in order. 
If driver wants to do something more fancy we can add a variant of
netif_napi_add() which specifies the channel/storage to use.

Thoughts? I may be overly sensitive to the ephemeral thing, maybe
I work with unfortunate drivers...
Joe Damato Aug. 30, 2024, 10:43 a.m. UTC | #2
On Thu, Aug 29, 2024 at 03:31:05PM -0700, Jakub Kicinski wrote:
> On Thu, 29 Aug 2024 13:12:01 +0000 Joe Damato wrote:
> > +	napi = napi_by_id(napi_id);
> > +	if (napi)
> > +		err = netdev_nl_napi_set_config(napi, info);
> > +	else
> > +		err = -EINVAL;
> 
> if (napi) {
> ...
> } else {
> 	NL_SET_BAD_ATTR(info->extack, info->attrs[NETDEV_A_NAPI_ID])
> 	err = -ENOENT;
> }

Thanks, I'll make that change in the v2.

Should I send a Fixes for the same pattern in
netdev_nl_napi_get_doit ?
 
> > +      doc: Set configurable NAPI instance settings.
> 
> We should pause and think here how configuring NAPI params should
> behave. NAPI instances are ephemeral, if you close and open the
> device (or for some drivers change any BPF or ethtool setting)
> the NAPIs may get wiped and recreated, discarding all configuration.
> 
> This is not how the sysfs API behaves, the sysfs settings on the device
> survive close. It's (weirdly?) also not how queues behave, because we
> have struct netdev{_rx,}_queue to store stuff persistently. Even tho
> you'd think queues are as ephemeral as NAPIs if not more.
> 
> I guess we can either document this, and move on (which may be fine,
> you have more practical experience than me). Or we can add an internal
> concept of a "channel" (which perhaps maybe if you squint is what
> ethtool -l calls NAPIs?) or just "napi_storage" as an array inside
> net_device and store such config there. For simplicity of matching
> config to NAPIs we can assume drivers add NAPI instances in order. 
> If driver wants to do something more fancy we can add a variant of
> netif_napi_add() which specifies the channel/storage to use.
> 
> Thoughts? I may be overly sensitive to the ephemeral thing, maybe
> I work with unfortunate drivers...

Thanks for pointing this out. I think this is an important case to
consider. Here's how I'm thinking about it.

There are two cases:

1) sysfs setting is used by existing/legacy apps: If the NAPIs are
discarded and recreated, the code I added to netif_napi_add_weight
in patch 1 and 3 should take care of that case preserving how sysfs
works today, I believe. I think we are good on this case ?

2) apps using netlink to set various custom settings. This seems
like a case where a future extension can be made to add a notifier
for NAPI changes (like the netdevice notifier?).

If you think this is a good idea, then we'd do something like:
  1. Document that the NAPI settings are wiped when NAPIs are wiped
  2. In the future (not part of this series) a NAPI notifier is
     added
  3. User apps can then listen for NAPI create/delete events
     and update settings when a NAPI is created. It would be
     helpful, I think, for user apps to know about NAPI
     create/delete events in general because it means NAPI IDs are
     changing.

One could argue:

  When wiping/recreating a NAPI for an existing HW queue, that HW
  queue gets a new NAPI ID associated with it. User apps operating
  at this level probably care about NAPI IDs changing (as it affects
  epoll busy poll). Since the settings in this series are per-NAPI
  (and not per HW queue), the argument could be that user apps need
  to setup NAPIs when they are created and settings do not persist
  between NAPIs with different IDs even if associated with the same
  HW queue.

Admittedly, from the perspective of a user it would be nice if a new
NAPI created for an existing HW queue retained the previous
settings so that I, as the user, can do less work.

But, what happens if a HW queue is destroyed and recreated? Will any
HW settings be retained? And does that have any influence on what we
do in software? See below.

This part of your message:

> we can assume drivers add NAPI instances in order. If driver wants
> to do something more fancy we can add a variant of
> netif_napi_add() which specifies the channel/storage to use.

assuming drivers will "do a thing", so to speak, makes me uneasy.

I started to wonder: how do drivers handle per-queue HW IRQ coalesce
settings when queue counts increase? It's a different, but adjacent
problem, I think.

I tried a couple experiments on mlx5 and got very strange results
suitable for their own thread and I didn't want to get this thread
too far off track.

I think you have much more practical experience when it comes to
dealing with drivers, so I am happy to follow your lead on this one,
but assuming drivers will "do a thing" seems mildly scary to me with
limited driver experience.

My two goals with this series are:
  1. Make it possible to set these values per NAPI
  2. Unblock the IRQ suspension series by threading the suspend
     parameter through the code path carved in this series

So, I'm happy to proceed with this series as you prefer whether
that's documentation or "napi_storage"; I think you are probably the
best person to answer this question :)
Jakub Kicinski Aug. 30, 2024, 9:22 p.m. UTC | #3
On Fri, 30 Aug 2024 11:43:00 +0100 Joe Damato wrote:
> On Thu, Aug 29, 2024 at 03:31:05PM -0700, Jakub Kicinski wrote:
> > On Thu, 29 Aug 2024 13:12:01 +0000 Joe Damato wrote:  
> > > +	napi = napi_by_id(napi_id);
> > > +	if (napi)
> > > +		err = netdev_nl_napi_set_config(napi, info);
> > > +	else
> > > +		err = -EINVAL;  
> > 
> > if (napi) {
> > ...
> > } else {
> > 	NL_SET_BAD_ATTR(info->extack, info->attrs[NETDEV_A_NAPI_ID])
> > 	err = -ENOENT;
> > }  
> 
> Thanks, I'll make that change in the v2.
> 
> Should I send a Fixes for the same pattern in
> netdev_nl_napi_get_doit ?

SG, standalone patch is good, FWIW, no need to add to the series.

> > > +      doc: Set configurable NAPI instance settings.  
> > 
> > We should pause and think here how configuring NAPI params should
> > behave. NAPI instances are ephemeral, if you close and open the
> > device (or for some drivers change any BPF or ethtool setting)
> > the NAPIs may get wiped and recreated, discarding all configuration.
> > 
> > This is not how the sysfs API behaves, the sysfs settings on the device
> > survive close. It's (weirdly?) also not how queues behave, because we
> > have struct netdev{_rx,}_queue to store stuff persistently. Even tho
> > you'd think queues are as ephemeral as NAPIs if not more.
> > 
> > I guess we can either document this, and move on (which may be fine,
> > you have more practical experience than me). Or we can add an internal
> > concept of a "channel" (which perhaps maybe if you squint is what
> > ethtool -l calls NAPIs?) or just "napi_storage" as an array inside
> > net_device and store such config there. For simplicity of matching
> > config to NAPIs we can assume drivers add NAPI instances in order. 
> > If driver wants to do something more fancy we can add a variant of
> > netif_napi_add() which specifies the channel/storage to use.
> > 
> > Thoughts? I may be overly sensitive to the ephemeral thing, maybe
> > I work with unfortunate drivers...  
> 
> Thanks for pointing this out. I think this is an important case to
> consider. Here's how I'm thinking about it.
> 
> There are two cases:
> 
> 1) sysfs setting is used by existing/legacy apps: If the NAPIs are
> discarded and recreated, the code I added to netif_napi_add_weight
> in patch 1 and 3 should take care of that case preserving how sysfs
> works today, I believe. I think we are good on this case ?

Agreed.

> 2) apps using netlink to set various custom settings. This seems
> like a case where a future extension can be made to add a notifier
> for NAPI changes (like the netdevice notifier?).

Yes, the notifier may help, but it's a bit of a stop gap / fallback.

> If you think this is a good idea, then we'd do something like:
>   1. Document that the NAPI settings are wiped when NAPIs are wiped
>   2. In the future (not part of this series) a NAPI notifier is
>      added
>   3. User apps can then listen for NAPI create/delete events
>      and update settings when a NAPI is created. It would be
>      helpful, I think, for user apps to know about NAPI
>      create/delete events in general because it means NAPI IDs are
>      changing.
> 
> One could argue:
> 
>   When wiping/recreating a NAPI for an existing HW queue, that HW
>   queue gets a new NAPI ID associated with it. User apps operating
>   at this level probably care about NAPI IDs changing (as it affects
>   epoll busy poll). Since the settings in this series are per-NAPI
>   (and not per HW queue), the argument could be that user apps need
>   to setup NAPIs when they are created and settings do not persist
>   between NAPIs with different IDs even if associated with the same
>   HW queue.

IDK if the fact that NAPI ID gets replaced was intentional in the first
place. I would venture a guess that the person who added the IDs was
working with NICs which have stable NAPI instances once the device is
opened. This is, unfortunately, not universally the case.

I just poked at bnxt, mlx5 and fbnic and all of them reallocate NAPIs
on an open device. Closer we get to queue API the more dynamic the whole
setup will become (read: the more often reconfigurations will happen).

> Admittedly, from the perspective of a user it would be nice if a new
> NAPI created for an existing HW queue retained the previous
> settings so that I, as the user, can do less work.
> 
> But, what happens if a HW queue is destroyed and recreated? Will any
> HW settings be retained? And does that have any influence on what we
> do in software? See below.

Up to the driver, today. But settings we store in queue structs in 
the core are not wiped.

> This part of your message:
> 
> > we can assume drivers add NAPI instances in order. If driver wants
> > to do something more fancy we can add a variant of
> > netif_napi_add() which specifies the channel/storage to use.  
> 
> assuming drivers will "do a thing", so to speak, makes me uneasy.

Yeah.. :(

> I started to wonder: how do drivers handle per-queue HW IRQ coalesce
> settings when queue counts increase? It's a different, but adjacent
> problem, I think.
> 
> I tried a couple experiments on mlx5 and got very strange results
> suitable for their own thread and I didn't want to get this thread
> too far off track.

Yes, but ethtool is an old shallow API from the times when semantics
were simpler. It's precisely this mess which we try to avoid by storing
more of the config in the core, in a consistent fashion.

> I think you have much more practical experience when it comes to
> dealing with drivers, so I am happy to follow your lead on this one,
> but assuming drivers will "do a thing" seems mildly scary to me with
> limited driver experience.
> 
> My two goals with this series are:
>   1. Make it possible to set these values per NAPI
>   2. Unblock the IRQ suspension series by threading the suspend
>      parameter through the code path carved in this series
> 
> So, I'm happy to proceed with this series as you prefer whether
> that's documentation or "napi_storage"; I think you are probably the
> best person to answer this question :)

How do you feel about making this configuration opt-in / require driver
changes? What I'm thinking is that having the new "netif_napi_add()"
variant (or perhaps extending netif_napi_set_irq()) to take an extra
"index" parameter would make the whole thing much simpler.

Index would basically be an integer 0..n, where n is the number of
IRQs configured for the driver. The index of a NAPI instance would
likely match the queue ID of the queue the NAPI serves.

We can then allocate an array of "napi_configs" in net_device -
like we allocate queues, the array size would be max(num_rx_queue,
num_tx_queues). We just need to store a couple of ints so it will
be tiny compared to queue structs, anyway.

The NAPI_SET netlink op can then work based on NAPI index rather 
than the ephemeral NAPI ID. It can apply the config to all live
NAPI instances with that index (of which there really should only 
be one, unless driver is mid-reconfiguration somehow but even that
won't cause issues, we can give multiple instances the same settings)
and also store the user config in the array in net_device.

When new NAPI instance is associate with a NAPI index it should get
all the config associated with that index applied.

Thoughts? Does that makes sense, and if so do you think it's an
over-complication?
Joe Damato Aug. 31, 2024, 5:27 p.m. UTC | #4
On Fri, Aug 30, 2024 at 02:22:35PM -0700, Jakub Kicinski wrote:
> On Fri, 30 Aug 2024 11:43:00 +0100 Joe Damato wrote:
> > On Thu, Aug 29, 2024 at 03:31:05PM -0700, Jakub Kicinski wrote:
> > > On Thu, 29 Aug 2024 13:12:01 +0000 Joe Damato wrote:  
> > > > +	napi = napi_by_id(napi_id);
> > > > +	if (napi)
> > > > +		err = netdev_nl_napi_set_config(napi, info);
> > > > +	else
> > > > +		err = -EINVAL;  
> > > 
> > > if (napi) {
> > > ...
> > > } else {
> > > 	NL_SET_BAD_ATTR(info->extack, info->attrs[NETDEV_A_NAPI_ID])
> > > 	err = -ENOENT;
> > > }  
> > 
> > Thanks, I'll make that change in the v2.
> > 
> > Should I send a Fixes for the same pattern in
> > netdev_nl_napi_get_doit ?
> 
> SG, standalone patch is good, FWIW, no need to add to the series.

Done. TBH: couldn't tell if it was a fixes for net or a net-next
thing.
 
> > > > +      doc: Set configurable NAPI instance settings.  
> > > 
> > > We should pause and think here how configuring NAPI params should
> > > behave. NAPI instances are ephemeral, if you close and open the
> > > device (or for some drivers change any BPF or ethtool setting)
> > > the NAPIs may get wiped and recreated, discarding all configuration.
> > > 
> > > This is not how the sysfs API behaves, the sysfs settings on the device
> > > survive close. It's (weirdly?) also not how queues behave, because we
> > > have struct netdev{_rx,}_queue to store stuff persistently. Even tho
> > > you'd think queues are as ephemeral as NAPIs if not more.
> > > 
> > > I guess we can either document this, and move on (which may be fine,
> > > you have more practical experience than me). Or we can add an internal
> > > concept of a "channel" (which perhaps maybe if you squint is what
> > > ethtool -l calls NAPIs?) or just "napi_storage" as an array inside
> > > net_device and store such config there. For simplicity of matching
> > > config to NAPIs we can assume drivers add NAPI instances in order. 
> > > If driver wants to do something more fancy we can add a variant of
> > > netif_napi_add() which specifies the channel/storage to use.
> > > 
> > > Thoughts? I may be overly sensitive to the ephemeral thing, maybe
> > > I work with unfortunate drivers...  
> > 
> > Thanks for pointing this out. I think this is an important case to
> > consider. Here's how I'm thinking about it.
> > 
> > There are two cases:
> > 
> > 1) sysfs setting is used by existing/legacy apps: If the NAPIs are
> > discarded and recreated, the code I added to netif_napi_add_weight
> > in patch 1 and 3 should take care of that case preserving how sysfs
> > works today, I believe. I think we are good on this case ?
> 
> Agreed.
> 
> > 2) apps using netlink to set various custom settings. This seems
> > like a case where a future extension can be made to add a notifier
> > for NAPI changes (like the netdevice notifier?).
> 
> Yes, the notifier may help, but it's a bit of a stop gap / fallback.
> 
> > If you think this is a good idea, then we'd do something like:
> >   1. Document that the NAPI settings are wiped when NAPIs are wiped
> >   2. In the future (not part of this series) a NAPI notifier is
> >      added
> >   3. User apps can then listen for NAPI create/delete events
> >      and update settings when a NAPI is created. It would be
> >      helpful, I think, for user apps to know about NAPI
> >      create/delete events in general because it means NAPI IDs are
> >      changing.
> > 
> > One could argue:
> > 
> >   When wiping/recreating a NAPI for an existing HW queue, that HW
> >   queue gets a new NAPI ID associated with it. User apps operating
> >   at this level probably care about NAPI IDs changing (as it affects
> >   epoll busy poll). Since the settings in this series are per-NAPI
> >   (and not per HW queue), the argument could be that user apps need
> >   to setup NAPIs when they are created and settings do not persist
> >   between NAPIs with different IDs even if associated with the same
> >   HW queue.
> 
> IDK if the fact that NAPI ID gets replaced was intentional in the first
> place. I would venture a guess that the person who added the IDs was
> working with NICs which have stable NAPI instances once the device is
> opened. This is, unfortunately, not universally the case.
> 
> I just poked at bnxt, mlx5 and fbnic and all of them reallocate NAPIs
> on an open device. Closer we get to queue API the more dynamic the whole
> setup will become (read: the more often reconfigurations will happen).
> 
> > Admittedly, from the perspective of a user it would be nice if a new
> > NAPI created for an existing HW queue retained the previous
> > settings so that I, as the user, can do less work.
> > 
> > But, what happens if a HW queue is destroyed and recreated? Will any
> > HW settings be retained? And does that have any influence on what we
> > do in software? See below.
> 
> Up to the driver, today. But settings we store in queue structs in 
> the core are not wiped.
> 
> > This part of your message:
> > 
> > > we can assume drivers add NAPI instances in order. If driver wants
> > > to do something more fancy we can add a variant of
> > > netif_napi_add() which specifies the channel/storage to use.  
> > 
> > assuming drivers will "do a thing", so to speak, makes me uneasy.
> 
> Yeah.. :(
> 
> > I started to wonder: how do drivers handle per-queue HW IRQ coalesce
> > settings when queue counts increase? It's a different, but adjacent
> > problem, I think.
> > 
> > I tried a couple experiments on mlx5 and got very strange results
> > suitable for their own thread and I didn't want to get this thread
> > too far off track.
> 
> Yes, but ethtool is an old shallow API from the times when semantics
> were simpler. It's precisely this mess which we try to avoid by storing
> more of the config in the core, in a consistent fashion.
> 
> > I think you have much more practical experience when it comes to
> > dealing with drivers, so I am happy to follow your lead on this one,
> > but assuming drivers will "do a thing" seems mildly scary to me with
> > limited driver experience.
> > 
> > My two goals with this series are:
> >   1. Make it possible to set these values per NAPI
> >   2. Unblock the IRQ suspension series by threading the suspend
> >      parameter through the code path carved in this series
> > 
> > So, I'm happy to proceed with this series as you prefer whether
> > that's documentation or "napi_storage"; I think you are probably the
> > best person to answer this question :)
> 
> How do you feel about making this configuration opt-in / require driver
> changes? What I'm thinking is that having the new "netif_napi_add()"
> variant (or perhaps extending netif_napi_set_irq()) to take an extra
> "index" parameter would make the whole thing much simpler.

I think if we are going to go this way, then opt-in is probably the
way to go. This series would include the necessary changes for mlx5,
in that case (because that's what I have access to) so that the new
variant has a user?

> Index would basically be an integer 0..n, where n is the number of
> IRQs configured for the driver. The index of a NAPI instance would
> likely match the queue ID of the queue the NAPI serves.
> 
> We can then allocate an array of "napi_configs" in net_device -
> like we allocate queues, the array size would be max(num_rx_queue,
> num_tx_queues). We just need to store a couple of ints so it will
> be tiny compared to queue structs, anyway.

I assume napi_storage exists for both combined RX/TX NAPIs (i.e.
drivers that multiplex RX/TX on a single NAPI like mlx5) as well
as drivers which create NAPIs that are RX or TX-only, right?

If so, it seems like we'd either need to:
  - Do something more complicated when computing how much NAPI
    storage to make, or
  - Provide a different path for drivers which don't multiplex and
    create some number of (for example) TX-only NAPIs ?

I guess I'm just imagining a weird case where a driver has 8 RX
queues but 64 TX queues. max of that is 64, so we'd be missing 8
napi_storage ?

Sorry, I'm probably just missing something about the implementation
details you summarized above.

> The NAPI_SET netlink op can then work based on NAPI index rather 
> than the ephemeral NAPI ID. It can apply the config to all live
> NAPI instances with that index (of which there really should only 
> be one, unless driver is mid-reconfiguration somehow but even that
> won't cause issues, we can give multiple instances the same settings)
> and also store the user config in the array in net_device.

I understand what you are proposing. I suppose napi-get could be
extended to include the NAPI index, too?

Then users could map queues to NAPI indexes to queues (via NAPI ID)?

> When new NAPI instance is associate with a NAPI index it should get
> all the config associated with that index applied.
> 
> Thoughts? Does that makes sense, and if so do you think it's an
> over-complication?

It feels a bit tricky, to me, as it seems there are some edge cases
to be careful with (queue count change). I could probably give the
implementation a try and see where I end up.

Having these settings per-NAPI would be really useful and being able
to support IRQ suspension would be useful, too.

I think being thoughtful about how we get there is important; I'm a
little wary of getting side tracked, but I trust your judgement and
if you think this is worth exploring I'll think on it some more.

- Joe
Joe Damato Sept. 2, 2024, 4:56 p.m. UTC | #5
On Fri, Aug 30, 2024 at 02:22:35PM -0700, Jakub Kicinski wrote:
> On Fri, 30 Aug 2024 11:43:00 +0100 Joe Damato wrote:
> > On Thu, Aug 29, 2024 at 03:31:05PM -0700, Jakub Kicinski wrote:
> > > On Thu, 29 Aug 2024 13:12:01 +0000 Joe Damato wrote:  
> > > > +      doc: Set configurable NAPI instance settings.  
> > > 
> > > We should pause and think here how configuring NAPI params should
> > > behave. NAPI instances are ephemeral, if you close and open the
> > > device (or for some drivers change any BPF or ethtool setting)
> > > the NAPIs may get wiped and recreated, discarding all configuration.
> > > 
> > > This is not how the sysfs API behaves, the sysfs settings on the device
> > > survive close. It's (weirdly?) also not how queues behave, because we
> > > have struct netdev{_rx,}_queue to store stuff persistently. Even tho
> > > you'd think queues are as ephemeral as NAPIs if not more.
> > > 
> > > I guess we can either document this, and move on (which may be fine,
> > > you have more practical experience than me). Or we can add an internal
> > > concept of a "channel" (which perhaps maybe if you squint is what
> > > ethtool -l calls NAPIs?) or just "napi_storage" as an array inside
> > > net_device and store such config there. For simplicity of matching
> > > config to NAPIs we can assume drivers add NAPI instances in order. 
> > > If driver wants to do something more fancy we can add a variant of
> > > netif_napi_add() which specifies the channel/storage to use.
> > > 
> > > Thoughts? I may be overly sensitive to the ephemeral thing, maybe
> > > I work with unfortunate drivers...  
> > 
> > Thanks for pointing this out. I think this is an important case to
> > consider. Here's how I'm thinking about it.
> > 
> > There are two cases:
> > 
> > 1) sysfs setting is used by existing/legacy apps: If the NAPIs are
> > discarded and recreated, the code I added to netif_napi_add_weight
> > in patch 1 and 3 should take care of that case preserving how sysfs
> > works today, I believe. I think we are good on this case ?
> 
> Agreed.
> 
> > 2) apps using netlink to set various custom settings. This seems
> > like a case where a future extension can be made to add a notifier
> > for NAPI changes (like the netdevice notifier?).
> 
> Yes, the notifier may help, but it's a bit of a stop gap / fallback.
> 
> > If you think this is a good idea, then we'd do something like:
> >   1. Document that the NAPI settings are wiped when NAPIs are wiped
> >   2. In the future (not part of this series) a NAPI notifier is
> >      added
> >   3. User apps can then listen for NAPI create/delete events
> >      and update settings when a NAPI is created. It would be
> >      helpful, I think, for user apps to know about NAPI
> >      create/delete events in general because it means NAPI IDs are
> >      changing.
> > 
> > One could argue:
> > 
> >   When wiping/recreating a NAPI for an existing HW queue, that HW
> >   queue gets a new NAPI ID associated with it. User apps operating
> >   at this level probably care about NAPI IDs changing (as it affects
> >   epoll busy poll). Since the settings in this series are per-NAPI
> >   (and not per HW queue), the argument could be that user apps need
> >   to setup NAPIs when they are created and settings do not persist
> >   between NAPIs with different IDs even if associated with the same
> >   HW queue.
> 
> IDK if the fact that NAPI ID gets replaced was intentional in the first
> place. I would venture a guess that the person who added the IDs was
> working with NICs which have stable NAPI instances once the device is
> opened. This is, unfortunately, not universally the case.
> 
> I just poked at bnxt, mlx5 and fbnic and all of them reallocate NAPIs
> on an open device. Closer we get to queue API the more dynamic the whole
> setup will become (read: the more often reconfigurations will happen).
>

[...]

> > I think you have much more practical experience when it comes to
> > dealing with drivers, so I am happy to follow your lead on this one,
> > but assuming drivers will "do a thing" seems mildly scary to me with
> > limited driver experience.
> > 
> > My two goals with this series are:
> >   1. Make it possible to set these values per NAPI
> >   2. Unblock the IRQ suspension series by threading the suspend
> >      parameter through the code path carved in this series
> > 
> > So, I'm happy to proceed with this series as you prefer whether
> > that's documentation or "napi_storage"; I think you are probably the
> > best person to answer this question :)
> 
> How do you feel about making this configuration opt-in / require driver
> changes? What I'm thinking is that having the new "netif_napi_add()"
> variant (or perhaps extending netif_napi_set_irq()) to take an extra
> "index" parameter would make the whole thing much simpler.

What about extending netif_queue_set_napi instead? That function
takes a napi and a queue index.

Locally I kinda of hacked up something simple that:
  - Allocates napi_storage in net_device in alloc_netdev_mqs
  - Modifies netif_queue_set_napi to:
     if (napi)
       napi->storage = dev->napi_storage[queue_index];

I think I'm still missing the bit about the
max(rx_queues,tx_queues), though :(

> Index would basically be an integer 0..n, where n is the number of
> IRQs configured for the driver. The index of a NAPI instance would
> likely match the queue ID of the queue the NAPI serves.

Hmmm. I'm hesitant about the "number of IRQs" part. What if there
are NAPIs for which no IRQ is allocated ~someday~ ?

It seems like (I could totally be wrong) that netif_queue_set_napi
can be called and work and create the association even without an
IRQ allocated.

I guess the issue is mostly the queue index question above: combined
rx/tx vs drivers having different numbers of rx and tx queues.

> We can then allocate an array of "napi_configs" in net_device -
> like we allocate queues, the array size would be max(num_rx_queue,
> num_tx_queues). We just need to store a couple of ints so it will
> be tiny compared to queue structs, anyway.
> 
> The NAPI_SET netlink op can then work based on NAPI index rather 
> than the ephemeral NAPI ID. It can apply the config to all live
> NAPI instances with that index (of which there really should only 
> be one, unless driver is mid-reconfiguration somehow but even that
> won't cause issues, we can give multiple instances the same settings)
> and also store the user config in the array in net_device.
> 
> When new NAPI instance is associate with a NAPI index it should get
> all the config associated with that index applied.
> 
> Thoughts? Does that makes sense, and if so do you think it's an
> over-complication?

I think what you are proposing seems fine; I'm just working out the
implementation details and making sure I understand before sending
another revision.
Jakub Kicinski Sept. 3, 2024, 12:49 a.m. UTC | #6
On Sat, 31 Aug 2024 18:27:45 +0100 Joe Damato wrote:
> > How do you feel about making this configuration opt-in / require driver
> > changes? What I'm thinking is that having the new "netif_napi_add()"
> > variant (or perhaps extending netif_napi_set_irq()) to take an extra
> > "index" parameter would make the whole thing much simpler.  
> 
> I think if we are going to go this way, then opt-in is probably the
> way to go. This series would include the necessary changes for mlx5,
> in that case (because that's what I have access to) so that the new
> variant has a user?

SG! FWIW for bnxt the "id" is struct bnxt_napi::index (I haven't looked
at bnxt before writing the suggestion :))

> > Index would basically be an integer 0..n, where n is the number of
> > IRQs configured for the driver. The index of a NAPI instance would
> > likely match the queue ID of the queue the NAPI serves.
> > 
> > We can then allocate an array of "napi_configs" in net_device -
> > like we allocate queues, the array size would be max(num_rx_queue,
> > num_tx_queues). We just need to store a couple of ints so it will
> > be tiny compared to queue structs, anyway.  
> 
> I assume napi_storage exists for both combined RX/TX NAPIs (i.e.
> drivers that multiplex RX/TX on a single NAPI like mlx5) as well
> as drivers which create NAPIs that are RX or TX-only, right?

Hm.

> If so, it seems like we'd either need to:
>   - Do something more complicated when computing how much NAPI
>     storage to make, or
>   - Provide a different path for drivers which don't multiplex and
>     create some number of (for example) TX-only NAPIs ?
> 
> I guess I'm just imagining a weird case where a driver has 8 RX
> queues but 64 TX queues. max of that is 64, so we'd be missing 8
> napi_storage ?
> 
> Sorry, I'm probably just missing something about the implementation
> details you summarized above.

I wouldn't worry about it. We can added a variant of alloc_netdev_mqs()
later which takes the NAPI count explicitly. For now we can simply
assume max(rx, tx) is good enough, and maybe add a WARN_ON_ONCE() to 
the set function to catch drivers which need something more complicated.

Modern NICs have far more queues than IRQs (~NAPIs).

> > The NAPI_SET netlink op can then work based on NAPI index rather 
> > than the ephemeral NAPI ID. It can apply the config to all live
> > NAPI instances with that index (of which there really should only 
> > be one, unless driver is mid-reconfiguration somehow but even that
> > won't cause issues, we can give multiple instances the same settings)
> > and also store the user config in the array in net_device.  
> 
> I understand what you are proposing. I suppose napi-get could be
> extended to include the NAPI index, too?

Yup!

> Then users could map queues to NAPI indexes to queues (via NAPI ID)?

Yes.

> > When new NAPI instance is associate with a NAPI index it should get
> > all the config associated with that index applied.
> > 
> > Thoughts? Does that makes sense, and if so do you think it's an
> > over-complication?  
> 
> It feels a bit tricky, to me, as it seems there are some edge cases
> to be careful with (queue count change). I could probably give the
> implementation a try and see where I end up.
> 
> Having these settings per-NAPI would be really useful and being able
> to support IRQ suspension would be useful, too.
> 
> I think being thoughtful about how we get there is important; I'm a
> little wary of getting side tracked, but I trust your judgement and
> if you think this is worth exploring I'll think on it some more.

I understand, we can abandon it if the implementation drags out due to
various nit picks and back-and-forths. But I don't expect much of that

Jakub Kicinski Sept. 3, 2024, 1:02 a.m. UTC | #7
On Mon, 2 Sep 2024 18:56:07 +0200 Joe Damato wrote:
> > How do you feel about making this configuration opt-in / require driver
> > changes? What I'm thinking is that having the new "netif_napi_add()"
> > variant (or perhaps extending netif_napi_set_irq()) to take an extra
> > "index" parameter would make the whole thing much simpler.  
> 
> What about extending netif_queue_set_napi instead? That function
> takes a napi and a queue index.

This may become problematic, since there may be more queues than NAPIs.
Either with multiple combined queues mapped to a single NAPI, or having
separate Rx/Tx NAPIs.

No strong preference on which function we modify (netif_napi_add vs the
IRQ setting helper) but I think we need to take the index as an
explicit param, rather than try to guess it based on another ID. 
The queue ID will match 95% of the time, today. But Intel was
interested in "remapping" queues. And we all think about a queue API...
So using queue ID is going to cause problems down the road.

> Locally I kinda of hacked up something simple that:
>   - Allocates napi_storage in net_device in alloc_netdev_mqs
>   - Modifies netif_queue_set_napi to:
>      if (napi)
>        napi->storage = dev->napi_storage[queue_index];
> 
> I think I'm still missing the bit about the
> max(rx_queues,tx_queues), though :(

You mean whether it's enough to do

	napi_cnt = max(txqs, rxqs)

? Or some other question?

AFAIU mlx5 can only have as many NAPIs as it has IRQs (256?).
Look at ip -d link to see how many queues it has.
We could do txqs + rxqs to be safe, if you prefer, but it will 
waste a bit of memory.

> > Index would basically be an integer 0..n, where n is the number of
> > IRQs configured for the driver. The index of a NAPI instance would
> > likely match the queue ID of the queue the NAPI serves.  
> 
> Hmmm. I'm hesitant about the "number of IRQs" part. What if there
> are NAPIs for which no IRQ is allocated ~someday~ ?

A device NAPI without an IRQ? Whoever does something of such silliness
will have to deal with consequences. I think it's unlikely.

> It seems like (I could totally be wrong) that netif_queue_set_napi
> can be called and work and create the association even without an
> IRQ allocated.
> 
> I guess the issue is mostly the queue index question above: combined
> rx/tx vs drivers having different numbers of rx and tx queues.

Yes, and in the future the ability to allocate more queues than NAPIs.
netif_queue_set_napi() was expected to cover such cases.

> > We can then allocate an array of "napi_configs" in net_device -
> > like we allocate queues, the array size would be max(num_rx_queue,
> > num_tx_queues). We just need to store a couple of ints so it will
> > be tiny compared to queue structs, anyway.
> > 
> > The NAPI_SET netlink op can then work based on NAPI index rather 
> > than the ephemeral NAPI ID. It can apply the config to all live
> > NAPI instances with that index (of which there really should only 
> > be one, unless driver is mid-reconfiguration somehow but even that
> > won't cause issues, we can give multiple instances the same settings)
> > and also store the user config in the array in net_device.
> > 
> > When new NAPI instance is associate with a NAPI index it should get
> > all the config associated with that index applied.
> > 
> > Thoughts? Does that makes sense, and if so do you think it's an
> > over-complication?  
> 
> I think what you are proposing seems fine; I'm just working out the
> implementation details and making sure I understand before sending
> another revision.

If you get stuck feel free to send a link to a GH or post RFC.
I'm adding extra asks so whether form of review helps.. :)
Samiullah Khawaja Sept. 3, 2024, 7:04 p.m. UTC | #8
This is great Joe.

On Mon, Sep 2, 2024 at 6:02 PM Jakub Kicinski <kuba@kernel.org> wrote:
>
> On Mon, 2 Sep 2024 18:56:07 +0200 Joe Damato wrote:
> > > How do you feel about making this configuration opt-in / require driver
> > > changes? What I'm thinking is that having the new "netif_napi_add()"
> > > variant (or perhaps extending netif_napi_set_irq()) to take an extra
> > > "index" parameter would make the whole thing much simpler.
> >
> > What about extending netif_queue_set_napi instead? That function
> > takes a napi and a queue index.
>
> This may become problematic, since there may be more queues than NAPIs.
> Either with multiple combined queues mapped to a single NAPI, or having
> separate Rx/Tx NAPIs.
>
> No strong preference on which function we modify (netif_napi_add vs the
> IRQ setting helper) but I think we need to take the index as an
> explicit param, rather than try to guess it based on another ID.
> The queue ID will match 95% of the time, today. But Intel was
> interested in "remapping" queues. And we all think about a queue API...
> So using queue ID is going to cause problems down the road.
Do we need a queue to napi association to set/persist napi
configurations? Can a new index param be added to the netif_napi_add
and persist the configurations in napi_storage. I guess the problem
would be the size of napi_storage.

Also wondering if for some use case persistence would be problematic
when the napis are recreated, since the new napi instances might not
represent the same context? For example If I resize the dev from 16
rx/tx to 8 rx/tx queues and the napi index that was used by TX queue,
now polls RX queue.
>
> > Locally I kinda of hacked up something simple that:
> >   - Allocates napi_storage in net_device in alloc_netdev_mqs
> >   - Modifies netif_queue_set_napi to:
> >      if (napi)
> >        napi->storage = dev->napi_storage[queue_index];
> >
> > I think I'm still missing the bit about the
> > max(rx_queues,tx_queues), though :(
>
> You mean whether it's enough to do
>
>         napi_cnt = max(txqs, rxqs)
>
> ? Or some other question?
>
> AFAIU mlx5 can only have as many NAPIs as it has IRQs (256?).
> Look at ip -d link to see how many queues it has.
> We could do txqs + rxqs to be safe, if you prefer, but it will
> waste a bit of memory.
>
> > > Index would basically be an integer 0..n, where n is the number of
> > > IRQs configured for the driver. The index of a NAPI instance would
> > > likely match the queue ID of the queue the NAPI serves.
> >
> > Hmmm. I'm hesitant about the "number of IRQs" part. What if there
> > are NAPIs for which no IRQ is allocated ~someday~ ?
>
> A device NAPI without an IRQ? Whoever does something of such silliness
> will have to deal with consequences. I think it's unlikely.
>
> > It seems like (I could totally be wrong) that netif_queue_set_napi
> > can be called and work and create the association even without an
> > IRQ allocated.
> >
> > I guess the issue is mostly the queue index question above: combined
> > rx/tx vs drivers having different numbers of rx and tx queues.
>
> Yes, and in the future the ability to allocate more queues than NAPIs.
> netif_queue_set_napi() was expected to cover such cases.
>
> > > We can then allocate an array of "napi_configs" in net_device -
> > > like we allocate queues, the array size would be max(num_rx_queue,
> > > num_tx_queues). We just need to store a couple of ints so it will
> > > be tiny compared to queue structs, anyway.
> > >
> > > The NAPI_SET netlink op can then work based on NAPI index rather
> > > than the ephemeral NAPI ID. It can apply the config to all live
> > > NAPI instances with that index (of which there really should only
> > > be one, unless driver is mid-reconfiguration somehow but even that
> > > won't cause issues, we can give multiple instances the same settings)
> > > and also store the user config in the array in net_device.
> > >
> > > When new NAPI instance is associate with a NAPI index it should get
> > > all the config associated with that index applied.
> > >
> > > Thoughts? Does that makes sense, and if so do you think it's an
> > > over-complication?
> >
> > I think what you are proposing seems fine; I'm just working out the
> > implementation details and making sure I understand before sending
> > another revision.
>
> If you get stuck feel free to send a link to a GH or post RFC.
> I'm adding extra asks so whether form of review helps.. :)
Jakub Kicinski Sept. 3, 2024, 7:40 p.m. UTC | #9
On Tue, 3 Sep 2024 12:04:52 -0700 Samiullah Khawaja wrote:
> Do we need a queue to napi association to set/persist napi
> configurations? 

I'm afraid zero-copy schemes will make multiple queues per NAPI more
and more common, so pretending the NAPI params (related to polling)
are pre queue will soon become highly problematic.

> Can a new index param be added to the netif_napi_add
> and persist the configurations in napi_storage.

That'd be my (weak) preference.

> I guess the problem would be the size of napi_storage.

I don't think so, we're talking about 16B per NAPI, 
struct netdev_queue is 320B, struct netdev_rx_queue is 192B. 
NAPI storage is rounding error next to those :S

> Also wondering if for some use case persistence would be problematic
> when the napis are recreated, since the new napi instances might not
> represent the same context? For example If I resize the dev from 16
> rx/tx to 8 rx/tx queues and the napi index that was used by TX queue,
> now polls RX queue.

We can clear the config when NAPI is activated (ethtool -L /
set-channels). That seems like a good idea.

The distinction between Rx and Tx NAPIs is a bit more tricky, tho.
When^w If we can dynamically create Rx queues one day, a NAPI may 
start out as a Tx NAPI and become a combined one when Rx queue is 
added to it.

Maybe it's enough to document how rings are distributed to NAPIs?

First set of NAPIs should get allocated to the combined channels,
then for remaining rx- and tx-only NAPIs they should be interleaved
starting with rx?

Example, asymmetric config: combined + some extra tx:

    combined        tx
 [0..#combined-1] [#combined..#combined+#tx-1]

Split rx / tx - interleave:

 [0 rx0] [1 tx0] [2 rx1] [3 tx1] [4 rx2] [5 tx2] ...

This would limit the churn when changing channel counts.
Samiullah Khawaja Sept. 3, 2024, 9:58 p.m. UTC | #10
On Tue, Sep 3, 2024 at 12:40 PM Jakub Kicinski <kuba@kernel.org> wrote:
>
> On Tue, 3 Sep 2024 12:04:52 -0700 Samiullah Khawaja wrote:
> > Do we need a queue to napi association to set/persist napi
> > configurations?
>
> I'm afraid zero-copy schemes will make multiple queues per NAPI more
> and more common, so pretending the NAPI params (related to polling)
> are pre queue will soon become highly problematic.
Agreed.
>
> > Can a new index param be added to the netif_napi_add
> > and persist the configurations in napi_storage.
>
> That'd be my (weak) preference.
>
> > I guess the problem would be the size of napi_storage.
>
> I don't think so, we're talking about 16B per NAPI,
> struct netdev_queue is 320B, struct netdev_rx_queue is 192B.
> NAPI storage is rounding error next to those :S
Oh, I am sorry I was actually referring to the problem of figuring out
the count of the napi_storage array.
>
> > Also wondering if for some use case persistence would be problematic
> > when the napis are recreated, since the new napi instances might not
> > represent the same context? For example If I resize the dev from 16
> > rx/tx to 8 rx/tx queues and the napi index that was used by TX queue,
> > now polls RX queue.
>
> We can clear the config when NAPI is activated (ethtool -L /
> set-channels). That seems like a good idea.
That sounds good.
>
> The distinction between Rx and Tx NAPIs is a bit more tricky, tho.
> When^w If we can dynamically create Rx queues one day, a NAPI may
> start out as a Tx NAPI and become a combined one when Rx queue is
> added to it.
>
> Maybe it's enough to document how rings are distributed to NAPIs?
>
> First set of NAPIs should get allocated to the combined channels,
> then for remaining rx- and tx-only NAPIs they should be interleaved
> starting with rx?
>
> Example, asymmetric config: combined + some extra tx:
>
>     combined        tx
>  [0..#combined-1] [#combined..#combined+#tx-1]
>
> Split rx / tx - interleave:
>
>  [0 rx0] [1 tx0] [2 rx1] [3 tx1] [4 rx2] [5 tx2] ...
>
> This would limit the churn when changing channel counts.
I think this is good. The queue-get dump netlink does provide details
of all the queues in a dev. It also provides a napi-id if the driver
has set it (only few drivers set this). So basically a busy poll
application would look at the queue type and apply configurations on
the relevant napi based on the documentation above (if napi-id is not
set on the queue)?
Stanislav Fomichev Sept. 4, 2024, 11:40 p.m. UTC | #11
On 09/02, Joe Damato wrote:
> On Fri, Aug 30, 2024 at 02:22:35PM -0700, Jakub Kicinski wrote:
> > On Fri, 30 Aug 2024 11:43:00 +0100 Joe Damato wrote:
> > > On Thu, Aug 29, 2024 at 03:31:05PM -0700, Jakub Kicinski wrote:
> > > > On Thu, 29 Aug 2024 13:12:01 +0000 Joe Damato wrote:  
> > > > > +      doc: Set configurable NAPI instance settings.  
> > > > 
> > > > We should pause and think here how configuring NAPI params should
> > > > behave. NAPI instances are ephemeral, if you close and open the
> > > > device (or for some drivers change any BPF or ethtool setting)
> > > > the NAPIs may get wiped and recreated, discarding all configuration.
> > > > 
> > > > This is not how the sysfs API behaves, the sysfs settings on the device
> > > > survive close. It's (weirdly?) also not how queues behave, because we
> > > > have struct netdev{_rx,}_queue to store stuff persistently. Even tho
> > > > you'd think queues are as ephemeral as NAPIs if not more.
> > > > 
> > > > I guess we can either document this, and move on (which may be fine,
> > > > you have more practical experience than me). Or we can add an internal
> > > > concept of a "channel" (which perhaps maybe if you squint is what
> > > > ethtool -l calls NAPIs?) or just "napi_storage" as an array inside
> > > > net_device and store such config there. For simplicity of matching
> > > > config to NAPIs we can assume drivers add NAPI instances in order. 
> > > > If driver wants to do something more fancy we can add a variant of
> > > > netif_napi_add() which specifies the channel/storage to use.
> > > > 
> > > > Thoughts? I may be overly sensitive to the ephemeral thing, maybe
> > > > I work with unfortunate drivers...  
> > > 
> > > Thanks for pointing this out. I think this is an important case to
> > > consider. Here's how I'm thinking about it.
> > > 
> > > There are two cases:
> > > 
> > > 1) sysfs setting is used by existing/legacy apps: If the NAPIs are
> > > discarded and recreated, the code I added to netif_napi_add_weight
> > > in patch 1 and 3 should take care of that case preserving how sysfs
> > > works today, I believe. I think we are good on this case ?
> > 
> > Agreed.
> > 
> > > 2) apps using netlink to set various custom settings. This seems
> > > like a case where a future extension can be made to add a notifier
> > > for NAPI changes (like the netdevice notifier?).
> > 
> > Yes, the notifier may help, but it's a bit of a stop gap / fallback.
> > 
> > > If you think this is a good idea, then we'd do something like:
> > >   1. Document that the NAPI settings are wiped when NAPIs are wiped
> > >   2. In the future (not part of this series) a NAPI notifier is
> > >      added
> > >   3. User apps can then listen for NAPI create/delete events
> > >      and update settings when a NAPI is created. It would be
> > >      helpful, I think, for user apps to know about NAPI
> > >      create/delete events in general because it means NAPI IDs are
> > >      changing.
> > > 
> > > One could argue:
> > > 
> > >   When wiping/recreating a NAPI for an existing HW queue, that HW
> > >   queue gets a new NAPI ID associated with it. User apps operating
> > >   at this level probably care about NAPI IDs changing (as it affects
> > >   epoll busy poll). Since the settings in this series are per-NAPI
> > >   (and not per HW queue), the argument could be that user apps need
> > >   to setup NAPIs when they are created and settings do not persist
> > >   between NAPIs with different IDs even if associated with the same
> > >   HW queue.
> > 
> > IDK if the fact that NAPI ID gets replaced was intentional in the first
> > place. I would venture a guess that the person who added the IDs was
> > working with NICs which have stable NAPI instances once the device is
> > opened. This is, unfortunately, not universally the case.
> > 
> > I just poked at bnxt, mlx5 and fbnic and all of them reallocate NAPIs
> > on an open device. Closer we get to queue API the more dynamic the whole
> > setup will become (read: the more often reconfigurations will happen).
> >
> 
> [...]
> 
> > > I think you have much more practical experience when it comes to
> > > dealing with drivers, so I am happy to follow your lead on this one,
> > > but assuming drivers will "do a thing" seems mildly scary to me with
> > > limited driver experience.
> > > 
> > > My two goals with this series are:
> > >   1. Make it possible to set these values per NAPI
> > >   2. Unblock the IRQ suspension series by threading the suspend
> > >      parameter through the code path carved in this series
> > > 
> > > So, I'm happy to proceed with this series as you prefer whether
> > > that's documentation or "napi_storage"; I think you are probably the
> > > best person to answer this question :)
> > 
> > How do you feel about making this configuration opt-in / require driver
> > changes? What I'm thinking is that having the new "netif_napi_add()"
> > variant (or perhaps extending netif_napi_set_irq()) to take an extra
> > "index" parameter would make the whole thing much simpler.
> 
> What about extending netif_queue_set_napi instead? That function
> takes a napi and a queue index.
> 
> Locally I kinda of hacked up something simple that:
>   - Allocates napi_storage in net_device in alloc_netdev_mqs
>   - Modifies netif_queue_set_napi to:
>      if (napi)
>        napi->storage = dev->napi_storage[queue_index];
> 
> I think I'm still missing the bit about the
> max(rx_queues,tx_queues), though :(
> 
> > Index would basically be an integer 0..n, where n is the number of
> > IRQs configured for the driver. The index of a NAPI instance would
> > likely match the queue ID of the queue the NAPI serves.
> 
> Hmmm. I'm hesitant about the "number of IRQs" part. What if there
> are NAPIs for which no IRQ is allocated ~someday~ ?
> 
> It seems like (I could totally be wrong) that netif_queue_set_napi
> can be called and work and create the association even without an
> IRQ allocated.
> 
> I guess the issue is mostly the queue index question above: combined
> rx/tx vs drivers having different numbers of rx and tx queues.
> 
> > We can then allocate an array of "napi_configs" in net_device -
> > like we allocate queues, the array size would be max(num_rx_queue,
> > num_tx_queues). We just need to store a couple of ints so it will
> > be tiny compared to queue structs, anyway.
> > 
> > The NAPI_SET netlink op can then work based on NAPI index rather 
> > than the ephemeral NAPI ID. It can apply the config to all live
> > NAPI instances with that index (of which there really should only 
> > be one, unless driver is mid-reconfiguration somehow but even that
> > won't cause issues, we can give multiple instances the same settings)
> > and also store the user config in the array in net_device.
> > 
> > When new NAPI instance is associate with a NAPI index it should get
> > all the config associated with that index applied.
> > 
> > Thoughts? Does that makes sense, and if so do you think it's an
> > over-complication?
> 
> I think what you are proposing seems fine; I'm just working out the
> implementation details and making sure I understand before sending
> another revision.

What if instead of an extra storage index in UAPI, we make napi_id persistent?
Then we can keep using napi_id as a user-facing number for the configuration.

Having a stable napi_id would also be super useful for the epoll setup so you
don't have to match old/invalid ids to the new ones on device reset.

In the code, we can keep the same idea with napi_storage in netdev and
ask drivers to provide storage id, but keep that id internal.

The only complication with that is napi_hash_add/napi_hash_del that
happen in netif_napi_add_weight. So for the devices that allocate
new napi before removing the old ones (most devices?), we'd have to add
some new netif_napi_takeover(old_napi, new_napi) to remove the
old napi_id from the hash and reuse it in the new one.

So for mlx5, the flow would look like the following:

- mlx5e_safe_switch_params
  - mlx5e_open_channels
    - netif_napi_add(new_napi)
      - adds napi with 'ephemeral' napi id
  - mlx5e_switch_priv_channels
    - mlx5e_deactivate_priv_channels
      - napi_disable(old_napi)
      - netif_napi_del(old_napi) - this frees the old napi_id
  - mlx5e_activate_priv_channels
    - mlx5e_activate_channels
      - mlx5e_activate_channel
        - netif_napi_takeover(old_napi is gone, so probably take id from napi_storage?)
	  - if napi is not hashed - safe to reuse?
	- napi_enable

This is a bit ugly because we still have random napi ids during reset, but
is not super complicated implementation-wise. We can eventually improve
the above by splitting netif_napi_add_weight into two steps: allocate and
activate (to do the napi_id allocation & hashing). Thoughts?
Jakub Kicinski Sept. 4, 2024, 11:54 p.m. UTC | #12
On Wed, 4 Sep 2024 16:40:41 -0700 Stanislav Fomichev wrote:
> > I think what you are proposing seems fine; I'm just working out the
> > implementation details and making sure I understand before sending
> > another revision.  
> 
> What if instead of an extra storage index in UAPI, we make napi_id persistent?
> Then we can keep using napi_id as a user-facing number for the configuration.
> 
> Having a stable napi_id would also be super useful for the epoll setup so you
> don't have to match old/invalid ids to the new ones on device reset.

that'd be nice, initially I thought that we have some drivers that have
multiple instances of NAPI enabled for a single "index", but I don't
see such drivers now.

> In the code, we can keep the same idea with napi_storage in netdev and
> ask drivers to provide storage id, but keep that id internal.
> 
> The only complication with that is napi_hash_add/napi_hash_del that
> happen in netif_napi_add_weight. So for the devices that allocate
> new napi before removing the old ones (most devices?), we'd have to add
> some new netif_napi_takeover(old_napi, new_napi) to remove the
> old napi_id from the hash and reuse it in the new one.
> 
> So for mlx5, the flow would look like the following:
> 
> - mlx5e_safe_switch_params
>   - mlx5e_open_channels
>     - netif_napi_add(new_napi)
>       - adds napi with 'ephemeral' napi id
>   - mlx5e_switch_priv_channels
>     - mlx5e_deactivate_priv_channels
>       - napi_disable(old_napi)
>       - netif_napi_del(old_napi) - this frees the old napi_id
>   - mlx5e_activate_priv_channels
>     - mlx5e_activate_channels
>       - mlx5e_activate_channel
>         - netif_napi_takeover(old_napi is gone, so probably take id from napi_storage?)
> 	  - if napi is not hashed - safe to reuse?
> 	- napi_enable
> 
> This is a bit ugly because we still have random napi ids during reset, but
> is not super complicated implementation-wise. We can eventually improve
> the above by splitting netif_napi_add_weight into two steps: allocate and
> activate (to do the napi_id allocation & hashing). Thoughts?

The "takeover" would be problematic for drivers which free old NAPI
before allocating new one (bnxt?). But splitting the two steps sounds
pretty clean. We can add a helper to mark NAPI as "driver will
explicitly list/hash later", and have the driver call a new helper
which takes storage ID and lists the NAPI in the hash.
Joe Damato Sept. 5, 2024, 9:20 a.m. UTC | #13
On Tue, Sep 03, 2024 at 02:58:14PM -0700, Samiullah Khawaja wrote:
> On Tue, Sep 3, 2024 at 12:40 PM Jakub Kicinski <kuba@kernel.org> wrote:
> >
> > On Tue, 3 Sep 2024 12:04:52 -0700 Samiullah Khawaja wrote:
> > > Do we need a queue to napi association to set/persist napi
> > > configurations?
> >
> > I'm afraid zero-copy schemes will make multiple queues per NAPI more
> > and more common, so pretending the NAPI params (related to polling)
> > are pre queue will soon become highly problematic.
> Agreed.
> >
> > > Can a new index param be added to the netif_napi_add
> > > and persist the configurations in napi_storage.
> >
> > That'd be my (weak) preference.
> >
> > > I guess the problem would be the size of napi_storage.
> >
> > I don't think so, we're talking about 16B per NAPI,
> > struct netdev_queue is 320B, struct netdev_rx_queue is 192B.
> > NAPI storage is rounding error next to those :S
> Oh, I am sorry I was actually referring to the problem of figuring out
> the count of the napi_storage array.
> >
> > > Also wondering if for some use case persistence would be problematic
> > > when the napis are recreated, since the new napi instances might not
> > > represent the same context? For example If I resize the dev from 16
> > > rx/tx to 8 rx/tx queues and the napi index that was used by TX queue,
> > > now polls RX queue.
> >
> > We can clear the config when NAPI is activated (ethtool -L /
> > set-channels). That seems like a good idea.
> That sounds good.
> >
> > The distinction between Rx and Tx NAPIs is a bit more tricky, tho.
> > When^w If we can dynamically create Rx queues one day, a NAPI may
> > start out as a Tx NAPI and become a combined one when Rx queue is
> > added to it.
> >
> > Maybe it's enough to document how rings are distributed to NAPIs?
> >
> > First set of NAPIs should get allocated to the combined channels,
> > then for remaining rx- and tx-only NAPIs they should be interleaved
> > starting with rx?
> >
> > Example, asymmetric config: combined + some extra tx:
> >
> >     combined        tx
> >  [0..#combined-1] [#combined..#combined+#tx-1]
> >
> > Split rx / tx - interleave:
> >
> >  [0 rx0] [1 tx0] [2 rx1] [3 tx1] [4 rx2] [5 tx2] ...
> >
> > This would limit the churn when changing channel counts.
> I think this is good. The queue-get dump netlink does provide details
> of all the queues in a dev. It also provides a napi-id if the driver
> has set it (only few drivers set this).

This is true, but there are several and IMHO extending existing
drivers to support this can be done. I have been adding "nits" to
driver reviewers for new drivers asking the author(s) to consider
adding support for the API.

Not sure which driver you are using, but I can help you add support
for the API if it is needed.

> So basically a busy poll application would look at the queue type
> and apply configurations on the relevant napi based on the
> documentation above (if napi-id is not set on the queue)?

That was my plan for my user app based on the conversation so far.
At start, the app gets some config with a list of ifindexes it will
bind to for incoming connections and then gets the NAPI IDs via
netlink and sets the per-NAPI params via netlink as well.

Haven't implemented this yet in the user app, but that's the
direction I am planning to go with this all.
Joe Damato Sept. 5, 2024, 9:30 a.m. UTC | #14
On Wed, Sep 04, 2024 at 04:40:41PM -0700, Stanislav Fomichev wrote:
> On 09/02, Joe Damato wrote:
> > On Fri, Aug 30, 2024 at 02:22:35PM -0700, Jakub Kicinski wrote:
> > > On Fri, 30 Aug 2024 11:43:00 +0100 Joe Damato wrote:
> > > > On Thu, Aug 29, 2024 at 03:31:05PM -0700, Jakub Kicinski wrote:
> > > > > On Thu, 29 Aug 2024 13:12:01 +0000 Joe Damato wrote:  
> > > > > > +      doc: Set configurable NAPI instance settings.  
> > > > > 
> > > > > We should pause and think here how configuring NAPI params should
> > > > > behave. NAPI instances are ephemeral, if you close and open the
> > > > > device (or for some drivers change any BPF or ethtool setting)
> > > > > the NAPIs may get wiped and recreated, discarding all configuration.
> > > > > 
> > > > > This is not how the sysfs API behaves, the sysfs settings on the device
> > > > > survive close. It's (weirdly?) also not how queues behave, because we
> > > > > have struct netdev{_rx,}_queue to store stuff persistently. Even tho
> > > > > you'd think queues are as ephemeral as NAPIs if not more.
> > > > > 
> > > > > I guess we can either document this, and move on (which may be fine,
> > > > > you have more practical experience than me). Or we can add an internal
> > > > > concept of a "channel" (which perhaps maybe if you squint is what
> > > > > ethtool -l calls NAPIs?) or just "napi_storage" as an array inside
> > > > > net_device and store such config there. For simplicity of matching
> > > > > config to NAPIs we can assume drivers add NAPI instances in order. 
> > > > > If driver wants to do something more fancy we can add a variant of
> > > > > netif_napi_add() which specifies the channel/storage to use.
> > > > > 
> > > > > Thoughts? I may be overly sensitive to the ephemeral thing, maybe
> > > > > I work with unfortunate drivers...  
> > > > 
> > > > Thanks for pointing this out. I think this is an important case to
> > > > consider. Here's how I'm thinking about it.
> > > > 
> > > > There are two cases:
> > > > 
> > > > 1) sysfs setting is used by existing/legacy apps: If the NAPIs are
> > > > discarded and recreated, the code I added to netif_napi_add_weight
> > > > in patch 1 and 3 should take care of that case preserving how sysfs
> > > > works today, I believe. I think we are good on this case ?
> > > 
> > > Agreed.
> > > 
> > > > 2) apps using netlink to set various custom settings. This seems
> > > > like a case where a future extension can be made to add a notifier
> > > > for NAPI changes (like the netdevice notifier?).
> > > 
> > > Yes, the notifier may help, but it's a bit of a stop gap / fallback.
> > > 
> > > > If you think this is a good idea, then we'd do something like:
> > > >   1. Document that the NAPI settings are wiped when NAPIs are wiped
> > > >   2. In the future (not part of this series) a NAPI notifier is
> > > >      added
> > > >   3. User apps can then listen for NAPI create/delete events
> > > >      and update settings when a NAPI is created. It would be
> > > >      helpful, I think, for user apps to know about NAPI
> > > >      create/delete events in general because it means NAPI IDs are
> > > >      changing.
> > > > 
> > > > One could argue:
> > > > 
> > > >   When wiping/recreating a NAPI for an existing HW queue, that HW
> > > >   queue gets a new NAPI ID associated with it. User apps operating
> > > >   at this level probably care about NAPI IDs changing (as it affects
> > > >   epoll busy poll). Since the settings in this series are per-NAPI
> > > >   (and not per HW queue), the argument could be that user apps need
> > > >   to setup NAPIs when they are created and settings do not persist
> > > >   between NAPIs with different IDs even if associated with the same
> > > >   HW queue.
> > > 
> > > IDK if the fact that NAPI ID gets replaced was intentional in the first
> > > place. I would venture a guess that the person who added the IDs was
> > > working with NICs which have stable NAPI instances once the device is
> > > opened. This is, unfortunately, not universally the case.
> > > 
> > > I just poked at bnxt, mlx5 and fbnic and all of them reallocate NAPIs
> > > on an open device. Closer we get to queue API the more dynamic the whole
> > > setup will become (read: the more often reconfigurations will happen).
> > >
> > 
> > [...]
> > 
> > > > I think you have much more practical experience when it comes to
> > > > dealing with drivers, so I am happy to follow your lead on this one,
> > > > but assuming drivers will "do a thing" seems mildly scary to me with
> > > > limited driver experience.
> > > > 
> > > > My two goals with this series are:
> > > >   1. Make it possible to set these values per NAPI
> > > >   2. Unblock the IRQ suspension series by threading the suspend
> > > >      parameter through the code path carved in this series
> > > > 
> > > > So, I'm happy to proceed with this series as you prefer whether
> > > > that's documentation or "napi_storage"; I think you are probably the
> > > > best person to answer this question :)
> > > 
> > > How do you feel about making this configuration opt-in / require driver
> > > changes? What I'm thinking is that having the new "netif_napi_add()"
> > > variant (or perhaps extending netif_napi_set_irq()) to take an extra
> > > "index" parameter would make the whole thing much simpler.
> > 
> > What about extending netif_queue_set_napi instead? That function
> > takes a napi and a queue index.
> > 
> > Locally I kinda of hacked up something simple that:
> >   - Allocates napi_storage in net_device in alloc_netdev_mqs
> >   - Modifies netif_queue_set_napi to:
> >      if (napi)
> >        napi->storage = dev->napi_storage[queue_index];
> > 
> > I think I'm still missing the bit about the
> > max(rx_queues,tx_queues), though :(
> > 
> > > Index would basically be an integer 0..n, where n is the number of
> > > IRQs configured for the driver. The index of a NAPI instance would
> > > likely match the queue ID of the queue the NAPI serves.
> > 
> > Hmmm. I'm hesitant about the "number of IRQs" part. What if there
> > are NAPIs for which no IRQ is allocated ~someday~ ?
> > 
> > It seems like (I could totally be wrong) that netif_queue_set_napi
> > can be called and work and create the association even without an
> > IRQ allocated.
> > 
> > I guess the issue is mostly the queue index question above: combined
> > rx/tx vs drivers having different numbers of rx and tx queues.
> > 
> > > We can then allocate an array of "napi_configs" in net_device -
> > > like we allocate queues, the array size would be max(num_rx_queue,
> > > num_tx_queues). We just need to store a couple of ints so it will
> > > be tiny compared to queue structs, anyway.
> > > 
> > > The NAPI_SET netlink op can then work based on NAPI index rather 
> > > than the ephemeral NAPI ID. It can apply the config to all live
> > > NAPI instances with that index (of which there really should only 
> > > be one, unless driver is mid-reconfiguration somehow but even that
> > > won't cause issues, we can give multiple instances the same settings)
> > > and also store the user config in the array in net_device.
> > > 
> > > When new NAPI instance is associate with a NAPI index it should get
> > > all the config associated with that index applied.
> > > 
> > > Thoughts? Does that makes sense, and if so do you think it's an
> > > over-complication?
> > 
> > I think what you are proposing seems fine; I'm just working out the
> > implementation details and making sure I understand before sending
> > another revision.
> 
> What if instead of an extra storage index in UAPI, we make napi_id persistent?
> Then we can keep using napi_id as a user-facing number for the configuration.
> 
> Having a stable napi_id would also be super useful for the epoll setup so you
> don't have to match old/invalid ids to the new ones on device reset.

Up to now for prototyping purposes: the way I've been dealing with this is
using a SO_ATTACH_REUSEPORT_CBPF program like this:

struct sock_filter code[] = {
    /* A = skb->queue_mapping */
    { BPF_LD | BPF_W | BPF_ABS, 0, 0, SKF_AD_OFF + SKF_AD_QUEUE },
    /* A = A % n */
    { BPF_ALU | BPF_MOD, 0, 0, n },
    /* return A */
    { BPF_RET | BPF_A, 0, 0, 0 },
};

with SO_BINDTODEVICE. Note that the above uses queue_mapping (not NAPI ID) so
even if the NAPI IDs change the filter still distributes connections from the
same "queue_mapping" to the same thread.

Since epoll busy poll is based on NAPI ID (and not queue_mapping), this will
probably cause some issue if the NAPI ID changes because the NAPI ID associated
with the epoll context will suddenly change meaning the "old" NAPI won't be
busy polled. This might be fine because if that happens the old NAPI is being
disabled anyway?

At any rate the user program doesn't "need" to do anything when the NAPI ID
changes... unless it has a more complicated ebpf program that relies on NAPI ID
;)

> In the code, we can keep the same idea with napi_storage in netdev and
> ask drivers to provide storage id, but keep that id internal.
> 
> The only complication with that is napi_hash_add/napi_hash_del that
> happen in netif_napi_add_weight. So for the devices that allocate
> new napi before removing the old ones (most devices?), we'd have to add
> some new netif_napi_takeover(old_napi, new_napi) to remove the
> old napi_id from the hash and reuse it in the new one.
> 
> So for mlx5, the flow would look like the following:
> 
> - mlx5e_safe_switch_params
>   - mlx5e_open_channels
>     - netif_napi_add(new_napi)
>       - adds napi with 'ephemeral' napi id
>   - mlx5e_switch_priv_channels
>     - mlx5e_deactivate_priv_channels
>       - napi_disable(old_napi)
>       - netif_napi_del(old_napi) - this frees the old napi_id
>   - mlx5e_activate_priv_channels
>     - mlx5e_activate_channels
>       - mlx5e_activate_channel
>         - netif_napi_takeover(old_napi is gone, so probably take id from napi_storage?)
> 	  - if napi is not hashed - safe to reuse?
> 	- napi_enable
> 
> This is a bit ugly because we still have random napi ids during reset, but
> is not super complicated implementation-wise. We can eventually improve
> the above by splitting netif_napi_add_weight into two steps: allocate and
> activate (to do the napi_id allocation & hashing). Thoughts?
Joe Damato Sept. 5, 2024, 9:32 a.m. UTC | #15
On Wed, Sep 04, 2024 at 04:54:17PM -0700, Jakub Kicinski wrote:
> On Wed, 4 Sep 2024 16:40:41 -0700 Stanislav Fomichev wrote:
> > > I think what you are proposing seems fine; I'm just working out the
> > > implementation details and making sure I understand before sending
> > > another revision.  
> > 
> > What if instead of an extra storage index in UAPI, we make napi_id persistent?
> > Then we can keep using napi_id as a user-facing number for the configuration.
> > 
> > Having a stable napi_id would also be super useful for the epoll setup so you
> > don't have to match old/invalid ids to the new ones on device reset.
> 
> that'd be nice, initially I thought that we have some drivers that have
> multiple instances of NAPI enabled for a single "index", but I don't
> see such drivers now.
> 
> > In the code, we can keep the same idea with napi_storage in netdev and
> > ask drivers to provide storage id, but keep that id internal.
> > 
> > The only complication with that is napi_hash_add/napi_hash_del that
> > happen in netif_napi_add_weight. So for the devices that allocate
> > new napi before removing the old ones (most devices?), we'd have to add
> > some new netif_napi_takeover(old_napi, new_napi) to remove the
> > old napi_id from the hash and reuse it in the new one.
> > 
> > So for mlx5, the flow would look like the following:
> > 
> > - mlx5e_safe_switch_params
> >   - mlx5e_open_channels
> >     - netif_napi_add(new_napi)
> >       - adds napi with 'ephemeral' napi id
> >   - mlx5e_switch_priv_channels
> >     - mlx5e_deactivate_priv_channels
> >       - napi_disable(old_napi)
> >       - netif_napi_del(old_napi) - this frees the old napi_id
> >   - mlx5e_activate_priv_channels
> >     - mlx5e_activate_channels
> >       - mlx5e_activate_channel
> >         - netif_napi_takeover(old_napi is gone, so probably take id from napi_storage?)
> > 	  - if napi is not hashed - safe to reuse?
> > 	- napi_enable
> > 
> > This is a bit ugly because we still have random napi ids during reset, but
> > is not super complicated implementation-wise. We can eventually improve
> > the above by splitting netif_napi_add_weight into two steps: allocate and
> > activate (to do the napi_id allocation & hashing). Thoughts?
> 
> The "takeover" would be problematic for drivers which free old NAPI
> before allocating new one (bnxt?). But splitting the two steps sounds
> pretty clean. We can add a helper to mark NAPI as "driver will
> explicitly list/hash later", and have the driver call a new helper
> which takes storage ID and lists the NAPI in the hash.

Hm... I thought I had an idea of how to write this up, but I think
maybe I've been thinking about it wrong.

Whatever I land on, I'll send first as an RFC to make sure I'm
following all the feedback that has come in. I definitely want to
get this right.

Sorry for the slow responses; I am technically on PTO for a bit
before LPC :)
Stanislav Fomichev Sept. 5, 2024, 4:56 p.m. UTC | #16
On 09/05, Joe Damato wrote:
> On Wed, Sep 04, 2024 at 04:40:41PM -0700, Stanislav Fomichev wrote:
> > On 09/02, Joe Damato wrote:
> > > On Fri, Aug 30, 2024 at 02:22:35PM -0700, Jakub Kicinski wrote:
> > > > On Fri, 30 Aug 2024 11:43:00 +0100 Joe Damato wrote:
> > > > > On Thu, Aug 29, 2024 at 03:31:05PM -0700, Jakub Kicinski wrote:
> > > > > > On Thu, 29 Aug 2024 13:12:01 +0000 Joe Damato wrote:  
> > > > > > > +      doc: Set configurable NAPI instance settings.  
> > > > > > 
> > > > > > We should pause and think here how configuring NAPI params should
> > > > > > behave. NAPI instances are ephemeral, if you close and open the
> > > > > > device (or for some drivers change any BPF or ethtool setting)
> > > > > > the NAPIs may get wiped and recreated, discarding all configuration.
> > > > > > 
> > > > > > This is not how the sysfs API behaves, the sysfs settings on the device
> > > > > > survive close. It's (weirdly?) also not how queues behave, because we
> > > > > > have struct netdev{_rx,}_queue to store stuff persistently. Even tho
> > > > > > you'd think queues are as ephemeral as NAPIs if not more.
> > > > > > 
> > > > > > I guess we can either document this, and move on (which may be fine,
> > > > > > you have more practical experience than me). Or we can add an internal
> > > > > > concept of a "channel" (which perhaps maybe if you squint is what
> > > > > > ethtool -l calls NAPIs?) or just "napi_storage" as an array inside
> > > > > > net_device and store such config there. For simplicity of matching
> > > > > > config to NAPIs we can assume drivers add NAPI instances in order. 
> > > > > > If driver wants to do something more fancy we can add a variant of
> > > > > > netif_napi_add() which specifies the channel/storage to use.
> > > > > > 
> > > > > > Thoughts? I may be overly sensitive to the ephemeral thing, maybe
> > > > > > I work with unfortunate drivers...  
> > > > > 
> > > > > Thanks for pointing this out. I think this is an important case to
> > > > > consider. Here's how I'm thinking about it.
> > > > > 
> > > > > There are two cases:
> > > > > 
> > > > > 1) sysfs setting is used by existing/legacy apps: If the NAPIs are
> > > > > discarded and recreated, the code I added to netif_napi_add_weight
> > > > > in patch 1 and 3 should take care of that case preserving how sysfs
> > > > > works today, I believe. I think we are good on this case ?
> > > > 
> > > > Agreed.
> > > > 
> > > > > 2) apps using netlink to set various custom settings. This seems
> > > > > like a case where a future extension can be made to add a notifier
> > > > > for NAPI changes (like the netdevice notifier?).
> > > > 
> > > > Yes, the notifier may help, but it's a bit of a stop gap / fallback.
> > > > 
> > > > > If you think this is a good idea, then we'd do something like:
> > > > >   1. Document that the NAPI settings are wiped when NAPIs are wiped
> > > > >   2. In the future (not part of this series) a NAPI notifier is
> > > > >      added
> > > > >   3. User apps can then listen for NAPI create/delete events
> > > > >      and update settings when a NAPI is created. It would be
> > > > >      helpful, I think, for user apps to know about NAPI
> > > > >      create/delete events in general because it means NAPI IDs are
> > > > >      changing.
> > > > > 
> > > > > One could argue:
> > > > > 
> > > > >   When wiping/recreating a NAPI for an existing HW queue, that HW
> > > > >   queue gets a new NAPI ID associated with it. User apps operating
> > > > >   at this level probably care about NAPI IDs changing (as it affects
> > > > >   epoll busy poll). Since the settings in this series are per-NAPI
> > > > >   (and not per HW queue), the argument could be that user apps need
> > > > >   to setup NAPIs when they are created and settings do not persist
> > > > >   between NAPIs with different IDs even if associated with the same
> > > > >   HW queue.
> > > > 
> > > > IDK if the fact that NAPI ID gets replaced was intentional in the first
> > > > place. I would venture a guess that the person who added the IDs was
> > > > working with NICs which have stable NAPI instances once the device is
> > > > opened. This is, unfortunately, not universally the case.
> > > > 
> > > > I just poked at bnxt, mlx5 and fbnic and all of them reallocate NAPIs
> > > > on an open device. Closer we get to queue API the more dynamic the whole
> > > > setup will become (read: the more often reconfigurations will happen).
> > > >
> > > 
> > > [...]
> > > 
> > > > > I think you have much more practical experience when it comes to
> > > > > dealing with drivers, so I am happy to follow your lead on this one,
> > > > > but assuming drivers will "do a thing" seems mildly scary to me with
> > > > > limited driver experience.
> > > > > 
> > > > > My two goals with this series are:
> > > > >   1. Make it possible to set these values per NAPI
> > > > >   2. Unblock the IRQ suspension series by threading the suspend
> > > > >      parameter through the code path carved in this series
> > > > > 
> > > > > So, I'm happy to proceed with this series as you prefer whether
> > > > > that's documentation or "napi_storage"; I think you are probably the
> > > > > best person to answer this question :)
> > > > 
> > > > How do you feel about making this configuration opt-in / require driver
> > > > changes? What I'm thinking is that having the new "netif_napi_add()"
> > > > variant (or perhaps extending netif_napi_set_irq()) to take an extra
> > > > "index" parameter would make the whole thing much simpler.
> > > 
> > > What about extending netif_queue_set_napi instead? That function
> > > takes a napi and a queue index.
> > > 
> > > Locally I kinda of hacked up something simple that:
> > >   - Allocates napi_storage in net_device in alloc_netdev_mqs
> > >   - Modifies netif_queue_set_napi to:
> > >      if (napi)
> > >        napi->storage = dev->napi_storage[queue_index];
> > > 
> > > I think I'm still missing the bit about the
> > > max(rx_queues,tx_queues), though :(
> > > 
> > > > Index would basically be an integer 0..n, where n is the number of
> > > > IRQs configured for the driver. The index of a NAPI instance would
> > > > likely match the queue ID of the queue the NAPI serves.
> > > 
> > > Hmmm. I'm hesitant about the "number of IRQs" part. What if there
> > > are NAPIs for which no IRQ is allocated ~someday~ ?
> > > 
> > > It seems like (I could totally be wrong) that netif_queue_set_napi
> > > can be called and work and create the association even without an
> > > IRQ allocated.
> > > 
> > > I guess the issue is mostly the queue index question above: combined
> > > rx/tx vs drivers having different numbers of rx and tx queues.
> > > 
> > > > We can then allocate an array of "napi_configs" in net_device -
> > > > like we allocate queues, the array size would be max(num_rx_queue,
> > > > num_tx_queues). We just need to store a couple of ints so it will
> > > > be tiny compared to queue structs, anyway.
> > > > 
> > > > The NAPI_SET netlink op can then work based on NAPI index rather 
> > > > than the ephemeral NAPI ID. It can apply the config to all live
> > > > NAPI instances with that index (of which there really should only 
> > > > be one, unless driver is mid-reconfiguration somehow but even that
> > > > won't cause issues, we can give multiple instances the same settings)
> > > > and also store the user config in the array in net_device.
> > > > 
> > > > When new NAPI instance is associate with a NAPI index it should get
> > > > all the config associated with that index applied.
> > > > 
> > > > Thoughts? Does that makes sense, and if so do you think it's an
> > > > over-complication?
> > > 
> > > I think what you are proposing seems fine; I'm just working out the
> > > implementation details and making sure I understand before sending
> > > another revision.
> > 
> > What if instead of an extra storage index in UAPI, we make napi_id persistent?
> > Then we can keep using napi_id as a user-facing number for the configuration.
> > 
> > Having a stable napi_id would also be super useful for the epoll setup so you
> > don't have to match old/invalid ids to the new ones on device reset.
> 
> Up to now for prototyping purposes: the way I've been dealing with this is
> using a SO_ATTACH_REUSEPORT_CBPF program like this:
> 
> struct sock_filter code[] = {
>     /* A = skb->queue_mapping */
>     { BPF_LD | BPF_W | BPF_ABS, 0, 0, SKF_AD_OFF + SKF_AD_QUEUE },
>     /* A = A % n */
>     { BPF_ALU | BPF_MOD, 0, 0, n },
>     /* return A */
>     { BPF_RET | BPF_A, 0, 0, 0 },
> };
> 
> with SO_BINDTODEVICE. Note that the above uses queue_mapping (not NAPI ID) so
> even if the NAPI IDs change the filter still distributes connections from the
> same "queue_mapping" to the same thread.
> 
> Since epoll busy poll is based on NAPI ID (and not queue_mapping), this will
> probably cause some issue if the NAPI ID changes because the NAPI ID associated
> with the epoll context will suddenly change meaning the "old" NAPI won't be
> busy polled. This might be fine because if that happens the old NAPI is being
> disabled anyway?
> 
> At any rate the user program doesn't "need" to do anything when the NAPI ID
> changes... unless it has a more complicated ebpf program that relies on NAPI ID
> ;)

Ah, you went a more creative route. We were doing SO_INCOMING_NAPI_ID on
a single listener and manually adding fds to the appropriate epoll.
But regardless of the method, having a stable napi_id is still good
to have and hopefully it's not a lot of work compared to exposing
extra napi storage id to the userspace.
Joe Damato Sept. 5, 2024, 5:05 p.m. UTC | #17
On Thu, Sep 05, 2024 at 09:56:43AM -0700, Stanislav Fomichev wrote:
> On 09/05, Joe Damato wrote:
> > On Wed, Sep 04, 2024 at 04:40:41PM -0700, Stanislav Fomichev wrote:
> > > On 09/02, Joe Damato wrote:
> > > > On Fri, Aug 30, 2024 at 02:22:35PM -0700, Jakub Kicinski wrote:
> > > > > On Fri, 30 Aug 2024 11:43:00 +0100 Joe Damato wrote:
> > > > > > On Thu, Aug 29, 2024 at 03:31:05PM -0700, Jakub Kicinski wrote:
> > > > > > > On Thu, 29 Aug 2024 13:12:01 +0000 Joe Damato wrote:  
> > > > > > > > +      doc: Set configurable NAPI instance settings.  
> > > > > > > 
> > > > > > > We should pause and think here how configuring NAPI params should
> > > > > > > behave. NAPI instances are ephemeral, if you close and open the
> > > > > > > device (or for some drivers change any BPF or ethtool setting)
> > > > > > > the NAPIs may get wiped and recreated, discarding all configuration.
> > > > > > > 
> > > > > > > This is not how the sysfs API behaves, the sysfs settings on the device
> > > > > > > survive close. It's (weirdly?) also not how queues behave, because we
> > > > > > > have struct netdev{_rx,}_queue to store stuff persistently. Even tho
> > > > > > > you'd think queues are as ephemeral as NAPIs if not more.
> > > > > > > 
> > > > > > > I guess we can either document this, and move on (which may be fine,
> > > > > > > you have more practical experience than me). Or we can add an internal
> > > > > > > concept of a "channel" (which perhaps maybe if you squint is what
> > > > > > > ethtool -l calls NAPIs?) or just "napi_storage" as an array inside
> > > > > > > net_device and store such config there. For simplicity of matching
> > > > > > > config to NAPIs we can assume drivers add NAPI instances in order. 
> > > > > > > If driver wants to do something more fancy we can add a variant of
> > > > > > > netif_napi_add() which specifies the channel/storage to use.
> > > > > > > 
> > > > > > > Thoughts? I may be overly sensitive to the ephemeral thing, maybe
> > > > > > > I work with unfortunate drivers...  
> > > > > > 
> > > > > > Thanks for pointing this out. I think this is an important case to
> > > > > > consider. Here's how I'm thinking about it.
> > > > > > 
> > > > > > There are two cases:
> > > > > > 
> > > > > > 1) sysfs setting is used by existing/legacy apps: If the NAPIs are
> > > > > > discarded and recreated, the code I added to netif_napi_add_weight
> > > > > > in patch 1 and 3 should take care of that case preserving how sysfs
> > > > > > works today, I believe. I think we are good on this case ?
> > > > > 
> > > > > Agreed.
> > > > > 
> > > > > > 2) apps using netlink to set various custom settings. This seems
> > > > > > like a case where a future extension can be made to add a notifier
> > > > > > for NAPI changes (like the netdevice notifier?).
> > > > > 
> > > > > Yes, the notifier may help, but it's a bit of a stop gap / fallback.
> > > > > 
> > > > > > If you think this is a good idea, then we'd do something like:
> > > > > >   1. Document that the NAPI settings are wiped when NAPIs are wiped
> > > > > >   2. In the future (not part of this series) a NAPI notifier is
> > > > > >      added
> > > > > >   3. User apps can then listen for NAPI create/delete events
> > > > > >      and update settings when a NAPI is created. It would be
> > > > > >      helpful, I think, for user apps to know about NAPI
> > > > > >      create/delete events in general because it means NAPI IDs are
> > > > > >      changing.
> > > > > > 
> > > > > > One could argue:
> > > > > > 
> > > > > >   When wiping/recreating a NAPI for an existing HW queue, that HW
> > > > > >   queue gets a new NAPI ID associated with it. User apps operating
> > > > > >   at this level probably care about NAPI IDs changing (as it affects
> > > > > >   epoll busy poll). Since the settings in this series are per-NAPI
> > > > > >   (and not per HW queue), the argument could be that user apps need
> > > > > >   to setup NAPIs when they are created and settings do not persist
> > > > > >   between NAPIs with different IDs even if associated with the same
> > > > > >   HW queue.
> > > > > 
> > > > > IDK if the fact that NAPI ID gets replaced was intentional in the first
> > > > > place. I would venture a guess that the person who added the IDs was
> > > > > working with NICs which have stable NAPI instances once the device is
> > > > > opened. This is, unfortunately, not universally the case.
> > > > > 
> > > > > I just poked at bnxt, mlx5 and fbnic and all of them reallocate NAPIs
> > > > > on an open device. Closer we get to queue API the more dynamic the whole
> > > > > setup will become (read: the more often reconfigurations will happen).
> > > > >
> > > > 
> > > > [...]
> > > > 
> > > > > > I think you have much more practical experience when it comes to
> > > > > > dealing with drivers, so I am happy to follow your lead on this one,
> > > > > > but assuming drivers will "do a thing" seems mildly scary to me with
> > > > > > limited driver experience.
> > > > > > 
> > > > > > My two goals with this series are:
> > > > > >   1. Make it possible to set these values per NAPI
> > > > > >   2. Unblock the IRQ suspension series by threading the suspend
> > > > > >      parameter through the code path carved in this series
> > > > > > 
> > > > > > So, I'm happy to proceed with this series as you prefer whether
> > > > > > that's documentation or "napi_storage"; I think you are probably the
> > > > > > best person to answer this question :)
> > > > > 
> > > > > How do you feel about making this configuration opt-in / require driver
> > > > > changes? What I'm thinking is that having the new "netif_napi_add()"
> > > > > variant (or perhaps extending netif_napi_set_irq()) to take an extra
> > > > > "index" parameter would make the whole thing much simpler.
> > > > 
> > > > What about extending netif_queue_set_napi instead? That function
> > > > takes a napi and a queue index.
> > > > 
> > > > Locally I kinda of hacked up something simple that:
> > > >   - Allocates napi_storage in net_device in alloc_netdev_mqs
> > > >   - Modifies netif_queue_set_napi to:
> > > >      if (napi)
> > > >        napi->storage = dev->napi_storage[queue_index];
> > > > 
> > > > I think I'm still missing the bit about the
> > > > max(rx_queues,tx_queues), though :(
> > > > 
> > > > > Index would basically be an integer 0..n, where n is the number of
> > > > > IRQs configured for the driver. The index of a NAPI instance would
> > > > > likely match the queue ID of the queue the NAPI serves.
> > > > 
> > > > Hmmm. I'm hesitant about the "number of IRQs" part. What if there
> > > > are NAPIs for which no IRQ is allocated ~someday~ ?
> > > > 
> > > > It seems like (I could totally be wrong) that netif_queue_set_napi
> > > > can be called and work and create the association even without an
> > > > IRQ allocated.
> > > > 
> > > > I guess the issue is mostly the queue index question above: combined
> > > > rx/tx vs drivers having different numbers of rx and tx queues.
> > > > 
> > > > > We can then allocate an array of "napi_configs" in net_device -
> > > > > like we allocate queues, the array size would be max(num_rx_queue,
> > > > > num_tx_queues). We just need to store a couple of ints so it will
> > > > > be tiny compared to queue structs, anyway.
> > > > > 
> > > > > The NAPI_SET netlink op can then work based on NAPI index rather 
> > > > > than the ephemeral NAPI ID. It can apply the config to all live
> > > > > NAPI instances with that index (of which there really should only 
> > > > > be one, unless driver is mid-reconfiguration somehow but even that
> > > > > won't cause issues, we can give multiple instances the same settings)
> > > > > and also store the user config in the array in net_device.
> > > > > 
> > > > > When new NAPI instance is associate with a NAPI index it should get
> > > > > all the config associated with that index applied.
> > > > > 
> > > > > Thoughts? Does that makes sense, and if so do you think it's an
> > > > > over-complication?
> > > > 
> > > > I think what you are proposing seems fine; I'm just working out the
> > > > implementation details and making sure I understand before sending
> > > > another revision.
> > > 
> > > What if instead of an extra storage index in UAPI, we make napi_id persistent?
> > > Then we can keep using napi_id as a user-facing number for the configuration.
> > > 
> > > Having a stable napi_id would also be super useful for the epoll setup so you
> > > don't have to match old/invalid ids to the new ones on device reset.
> > 
> > Up to now for prototyping purposes: the way I've been dealing with this is
> > using a SO_ATTACH_REUSEPORT_CBPF program like this:
> > 
> > struct sock_filter code[] = {
> >     /* A = skb->queue_mapping */
> >     { BPF_LD | BPF_W | BPF_ABS, 0, 0, SKF_AD_OFF + SKF_AD_QUEUE },
> >     /* A = A % n */
> >     { BPF_ALU | BPF_MOD, 0, 0, n },
> >     /* return A */
> >     { BPF_RET | BPF_A, 0, 0, 0 },
> > };
> > 
> > with SO_BINDTODEVICE. Note that the above uses queue_mapping (not NAPI ID) so
> > even if the NAPI IDs change the filter still distributes connections from the
> > same "queue_mapping" to the same thread.
> > 
> > Since epoll busy poll is based on NAPI ID (and not queue_mapping), this will
> > probably cause some issue if the NAPI ID changes because the NAPI ID associated
> > with the epoll context will suddenly change meaning the "old" NAPI won't be
> > busy polled. This might be fine because if that happens the old NAPI is being
> > disabled anyway?
> > 
> > At any rate the user program doesn't "need" to do anything when the NAPI ID
> > changes... unless it has a more complicated ebpf program that relies on NAPI ID
> > ;)
> 
> Ah, you went a more creative route. We were doing SO_INCOMING_NAPI_ID on
> a single listener and manually adding fds to the appropriate epoll.
> But regardless of the method, having a stable napi_id is still good
> to have and hopefully it's not a lot of work compared to exposing
> extra napi storage id to the userspace.

Yes, your approach makes sense, as well. With a single acceptor
thread I'd also do what you are doing.

I was essentially modifying an existing REUSEPORT user app that one
app listens on all interfaces, so BINDTODEVICE + queue_mapping bpf
filter was the only way (that I could find) to make the NAPI ID
based polling work.
Joe Damato Sept. 8, 2024, 3:54 p.m. UTC | #18
On Tue, Sep 03, 2024 at 12:40:08PM -0700, Jakub Kicinski wrote:
> On Tue, 3 Sep 2024 12:04:52 -0700 Samiullah Khawaja wrote:
> > Do we need a queue to napi association to set/persist napi
> > configurations? 
> 
> I'm afraid zero-copy schemes will make multiple queues per NAPI more
> and more common, so pretending the NAPI params (related to polling)
> are pre queue will soon become highly problematic.
> 
> > Can a new index param be added to the netif_napi_add
> > and persist the configurations in napi_storage.
> 
> That'd be my (weak) preference.
> 
> > I guess the problem would be the size of napi_storage.
> 
> I don't think so, we're talking about 16B per NAPI, 
> struct netdev_queue is 320B, struct netdev_rx_queue is 192B. 
> NAPI storage is rounding error next to those :S
> 
> > Also wondering if for some use case persistence would be problematic
> > when the napis are recreated, since the new napi instances might not
> > represent the same context? For example If I resize the dev from 16
> > rx/tx to 8 rx/tx queues and the napi index that was used by TX queue,
> > now polls RX queue.
> 
> We can clear the config when NAPI is activated (ethtool -L /
> set-channels). That seems like a good idea.

I'm probably misunderstanding this bit; I've implemented this by
just memsetting the storages to 0 on napi_enable which obviously
breaks the persistence and is incorrect because it doesn't respect
sysfs.

I'm going to send what I have as an RFC anyway, because it might be
easier to discuss by commenting on code that is (hopefully) moving
in the right direction?

I hope that's OK; I'll explicitly call it out in the cover letter
that I am about to send.

- Joe
Joe Damato Sept. 8, 2024, 3:57 p.m. UTC | #19
On Wed, Sep 04, 2024 at 04:54:17PM -0700, Jakub Kicinski wrote:
> On Wed, 4 Sep 2024 16:40:41 -0700 Stanislav Fomichev wrote:
> > > I think what you are proposing seems fine; I'm just working out the
> > > implementation details and making sure I understand before sending
> > > another revision.  
> > 
> > What if instead of an extra storage index in UAPI, we make napi_id persistent?
> > Then we can keep using napi_id as a user-facing number for the configuration.
> > 
> > Having a stable napi_id would also be super useful for the epoll setup so you
> > don't have to match old/invalid ids to the new ones on device reset.
> 
> that'd be nice, initially I thought that we have some drivers that have
> multiple instances of NAPI enabled for a single "index", but I don't
> see such drivers now.
> 
> > In the code, we can keep the same idea with napi_storage in netdev and
> > ask drivers to provide storage id, but keep that id internal.
> > 
> > The only complication with that is napi_hash_add/napi_hash_del that
> > happen in netif_napi_add_weight. So for the devices that allocate
> > new napi before removing the old ones (most devices?), we'd have to add
> > some new netif_napi_takeover(old_napi, new_napi) to remove the
> > old napi_id from the hash and reuse it in the new one.
> > 
> > So for mlx5, the flow would look like the following:
> > 
> > - mlx5e_safe_switch_params
> >   - mlx5e_open_channels
> >     - netif_napi_add(new_napi)
> >       - adds napi with 'ephemeral' napi id
> >   - mlx5e_switch_priv_channels
> >     - mlx5e_deactivate_priv_channels
> >       - napi_disable(old_napi)
> >       - netif_napi_del(old_napi) - this frees the old napi_id
> >   - mlx5e_activate_priv_channels
> >     - mlx5e_activate_channels
> >       - mlx5e_activate_channel
> >         - netif_napi_takeover(old_napi is gone, so probably take id from napi_storage?)
> > 	  - if napi is not hashed - safe to reuse?
> > 	- napi_enable
> > 
> > This is a bit ugly because we still have random napi ids during reset, but
> > is not super complicated implementation-wise. We can eventually improve
> > the above by splitting netif_napi_add_weight into two steps: allocate and
> > activate (to do the napi_id allocation & hashing). Thoughts?
> 
> The "takeover" would be problematic for drivers which free old NAPI
> before allocating new one (bnxt?). But splitting the two steps sounds
> pretty clean. We can add a helper to mark NAPI as "driver will
> explicitly list/hash later", and have the driver call a new helper
> which takes storage ID and lists the NAPI in the hash.

It sounds like you all are suggesting that napi_id is moved into the
napi_storage array, as well? That way NAPI IDs persist even if the
NAPI structs themselves are recreated?

I think that's interesting and I'm open to supporting that. I wrote
up an RFC that moves stuff in the direction of napi_storage and
modifies 3 drivers but:
  - is incorrect because it breaks the persistence thing we are
    talking about, and
  - it doesn't do the two step take-over thing described above to
    inherit NAPI IDs (as far as I understand ?)

I'm going to send the RFC anyway because I think it'll be easier to
pick up the discussion on code that is hopefully closer to where we
want to land.

I hope that is OK.

- Joe
Jakub Kicinski Sept. 9, 2024, 11:03 p.m. UTC | #20
On Sun, 8 Sep 2024 17:57:22 +0200 Joe Damato wrote:
> I hope that is OK.

I'll comment on the RFC but FWIW, yes, sharing the code with discussion
point / TODOs listed in the cover letter is the right way of moving
things forward, thanks :)
diff mbox series

Patch

diff --git a/Documentation/netlink/specs/netdev.yaml b/Documentation/netlink/specs/netdev.yaml
index 290894962ac4..cb7049a1d6d8 100644
--- a/Documentation/netlink/specs/netdev.yaml
+++ b/Documentation/netlink/specs/netdev.yaml
@@ -631,6 +631,17 @@  operations:
             - rx-bytes
             - tx-packets
             - tx-bytes
+    -
+      name: napi-set
+      doc: Set configurable NAPI instance settings.
+      attribute-set: napi
+      flags: [ admin-perm ]
+      do:
+        request:
+          attributes:
+            - id
+            - defer-hard-irqs
+            - gro-flush-timeout
 
 mcast-groups:
   list:
diff --git a/include/uapi/linux/netdev.h b/include/uapi/linux/netdev.h
index b088a34e9254..4c5bfbc85504 100644
--- a/include/uapi/linux/netdev.h
+++ b/include/uapi/linux/netdev.h
@@ -188,6 +188,7 @@  enum {
 	NETDEV_CMD_QUEUE_GET,
 	NETDEV_CMD_NAPI_GET,
 	NETDEV_CMD_QSTATS_GET,
+	NETDEV_CMD_NAPI_SET,
 
 	__NETDEV_CMD_MAX,
 	NETDEV_CMD_MAX = (__NETDEV_CMD_MAX - 1)
diff --git a/net/core/netdev-genl-gen.c b/net/core/netdev-genl-gen.c
index 8350a0afa9ec..5ddb5d926850 100644
--- a/net/core/netdev-genl-gen.c
+++ b/net/core/netdev-genl-gen.c
@@ -74,6 +74,13 @@  static const struct nla_policy netdev_qstats_get_nl_policy[NETDEV_A_QSTATS_SCOPE
 	[NETDEV_A_QSTATS_SCOPE] = NLA_POLICY_MASK(NLA_UINT, 0x1),
 };
 
+/* NETDEV_CMD_NAPI_SET - set */
+static const struct nla_policy netdev_napi_set_nl_policy[NETDEV_A_NAPI_GRO_FLUSH_TIMEOUT + 1] = {
+	[NETDEV_A_NAPI_ID] = { .type = NLA_U32, },
+	[NETDEV_A_NAPI_DEFER_HARD_IRQS] = { .type = NLA_S32 },
+	[NETDEV_A_NAPI_GRO_FLUSH_TIMEOUT] = { .type = NLA_U64 },
+};
+
 /* Ops table for netdev */
 static const struct genl_split_ops netdev_nl_ops[] = {
 	{
@@ -151,6 +158,13 @@  static const struct genl_split_ops netdev_nl_ops[] = {
 		.maxattr	= NETDEV_A_QSTATS_SCOPE,
 		.flags		= GENL_CMD_CAP_DUMP,
 	},
+	{
+		.cmd		= NETDEV_CMD_NAPI_SET,
+		.doit		= netdev_nl_napi_set_doit,
+		.policy		= netdev_napi_set_nl_policy,
+		.maxattr	= NETDEV_A_NAPI_GRO_FLUSH_TIMEOUT,
+		.flags		= GENL_ADMIN_PERM | GENL_CMD_CAP_DO,
+	},
 };
 
 static const struct genl_multicast_group netdev_nl_mcgrps[] = {
diff --git a/net/core/netdev-genl-gen.h b/net/core/netdev-genl-gen.h
index 4db40fd5b4a9..b70cb0f20acb 100644
--- a/net/core/netdev-genl-gen.h
+++ b/net/core/netdev-genl-gen.h
@@ -28,6 +28,7 @@  int netdev_nl_queue_get_dumpit(struct sk_buff *skb,
 			       struct netlink_callback *cb);
 int netdev_nl_napi_get_doit(struct sk_buff *skb, struct genl_info *info);
 int netdev_nl_napi_get_dumpit(struct sk_buff *skb, struct netlink_callback *cb);
+int netdev_nl_napi_set_doit(struct sk_buff *skb, struct genl_info *info);
 int netdev_nl_qstats_get_dumpit(struct sk_buff *skb,
 				struct netlink_callback *cb);
 
diff --git a/net/core/netdev-genl.c b/net/core/netdev-genl.c
index 2eee95d05fe0..43dd30eadbc6 100644
--- a/net/core/netdev-genl.c
+++ b/net/core/netdev-genl.c
@@ -299,6 +299,51 @@  int netdev_nl_napi_get_dumpit(struct sk_buff *skb, struct netlink_callback *cb)
 	return err;
 }
 
+static int
+netdev_nl_napi_set_config(struct napi_struct *napi, struct genl_info *info)
+{
+	u64 gro_flush_timeout = 0;
+	int defer = 0;
+
+	if (info->attrs[NETDEV_A_NAPI_DEFER_HARD_IRQS]) {
+		defer = nla_get_s32(info->attrs[NETDEV_A_NAPI_DEFER_HARD_IRQS]);
+		if (defer < 0)
+			return -EINVAL;
+		napi_set_defer_hard_irqs(napi, defer);
+	}
+
+	if (info->attrs[NETDEV_A_NAPI_GRO_FLUSH_TIMEOUT]) {
+		gro_flush_timeout = nla_get_u64(info->attrs[NETDEV_A_NAPI_GRO_FLUSH_TIMEOUT]);
+		napi_set_gro_flush_timeout(napi, gro_flush_timeout);
+	}
+
+	return 0;
+}
+
+int netdev_nl_napi_set_doit(struct sk_buff *skb, struct genl_info *info)
+{
+	struct napi_struct *napi;
+	u32 napi_id;
+	int err;
+
+	if (GENL_REQ_ATTR_CHECK(info, NETDEV_A_NAPI_ID))
+		return -EINVAL;
+
+	napi_id = nla_get_u32(info->attrs[NETDEV_A_NAPI_ID]);
+
+	rtnl_lock();
+
+	napi = napi_by_id(napi_id);
+	if (napi)
+		err = netdev_nl_napi_set_config(napi, info);
+	else
+		err = -EINVAL;
+
+	rtnl_unlock();
+
+	return err;
+}
+
 static int
 netdev_nl_queue_fill_one(struct sk_buff *rsp, struct net_device *netdev,
 			 u32 q_idx, u32 q_type, const struct genl_info *info)
diff --git a/tools/include/uapi/linux/netdev.h b/tools/include/uapi/linux/netdev.h
index b088a34e9254..4c5bfbc85504 100644
--- a/tools/include/uapi/linux/netdev.h
+++ b/tools/include/uapi/linux/netdev.h
@@ -188,6 +188,7 @@  enum {
 	NETDEV_CMD_QUEUE_GET,
 	NETDEV_CMD_NAPI_GET,
 	NETDEV_CMD_QSTATS_GET,
+	NETDEV_CMD_NAPI_SET,
 
 	__NETDEV_CMD_MAX,
 	NETDEV_CMD_MAX = (__NETDEV_CMD_MAX - 1)