mbox series

[iwl-next,v2,0/7] ice: managing MSI-X in driver

Message ID 20240801093115.8553-1-michal.swiatkowski@linux.intel.com (mailing list archive)
Headers show
Series ice: managing MSI-X in driver | expand

Message

Michal Swiatkowski Aug. 1, 2024, 9:31 a.m. UTC
Hi,

It is another try to allow user to manage amount of MSI-X used for each
feature in ice. First was via devlink resources API, it wasn't accepted
in upstream. Also static MSI-X allocation using devlink resources isn't
really user friendly.

This try is using more dynamic way. "Dynamic" across whole kernel when
platform supports it and "dynamic" across the driver when not.

To achieve that reuse global devlink parameter pf_msix_max and
pf_msix_min. It fits how ice hardware counts MSI-X. In case of ice amount
of MSI-X reported on PCI is a whole MSI-X for the card (with MSI-X for
VFs also). Having pf_msix_max allow user to statically set how many
MSI-X he wants on PF and how many should be reserved for VFs.

pf_msix_min is used to set minimum number of MSI-X with which ice driver
should probe correctly.

Meaning of this field in case of dynamic vs static allocation:
- on system with dynamic MSI-X allocation support
 * alloc pf_msix_min as static, rest will be allocated dynamically
- on system without dynamic MSI-X allocation support
 * try alloc pf_msix_max as static, minimum acceptable result is
 pf_msix_min

As Jesse and Piotr suggested pf_msix_max and pf_msix_min can (an
probably should) be stored in NVM. This patchset isn't implementing
that.

Dynamic (kernel or driver) way means that splitting MSI-X across the
RDMA and eth in case there is a MSI-X shortage isn't correct. Can work
when dynamic is only on driver site, but can't when dynamic is on kernel
site.

Let's remove this code and move to MSI-X allocation feature by feature.
If there is no more MSI-X for a feature, a feature is working with less
MSI-X or it is turned off.

There is a regression here. With MSI-X splitting user can run RDMA and
eth even on system with not enough MSI-X. Now only eth will work. RDMA
can be turned on by changing number of PF queues (lowering) and reprobe
RDMA driver.

Example:
72 CPU number, eth, RDMA and flow director (1 MSI-X), 1 MSI-X for OICR
on PF, and 1 more for RDMA. Card is using 1 + 72 + 1 + 72 + 1 = 147.

We set pf_msix_min = 2, pf_msix_max = 128

OICR: 1
eth: 72
RDMA: 128 - 73 = 55
flow director: turned off not enough MSI-X

We can change number of queues on pf to 36 and do devlink reinit

OICR: 1
eth: 36
RDMA: 73
flow director: 1

We can also (implemented in "ice: enable_rdma devlink param") turned
RDMA off.

OICR: 1
eth: 72
RDMA: 0 (turned off)
flow director: 1

Maybe flow director should have higher priority than RDMA? It needs only
1 MSI-X, so it seems more logic to lower RDMA by one then maxing MSI-X
on RDMA and turning off flow director (as default).

After this changes we have a static base vector for SRIOV (SIOV probably
in the feature). Last patch from this series is simplifying managing VF
MSI-X code based on static vector.

Now changing queues using ethtool is also changing MSI-X. If there is
enough MSI-X it is always one to one. When there is not enough there
will be more queues than MSI-X. There is a lack of ability to set how
many queues should be used per MSI-X. Maybe we should introduce another
ethtool param for it? Sth like queues_per_vector?

v1 --> v2: [1]
 * change permanent MSI-X cmode parameters to driverinit
 * remove locking during devlink parameter registration (it is now
   locked for whole init/deinit part)

[1] https://lore.kernel.org/netdev/20240213073509.77622-1-michal.swiatkowski@linux.intel.com/

Michal Swiatkowski (7):
  ice: devlink PF MSI-X max and min parameter
  ice: remove splitting MSI-X between features
  ice: get rid of num_lan_msix field
  ice, irdma: move interrupts code to irdma
  ice: treat dyn_allowed only as suggestion
  ice: enable_rdma devlink param
  ice: simplify VF MSI-X managing

 drivers/infiniband/hw/irdma/hw.c              |   2 -
 drivers/infiniband/hw/irdma/main.c            |  46 ++-
 drivers/infiniband/hw/irdma/main.h            |   3 +
 .../net/ethernet/intel/ice/devlink/devlink.c  |  75 ++++-
 drivers/net/ethernet/intel/ice/ice.h          |  21 +-
 drivers/net/ethernet/intel/ice/ice_base.c     |  10 +-
 drivers/net/ethernet/intel/ice/ice_ethtool.c  |   8 +-
 drivers/net/ethernet/intel/ice/ice_idc.c      |  64 +---
 drivers/net/ethernet/intel/ice/ice_irq.c      | 277 ++++++------------
 drivers/net/ethernet/intel/ice/ice_irq.h      |  13 +-
 drivers/net/ethernet/intel/ice/ice_lib.c      |  36 ++-
 drivers/net/ethernet/intel/ice/ice_sriov.c    | 153 +---------
 include/linux/net/intel/iidc.h                |   2 +
 13 files changed, 287 insertions(+), 423 deletions(-)

Comments

Przemek Kitszel Aug. 2, 2024, 12:41 p.m. UTC | #1
On 8/1/24 11:31, Michal Swiatkowski wrote:
> Hi,
> 
> It is another try to allow user to manage amount of MSI-X used for each
> feature in ice. First was via devlink resources API, it wasn't accepted
> in upstream. Also static MSI-X allocation using devlink resources isn't
> really user friendly.
> 
> This try is using more dynamic way. "Dynamic" across whole kernel when
> platform supports it and "dynamic" across the driver when not.
> 
> To achieve that reuse global devlink parameter pf_msix_max and
> pf_msix_min. It fits how ice hardware counts MSI-X. In case of ice amount
> of MSI-X reported on PCI is a whole MSI-X for the card (with MSI-X for
> VFs also). Having pf_msix_max allow user to statically set how many
> MSI-X he wants on PF and how many should be reserved for VFs.
> 
> pf_msix_min is used to set minimum number of MSI-X with which ice driver
> should probe correctly.
> 
> Meaning of this field in case of dynamic vs static allocation:
> - on system with dynamic MSI-X allocation support
>   * alloc pf_msix_min as static, rest will be allocated dynamically
> - on system without dynamic MSI-X allocation support
>   * try alloc pf_msix_max as static, minimum acceptable result is
>   pf_msix_min
> 
> As Jesse and Piotr suggested pf_msix_max and pf_msix_min can (an
> probably should) be stored in NVM. This patchset isn't implementing
> that.
> 
> Dynamic (kernel or driver) way means that splitting MSI-X across the
> RDMA and eth in case there is a MSI-X shortage isn't correct. Can work
> when dynamic is only on driver site, but can't when dynamic is on kernel
> site.
> 
> Let's remove this code and move to MSI-X allocation feature by feature.
> If there is no more MSI-X for a feature, a feature is working with less
> MSI-X or it is turned off.
> 
> There is a regression here. With MSI-X splitting user can run RDMA and
> eth even on system with not enough MSI-X. Now only eth will work. RDMA
> can be turned on by changing number of PF queues (lowering) and reprobe
> RDMA driver.
> 
> Example:
> 72 CPU number, eth, RDMA and flow director (1 MSI-X), 1 MSI-X for OICR
> on PF, and 1 more for RDMA. Card is using 1 + 72 + 1 + 72 + 1 = 147.
> 
> We set pf_msix_min = 2, pf_msix_max = 128
> 
> OICR: 1
> eth: 72
> RDMA: 128 - 73 = 55
> flow director: turned off not enough MSI-X
> 
> We can change number of queues on pf to 36 and do devlink reinit
> 
> OICR: 1
> eth: 36
> RDMA: 73
> flow director: 1
> 
> We can also (implemented in "ice: enable_rdma devlink param") turned
> RDMA off.
> 
> OICR: 1
> eth: 72
> RDMA: 0 (turned off)
> flow director: 1
> 
> Maybe flow director should have higher priority than RDMA? It needs only
> 1 MSI-X, so it seems more logic to lower RDMA by one then maxing MSI-X
> on RDMA and turning off flow director (as default).

sounds better, less surprising, with only RDMA being affected by this
series as "regression"

> 
> After this changes we have a static base vector for SRIOV (SIOV probably
> in the feature). Last patch from this series is simplifying managing VF
> MSI-X code based on static vector.
> 
> Now changing queues using ethtool is also changing MSI-X. If there is
> enough MSI-X it is always one to one. When there is not enough there
> will be more queues than MSI-X. There is a lack of ability to set how
> many queues should be used per MSI-X. Maybe we should introduce another
> ethtool param for it? Sth like queues_per_vector?

Our 1:1 mapping was too rigid solution (but performant), I like MSI-Xes
being kept as a detail and [setting of them] decoupled from being
mandatory on [at least some] flows. Tuning the mapping could be useful,
esp in heterotelic scenarios (like keeping XDP stuff separate). Could be
left for the future.

What happens when user decreases number of MSI-X, queues will just get
remapped to other?

> 
> v1 --> v2: [1]
>   * change permanent MSI-X cmode parameters to driverinit
>   * remove locking during devlink parameter registration (it is now
>     locked for whole init/deinit part)
> 
> [1] https://lore.kernel.org/netdev/20240213073509.77622-1-michal.swiatkowski@linux.intel.com/
> 
> Michal Swiatkowski (7):
>    ice: devlink PF MSI-X max and min parameter
>    ice: remove splitting MSI-X between features
>    ice: get rid of num_lan_msix field
>    ice, irdma: move interrupts code to irdma
>    ice: treat dyn_allowed only as suggestion
>    ice: enable_rdma devlink param
>    ice: simplify VF MSI-X managing
> 
>   drivers/infiniband/hw/irdma/hw.c              |   2 -
>   drivers/infiniband/hw/irdma/main.c            |  46 ++-
>   drivers/infiniband/hw/irdma/main.h            |   3 +
>   .../net/ethernet/intel/ice/devlink/devlink.c  |  75 ++++-
>   drivers/net/ethernet/intel/ice/ice.h          |  21 +-
>   drivers/net/ethernet/intel/ice/ice_base.c     |  10 +-
>   drivers/net/ethernet/intel/ice/ice_ethtool.c  |   8 +-
>   drivers/net/ethernet/intel/ice/ice_idc.c      |  64 +---
>   drivers/net/ethernet/intel/ice/ice_irq.c      | 277 ++++++------------
>   drivers/net/ethernet/intel/ice/ice_irq.h      |  13 +-
>   drivers/net/ethernet/intel/ice/ice_lib.c      |  36 ++-
>   drivers/net/ethernet/intel/ice/ice_sriov.c    | 153 +---------
>   include/linux/net/intel/iidc.h                |   2 +
>   13 files changed, 287 insertions(+), 423 deletions(-)
>
Michal Swiatkowski Aug. 5, 2024, 4:54 a.m. UTC | #2
On Fri, Aug 02, 2024 at 02:41:54PM +0200, Przemek Kitszel wrote:
> On 8/1/24 11:31, Michal Swiatkowski wrote:
> > Hi,
> > 
> > It is another try to allow user to manage amount of MSI-X used for each
> > feature in ice. First was via devlink resources API, it wasn't accepted
> > in upstream. Also static MSI-X allocation using devlink resources isn't
> > really user friendly.
> > 
> > This try is using more dynamic way. "Dynamic" across whole kernel when
> > platform supports it and "dynamic" across the driver when not.
> > 
> > To achieve that reuse global devlink parameter pf_msix_max and
> > pf_msix_min. It fits how ice hardware counts MSI-X. In case of ice amount
> > of MSI-X reported on PCI is a whole MSI-X for the card (with MSI-X for
> > VFs also). Having pf_msix_max allow user to statically set how many
> > MSI-X he wants on PF and how many should be reserved for VFs.
> > 
> > pf_msix_min is used to set minimum number of MSI-X with which ice driver
> > should probe correctly.
> > 
> > Meaning of this field in case of dynamic vs static allocation:
> > - on system with dynamic MSI-X allocation support
> >   * alloc pf_msix_min as static, rest will be allocated dynamically
> > - on system without dynamic MSI-X allocation support
> >   * try alloc pf_msix_max as static, minimum acceptable result is
> >   pf_msix_min
> > 
> > As Jesse and Piotr suggested pf_msix_max and pf_msix_min can (an
> > probably should) be stored in NVM. This patchset isn't implementing
> > that.
> > 
> > Dynamic (kernel or driver) way means that splitting MSI-X across the
> > RDMA and eth in case there is a MSI-X shortage isn't correct. Can work
> > when dynamic is only on driver site, but can't when dynamic is on kernel
> > site.
> > 
> > Let's remove this code and move to MSI-X allocation feature by feature.
> > If there is no more MSI-X for a feature, a feature is working with less
> > MSI-X or it is turned off.
> > 
> > There is a regression here. With MSI-X splitting user can run RDMA and
> > eth even on system with not enough MSI-X. Now only eth will work. RDMA
> > can be turned on by changing number of PF queues (lowering) and reprobe
> > RDMA driver.
> > 
> > Example:
> > 72 CPU number, eth, RDMA and flow director (1 MSI-X), 1 MSI-X for OICR
> > on PF, and 1 more for RDMA. Card is using 1 + 72 + 1 + 72 + 1 = 147.
> > 
> > We set pf_msix_min = 2, pf_msix_max = 128
> > 
> > OICR: 1
> > eth: 72
> > RDMA: 128 - 73 = 55
> > flow director: turned off not enough MSI-X
> > 
> > We can change number of queues on pf to 36 and do devlink reinit
> > 
> > OICR: 1
> > eth: 36
> > RDMA: 73
> > flow director: 1
> > 
> > We can also (implemented in "ice: enable_rdma devlink param") turned
> > RDMA off.
> > 
> > OICR: 1
> > eth: 72
> > RDMA: 0 (turned off)
> > flow director: 1
> > 
> > Maybe flow director should have higher priority than RDMA? It needs only
> > 1 MSI-X, so it seems more logic to lower RDMA by one then maxing MSI-X
> > on RDMA and turning off flow director (as default).
> 
> sounds better, less surprising, with only RDMA being affected by this
> series as "regression"
> 

Sounds reasonable

> > 
> > After this changes we have a static base vector for SRIOV (SIOV probably
> > in the feature). Last patch from this series is simplifying managing VF
> > MSI-X code based on static vector.
> > 
> > Now changing queues using ethtool is also changing MSI-X. If there is
> > enough MSI-X it is always one to one. When there is not enough there
> > will be more queues than MSI-X. There is a lack of ability to set how
> > many queues should be used per MSI-X. Maybe we should introduce another
> > ethtool param for it? Sth like queues_per_vector?
> 
> Our 1:1 mapping was too rigid solution (but performant), I like MSI-Xes
> being kept as a detail and [setting of them] decoupled from being
> mandatory on [at least some] flows. Tuning the mapping could be useful,
> esp in heterotelic scenarios (like keeping XDP stuff separate). Could be
> left for the future.
> 
> What happens when user decreases number of MSI-X, queues will just get
> remapped to other?
>

Yes, queue will be remapped

> > 
> > v1 --> v2: [1]
> >   * change permanent MSI-X cmode parameters to driverinit
> >   * remove locking during devlink parameter registration (it is now
> >     locked for whole init/deinit part)
> > 
> > [1] https://lore.kernel.org/netdev/20240213073509.77622-1-michal.swiatkowski@linux.intel.com/
> > 
> > Michal Swiatkowski (7):
> >    ice: devlink PF MSI-X max and min parameter
> >    ice: remove splitting MSI-X between features
> >    ice: get rid of num_lan_msix field
> >    ice, irdma: move interrupts code to irdma
> >    ice: treat dyn_allowed only as suggestion
> >    ice: enable_rdma devlink param
> >    ice: simplify VF MSI-X managing
> > 
> >   drivers/infiniband/hw/irdma/hw.c              |   2 -
> >   drivers/infiniband/hw/irdma/main.c            |  46 ++-
> >   drivers/infiniband/hw/irdma/main.h            |   3 +
> >   .../net/ethernet/intel/ice/devlink/devlink.c  |  75 ++++-
> >   drivers/net/ethernet/intel/ice/ice.h          |  21 +-
> >   drivers/net/ethernet/intel/ice/ice_base.c     |  10 +-
> >   drivers/net/ethernet/intel/ice/ice_ethtool.c  |   8 +-
> >   drivers/net/ethernet/intel/ice/ice_idc.c      |  64 +---
> >   drivers/net/ethernet/intel/ice/ice_irq.c      | 277 ++++++------------
> >   drivers/net/ethernet/intel/ice/ice_irq.h      |  13 +-
> >   drivers/net/ethernet/intel/ice/ice_lib.c      |  36 ++-
> >   drivers/net/ethernet/intel/ice/ice_sriov.c    | 153 +---------
> >   include/linux/net/intel/iidc.h                |   2 +
> >   13 files changed, 287 insertions(+), 423 deletions(-)
> > 
>