mbox series

[v2,00/14] genirq: Cleanup the usage of irq_set_affinity_hint

Message ID 20210629152746.2953364-1-nitesh@redhat.com (mailing list archive)
Headers show
Series genirq: Cleanup the usage of irq_set_affinity_hint | expand

Message

Nitesh Narayan Lal June 29, 2021, 3:27 p.m. UTC
The drivers currently rely on irq_set_affinity_hint() to either set the
affinity_hint that is consumed by the userspace and/or to enforce a custom
affinity.

irq_set_affinity_hint() as the name suggests is originally introduced to
only set the affinity_hint to help the userspace in guiding the interrupts
and not the affinity itself. However, since the commit

        e2e64a932556 "genirq: Set initial affinity in irq_set_affinity_hint()"

irq_set_affinity_hint() also started applying the provided cpumask (if not
NULL) as the affinity for the interrupts. The issue that this commit was
trying to solve is to allow the drivers to enforce their affinity mask to
distribute the interrupts across the CPUs such that they don't always end
up on CPU0. This issue has been resolved within the irq subsystem since the
commit

        a0c9259dc4e1 "irq/matrix: Spread interrupts on allocation"

Hence, there is no need for the drivers to overwrite the affinity to spread
as it is dynamically performed at the time of allocation.

Also, irq_set_affinity_hint() setting affinity unconditionally introduces
issues for the drivers that only want to set their affinity_hint and not the
affinity itself as for these driver interrupts the default_smp_affinity_mask
is completely ignored (for detailed investigation please refer to [1]).

Unfortunately reverting the commit e2e64a932556 is not an option at this
point for two reasons [2]:

- Several drivers for a valid reason (performance) rely on this API to
  enforce their affinity mask

- Until very recently this was the only exported interface that was
  available

To clear this out Thomas has come up with the following interfaces:

- irq_set_affinity(): only sets affinity of an IRQ [3]
- irq_update_affinity_hint(): Only sets the hint [4]
- irq_set_affinity_and_hint(): Sets both affinity and the hint mask [4]

The first API is already merged in the linux-next tree and the patch
that introduces the other two interfaces are included with this patch-set.

To move to the stage where we can safely get rid of the
irq_set_affinity_hint(), which has been marked deprecated, we have to
move all its consumers to these new interfaces. In this patch-set, I have
done that for a few drivers and will hopefully try to move the remaining of
them in the coming days.

Testing
-------
In terms of testing, I have performed some basic testing on x86 to verify
things such as the interrupts are evenly spread on all CPUs, hint mask is
correctly set etc. for the drivers - i40e, iavf, mlx5, mlx4, ixgbe, i40iw
and enic on top of:

        git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git

So more testing is probably required for these and the drivers that I didn't
test and any help will be much appreciated.


Notes
-----
- I was told that i40iw driver is going to be replaced by irdma, however,
  the new driver didn't land in Linus's tree yet. Once it does I will send
  a follow up patch for that as well.

- For the mpt3sas driver I decided to go with the usage of
  irq_set_affinity_and_hint over irq_set_affinity based on my little
  analysis of it and the megaraid driver. However, if we are sure that it
  is not required then I can replace it with just irq_set_affinity as one
  of its comment suggests.


Change from v1 [5]
------------------
- Fixed compilation error by adding the new interface definitions for cases
  where CONFIG_SMP is not defined

- Fixed function usage in megaraid_sas and removed unnecessary variable
  (Robin Murphy)

- Removed unwanted #if/endif from mlx4 (Leon Romanovsky)

- Other indentation related fixes

 
[1] https://lore.kernel.org/lkml/1a044a14-0884-eedb-5d30-28b4bec24b23@redhat.com/
[2] https://lore.kernel.org/linux-pci/d1d5e797-49ee-4968-88c6-c07119343492@arm.com/
[3] https://lore.kernel.org/linux-arm-kernel/20210518091725.046774792@linutronix.de/
[4] https://lore.kernel.org/patchwork/patch/1434326/
[5] https://lore.kernel.org/linux-scsi/20210617182242.8637-1-nitesh@redhat.com/


Nitesh Narayan Lal (13):
  iavf: Use irq_update_affinity_hint
  i40e: Use irq_update_affinity_hint
  scsi: megaraid_sas: Use irq_set_affinity_and_hint
  scsi: mpt3sas: Use irq_set_affinity_and_hint
  RDMA/i40iw: Use irq_update_affinity_hint
  enic: Use irq_update_affinity_hint
  be2net: Use irq_update_affinity_hint
  ixgbe: Use irq_update_affinity_hint
  mailbox: Use irq_update_affinity_hint
  scsi: lpfc: Use irq_set_affinity
  hinic: Use irq_set_affinity_and_hint
  net/mlx5: Use irq_update_affinity_hint
  net/mlx4: Use irq_update_affinity_hint

Thomas Gleixner (1):
  genirq: Provide new interfaces for affinity hints

 drivers/infiniband/hw/i40iw/i40iw_main.c      |  4 +-
 drivers/mailbox/bcm-flexrm-mailbox.c          |  4 +-
 drivers/net/ethernet/cisco/enic/enic_main.c   |  8 +--
 drivers/net/ethernet/emulex/benet/be_main.c   |  4 +-
 drivers/net/ethernet/huawei/hinic/hinic_rx.c  |  4 +-
 drivers/net/ethernet/intel/i40e/i40e_main.c   |  8 +--
 drivers/net/ethernet/intel/iavf/iavf_main.c   |  8 +--
 drivers/net/ethernet/intel/ixgbe/ixgbe_main.c | 10 ++--
 drivers/net/ethernet/mellanox/mlx4/eq.c       |  8 ++-
 .../net/ethernet/mellanox/mlx5/core/pci_irq.c |  6 +--
 drivers/scsi/lpfc/lpfc_init.c                 |  4 +-
 drivers/scsi/megaraid/megaraid_sas_base.c     | 27 +++++-----
 drivers/scsi/mpt3sas/mpt3sas_base.c           | 21 ++++----
 include/linux/interrupt.h                     | 53 ++++++++++++++++++-
 kernel/irq/manage.c                           |  8 +--
 15 files changed, 113 insertions(+), 64 deletions(-)

--

Comments

Nitesh Lal July 8, 2021, 7:24 p.m. UTC | #1
On Tue, Jun 29, 2021 at 11:28 AM Nitesh Narayan Lal <nitesh@redhat.com> wrote:
>
> The drivers currently rely on irq_set_affinity_hint() to either set the
> affinity_hint that is consumed by the userspace and/or to enforce a custom
> affinity.
>
> irq_set_affinity_hint() as the name suggests is originally introduced to
> only set the affinity_hint to help the userspace in guiding the interrupts
> and not the affinity itself. However, since the commit
>
>         e2e64a932556 "genirq: Set initial affinity in irq_set_affinity_hint()"
>
> irq_set_affinity_hint() also started applying the provided cpumask (if not
> NULL) as the affinity for the interrupts. The issue that this commit was
> trying to solve is to allow the drivers to enforce their affinity mask to
> distribute the interrupts across the CPUs such that they don't always end
> up on CPU0. This issue has been resolved within the irq subsystem since the
> commit
>
>         a0c9259dc4e1 "irq/matrix: Spread interrupts on allocation"
>
> Hence, there is no need for the drivers to overwrite the affinity to spread
> as it is dynamically performed at the time of allocation.
>
> Also, irq_set_affinity_hint() setting affinity unconditionally introduces
> issues for the drivers that only want to set their affinity_hint and not the
> affinity itself as for these driver interrupts the default_smp_affinity_mask
> is completely ignored (for detailed investigation please refer to [1]).
>
> Unfortunately reverting the commit e2e64a932556 is not an option at this
> point for two reasons [2]:
>
> - Several drivers for a valid reason (performance) rely on this API to
>   enforce their affinity mask
>
> - Until very recently this was the only exported interface that was
>   available
>
> To clear this out Thomas has come up with the following interfaces:
>
> - irq_set_affinity(): only sets affinity of an IRQ [3]
> - irq_update_affinity_hint(): Only sets the hint [4]
> - irq_set_affinity_and_hint(): Sets both affinity and the hint mask [4]
>
> The first API is already merged in the linux-next tree and the patch
> that introduces the other two interfaces are included with this patch-set.
>
> To move to the stage where we can safely get rid of the
> irq_set_affinity_hint(), which has been marked deprecated, we have to
> move all its consumers to these new interfaces. In this patch-set, I have
> done that for a few drivers and will hopefully try to move the remaining of
> them in the coming days.
>
> Testing
> -------
> In terms of testing, I have performed some basic testing on x86 to verify
> things such as the interrupts are evenly spread on all CPUs, hint mask is
> correctly set etc. for the drivers - i40e, iavf, mlx5, mlx4, ixgbe, i40iw
> and enic on top of:
>
>         git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git
>
> So more testing is probably required for these and the drivers that I didn't
> test and any help will be much appreciated.
>
>
> Notes
> -----
> - I was told that i40iw driver is going to be replaced by irdma, however,
>   the new driver didn't land in Linus's tree yet. Once it does I will send
>   a follow up patch for that as well.
>
> - For the mpt3sas driver I decided to go with the usage of
>   irq_set_affinity_and_hint over irq_set_affinity based on my little
>   analysis of it and the megaraid driver. However, if we are sure that it
>   is not required then I can replace it with just irq_set_affinity as one
>   of its comment suggests.
>
>
> Change from v1 [5]
> ------------------
> - Fixed compilation error by adding the new interface definitions for cases
>   where CONFIG_SMP is not defined
>
> - Fixed function usage in megaraid_sas and removed unnecessary variable
>   (Robin Murphy)
>
> - Removed unwanted #if/endif from mlx4 (Leon Romanovsky)
>
> - Other indentation related fixes
>
>
> [1] https://lore.kernel.org/lkml/1a044a14-0884-eedb-5d30-28b4bec24b23@redhat.com/
> [2] https://lore.kernel.org/linux-pci/d1d5e797-49ee-4968-88c6-c07119343492@arm.com/
> [3] https://lore.kernel.org/linux-arm-kernel/20210518091725.046774792@linutronix.de/
> [4] https://lore.kernel.org/patchwork/patch/1434326/
> [5] https://lore.kernel.org/linux-scsi/20210617182242.8637-1-nitesh@redhat.com/
>
>
> Nitesh Narayan Lal (13):
>   iavf: Use irq_update_affinity_hint
>   i40e: Use irq_update_affinity_hint
>   scsi: megaraid_sas: Use irq_set_affinity_and_hint
>   scsi: mpt3sas: Use irq_set_affinity_and_hint
>   RDMA/i40iw: Use irq_update_affinity_hint
>   enic: Use irq_update_affinity_hint
>   be2net: Use irq_update_affinity_hint
>   ixgbe: Use irq_update_affinity_hint
>   mailbox: Use irq_update_affinity_hint
>   scsi: lpfc: Use irq_set_affinity
>   hinic: Use irq_set_affinity_and_hint
>   net/mlx5: Use irq_update_affinity_hint
>   net/mlx4: Use irq_update_affinity_hint
>
> Thomas Gleixner (1):
>   genirq: Provide new interfaces for affinity hints
>
>  drivers/infiniband/hw/i40iw/i40iw_main.c      |  4 +-
>  drivers/mailbox/bcm-flexrm-mailbox.c          |  4 +-
>  drivers/net/ethernet/cisco/enic/enic_main.c   |  8 +--
>  drivers/net/ethernet/emulex/benet/be_main.c   |  4 +-
>  drivers/net/ethernet/huawei/hinic/hinic_rx.c  |  4 +-
>  drivers/net/ethernet/intel/i40e/i40e_main.c   |  8 +--
>  drivers/net/ethernet/intel/iavf/iavf_main.c   |  8 +--
>  drivers/net/ethernet/intel/ixgbe/ixgbe_main.c | 10 ++--
>  drivers/net/ethernet/mellanox/mlx4/eq.c       |  8 ++-
>  .../net/ethernet/mellanox/mlx5/core/pci_irq.c |  6 +--
>  drivers/scsi/lpfc/lpfc_init.c                 |  4 +-
>  drivers/scsi/megaraid/megaraid_sas_base.c     | 27 +++++-----
>  drivers/scsi/mpt3sas/mpt3sas_base.c           | 21 ++++----
>  include/linux/interrupt.h                     | 53 ++++++++++++++++++-
>  kernel/irq/manage.c                           |  8 +--
>  15 files changed, 113 insertions(+), 64 deletions(-)
>
> --
>
>

Gentle ping.
Any comments or suggestions on any of the patches included in this series?
Leon Romanovsky July 11, 2021, 11:31 a.m. UTC | #2
On Thu, Jul 08, 2021 at 03:24:20PM -0400, Nitesh Lal wrote:
> On Tue, Jun 29, 2021 at 11:28 AM Nitesh Narayan Lal <nitesh@redhat.com> wrote:

<...>

> >
> >  drivers/infiniband/hw/i40iw/i40iw_main.c      |  4 +-
> >  drivers/mailbox/bcm-flexrm-mailbox.c          |  4 +-
> >  drivers/net/ethernet/cisco/enic/enic_main.c   |  8 +--
> >  drivers/net/ethernet/emulex/benet/be_main.c   |  4 +-
> >  drivers/net/ethernet/huawei/hinic/hinic_rx.c  |  4 +-
> >  drivers/net/ethernet/intel/i40e/i40e_main.c   |  8 +--
> >  drivers/net/ethernet/intel/iavf/iavf_main.c   |  8 +--
> >  drivers/net/ethernet/intel/ixgbe/ixgbe_main.c | 10 ++--
> >  drivers/net/ethernet/mellanox/mlx4/eq.c       |  8 ++-
> >  .../net/ethernet/mellanox/mlx5/core/pci_irq.c |  6 +--
> >  drivers/scsi/lpfc/lpfc_init.c                 |  4 +-
> >  drivers/scsi/megaraid/megaraid_sas_base.c     | 27 +++++-----
> >  drivers/scsi/mpt3sas/mpt3sas_base.c           | 21 ++++----
> >  include/linux/interrupt.h                     | 53 ++++++++++++++++++-
> >  kernel/irq/manage.c                           |  8 +--
> >  15 files changed, 113 insertions(+), 64 deletions(-)
> >
> > --
> >
> >
> 
> Gentle ping.
> Any comments or suggestions on any of the patches included in this series?

Please wait for -rc1, rebase and resend.
At least i40iw was deleted during merge window.

Thanks

> 
> -- 
> Thanks
> Nitesh
>
Nitesh Lal July 12, 2021, 2:13 p.m. UTC | #3
On Sun, Jul 11, 2021 at 7:32 AM Leon Romanovsky <leonro@nvidia.com> wrote:
>
> On Thu, Jul 08, 2021 at 03:24:20PM -0400, Nitesh Lal wrote:
> > On Tue, Jun 29, 2021 at 11:28 AM Nitesh Narayan Lal <nitesh@redhat.com> wrote:
>
> <...>
>
> > >
> > >  drivers/infiniband/hw/i40iw/i40iw_main.c      |  4 +-
> > >  drivers/mailbox/bcm-flexrm-mailbox.c          |  4 +-
> > >  drivers/net/ethernet/cisco/enic/enic_main.c   |  8 +--
> > >  drivers/net/ethernet/emulex/benet/be_main.c   |  4 +-
> > >  drivers/net/ethernet/huawei/hinic/hinic_rx.c  |  4 +-
> > >  drivers/net/ethernet/intel/i40e/i40e_main.c   |  8 +--
> > >  drivers/net/ethernet/intel/iavf/iavf_main.c   |  8 +--
> > >  drivers/net/ethernet/intel/ixgbe/ixgbe_main.c | 10 ++--
> > >  drivers/net/ethernet/mellanox/mlx4/eq.c       |  8 ++-
> > >  .../net/ethernet/mellanox/mlx5/core/pci_irq.c |  6 +--
> > >  drivers/scsi/lpfc/lpfc_init.c                 |  4 +-
> > >  drivers/scsi/megaraid/megaraid_sas_base.c     | 27 +++++-----
> > >  drivers/scsi/mpt3sas/mpt3sas_base.c           | 21 ++++----
> > >  include/linux/interrupt.h                     | 53 ++++++++++++++++++-
> > >  kernel/irq/manage.c                           |  8 +--
> > >  15 files changed, 113 insertions(+), 64 deletions(-)
> > >
> > > --
> > >
> > >
> >
> > Gentle ping.
> > Any comments or suggestions on any of the patches included in this series?
>
> Please wait for -rc1, rebase and resend.
> At least i40iw was deleted during merge window.
>

Right, will rebase on top of 5.14-rc1 and resend.
Nitesh Lal July 12, 2021, 9:27 p.m. UTC | #4
Hi Leon,

On Sun, Jul 11, 2021 at 7:32 AM Leon Romanovsky <leonro@nvidia.com> wrote:
>
> On Thu, Jul 08, 2021 at 03:24:20PM -0400, Nitesh Lal wrote:
> > On Tue, Jun 29, 2021 at 11:28 AM Nitesh Narayan Lal <nitesh@redhat.com> wrote:
>
> <...>
>
> > >
> > >  drivers/infiniband/hw/i40iw/i40iw_main.c      |  4 +-
> > >  drivers/mailbox/bcm-flexrm-mailbox.c          |  4 +-
> > >  drivers/net/ethernet/cisco/enic/enic_main.c   |  8 +--
> > >  drivers/net/ethernet/emulex/benet/be_main.c   |  4 +-
> > >  drivers/net/ethernet/huawei/hinic/hinic_rx.c  |  4 +-
> > >  drivers/net/ethernet/intel/i40e/i40e_main.c   |  8 +--
> > >  drivers/net/ethernet/intel/iavf/iavf_main.c   |  8 +--
> > >  drivers/net/ethernet/intel/ixgbe/ixgbe_main.c | 10 ++--
> > >  drivers/net/ethernet/mellanox/mlx4/eq.c       |  8 ++-
> > >  .../net/ethernet/mellanox/mlx5/core/pci_irq.c |  6 +--
> > >  drivers/scsi/lpfc/lpfc_init.c                 |  4 +-
> > >  drivers/scsi/megaraid/megaraid_sas_base.c     | 27 +++++-----
> > >  drivers/scsi/mpt3sas/mpt3sas_base.c           | 21 ++++----
> > >  include/linux/interrupt.h                     | 53 ++++++++++++++++++-
> > >  kernel/irq/manage.c                           |  8 +--
> > >  15 files changed, 113 insertions(+), 64 deletions(-)
> > >
> > > --
> > >
> > >
> >
> > Gentle ping.
> > Any comments or suggestions on any of the patches included in this series?
>
> Please wait for -rc1, rebase and resend.
> At least i40iw was deleted during merge window.
>

In -rc1 some non-trivial mlx5 changes also went in.  I was going through
these changes and it seems after your patch

e4e3f24b822f: ("net/mlx5: Provide cpumask at EQ creation phase")

we do want to control the affinity for the mlx5 interrupts from the driver.
Is that correct? This would mean that we should use
irq_set_affinity_and_hint() instead
of irq_update_affinity_hint().
Leon Romanovsky July 13, 2021, 5:01 a.m. UTC | #5
On Mon, Jul 12, 2021 at 05:27:05PM -0400, Nitesh Lal wrote:
> Hi Leon,
> 
> On Sun, Jul 11, 2021 at 7:32 AM Leon Romanovsky <leonro@nvidia.com> wrote:
> >
> > On Thu, Jul 08, 2021 at 03:24:20PM -0400, Nitesh Lal wrote:
> > > On Tue, Jun 29, 2021 at 11:28 AM Nitesh Narayan Lal <nitesh@redhat.com> wrote:
> >
> > <...>
> >
> > > >
> > > >  drivers/infiniband/hw/i40iw/i40iw_main.c      |  4 +-
> > > >  drivers/mailbox/bcm-flexrm-mailbox.c          |  4 +-
> > > >  drivers/net/ethernet/cisco/enic/enic_main.c   |  8 +--
> > > >  drivers/net/ethernet/emulex/benet/be_main.c   |  4 +-
> > > >  drivers/net/ethernet/huawei/hinic/hinic_rx.c  |  4 +-
> > > >  drivers/net/ethernet/intel/i40e/i40e_main.c   |  8 +--
> > > >  drivers/net/ethernet/intel/iavf/iavf_main.c   |  8 +--
> > > >  drivers/net/ethernet/intel/ixgbe/ixgbe_main.c | 10 ++--
> > > >  drivers/net/ethernet/mellanox/mlx4/eq.c       |  8 ++-
> > > >  .../net/ethernet/mellanox/mlx5/core/pci_irq.c |  6 +--
> > > >  drivers/scsi/lpfc/lpfc_init.c                 |  4 +-
> > > >  drivers/scsi/megaraid/megaraid_sas_base.c     | 27 +++++-----
> > > >  drivers/scsi/mpt3sas/mpt3sas_base.c           | 21 ++++----
> > > >  include/linux/interrupt.h                     | 53 ++++++++++++++++++-
> > > >  kernel/irq/manage.c                           |  8 +--
> > > >  15 files changed, 113 insertions(+), 64 deletions(-)
> > > >
> > > > --
> > > >
> > > >
> > >
> > > Gentle ping.
> > > Any comments or suggestions on any of the patches included in this series?
> >
> > Please wait for -rc1, rebase and resend.
> > At least i40iw was deleted during merge window.
> >
> 
> In -rc1 some non-trivial mlx5 changes also went in.  I was going through
> these changes and it seems after your patch
> 
> e4e3f24b822f: ("net/mlx5: Provide cpumask at EQ creation phase")
> 
> we do want to control the affinity for the mlx5 interrupts from the driver.
> Is that correct? 

We would like to create devices with correct affinity from the
beginning. For this, we will introduce extension to devlink to control
affinity that will be used prior initialization sequence.

Currently, netdev users who don't want irqbalance are digging into
their procfs, reconfigure affinity on already existing devices and
hope for the best. 

This is even more cumbersome for the SIOV use case, where every physical
NIC PCI device will/can create thousands of lightweights netdevs that will
be forwarded to the containers later. These containers are limited to known
CPU cores, so no reason do not limit netdev device too.

The same goes for other sub-functions of that PCI device, like RDMA,
vdpa e.t.c.

> This would mean that we should use irq_set_affinity_and_hint() instead
> of irq_update_affinity_hint().

I think so.

Thanks

> 
> -- 
> Thanks
> Nitesh
>
Nitesh Lal July 13, 2021, 1:29 p.m. UTC | #6
On Tue, Jul 13, 2021 at 1:01 AM Leon Romanovsky <leon@kernel.org> wrote:
>
> On Mon, Jul 12, 2021 at 05:27:05PM -0400, Nitesh Lal wrote:
> > Hi Leon,
> >

<snip>

> > > >
> > > > Gentle ping.
> > > > Any comments or suggestions on any of the patches included in this series?
> > >
> > > Please wait for -rc1, rebase and resend.
> > > At least i40iw was deleted during merge window.
> > >
> >
> > In -rc1 some non-trivial mlx5 changes also went in.  I was going through
> > these changes and it seems after your patch
> >
> > e4e3f24b822f: ("net/mlx5: Provide cpumask at EQ creation phase")
> >
> > we do want to control the affinity for the mlx5 interrupts from the driver.
> > Is that correct?
>
> We would like to create devices with correct affinity from the
> beginning. For this, we will introduce extension to devlink to control
> affinity that will be used prior initialization sequence.
>
> Currently, netdev users who don't want irqbalance are digging into
> their procfs, reconfigure affinity on already existing devices and
> hope for the best.
>
> This is even more cumbersome for the SIOV use case, where every physical
> NIC PCI device will/can create thousands of lightweights netdevs that will
> be forwarded to the containers later. These containers are limited to known
> CPU cores, so no reason do not limit netdev device too.
>
> The same goes for other sub-functions of that PCI device, like RDMA,
> vdpa e.t.c.
>
> > This would mean that we should use irq_set_affinity_and_hint() instead
> > of irq_update_affinity_hint().
>
> I think so.
>

Thanks, will make that change in the patch and re-send.
I will also drop your reviewed-by for the mlx5 patch so that you can
have a look at it again, please let me know if you have any
objections.