mbox series

[net-next,0/4] devlink: Introduce cpu_affinity command

Message ID 20220222105812.18668-1-shayd@nvidia.com (mailing list archive)
Headers show
Series devlink: Introduce cpu_affinity command | expand

Message

Shay Drori Feb. 22, 2022, 10:58 a.m. UTC
Currently a user can only configure the IRQ CPU affinity of a device
via the global /proc/irq/../smp_afinity interface, however this
interface changes the affinity globally across all subsystems connected
to the device.

Historically, this API is useful for single function devices since,
generally speaking, the queue structure created on top of the IRQ
vectors is predictable enough that this control is usable.

However, with complex multi-subsystem devices, like mlx5, the
assignment of queues at every layer throughout the software stack is
complex and there are multiple queues, each for different usage, over
the same IRQ. Hence, a simple fiddling of the base IRQ is no longer
effective.

As an example mlx5 SF's can share MSI-X IRQ between themselves, which
means that currently user doesn't have control over which SF to use
which CPU set. Hence, an application and IRQ can run on different
CPUs, which leads to lower performance, as shown in the bellow table.

application=netperf,    SF-IRQ     channel affinity   latecy(usec)
                                                      (lower is better)
cpu=0 (numa=0)           cpu={0}   cpu={0}            14.417
cpu=8 (numa=0)           cpu={0}   cpu={0}            15.114 (+5%)
cpu=1 (numa=1)           cpu={0}   cpu={0}            17.784 (+30%)

This series is a start at resolving this problem by inverting the
control of the affinities. Instead of having the user go around behind
the driver and adjusting the IRQs the driver already created we want
to have the user tell the software layer what CPUs to use and the
software layer will manage this. The suggested command will then trickle
down to the PCI driver which will create/share MSI-X IRQs and resources
to achieve it. In the mlx5 SF example the involved software components
would be devlink, rdma, vdpa and netdev.

This series introduces a devlink control that assigns a CPU set to the
cross-subsystem mlx5_core PCI function device. This can be used either
on PF, VF or SF and restricts all the software layers above it to the
given CPU set.

For specified CPU, SF either uses an existing IRQ affiliated to the CPU
or a new IRQ available from the device. For example if user gives
affinity 3 (11 in binary), SF will create driver internal required
completion EQ, attached to these specific CPU's IRQ.
If SF is already fully probed, devlink reload is required for
cpu_affinity to take effect.

The following command sets the affinity of mlx5 PF/VF/SF.
devlink command structure:
$ devlink dev param set auxiliary/mlx5_core.sf.4 name cpu_affinity value \
          [cpu_bitmask] cmode driverinit

Applications that want to restrict a SF or VF HW to a CPU set, for
instance container workloads, can make use of this API to easily
achieve it.

Shay Drory (4):
  net netlink: Introduce NLA_BITFIELD type
  devlink: Add support for NLA_BITFIELD for devlink param
  devlink: Add new cpu_affinity generic device param
  net/mlx5: Support cpu_affinity devlink dev param

 .../networking/devlink/devlink-params.rst     |   5 +
 Documentation/networking/devlink/mlx5.rst     |   3 +
 .../net/ethernet/mellanox/mlx5/core/devlink.c | 123 +++++++++++++++
 .../net/ethernet/mellanox/mlx5/core/devlink.h |   2 +
 drivers/net/ethernet/mellanox/mlx5/core/eq.c  |  39 +++++
 .../ethernet/mellanox/mlx5/core/mlx5_core.h   |   2 +
 .../ethernet/mellanox/mlx5/core/mlx5_irq.h    |   5 +-
 .../net/ethernet/mellanox/mlx5/core/pci_irq.c |  85 +++++++++-
 include/net/devlink.h                         |  22 +++
 include/net/netlink.h                         |  30 ++++
 include/uapi/linux/netlink.h                  |  10 ++
 lib/nlattr.c                                  | 145 +++++++++++++++++-
 net/core/devlink.c                            | 143 +++++++++++++++--
 net/netlink/policy.c                          |   4 +
 14 files changed, 594 insertions(+), 24 deletions(-)