mbox series

[net-next,00/19] Mellanox, mlx5 sub function support

Message ID 20191107160448.20962-1-parav@mellanox.com (mailing list archive)
Headers show
Series Mellanox, mlx5 sub function support | expand

Message

Parav Pandit Nov. 7, 2019, 4:04 p.m. UTC
Hi Dave, Jiri, Alex,

This series adds the support for mlx5 sub function devices using
mediated device with eswitch switchdev mode.

This series is currently on top of [1], but once [1] is merged to
net-next tree, this series should be resend through usual Saeed's
net-next tree or to netdev net-next.

Mdev alias support patches are already reviewed at vfio/kvm's mailing
list in past. Since mlx5_core driver is first user of the
mdev alias API, those patches are part of this series to merge via
net-next tree.

[1] https://git.kernel.org/pub/scm/linux/kernel/git/saeed/linux.git/log/?h=net-next-mlx5

Also, once Jason Wang's series [9] is merged, mlx5_core driver will be enhanced to
use class id based matching support for mdev devices.

Abstract:
---------

Mellanox ConnectX device supports dynamic creation of sub function which
are equivalent to a PCI function handled by mlx5_core and mlx5_ib drivers.
It has few few differences to PCI function. They are described below in
overview section.

Mellanox sub function capability allows users to create several hundreds
of networking and/or rdma devices without depending on PCI SR-IOV support.

Motivation:
-----------
User wants to use multiple netdevices and rdma devices in a system
without enabling SR-IOV, possibly more number of devices than what
mellanox ConnectX device supports using SR-IOV.

Such devices will have most if not all of the acceleration and offload
capabilities of mlx5, while they are still light weight and don't demand
any dedicated physical resources such as (pci function, MSIX vectors).

Provision such netdevice and/or rdma device to use in bare-metal
server or provision to a container.

In future, map such sub function device to a VM once ConnectX device
supports it. In such case, it should be able to reuse existing
mediated device (mdev) [2] framework of kernel.

Regardless of how a sub function is used, it is desired to lifecycle
such device in unified way, such as using mdev [2].

User wants to have same level of control, visibility, offloads support
as that of current SR-IOV VFs using eswitch switchdev mode for
Ethernet devices.

Overview:
---------
Mellanox ConnectX sub functions are exposed to user as a mediated
device (mdev) [2] as discussed in RFC [3] and further during
netdevconf0x13 at [4].

mlx5 mediated device (mdev) enables users to create multiple netdevices
and/or RDMA devices from single PCI function.

Each mdev maps to a mlx5 sub function.
mlx5 sub function is similar to PCI VF. However it doesn't have its own
PCI function and MSI-X vectors.

mlx5 mdevs share common PCI resources such as PCI BAR region,
MSI-X interrupts.

Each mdev has its own window in the PCI BAR region, which is
accessible only to that mdev and applications using it.

Each mlx5 sub function has its own resource namespace for RDMA resources.

mdevs are supported when eswitch mode of the devlink instance
is in switchdev mode described in devlink documentation [5].

mdev is identified using a UUID defined by RFC 4122.

Each created mdev has unique 12 letters alias. This alias is used to
derive phys_port_name attribute of the corresponding representor
netdevice. This establishes clear link between mdev device, eswitch
port and eswitch representor netdevice.

systemd udev [6] will be enhanced post this work to have predictable
netdevice names based on the unique and predicable mdev alias of the
mdev device.

For example, an Ethernet netdevice of an mdev device having alias
'aliasfoo123' will be persistently named as enmaliasfoo123, where:
<en> for Ethernet
m = mediated device bus
<alias> = unique mdev alias

Design decisions:
-----------------
1. mdev device (and bus) instead of any new subdevice bus is used
after concluding discussion at [7]. This simplifies and eliminates the
need of new bus, also eliminates any vendor specific bus.

2. mdev device is also chosen to not create multiple tools for creating
netdevice, rdma device, devlink port, its representor netdevice and
representor rdma port.

3. mdev device support in eswitch switchdev mode, gives rich isolation
and reuses existing offload infrastructure of iproute2/tc.

4. mdev device also enjoys existing devlink health reporting infrastructure
uniformely for PF, VF and mdev.

5. Persistent naming of mdev's netdevice scheme is discussed and concluded
at [8] using systemwide unique, predicable mdev alias.

6. Unique eswitch port phys_port_name derivation from the mdev alias usage
is concluded at [8].

User commands examples:
-----------------------

- Set eswitch mode as switchdev mode
    $ devlink dev eswitch set pci/0000:06:00.0 mode switchdev

- Create a mdev
    Generate a UUID
    $ UUID=$(uuidgen)
    Create the mdev using UUID
    $ echo $UUID > /sys/class/net/ens2f0_p0/device/mdev_supported_types/mlx5_core-local/create

- Unbind a mdev from vfio_mdev driver
    $ echo $UUID > /sys/bus/mdev/drivers/vfio_mdev/unbind

- Bind a mdev to mlx5_core driver
    $ echo $UUID > /sys/bus/mdev/drivers/mlx5_core/bind

- View netdevice and (optionally) RDMA device in sysfs tree
    $ ls -l /sys/bus/mdev/devices/$UUID/net/
    $ ls -l /sys/bus/mdev/devices/$UUID/infiniband/

- View netdevice and (optionally) RDMA device using iproute2 tools
    $ ip link show
    $ rdma dev show

- Query maximum number of mdevs that can be created
    $ cat /sys/class/net/ens2f0_p0/device/mdev_supported_types/mlx5_core-local/max_mdevs

- Query remaining number of mdevs that can be created
    $ cat /sys/class/net/ens2f0_p0/device/mdev_supported_types/mlx5_core-local/available_instances

- Query an alias of the mdev
    $ cat /sys/bus/mdev/devices/$UUID/alias

Security model:
--------------
This section covers security aspects of mlx5 mediated devices at
host level and at network level.

Host side:
- At present mlx5 mdev is meant to be used only in a host.
It is not meant to be mapped to a VM or access by userspace application
using VFIO framework.
Hence, mlx5_core driver doesn't implement any of the VFIO device specific
callback routines.
Hence, mlx5 mediated device cannot be mapped to a VM or to a userspace
application via VFIO framework.

- At present an mlx5 mdev can be accessed by an application through
its netdevice and/or RDMA device.

- mlx5 mdev does not share PCI BAR with its parent PCI function.

- All mlx5 mdevs of a given parent device share a single PCI BAR.
However each mdev device has a small dedicated window of the PCI BAR.
Hence, one mdev device cannot access PCI BAR or any of the resources
of another mdev device.

- Each mlx5 mdev has its own dedicated event queue through which interrupt
notifications are delivered. Hence, one mlx5 mdev cannot enable/disable
interrupts of other mlx5 mdev. mlx5 mdev cannot enable/disable interrupts
of the parent PCI function.

Network side:
- By default the netdevice and the rdma device of mlx5 mdev cannot send or
receive any packets over the network or to any other mlx5 mdev.

- mlx5 mdev follows devlink eswitch and vport model of PCI SR-IOV PF and VFs.
All traffic is dropped by default in this eswitch model.

- Each mlx5 mdev has one eswitch vport representor netdevice and rdma port.
The user must do necessary configuration through such representor to enable
mlx5 mdev to send and/or receive packets.

Patchset summary:
-----------------
Patch 1 to 6 prepare mlx5 core driver to create mdev.

Patch-1 Moves mlx5 devlink port close to eswitch port
Patch-2 implements sub function vport representor handling
Patch-3,4 Introduces mlx5 sub function lifecycle routines
Patch-5 Extended mlx5 eswitch to enable/disable sub function vports
Patch-6 Registers mlx5 driver with mdev subsystem for mdev lifecycle

Patch-7 to 10 enhances mdev subsystem for unique alias generation

Patch-7 introduces sha1 based mdev alias
Patch-8 Ensures mdev alias is unique amlong all mdev devices
Patch-9 Exposes mdev alias in sysfs file for systemd/udev usage
Patch-10 Introduces mdev alias API to be used by vendor driver
Patch-11 Improves mdev to avoid nested locking with mlx5_core driver
and devlink subsystem

Patch-12 Introduces new devlink mdev port flavour
Patch-13 Registers devlink eswitch port for a sub function

Patch-14 to 17 implements mdev driver by reusing most of the mlx5_core
driver
Patch-14 implements sharing IRQ between sub function and parent PCI device
Patch-15 implements load, unload routine for sub function
Patch-16 implements dma ops for mediated device
Patch-17 implements mdev driver to bind to mdev device

Patch-18 Adds documentation for mdev overview, example and security model
Patch-19 extends sample mtty driver for mdev alias generation

[1] https://git.kernel.org/pub/scm/linux/kernel/git/saeed/linux.git/log/?h=net-next-mlx5
[2] https://www.kernel.org/doc/Documentation/vfio-mediated-device.txt
[3] https://lkml.org/lkml/2019/3/8/819
[4] https://netdevconf.org/0x13/session.html?workshop-hardware-offload
[5] http://man7.org/linux/man-pages/man8/devlink-dev.8.html.
[6] https://github.com/systemd/systemd/blob/master/src/udev/udev-builtin-net_id.c
[7] https://lkml.org/lkml/2019/3/7/696
[8] https://lkml.org/lkml/2019/8/23/146
[9] https://patchwork.ozlabs.org/patch/1190425


Parav Pandit (14):
  net/mlx5: E-switch, Move devlink port close to eswitch port
  net/mlx5: Introduce SF life cycle APIs to allocate/free
  vfio/mdev: Introduce sha1 based mdev alias
  vfio/mdev: Make mdev alias unique among all mdevs
  vfio/mdev: Expose mdev alias in sysfs tree
  vfio/mdev: Introduce an API mdev_alias
  vfio/mdev: Improvise mdev life cycle and parent removal scheme
  devlink: Introduce mdev port flavour
  net/mlx5: Register SF devlink port
  net/mlx5: Add load/unload routines for SF driver binding
  net/mlx5: Implement dma ops and params for mediated device
  net/mlx5: Add mdev driver to bind to mdev devices
  Documentation: net: mlx5: Add mdev usage documentation
  mtty: Optionally support mtty alias

Vu Pham (4):
  net/mlx5: E-Switch, Add SF vport, vport-rep support
  net/mlx5: Introduce SF table framework
  net/mlx5: E-Switch, Enable/disable SF's vport during SF life cycle
  net/mlx5: Add support for mediated devices in switchdev mode

Yuval Avnery (1):
  net/mlx5: Share irqs between SFs and parent PCI device

 .../driver-api/vfio-mediated-device.rst       |   9 +
 .../device_drivers/mellanox/mlx5.rst          | 122 +++++
 .../net/ethernet/mellanox/mlx5/core/Kconfig   |  11 +
 .../net/ethernet/mellanox/mlx5/core/Makefile  |   4 +
 drivers/net/ethernet/mellanox/mlx5/core/cmd.c |   6 +
 drivers/net/ethernet/mellanox/mlx5/core/dev.c |  17 +
 .../net/ethernet/mellanox/mlx5/core/devlink.c |  69 +++
 .../net/ethernet/mellanox/mlx5/core/devlink.h |   8 +
 .../net/ethernet/mellanox/mlx5/core/en_rep.c  |  94 +---
 .../net/ethernet/mellanox/mlx5/core/en_rep.h  |   2 +-
 drivers/net/ethernet/mellanox/mlx5/core/eq.c  |   6 +-
 .../net/ethernet/mellanox/mlx5/core/eswitch.c |  24 +-
 .../net/ethernet/mellanox/mlx5/core/eswitch.h |  85 ++++
 .../mellanox/mlx5/core/eswitch_offloads.c     | 147 +++++-
 .../net/ethernet/mellanox/mlx5/core/lib/eq.h  |   3 +-
 .../net/ethernet/mellanox/mlx5/core/main.c    |  80 +++-
 .../ethernet/mellanox/mlx5/core/meddev/mdev.c | 212 +++++++++
 .../mellanox/mlx5/core/meddev/mdev_driver.c   |  50 +++
 .../ethernet/mellanox/mlx5/core/meddev/sf.c   | 425 ++++++++++++++++++
 .../ethernet/mellanox/mlx5/core/meddev/sf.h   |  73 +++
 .../ethernet/mellanox/mlx5/core/mlx5_core.h   |  54 +++
 .../net/ethernet/mellanox/mlx5/core/pci_irq.c |  12 +
 .../net/ethernet/mellanox/mlx5/core/vport.c   |   4 +-
 drivers/vfio/mdev/mdev_core.c                 | 198 +++++++-
 drivers/vfio/mdev/mdev_private.h              |   8 +-
 drivers/vfio/mdev/mdev_sysfs.c                |  26 +-
 include/linux/mdev.h                          |   5 +
 include/linux/mlx5/driver.h                   |   8 +-
 include/net/devlink.h                         |   9 +
 include/uapi/linux/devlink.h                  |   5 +
 net/core/devlink.c                            |  32 ++
 samples/vfio-mdev/mtty.c                      |  13 +
 32 files changed, 1678 insertions(+), 143 deletions(-)
 create mode 100644 drivers/net/ethernet/mellanox/mlx5/core/meddev/mdev.c
 create mode 100644 drivers/net/ethernet/mellanox/mlx5/core/meddev/mdev_driver.c
 create mode 100644 drivers/net/ethernet/mellanox/mlx5/core/meddev/sf.c
 create mode 100644 drivers/net/ethernet/mellanox/mlx5/core/meddev/sf.h

Comments

Leon Romanovsky Nov. 7, 2019, 5:03 p.m. UTC | #1
On Thu, Nov 07, 2019 at 10:04:48AM -0600, Parav Pandit wrote:
> Hi Dave, Jiri, Alex,
>

<...>

> - View netdevice and (optionally) RDMA device using iproute2 tools
>     $ ip link show
>     $ rdma dev show

You perfectly explained how ETH devices will be named, but what about RDMA?
How will be named? I feel that rdma-core needs to be extended to support such
mediated devices.

Thanks
Parav Pandit Nov. 7, 2019, 8:10 p.m. UTC | #2
Hi Leon,

> -----Original Message-----
> From: Leon Romanovsky <leon@kernel.org>
> Sent: Thursday, November 7, 2019 11:04 AM
> To: Parav Pandit <parav@mellanox.com>
> Cc: alex.williamson@redhat.com; davem@davemloft.net;
> kvm@vger.kernel.org; netdev@vger.kernel.org; Saeed Mahameed
> <saeedm@mellanox.com>; kwankhede@nvidia.com; cohuck@redhat.com; Jiri
> Pirko <jiri@mellanox.com>; linux-rdma@vger.kernel.org
> Subject: Re: [PATCH net-next 00/19] Mellanox, mlx5 sub function support
> 
> On Thu, Nov 07, 2019 at 10:04:48AM -0600, Parav Pandit wrote:
> > Hi Dave, Jiri, Alex,
> >
> 
> <...>
> 
> > - View netdevice and (optionally) RDMA device using iproute2 tools
> >     $ ip link show
> >     $ rdma dev show
> 
> You perfectly explained how ETH devices will be named, but what about
> RDMA?
> How will be named? I feel that rdma-core needs to be extended to support such
> mediated devices.
> 
rdma devices are named by default using mlx_X.
After your persistent naming patches, I thought we have GUID based naming scheme which doesn't care about its underlying bus.
So mdevs will be able to use current GUID based naming scheme we already have.

Additionally, if user prefers, mdev alias, we can extend systemd/udev to use mdev alias based names (like PCI bdf).
Such as,
rocem<alias1>
ibm<alias2>
Format is:
<link_layer><m><alias>
m -> stands for mdev device (similar to 'p' for PCI)
Jakub Kicinski Nov. 7, 2019, 8:32 p.m. UTC | #3
On Thu,  7 Nov 2019 10:04:48 -0600, Parav Pandit wrote:
> Mellanox sub function capability allows users to create several hundreds
> of networking and/or rdma devices without depending on PCI SR-IOV support.

You call the new port type "sub function" but the devlink port flavour
is mdev.

As I'm sure you remember you nacked my patches exposing NFP's PCI 
sub functions which are just regions of the BAR without any mdev
capability. Am I in the clear to repost those now? Jiri?

> Overview:
> ---------
> Mellanox ConnectX sub functions are exposed to user as a mediated
> device (mdev) [2] as discussed in RFC [3] and further during
> netdevconf0x13 at [4].
> 
> mlx5 mediated device (mdev) enables users to create multiple netdevices
> and/or RDMA devices from single PCI function.
> 
> Each mdev maps to a mlx5 sub function.
> mlx5 sub function is similar to PCI VF. However it doesn't have its own
> PCI function and MSI-X vectors.
> 
> mlx5 mdevs share common PCI resources such as PCI BAR region,
> MSI-X interrupts.
> 
> Each mdev has its own window in the PCI BAR region, which is
> accessible only to that mdev and applications using it.
> 
> Each mlx5 sub function has its own resource namespace for RDMA resources.
> 
> mdevs are supported when eswitch mode of the devlink instance
> is in switchdev mode described in devlink documentation [5].

So presumably the mdevs don't spawn their own devlink instance today,
but once mapped via VIRTIO to a VM they will create one?

It could be useful to specify.

> Network side:
> - By default the netdevice and the rdma device of mlx5 mdev cannot send or
> receive any packets over the network or to any other mlx5 mdev.

Does this mean the frames don't fall back to the repr by default?
Parav Pandit Nov. 7, 2019, 8:52 p.m. UTC | #4
Hi Jakub,

> -----Original Message-----
> From: Jakub Kicinski <jakub.kicinski@netronome.com>
> Sent: Thursday, November 7, 2019 2:33 PM
> To: Parav Pandit <parav@mellanox.com>
> Cc: alex.williamson@redhat.com; davem@davemloft.net;
> kvm@vger.kernel.org; netdev@vger.kernel.org; Saeed Mahameed
> <saeedm@mellanox.com>; kwankhede@nvidia.com; leon@kernel.org;
> cohuck@redhat.com; Jiri Pirko <jiri@mellanox.com>; linux-
> rdma@vger.kernel.org; Or Gerlitz <gerlitz.or@gmail.com>
> Subject: Re: [PATCH net-next 00/19] Mellanox, mlx5 sub function support
> 
> On Thu,  7 Nov 2019 10:04:48 -0600, Parav Pandit wrote:
> > Mellanox sub function capability allows users to create several
> > hundreds of networking and/or rdma devices without depending on PCI SR-
> IOV support.
> 
> You call the new port type "sub function" but the devlink port flavour is mdev.
> 
Sub function is the internal driver structure. The abstract entity at user and stack level is mdev.
Hence the port flavour is mdev.
 
> As I'm sure you remember you nacked my patches exposing NFP's PCI sub
> functions which are just regions of the BAR without any mdev capability. Am I
> in the clear to repost those now? Jiri?
> 
For sure I didn't nack it. :-)
What I remember discussing offline/mailing list is 
(a) exposing mdev/sub fuctions as devlink sub ports is not so good abstraction
(b) user creating/deleting eswitch sub ports would be hard to fit in the whole usage model

> > Overview:
> > ---------
> > Mellanox ConnectX sub functions are exposed to user as a mediated
> > device (mdev) [2] as discussed in RFC [3] and further during
> > netdevconf0x13 at [4].
> >
> > mlx5 mediated device (mdev) enables users to create multiple
> > netdevices and/or RDMA devices from single PCI function.
> >
> > Each mdev maps to a mlx5 sub function.
> > mlx5 sub function is similar to PCI VF. However it doesn't have its
> > own PCI function and MSI-X vectors.
> >
> > mlx5 mdevs share common PCI resources such as PCI BAR region, MSI-X
> > interrupts.
> >
> > Each mdev has its own window in the PCI BAR region, which is
> > accessible only to that mdev and applications using it.
> >
> > Each mlx5 sub function has its own resource namespace for RDMA resources.
> >
> > mdevs are supported when eswitch mode of the devlink instance is in
> > switchdev mode described in devlink documentation [5].
> 
> So presumably the mdevs don't spawn their own devlink instance today, but
> once mapped via VIRTIO to a VM they will create one?
> 
mdev doesn't spawn the devlink instance today when mdev is created by user, like PCI.
When PCI bus driver enumerates and creates PCI device, there isn't a devlink instance for it.

But, mdev's devlink instance is created when mlx5_core driver binds to the mdev device.
(again similar to PCI, when mlx5_core driver binds to PCI, its devlink instance is created ).

I should have put the example in patch-15 which creates/deletes devlink instance of mdev.
I will revise the commit log of patch-15 to include that.
Good point.

> It could be useful to specify.
> 
Yes, its certainly useful. I missed to put the example in commit log of patch-15.

> > Network side:
> > - By default the netdevice and the rdma device of mlx5 mdev cannot
> > send or receive any packets over the network or to any other mlx5 mdev.
> 
> Does this mean the frames don't fall back to the repr by default?
Probably I wasn't clear.
What I wanted to say is, that frames transmitted by mdev's netdevice and rdma devices don't go to network.
These frames goes to representor device.
User must configure representor to send/receive/steer traffic to mdev.
David Miller Nov. 7, 2019, 11:57 p.m. UTC | #5
From: Parav Pandit <parav@mellanox.com>
Date: Thu,  7 Nov 2019 10:04:48 -0600

> This series adds the support for mlx5 sub function devices using
> mediated device with eswitch switchdev mode.

I think at a minimum there needs to be deeper explanations in the commit log
messages and thus I expect a respin of this series.

Thanks.
Jakub Kicinski Nov. 8, 2019, 1:16 a.m. UTC | #6
On Thu, 7 Nov 2019 20:52:29 +0000, Parav Pandit wrote:
> > On Thu,  7 Nov 2019 10:04:48 -0600, Parav Pandit wrote:  
> > > Mellanox sub function capability allows users to create several
> > > hundreds of networking and/or rdma devices without depending on PCI SR-  
> > IOV support.
> > 
> > You call the new port type "sub function" but the devlink port flavour is mdev.
> >   
> Sub function is the internal driver structure. The abstract entity at user and stack level is mdev.
> Hence the port flavour is mdev.

FWIW I agree mdev as flavour seems like the right choice.

> > As I'm sure you remember you nacked my patches exposing NFP's PCI sub
> > functions which are just regions of the BAR without any mdev capability. Am I
> > in the clear to repost those now? Jiri?
> >   
> For sure I didn't nack it. :-)

Well, maybe the word "nack" wasn't exactly used :)

> What I remember discussing offline/mailing list is 
> (a) exposing mdev/sub fuctions as devlink sub ports is not so good abstraction
> (b) user creating/deleting eswitch sub ports would be hard to fit in the whole usage model

Okay, so I can repost the "basic" sub functions?

> > > Overview:
> > > ---------
> > > Mellanox ConnectX sub functions are exposed to user as a mediated
> > > device (mdev) [2] as discussed in RFC [3] and further during
> > > netdevconf0x13 at [4].
> > >
> > > mlx5 mediated device (mdev) enables users to create multiple
> > > netdevices and/or RDMA devices from single PCI function.
> > >
> > > Each mdev maps to a mlx5 sub function.
> > > mlx5 sub function is similar to PCI VF. However it doesn't have its
> > > own PCI function and MSI-X vectors.
> > >
> > > mlx5 mdevs share common PCI resources such as PCI BAR region, MSI-X
> > > interrupts.
> > >
> > > Each mdev has its own window in the PCI BAR region, which is
> > > accessible only to that mdev and applications using it.
> > >
> > > Each mlx5 sub function has its own resource namespace for RDMA resources.
> > >
> > > mdevs are supported when eswitch mode of the devlink instance is in
> > > switchdev mode described in devlink documentation [5].  
> > 
> > So presumably the mdevs don't spawn their own devlink instance today, but
> > once mapped via VIRTIO to a VM they will create one?
> >   
> mdev doesn't spawn the devlink instance today when mdev is created by user, like PCI.
> When PCI bus driver enumerates and creates PCI device, there isn't a devlink instance for it.
> 
> But, mdev's devlink instance is created when mlx5_core driver binds to the mdev device.
> (again similar to PCI, when mlx5_core driver binds to PCI, its devlink instance is created ).
> 
> I should have put the example in patch-15 which creates/deletes devlink instance of mdev.
> I will revise the commit log of patch-15 to include that.
> Good point.

Thanks.

> > It could be useful to specify.
> >   
> Yes, its certainly useful. I missed to put the example in commit log of patch-15.
> 
> > > Network side:
> > > - By default the netdevice and the rdma device of mlx5 mdev cannot
> > > send or receive any packets over the network or to any other mlx5 mdev.  
> > 
> > Does this mean the frames don't fall back to the repr by default?  
> Probably I wasn't clear.
> What I wanted to say is, that frames transmitted by mdev's netdevice and rdma devices don't go to network.
> These frames goes to representor device.
> User must configure representor to send/receive/steer traffic to mdev.


Parav Pandit Nov. 8, 2019, 1:49 a.m. UTC | #7
> -----Original Message-----
> From: Jakub Kicinski <jakub.kicinski@netronome.com>
> Sent: Thursday, November 7, 2019 7:16 PM
> To: Parav Pandit <parav@mellanox.com>
> Cc: alex.williamson@redhat.com; davem@davemloft.net;
> kvm@vger.kernel.org; netdev@vger.kernel.org; Saeed Mahameed
> <saeedm@mellanox.com>; kwankhede@nvidia.com; leon@kernel.org;
> cohuck@redhat.com; Jiri Pirko <jiri@mellanox.com>; linux-
> rdma@vger.kernel.org; Or Gerlitz <gerlitz.or@gmail.com>
> Subject: Re: [PATCH net-next 00/19] Mellanox, mlx5 sub function support
> 
> On Thu, 7 Nov 2019 20:52:29 +0000, Parav Pandit wrote:
> > > On Thu,  7 Nov 2019 10:04:48 -0600, Parav Pandit wrote:
> > > > Mellanox sub function capability allows users to create several
> > > > hundreds of networking and/or rdma devices without depending on
> > > > PCI SR-
> > > IOV support.
> > >
> > > You call the new port type "sub function" but the devlink port flavour is
> mdev.
> > >
> > Sub function is the internal driver structure. The abstract entity at user and
> stack level is mdev.
> > Hence the port flavour is mdev.
> 
> FWIW I agree mdev as flavour seems like the right choice.
>
Ok.
 
> > > As I'm sure you remember you nacked my patches exposing NFP's PCI
> > > sub functions which are just regions of the BAR without any mdev
> > > capability. Am I in the clear to repost those now? Jiri?
> > >
> > For sure I didn't nack it. :-)
> 
> Well, maybe the word "nack" wasn't exactly used :)
> 
> > What I remember discussing offline/mailing list is
> > (a) exposing mdev/sub fuctions as devlink sub ports is not so good
> > abstraction
> > (b) user creating/deleting eswitch sub ports would be hard to fit in
> > the whole usage model
> 
> Okay, so I can repost the "basic" sub functions?
> 
I think so. Would you like post on top of this series as port flavour etc would come by default?
Also there is vfio/mdev dependency exist in this series...

> > > > Overview:
> > > > ---------
> > > > Mellanox ConnectX sub functions are exposed to user as a mediated
> > > > device (mdev) [2] as discussed in RFC [3] and further during
> > > > netdevconf0x13 at [4].
> > > >
> > > > mlx5 mediated device (mdev) enables users to create multiple
> > > > netdevices and/or RDMA devices from single PCI function.
> > > >
> > > > Each mdev maps to a mlx5 sub function.
> > > > mlx5 sub function is similar to PCI VF. However it doesn't have
> > > > its own PCI function and MSI-X vectors.
> > > >
> > > > mlx5 mdevs share common PCI resources such as PCI BAR region,
> > > > MSI-X interrupts.
> > > >
> > > > Each mdev has its own window in the PCI BAR region, which is
> > > > accessible only to that mdev and applications using it.
> > > >
> > > > Each mlx5 sub function has its own resource namespace for RDMA
> resources.
> > > >
> > > > mdevs are supported when eswitch mode of the devlink instance is
> > > > in switchdev mode described in devlink documentation [5].
> > >
> > > So presumably the mdevs don't spawn their own devlink instance
> > > today, but once mapped via VIRTIO to a VM they will create one?
> > >
> > mdev doesn't spawn the devlink instance today when mdev is created by
> user, like PCI.
> > When PCI bus driver enumerates and creates PCI device, there isn't a
> devlink instance for it.
> >
> > But, mdev's devlink instance is created when mlx5_core driver binds to the
> mdev device.
> > (again similar to PCI, when mlx5_core driver binds to PCI, its devlink
> instance is created ).
> >
> > I should have put the example in patch-15 which creates/deletes devlink
> instance of mdev.
> > I will revise the commit log of patch-15 to include that.
> > Good point.
> 
> Thanks.
> 
> > > It could be useful to specify.
> > >
> > Yes, its certainly useful. I missed to put the example in commit log of
> patch-15.
> >
> > > > Network side:
> > > > - By default the netdevice and the rdma device of mlx5 mdev cannot
> > > > send or receive any packets over the network or to any other mlx5
> mdev.
> > >
> > > Does this mean the frames don't fall back to the repr by default?
> > Probably I wasn't clear.
> > What I wanted to say is, that frames transmitted by mdev's netdevice and
> rdma devices don't go to network.
> > These frames goes to representor device.
> > User must configure representor to send/receive/steer traffic to mdev.
> 
> 
Jakub Kicinski Nov. 8, 2019, 2:12 a.m. UTC | #8
On Fri, 8 Nov 2019 01:49:09 +0000, Parav Pandit wrote:
> > > What I remember discussing offline/mailing list is
> > > (a) exposing mdev/sub fuctions as devlink sub ports is not so good
> > > abstraction
> > > (b) user creating/deleting eswitch sub ports would be hard to fit in
> > > the whole usage model  
> > 
> > Okay, so I can repost the "basic" sub functions?
> >   
> I think so. Would you like post on top of this series as port flavour
> etc would come by default? Also there is vfio/mdev dependency exist
> in this series...

I don't mind the ordering.
Leon Romanovsky Nov. 8, 2019, 6:20 a.m. UTC | #9
On Thu, Nov 07, 2019 at 08:10:45PM +0000, Parav Pandit wrote:
> Hi Leon,
>
> > -----Original Message-----
> > From: Leon Romanovsky <leon@kernel.org>
> > Sent: Thursday, November 7, 2019 11:04 AM
> > To: Parav Pandit <parav@mellanox.com>
> > Cc: alex.williamson@redhat.com; davem@davemloft.net;
> > kvm@vger.kernel.org; netdev@vger.kernel.org; Saeed Mahameed
> > <saeedm@mellanox.com>; kwankhede@nvidia.com; cohuck@redhat.com; Jiri
> > Pirko <jiri@mellanox.com>; linux-rdma@vger.kernel.org
> > Subject: Re: [PATCH net-next 00/19] Mellanox, mlx5 sub function support
> >
> > On Thu, Nov 07, 2019 at 10:04:48AM -0600, Parav Pandit wrote:
> > > Hi Dave, Jiri, Alex,
> > >
> >
> > <...>
> >
> > > - View netdevice and (optionally) RDMA device using iproute2 tools
> > >     $ ip link show
> > >     $ rdma dev show
> >
> > You perfectly explained how ETH devices will be named, but what about
> > RDMA?
> > How will be named? I feel that rdma-core needs to be extended to support such
> > mediated devices.
> >
> rdma devices are named by default using mlx_X.
> After your persistent naming patches, I thought we have GUID based naming scheme which doesn't care about its underlying bus.

No, it is not how it is done. RDMA persistent naming is modeled exactly
as ETH naming, it means that we do care about bus and we don't use GUID
unless user explicitly asked, exactly as MAC based names in ETH world.

> So mdevs will be able to use current GUID based naming scheme we already have.

Unfortunately, no.

>
> Additionally, if user prefers, mdev alias, we can extend systemd/udev to use mdev alias based names (like PCI bdf).

It is not "Additionally", but "must".

> Such as,
> rocem<alias1>
> ibm<alias2>
> Format is:
> <link_layer><m><alias>
> m -> stands for mdev device (similar to 'p' for PCI)
Jiri Pirko Nov. 8, 2019, 12:12 p.m. UTC | #10
Thu, Nov 07, 2019 at 09:32:34PM CET, jakub.kicinski@netronome.com wrote:
>On Thu,  7 Nov 2019 10:04:48 -0600, Parav Pandit wrote:
>> Mellanox sub function capability allows users to create several hundreds
>> of networking and/or rdma devices without depending on PCI SR-IOV support.
>
>You call the new port type "sub function" but the devlink port flavour
>is mdev.
>
>As I'm sure you remember you nacked my patches exposing NFP's PCI 
>sub functions which are just regions of the BAR without any mdev
>capability. Am I in the clear to repost those now? Jiri?

Well question is, if it makes sense to have SFs without having them as
mdev? I mean, we discussed the modelling thoroughtly and eventually we
realized that in order to model this correctly, we need SFs on "a bus".
Originally we were thinking about custom bus, but mdev is already there
to handle this.

Our SFs are also just regions of the BAR, same thing as you have.

Can't you do the same for nfp SFs?
Then the "mdev" flavour is enough for all.


>
>> Overview:
>> ---------
>> Mellanox ConnectX sub functions are exposed to user as a mediated
>> device (mdev) [2] as discussed in RFC [3] and further during
>> netdevconf0x13 at [4].
>> 
>> mlx5 mediated device (mdev) enables users to create multiple netdevices
>> and/or RDMA devices from single PCI function.
>> 
>> Each mdev maps to a mlx5 sub function.
>> mlx5 sub function is similar to PCI VF. However it doesn't have its own
>> PCI function and MSI-X vectors.
>> 
>> mlx5 mdevs share common PCI resources such as PCI BAR region,
>> MSI-X interrupts.
>> 
>> Each mdev has its own window in the PCI BAR region, which is
>> accessible only to that mdev and applications using it.
>> 
>> Each mlx5 sub function has its own resource namespace for RDMA resources.
>> 
>> mdevs are supported when eswitch mode of the devlink instance
>> is in switchdev mode described in devlink documentation [5].
>
>So presumably the mdevs don't spawn their own devlink instance today,
>but once mapped via VIRTIO to a VM they will create one?

I don't think it is needed for anything. Maybe one day if there is a
need to create devlink instance for VF or SF, we can add it. But
currently, I don't see the need.


>
>It could be useful to specify.
>
>> Network side:
>> - By default the netdevice and the rdma device of mlx5 mdev cannot send or
>> receive any packets over the network or to any other mlx5 mdev.
>
>Does this mean the frames don't fall back to the repr by default?

That would be the sane default. If I up the representor, I should see
packets coming in from SF/VF and I should be able to send packets back.
Jason Gunthorpe Nov. 8, 2019, 2:40 p.m. UTC | #11
On Fri, Nov 08, 2019 at 01:12:33PM +0100, Jiri Pirko wrote:
> Thu, Nov 07, 2019 at 09:32:34PM CET, jakub.kicinski@netronome.com wrote:
> >On Thu,  7 Nov 2019 10:04:48 -0600, Parav Pandit wrote:
> >> Mellanox sub function capability allows users to create several hundreds
> >> of networking and/or rdma devices without depending on PCI SR-IOV support.
> >
> >You call the new port type "sub function" but the devlink port flavour
> >is mdev.
> >
> >As I'm sure you remember you nacked my patches exposing NFP's PCI 
> >sub functions which are just regions of the BAR without any mdev
> >capability. Am I in the clear to repost those now? Jiri?
> 
> Well question is, if it makes sense to have SFs without having them as
> mdev? I mean, we discussed the modelling thoroughtly and eventually we
> realized that in order to model this correctly, we need SFs on "a bus".
> Originally we were thinking about custom bus, but mdev is already there
> to handle this.

Did anyone consult Greg on this? 

The new intel driver has been having a very similar discussion about
how to model their 'multi function device' ie to bind RDMA and other
drivers to a shared PCI function, and I think that discussion settled
on adding a new bus?

Really these things are all very similar, it would be nice to have a
clear methodology on how to use the device core if a single PCI device
is split by software into multiple different functional units and
attached to different driver instances.

Currently there is alot of hacking in this area.. And a consistent
scheme might resolve the ugliness with the dma_ops wrappers.

We already have the 'mfd' stuff to support splitting platform devices,
maybe we need to create a 'pci-mfd' to support splitting PCI devices? 

I'm not really clear how mfd and mdev relate, I always thought mdev
was strongly linked to vfio.

At the very least if it is agreed mdev should be the vehicle here,
then it should also be able to solve the netdev/rdma hookup problem
too.

Jason
Parav Pandit Nov. 8, 2019, 3:01 p.m. UTC | #12
> -----Original Message-----
> From: Leon Romanovsky <leon@kernel.org>
> Sent: Friday, November 8, 2019 12:20 AM
> To: Parav Pandit <parav@mellanox.com>
> Cc: alex.williamson@redhat.com; davem@davemloft.net;
> kvm@vger.kernel.org; netdev@vger.kernel.org; Saeed Mahameed
> <saeedm@mellanox.com>; kwankhede@nvidia.com; cohuck@redhat.com;
> Jiri Pirko <jiri@mellanox.com>; linux-rdma@vger.kernel.org
> Subject: Re: [PATCH net-next 00/19] Mellanox, mlx5 sub function support
> 
> On Thu, Nov 07, 2019 at 08:10:45PM +0000, Parav Pandit wrote:
> > Hi Leon,
> >
> > > -----Original Message-----
> > > From: Leon Romanovsky <leon@kernel.org>
> > > Sent: Thursday, November 7, 2019 11:04 AM
> > > To: Parav Pandit <parav@mellanox.com>
> > > Cc: alex.williamson@redhat.com; davem@davemloft.net;
> > > kvm@vger.kernel.org; netdev@vger.kernel.org; Saeed Mahameed
> > > <saeedm@mellanox.com>; kwankhede@nvidia.com;
> cohuck@redhat.com; Jiri
> > > Pirko <jiri@mellanox.com>; linux-rdma@vger.kernel.org
> > > Subject: Re: [PATCH net-next 00/19] Mellanox, mlx5 sub function
> > > support
> > >
> > > On Thu, Nov 07, 2019 at 10:04:48AM -0600, Parav Pandit wrote:
> > > > Hi Dave, Jiri, Alex,
> > > >
> > >
> > > <...>
> > >
> > > > - View netdevice and (optionally) RDMA device using iproute2 tools
> > > >     $ ip link show
> > > >     $ rdma dev show
> > >
> > > You perfectly explained how ETH devices will be named, but what
> > > about RDMA?
> > > How will be named? I feel that rdma-core needs to be extended to
> > > support such mediated devices.
> > >
> > rdma devices are named by default using mlx_X.
> > After your persistent naming patches, I thought we have GUID based
> naming scheme which doesn't care about its underlying bus.
> 
> No, it is not how it is done. RDMA persistent naming is modeled exactly as
> ETH naming, it means that we do care about bus and we don't use GUID
> unless user explicitly asked, exactly as MAC based names in ETH world.
> 
> > So mdevs will be able to use current GUID based naming scheme we
> already have.
> 
> Unfortunately, no.
> 
> >
> > Additionally, if user prefers, mdev alias, we can extend systemd/udev to
> use mdev alias based names (like PCI bdf).
> 
> It is not "Additionally", but "must".
>
Ok. So will do post this series with similar naming scheme.
Will update the cover letter to describe rdma naming too.
 
> > Such as,
> > rocem<alias1>
> > ibm<alias2>
> > Format is:
> > <link_layer><m><alias>
> > m -> stands for mdev device (similar to 'p' for PCI)
Parav Pandit Nov. 8, 2019, 3:40 p.m. UTC | #13
Hi Jason,

+ Greg

> -----Original Message-----
> From: Jason Gunthorpe <jgg@ziepe.ca>
> Sent: Friday, November 8, 2019 8:41 AM
> To: Jiri Pirko <jiri@resnulli.us>; Ertman@ziepe.ca; David M
> <david.m.ertman@intel.com>; gregkh@linuxfoundation.org
> Cc: Jakub Kicinski <jakub.kicinski@netronome.com>; Parav Pandit
> <parav@mellanox.com>; alex.williamson@redhat.com;
> davem@davemloft.net; kvm@vger.kernel.org; netdev@vger.kernel.org;
> Saeed Mahameed <saeedm@mellanox.com>; kwankhede@nvidia.com;
> leon@kernel.org; cohuck@redhat.com; Jiri Pirko <jiri@mellanox.com>; linux-
> rdma@vger.kernel.org; Or Gerlitz <gerlitz.or@gmail.com>
> Subject: Re: [PATCH net-next 00/19] Mellanox, mlx5 sub function support
> 
> On Fri, Nov 08, 2019 at 01:12:33PM +0100, Jiri Pirko wrote:
> > Thu, Nov 07, 2019 at 09:32:34PM CET, jakub.kicinski@netronome.com
> wrote:
> > >On Thu,  7 Nov 2019 10:04:48 -0600, Parav Pandit wrote:
> > >> Mellanox sub function capability allows users to create several
> > >> hundreds of networking and/or rdma devices without depending on PCI
> SR-IOV support.
> > >
> > >You call the new port type "sub function" but the devlink port
> > >flavour is mdev.
> > >
> > >As I'm sure you remember you nacked my patches exposing NFP's PCI sub
> > >functions which are just regions of the BAR without any mdev
> > >capability. Am I in the clear to repost those now? Jiri?
> >
> > Well question is, if it makes sense to have SFs without having them as
> > mdev? I mean, we discussed the modelling thoroughtly and eventually we
> > realized that in order to model this correctly, we need SFs on "a bus".
> > Originally we were thinking about custom bus, but mdev is already
> > there to handle this.
> 
> Did anyone consult Greg on this?
> 
Back when I started with subdev bus in March, we consulted Greg and mdev maintainers.
After which we settled on extending mdev for wider use case, more below.
It is extended for multiple users for example for virtio too in addition to vfio and mlx5_core.

> The new intel driver has been having a very similar discussion about how to
> model their 'multi function device' ie to bind RDMA and other drivers to a
> shared PCI function, and I think that discussion settled on adding a new bus?
> 
> Really these things are all very similar, it would be nice to have a clear
> methodology on how to use the device core if a single PCI device is split by
> software into multiple different functional units and attached to different
> driver instances.
> 
> Currently there is alot of hacking in this area.. And a consistent scheme
> might resolve the ugliness with the dma_ops wrappers.
> 
> We already have the 'mfd' stuff to support splitting platform devices, maybe
> we need to create a 'pci-mfd' to support splitting PCI devices?
> 
> I'm not really clear how mfd and mdev relate, I always thought mdev was
> strongly linked to vfio.
> 
Mdev at beginning was strongly linked to vfio, but as I mentioned above it is addressing more use case.

I observed that discussion, but was not sure of extending mdev further.

One way to do for Intel drivers to do is after series [9].
Where PCI driver says, MDEV_CLASS_ID_I40_FOO
RDMA driver mdev_register_driver(), matches on it and does the probe().

> At the very least if it is agreed mdev should be the vehicle here, then it
> should also be able to solve the netdev/rdma hookup problem too.
> 
> Jason

[9] https://patchwork.ozlabs.org/patch/1190425
Parav Pandit Nov. 8, 2019, 4:06 p.m. UTC | #14
> -----Original Message-----
> From: Jiri Pirko <jiri@resnulli.us>
> Sent: Friday, November 8, 2019 6:13 AM
> To: Jakub Kicinski <jakub.kicinski@netronome.com>
> Cc: Parav Pandit <parav@mellanox.com>; alex.williamson@redhat.com;
> davem@davemloft.net; kvm@vger.kernel.org; netdev@vger.kernel.org;
> Saeed Mahameed <saeedm@mellanox.com>; kwankhede@nvidia.com;
> leon@kernel.org; cohuck@redhat.com; Jiri Pirko <jiri@mellanox.com>; linux-
> rdma@vger.kernel.org; Or Gerlitz <gerlitz.or@gmail.com>
> Subject: Re: [PATCH net-next 00/19] Mellanox, mlx5 sub function support
> 
> Thu, Nov 07, 2019 at 09:32:34PM CET, jakub.kicinski@netronome.com wrote:
> >On Thu,  7 Nov 2019 10:04:48 -0600, Parav Pandit wrote:
> >> Mellanox sub function capability allows users to create several
> >> hundreds of networking and/or rdma devices without depending on PCI
> SR-IOV support.
> >
> >You call the new port type "sub function" but the devlink port flavour
> >is mdev.
> >
> >As I'm sure you remember you nacked my patches exposing NFP's PCI sub
> >functions which are just regions of the BAR without any mdev
> >capability. Am I in the clear to repost those now? Jiri?
> 
> Well question is, if it makes sense to have SFs without having them as mdev?
> I mean, we discussed the modelling thoroughtly and eventually we realized
> that in order to model this correctly, we need SFs on "a bus".
> Originally we were thinking about custom bus, but mdev is already there to
> handle this.
> 
> Our SFs are also just regions of the BAR, same thing as you have.
> 
> Can't you do the same for nfp SFs?
> Then the "mdev" flavour is enough for all.
> 
> 
> >
> >> Overview:
> >> ---------
> >> Mellanox ConnectX sub functions are exposed to user as a mediated
> >> device (mdev) [2] as discussed in RFC [3] and further during
> >> netdevconf0x13 at [4].
> >>
> >> mlx5 mediated device (mdev) enables users to create multiple
> >> netdevices and/or RDMA devices from single PCI function.
> >>
> >> Each mdev maps to a mlx5 sub function.
> >> mlx5 sub function is similar to PCI VF. However it doesn't have its
> >> own PCI function and MSI-X vectors.
> >>
> >> mlx5 mdevs share common PCI resources such as PCI BAR region, MSI-X
> >> interrupts.
> >>
> >> Each mdev has its own window in the PCI BAR region, which is
> >> accessible only to that mdev and applications using it.
> >>
> >> Each mlx5 sub function has its own resource namespace for RDMA
> resources.
> >>
> >> mdevs are supported when eswitch mode of the devlink instance is in
> >> switchdev mode described in devlink documentation [5].
> >
> >So presumably the mdevs don't spawn their own devlink instance today,
> >but once mapped via VIRTIO to a VM they will create one?
> 
> I don't think it is needed for anything. Maybe one day if there is a need to
> create devlink instance for VF or SF, we can add it. But currently, I don't see
> the need.
> 
> 
> >
> >It could be useful to specify.
> >
> >> Network side:
> >> - By default the netdevice and the rdma device of mlx5 mdev cannot
> >> send or receive any packets over the network or to any other mlx5 mdev.
> >
> >Does this mean the frames don't fall back to the repr by default?
> 
> That would be the sane default. If I up the representor, I should see packets
> coming in from SF/VF and I should be able to send packets back.

It is similar to VF. I clarified in previous email.
Wil update the cover letter to update the description.
Jakub Kicinski Nov. 8, 2019, 7:06 p.m. UTC | #15
On Fri, 8 Nov 2019 13:12:33 +0100, Jiri Pirko wrote:
> Thu, Nov 07, 2019 at 09:32:34PM CET, jakub.kicinski@netronome.com wrote:
> >On Thu,  7 Nov 2019 10:04:48 -0600, Parav Pandit wrote:  
> >> Mellanox sub function capability allows users to create several hundreds
> >> of networking and/or rdma devices without depending on PCI SR-IOV support.  
> >
> >You call the new port type "sub function" but the devlink port flavour
> >is mdev.
> >
> >As I'm sure you remember you nacked my patches exposing NFP's PCI 
> >sub functions which are just regions of the BAR without any mdev
> >capability. Am I in the clear to repost those now? Jiri?  
> 
> Well question is, if it makes sense to have SFs without having them as
> mdev? I mean, we discussed the modelling thoroughtly and eventually we
> realized that in order to model this correctly, we need SFs on "a bus".
> Originally we were thinking about custom bus, but mdev is already there
> to handle this.

But the "main/real" port is not a mdev in your case. NFP is like mlx4. 
It has one PCI PF for multiple ports.

> Our SFs are also just regions of the BAR, same thing as you have.
> 
> Can't you do the same for nfp SFs?
> Then the "mdev" flavour is enough for all.

Absolutely not. 

Why not make the main device of mlx5 a mdev, too, if that's acceptable.
There's (a) long precedence for multiple ports on one PCI PF in
networking devices, (b) plenty deployed software 
which depend on the main devices hanging off the PCI PF directly.

The point of mdevs is being able to sign them to VFs or run DPDK on
them (map to user space).

For normal devices existing sysfs hierarchy were one device has
multiple children of a certain class, without a bus and a separate
driver is perfectly fine. Do you think we should also slice all serial
chips into mdevs if they have multiple lines.

Exactly as I predicted much confusion about what's being achieved here,
heh :)
Jakub Kicinski Nov. 8, 2019, 7:12 p.m. UTC | #16
On Fri, 8 Nov 2019 15:40:22 +0000, Parav Pandit wrote:
> > The new intel driver has been having a very similar discussion about how to
> > model their 'multi function device' ie to bind RDMA and other drivers to a
> > shared PCI function, and I think that discussion settled on adding a new bus?
> > 
> > Really these things are all very similar, it would be nice to have a clear
> > methodology on how to use the device core if a single PCI device is split by
> > software into multiple different functional units and attached to different
> > driver instances.
> > 
> > Currently there is alot of hacking in this area.. And a consistent scheme
> > might resolve the ugliness with the dma_ops wrappers.
> > 
> > We already have the 'mfd' stuff to support splitting platform devices, maybe
> > we need to create a 'pci-mfd' to support splitting PCI devices?
> > 
> > I'm not really clear how mfd and mdev relate, I always thought mdev was
> > strongly linked to vfio.
> >   
> Mdev at beginning was strongly linked to vfio, but as I mentioned above it is addressing more use case.
> 
> I observed that discussion, but was not sure of extending mdev further.
> 
> One way to do for Intel drivers to do is after series [9].
> Where PCI driver says, MDEV_CLASS_ID_I40_FOO
> RDMA driver mdev_register_driver(), matches on it and does the probe().

Yup, FWIW to me the benefit of reusing mdevs for the Intel case vs
muddying the purpose of mdevs is not a clear trade off.

IMHO MFD should be of more natural use for Intel, since it's about
providing different functionality rather than virtual slices of the
same device.

> > At the very least if it is agreed mdev should be the vehicle here, then it
> > should also be able to solve the netdev/rdma hookup problem too.
Parav Pandit Nov. 8, 2019, 7:34 p.m. UTC | #17
> -----Original Message-----
> From: Jakub Kicinski <jakub.kicinski@netronome.com>
 
> On Fri, 8 Nov 2019 13:12:33 +0100, Jiri Pirko wrote:
> > Thu, Nov 07, 2019 at 09:32:34PM CET, jakub.kicinski@netronome.com wrote:
> > >On Thu,  7 Nov 2019 10:04:48 -0600, Parav Pandit wrote:
> > >> Mellanox sub function capability allows users to create several
> > >> hundreds of networking and/or rdma devices without depending on PCI
> SR-IOV support.
> > >
> > >You call the new port type "sub function" but the devlink port
> > >flavour is mdev.
> > >
> > >As I'm sure you remember you nacked my patches exposing NFP's PCI sub
> > >functions which are just regions of the BAR without any mdev
> > >capability. Am I in the clear to repost those now? Jiri?
> >
> > Well question is, if it makes sense to have SFs without having them as
> > mdev? I mean, we discussed the modelling thoroughtly and eventually we
> > realized that in order to model this correctly, we need SFs on "a bus".
> > Originally we were thinking about custom bus, but mdev is already
> > there to handle this.
> 
> But the "main/real" port is not a mdev in your case. NFP is like mlx4.
> It has one PCI PF for multiple ports.
> 
> > Our SFs are also just regions of the BAR, same thing as you have.
> >
> > Can't you do the same for nfp SFs?
> > Then the "mdev" flavour is enough for all.
> 
> Absolutely not.
> 
Please explain what is missing in mdev.
And frankly it is too late for such a comment.
I sent out RFC patches back in March-19, 
further discussed in netdevconf0x13, 
further discussed several times while we introduced devlink ports,
further discussed in august with mdev alias series and phys_port_name formation for mdev.
This series shouldn't be any surprise for you at all.
Anyways.

> Why not make the main device of mlx5 a mdev, too, if that's acceptable.
> There's (a) long precedence for multiple ports on one PCI PF in networking
> devices, (b) plenty deployed software which depend on the main devices
> hanging off the PCI PF directly.
> 
> The point of mdevs is being able to sign them to VFs or run DPDK on them (map
> to user space).
> 
That can be one use case. That is not the only use case.
I clearly explained the use case scenarios in motivation section of the cover letter. Please go through it again.
This series is certainly not targeting the DPDK usecase right now.

Also please read design decisions section of cover letter...

> For normal devices existing sysfs hierarchy were one device has multiple
> children of a certain class, without a bus and a separate driver is perfectly fine.
> Do you think we should also slice all serial chips into mdevs if they have
> multiple lines.
> 
> Exactly as I predicted much confusion about what's being achieved here, heh :)

We don't see confusion here. Please be specific on the point that confuses you.
A PCI device is sliced to multiple devices to make use of it in different ways. Serial chips are not good example of it.
This is fitting in with 
- overall kernel bus-driver model, 
- suitable for multiple types of devices (netdev, rdma and their different link layers), 
- devlink framework of bus/device notion for health reporters, ports, resources
- mapping to VM/bare-metal/container usecase.
- and more..
I don't want to further repeat the cover-letter here.. let's talk specific points to improve upon.
Jiri Pirko Nov. 8, 2019, 7:41 p.m. UTC | #18
Fri, Nov 08, 2019 at 08:06:40PM CET, jakub.kicinski@netronome.com wrote:
>On Fri, 8 Nov 2019 13:12:33 +0100, Jiri Pirko wrote:
>> Thu, Nov 07, 2019 at 09:32:34PM CET, jakub.kicinski@netronome.com wrote:
>> >On Thu,  7 Nov 2019 10:04:48 -0600, Parav Pandit wrote:  
>> >> Mellanox sub function capability allows users to create several hundreds
>> >> of networking and/or rdma devices without depending on PCI SR-IOV support.  
>> >
>> >You call the new port type "sub function" but the devlink port flavour
>> >is mdev.
>> >
>> >As I'm sure you remember you nacked my patches exposing NFP's PCI 
>> >sub functions which are just regions of the BAR without any mdev
>> >capability. Am I in the clear to repost those now? Jiri?  
>> 
>> Well question is, if it makes sense to have SFs without having them as
>> mdev? I mean, we discussed the modelling thoroughtly and eventually we
>> realized that in order to model this correctly, we need SFs on "a bus".
>> Originally we were thinking about custom bus, but mdev is already there
>> to handle this.
>
>But the "main/real" port is not a mdev in your case. NFP is like mlx4. 
>It has one PCI PF for multiple ports.

I don't see how relevant the number of PFs-vs-uplink_ports is.


>
>> Our SFs are also just regions of the BAR, same thing as you have.
>> 
>> Can't you do the same for nfp SFs?
>> Then the "mdev" flavour is enough for all.
>
>Absolutely not. 
>
>Why not make the main device of mlx5 a mdev, too, if that's acceptable.
>There's (a) long precedence for multiple ports on one PCI PF in
>networking devices, (b) plenty deployed software 
>which depend on the main devices hanging off the PCI PF directly.
>
>The point of mdevs is being able to sign them to VFs or run DPDK on
>them (map to user space).
>
>For normal devices existing sysfs hierarchy were one device has
>multiple children of a certain class, without a bus and a separate
>driver is perfectly fine. Do you think we should also slice all serial
>chips into mdevs if they have multiple lines.
>
>Exactly as I predicted much confusion about what's being achieved here,
>heh :)

Please let me understand how your device is different.
Originally Parav didn't want to have mlx5 subfunctions as mdev. He
wanted to have them tight to the same pci device as the pf. No
difference from what you describe you want. However while we thought
about how to fit things in, how to handle na phys_port_name, how to see
things in sysfs we came up with an idea of a dedicated bus. We took it
upstream and people suggested to use mdev bus for this.

Parav, please correct me if I'm wrong but I don't think where is a plan
to push SFs into VM or to userspace as Jakub expects, right?
Jakub Kicinski Nov. 8, 2019, 7:48 p.m. UTC | #19
On Fri, 8 Nov 2019 19:34:07 +0000, Parav Pandit wrote:
> > -----Original Message-----
> > From: Jakub Kicinski <jakub.kicinski@netronome.com>  
>  
> > On Fri, 8 Nov 2019 13:12:33 +0100, Jiri Pirko wrote:  
> > > Thu, Nov 07, 2019 at 09:32:34PM CET, jakub.kicinski@netronome.com wrote:  
> > > >On Thu,  7 Nov 2019 10:04:48 -0600, Parav Pandit wrote:  
> > > >> Mellanox sub function capability allows users to create several
> > > >> hundreds of networking and/or rdma devices without depending on PCI  
> > SR-IOV support.  
> > > >
> > > >You call the new port type "sub function" but the devlink port
> > > >flavour is mdev.
> > > >
> > > >As I'm sure you remember you nacked my patches exposing NFP's PCI sub
> > > >functions which are just regions of the BAR without any mdev
> > > >capability. Am I in the clear to repost those now? Jiri?  
> > >
> > > Well question is, if it makes sense to have SFs without having them as
> > > mdev? I mean, we discussed the modelling thoroughtly and eventually we
> > > realized that in order to model this correctly, we need SFs on "a bus".
> > > Originally we were thinking about custom bus, but mdev is already
> > > there to handle this.  
> > 
> > But the "main/real" port is not a mdev in your case. NFP is like mlx4.
> > It has one PCI PF for multiple ports.
> >   
> > > Our SFs are also just regions of the BAR, same thing as you have.
> > >
> > > Can't you do the same for nfp SFs?
> > > Then the "mdev" flavour is enough for all.  
> > 
> > Absolutely not.
> >   
> Please explain what is missing in mdev.
> And frankly it is too late for such a comment.
> I sent out RFC patches back in March-19, 
> further discussed in netdevconf0x13, 
> further discussed several times while we introduced devlink ports,
> further discussed in august with mdev alias series and phys_port_name formation for mdev.
> This series shouldn't be any surprise for you at all.
> Anyways.
> 
> > Why not make the main device of mlx5 a mdev, too, if that's acceptable.
> > There's (a) long precedence for multiple ports on one PCI PF in networking
> > devices, (b) plenty deployed software which depend on the main devices
> > hanging off the PCI PF directly.
> > 
> > The point of mdevs is being able to sign them to VFs or run DPDK on them (map
> > to user space).
> >   
> That can be one use case. That is not the only use case.
> I clearly explained the use case scenarios in motivation section of the cover letter. Please go through it again.
> This series is certainly not targeting the DPDK usecase right now.
> 
> Also please read design decisions section of cover letter...
> 
> > For normal devices existing sysfs hierarchy were one device has multiple
> > children of a certain class, without a bus and a separate driver is perfectly fine.
> > Do you think we should also slice all serial chips into mdevs if they have
> > multiple lines.
> > 
> > Exactly as I predicted much confusion about what's being achieved here, heh :)  
> 
> We don't see confusion here. Please be specific on the point that confuses you.
> A PCI device is sliced to multiple devices to make use of it in different ways. Serial chips are not good example of it.
> This is fitting in with 
> - overall kernel bus-driver model, 
> - suitable for multiple types of devices (netdev, rdma and their different link layers), 
> - devlink framework of bus/device notion for health reporters, ports, resources
> - mapping to VM/bare-metal/container usecase.
> - and more..
> I don't want to further repeat the cover-letter here.. let's talk specific points to improve upon.

Parav, I think you completely missed the point of my email.

I was replying to Jiri saying that mdev does not fit the nfp/mlx4
mutli-port devices, and we still need a normal "PCI split" or something
along those lines to have multiple PF ports on one PF.

I'm not talking about your use of mdev at all.

The discussion on patch 12 also seems to indicate you don't read the
emails you reply to...
Jason Gunthorpe Nov. 8, 2019, 8:12 p.m. UTC | #20
On Fri, Nov 08, 2019 at 11:12:38AM -0800, Jakub Kicinski wrote:
> On Fri, 8 Nov 2019 15:40:22 +0000, Parav Pandit wrote:
> > > The new intel driver has been having a very similar discussion about how to
> > > model their 'multi function device' ie to bind RDMA and other drivers to a
> > > shared PCI function, and I think that discussion settled on adding a new bus?
> > > 
> > > Really these things are all very similar, it would be nice to have a clear
> > > methodology on how to use the device core if a single PCI device is split by
> > > software into multiple different functional units and attached to different
> > > driver instances.
> > > 
> > > Currently there is alot of hacking in this area.. And a consistent scheme
> > > might resolve the ugliness with the dma_ops wrappers.
> > > 
> > > We already have the 'mfd' stuff to support splitting platform devices, maybe
> > > we need to create a 'pci-mfd' to support splitting PCI devices?
> > > 
> > > I'm not really clear how mfd and mdev relate, I always thought mdev was
> > > strongly linked to vfio.
> > >
> >
> > Mdev at beginning was strongly linked to vfio, but as I mentioned
> > above it is addressing more use case.
> > 
> > I observed that discussion, but was not sure of extending mdev further.
> > 
> > One way to do for Intel drivers to do is after series [9].
> > Where PCI driver says, MDEV_CLASS_ID_I40_FOO
> > RDMA driver mdev_register_driver(), matches on it and does the probe().
> 
> Yup, FWIW to me the benefit of reusing mdevs for the Intel case vs
> muddying the purpose of mdevs is not a clear trade off.

IMHO, mdev has amdev_parent_ops structure clearly intended to link it
to vfio, so using a mdev for something not related to vfio seems like
a poor choice.

I suppose this series is the start and we will eventually see the
mlx5's mdev_parent_ops filled in to support vfio - but *right now*
this looks identical to the problem most of the RDMA capable net
drivers have splitting into a 'core' and a 'function'

> IMHO MFD should be of more natural use for Intel, since it's about
> providing different functionality rather than virtual slices of the
> same device.

I don't think the 'different functionality' should matter much. 

Generally these multi-function drivers are build some some common
'core' language like queues interrupts, BAR space, etc and then these
common things can be specialized into netdev, rdma, scsi, etc. So we
see a general rough design with a core layer managing the raw HW then
drivers on top of that (including netdev) using that API.

The actual layering doesn't come through in the driver model,
generally people put all the core stuff in with the netdev and then
try and shuffle the netdev around as the 'handle' for that core API.

These SFs are pretty similar in that the core physical driver
continues to provide some software API support to the SF children (at
least for mlx it is a small API)

For instance mdev has no generic way to learn the BAR struct
resources, so there is some extra API around the side that does this -
in this series it is done by hackily co-opting the drvdata to
something owned by the struct device instead of the device_driver and
using that to access the API surface on 'struct mlx5_sf *', which
includes the BAR info and so forth.

This is probably the main difference from MFD. At least the few
drivers I looked at, did not try and expose an SW API from the 'core'
to the 'part', everything was usual generic driver resource stuff.

Jason
Parav Pandit Nov. 8, 2019, 8:20 p.m. UTC | #21
> -----Original Message-----
> From: Jason Gunthorpe <jgg@ziepe.ca>
> On Fri, Nov 08, 2019 at 11:12:38AM -0800, Jakub Kicinski wrote:
> > On Fri, 8 Nov 2019 15:40:22 +0000, Parav Pandit wrote:
> > > > The new intel driver has been having a very similar discussion
> > > > about how to model their 'multi function device' ie to bind RDMA
> > > > and other drivers to a shared PCI function, and I think that discussion
> settled on adding a new bus?
> > > >
> > > > Really these things are all very similar, it would be nice to have
> > > > a clear methodology on how to use the device core if a single PCI
> > > > device is split by software into multiple different functional
> > > > units and attached to different driver instances.
> > > >
> > > > Currently there is alot of hacking in this area.. And a consistent
> > > > scheme might resolve the ugliness with the dma_ops wrappers.
> > > >
> > > > We already have the 'mfd' stuff to support splitting platform
> > > > devices, maybe we need to create a 'pci-mfd' to support splitting PCI
> devices?
> > > >
> > > > I'm not really clear how mfd and mdev relate, I always thought
> > > > mdev was strongly linked to vfio.
> > > >
> > >
> > > Mdev at beginning was strongly linked to vfio, but as I mentioned
> > > above it is addressing more use case.
> > >
> > > I observed that discussion, but was not sure of extending mdev further.
> > >
> > > One way to do for Intel drivers to do is after series [9].
> > > Where PCI driver says, MDEV_CLASS_ID_I40_FOO RDMA driver
> > > mdev_register_driver(), matches on it and does the probe().
> >
> > Yup, FWIW to me the benefit of reusing mdevs for the Intel case vs
> > muddying the purpose of mdevs is not a clear trade off.
> 
> IMHO, mdev has amdev_parent_ops structure clearly intended to link it to vfio,
> so using a mdev for something not related to vfio seems like a poor choice.
> 
Splitting mdev_parent_ops{} is already in works for larger use case in series [1] for virtio.

[1] https://patchwork.kernel.org/patch/11233127/
Jason Gunthorpe Nov. 8, 2019, 8:32 p.m. UTC | #22
On Fri, Nov 08, 2019 at 08:20:43PM +0000, Parav Pandit wrote:
> 
> 
> > From: Jason Gunthorpe <jgg@ziepe.ca>
> > On Fri, Nov 08, 2019 at 11:12:38AM -0800, Jakub Kicinski wrote:
> > > On Fri, 8 Nov 2019 15:40:22 +0000, Parav Pandit wrote:
> > > > > The new intel driver has been having a very similar discussion
> > > > > about how to model their 'multi function device' ie to bind RDMA
> > > > > and other drivers to a shared PCI function, and I think that discussion
> > settled on adding a new bus?
> > > > >
> > > > > Really these things are all very similar, it would be nice to have
> > > > > a clear methodology on how to use the device core if a single PCI
> > > > > device is split by software into multiple different functional
> > > > > units and attached to different driver instances.
> > > > >
> > > > > Currently there is alot of hacking in this area.. And a consistent
> > > > > scheme might resolve the ugliness with the dma_ops wrappers.
> > > > >
> > > > > We already have the 'mfd' stuff to support splitting platform
> > > > > devices, maybe we need to create a 'pci-mfd' to support splitting PCI
> > devices?
> > > > >
> > > > > I'm not really clear how mfd and mdev relate, I always thought
> > > > > mdev was strongly linked to vfio.
> > > > >
> > > >
> > > > Mdev at beginning was strongly linked to vfio, but as I mentioned
> > > > above it is addressing more use case.
> > > >
> > > > I observed that discussion, but was not sure of extending mdev further.
> > > >
> > > > One way to do for Intel drivers to do is after series [9].
> > > > Where PCI driver says, MDEV_CLASS_ID_I40_FOO RDMA driver
> > > > mdev_register_driver(), matches on it and does the probe().
> > >
> > > Yup, FWIW to me the benefit of reusing mdevs for the Intel case vs
> > > muddying the purpose of mdevs is not a clear trade off.
> > 
> > IMHO, mdev has amdev_parent_ops structure clearly intended to link it to vfio,
> > so using a mdev for something not related to vfio seems like a poor choice.
> > 
> Splitting mdev_parent_ops{} is already in works for larger use case in series [1] for virtio.
> 
> [1] https://patchwork.kernel.org/patch/11233127/

Weird. So what is mdev actually providing and what does it represent
if the entire driver facing API surface is under a union?

This smells a lot like it is re-implementing a bus.. AFAIK bus is
supposed to represent the in-kernel API the struct device presents to
drivers.

Jason
Alex Williamson Nov. 8, 2019, 8:34 p.m. UTC | #23
On Fri, 8 Nov 2019 16:12:53 -0400
Jason Gunthorpe <jgg@ziepe.ca> wrote:

> On Fri, Nov 08, 2019 at 11:12:38AM -0800, Jakub Kicinski wrote:
> > On Fri, 8 Nov 2019 15:40:22 +0000, Parav Pandit wrote:  
> > > > The new intel driver has been having a very similar discussion about how to
> > > > model their 'multi function device' ie to bind RDMA and other drivers to a
> > > > shared PCI function, and I think that discussion settled on adding a new bus?
> > > > 
> > > > Really these things are all very similar, it would be nice to have a clear
> > > > methodology on how to use the device core if a single PCI device is split by
> > > > software into multiple different functional units and attached to different
> > > > driver instances.
> > > > 
> > > > Currently there is alot of hacking in this area.. And a consistent scheme
> > > > might resolve the ugliness with the dma_ops wrappers.
> > > > 
> > > > We already have the 'mfd' stuff to support splitting platform devices, maybe
> > > > we need to create a 'pci-mfd' to support splitting PCI devices?
> > > > 
> > > > I'm not really clear how mfd and mdev relate, I always thought mdev was
> > > > strongly linked to vfio.
> > > >  
> > >
> > > Mdev at beginning was strongly linked to vfio, but as I mentioned
> > > above it is addressing more use case.
> > > 
> > > I observed that discussion, but was not sure of extending mdev further.
> > > 
> > > One way to do for Intel drivers to do is after series [9].
> > > Where PCI driver says, MDEV_CLASS_ID_I40_FOO
> > > RDMA driver mdev_register_driver(), matches on it and does the probe().  
> > 
> > Yup, FWIW to me the benefit of reusing mdevs for the Intel case vs
> > muddying the purpose of mdevs is not a clear trade off.  
> 
> IMHO, mdev has amdev_parent_ops structure clearly intended to link it
> to vfio, so using a mdev for something not related to vfio seems like
> a poor choice.

Unless there's some opposition, I'm intended to queue this for v5.5:

https://www.spinics.net/lists/kvm/msg199613.html

mdev has started out as tied to vfio, but at it's core, it's just a
device life cycle infrastructure with callbacks between bus drivers
and vendor devices.  If virtio is on the wrong path with the above
series, please speak up.  Thanks,

Alex

 
> I suppose this series is the start and we will eventually see the
> mlx5's mdev_parent_ops filled in to support vfio - but *right now*
> this looks identical to the problem most of the RDMA capable net
> drivers have splitting into a 'core' and a 'function'
> 
> > IMHO MFD should be of more natural use for Intel, since it's about
> > providing different functionality rather than virtual slices of the
> > same device.  
> 
> I don't think the 'different functionality' should matter much. 
> 
> Generally these multi-function drivers are build some some common
> 'core' language like queues interrupts, BAR space, etc and then these
> common things can be specialized into netdev, rdma, scsi, etc. So we
> see a general rough design with a core layer managing the raw HW then
> drivers on top of that (including netdev) using that API.
> 
> The actual layering doesn't come through in the driver model,
> generally people put all the core stuff in with the netdev and then
> try and shuffle the netdev around as the 'handle' for that core API.
> 
> These SFs are pretty similar in that the core physical driver
> continues to provide some software API support to the SF children (at
> least for mlx it is a small API)
> 
> For instance mdev has no generic way to learn the BAR struct
> resources, so there is some extra API around the side that does this -
> in this series it is done by hackily co-opting the drvdata to
> something owned by the struct device instead of the device_driver and
> using that to access the API surface on 'struct mlx5_sf *', which
> includes the BAR info and so forth.
> 
> This is probably the main difference from MFD. At least the few
> drivers I looked at, did not try and expose an SW API from the 'core'
> to the 'part', everything was usual generic driver resource stuff.
> 
> Jason
Parav Pandit Nov. 8, 2019, 8:40 p.m. UTC | #24
> -----Original Message-----
> From: Jiri Pirko <jiri@resnulli.us>
> 
> Fri, Nov 08, 2019 at 08:06:40PM CET, jakub.kicinski@netronome.com wrote:
> >On Fri, 8 Nov 2019 13:12:33 +0100, Jiri Pirko wrote:
> >> Thu, Nov 07, 2019 at 09:32:34PM CET, jakub.kicinski@netronome.com
> wrote:
> >> >On Thu,  7 Nov 2019 10:04:48 -0600, Parav Pandit wrote:
> >> >> Mellanox sub function capability allows users to create several
> >> >> hundreds of networking and/or rdma devices without depending on PCI
> SR-IOV support.
> >> >
> >> >You call the new port type "sub function" but the devlink port
> >> >flavour is mdev.
> >> >
> >> >As I'm sure you remember you nacked my patches exposing NFP's PCI
> >> >sub functions which are just regions of the BAR without any mdev
> >> >capability. Am I in the clear to repost those now? Jiri?
> >>
> >> Well question is, if it makes sense to have SFs without having them
> >> as mdev? I mean, we discussed the modelling thoroughtly and
> >> eventually we realized that in order to model this correctly, we need SFs on
> "a bus".
> >> Originally we were thinking about custom bus, but mdev is already
> >> there to handle this.
> >
> >But the "main/real" port is not a mdev in your case. NFP is like mlx4.
> >It has one PCI PF for multiple ports.
> 
> I don't see how relevant the number of PFs-vs-uplink_ports is.
> 
> 
> >
> >> Our SFs are also just regions of the BAR, same thing as you have.
> >>
> >> Can't you do the same for nfp SFs?
> >> Then the "mdev" flavour is enough for all.
> >
> >Absolutely not.
> >
> >Why not make the main device of mlx5 a mdev, too, if that's acceptable.
> >There's (a) long precedence for multiple ports on one PCI PF in
> >networking devices, (b) plenty deployed software which depend on the
> >main devices hanging off the PCI PF directly.
> >
> >The point of mdevs is being able to sign them to VFs or run DPDK on
> >them (map to user space).
> >
> >For normal devices existing sysfs hierarchy were one device has
> >multiple children of a certain class, without a bus and a separate
> >driver is perfectly fine. Do you think we should also slice all serial
> >chips into mdevs if they have multiple lines.
> >
> >Exactly as I predicted much confusion about what's being achieved here,
> >heh :)
> 
> Please let me understand how your device is different.
> Originally Parav didn't want to have mlx5 subfunctions as mdev. He wanted to
> have them tight to the same pci device as the pf. No difference from what you
> describe you want.
> However while we thought about how to fit things in, how to
> handle na phys_port_name, how to see things in sysfs we came up with an idea
> of a dedicated bus. We took it upstream and people suggested to use mdev bus
> for this.
> 
You are right. We considered multiple ports approach, followed by subdevices and mfd.
Around that time mdev was being proposed that can address current and future VM/userspace usecases using one way to lifecycle the devices.

> Parav, please correct me if I'm wrong but I don't think where is a plan to push
> SFs into VM or to userspace as Jakub expects, right?
With this series - certainly not.
In future, if mdev to be used by via vfio/VM framework, why should we prevent it (ofcourse after implementing necessary isolation method)?
Greg KH Nov. 8, 2019, 8:52 p.m. UTC | #25
On Fri, Nov 08, 2019 at 04:32:09PM -0400, Jason Gunthorpe wrote:
> On Fri, Nov 08, 2019 at 08:20:43PM +0000, Parav Pandit wrote:
> > 
> > 
> > > From: Jason Gunthorpe <jgg@ziepe.ca>
> > > On Fri, Nov 08, 2019 at 11:12:38AM -0800, Jakub Kicinski wrote:
> > > > On Fri, 8 Nov 2019 15:40:22 +0000, Parav Pandit wrote:
> > > > > > The new intel driver has been having a very similar discussion
> > > > > > about how to model their 'multi function device' ie to bind RDMA
> > > > > > and other drivers to a shared PCI function, and I think that discussion
> > > settled on adding a new bus?
> > > > > >
> > > > > > Really these things are all very similar, it would be nice to have
> > > > > > a clear methodology on how to use the device core if a single PCI
> > > > > > device is split by software into multiple different functional
> > > > > > units and attached to different driver instances.
> > > > > >
> > > > > > Currently there is alot of hacking in this area.. And a consistent
> > > > > > scheme might resolve the ugliness with the dma_ops wrappers.
> > > > > >
> > > > > > We already have the 'mfd' stuff to support splitting platform
> > > > > > devices, maybe we need to create a 'pci-mfd' to support splitting PCI
> > > devices?
> > > > > >
> > > > > > I'm not really clear how mfd and mdev relate, I always thought
> > > > > > mdev was strongly linked to vfio.
> > > > > >
> > > > >
> > > > > Mdev at beginning was strongly linked to vfio, but as I mentioned
> > > > > above it is addressing more use case.
> > > > >
> > > > > I observed that discussion, but was not sure of extending mdev further.
> > > > >
> > > > > One way to do for Intel drivers to do is after series [9].
> > > > > Where PCI driver says, MDEV_CLASS_ID_I40_FOO RDMA driver
> > > > > mdev_register_driver(), matches on it and does the probe().
> > > >
> > > > Yup, FWIW to me the benefit of reusing mdevs for the Intel case vs
> > > > muddying the purpose of mdevs is not a clear trade off.
> > > 
> > > IMHO, mdev has amdev_parent_ops structure clearly intended to link it to vfio,
> > > so using a mdev for something not related to vfio seems like a poor choice.
> > > 
> > Splitting mdev_parent_ops{} is already in works for larger use case in series [1] for virtio.
> > 
> > [1] https://patchwork.kernel.org/patch/11233127/
> 
> Weird. So what is mdev actually providing and what does it represent
> if the entire driver facing API surface is under a union?
> 
> This smells a lot like it is re-implementing a bus.. AFAIK bus is
> supposed to represent the in-kernel API the struct device presents to
> drivers.

Yes, yes yes yes...

I'm getting tired of saying the same thing here, just use a bus, that's
what it is there for.

greg k-h
Jason Gunthorpe Nov. 8, 2019, 9:05 p.m. UTC | #26
On Fri, Nov 08, 2019 at 01:34:35PM -0700, Alex Williamson wrote:
> On Fri, 8 Nov 2019 16:12:53 -0400
> Jason Gunthorpe <jgg@ziepe.ca> wrote:
> 
> > On Fri, Nov 08, 2019 at 11:12:38AM -0800, Jakub Kicinski wrote:
> > > On Fri, 8 Nov 2019 15:40:22 +0000, Parav Pandit wrote:  
> > > > > The new intel driver has been having a very similar discussion about how to
> > > > > model their 'multi function device' ie to bind RDMA and other drivers to a
> > > > > shared PCI function, and I think that discussion settled on adding a new bus?
> > > > > 
> > > > > Really these things are all very similar, it would be nice to have a clear
> > > > > methodology on how to use the device core if a single PCI device is split by
> > > > > software into multiple different functional units and attached to different
> > > > > driver instances.
> > > > > 
> > > > > Currently there is alot of hacking in this area.. And a consistent scheme
> > > > > might resolve the ugliness with the dma_ops wrappers.
> > > > > 
> > > > > We already have the 'mfd' stuff to support splitting platform devices, maybe
> > > > > we need to create a 'pci-mfd' to support splitting PCI devices?
> > > > > 
> > > > > I'm not really clear how mfd and mdev relate, I always thought mdev was
> > > > > strongly linked to vfio.
> > > > >  
> > > >
> > > > Mdev at beginning was strongly linked to vfio, but as I mentioned
> > > > above it is addressing more use case.
> > > > 
> > > > I observed that discussion, but was not sure of extending mdev further.
> > > > 
> > > > One way to do for Intel drivers to do is after series [9].
> > > > Where PCI driver says, MDEV_CLASS_ID_I40_FOO
> > > > RDMA driver mdev_register_driver(), matches on it and does the probe().  
> > > 
> > > Yup, FWIW to me the benefit of reusing mdevs for the Intel case vs
> > > muddying the purpose of mdevs is not a clear trade off.  
> > 
> > IMHO, mdev has amdev_parent_ops structure clearly intended to link it
> > to vfio, so using a mdev for something not related to vfio seems like
> > a poor choice.
> 
> Unless there's some opposition, I'm intended to queue this for v5.5:
> 
> https://www.spinics.net/lists/kvm/msg199613.html
> 
> mdev has started out as tied to vfio, but at it's core, it's just a
> device life cycle infrastructure with callbacks between bus drivers
> and vendor devices.  If virtio is on the wrong path with the above
> series, please speak up.  Thanks,

Well, I think Greg just objected pretty strongly.

IMHO it is wrong to turn mdev into some API multiplexor. That is what
the driver core already does and AFAIK your bus type is supposed to
represent your API contract to your drivers.

Since the bus type is ABI, 'mdev' is really all about vfio I guess?

Maybe mdev should grow by factoring the special GUID life cycle stuff
into a helper library that can make it simpler to build proper API
specific bus's using that lifecycle model? ie the virtio I saw
proposed should probably be a mdev-virtio bus type providing this new
virtio API contract using a 'struct mdev_virtio'?

I only looked briefly but mdev seems like an unusual way to use the
driver core. *generally* I would expect that if a driver wants to
provide a foo_device (on a foo bus, providing the foo API contract) it
looks very broadly like:

  struct foo_device {
       struct device dev;
       const struct foo_ops *ops;
  };
  struct my_foo_device {
      struct foo_device fdev;
  };

  foo_device_register(&mydev->fdev);

Which means we can use normal container_of() patterns, while mdev
seems to want to allocate all the structs internally.. I guess this is
because of how the lifecycle stuff works? From a device core view it
looks quite unnatural.

Jason
Greg KH Nov. 8, 2019, 9:19 p.m. UTC | #27
On Fri, Nov 08, 2019 at 05:05:45PM -0400, Jason Gunthorpe wrote:
> On Fri, Nov 08, 2019 at 01:34:35PM -0700, Alex Williamson wrote:
> > On Fri, 8 Nov 2019 16:12:53 -0400
> > Jason Gunthorpe <jgg@ziepe.ca> wrote:
> > 
> > > On Fri, Nov 08, 2019 at 11:12:38AM -0800, Jakub Kicinski wrote:
> > > > On Fri, 8 Nov 2019 15:40:22 +0000, Parav Pandit wrote:  
> > > > > > The new intel driver has been having a very similar discussion about how to
> > > > > > model their 'multi function device' ie to bind RDMA and other drivers to a
> > > > > > shared PCI function, and I think that discussion settled on adding a new bus?
> > > > > > 
> > > > > > Really these things are all very similar, it would be nice to have a clear
> > > > > > methodology on how to use the device core if a single PCI device is split by
> > > > > > software into multiple different functional units and attached to different
> > > > > > driver instances.
> > > > > > 
> > > > > > Currently there is alot of hacking in this area.. And a consistent scheme
> > > > > > might resolve the ugliness with the dma_ops wrappers.
> > > > > > 
> > > > > > We already have the 'mfd' stuff to support splitting platform devices, maybe
> > > > > > we need to create a 'pci-mfd' to support splitting PCI devices?
> > > > > > 
> > > > > > I'm not really clear how mfd and mdev relate, I always thought mdev was
> > > > > > strongly linked to vfio.
> > > > > >  
> > > > >
> > > > > Mdev at beginning was strongly linked to vfio, but as I mentioned
> > > > > above it is addressing more use case.
> > > > > 
> > > > > I observed that discussion, but was not sure of extending mdev further.
> > > > > 
> > > > > One way to do for Intel drivers to do is after series [9].
> > > > > Where PCI driver says, MDEV_CLASS_ID_I40_FOO
> > > > > RDMA driver mdev_register_driver(), matches on it and does the probe().  
> > > > 
> > > > Yup, FWIW to me the benefit of reusing mdevs for the Intel case vs
> > > > muddying the purpose of mdevs is not a clear trade off.  
> > > 
> > > IMHO, mdev has amdev_parent_ops structure clearly intended to link it
> > > to vfio, so using a mdev for something not related to vfio seems like
> > > a poor choice.
> > 
> > Unless there's some opposition, I'm intended to queue this for v5.5:
> > 
> > https://www.spinics.net/lists/kvm/msg199613.html
> > 
> > mdev has started out as tied to vfio, but at it's core, it's just a
> > device life cycle infrastructure with callbacks between bus drivers
> > and vendor devices.  If virtio is on the wrong path with the above
> > series, please speak up.  Thanks,
> 
> Well, I think Greg just objected pretty strongly.

Yes I did.

I keep saying this again and again, and so did you here:

> IMHO it is wrong to turn mdev into some API multiplexor. That is what
> the driver core already does and AFAIK your bus type is supposed to
> represent your API contract to your drivers.

That is exactly right.  Don't re-create the driver api interface at
another layer please.

thanks,

greg k-h
Jakub Kicinski Nov. 8, 2019, 9:21 p.m. UTC | #28
On Fri, 8 Nov 2019 20:41:18 +0100, Jiri Pirko wrote:
> Fri, Nov 08, 2019 at 08:06:40PM CET, jakub.kicinski@netronome.com wrote:
> >On Fri, 8 Nov 2019 13:12:33 +0100, Jiri Pirko wrote:  
> >> Thu, Nov 07, 2019 at 09:32:34PM CET, jakub.kicinski@netronome.com wrote:  
> >> >On Thu,  7 Nov 2019 10:04:48 -0600, Parav Pandit wrote:    
> >> >> Mellanox sub function capability allows users to create several hundreds
> >> >> of networking and/or rdma devices without depending on PCI SR-IOV support.    
> >> >
> >> >You call the new port type "sub function" but the devlink port flavour
> >> >is mdev.
> >> >
> >> >As I'm sure you remember you nacked my patches exposing NFP's PCI 
> >> >sub functions which are just regions of the BAR without any mdev
> >> >capability. Am I in the clear to repost those now? Jiri?    
> >> 
> >> Well question is, if it makes sense to have SFs without having them as
> >> mdev? I mean, we discussed the modelling thoroughtly and eventually we
> >> realized that in order to model this correctly, we need SFs on "a bus".
> >> Originally we were thinking about custom bus, but mdev is already there
> >> to handle this.  
> >
> >But the "main/real" port is not a mdev in your case. NFP is like mlx4. 
> >It has one PCI PF for multiple ports.  
> 
> I don't see how relevant the number of PFs-vs-uplink_ports is.

Well. We have a slice per external port, the association between the
port and the slice becomes irrelevant once switchdev mode is enabled,
but the queues are assigned statically so it'd be a waste of resources
to not show all slices as netdevs.

> >> Our SFs are also just regions of the BAR, same thing as you have.
> >> 
> >> Can't you do the same for nfp SFs?
> >> Then the "mdev" flavour is enough for all.  
> >
> >Absolutely not. 
> >
> >Why not make the main device of mlx5 a mdev, too, if that's acceptable.
> >There's (a) long precedence for multiple ports on one PCI PF in
> >networking devices, (b) plenty deployed software 
> >which depend on the main devices hanging off the PCI PF directly.
> >
> >The point of mdevs is being able to sign them to VFs or run DPDK on
> >them (map to user space).
> >
> >For normal devices existing sysfs hierarchy were one device has
> >multiple children of a certain class, without a bus and a separate
> >driver is perfectly fine. Do you think we should also slice all serial
> >chips into mdevs if they have multiple lines.
> >
> >Exactly as I predicted much confusion about what's being achieved here,
> >heh :)  
> 
> Please let me understand how your device is different.
> Originally Parav didn't want to have mlx5 subfunctions as mdev. He
> wanted to have them tight to the same pci device as the pf. No
> difference from what you describe you want. However while we thought
> about how to fit things in, how to handle na phys_port_name, how to see
> things in sysfs we came up with an idea of a dedicated bus.

The difference is that there is naturally a main device and subslices
with this new mlx5 code. In mlx4 or nfp all ports are equal and
statically allocated when FW initializes based on port breakout.

Maybe it's the fact I spent last night at an airport but I'm feeling
like I'm arguing about this stronger than I actually care :)

> We took it upstream and people suggested to use mdev bus for this.
> 
> Parav, please correct me if I'm wrong but I don't think where is a plan
> to push SFs into VM or to userspace as Jakub expects, right?

There's definitely a plan to push them to VFs, I believe that was part
of the original requirements, otherwise there'd be absolutely no need
for a bus to begin with.
Jiri Pirko Nov. 8, 2019, 9:39 p.m. UTC | #29
Fri, Nov 08, 2019 at 10:21:20PM CET, jakub.kicinski@netronome.com wrote:
>On Fri, 8 Nov 2019 20:41:18 +0100, Jiri Pirko wrote:
>> Fri, Nov 08, 2019 at 08:06:40PM CET, jakub.kicinski@netronome.com wrote:
>> >On Fri, 8 Nov 2019 13:12:33 +0100, Jiri Pirko wrote:  
>> >> Thu, Nov 07, 2019 at 09:32:34PM CET, jakub.kicinski@netronome.com wrote:  
>> >> >On Thu,  7 Nov 2019 10:04:48 -0600, Parav Pandit wrote:    
>> >> >> Mellanox sub function capability allows users to create several hundreds
>> >> >> of networking and/or rdma devices without depending on PCI SR-IOV support.    
>> >> >
>> >> >You call the new port type "sub function" but the devlink port flavour
>> >> >is mdev.
>> >> >
>> >> >As I'm sure you remember you nacked my patches exposing NFP's PCI 
>> >> >sub functions which are just regions of the BAR without any mdev
>> >> >capability. Am I in the clear to repost those now? Jiri?    
>> >> 
>> >> Well question is, if it makes sense to have SFs without having them as
>> >> mdev? I mean, we discussed the modelling thoroughtly and eventually we
>> >> realized that in order to model this correctly, we need SFs on "a bus".
>> >> Originally we were thinking about custom bus, but mdev is already there
>> >> to handle this.  
>> >
>> >But the "main/real" port is not a mdev in your case. NFP is like mlx4. 
>> >It has one PCI PF for multiple ports.  
>> 
>> I don't see how relevant the number of PFs-vs-uplink_ports is.
>
>Well. We have a slice per external port, the association between the
>port and the slice becomes irrelevant once switchdev mode is enabled,
>but the queues are assigned statically so it'd be a waste of resources
>to not show all slices as netdevs.
>
>> >> Our SFs are also just regions of the BAR, same thing as you have.
>> >> 
>> >> Can't you do the same for nfp SFs?
>> >> Then the "mdev" flavour is enough for all.  
>> >
>> >Absolutely not. 
>> >
>> >Why not make the main device of mlx5 a mdev, too, if that's acceptable.
>> >There's (a) long precedence for multiple ports on one PCI PF in
>> >networking devices, (b) plenty deployed software 
>> >which depend on the main devices hanging off the PCI PF directly.
>> >
>> >The point of mdevs is being able to sign them to VFs or run DPDK on
>> >them (map to user space).
>> >
>> >For normal devices existing sysfs hierarchy were one device has
>> >multiple children of a certain class, without a bus and a separate
>> >driver is perfectly fine. Do you think we should also slice all serial
>> >chips into mdevs if they have multiple lines.
>> >
>> >Exactly as I predicted much confusion about what's being achieved here,
>> >heh :)  
>> 
>> Please let me understand how your device is different.
>> Originally Parav didn't want to have mlx5 subfunctions as mdev. He
>> wanted to have them tight to the same pci device as the pf. No
>> difference from what you describe you want. However while we thought
>> about how to fit things in, how to handle na phys_port_name, how to see
>> things in sysfs we came up with an idea of a dedicated bus.
>
>The difference is that there is naturally a main device and subslices
>with this new mlx5 code. In mlx4 or nfp all ports are equal and
>statically allocated when FW initializes based on port breakout.

Ah, I see. I was missing the static part in nfp. Now I understand. It is
just an another "pf", but not real pf in the pci terminology, right?


>
>Maybe it's the fact I spent last night at an airport but I'm feeling
>like I'm arguing about this stronger than I actually care :)
>
>> We took it upstream and people suggested to use mdev bus for this.
>> 
>> Parav, please correct me if I'm wrong but I don't think where is a plan
>> to push SFs into VM or to userspace as Jakub expects, right?
>
>There's definitely a plan to push them to VFs, I believe that was part
>of the original requirements, otherwise there'd be absolutely no need
>for a bus to begin with.
Jakub Kicinski Nov. 8, 2019, 9:45 p.m. UTC | #30
On Fri, 8 Nov 2019 16:12:53 -0400, Jason Gunthorpe wrote:
> On Fri, Nov 08, 2019 at 11:12:38AM -0800, Jakub Kicinski wrote:
> > On Fri, 8 Nov 2019 15:40:22 +0000, Parav Pandit wrote:  
> > > Mdev at beginning was strongly linked to vfio, but as I mentioned
> > > above it is addressing more use case.
> > > 
> > > I observed that discussion, but was not sure of extending mdev further.
> > > 
> > > One way to do for Intel drivers to do is after series [9].
> > > Where PCI driver says, MDEV_CLASS_ID_I40_FOO
> > > RDMA driver mdev_register_driver(), matches on it and does the probe().  
> > 
> > Yup, FWIW to me the benefit of reusing mdevs for the Intel case vs
> > muddying the purpose of mdevs is not a clear trade off.  
> 
> IMHO, mdev has amdev_parent_ops structure clearly intended to link it
> to vfio, so using a mdev for something not related to vfio seems like
> a poor choice.

Yes, my suggestion to use mdev was entirely based on the premise that
the purpose of this work is to get vfio working.. otherwise I'm unclear
as to why we'd need a bus in the first place. If this is just for
containers - we have macvlan offload for years now, with no need for a
separate device.

> I suppose this series is the start and we will eventually see the
> mlx5's mdev_parent_ops filled in to support vfio - but *right now*
> this looks identical to the problem most of the RDMA capable net
> drivers have splitting into a 'core' and a 'function'

On the RDMA/Intel front, would you mind explaining what the main
motivation for the special buses is? I'm a little confurious.

My understanding is MFD was created to help with cases where single
device has multiple pieces of common IP in it. Do modern RDMA cards
really share IP across generations? Is there a need to reload the
drivers for the separate pieces (I wonder if the devlink reload doesn't
belong to the device model :().

Or is it purely an abstraction and people like abstractions?
Jakub Kicinski Nov. 8, 2019, 9:51 p.m. UTC | #31
On Fri, 8 Nov 2019 22:39:52 +0100, Jiri Pirko wrote:
> >> Please let me understand how your device is different.
> >> Originally Parav didn't want to have mlx5 subfunctions as mdev. He
> >> wanted to have them tight to the same pci device as the pf. No
> >> difference from what you describe you want. However while we thought
> >> about how to fit things in, how to handle na phys_port_name, how to see
> >> things in sysfs we came up with an idea of a dedicated bus.  
> >
> >The difference is that there is naturally a main device and subslices
> >with this new mlx5 code. In mlx4 or nfp all ports are equal and
> >statically allocated when FW initializes based on port breakout.  
> 
> Ah, I see. I was missing the static part in nfp. Now I understand. It is
> just an another "pf", but not real pf in the pci terminology, right?

Ack, due to (real and perceived) HW limitations what should have been
separate PFs got squished into a single big one.

Biggest NFP chip has an insane (for a NIC) number Ethernet ports.
Alex Williamson Nov. 8, 2019, 9:52 p.m. UTC | #32
On Fri, 8 Nov 2019 17:05:45 -0400
Jason Gunthorpe <jgg@ziepe.ca> wrote:

> On Fri, Nov 08, 2019 at 01:34:35PM -0700, Alex Williamson wrote:
> > On Fri, 8 Nov 2019 16:12:53 -0400
> > Jason Gunthorpe <jgg@ziepe.ca> wrote:
> >   
> > > On Fri, Nov 08, 2019 at 11:12:38AM -0800, Jakub Kicinski wrote:  
> > > > On Fri, 8 Nov 2019 15:40:22 +0000, Parav Pandit wrote:    
> > > > > > The new intel driver has been having a very similar discussion about how to
> > > > > > model their 'multi function device' ie to bind RDMA and other drivers to a
> > > > > > shared PCI function, and I think that discussion settled on adding a new bus?
> > > > > > 
> > > > > > Really these things are all very similar, it would be nice to have a clear
> > > > > > methodology on how to use the device core if a single PCI device is split by
> > > > > > software into multiple different functional units and attached to different
> > > > > > driver instances.
> > > > > > 
> > > > > > Currently there is alot of hacking in this area.. And a consistent scheme
> > > > > > might resolve the ugliness with the dma_ops wrappers.
> > > > > > 
> > > > > > We already have the 'mfd' stuff to support splitting platform devices, maybe
> > > > > > we need to create a 'pci-mfd' to support splitting PCI devices?
> > > > > > 
> > > > > > I'm not really clear how mfd and mdev relate, I always thought mdev was
> > > > > > strongly linked to vfio.
> > > > > >    
> > > > >
> > > > > Mdev at beginning was strongly linked to vfio, but as I mentioned
> > > > > above it is addressing more use case.
> > > > > 
> > > > > I observed that discussion, but was not sure of extending mdev further.
> > > > > 
> > > > > One way to do for Intel drivers to do is after series [9].
> > > > > Where PCI driver says, MDEV_CLASS_ID_I40_FOO
> > > > > RDMA driver mdev_register_driver(), matches on it and does the probe().    
> > > > 
> > > > Yup, FWIW to me the benefit of reusing mdevs for the Intel case vs
> > > > muddying the purpose of mdevs is not a clear trade off.    
> > > 
> > > IMHO, mdev has amdev_parent_ops structure clearly intended to link it
> > > to vfio, so using a mdev for something not related to vfio seems like
> > > a poor choice.  
> > 
> > Unless there's some opposition, I'm intended to queue this for v5.5:
> > 
> > https://www.spinics.net/lists/kvm/msg199613.html
> > 
> > mdev has started out as tied to vfio, but at it's core, it's just a
> > device life cycle infrastructure with callbacks between bus drivers
> > and vendor devices.  If virtio is on the wrong path with the above
> > series, please speak up.  Thanks,  
> 
> Well, I think Greg just objected pretty strongly.
> 
> IMHO it is wrong to turn mdev into some API multiplexor. That is what
> the driver core already does and AFAIK your bus type is supposed to
> represent your API contract to your drivers.
> 
> Since the bus type is ABI, 'mdev' is really all about vfio I guess?
> 
> Maybe mdev should grow by factoring the special GUID life cycle stuff
> into a helper library that can make it simpler to build proper API
> specific bus's using that lifecycle model? ie the virtio I saw
> proposed should probably be a mdev-virtio bus type providing this new
> virtio API contract using a 'struct mdev_virtio'?

I see, the bus:API contract is more clear when we're talking about
physical buses and physical devices following a hardware specification.
But if we take PCI for example, each PCI device has it's own internal
API that operates on the bus API.  PCI bus drivers match devices based
on vendor and device ID, which defines that internal API, not the bus
API.  The bus API is pretty thin when we're talking virtual devices and
virtual buses though.  The bus "API" is essentially that lifecycle
management, so I'm having a bit of a hard time differentiating this
from saying "hey, that PCI bus is nice, but we can't have drivers using
their own API on the same bus, so can we move the config space, reset,
hotplug, etc, stuff into helpers and come up with an (ex.) mlx5_bus
instead?"  Essentially for virtual devices, we're dictating a bus per
device type, whereas it seemed like a reasonable idea at the time to
create a common virtual device bus, but maybe it went into the weeds
when trying to figure out how device drivers match to devices on that
bus and actually interact with them.

> I only looked briefly but mdev seems like an unusual way to use the
> driver core. *generally* I would expect that if a driver wants to
> provide a foo_device (on a foo bus, providing the foo API contract) it
> looks very broadly like:
> 
>   struct foo_device {
>        struct device dev;
>        const struct foo_ops *ops;
>   };
>   struct my_foo_device {
>       struct foo_device fdev;
>   };
> 
>   foo_device_register(&mydev->fdev);
> 
> Which means we can use normal container_of() patterns, while mdev
> seems to want to allocate all the structs internally.. I guess this is
> because of how the lifecycle stuff works? From a device core view it
> looks quite unnatural.

Right, there's an attempt in mdev to do the common bits of the device
creation in the core and pass it to the vendor driver to fill in the
private bits.  I'm sure it could be cleaner, patches welcome :)  Thanks,

Alex
Jiri Pirko Nov. 8, 2019, 10:21 p.m. UTC | #33
Fri, Nov 08, 2019 at 10:51:09PM CET, jakub.kicinski@netronome.com wrote:
>On Fri, 8 Nov 2019 22:39:52 +0100, Jiri Pirko wrote:
>> >> Please let me understand how your device is different.
>> >> Originally Parav didn't want to have mlx5 subfunctions as mdev. He
>> >> wanted to have them tight to the same pci device as the pf. No
>> >> difference from what you describe you want. However while we thought
>> >> about how to fit things in, how to handle na phys_port_name, how to see
>> >> things in sysfs we came up with an idea of a dedicated bus.  
>> >
>> >The difference is that there is naturally a main device and subslices
>> >with this new mlx5 code. In mlx4 or nfp all ports are equal and
>> >statically allocated when FW initializes based on port breakout.  
>> 
>> Ah, I see. I was missing the static part in nfp. Now I understand. It is
>> just an another "pf", but not real pf in the pci terminology, right?
>
>Ack, due to (real and perceived) HW limitations what should have been
>separate PFs got squished into a single big one.
>
>Biggest NFP chip has an insane (for a NIC) number Ethernet ports.

Okay. So we'll endup in having flavour "mdev" for the SFs that are
spawned on fly by user and "sf" for the fixed one - that is your
patchset if I recall correctly.
Parav Pandit Nov. 8, 2019, 10:48 p.m. UTC | #34
Hi Greg, Jason,

> -----Original Message-----
> From: Alex Williamson <alex.williamson@redhat.com>
> 
> On Fri, 8 Nov 2019 17:05:45 -0400
> Jason Gunthorpe <jgg@ziepe.ca> wrote:
> 
> > On Fri, Nov 08, 2019 at 01:34:35PM -0700, Alex Williamson wrote:
> > > On Fri, 8 Nov 2019 16:12:53 -0400
> > > Jason Gunthorpe <jgg@ziepe.ca> wrote:
> > >
> > > > On Fri, Nov 08, 2019 at 11:12:38AM -0800, Jakub Kicinski wrote:
> > > > > On Fri, 8 Nov 2019 15:40:22 +0000, Parav Pandit wrote:
> > > > > > > The new intel driver has been having a very similar
> > > > > > > discussion about how to model their 'multi function device'
> > > > > > > ie to bind RDMA and other drivers to a shared PCI function, and I
> think that discussion settled on adding a new bus?
> > > > > > >
> > > > > > > Really these things are all very similar, it would be nice
> > > > > > > to have a clear methodology on how to use the device core if
> > > > > > > a single PCI device is split by software into multiple
> > > > > > > different functional units and attached to different driver instances.
> > > > > > >
> > > > > > > Currently there is alot of hacking in this area.. And a
> > > > > > > consistent scheme might resolve the ugliness with the dma_ops
> wrappers.
> > > > > > >
> > > > > > > We already have the 'mfd' stuff to support splitting
> > > > > > > platform devices, maybe we need to create a 'pci-mfd' to support
> splitting PCI devices?
> > > > > > >
> > > > > > > I'm not really clear how mfd and mdev relate, I always
> > > > > > > thought mdev was strongly linked to vfio.
> > > > > > >
> > > > > >
> > > > > > Mdev at beginning was strongly linked to vfio, but as I
> > > > > > mentioned above it is addressing more use case.
> > > > > >
> > > > > > I observed that discussion, but was not sure of extending mdev
> further.
> > > > > >
> > > > > > One way to do for Intel drivers to do is after series [9].
> > > > > > Where PCI driver says, MDEV_CLASS_ID_I40_FOO
> > > > > > RDMA driver mdev_register_driver(), matches on it and does the
> probe().
> > > > >
> > > > > Yup, FWIW to me the benefit of reusing mdevs for the Intel case vs
> > > > > muddying the purpose of mdevs is not a clear trade off.
> > > >
> > > > IMHO, mdev has amdev_parent_ops structure clearly intended to link
> > > > it to vfio, so using a mdev for something not related to vfio
> > > > seems like a poor choice.
> > >
> > > Unless there's some opposition, I'm intended to queue this for v5.5:
> > >
> > > https://www.spinics.net/lists/kvm/msg199613.html
> > >
> > > mdev has started out as tied to vfio, but at it's core, it's just a
> > > device life cycle infrastructure with callbacks between bus drivers
> > > and vendor devices.  If virtio is on the wrong path with the above
> > > series, please speak up.  Thanks,
> >
> > Well, I think Greg just objected pretty strongly.
> >
> > IMHO it is wrong to turn mdev into some API multiplexor. That is what
> > the driver core already does and AFAIK your bus type is supposed to
> > represent your API contract to your drivers.
> >
> > Since the bus type is ABI, 'mdev' is really all about vfio I guess?
> >
> > Maybe mdev should grow by factoring the special GUID life cycle stuff
> > into a helper library that can make it simpler to build proper API
> > specific bus's using that lifecycle model? ie the virtio I saw
> > proposed should probably be a mdev-virtio bus type providing this new
> > virtio API contract using a 'struct mdev_virtio'?
> 
> I see, the bus:API contract is more clear when we're talking about physical
> buses and physical devices following a hardware specification.
> But if we take PCI for example, each PCI device has it's own internal API that
> operates on the bus API.  PCI bus drivers match devices based on vendor and
> device ID, which defines that internal API, not the bus API.  The bus API is pretty
> thin when we're talking virtual devices and virtual buses though.  The bus "API"
> is essentially that lifecycle management, so I'm having a bit of a hard time
> differentiating this from saying "hey, that PCI bus is nice, but we can't have
> drivers using their own API on the same bus, so can we move the config space,
> reset, hotplug, etc, stuff into helpers and come up with an (ex.) mlx5_bus
> instead?"  Essentially for virtual devices, we're dictating a bus per device type,
> whereas it seemed like a reasonable idea at the time to create a common
> virtual device bus, but maybe it went into the weeds when trying to figure out
> how device drivers match to devices on that bus and actually interact with
> them.
> 
> > I only looked briefly but mdev seems like an unusual way to use the
> > driver core. *generally* I would expect that if a driver wants to
> > provide a foo_device (on a foo bus, providing the foo API contract) it
> > looks very broadly like:
> >
> >   struct foo_device {
> >        struct device dev;
> >        const struct foo_ops *ops;
> >   };
> >   struct my_foo_device {
> >       struct foo_device fdev;
> >   };
> >
> >   foo_device_register(&mydev->fdev);
> >
If I understood Greg's direction on using bus and Jason's suggestion of 'mdev-virtio' example,

User has one of the three use cases as I described in cover letter.
i.e. create a sub device and configure it.
once its configured,
Based on the use case, map it to right bus driver.
1. mdev-vfio (no demux business)
2. virtio (new bus)
3. mlx5_bus (new bus)

We should be creating 3 different buses, instead of mdev bus being de-multiplexer of that?

Hence, depending the device flavour specified, create such device on right bus?

For example,
$ devlink create subdev pci/0000:05:00.0 flavour virtio name foo subdev_id 1
$ devlink create subdev pci/0000:05:00.0 flavour mdev <uuid> subdev_id 2
$ devlink create subdev pci/0000:05:00.0 flavour mlx5 id 1 subdev_id 3

$ devlink subdev pci/0000:05:00.0/<subdev_id> config <params>
$ echo <respective_device_id> <sysfs_path>/bind

Implement power management callbacks also on all above 3 buses?
Abstract out mlx5_bus into more generic virtual bus (vdev bus?) so that multiple vendors can reuse?
Jason Gunthorpe Nov. 9, 2019, 12:12 a.m. UTC | #35
On Fri, Nov 08, 2019 at 02:52:10PM -0700, Alex Williamson wrote:
> > > 
> > > Unless there's some opposition, I'm intended to queue this for v5.5:
> > > 
> > > https://www.spinics.net/lists/kvm/msg199613.html
> > > 
> > > mdev has started out as tied to vfio, but at it's core, it's just a
> > > device life cycle infrastructure with callbacks between bus drivers
> > > and vendor devices.  If virtio is on the wrong path with the above
> > > series, please speak up.  Thanks,  
> > 
> > Well, I think Greg just objected pretty strongly.
> > 
> > IMHO it is wrong to turn mdev into some API multiplexor. That is what
> > the driver core already does and AFAIK your bus type is supposed to
> > represent your API contract to your drivers.
> > 
> > Since the bus type is ABI, 'mdev' is really all about vfio I guess?
> > 
> > Maybe mdev should grow by factoring the special GUID life cycle stuff
> > into a helper library that can make it simpler to build proper API
> > specific bus's using that lifecycle model? ie the virtio I saw
> > proposed should probably be a mdev-virtio bus type providing this new
> > virtio API contract using a 'struct mdev_virtio'?
> 
> I see, the bus:API contract is more clear when we're talking about
> physical buses and physical devices following a hardware
> specification.

Well, I don't think it matters, this is a software contract inside the
kernel between the 'struct foo_device' (as provided by the foo_bus)
and the 'struct foo_driver'

This contract is certainly easier to define when a HW specification
dictates basically how it works.

> But if we take PCI for example, each PCI device has it's own internal
> API that operates on the bus API.  PCI bus drivers match devices based
> on vendor and device ID, which defines that internal API, not the bus
> API.  

Yes, this matching is part of the API contract between the bus and
device driver.

But all of the pci_* functions that accept a 'struct pci_device *' are
also part of this API contract toward the driver.

> The bus API is pretty thin when we're talking virtual devices and
> virtual buses though.  The bus "API" is essentially that lifecycle
> management, so I'm having a bit of a hard time differentiating this

But Parav just pointed out to a virtio SW API that had something like
20 API entry points.

> instead?"  Essentially for virtual devices, we're dictating a bus per
> device type, whereas it seemed like a reasonable idea at the time to

Well, what does a driver binding to a virtual device need to know?

The virtual device API should provide all of that information.

I think things like vfio and virtio APIs are very reasonable bus
types. virtio in particular has a published 'hw-like' specification
with some good layers that can build a bus API.

Not so sure about the very HW specific things like the Intel driver and
these SFs. These will really only ever bind to one driver and seem to
have no commonalities.

For those we either create a bus per driver-specific proprietary API
(feels kind of wrong) or we have a generic bus essentially for managed
multi-function hardware that uses a simple 'void *hw_data' as the
driver API and some matching logic to support that.

> create a common virtual device bus, but maybe it went into the weeds
> when trying to figure out how device drivers match to devices on that
> bus and actually interact with them.

I think it is important to focus on the the SW API the 'struct
foo_device' is supposed to provide toward the driver that binds to it.

It should be a sensible API covering some well defined area..

Jason
Jason Gunthorpe Nov. 9, 2019, 12:44 a.m. UTC | #36
On Fri, Nov 08, 2019 at 01:45:59PM -0800, Jakub Kicinski wrote:

> > IMHO, mdev has amdev_parent_ops structure clearly intended to link it
> > to vfio, so using a mdev for something not related to vfio seems like
> > a poor choice.
> 
> Yes, my suggestion to use mdev was entirely based on the premise that
> the purpose of this work is to get vfio working.. otherwise I'm unclear
> as to why we'd need a bus in the first place. If this is just for
> containers - we have macvlan offload for years now, with no need for a
> separate device.

This SF thing is a full fledged VF function, it is not at all like
macvlan. This is perhaps less important for the netdev part of the
world, but the difference is very big for the RDMA side, and should
enable VFIO too..
 
> On the RDMA/Intel front, would you mind explaining what the main
> motivation for the special buses is? I'm a little confurious.

Well, the issue is driver binding. For years we have had these
multi-function netdev drivers that have a single PCI device which must
bind into multiple subsystems, ie mlx5 does netdev and RDMA, the cxgb
drivers do netdev, RDMA, SCSI initiator, SCSI target, etc. [And I
expect when NVMe over TCP rolls out we will have drivers like cxgb4
binding to 6 subsytems in total!]

Today most of this is a big hack where the PCI device binds to the
netdev driver and then the other drivers in different subsystems
'discover' that an appropriate netdev is plugged in using various
unique, hacky and ugly means. For instance cxgb4 duplicates a chunk of
the device core, see cxgb4_register_uld() for example. Other drivers
try to use netdev notifiers, and various other wild things.

So, the general concept is to use the driver model to manage driver
binding. A multi-subsystem driver would have several parts:

- A pci_driver which binds to the pci_device (the core)
  It creates, on a bus, struct ??_device's for the other subsystems
  that this HW supports. ie if the chip supports netdev then a 
  ??_device that binds to the netdev driver is created, same for RDMA

- A ??_driver in netdev binds to the device and accesses the core API
- A ??_driver in RDMA binds to the device and accesses the core API
- A ??_driver in SCSI binds to the device and accesses the core API

Now the driver model directly handles all binding, autoloading,
discovery, etc, and 'netdev' is just another consumer of 'core'
functionality.

For something like mlx5 the 'core' is the stuff in
drivers/net/ethernet/mellanox/mlx5/core/*.c, give or take. It is
broadly generic stuff like send commands, create queues, manage HW
resources, etc.

There has been some lack of clarity on what the ?? should be. People
have proposed platform and MFD, and those seem to be no-goes. So, it
looks like ?? will be a mlx5_driver on a mlx5_bus, and Intel will use
an ice_driver on a ice_bus, ditto for cxgb4, if I understand Greg's
guidance.

Though I'm wondering if we should have a 'multi_subsystem_device' that
was really just about passing a 'void *core_handle' from the 'core'
(ie the bus) to the driver (ie RDMA, netdev, etc). 

It seems weakly defined, but also exactly what every driver doing this
needs.. It is basically what this series is abusing mdev to accomplish.

> My understanding is MFD was created to help with cases where single
> device has multiple pieces of common IP in it. 

MFD really seems to be good at splitting a device when the HW is
orthogonal at the register level. Ie you might have regs 100-200 for
ethernet and 200-300 for RDMA.

But this is not how modern HW works, the functional division is more
subtle and more software based. ie on most devices a netdev and rdma
queue are nearly the same, just a few settings make them function
differently.

So what is needed isn't a split of register set like MFD specializes
in, but a unique per-driver API between the 'core' and 'subsystem'
parts of the multi-subsystem device.

> Do modern RDMA cards really share IP across generations? 

What is a generation? Mellanox has had a stable RDMA driver across
many sillicon generations. Intel looks like their new driver will
support at least the last two or more sillicon generations..

RDMA drivers are monstrous complex things, there is a big incentive to
not respin them every time a new chip comes out.

> Is there a need to reload the drivers for the separate pieces (I
> wonder if the devlink reload doesn't belong to the device model :().

Yes, it is already done, but without driver model support the only way
to reload the rdma driver is to unload the entire module as there is
no 'unbind'

Jason
Parav Pandit Nov. 9, 2019, 12:45 a.m. UTC | #37
Hi Jason,

> -----Original Message-----
> From: Jason Gunthorpe <jgg@ziepe.ca>
> 
> On Fri, Nov 08, 2019 at 02:52:10PM -0700, Alex Williamson wrote:
> > > >
> > > > Unless there's some opposition, I'm intended to queue this for v5.5:
> > > >
> > > > https://www.spinics.net/lists/kvm/msg199613.html
> > > >
> > > > mdev has started out as tied to vfio, but at it's core, it's just
> > > > a device life cycle infrastructure with callbacks between bus
> > > > drivers and vendor devices.  If virtio is on the wrong path with
> > > > the above series, please speak up.  Thanks,
> > >
> > > Well, I think Greg just objected pretty strongly.
> > >
> > > IMHO it is wrong to turn mdev into some API multiplexor. That is
> > > what the driver core already does and AFAIK your bus type is
> > > supposed to represent your API contract to your drivers.
> > >
> > > Since the bus type is ABI, 'mdev' is really all about vfio I guess?
> > >
> > > Maybe mdev should grow by factoring the special GUID life cycle
> > > stuff into a helper library that can make it simpler to build proper
> > > API specific bus's using that lifecycle model? ie the virtio I saw
> > > proposed should probably be a mdev-virtio bus type providing this
> > > new virtio API contract using a 'struct mdev_virtio'?
> >
> > I see, the bus:API contract is more clear when we're talking about
> > physical buses and physical devices following a hardware
> > specification.
> 
> Well, I don't think it matters, this is a software contract inside the kernel
> between the 'struct foo_device' (as provided by the foo_bus) and the 'struct
> foo_driver'
> 
> This contract is certainly easier to define when a HW specification dictates
> basically how it works.
> 
> > But if we take PCI for example, each PCI device has it's own internal
> > API that operates on the bus API.  PCI bus drivers match devices based
> > on vendor and device ID, which defines that internal API, not the bus
> > API.
> 
> Yes, this matching is part of the API contract between the bus and device
> driver.
> 
> But all of the pci_* functions that accept a 'struct pci_device *' are also part of
> this API contract toward the driver.
> 
> > The bus API is pretty thin when we're talking virtual devices and
> > virtual buses though.  The bus "API" is essentially that lifecycle
> > management, so I'm having a bit of a hard time differentiating this
> 
> But Parav just pointed out to a virtio SW API that had something like
> 20 API entry points.
> 
> > instead?"  Essentially for virtual devices, we're dictating a bus per
> > device type, whereas it seemed like a reasonable idea at the time to
> 
> Well, what does a driver binding to a virtual device need to know?
> 
> The virtual device API should provide all of that information.
> 
> I think things like vfio and virtio APIs are very reasonable bus types. virtio in
> particular has a published 'hw-like' specification with some good layers that
> can build a bus API.
> 
> Not so sure about the very HW specific things like the Intel driver and these SFs.
> These will really only ever bind to one driver and seem to have no
> commonalities.
> 
> For those we either create a bus per driver-specific proprietary API (feels kind
> of wrong) or we have a generic bus essentially for managed multi-function
> hardware that uses a simple 'void *hw_data' as the driver API and some
> matching logic to support that.
> 
Its certainly important to use generic bus approach overall at kernel level so that every vendor doesn't define that own devlink flavor, id scheme, udev naming method, PM and so on.
(It is not just bus definition).

Coming to hw_data part, even if this subdev (vendor) bus is created, it can still exactly follow your foo_device example.

In fact my first published RFC [1] and its specific patch [2] was doing that.

probe() routine in series [1] didn't have PCI like struct subdev *, because I wanted to use the core's generic probe(),
However it still correct because probe() can reach out to foo_device using container_of().
And hence the *hw_data is also resolved.

So struct looks like,

struct subdev {
	struct device device;
	/* resource range */
	/* num of irq vectors */
	const char *hw_addr;
	[..]
};

struct mlx5_subdev {
	struct subdev device;
	[..];
};

I request to reconsider RFC [1] for multi-function SFs use with extension of device flavour as 'virtio', 'mlx5' etc in [3].

[1] https://lkml.org/lkml/2019/3/1/19
[2] https://lore.kernel.org/patchwork/patch/1046997/#1238851
[3] https://lkml.org/lkml/2019/3/1/25

> > create a common virtual device bus, but maybe it went into the weeds
> > when trying to figure out how device drivers match to devices on that
> > bus and actually interact with them.
> 
> I think it is important to focus on the the SW API the 'struct foo_device' is
> supposed to provide toward the driver that binds to it.
> 
> It should be a sensible API covering some well defined area..
> 
> Jason
Jason Gunthorpe Nov. 9, 2019, 12:57 a.m. UTC | #38
On Fri, Nov 08, 2019 at 10:48:31PM +0000, Parav Pandit wrote:
> We should be creating 3 different buses, instead of mdev bus being de-multiplexer of that?
> 
> Hence, depending the device flavour specified, create such device on right bus?
> 
> For example,
> $ devlink create subdev pci/0000:05:00.0 flavour virtio name foo subdev_id 1
> $ devlink create subdev pci/0000:05:00.0 flavour mdev <uuid> subdev_id 2
> $ devlink create subdev pci/0000:05:00.0 flavour mlx5 id 1 subdev_id 3

I like the idea of specifying what kind of interface you want at sub
device creation time. It fits the driver model pretty well and doesn't
require abusing the vfio mdev for binding to a netdev driver.

> $ devlink subdev pci/0000:05:00.0/<subdev_id> config <params>
> $ echo <respective_device_id> <sysfs_path>/bind

Is explicit binding really needed? If you specify a vfio flavour why
shouldn't the vfio driver autoload and bind to it right away? That is
kind of the point of the driver model...

(kind of related, but I don't get while all that GUID and lifecycle
stuff in mdev should apply for something like a SF)

> Implement power management callbacks also on all above 3 buses?
> Abstract out mlx5_bus into more generic virtual bus (vdev bus?) so
> that multiple vendors can reuse?

In this specific case, why does the SF in mlx5 mode even need a bus?
Is it only because of devlink? That would be unfortunate

Jason
Greg KH Nov. 9, 2019, 8:46 a.m. UTC | #39
On Fri, Nov 08, 2019 at 08:44:26PM -0400, Jason Gunthorpe wrote:
> There has been some lack of clarity on what the ?? should be. People
> have proposed platform and MFD, and those seem to be no-goes. So, it
> looks like ?? will be a mlx5_driver on a mlx5_bus, and Intel will use
> an ice_driver on a ice_bus, ditto for cxgb4, if I understand Greg's
> guidance.

Yes, that is the only way it can work because you really are just
sharing a single PCI device in a vendor-specific way, and they all need
to get along with each one properly for that vendor-specific way.  So
each vendor needs its own "bus" to be able to work out things properly,
I doubt you can make this more generic than that easily.

> Though I'm wondering if we should have a 'multi_subsystem_device' that
> was really just about passing a 'void *core_handle' from the 'core'
> (ie the bus) to the driver (ie RDMA, netdev, etc). 

Ick, no.

> It seems weakly defined, but also exactly what every driver doing this
> needs.. It is basically what this series is abusing mdev to accomplish.

What is so hard about writing a bus?  Last I tried it was just a few
hundred lines of code, if that.  I know it's not the easiest in places,
but we have loads of examples to crib from.  If you have
problems/questions, just ask!

Or, worst case, you just do what I asked in this thread somewhere, and
write a "virtual bus" where you just create devices and bind them to the
driver before registering and away you go.  No auto-loading needed (or
possible), but then you have a generic layer that everyone can use if
they want to (but you loose some functionality at the expense of
generic code.)

Are these constant long email threads a way that people are just trying
to get me to do this work for them?  Because if it is, it's working...

thanks,

greg k-h
Jiri Pirko Nov. 9, 2019, 11:18 a.m. UTC | #40
Sat, Nov 09, 2019 at 09:46:59AM CET, gregkh@linuxfoundation.org wrote:
>On Fri, Nov 08, 2019 at 08:44:26PM -0400, Jason Gunthorpe wrote:
>> There has been some lack of clarity on what the ?? should be. People
>> have proposed platform and MFD, and those seem to be no-goes. So, it
>> looks like ?? will be a mlx5_driver on a mlx5_bus, and Intel will use
>> an ice_driver on a ice_bus, ditto for cxgb4, if I understand Greg's
>> guidance.
>
>Yes, that is the only way it can work because you really are just
>sharing a single PCI device in a vendor-specific way, and they all need
>to get along with each one properly for that vendor-specific way.  So
>each vendor needs its own "bus" to be able to work out things properly,
>I doubt you can make this more generic than that easily.
>
>> Though I'm wondering if we should have a 'multi_subsystem_device' that
>> was really just about passing a 'void *core_handle' from the 'core'
>> (ie the bus) to the driver (ie RDMA, netdev, etc). 
>
>Ick, no.
>
>> It seems weakly defined, but also exactly what every driver doing this
>> needs.. It is basically what this series is abusing mdev to accomplish.
>
>What is so hard about writing a bus?  Last I tried it was just a few
>hundred lines of code, if that.  I know it's not the easiest in places,
>but we have loads of examples to crib from.  If you have
>problems/questions, just ask!
>
>Or, worst case, you just do what I asked in this thread somewhere, and
>write a "virtual bus" where you just create devices and bind them to the
>driver before registering and away you go.  No auto-loading needed (or
>possible), but then you have a generic layer that everyone can use if
>they want to (but you loose some functionality at the expense of
>generic code.)

Pardon my ignorance, just to be clear: You suggest to have
one-virtual-bus-per-driver or rather some common "xbus" to serve this
purpose for all of them, right?
If so, isn't that a bit ugly to have a bus in every driver? I wonder if
there can be some abstraction found.


>
>Are these constant long email threads a way that people are just trying
>to get me to do this work for them?  Because if it is, it's working...

Maybe they are just confused, like I am :)


>
>thanks,
>
>greg k-h
Jakub Kicinski Nov. 9, 2019, 5:27 p.m. UTC | #41
On Fri, 8 Nov 2019 20:44:26 -0400, Jason Gunthorpe wrote:
> On Fri, Nov 08, 2019 at 01:45:59PM -0800, Jakub Kicinski wrote:
> > Yes, my suggestion to use mdev was entirely based on the premise that
> > the purpose of this work is to get vfio working.. otherwise I'm unclear
> > as to why we'd need a bus in the first place. If this is just for
> > containers - we have macvlan offload for years now, with no need for a
> > separate device.  
> 
> This SF thing is a full fledged VF function, it is not at all like
> macvlan. This is perhaps less important for the netdev part of the
> world, but the difference is very big for the RDMA side, and should
> enable VFIO too..

Well, macvlan used VMDq so it was pretty much a "legacy SR-IOV" VF.
I'd perhaps need to learn more about RDMA to appreciate the difference.

> > On the RDMA/Intel front, would you mind explaining what the main
> > motivation for the special buses is? I'm a little confurious.  
> 
> Well, the issue is driver binding. For years we have had these
> multi-function netdev drivers that have a single PCI device which must
> bind into multiple subsystems, ie mlx5 does netdev and RDMA, the cxgb
> drivers do netdev, RDMA, SCSI initiator, SCSI target, etc. [And I
> expect when NVMe over TCP rolls out we will have drivers like cxgb4
> binding to 6 subsytems in total!]

What I'm missing is why is it so bad to have a driver register to
multiple subsystems.

I've seen no end of hacks caused people trying to split their driver
too deeply by functionality. Separate sub-drivers, buses and modules.

The nfp driver was split up before I upstreamed it, I merged it into
one monolithic driver/module. Code is still split up cleanly internally,
the architecture doesn't change in any major way. Sure 5% of developers
were upset they can't do some partial reloads they were used to, but
they got used to the new ways, and 100% of users were happy about the
simplicity.

For the nfp I think the _real_ reason to have a bus was that it
was expected to have some out-of-tree modules bind to it. Something 
I would not encourage :)

Maybe RDMA and storage have some requirements where the reload of the
part of the driver is important, IDK..

> > My understanding is MFD was created to help with cases where single
> > device has multiple pieces of common IP in it.   
> 
> MFD really seems to be good at splitting a device when the HW is
> orthogonal at the register level. Ie you might have regs 100-200 for
> ethernet and 200-300 for RDMA.
> 
> But this is not how modern HW works, the functional division is more
> subtle and more software based. ie on most devices a netdev and rdma
> queue are nearly the same, just a few settings make them function
> differently.
> 
> So what is needed isn't a split of register set like MFD specializes
> in, but a unique per-driver API between the 'core' and 'subsystem'
> parts of the multi-subsystem device.

Exactly, because the device is one. For my simplistic brain one device
means one driver, which can register to as many subsystems as it wants.

> > Do modern RDMA cards really share IP across generations?   
> 
> What is a generation? Mellanox has had a stable RDMA driver across
> many sillicon generations. Intel looks like their new driver will
> support at least the last two or more sillicon generations..
> 
> RDMA drivers are monstrous complex things, there is a big incentive to
> not respin them every time a new chip comes out.

Ack, but then again none of the drivers gets rewritten from scratch,
right? It's not that some "sub-drivers" get reused and some not, no?

> > Is there a need to reload the drivers for the separate pieces (I
> > wonder if the devlink reload doesn't belong to the device model :().  
> 
> Yes, it is already done, but without driver model support the only way
> to reload the rdma driver is to unload the entire module as there is
> no 'unbind'

The reload is the only thing that I can think of (other than
out-of-tree code), but with devlink no I believe it can be solved
differently.

Thanks a lot for the explanation Jason, much appreciated!

The practicality of this is still a little elusive to me, but since 
Greg seems on board I guess it's just me :)
Jakub Kicinski Nov. 9, 2019, 5:28 p.m. UTC | #42
On Sat, 9 Nov 2019 12:18:09 +0100, Jiri Pirko wrote:
> >Are these constant long email threads a way that people are just trying
> >to get me to do this work for them?  Because if it is, it's working...  
> 
> Maybe they are just confused, like I am :)

+1 :)
Jakub Kicinski Nov. 9, 2019, 5:41 p.m. UTC | #43
On Fri, 8 Nov 2019 20:57:08 -0400, Jason Gunthorpe wrote:
> On Fri, Nov 08, 2019 at 10:48:31PM +0000, Parav Pandit wrote:
> > We should be creating 3 different buses, instead of mdev bus being de-multiplexer of that?
> > 
> > Hence, depending the device flavour specified, create such device on right bus?
> > 
> > For example,
> > $ devlink create subdev pci/0000:05:00.0 flavour virtio name foo subdev_id 1
> > $ devlink create subdev pci/0000:05:00.0 flavour mdev <uuid> subdev_id 2
> > $ devlink create subdev pci/0000:05:00.0 flavour mlx5 id 1 subdev_id 3  
> 
> I like the idea of specifying what kind of interface you want at sub
> device creation time. It fits the driver model pretty well and doesn't
> require abusing the vfio mdev for binding to a netdev driver.

Aren't the HW resources spun out in all three cases exactly identical?

IMHO creation of sub device should only define which HW resources are
provisioned/relegated. Specifying a driver when recreating a device
seems a little backwards.
Greg KH Nov. 10, 2019, 9:16 a.m. UTC | #44
On Sat, Nov 09, 2019 at 12:18:09PM +0100, Jiri Pirko wrote:
> Sat, Nov 09, 2019 at 09:46:59AM CET, gregkh@linuxfoundation.org wrote:
> >On Fri, Nov 08, 2019 at 08:44:26PM -0400, Jason Gunthorpe wrote:
> >> There has been some lack of clarity on what the ?? should be. People
> >> have proposed platform and MFD, and those seem to be no-goes. So, it
> >> looks like ?? will be a mlx5_driver on a mlx5_bus, and Intel will use
> >> an ice_driver on a ice_bus, ditto for cxgb4, if I understand Greg's
> >> guidance.
> >
> >Yes, that is the only way it can work because you really are just
> >sharing a single PCI device in a vendor-specific way, and they all need
> >to get along with each one properly for that vendor-specific way.  So
> >each vendor needs its own "bus" to be able to work out things properly,
> >I doubt you can make this more generic than that easily.
> >
> >> Though I'm wondering if we should have a 'multi_subsystem_device' that
> >> was really just about passing a 'void *core_handle' from the 'core'
> >> (ie the bus) to the driver (ie RDMA, netdev, etc). 
> >
> >Ick, no.
> >
> >> It seems weakly defined, but also exactly what every driver doing this
> >> needs.. It is basically what this series is abusing mdev to accomplish.
> >
> >What is so hard about writing a bus?  Last I tried it was just a few
> >hundred lines of code, if that.  I know it's not the easiest in places,
> >but we have loads of examples to crib from.  If you have
> >problems/questions, just ask!
> >
> >Or, worst case, you just do what I asked in this thread somewhere, and
> >write a "virtual bus" where you just create devices and bind them to the
> >driver before registering and away you go.  No auto-loading needed (or
> >possible), but then you have a generic layer that everyone can use if
> >they want to (but you loose some functionality at the expense of
> >generic code.)
> 
> Pardon my ignorance, just to be clear: You suggest to have
> one-virtual-bus-per-driver or rather some common "xbus" to serve this
> purpose for all of them, right?

Yes.

> If so, isn't that a bit ugly to have a bus in every driver?

No, not if that's what you want to have for that specific type of
device.  I.e. you want to have multiple drivers all attached to a single
PCI device and somehow "share" the physical resources properly in a sane
way.

> I wonder if there can be some abstraction found.

The abstraction is just that, the bus one.  It's not all that complex,
is it?

thanks,

greg k-h
Greg KH Nov. 10, 2019, 9:18 a.m. UTC | #45
On Sat, Nov 09, 2019 at 09:27:47AM -0800, Jakub Kicinski wrote:
> On Fri, 8 Nov 2019 20:44:26 -0400, Jason Gunthorpe wrote:
> > On Fri, Nov 08, 2019 at 01:45:59PM -0800, Jakub Kicinski wrote:
> > > Yes, my suggestion to use mdev was entirely based on the premise that
> > > the purpose of this work is to get vfio working.. otherwise I'm unclear
> > > as to why we'd need a bus in the first place. If this is just for
> > > containers - we have macvlan offload for years now, with no need for a
> > > separate device.  
> > 
> > This SF thing is a full fledged VF function, it is not at all like
> > macvlan. This is perhaps less important for the netdev part of the
> > world, but the difference is very big for the RDMA side, and should
> > enable VFIO too..
> 
> Well, macvlan used VMDq so it was pretty much a "legacy SR-IOV" VF.
> I'd perhaps need to learn more about RDMA to appreciate the difference.
> 
> > > On the RDMA/Intel front, would you mind explaining what the main
> > > motivation for the special buses is? I'm a little confurious.  
> > 
> > Well, the issue is driver binding. For years we have had these
> > multi-function netdev drivers that have a single PCI device which must
> > bind into multiple subsystems, ie mlx5 does netdev and RDMA, the cxgb
> > drivers do netdev, RDMA, SCSI initiator, SCSI target, etc. [And I
> > expect when NVMe over TCP rolls out we will have drivers like cxgb4
> > binding to 6 subsytems in total!]
> 
> What I'm missing is why is it so bad to have a driver register to
> multiple subsystems.

Because these PCI devices seem to do "different" things all in one PCI
resource set.  Blame the hardware designers :)

> I've seen no end of hacks caused people trying to split their driver
> too deeply by functionality. Separate sub-drivers, buses and modules.
> 
> The nfp driver was split up before I upstreamed it, I merged it into
> one monolithic driver/module. Code is still split up cleanly internally,
> the architecture doesn't change in any major way. Sure 5% of developers
> were upset they can't do some partial reloads they were used to, but
> they got used to the new ways, and 100% of users were happy about the
> simplicity.

I agree, you should stick with the "one device/driver" thing where ever
possible, like you did.

> For the nfp I think the _real_ reason to have a bus was that it
> was expected to have some out-of-tree modules bind to it. Something 
> I would not encourage :)

That's not ok, and I agree with you.

But there seems to be some more complex PCI devices that do lots of
different things all at once.  Kind of like a PCI device that wants to
be both a keyboard and a storage device at the same time (i.e. a button
on a disk drive...)

thanks,

greg k-h
Jason Gunthorpe Nov. 10, 2019, 7:04 p.m. UTC | #46
On Sat, Nov 09, 2019 at 09:41:03AM -0800, Jakub Kicinski wrote:
> On Fri, 8 Nov 2019 20:57:08 -0400, Jason Gunthorpe wrote:
> > On Fri, Nov 08, 2019 at 10:48:31PM +0000, Parav Pandit wrote:
> > > We should be creating 3 different buses, instead of mdev bus being de-multiplexer of that?
> > > 
> > > Hence, depending the device flavour specified, create such device on right bus?
> > > 
> > > For example,
> > > $ devlink create subdev pci/0000:05:00.0 flavour virtio name foo subdev_id 1
> > > $ devlink create subdev pci/0000:05:00.0 flavour mdev <uuid> subdev_id 2
> > > $ devlink create subdev pci/0000:05:00.0 flavour mlx5 id 1 subdev_id 3  
> > 
> > I like the idea of specifying what kind of interface you want at sub
> > device creation time. It fits the driver model pretty well and doesn't
> > require abusing the vfio mdev for binding to a netdev driver.
> 
> Aren't the HW resources spun out in all three cases exactly identical?

Exactly? No, not really. The only constant is that some chunk of the
BAR is dedicated to this subedv. The BAR is flexible, so a BAR chunk
configured for virtio is not going to support mlx5 mode.

Aside from that, there are other differences ie - mlx5 does not need a
dedicated set of MSI-X's while other modes do. There are fewer MSI-X's
than SF's, so managing this is important for the admin.

Even in modes which are very similar, like mlx5 vs mdev-vfio, the HW
still has to be configured to provide global DMA isolation on the NIC
for vfio as the IOMMU cannot be involved. This is extra overhead and
should not be activated unless vfio is being used.

.. and finally the driver core does not support a
'multiple-inheritance' like idea, so we can't have a 'foo_device' that
is three different things.

So somehow the 'flavour' of the 'struct device' has to be exposed to
userspace, and it is best if this is done at device creation time so
the BAR region and HW can be setup once and we don't have to have
complex reconfiguration flows.

Jason
Jason Gunthorpe Nov. 10, 2019, 7:37 p.m. UTC | #47
On Sat, Nov 09, 2019 at 09:27:47AM -0800, Jakub Kicinski wrote:
> On Fri, 8 Nov 2019 20:44:26 -0400, Jason Gunthorpe wrote:
> > On Fri, Nov 08, 2019 at 01:45:59PM -0800, Jakub Kicinski wrote:
> > > Yes, my suggestion to use mdev was entirely based on the premise that
> > > the purpose of this work is to get vfio working.. otherwise I'm unclear
> > > as to why we'd need a bus in the first place. If this is just for
> > > containers - we have macvlan offload for years now, with no need for a
> > > separate device.  
> > 
> > This SF thing is a full fledged VF function, it is not at all like
> > macvlan. This is perhaps less important for the netdev part of the
> > world, but the difference is very big for the RDMA side, and should
> > enable VFIO too..
> 
> Well, macvlan used VMDq so it was pretty much a "legacy SR-IOV" VF.
> I'd perhaps need to learn more about RDMA to appreciate the difference.

It has a lot to do with the how the RDMA functionality works in the
HW.. At least for mlx the RDMA is 'below' all the netdev stuff, so
even though netdev has some offloaded vlan RDMA sees, essentially, the
union of all the vlan's on the system.

Which at least breaks the security model of a macvlan device for
net-namespaces.

Maybe with new HW something could be done, but today, the HW is
limited.

> > > On the RDMA/Intel front, would you mind explaining what the main
> > > motivation for the special buses is? I'm a little confurious.  
> > 
> > Well, the issue is driver binding. For years we have had these
> > multi-function netdev drivers that have a single PCI device which must
> > bind into multiple subsystems, ie mlx5 does netdev and RDMA, the cxgb
> > drivers do netdev, RDMA, SCSI initiator, SCSI target, etc. [And I
> > expect when NVMe over TCP rolls out we will have drivers like cxgb4
> > binding to 6 subsytems in total!]
> 
> What I'm missing is why is it so bad to have a driver register to
> multiple subsystems.

Well, for example, if you proposed to have a RDMA driver in
drivers/net/ethernet/foo/, I would NAK it, and I hope Dave would
too. Same for SCSI and nvme.

This Linux process is that driver code for a subsystem lives in the
subsystem and should be in a subsystem specific module. While it is
technically possible to have a giant driver, it distorts our process
in a way I don't think is good.

So, we have software layers between the large Linux subsystems just to
make the development side manageable and practical.

.. once the code lives in another subsystem, it is in a new module. A
new module requires some way to connect them all together, the driver
core is the logical way to do this connection.

I don't think a driver should be split beyond that. Even my suggestion
of a 'core' may in practice just be the netdev driver as most of the
other modules can't function without netdev. ie you can't do iSCSI
without an IP stack.

> > What is a generation? Mellanox has had a stable RDMA driver across
> > many sillicon generations. Intel looks like their new driver will
> > support at least the last two or more sillicon generations..
> > 
> > RDMA drivers are monstrous complex things, there is a big incentive to
> > not respin them every time a new chip comes out.
> 
> Ack, but then again none of the drivers gets rewritten from scratch,
> right? It's not that some "sub-drivers" get reused and some not, no?

Remarkably Intel is saying their new RDMA 'sub-driver' will be compatible
with their ICE and pre-ICE (sorry, forget the names) netdev core
drivers. 

netdev will get a different driver for each, but RDMA will use the
same driver.

Jason
Parav Pandit Nov. 10, 2019, 7:48 p.m. UTC | #48
> From: Jason Gunthorpe <jgg@ziepe.ca>
> Sent: Friday, November 8, 2019 6:57 PM
> > We should be creating 3 different buses, instead of mdev bus being de-
> multiplexer of that?
> >
> > Hence, depending the device flavour specified, create such device on right
> bus?
> >
> > For example,
> > $ devlink create subdev pci/0000:05:00.0 flavour virtio name foo
> > subdev_id 1 $ devlink create subdev pci/0000:05:00.0 flavour mdev
> > <uuid> subdev_id 2 $ devlink create subdev pci/0000:05:00.0 flavour
> > mlx5 id 1 subdev_id 3
> 
> I like the idea of specifying what kind of interface you want at sub device
> creation time. It fits the driver model pretty well and doesn't require abusing
> the vfio mdev for binding to a netdev driver.
> 
> > $ devlink subdev pci/0000:05:00.0/<subdev_id> config <params> $ echo
> > <respective_device_id> <sysfs_path>/bind
> 
> Is explicit binding really needed?
No.

> If you specify a vfio flavour why shouldn't
> the vfio driver autoload and bind to it right away? That is kind of the point
> of the driver model...
> 
It some configuration is needed that cannot be passed at device creation time, explicit bind later can be used.

> (kind of related, but I don't get while all that GUID and lifecycle stuff in mdev
> should apply for something like a SF)
> 
GUID is just the name of the device.
But lets park this aside for a moment.

> > Implement power management callbacks also on all above 3 buses?
> > Abstract out mlx5_bus into more generic virtual bus (vdev bus?) so
> > that multiple vendors can reuse?
> 
> In this specific case, why does the SF in mlx5 mode even need a bus?
> Is it only because of devlink? That would be unfortunate
>
Devlink is one part due to identifying using bus/dev.
How do we refer to its devlink instance of SF without bus/device?
Can we extend devlink_register() to accept optionally have sf_id?

If we don't have a bus, creating sub function (a device), without a 'struct device' which will have BAR, resources, etc is odd.

Now if we cannot see 'struct device' in sysfs, how do we persistently name them?
Are we ok to add /sys/class/net/sf_netdev/subdev_id
And
/sys/class/infiniband/<rdma_dev>/subdev_id

So that systemd/udev can rename them as en<X?><subdev_id> and roce<X><subdev_id>
If so, what will be X without a bus type?

This route without a bus is certainly helpful to overcome the IOMMU limitation where IOMMU only listens to pci bus type for DMAR setup, 
dmar_register_bus_notifier(), and in 
intel_iommu_init()-> bus_set_iommu(&pci_bus_type, &intel_iommu_ops);
and other IOMMU doing similar PCI/AMBA binding.
This is currently overcome using WA dma_ops.
Jason Wang Nov. 11, 2019, 2:19 a.m. UTC | #49
On 2019/11/9 上午5:05, Jason Gunthorpe wrote:
> On Fri, Nov 08, 2019 at 01:34:35PM -0700, Alex Williamson wrote:
>> On Fri, 8 Nov 2019 16:12:53 -0400
>> Jason Gunthorpe <jgg@ziepe.ca> wrote:
>>
>>> On Fri, Nov 08, 2019 at 11:12:38AM -0800, Jakub Kicinski wrote:
>>>> On Fri, 8 Nov 2019 15:40:22 +0000, Parav Pandit wrote:
>>>>>> The new intel driver has been having a very similar discussion about how to
>>>>>> model their 'multi function device' ie to bind RDMA and other drivers to a
>>>>>> shared PCI function, and I think that discussion settled on adding a new bus?
>>>>>>
>>>>>> Really these things are all very similar, it would be nice to have a clear
>>>>>> methodology on how to use the device core if a single PCI device is split by
>>>>>> software into multiple different functional units and attached to different
>>>>>> driver instances.
>>>>>>
>>>>>> Currently there is alot of hacking in this area.. And a consistent scheme
>>>>>> might resolve the ugliness with the dma_ops wrappers.
>>>>>>
>>>>>> We already have the 'mfd' stuff to support splitting platform devices, maybe
>>>>>> we need to create a 'pci-mfd' to support splitting PCI devices?
>>>>>>
>>>>>> I'm not really clear how mfd and mdev relate, I always thought mdev was
>>>>>> strongly linked to vfio.
>>>>>>   
>>>>> Mdev at beginning was strongly linked to vfio, but as I mentioned
>>>>> above it is addressing more use case.
>>>>>
>>>>> I observed that discussion, but was not sure of extending mdev further.
>>>>>
>>>>> One way to do for Intel drivers to do is after series [9].
>>>>> Where PCI driver says, MDEV_CLASS_ID_I40_FOO
>>>>> RDMA driver mdev_register_driver(), matches on it and does the probe().
>>>> Yup, FWIW to me the benefit of reusing mdevs for the Intel case vs
>>>> muddying the purpose of mdevs is not a clear trade off.
>>> IMHO, mdev has amdev_parent_ops structure clearly intended to link it
>>> to vfio, so using a mdev for something not related to vfio seems like
>>> a poor choice.
>> Unless there's some opposition, I'm intended to queue this for v5.5:
>>
>> https://www.spinics.net/lists/kvm/msg199613.html
>>
>> mdev has started out as tied to vfio, but at it's core, it's just a
>> device life cycle infrastructure with callbacks between bus drivers
>> and vendor devices.  If virtio is on the wrong path with the above
>> series, please speak up.  Thanks,
> Well, I think Greg just objected pretty strongly.
>
> IMHO it is wrong to turn mdev into some API multiplexor. That is what
> the driver core already does and AFAIK your bus type is supposed to
> represent your API contract to your drivers.
>
> Since the bus type is ABI, 'mdev' is really all about vfio I guess?
>
> Maybe mdev should grow by factoring the special GUID life cycle stuff
> into a helper library that can make it simpler to build proper API
> specific bus's using that lifecycle model? ie the virtio I saw
> proposed should probably be a mdev-virtio bus type providing this new
> virtio API contract using a 'struct mdev_virtio'?


Yes, and probably just decouple the vfio a little bit more from mdev, 
and allow mdev to register multiple types of buses. Vfio-mdev still go 
for mdev bus, but for virtio and other they will go their own.

Thanks


>
> I only looked briefly but mdev seems like an unusual way to use the
> driver core. *generally* I would expect that if a driver wants to
> provide a foo_device (on a foo bus, providing the foo API contract) it
> looks very broadly like:
>
>    struct foo_device {
>         struct device dev;
>         const struct foo_ops *ops;
>    };
>    struct my_foo_device {
>        struct foo_device fdev;
>    };
>
>    foo_device_register(&mydev->fdev);
>
> Which means we can use normal container_of() patterns, while mdev
> seems to want to allocate all the structs internally.. I guess this is
> because of how the lifecycle stuff works? From a device core view it
> looks quite unnatural.
>
> Jason
Jakub Kicinski Nov. 11, 2019, 3:46 a.m. UTC | #50
On Sun, 10 Nov 2019 10:18:55 +0100, gregkh@linuxfoundation.org wrote:
> > What I'm missing is why is it so bad to have a driver register to
> > multiple subsystems.  
> 
> Because these PCI devices seem to do "different" things all in one PCI
> resource set.  Blame the hardware designers :)

See below, I don't think you can blame the HW designers in this
particular case :)

> > For the nfp I think the _real_ reason to have a bus was that it
> > was expected to have some out-of-tree modules bind to it. Something 
> > I would not encourage :)  
> 
> That's not ok, and I agree with you.
> 
> But there seems to be some more complex PCI devices that do lots of
> different things all at once.  Kind of like a PCI device that wants to
> be both a keyboard and a storage device at the same time (i.e. a button
> on a disk drive...)

The keyboard which is also a storage device may be a clear cut case
where multiple devices were integrated into one bus endpoint.

The case with these advanced networking adapters is a little different
in that they are one HW device which has oodles of FW implementing
clients or acceleration for various networking protocols.

The nice thing about having a fake bus is you can load out-of-tree
drivers to operate extra protocols quite cleanly.

I'm not saying that's what the code in question is doing, I'm saying 
I'd personally like to understand the motivation more clearly before
every networking driver out there starts spawning buses. The only
argument I've heard so far for the separate devices is reloading subset
of the drivers, which I'd rate as moderately convincing.
Jakub Kicinski Nov. 11, 2019, 3:57 a.m. UTC | #51
On Sun, 10 Nov 2019 15:37:59 -0400, Jason Gunthorpe wrote:
> On Sat, Nov 09, 2019 at 09:27:47AM -0800, Jakub Kicinski wrote:
> > On Fri, 8 Nov 2019 20:44:26 -0400, Jason Gunthorpe wrote:  
> > > On Fri, Nov 08, 2019 at 01:45:59PM -0800, Jakub Kicinski wrote:  
> > > > Yes, my suggestion to use mdev was entirely based on the premise that
> > > > the purpose of this work is to get vfio working.. otherwise I'm unclear
> > > > as to why we'd need a bus in the first place. If this is just for
> > > > containers - we have macvlan offload for years now, with no need for a
> > > > separate device.    
> > > 
> > > This SF thing is a full fledged VF function, it is not at all like
> > > macvlan. This is perhaps less important for the netdev part of the
> > > world, but the difference is very big for the RDMA side, and should
> > > enable VFIO too..  
> > 
> > Well, macvlan used VMDq so it was pretty much a "legacy SR-IOV" VF.
> > I'd perhaps need to learn more about RDMA to appreciate the difference.  
> 
> It has a lot to do with the how the RDMA functionality works in the
> HW.. At least for mlx the RDMA is 'below' all the netdev stuff, so
> even though netdev has some offloaded vlan RDMA sees, essentially, the
> union of all the vlan's on the system.
> 
> Which at least breaks the security model of a macvlan device for
> net-namespaces.
> 
> Maybe with new HW something could be done, but today, the HW is
> limited.

Oh, I think we sort of talked past each other there.

I was just pointing to the fact that Intel's macvlan offload did well
without any fake bus or devices. I'm not saying anything about the
particulars of the virtualization from the networking perspective.

> > > > On the RDMA/Intel front, would you mind explaining what the main
> > > > motivation for the special buses is? I'm a little confurious.    
> > > 
> > > Well, the issue is driver binding. For years we have had these
> > > multi-function netdev drivers that have a single PCI device which must
> > > bind into multiple subsystems, ie mlx5 does netdev and RDMA, the cxgb
> > > drivers do netdev, RDMA, SCSI initiator, SCSI target, etc. [And I
> > > expect when NVMe over TCP rolls out we will have drivers like cxgb4
> > > binding to 6 subsytems in total!]  
> > 
> > What I'm missing is why is it so bad to have a driver register to
> > multiple subsystems.  
> 
> Well, for example, if you proposed to have a RDMA driver in
> drivers/net/ethernet/foo/, I would NAK it, and I hope Dave would
> too. Same for SCSI and nvme.
> 
> This Linux process is that driver code for a subsystem lives in the
> subsystem and should be in a subsystem specific module. While it is
> technically possible to have a giant driver, it distorts our process
> in a way I don't think is good.
> 
> So, we have software layers between the large Linux subsystems just to
> make the development side manageable and practical.
> 
> .. once the code lives in another subsystem, it is in a new module. A
> new module requires some way to connect them all together, the driver
> core is the logical way to do this connection.
> 
> I don't think a driver should be split beyond that. Even my suggestion
> of a 'core' may in practice just be the netdev driver as most of the
> other modules can't function without netdev. ie you can't do iSCSI
> without an IP stack.

Okay, yes, that's what I was expecting you'd say. I'm not 100%
convinced a bus is necessary, we lived long enough with drivers 
split across the tree...

> > > What is a generation? Mellanox has had a stable RDMA driver across
> > > many sillicon generations. Intel looks like their new driver will
> > > support at least the last two or more sillicon generations..
> > > 
> > > RDMA drivers are monstrous complex things, there is a big incentive to
> > > not respin them every time a new chip comes out.  
> > 
> > Ack, but then again none of the drivers gets rewritten from scratch,
> > right? It's not that some "sub-drivers" get reused and some not, no?  
> 
> Remarkably Intel is saying their new RDMA 'sub-driver' will be compatible
> with their ICE and pre-ICE (sorry, forget the names) netdev core
> drivers. 
> 
> netdev will get a different driver for each, but RDMA will use the
> same driver.

I see :)
Parav Pandit Nov. 11, 2019, 5:18 a.m. UTC | #52
> -----Original Message-----
> From: kvm-owner@vger.kernel.org <kvm-owner@vger.kernel.org> On Behalf
> Of Jakub Kicinski
> Sent: Sunday, November 10, 2019 9:46 PM
> On Sun, 10 Nov 2019 10:18:55 +0100, gregkh@linuxfoundation.org wrote:

[..]
> The nice thing about having a fake bus is you can load out-of-tree drivers to
> operate extra protocols quite cleanly.
> 
This series does NOT intent to do any out of tree driver.
Please do not think in that direction for this series.

> I'm not saying that's what the code in question is doing, I'm saying I'd
> personally like to understand the motivation more clearly before every
> networking driver out there starts spawning buses. The only argument I've
> heard so far for the separate devices is reloading subset of the drivers, which
> I'd rate as moderately convincing.

Primary objectives behind using a bus in this series is:

1. get same level of device view as PF/VF/SF by devlink instance
2. to not re-invent already matured pm (suspend/resume) in devlink and/or vendor driver
3. ability to bind a sub-function to different drivers depending on use case based on 'in-kernel' defined class-id
(mdev/virtio/kernel) - just like vfio-pci and regular PF driver, by following standard driver model
(Ofcourse, It can be done using 3 or more buses as one virtual mdev bus appears an abuse)
4. create->configure->bind process of an sub function (just like a VF)
5. persistent naming of sf's netdev and rdmadev (again like PF and VF)

I will wait for Jason's and Jiri's view on the alternative proposal I sent few hours back to omit bus for in-kernel use of sf; and see how far can we run without a bus :-)
Jiri Pirko Nov. 11, 2019, 1:30 p.m. UTC | #53
Mon, Nov 11, 2019 at 04:46:01AM CET, jakub.kicinski@netronome.com wrote:
>On Sun, 10 Nov 2019 10:18:55 +0100, gregkh@linuxfoundation.org wrote:
>> > What I'm missing is why is it so bad to have a driver register to
>> > multiple subsystems.  
>> 
>> Because these PCI devices seem to do "different" things all in one PCI
>> resource set.  Blame the hardware designers :)
>
>See below, I don't think you can blame the HW designers in this
>particular case :)
>
>> > For the nfp I think the _real_ reason to have a bus was that it
>> > was expected to have some out-of-tree modules bind to it. Something 
>> > I would not encourage :)  
>> 
>> That's not ok, and I agree with you.
>> 
>> But there seems to be some more complex PCI devices that do lots of
>> different things all at once.  Kind of like a PCI device that wants to
>> be both a keyboard and a storage device at the same time (i.e. a button
>> on a disk drive...)
>
>The keyboard which is also a storage device may be a clear cut case
>where multiple devices were integrated into one bus endpoint.

Also, I think that very important differentiator between keyboard/button
and NIC is that keyboard/button is fixed. You have driver bus with 2
devices on constant addresses.

However in case of NIC subfunctions. You have 0 at he beginning and user
instructs to create more (maybe hundreds). Now important questions
appear:

1) How to create devices (what API) - mdev has this figured out
2) How to to do the addressing of the devices. Needs to be
   predictable/defined by the user - mdev has this figured out
3) Udev names of netdevices - udev names that according to the
   bus/address. That is straightforeward with mdev.
   I can't really see how to figure this one in particular with
   per-driver busses :/


>
>The case with these advanced networking adapters is a little different
>in that they are one HW device which has oodles of FW implementing
>clients or acceleration for various networking protocols.
>
>The nice thing about having a fake bus is you can load out-of-tree
>drivers to operate extra protocols quite cleanly.
>
>I'm not saying that's what the code in question is doing, I'm saying 
>I'd personally like to understand the motivation more clearly before
>every networking driver out there starts spawning buses. The only
>argument I've heard so far for the separate devices is reloading subset
>of the drivers, which I'd rate as moderately convincing.
Greg KH Nov. 11, 2019, 2:14 p.m. UTC | #54
On Mon, Nov 11, 2019 at 02:30:26PM +0100, Jiri Pirko wrote:
> Mon, Nov 11, 2019 at 04:46:01AM CET, jakub.kicinski@netronome.com wrote:
> >On Sun, 10 Nov 2019 10:18:55 +0100, gregkh@linuxfoundation.org wrote:
> >> > What I'm missing is why is it so bad to have a driver register to
> >> > multiple subsystems.  
> >> 
> >> Because these PCI devices seem to do "different" things all in one PCI
> >> resource set.  Blame the hardware designers :)
> >
> >See below, I don't think you can blame the HW designers in this
> >particular case :)
> >
> >> > For the nfp I think the _real_ reason to have a bus was that it
> >> > was expected to have some out-of-tree modules bind to it. Something 
> >> > I would not encourage :)  
> >> 
> >> That's not ok, and I agree with you.
> >> 
> >> But there seems to be some more complex PCI devices that do lots of
> >> different things all at once.  Kind of like a PCI device that wants to
> >> be both a keyboard and a storage device at the same time (i.e. a button
> >> on a disk drive...)
> >
> >The keyboard which is also a storage device may be a clear cut case
> >where multiple devices were integrated into one bus endpoint.
> 
> Also, I think that very important differentiator between keyboard/button
> and NIC is that keyboard/button is fixed. You have driver bus with 2
> devices on constant addresses.
> 
> However in case of NIC subfunctions. You have 0 at he beginning and user
> instructs to create more (maybe hundreds). Now important questions
> appear:
> 
> 1) How to create devices (what API) - mdev has this figured out
> 2) How to to do the addressing of the devices. Needs to be
>    predictable/defined by the user - mdev has this figured out
> 3) Udev names of netdevices - udev names that according to the
>    bus/address. That is straightforeward with mdev.
>    I can't really see how to figure this one in particular with
>    per-driver busses :/

Are network devices somehow only allowed to be on mdev busses?

No, don't be silly, userspace handles this just fine today on any type
of bus, it's not an issue.

You don't have to like individual "driver busses", but you had better
not be using a fake platform device to use mdev.  That's my main
objection...

thanks,

greg k-h
Jiri Pirko Nov. 11, 2019, 2:17 p.m. UTC | #55
Sun, Nov 10, 2019 at 08:48:31PM CET, parav@mellanox.com wrote:
>
>> From: Jason Gunthorpe <jgg@ziepe.ca>
>> Sent: Friday, November 8, 2019 6:57 PM
>> > We should be creating 3 different buses, instead of mdev bus being de-
>> multiplexer of that?
>> >
>> > Hence, depending the device flavour specified, create such device on right
>> bus?
>> >
>> > For example,
>> > $ devlink create subdev pci/0000:05:00.0 flavour virtio name foo
>> > subdev_id 1 $ devlink create subdev pci/0000:05:00.0 flavour mdev
>> > <uuid> subdev_id 2 $ devlink create subdev pci/0000:05:00.0 flavour
>> > mlx5 id 1 subdev_id 3
>> 
>> I like the idea of specifying what kind of interface you want at sub device
>> creation time. It fits the driver model pretty well and doesn't require abusing
>> the vfio mdev for binding to a netdev driver.
>> 
>> > $ devlink subdev pci/0000:05:00.0/<subdev_id> config <params> $ echo
>> > <respective_device_id> <sysfs_path>/bind
>> 
>> Is explicit binding really needed?
>No.
>
>> If you specify a vfio flavour why shouldn't
>> the vfio driver autoload and bind to it right away? That is kind of the point
>> of the driver model...
>> 
>It some configuration is needed that cannot be passed at device creation time, explicit bind later can be used.
>
>> (kind of related, but I don't get while all that GUID and lifecycle stuff in mdev
>> should apply for something like a SF)
>> 
>GUID is just the name of the device.
>But lets park this aside for a moment.
>
>> > Implement power management callbacks also on all above 3 buses?
>> > Abstract out mlx5_bus into more generic virtual bus (vdev bus?) so
>> > that multiple vendors can reuse?
>> 
>> In this specific case, why does the SF in mlx5 mode even need a bus?
>> Is it only because of devlink? That would be unfortunate
>>
>Devlink is one part due to identifying using bus/dev.
>How do we refer to its devlink instance of SF without bus/device?

Question is, why to have devlink instance for SF itself. Same as VF, you
don't need devlink instance. You only need devlink_port (or
devlink_subdev) instance on the PF devlink parent for it.


>Can we extend devlink_register() to accept optionally have sf_id?
>
>If we don't have a bus, creating sub function (a device), without a 'struct device' which will have BAR, resources, etc is odd.
>
>Now if we cannot see 'struct device' in sysfs, how do we persistently name them?
>Are we ok to add /sys/class/net/sf_netdev/subdev_id
>And
>/sys/class/infiniband/<rdma_dev>/subdev_id
>
>So that systemd/udev can rename them as en<X?><subdev_id> and roce<X><subdev_id>
>If so, what will be X without a bus type?
>
>This route without a bus is certainly helpful to overcome the IOMMU limitation where IOMMU only listens to pci bus type for DMAR setup, 
>dmar_register_bus_notifier(), and in 
>intel_iommu_init()-> bus_set_iommu(&pci_bus_type, &intel_iommu_ops);
>and other IOMMU doing similar PCI/AMBA binding.
>This is currently overcome using WA dma_ops.
Jiri Pirko Nov. 11, 2019, 2:37 p.m. UTC | #56
Mon, Nov 11, 2019 at 03:14:30PM CET, gregkh@linuxfoundation.org wrote:
>On Mon, Nov 11, 2019 at 02:30:26PM +0100, Jiri Pirko wrote:
>> Mon, Nov 11, 2019 at 04:46:01AM CET, jakub.kicinski@netronome.com wrote:
>> >On Sun, 10 Nov 2019 10:18:55 +0100, gregkh@linuxfoundation.org wrote:
>> >> > What I'm missing is why is it so bad to have a driver register to
>> >> > multiple subsystems.  
>> >> 
>> >> Because these PCI devices seem to do "different" things all in one PCI
>> >> resource set.  Blame the hardware designers :)
>> >
>> >See below, I don't think you can blame the HW designers in this
>> >particular case :)
>> >
>> >> > For the nfp I think the _real_ reason to have a bus was that it
>> >> > was expected to have some out-of-tree modules bind to it. Something 
>> >> > I would not encourage :)  
>> >> 
>> >> That's not ok, and I agree with you.
>> >> 
>> >> But there seems to be some more complex PCI devices that do lots of
>> >> different things all at once.  Kind of like a PCI device that wants to
>> >> be both a keyboard and a storage device at the same time (i.e. a button
>> >> on a disk drive...)
>> >
>> >The keyboard which is also a storage device may be a clear cut case
>> >where multiple devices were integrated into one bus endpoint.
>> 
>> Also, I think that very important differentiator between keyboard/button
>> and NIC is that keyboard/button is fixed. You have driver bus with 2
>> devices on constant addresses.
>> 
>> However in case of NIC subfunctions. You have 0 at he beginning and user
>> instructs to create more (maybe hundreds). Now important questions
>> appear:
>> 
>> 1) How to create devices (what API) - mdev has this figured out
>> 2) How to to do the addressing of the devices. Needs to be
>>    predictable/defined by the user - mdev has this figured out
>> 3) Udev names of netdevices - udev names that according to the
>>    bus/address. That is straightforeward with mdev.
>>    I can't really see how to figure this one in particular with
>>    per-driver busses :/
>
>Are network devices somehow only allowed to be on mdev busses?

Of course not. But there is a difference if we are talking about:
a) "the usual" network devices, like PF, VF. - They are well defined and
   they have well defined lifecycle (pci probe, sriov sysfs for number
   of VFs, etc). I this world all works fine. Even if a device has 100
   static subdevices (bus or no bus).
b) dynamically created sub-bar-devices or subfunctions. Could be created
   by user. This is not handled now in kernel, we have to find correct iface.
   I don't really care it it is fakebus, driverbus, etc. I'm just concerned
   about how to handle 1), 2), 3) above.

>
>No, don't be silly, userspace handles this just fine today on any type
>of bus, it's not an issue.
>
>You don't have to like individual "driver busses", but you had better
>not be using a fake platform device to use mdev.  That's my main
>objection...

Okay, I understand your objection. Do you have suggestion how to handle
1) 2) 3) from above?



>
>thanks,
>
>greg k-h
Parav Pandit Nov. 11, 2019, 2:58 p.m. UTC | #57
> -----Original Message-----
> From: Jiri Pirko <jiri@resnulli.us>
> Sent: Monday, November 11, 2019 8:18 AM
> Sun, Nov 10, 2019 at 08:48:31PM CET, parav@mellanox.com wrote:
> >
> >> From: Jason Gunthorpe <jgg@ziepe.ca>
> >> Sent: Friday, November 8, 2019 6:57 PM
> >> > We should be creating 3 different buses, instead of mdev bus being
> >> > de-
> >> multiplexer of that?
> >> >
> >> > Hence, depending the device flavour specified, create such device
> >> > on right
> >> bus?
> >> >
> >> > For example,
> >> > $ devlink create subdev pci/0000:05:00.0 flavour virtio name foo
> >> > subdev_id 1 $ devlink create subdev pci/0000:05:00.0 flavour mdev
> >> > <uuid> subdev_id 2 $ devlink create subdev pci/0000:05:00.0 flavour
> >> > mlx5 id 1 subdev_id 3
> >>
> >> I like the idea of specifying what kind of interface you want at sub
> >> device creation time. It fits the driver model pretty well and
> >> doesn't require abusing the vfio mdev for binding to a netdev driver.
> >>
> >> > $ devlink subdev pci/0000:05:00.0/<subdev_id> config <params> $
> >> > echo <respective_device_id> <sysfs_path>/bind
> >>
> >> Is explicit binding really needed?
> >No.
> >
> >> If you specify a vfio flavour why shouldn't the vfio driver autoload
> >> and bind to it right away? That is kind of the point of the driver
> >> model...
> >>
> >It some configuration is needed that cannot be passed at device creation
> time, explicit bind later can be used.
> >
> >> (kind of related, but I don't get while all that GUID and lifecycle
> >> stuff in mdev should apply for something like a SF)
> >>
> >GUID is just the name of the device.
> >But lets park this aside for a moment.
> >
> >> > Implement power management callbacks also on all above 3 buses?
> >> > Abstract out mlx5_bus into more generic virtual bus (vdev bus?) so
> >> > that multiple vendors can reuse?
> >>
> >> In this specific case, why does the SF in mlx5 mode even need a bus?
> >> Is it only because of devlink? That would be unfortunate
> >>
> >Devlink is one part due to identifying using bus/dev.
> >How do we refer to its devlink instance of SF without bus/device?
> 
> Question is, why to have devlink instance for SF itself. Same as VF, you don't
mlx5_core has devlink instance for PF and VF for long time now.
Health report, txq/rxq dumps etc all anchored to this devlink instance even for VF. (similar to PF).
And so, SF same framework should work for SF.

> need devlink instance. You only need devlink_port (or
> devlink_subdev) instance on the PF devlink parent for it.
> 
Devlink_port or devlink_subdev are still on eswitch or mgmt side.
They are not present on the side where devlink instance exist on side where txq/rxq/eq etc exist.
Jiri Pirko Nov. 11, 2019, 3:06 p.m. UTC | #58
Mon, Nov 11, 2019 at 03:58:18PM CET, parav@mellanox.com wrote:
>
>
>> -----Original Message-----
>> From: Jiri Pirko <jiri@resnulli.us>
>> Sent: Monday, November 11, 2019 8:18 AM
>> Sun, Nov 10, 2019 at 08:48:31PM CET, parav@mellanox.com wrote:
>> >
>> >> From: Jason Gunthorpe <jgg@ziepe.ca>
>> >> Sent: Friday, November 8, 2019 6:57 PM
>> >> > We should be creating 3 different buses, instead of mdev bus being
>> >> > de-
>> >> multiplexer of that?
>> >> >
>> >> > Hence, depending the device flavour specified, create such device
>> >> > on right
>> >> bus?
>> >> >
>> >> > For example,
>> >> > $ devlink create subdev pci/0000:05:00.0 flavour virtio name foo
>> >> > subdev_id 1 $ devlink create subdev pci/0000:05:00.0 flavour mdev
>> >> > <uuid> subdev_id 2 $ devlink create subdev pci/0000:05:00.0 flavour
>> >> > mlx5 id 1 subdev_id 3
>> >>
>> >> I like the idea of specifying what kind of interface you want at sub
>> >> device creation time. It fits the driver model pretty well and
>> >> doesn't require abusing the vfio mdev for binding to a netdev driver.
>> >>
>> >> > $ devlink subdev pci/0000:05:00.0/<subdev_id> config <params> $
>> >> > echo <respective_device_id> <sysfs_path>/bind
>> >>
>> >> Is explicit binding really needed?
>> >No.
>> >
>> >> If you specify a vfio flavour why shouldn't the vfio driver autoload
>> >> and bind to it right away? That is kind of the point of the driver
>> >> model...
>> >>
>> >It some configuration is needed that cannot be passed at device creation
>> time, explicit bind later can be used.
>> >
>> >> (kind of related, but I don't get while all that GUID and lifecycle
>> >> stuff in mdev should apply for something like a SF)
>> >>
>> >GUID is just the name of the device.
>> >But lets park this aside for a moment.
>> >
>> >> > Implement power management callbacks also on all above 3 buses?
>> >> > Abstract out mlx5_bus into more generic virtual bus (vdev bus?) so
>> >> > that multiple vendors can reuse?
>> >>
>> >> In this specific case, why does the SF in mlx5 mode even need a bus?
>> >> Is it only because of devlink? That would be unfortunate
>> >>
>> >Devlink is one part due to identifying using bus/dev.
>> >How do we refer to its devlink instance of SF without bus/device?
>> 
>> Question is, why to have devlink instance for SF itself. Same as VF, you don't
>mlx5_core has devlink instance for PF and VF for long time now.
>Health report, txq/rxq dumps etc all anchored to this devlink instance even for VF. (similar to PF).
>And so, SF same framework should work for SF.

Right, for health it makes sense.

>
>> need devlink instance. You only need devlink_port (or
>> devlink_subdev) instance on the PF devlink parent for it.
>> 
>Devlink_port or devlink_subdev are still on eswitch or mgmt side.
>They are not present on the side where devlink instance exist on side where txq/rxq/eq etc exist.
>

Got it.
Parav Pandit Nov. 19, 2019, 4:51 a.m. UTC | #59
Hi All,

> From: Jiri Pirko <jiri@resnulli.us>
> Sent: Monday, November 11, 2019 9:06 AM
> 
> Mon, Nov 11, 2019 at 03:58:18PM CET, parav@mellanox.com wrote:
> >
> >
> >> -----Original Message-----
> >> From: Jiri Pirko <jiri@resnulli.us>
> >> Sent: Monday, November 11, 2019 8:18 AM Sun, Nov 10, 2019 at
> >> 08:48:31PM CET, parav@mellanox.com wrote:
> >> >
> >> >> From: Jason Gunthorpe <jgg@ziepe.ca>
> >> >> Sent: Friday, November 8, 2019 6:57 PM
> >> >> > We should be creating 3 different buses, instead of mdev bus
> >> >> > being
> >> >> > de-
> >> >> multiplexer of that?
> >> >> >
> >> >> > Hence, depending the device flavour specified, create such
> >> >> > device on right
> >> >> bus?
> >> >> >
> >> >> > For example,
> >> >> > $ devlink create subdev pci/0000:05:00.0 flavour virtio name foo
> >> >> > subdev_id 1 $ devlink create subdev pci/0000:05:00.0 flavour
> >> >> > mdev <uuid> subdev_id 2 $ devlink create subdev pci/0000:05:00.0
> >> >> > flavour
> >> >> > mlx5 id 1 subdev_id 3
> >> >>
> >> >> I like the idea of specifying what kind of interface you want at
> >> >> sub device creation time. It fits the driver model pretty well and
> >> >> doesn't require abusing the vfio mdev for binding to a netdev driver.
> >> >>
> >> >> > $ devlink subdev pci/0000:05:00.0/<subdev_id> config <params> $
> >> >> > echo <respective_device_id> <sysfs_path>/bind
> >> >>
> >> >> Is explicit binding really needed?
> >> >No.
> >> >
> >> >> If you specify a vfio flavour why shouldn't the vfio driver
> >> >> autoload and bind to it right away? That is kind of the point of
> >> >> the driver model...
> >> >>
> >> >It some configuration is needed that cannot be passed at device
> >> >creation
> >> time, explicit bind later can be used.
> >> >
> >> >> (kind of related, but I don't get while all that GUID and
> >> >> lifecycle stuff in mdev should apply for something like a SF)
> >> >>
> >> >GUID is just the name of the device.
> >> >But lets park this aside for a moment.
> >> >
> >> >> > Implement power management callbacks also on all above 3 buses?
> >> >> > Abstract out mlx5_bus into more generic virtual bus (vdev bus?)
> >> >> > so that multiple vendors can reuse?
> >> >>
> >> >> In this specific case, why does the SF in mlx5 mode even need a bus?
> >> >> Is it only because of devlink? That would be unfortunate
> >> >>
> >> >Devlink is one part due to identifying using bus/dev.
> >> >How do we refer to its devlink instance of SF without bus/device?
> >>
> >> Question is, why to have devlink instance for SF itself. Same as VF,
> >> you don't
> >mlx5_core has devlink instance for PF and VF for long time now.
> >Health report, txq/rxq dumps etc all anchored to this devlink instance even for
> VF. (similar to PF).
> >And so, SF same framework should work for SF.
> 
> Right, for health it makes sense.
> 
> >
> >> need devlink instance. You only need devlink_port (or
> >> devlink_subdev) instance on the PF devlink parent for it.
> >>
> >Devlink_port or devlink_subdev are still on eswitch or mgmt side.
> >They are not present on the side where devlink instance exist on side where
> txq/rxq/eq etc exist.
> >
> 
> Got it.

I am working on the revised v1 version of these series to address below concerns to achieve all the requirements captured in this cover letter.

1. Avoid mdev bus abuse (Jason's and Greg's input)
2. Avoid dma_ops overload (Christoph's comment)
3. Update cover letter for devlink examples (Jakub's comment)
4. Update cover letter to describe rdma persistent naming scheme (Leon's comment)
5. Jiri's few comments on code restructure