diff mbox series

[net-next,03/13] devlink: Support add and delete devlink port

Message ID 20201112192424.2742-4-parav@nvidia.com (mailing list archive)
State Not Applicable
Headers show
Series Add mlx5 subfunction support | expand

Commit Message

Parav Pandit Nov. 12, 2020, 7:24 p.m. UTC
Extended devlink interface for the user to add and delete port.
Extend devlink to connect user requests to driver to add/delete
such port in the device.

When driver routines are invoked, devlink instance lock is not held.
This enables driver to perform several devlink objects registration,
unregistration such as (port, health reporter, resource etc)
by using exising devlink APIs.
This also helps to uniformly use the code for port unregistration
during driver unload and during port deletion initiated by user.

Examples of add, show and delete commands:
$ devlink dev eswitch set pci/0000:06:00.0 mode switchdev

$ devlink port show
pci/0000:06:00.0/65535: type eth netdev ens2f0np0 flavour physical port 0 splittable false

$ devlink port add pci/0000:06:00.0 flavour pcisf pfnum 0 sfnum 88

$ devlink port show pci/0000:06:00.0/32768
pci/0000:06:00.0/32768: type eth netdev eth0 flavour pcisf controller 0 pfnum 0 sfnum 88 external false splittable false
  function:
    hw_addr 00:00:00:00:88:88 state inactive opstate detached

$ udevadm test-builtin net_id /sys/class/net/eth0
Load module index
Parsed configuration file /usr/lib/systemd/network/99-default.link
Created link configuration context.
Using default interface naming scheme 'v245'.
ID_NET_NAMING_SCHEME=v245
ID_NET_NAME_PATH=enp6s0f0npf0sf88
ID_NET_NAME_SLOT=ens2f0npf0sf88
Unload module index
Unloaded link configuration context.

$ devlink port del netdevsim/netdevsim10/32768

Signed-off-by: Parav Pandit <parav@nvidia.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Reviewed-by: Vu Pham <vuhuong@nvidia.com>
---
 include/net/devlink.h | 38 ++++++++++++++++++++++++
 net/core/devlink.c    | 67 +++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 105 insertions(+)

Comments

David Ahern Nov. 18, 2020, 4:21 p.m. UTC | #1
On 11/12/20 12:24 PM, Parav Pandit wrote:
> Extended devlink interface for the user to add and delete port.
> Extend devlink to connect user requests to driver to add/delete
> such port in the device.
> 
> When driver routines are invoked, devlink instance lock is not held.
> This enables driver to perform several devlink objects registration,
> unregistration such as (port, health reporter, resource etc)
> by using exising devlink APIs.
> This also helps to uniformly use the code for port unregistration
> during driver unload and during port deletion initiated by user.
> 
> Examples of add, show and delete commands:
> $ devlink dev eswitch set pci/0000:06:00.0 mode switchdev
> 
> $ devlink port show
> pci/0000:06:00.0/65535: type eth netdev ens2f0np0 flavour physical port 0 splittable false
> 
> $ devlink port add pci/0000:06:00.0 flavour pcisf pfnum 0 sfnum 88
> 
> $ devlink port show pci/0000:06:00.0/32768
> pci/0000:06:00.0/32768: type eth netdev eth0 flavour pcisf controller 0 pfnum 0 sfnum 88 external false splittable false
>   function:
>     hw_addr 00:00:00:00:88:88 state inactive opstate detached
> 

There has to be limits on the number of sub functions that can be
created for a device. How does a user find that limit?

Also, seems like there are hardware constraint at play. e.g., can a user
reduce the number of queues used by the physical function to support
more sub-functions? If so how does a user programmatically learn about
this limitation? e.g., devlink could have support to show resource
sizing and configure constraints similar to what mlxsw has.
Parav Pandit Nov. 18, 2020, 5:02 p.m. UTC | #2
> From: David Ahern <dsahern@gmail.com>
> Sent: Wednesday, November 18, 2020 9:51 PM
> 
> On 11/12/20 12:24 PM, Parav Pandit wrote:
> > Extended devlink interface for the user to add and delete port.
> > Extend devlink to connect user requests to driver to add/delete such
> > port in the device.
> >
> > When driver routines are invoked, devlink instance lock is not held.
> > This enables driver to perform several devlink objects registration,
> > unregistration such as (port, health reporter, resource etc) by using
> > exising devlink APIs.
> > This also helps to uniformly use the code for port unregistration
> > during driver unload and during port deletion initiated by user.
> >
> > Examples of add, show and delete commands:
> > $ devlink dev eswitch set pci/0000:06:00.0 mode switchdev
> >
> > $ devlink port show
> > pci/0000:06:00.0/65535: type eth netdev ens2f0np0 flavour physical
> > port 0 splittable false
> >
> > $ devlink port add pci/0000:06:00.0 flavour pcisf pfnum 0 sfnum 88
> >
> > $ devlink port show pci/0000:06:00.0/32768
> > pci/0000:06:00.0/32768: type eth netdev eth0 flavour pcisf controller 0
> pfnum 0 sfnum 88 external false splittable false
> >   function:
> >     hw_addr 00:00:00:00:88:88 state inactive opstate detached
> >
> 
> There has to be limits on the number of sub functions that can be created for
> a device. How does a user find that limit?
Yes, this came up internally, but didn't really converged.
The devlink resource looked too verbose for an average or simple use cases.
But it may be fine.
The hurdle I faced with devlink resource is with defining the granularity.

For example one devlink instance deploys sub functions on multiple pci functions.
So how to name them? Currently we have controller and PFs in port annotation.
So resource name as 
c0pf0_subfunctions -> for controller 0, pf 0 
c1pf2_subfunctions -> for controller 1, pf 2

Couldn't convince my self to name it this way.

Below example looked simpler to use but plumbing doesn’t exist for it.

$ devlink resource show pci/0000:03:00.0
pci/0000:03:00.0/1: name max_sfs count 256 controller 0 pf 0
pci/0000:03:00.0/2: name max_sfs count 100 controller 1 pf 0
pci/0000:03:00.0/3: name max_sfs count 64 controller 1 pf 1

$ devlink resource set pci/0000:03:00.0/1 max_sfs 100

Second option I was considering was use port params which doesn't sound so right as resource.

> 
> Also, seems like there are hardware constraint at play. e.g., can a user reduce
> the number of queues used by the physical function to support more sub-
> functions? If so how does a user programmatically learn about this limitation?
> e.g., devlink could have support to show resource sizing and configure
> constraints similar to what mlxsw has.
Yes, need to figure out its naming. For mlx5 num queues doesn't have relation to subfunctions.
But PCI resource has relation and this is something we want to do in future, as you said may be using devlink resource.
David Ahern Nov. 18, 2020, 6:03 p.m. UTC | #3
On 11/18/20 10:02 AM, Parav Pandit wrote:
> 
>> From: David Ahern <dsahern@gmail.com>
>> Sent: Wednesday, November 18, 2020 9:51 PM
>>
>> On 11/12/20 12:24 PM, Parav Pandit wrote:
>>> Extended devlink interface for the user to add and delete port.
>>> Extend devlink to connect user requests to driver to add/delete such
>>> port in the device.
>>>
>>> When driver routines are invoked, devlink instance lock is not held.
>>> This enables driver to perform several devlink objects registration,
>>> unregistration such as (port, health reporter, resource etc) by using
>>> exising devlink APIs.
>>> This also helps to uniformly use the code for port unregistration
>>> during driver unload and during port deletion initiated by user.
>>>
>>> Examples of add, show and delete commands:
>>> $ devlink dev eswitch set pci/0000:06:00.0 mode switchdev
>>>
>>> $ devlink port show
>>> pci/0000:06:00.0/65535: type eth netdev ens2f0np0 flavour physical
>>> port 0 splittable false
>>>
>>> $ devlink port add pci/0000:06:00.0 flavour pcisf pfnum 0 sfnum 88
>>>
>>> $ devlink port show pci/0000:06:00.0/32768
>>> pci/0000:06:00.0/32768: type eth netdev eth0 flavour pcisf controller 0
>> pfnum 0 sfnum 88 external false splittable false
>>>   function:
>>>     hw_addr 00:00:00:00:88:88 state inactive opstate detached
>>>
>>
>> There has to be limits on the number of sub functions that can be created for
>> a device. How does a user find that limit?
> Yes, this came up internally, but didn't really converged.
> The devlink resource looked too verbose for an average or simple use cases.
> But it may be fine.
> The hurdle I faced with devlink resource is with defining the granularity.
> 
> For example one devlink instance deploys sub functions on multiple pci functions.
> So how to name them? Currently we have controller and PFs in port annotation.
> So resource name as 
> c0pf0_subfunctions -> for controller 0, pf 0 
> c1pf2_subfunctions -> for controller 1, pf 2
> 
> Couldn't convince my self to name it this way.
> 
> Below example looked simpler to use but plumbing doesn’t exist for it.
> 
> $ devlink resource show pci/0000:03:00.0
> pci/0000:03:00.0/1: name max_sfs count 256 controller 0 pf 0
> pci/0000:03:00.0/2: name max_sfs count 100 controller 1 pf 0
> pci/0000:03:00.0/3: name max_sfs count 64 controller 1 pf 1
> 
> $ devlink resource set pci/0000:03:00.0/1 max_sfs 100
> 
> Second option I was considering was use port params which doesn't sound so right as resource.
> 
>>
>> Also, seems like there are hardware constraint at play. e.g., can a user reduce
>> the number of queues used by the physical function to support more sub-
>> functions? If so how does a user programmatically learn about this limitation?
>> e.g., devlink could have support to show resource sizing and configure
>> constraints similar to what mlxsw has.
> Yes, need to figure out its naming. For mlx5 num queues doesn't have relation to subfunctions.
> But PCI resource has relation and this is something we want to do in future, as you said may be using devlink resource.
> 

With Connectx-4 Lx for example the netdev can have at most 63 queues
leaving 96 cpu servers a bit short - as an example of the limited number
of queues that a nic can handle (or currently exposes to the OS not sure
which). If I create a subfunction for ethernet traffic, how many queues
are allocated to it by default, is it managed via ethtool like the pf
and is there an impact to the resources used by / available to the
primary device?
Jason Gunthorpe Nov. 18, 2020, 6:38 p.m. UTC | #4
On Wed, Nov 18, 2020 at 11:03:24AM -0700, David Ahern wrote:

> With Connectx-4 Lx for example the netdev can have at most 63 queues

What netdev calls a queue is really a "can the device deliver
interrupts and packets to a given per-CPU queue" and covers a whole
spectrum of smaller limits like RSS scheme, # of available interrupts,
ability of the device to create queues, etc.

CX4Lx can create a huge number of queues, but hits one of these limits
that mean netdev's specific usage can't scale up. Other stuff like
RDMA doesn't have the same limits, and has tonnes of queues.

What seems to be needed is a resource controller concept like cgroup
has for processes. The system is really organized into a tree:

           physical device
              mlx5_core
        /      |      \      \                        (aux bus)
     netdev   rdma    vdpa   SF  etc
                             |                        (aux bus)
                           mlx5_core
                          /      \                    (aux bus)
                       netdev   vdpa

And it does make a lot of sense to start to talk about limits at each
tree level.

eg the top of the tree may have 128 physical interrupts. With 128 CPU
cores that isn't enough interrupts to support all of those things
concurrently.

So the user may want to configure:
 - The first level netdev only gets 64,
 - 3rd level mlx5_core gets 32 
 - Final level vdpa gets 8

Other stuff has to fight it out with the remaining shared interrupts.

In netdev land # of interrupts governs # of queues

For RDMA # of interrupts limits the CPU affinities for queues

VPDA limits the # of VMs that can use VT-d

The same story repeats for other less general resources, mlx5 also
has consumption of limited BAR space, and consumption of some limited
memory elements. These numbers are much bigger and may not need
explicit governing, but the general concept holds.

It would be very nice if the limit could be injected when the aux
device is created but before the driver is bound. I'm not sure how to
manage that though..

I assume other devices will be different, maybe some devices have a
limit on the number of total queues, or a limit on the number of
VDPA or RDMA devices.

Jason
Parav Pandit Nov. 18, 2020, 7:22 p.m. UTC | #5
> From: David Ahern <dsahern@gmail.com>
> Sent: Wednesday, November 18, 2020 11:33 PM
> 
> 
> With Connectx-4 Lx for example the netdev can have at most 63 queues
> leaving 96 cpu servers a bit short - as an example of the limited number of
> queues that a nic can handle (or currently exposes to the OS not sure which).
> If I create a subfunction for ethernet traffic, how many queues are allocated
> to it by default, is it managed via ethtool like the pf and is there an impact to
> the resources used by / available to the primary device?

Jason already answered it with details.
Thanks a lot Jason.

Short answer to ethtool question, yes, ethtool can change num queues for subfunction like PF.
Default is same number of queues for subfunction as that of PF in this patchset.
David Ahern Nov. 18, 2020, 7:36 p.m. UTC | #6
On 11/18/20 11:38 AM, Jason Gunthorpe wrote:
> On Wed, Nov 18, 2020 at 11:03:24AM -0700, David Ahern wrote:
> 
>> With Connectx-4 Lx for example the netdev can have at most 63 queues
> 
> What netdev calls a queue is really a "can the device deliver
> interrupts and packets to a given per-CPU queue" and covers a whole
> spectrum of smaller limits like RSS scheme, # of available interrupts,
> ability of the device to create queues, etc.
> 
> CX4Lx can create a huge number of queues, but hits one of these limits
> that mean netdev's specific usage can't scale up. Other stuff like
> RDMA doesn't have the same limits, and has tonnes of queues.
> 
> What seems to be needed is a resource controller concept like cgroup
> has for processes. The system is really organized into a tree:
> 
>            physical device
>               mlx5_core
>         /      |      \      \                        (aux bus)
>      netdev   rdma    vdpa   SF  etc
>                              |                        (aux bus)
>                            mlx5_core
>                           /      \                    (aux bus)
>                        netdev   vdpa
> 
> And it does make a lot of sense to start to talk about limits at each
> tree level.
> 
> eg the top of the tree may have 128 physical interrupts. With 128 CPU
> cores that isn't enough interrupts to support all of those things
> concurrently.
> 
> So the user may want to configure:
>  - The first level netdev only gets 64,
>  - 3rd level mlx5_core gets 32 
>  - Final level vdpa gets 8
> 
> Other stuff has to fight it out with the remaining shared interrupts.
> 
> In netdev land # of interrupts governs # of queues
> 
> For RDMA # of interrupts limits the CPU affinities for queues
> 
> VPDA limits the # of VMs that can use VT-d
> 
> The same story repeats for other less general resources, mlx5 also
> has consumption of limited BAR space, and consumption of some limited
> memory elements. These numbers are much bigger and may not need
> explicit governing, but the general concept holds.
> 
> It would be very nice if the limit could be injected when the aux
> device is created but before the driver is bound. I'm not sure how to
> manage that though..
> 
> I assume other devices will be different, maybe some devices have a
> limit on the number of total queues, or a limit on the number of
> VDPA or RDMA devices.
> 
> Jason
> 

A lot of low level resource details that need to be summarized into a
nicer user / config perspective to specify limits / allocations.

Thanks for the detailed response.
Jason Gunthorpe Nov. 18, 2020, 8:42 p.m. UTC | #7
On Wed, Nov 18, 2020 at 12:36:26PM -0700, David Ahern wrote:
> On 11/18/20 11:38 AM, Jason Gunthorpe wrote:
> > On Wed, Nov 18, 2020 at 11:03:24AM -0700, David Ahern wrote:
> > 
> >> With Connectx-4 Lx for example the netdev can have at most 63 queues
> > 
> > What netdev calls a queue is really a "can the device deliver
> > interrupts and packets to a given per-CPU queue" and covers a whole
> > spectrum of smaller limits like RSS scheme, # of available interrupts,
> > ability of the device to create queues, etc.
> > 
> > CX4Lx can create a huge number of queues, but hits one of these limits
> > that mean netdev's specific usage can't scale up. Other stuff like
> > RDMA doesn't have the same limits, and has tonnes of queues.
> > 
> > What seems to be needed is a resource controller concept like cgroup
> > has for processes. The system is really organized into a tree:
> > 
> >            physical device
> >               mlx5_core
> >         /      |      \      \                        (aux bus)
> >      netdev   rdma    vdpa   SF  etc
> >                              |                        (aux bus)
> >                            mlx5_core
> >                           /      \                    (aux bus)
> >                        netdev   vdpa
> > 
> > And it does make a lot of sense to start to talk about limits at each
> > tree level.
> > 
> > eg the top of the tree may have 128 physical interrupts. With 128 CPU
> > cores that isn't enough interrupts to support all of those things
> > concurrently.
> > 
> > So the user may want to configure:
> >  - The first level netdev only gets 64,
> >  - 3rd level mlx5_core gets 32 
> >  - Final level vdpa gets 8
> > 
> > Other stuff has to fight it out with the remaining shared interrupts.
> > 
> > In netdev land # of interrupts governs # of queues
> > 
> > For RDMA # of interrupts limits the CPU affinities for queues
> > 
> > VPDA limits the # of VMs that can use VT-d
> > 
> > The same story repeats for other less general resources, mlx5 also
> > has consumption of limited BAR space, and consumption of some limited
> > memory elements. These numbers are much bigger and may not need
> > explicit governing, but the general concept holds.
> > 
> > It would be very nice if the limit could be injected when the aux
> > device is created but before the driver is bound. I'm not sure how to
> > manage that though..
> > 
> > I assume other devices will be different, maybe some devices have a
> > limit on the number of total queues, or a limit on the number of
> > VDPA or RDMA devices.
> 
> A lot of low level resource details that need to be summarized into a
> nicer user / config perspective to specify limits / allocations.

Well, now that we have the aux bus stuff there is a nice natural place
to put things..

The aux bus owner device (mlx5_core) could have a list of available
resources

Each aux bus device (netdev/rdma/vdpa) could have a list of consumed
resources

Some API to place a limit on the consumed resources at each aux bus
device.

The tricky bit is the auto-probing/configure. By the time the user has
a chance to apply a limit the drivers are already bound and have
already done their setup. So each subsystem has to support dynamically
imposing a limit..

And I simplified things a bit above too, we actually have two kinds of
interrupt demand: sharable and dedicated. The actual need is to carve
out a bunch of dedicated interrupts and only allow subsystems that are
doing VT-d guest interrupt assignment to consume them (eg VDPA)

Jason
Jacob Keller Nov. 19, 2020, 12:41 a.m. UTC | #8
On 11/18/2020 11:22 AM, Parav Pandit wrote:
> 
> 
>> From: David Ahern <dsahern@gmail.com>
>> Sent: Wednesday, November 18, 2020 11:33 PM
>>
>>
>> With Connectx-4 Lx for example the netdev can have at most 63 queues
>> leaving 96 cpu servers a bit short - as an example of the limited number of
>> queues that a nic can handle (or currently exposes to the OS not sure which).
>> If I create a subfunction for ethernet traffic, how many queues are allocated
>> to it by default, is it managed via ethtool like the pf and is there an impact to
>> the resources used by / available to the primary device?
> 
> Jason already answered it with details.
> Thanks a lot Jason.
> 
> Short answer to ethtool question, yes, ethtool can change num queues for subfunction like PF.
> Default is same number of queues for subfunction as that of PF in this patchset.
> 

But what is the mechanism for partitioning the global resources of the
device into each sub function?

Is it just evenly divided into the subfunctions? is there some maximum
limit per sub function?
Jacob Keller Nov. 19, 2020, 12:52 a.m. UTC | #9
On 11/18/2020 9:02 AM, Parav Pandit wrote:
> 
>> From: David Ahern <dsahern@gmail.com>
>> Sent: Wednesday, November 18, 2020 9:51 PM
>>
>> On 11/12/20 12:24 PM, Parav Pandit wrote:
>>> Extended devlink interface for the user to add and delete port.
>>> Extend devlink to connect user requests to driver to add/delete such
>>> port in the device.
>>>
>>> When driver routines are invoked, devlink instance lock is not held.
>>> This enables driver to perform several devlink objects registration,
>>> unregistration such as (port, health reporter, resource etc) by using
>>> exising devlink APIs.
>>> This also helps to uniformly use the code for port unregistration
>>> during driver unload and during port deletion initiated by user.
>>>
>>> Examples of add, show and delete commands:
>>> $ devlink dev eswitch set pci/0000:06:00.0 mode switchdev
>>>
>>> $ devlink port show
>>> pci/0000:06:00.0/65535: type eth netdev ens2f0np0 flavour physical
>>> port 0 splittable false
>>>
>>> $ devlink port add pci/0000:06:00.0 flavour pcisf pfnum 0 sfnum 88
>>>
>>> $ devlink port show pci/0000:06:00.0/32768
>>> pci/0000:06:00.0/32768: type eth netdev eth0 flavour pcisf controller 0
>> pfnum 0 sfnum 88 external false splittable false
>>>   function:
>>>     hw_addr 00:00:00:00:88:88 state inactive opstate detached
>>>
>>
>> There has to be limits on the number of sub functions that can be created for
>> a device. How does a user find that limit?
> Yes, this came up internally, but didn't really converged.
> The devlink resource looked too verbose for an average or simple use cases.
> But it may be fine.
> The hurdle I faced with devlink resource is with defining the granularity.
> 
> For example one devlink instance deploys sub functions on multiple pci functions.
> So how to name them? Currently we have controller and PFs in port annotation.
> So resource name as 
> c0pf0_subfunctions -> for controller 0, pf 0 
> c1pf2_subfunctions -> for controller 1, pf 2
> 
> Couldn't convince my self to name it this way.

Yea, I think we need to extend the plumbing of resources to allow
specifying or assigning parent resources to a subfunction.

> 
> Below example looked simpler to use but plumbing doesn’t exist for it.
> 
> $ devlink resource show pci/0000:03:00.0
> pci/0000:03:00.0/1: name max_sfs count 256 controller 0 pf 0
> pci/0000:03:00.0/2: name max_sfs count 100 controller 1 pf 0
> pci/0000:03:00.0/3: name max_sfs count 64 controller 1 pf 1
> 
> $ devlink resource set pci/0000:03:00.0/1 max_sfs 100
> 
> Second option I was considering was use port params which doesn't sound so right as resource.
> 

I don't think port parameters make sense here. They only encapsulate
single name -> value pairs, and don't really help show the relationships
between the subfunction ports and the parent device.

>>
>> Also, seems like there are hardware constraint at play. e.g., can a user reduce
>> the number of queues used by the physical function to support more sub-
>> functions? If so how does a user programmatically learn about this limitation?
>> e.g., devlink could have support to show resource sizing and configure
>> constraints similar to what mlxsw has.
> Yes, need to figure out its naming. For mlx5 num queues doesn't have relation to subfunctions.
> But PCI resource has relation and this is something we want to do in future, as you said may be using devlink resource.
> 

I've been looking into queue management and being able to add and remove
queue groups and queues. I'm leaning towards building on top of devlink
resource for this.

Specifically I have been looking at picking up the work started by
Magnus last year, around creating interface for representing queues to
the stack better for AF_XDP, but it also has other possible uses.

I'd like to make sure it aligns with the ideas here for partitioning
resources. It seems like that should be best done at the devlink level,
where the main devlink instance knows about all the part limitations and
can then have new commands for allowing assignment of resources to ports.
David Ahern Nov. 19, 2020, 1:17 a.m. UTC | #10
On 11/18/20 5:41 PM, Jacob Keller wrote:
> 
> 
> On 11/18/2020 11:22 AM, Parav Pandit wrote:
>>
>>
>>> From: David Ahern <dsahern@gmail.com>
>>> Sent: Wednesday, November 18, 2020 11:33 PM
>>>
>>>
>>> With Connectx-4 Lx for example the netdev can have at most 63 queues
>>> leaving 96 cpu servers a bit short - as an example of the limited number of
>>> queues that a nic can handle (or currently exposes to the OS not sure which).
>>> If I create a subfunction for ethernet traffic, how many queues are allocated
>>> to it by default, is it managed via ethtool like the pf and is there an impact to
>>> the resources used by / available to the primary device?
>>
>> Jason already answered it with details.
>> Thanks a lot Jason.
>>
>> Short answer to ethtool question, yes, ethtool can change num queues for subfunction like PF.
>> Default is same number of queues for subfunction as that of PF in this patchset.
>>
> 
> But what is the mechanism for partitioning the global resources of the
> device into each sub function?
> 
> Is it just evenly divided into the subfunctions? is there some maximum
> limit per sub function?
> 

I hope it is not just evenly divided; it should be user controllable. If
I create a subfunction for say a container's networking, I may want to
only assign 1 Rx and 1 Tx queue pair (or 1 channel depending on
terminology where channel includes Rx, Tx and CQ).
Samudrala, Sridhar Nov. 19, 2020, 1:56 a.m. UTC | #11
On 11/18/2020 5:17 PM, David Ahern wrote:
> On 11/18/20 5:41 PM, Jacob Keller wrote:
>>
>>
>> On 11/18/2020 11:22 AM, Parav Pandit wrote:
>>>
>>>
>>>> From: David Ahern <dsahern@gmail.com>
>>>> Sent: Wednesday, November 18, 2020 11:33 PM
>>>>
>>>>
>>>> With Connectx-4 Lx for example the netdev can have at most 63 queues
>>>> leaving 96 cpu servers a bit short - as an example of the limited number of
>>>> queues that a nic can handle (or currently exposes to the OS not sure which).
>>>> If I create a subfunction for ethernet traffic, how many queues are allocated
>>>> to it by default, is it managed via ethtool like the pf and is there an impact to
>>>> the resources used by / available to the primary device?
>>>
>>> Jason already answered it with details.
>>> Thanks a lot Jason.
>>>
>>> Short answer to ethtool question, yes, ethtool can change num queues for subfunction like PF.
>>> Default is same number of queues for subfunction as that of PF in this patchset.
>>>
>>
>> But what is the mechanism for partitioning the global resources of the
>> device into each sub function?
>>
>> Is it just evenly divided into the subfunctions? is there some maximum
>> limit per sub function?
>>
> 
> I hope it is not just evenly divided; it should be user controllable. If
> I create a subfunction for say a container's networking, I may want to
> only assign 1 Rx and 1 Tx queue pair (or 1 channel depending on
> terminology where channel includes Rx, Tx and CQ).

I think we need a way to expose and configure policy for resources 
associated with each type of auxiliary device.
   For ex: default, min and max queues and interrupt vectors.

Once an auxiliary device is created, the user should be able to 
configure the resources within the allowed min-max values.
diff mbox series

Patch

diff --git a/include/net/devlink.h b/include/net/devlink.h
index 1b7c9fbc607a..3991345ef3e2 100644
--- a/include/net/devlink.h
+++ b/include/net/devlink.h
@@ -152,6 +152,17 @@  struct devlink_port {
 	struct mutex reporters_lock; /* Protects reporter_list */
 };
 
+struct devlink_port_new_attrs {
+	enum devlink_port_flavour flavour;
+	unsigned int port_index;
+	u32 controller;
+	u32 sfnum;
+	u16 pfnum;
+	u8 port_index_valid:1,
+	   controller_valid:1,
+	   sfnum_valid:1;
+};
+
 struct devlink_sb_pool_info {
 	enum devlink_sb_pool_type pool_type;
 	u32 size;
@@ -1360,6 +1371,33 @@  struct devlink_ops {
 	int (*port_function_hw_addr_set)(struct devlink *devlink, struct devlink_port *port,
 					 const u8 *hw_addr, int hw_addr_len,
 					 struct netlink_ext_ack *extack);
+	/**
+	 * @port_new: Port add function.
+	 *
+	 * Should be used by device driver to let caller add new port of a specified flavour
+	 * with optional attributes.
+	 * Driver should return -EOPNOTSUPP if it doesn't support port addition of a specified
+	 * flavour or specified attributes. Driver should set extack error message in case of fail
+	 * to add the port.
+	 * devlink core does not hold a devlink instance lock when this callback is invoked.
+	 * Driver must ensures synchronization when adding or deleting a port. Driver must
+	 * register a port with devlink core.
+	 */
+	int (*port_new)(struct devlink *devlink, const struct devlink_port_new_attrs *attrs,
+			struct netlink_ext_ack *extack);
+	/**
+	 * @port_del: Port delete function.
+	 *
+	 * Should be used by device driver to let caller delete port which was previously created
+	 * using port_new() callback.
+	 * Driver should return -EOPNOTSUPP if it doesn't support port deletion.
+	 * Driver should set extack error message in case of fail to delete the port.
+	 * devlink core does not hold a devlink instance lock when this callback is invoked.
+	 * Driver must ensures synchronization when adding or deleting a port. Driver must
+	 * register a port with devlink core.
+	 */
+	int (*port_del)(struct devlink *devlink, unsigned int port_index,
+			struct netlink_ext_ack *extack);
 };
 
 static inline void *devlink_priv(struct devlink *devlink)
diff --git a/net/core/devlink.c b/net/core/devlink.c
index b1e849b624a6..dccdf36afba6 100644
--- a/net/core/devlink.c
+++ b/net/core/devlink.c
@@ -1124,6 +1124,57 @@  static int devlink_nl_cmd_port_unsplit_doit(struct sk_buff *skb,
 	return devlink_port_unsplit(devlink, port_index, info->extack);
 }
 
+static int devlink_nl_cmd_port_new_doit(struct sk_buff *skb, struct genl_info *info)
+{
+	struct netlink_ext_ack *extack = info->extack;
+	struct devlink_port_new_attrs new_attrs = {};
+	struct devlink *devlink = info->user_ptr[0];
+
+	if (!info->attrs[DEVLINK_ATTR_PORT_FLAVOUR] ||
+	    !info->attrs[DEVLINK_ATTR_PORT_PCI_PF_NUMBER]) {
+		NL_SET_ERR_MSG_MOD(extack, "Port flavour or PCI PF are not specified");
+		return -EINVAL;
+	}
+	new_attrs.flavour = nla_get_u16(info->attrs[DEVLINK_ATTR_PORT_FLAVOUR]);
+	new_attrs.pfnum = nla_get_u16(info->attrs[DEVLINK_ATTR_PORT_PCI_PF_NUMBER]);
+
+	if (info->attrs[DEVLINK_ATTR_PORT_INDEX]) {
+		new_attrs.port_index = nla_get_u32(info->attrs[DEVLINK_ATTR_PORT_INDEX]);
+		new_attrs.port_index_valid = true;
+	}
+	if (info->attrs[DEVLINK_ATTR_PORT_CONTROLLER_NUMBER]) {
+		new_attrs.controller =
+			nla_get_u16(info->attrs[DEVLINK_ATTR_PORT_CONTROLLER_NUMBER]);
+		new_attrs.controller_valid = true;
+	}
+	if (info->attrs[DEVLINK_ATTR_PORT_PCI_SF_NUMBER]) {
+		new_attrs.sfnum = nla_get_u32(info->attrs[DEVLINK_ATTR_PORT_PCI_SF_NUMBER]);
+		new_attrs.sfnum_valid = true;
+	}
+
+	if (!devlink->ops->port_new)
+		return -EOPNOTSUPP;
+
+	return devlink->ops->port_new(devlink, &new_attrs, extack);
+}
+
+static int devlink_nl_cmd_port_del_doit(struct sk_buff *skb, struct genl_info *info)
+{
+	struct netlink_ext_ack *extack = info->extack;
+	struct devlink *devlink = info->user_ptr[0];
+	unsigned int port_index;
+
+	if (!info->attrs[DEVLINK_ATTR_PORT_INDEX]) {
+		NL_SET_ERR_MSG_MOD(extack, "Port index is not specified");
+		return -EINVAL;
+	}
+	port_index = nla_get_u32(info->attrs[DEVLINK_ATTR_PORT_INDEX]);
+
+	if (!devlink->ops->port_del)
+		return -EOPNOTSUPP;
+	return devlink->ops->port_del(devlink, port_index, extack);
+}
+
 static int devlink_nl_sb_fill(struct sk_buff *msg, struct devlink *devlink,
 			      struct devlink_sb *devlink_sb,
 			      enum devlink_command cmd, u32 portid,
@@ -7565,6 +7616,10 @@  static const struct nla_policy devlink_nl_policy[DEVLINK_ATTR_MAX + 1] = {
 	[DEVLINK_ATTR_RELOAD_ACTION] = NLA_POLICY_RANGE(NLA_U8, DEVLINK_RELOAD_ACTION_DRIVER_REINIT,
 							DEVLINK_RELOAD_ACTION_MAX),
 	[DEVLINK_ATTR_RELOAD_LIMITS] = NLA_POLICY_BITFIELD32(DEVLINK_RELOAD_LIMITS_VALID_MASK),
+	[DEVLINK_ATTR_PORT_FLAVOUR] = { .type = NLA_U16 },
+	[DEVLINK_ATTR_PORT_PCI_PF_NUMBER] = { .type = NLA_U16 },
+	[DEVLINK_ATTR_PORT_PCI_SF_NUMBER] = { .type = NLA_U32 },
+	[DEVLINK_ATTR_PORT_CONTROLLER_NUMBER] = { .type = NLA_U32 },
 };
 
 static const struct genl_small_ops devlink_nl_ops[] = {
@@ -7604,6 +7659,18 @@  static const struct genl_small_ops devlink_nl_ops[] = {
 		.flags = GENL_ADMIN_PERM,
 		.internal_flags = DEVLINK_NL_FLAG_NO_LOCK,
 	},
+	{
+		.cmd = DEVLINK_CMD_PORT_NEW,
+		.doit = devlink_nl_cmd_port_new_doit,
+		.flags = GENL_ADMIN_PERM,
+		.internal_flags = DEVLINK_NL_FLAG_NO_LOCK,
+	},
+	{
+		.cmd = DEVLINK_CMD_PORT_DEL,
+		.doit = devlink_nl_cmd_port_del_doit,
+		.flags = GENL_ADMIN_PERM,
+		.internal_flags = DEVLINK_NL_FLAG_NO_LOCK,
+	},
 	{
 		.cmd = DEVLINK_CMD_SB_GET,
 		.validate = GENL_DONT_VALIDATE_STRICT | GENL_DONT_VALIDATE_DUMP,