mbox series

[0/8] Introduce fwctl subystem

Message ID 0-v1-9912f1a11620+2a-fwctl_jgg@nvidia.com (mailing list archive)
Headers show
Series Introduce fwctl subystem | expand

Message

Jason Gunthorpe June 3, 2024, 3:53 p.m. UTC
fwctl is a new subsystem intended to bring some common rules and order to
the growing pattern of exposing a secure FW interface directly to
userspace. Unlike existing places like RDMA/DRM/VFIO/uacce that are
exposing a device for datapath operations fwctl is focused on debugging,
configuration and provisioning of the device. It will not have the
necessary features like interrupt delivery to support a datapath.

This concept is similar to the long standing practice in the "HW" RAID
space of having a device specific misc device to manager the RAID
controller FW. fwctl generalizes this notion of a companion debug and
management interface that goes along with a dataplane implemented in an
appropriate subsystem.

The need for this has reached a critical point as many users are moving to
run lockdown enabled kernels. Several existing devices have had long
standing tooling for management that relied on /sys/../resource0 or PCI
config space access which is not permitted in lockdown. A major point of
fwctl is to define and document the rules that a device must follow to
expose a lockdown compatible RPC.

Based on some discussion fwctl splits the RPCs into four categories

	FWCTL_RPC_CONFIGURATION
	FWCTL_RPC_DEBUG_READ_ONLY
	FWCTL_RPC_DEBUG_WRITE
	FWCTL_RPC_DEBUG_WRITE_FULL

Where the latter two trigger a new TAINT_FWCTL, and the final one requires
CAP_SYS_RAWIO - excluding it from lockdown. The device driver and its FW
would be responsible to restrict RPCs to the requested security scope,
while the core code handles the tainting and CAP checks.

For details see the final patch which introduces the documentation.

This series incorporates a version of the mlx5ctl interface previously
proposed:
  https://lore.kernel.org/r/20240207072435.14182-1-saeed@kernel.org/

For this series the memory registration mechanism was removed, but I
expect it will come back.

This series comes with mlx5 as a driver implementation, and I have soft
commitments for at least three more drivers.

There have been two LWN articles written discussing various aspects of
this proposal:

 https://lwn.net/Articles/955001/
 https://lwn.net/Articles/969383/

Several have expressed general support for this concept:

 Broadcom Networking - https://lore.kernel.org/r/Zf2n02q0GevGdS-Z@C02YVCJELVCG
 Christoph Hellwig - https://lore.kernel.org/r/Zcx53N8lQjkpEu94@infradead.org/
 Enfabrica - https://lore.kernel.org/r/9cc7127f-8674-43bc-b4d7-b1c4c2d96fed@kernel.org/
 NVIDIA Networking
 Oracle Linux - https://lore.kernel.org/r/6lakj6lxlxhdgrewodvj3xh6sxn3d36t5dab6najzyti2navx3@wrge7cyfk6nq

Work is ongoing for a robust multi-device open source userspace, currently
the mlx5ctl_user that was posted by Saeed has been updated to use fwctl.

  https://github.com/saeedtx/mlx5ctl.git
  https://github.com/jgunthorpe/mlx5ctl.git

This is on github: https://github.com/jgunthorpe/linux/commits/fwctl

Jason Gunthorpe (6):
  fwctl: Add basic structure for a class subsystem with a cdev
  fwctl: Basic ioctl dispatch for the character device
  fwctl: FWCTL_INFO to return basic information about the device
  taint: Add TAINT_FWCTL
  fwctl: FWCTL_RPC to execute a Remote Procedure Call to device firmware
  fwctl: Add documentation

Saeed Mahameed (2):
  fwctl/mlx5: Support for communicating with mlx5 fw
  mlx5: Create an auxiliary device for fwctl_mlx5

 Documentation/admin-guide/tainted-kernels.rst |   5 +
 Documentation/userspace-api/fwctl.rst         | 269 ++++++++++++
 Documentation/userspace-api/index.rst         |   1 +
 .../userspace-api/ioctl/ioctl-number.rst      |   1 +
 MAINTAINERS                                   |  16 +
 drivers/Kconfig                               |   2 +
 drivers/Makefile                              |   1 +
 drivers/fwctl/Kconfig                         |  23 +
 drivers/fwctl/Makefile                        |   5 +
 drivers/fwctl/main.c                          | 411 ++++++++++++++++++
 drivers/fwctl/mlx5/Makefile                   |   4 +
 drivers/fwctl/mlx5/main.c                     | 333 ++++++++++++++
 drivers/net/ethernet/mellanox/mlx5/core/dev.c |   8 +
 include/linux/fwctl.h                         | 112 +++++
 include/linux/panic.h                         |   3 +-
 include/uapi/fwctl/fwctl.h                    | 137 ++++++
 include/uapi/fwctl/mlx5.h                     |  36 ++
 kernel/panic.c                                |   1 +
 18 files changed, 1367 insertions(+), 1 deletion(-)
 create mode 100644 Documentation/userspace-api/fwctl.rst
 create mode 100644 drivers/fwctl/Kconfig
 create mode 100644 drivers/fwctl/Makefile
 create mode 100644 drivers/fwctl/main.c
 create mode 100644 drivers/fwctl/mlx5/Makefile
 create mode 100644 drivers/fwctl/mlx5/main.c
 create mode 100644 include/linux/fwctl.h
 create mode 100644 include/uapi/fwctl/fwctl.h
 create mode 100644 include/uapi/fwctl/mlx5.h


base-commit: c3f38fa61af77b49866b006939479069cd451173

Comments

Jakub Kicinski June 3, 2024, 6:42 p.m. UTC | #1
On Mon,  3 Jun 2024 12:53:16 -0300 Jason Gunthorpe wrote:
> fwctl is a new subsystem intended to bring some common rules and order to
> the growing pattern of exposing a secure FW interface directly to
> userspace. Unlike existing places like RDMA/DRM/VFIO/uacce that are
> exposing a device for datapath operations fwctl is focused on debugging,
> configuration and provisioning of the device. It will not have the
> necessary features like interrupt delivery to support a datapath.

If you have debug problems in your subsystem, put the APIs in your
subsystem. Don't force your choices on all the subsystems your device
interacts with:

Nacked-by: Jakub Kicinski <kuba@kernel.org>

Somewhat related, I saw nVidia sells various interesting features in
its DOCA stack. Is that Open Source?
David Ahern June 4, 2024, 3:01 a.m. UTC | #2
On 6/3/24 12:42 PM, Jakub Kicinski wrote:
> Somewhat related, I saw nVidia sells various interesting features in its
> DOCA stack. Is that Open Source?

Seriously, Jakub, how is that in any way related to this patch set?

You are basically suggesting that if any vendor ever has an out of tree
option for its hardware every patch it sends should be considered a ruse
to enable or simplify proprietary options.
Jakub Kicinski June 4, 2024, 2:04 p.m. UTC | #3
On Mon, 3 Jun 2024 21:01:58 -0600 David Ahern wrote:
> On 6/3/24 12:42 PM, Jakub Kicinski wrote:
> > Somewhat related, I saw nVidia sells various interesting features in its
> > DOCA stack. Is that Open Source?  
> 
> Seriously, Jakub, how is that in any way related to this patch set?

Whether they admit it or not, DOCA is a major reason nVidia wants
this to be standalone rather than part of RDMA.

> You are basically suggesting that if any vendor ever has an out of tree
> option for its hardware every patch it sends should be considered a ruse
> to enable or simplify proprietary options.

Ooo, is that a sore spot?

I don't begrudge anyone building proprietary options, but leave
upstream out of it.
Saeed Mahameed June 4, 2024, 9:28 p.m. UTC | #4
On 04 Jun 07:04, Jakub Kicinski wrote:
>On Mon, 3 Jun 2024 21:01:58 -0600 David Ahern wrote:
>> On 6/3/24 12:42 PM, Jakub Kicinski wrote:
>> > Somewhat related, I saw nVidia sells various interesting features in its
>> > DOCA stack. Is that Open Source?
>>
>> Seriously, Jakub, how is that in any way related to this patch set?
>
>Whether they admit it or not, DOCA is a major reason nVidia wants
>this to be standalone rather than part of RDMA.
>

No, DOCA isn't on the agenda for this new interface. But what is the point
in arguing? Apparently the vendor is not credible enough in your opinion.
Which is an absolute outrageous grounds for a NAK.

Anyway I don't see your point in bringing up DOCA here, but obviously once 
this interface is accepted, all developers are welcome to use it,
including DOCA developers of course..

That being said, the why we need this is crystal clear in the 
cover-letter and previous submission discussions, bringing random SDKs
into this discussion is not objective and counter productive to the
technical discussion.

>> You are basically suggesting that if any vendor ever has an out of tree
>> option for its hardware every patch it sends should be considered a ruse
>> to enable or simplify proprietary options.
>

It's apparent that you're attributing sinister agendas to patchsets when
you fail to offer valid technical opinions regarding the NAK nature. Let's
address this outside of this patchset, as this isn't the first occurrence.
Consistency in evaluating patches is crucial; some, like the fbnic and
idpf, seem to go unquestioned, while others face scrutiny.

>Ooo, is that a sore spot?
>
>I don't begrudge anyone building proprietary options, but leave
>upstream out of it.
>
Jakub Kicinski June 4, 2024, 10:32 p.m. UTC | #5
On Tue, 4 Jun 2024 14:28:05 -0700 Saeed Mahameed wrote:
> On 04 Jun 07:04, Jakub Kicinski wrote:
> >On Mon, 3 Jun 2024 21:01:58 -0600 David Ahern wrote:  
> >> Seriously, Jakub, how is that in any way related to this patch set?  
> >
> >Whether they admit it or not, DOCA is a major reason nVidia wants
> >this to be standalone rather than part of RDMA.
> 
> No, DOCA isn't on the agenda for this new interface. But what is the point
> in arguing?

I'm not arguing any point, we argued enough. But you failed to disclose
that DOCA is very likely user of this interface. So whoever you're
planning to submit it to should know.

DOCA was top of mind for me because I noticed it has PSP support, and
I wanted to take a look at the implementation.

> Apparently the vendor is not credible enough in your opinion.

You're creating an interface where you depend on a pinky promise from
a black box that the RPC is not a write. I trust you personally not to
write a patch which abuses this interface. But this cannot possibly
extend to all developers, most of who just want to ship features.

> Which is an absolute outrageous grounds for a NAK.
> 
> Anyway I don't see your point in bringing up DOCA here, but obviously once 
> this interface is accepted, all developers are welcome to use it,
> including DOCA developers of course..

Of course.

> That being said, the why we need this is crystal clear in the 
> cover-letter and previous submission discussions, bringing random SDKs
> into this discussion is not objective and counter productive to the
> technical discussion.
> 
> >> You are basically suggesting that if any vendor ever has an out of tree
> >> option for its hardware every patch it sends should be considered a ruse
> >> to enable or simplify proprietary options.
> 
> It's apparent that you're attributing sinister agendas to patchsets when
> you fail to offer valid technical opinions regarding the NAK nature. Let's
> address this outside of this patchset, as this isn't the first occurrence.
> Consistency in evaluating patches is crucial;

Exactly :| Netdev people, including multiple prominent developers from
Mellanox/nVidia have been nacking SDK interfaces in Linux networking
for 20 years. How are we going to look to all the companies which have
been doing IPUs for over a decade if we change the rules for nVidia?

> some, like the fbnic and idpf, seem to go unquestioned, while others
> face scrutiny.

fbnic got a nack for any core changes or uAPI not used by other drivers.
idpf got a nack for pretending to be a standard.

You keep saying that I'm nacking your interface because I have some
hatred and distrust for you or nVidia. I really, really don't.
Any vendor posting this would get exactly the same nack from me.

If by "let's address this outside of this patchset" you mean that we
should have a discussion about maintainer favoritism, and subsystem
capture by vendors - you have my full support!
Dan Williams June 4, 2024, 11:56 p.m. UTC | #6
Jakub Kicinski wrote:
[..]
> I don't begrudge anyone building proprietary options, but leave
> upstream out of it.

So I am of 2 minds here. In general, how is upstream benefited by
requiring every vendor command to be wrapped by a Linux command?

Mind you, I am coming at this from the perspective of being a maintainer
of a subsystem that does *not* allow unrestricted vendor commands. Since
day one, the CXL subsystem has matched netdev's general sentiment and
been more restrictive than NVMe. It places all vendor commands and even
all yet-to-be-Linux-wrapped-standard-commands behind a
CONFIG_CXL_MEM_RAW_COMMANDS option. That default-off option, when
enabled, allows any command to be sent but it taints the kernel with a
WARN(). CXL devices theoretically allow direct manipulation of system
memory without IOMMU protection which is in contrast to NVMe which would
need to work harder to violate kernel-lockdown protections.

The expectation that I laid out here [1] is based on the observation
that a significant portion of the vendor commands these devices support
are for pre-release hardware qualification and debug flows. The
recommendation to device vendors was "if you need wide distribution of
kernels that allow unrestricted vendor passthrough, work with Linux
distributions to enable this option in debug kernels, run those debug
kernels for your pre-release hardware flows, ignore the warnings".

3 years on from that recommendation it seems no vendor has even needed
that level of distribution help. I.e. checking a few distro kernels
(Fedora, openSUSE) shows no uptake for CONFIG_CXL_MEM_RAW_COMMANDS=y in
their debug builds. I can only assume that locally compiled custom
kernel binaries are filling the need.

So all seems quiet with current restriction for CXL endpoint vendor
commands, but this stance was recently challenged in this thread [2] by
CXL switch vendors with an assertion that fabric switch configuration
has need for more and varied vendor flows than endpoint configuration.
 
While I am not clear on the veracity of that claim, it at least
challenged me to do the thought experiment of "what would it look like
to relax the CXL command restriction?". Maybe we can come up with a
community answer to the "so you want to build a
userpace-to-device-firmware tunnel?" to at least get all the various
concerns documented in one place, and provide guidance for how device
vendors should navigate this space across subsystems. Between NVMe
"allow all the things", CXL "allow all the things only after tainting
the kernel", and the "never allow vendor passthrough" position (I am
sure there are other nuanced positions) it at least seems useful to
document the concerns. Here is a start for that guidance from the CXL
perspective:

* Integrity: Subsystem has a responsibility to meet kernel-lockdown
  expectations:

  Distros and system owners need to be assured that root's ability to
  modify the running kernel image are mitigated. For CXL there are 2 ways
  to do this, require Linux wrapper commands for all the low level
  commands (status quo), or a new trust the device to publish which
  commands have user data effects in something CXL calls the "Command
  Effects Log". In that "trust Command Effects" scenario the kernel still
  has no idea what the command is actually doing, but it can at least
  assert that the device does not claim that the command changes the
  contents of system-memory. Now, you might say, "the device can just
  lie", but that betrays a conceit of the kernel restriction. A device
  could lie that a Linux wrapped command when passed certain payloads does
  not in turn proxy to a restricted command. So at some point there is
  almost always an out-of-tree way to get around the kernel restriction,
  so the question is are we better off giving a blessed path or force
  vendors into ugly out-of-tree workarounds?

* Introspection / validation: Subsystem community needs to be able to
  audit behavior after the fact.

  To me this means even if the kernel is letting a command through based
  on the stated Command Effect of "Configuration Change after Cold Reset"
  upstream community has a need to be able to read the vendor
  specification for that command. I.e. commands might be vendor-specific,
  but never vendor-private. I see this as similar to the requirement for
  open source userspace for sophisticated accelerators.

* Collaboration: open standards support open driver maintenance.

  Without standards we end up with awkward situations like Confidential
  Computing where every vendor races to implement the same functionality
  in arbitrarily different and vendor specific ways.

  For CXL devices, and I believe the devices fwctl is targeting, there
  are a whole class of commands for vendor specific configuration and
  debug. Commands that the kernel really need not worry about.

  Some subsystems may want to allow high-performance science experiments
  like what NVMe allows, but it seems worth asking the question if
  standardizing device configuration and debug is really the best use of
  upstream's limited time?

  One of the release valves in the CXL space is openly specified
  commands with opaque payloads, like "Read Vendor Debug Log". That is
  clear what it does, likely a payload the kernel need never worry
  about, and the "Command Effects" is empty. However, going forward there
  is a new class of commands called "Set/Get Feature" that allow a wide
  range of vendor toggles to be deployed which will need an upstream
  response for the driver policy to vendor-specific "Features".

So if fwctl, or something like it, can strike a balance of enforcing
integrity and introspection while encouraging collaboration on the
aspects that are worth upstream collaboration, I think that is a
conversation worth having.

[1]: http://lore.kernel.org/r/CAPcyv4gDShAYih5iWabKg_eTHhuHm54vEAei8ZkcmHnPp3B0cw@mail.gmail.com/
[2]: http://lore.kernel.org/r/20240321174423.00007e0d@Huawei.com
Jakub Kicinski June 5, 2024, 3:05 a.m. UTC | #7
On Tue, 4 Jun 2024 16:56:57 -0700 Dan Williams wrote:
> Jakub Kicinski wrote:
> [..]
> > I don't begrudge anyone building proprietary options, but leave
> > upstream out of it.  
> 
> So I am of 2 minds here. In general, how is upstream benefited by
> requiring every vendor command to be wrapped by a Linux command?
> [...]

Thanks for sharing the CXL experience and your perspective.
Also for trying to frame the discussion in a useful way,
although I have little faith that it will help :( Fingers crossed?

> * Integrity: Subsystem has a responsibility to meet kernel-lockdown
>   expectations:
> 
>   Distros and system owners need to be assured that root's ability to
>   modify the running kernel image are mitigated. For CXL there are 2 ways
>   to do this, require Linux wrapper commands for all the low level
>   commands (status quo), or a new trust the device to publish which
>   commands have user data effects in something CXL calls the "Command
>   Effects Log". In that "trust Command Effects" scenario the kernel still
>   has no idea what the command is actually doing, but it can at least
>   assert that the device does not claim that the command changes the
>   contents of system-memory. Now, you might say, "the device can just
>   lie", but that betrays a conceit of the kernel restriction. A device
>   could lie that a Linux wrapped command when passed certain payloads does
>   not in turn proxy to a restricted command. So at some point there is
>   almost always an out-of-tree way to get around the kernel restriction,
>   so the question is are we better off giving a blessed path or force
>   vendors into ugly out-of-tree workarounds?

The integrity thing is a double edge sword, so I don't have much to say
here. If we take a few wrong turns we'll wrap the vendor commands with
crypto and then the vendor can control which commands you get to run ;)
Obviously I'm joking, and not saying that the intent of the current
series! But its about as realistic as "this will only be used for truly
vendor specific things".

> * Introspection / validation: Subsystem community needs to be able to
>   audit behavior after the fact.
> 
>   To me this means even if the kernel is letting a command through based
>   on the stated Command Effect of "Configuration Change after Cold Reset"
>   upstream community has a need to be able to read the vendor
>   specification for that command. I.e. commands might be vendor-specific,
>   but never vendor-private. I see this as similar to the requirement for
>   open source userspace for sophisticated accelerators.

That sounds pretty CXL specific, and IIUC unrealistic.
You assume you have some specification to consult, while this discussion
has been going for over a year now, and I can't get the vendors to share
what those turntables they so desperately need to tweak are.

> * Collaboration: open standards support open driver maintenance.
> 
>   Without standards we end up with awkward situations like Confidential
>   Computing where every vendor races to implement the same functionality
>   in arbitrarily different and vendor specific ways.
> 
>   For CXL devices, and I believe the devices fwctl is targeting, there
>   are a whole class of commands for vendor specific configuration and
>   debug. Commands that the kernel really need not worry about.
> 
>   Some subsystems may want to allow high-performance science experiments
>   like what NVMe allows, but it seems worth asking the question if
>   standardizing device configuration and debug is really the best use of
>   upstream's limited time?

No, but it's not about science experiments, really. It's about
production features. The effort of implementing something properly
upstream is high. I cost time and money to get the right caliber of
people and let them go thru the revisions. I lack confidence that
merging fwctl will not negatively impact motivation for companies to
pay off our accrued technical debt. While all they need is "this simple
little feature". And before competition wins the customer. It's a race
to the bottom.

>   One of the release valves in the CXL space is openly specified
>   commands with opaque payloads, like "Read Vendor Debug Log". That is
>   clear what it does, likely a payload the kernel need never worry
>   about, and the "Command Effects" is empty. However, going forward there
>   is a new class of commands called "Set/Get Feature" that allow a wide
>   range of vendor toggles to be deployed which will need an upstream
>   response for the driver policy to vendor-specific "Features".
>
> So if fwctl, or something like it, can strike a balance of enforcing
> integrity and introspection while encouraging collaboration on the
> aspects that are worth upstream collaboration, I think that is a
> conversation worth having.

I presume you were trying to underscore that the decision is unavoidably
a trade off, which is true. But I don't follow the exact formulation.
Is fwctl helping integrity or collaboration? If we assume use of vendor
tools is unavoidable, then I guess integrity? I really can't see how it
helps collaboration when everyone ships their custom tool set.

Back to the tradeoff. For networking, which is a _very_ mature subsystem
with a ton of standards the need to do "vendor specific things" is
marginal. The downside of the loss of an "upstream advantage" is obvious.
We need to take such decisions on subsystem by subsystem basis.
You should be able to draw the lines differently for CXL than how we
draw them for TCP/IP.

On the technical level the discussion can't go very far, because I'd
like to hear actual user problems. But I can't even get a list of those
infamous thousands of knobs :|
Jakub Kicinski June 5, 2024, 3:11 a.m. UTC | #8
On Mon,  3 Jun 2024 12:53:16 -0300 Jason Gunthorpe wrote:
>  Broadcom Networking - https://lore.kernel.org/r/Zf2n02q0GevGdS-Z@C02YVCJELVCG

Please double check with Broadcom if they are still supportive, 
in the current form.

Please include lore links to previous postings.

Please carry my nack on future version. At least as long as
the write access checks are.. good-faith-based.
Jonathan Cameron June 5, 2024, 11:19 a.m. UTC | #9
>   One of the release valves in the CXL space is openly specified
>   commands with opaque payloads, like "Read Vendor Debug Log". That is
>   clear what it does, likely a payload the kernel need never worry
>   about, and the "Command Effects" is empty. However, going forward there
>   is a new class of commands called "Set/Get Feature" that allow a wide
>   range of vendor toggles to be deployed which will need an upstream
>   response for the driver policy to vendor-specific "Features".

Irrelevant rat hole time ;)

I don't see those Set / Get feature as any different from other commands.
I see them as a convenience mostly there to cut down on spec duplication
and enforce some consistency across multiple similar commands, but they
are just commands like any other, validation is just one step further
into the payload.

There are already a bunch of them in the main CXL spec and like you mention
above if someone brings a well documented vendor feature (or feature from
another standard etc), then if appropriate we could let that through the
filter as well.

Same will be true of tunneled commands (I think we can ignore the cross
host security aspect of those). Ultimately we can sanity check the payload
much like a top level command.

So I mostly agree with rest of what you've said, but think this detail
doesn't matter.

> 
> So if fwctl, or something like it, can strike a balance of enforcing
> integrity and introspection while encouraging collaboration on the
> aspects that are worth upstream collaboration, I think that is a
> conversation worth having.
> 
> [1]: http://lore.kernel.org/r/CAPcyv4gDShAYih5iWabKg_eTHhuHm54vEAei8ZkcmHnPp3B0cw@mail.gmail.com/
> [2]: http://lore.kernel.org/r/20240321174423.00007e0d@Huawei.com
>
Jason Gunthorpe June 5, 2024, 12:06 p.m. UTC | #10
On Tue, Jun 04, 2024 at 08:11:03PM -0700, Jakub Kicinski wrote:
> On Mon,  3 Jun 2024 12:53:16 -0300 Jason Gunthorpe wrote:
> >  Broadcom Networking - https://lore.kernel.org/r/Zf2n02q0GevGdS-Z@C02YVCJELVCG
> 
> Please double check with Broadcom if they are still supportive, 
> in the current form.

They are free to comment.

> Please include lore links to previous postings.

The link to mlx5ctl is already in the cover letter and Saeed linked
from there to enough of the prior stuff.
 
> Please carry my nack on future version. At least as long as
> the write access checks are.. good-faith-based.

I will include the acks and nacks related to the general concept on
the documentation patch 6 along with a links and mention in the PR
when we get there.

Jason
Jason Gunthorpe June 5, 2024, 1:59 p.m. UTC | #11
On Tue, Jun 04, 2024 at 04:56:57PM -0700, Dan Williams wrote:
> Jakub Kicinski wrote:
> [..]
> > I don't begrudge anyone building proprietary options, but leave
> > upstream out of it.
> 
> So I am of 2 minds here. In general, how is upstream benefited by
> requiring every vendor command to be wrapped by a Linux command?

People actually can use upstream :)

Amazingly there is inherit benefit to people being able to use the
software we produce.

> 3 years on from that recommendation it seems no vendor has even needed
> that level of distribution help. I.e. checking a few distro kernels
> (Fedora, openSUSE) shows no uptake for CONFIG_CXL_MEM_RAW_COMMANDS=y in
> their debug builds. I can only assume that locally compiled custom
> kernel binaries are filling the need.

My strong advice would be to be careful about this. Android-ism where
nobody runs the upstream kernel is a real thing. For something
emerging like CXL there is a real risk that the hyperscale folks will
go off and do their own OOT stuff and in-tree CXL will be something
usuable but inferior. I've seen this happen enough times..

If people come and say we need X and the maintainer says no, they
don't just give up and stop doing X, the go and do X anyhow out of
tree. This has become especially true now that the center of business
activity in server-Linux is driven by the hyperscale crowd that don't
care much about upstream. Linux maintainer's don't actually have the
power to force the industry to do things, though people do keep
trying.. Maintainers can only lead, and productive leading is not done
with a NO.

You will start to see this pain in maybe 5-10 years if CXL starts to
be something deployed in an enterprise RedHat/Dell/etc sort of
environment. Then that missing X becomes a critical issue because it
turns out the hyperscale folks long since figured out it is really
important but didn't do anything to enable it upstream.

There is merit in upstream being something people can and do actually
use, not just an ivory tower of architectural perfection. There is
merit in bringing code into the community instead of forcing things to
be OOT.

For instance the thread you linked where there was talk of needing the
signal integrity data is a great example. Sure some of that is
manufacturing time, but also if you deploy a million interfaces in a
datacenter, then yes, there will be need to collect SI information
from live systems and do some analysis on it. You wouldn't believe how
much physically broken HW leaks out into data centers and needs
manufacturing level debugging techniques to properly root cause :(

> userpace-to-device-firmware tunnel?" to at least get all the various
> concerns documented in one place, and provide guidance for how device
> vendors should navigate this space across subsystems. 

This is my effort here. If we document the expectations there is a
much better chance that a standard body or device manufacturer can
implement their interfaces in a way that works with the OS. There is a
much higher chance they will attract CVEs and be forced to fix it if
the security expectations are clearly laid out. You had a good
observation in one of those links about how they are not OS
people. Let's help them do better.

Shunt the less robust stuff to fwctl and then people can also make
their own security choices, don't enable or load the fwctl modules and
you get more protection. It is closer to your
CONFIG_CXL_MEM_RAW_COMMANDS=y but at runtime.

I think I captured most of your commentary below here in patch 6.

>   Effects Log". In that "trust Command Effects" scenario the kernel still
>   has no idea what the command is actually doing, but it can at least
>   assert that the device does not claim that the command changes the
>   contents of system-memory. Now, you might say, "the device can just
>   lie", but that betrays a conceit of the kernel restriction. A device
>   could lie that a Linux wrapped command when passed certain payloads does
>   not in turn proxy to a restricted command.

Yeah, we have to trust the device. If the device is hostile toward the
OS then there are already big problems. We need to allow for
unintentional defects in the devices, but we don't need to be
paranoid.

IMHO a command effects report, in conjunction with a robust OS centric
defintion is something we can trust in.

> * Introspection / validation: Subsystem community needs to be able to
>   audit behavior after the fact.
> 
>   To me this means even if the kernel is letting a command through based
>   on the stated Command Effect of "Configuration Change after Cold Reset"
>   upstream community has a need to be able to read the vendor
>   specification for that command. I.e. commands might be vendor-specific,
>   but never vendor-private. I see this as similar to the requirement for
>   open source userspace for sophisticated accelerators.

I'm less hard on this. As long as reasonable open userspace exists I
think it is fine to let other stuff through too. I can appreciate the
DRM stance on this, but IMHO, there is meaningfully more value for open
source in trying get an open Vulkan implementation vs blocking users
from reading their vendor'd diagnostic SI values.

I don't think we should get into some kind of extremism and insist
that every single bit must be documented/standardized or Linux won't
support it.

This is why I envision fwctl as not being suitable for actual
datapath/performance stuff.

> * Collaboration: open standards support open driver maintenance.
> 
>   Without standards we end up with awkward situations like Confidential
>   Computing where every vendor races to implement the same functionality
>   in arbitrarily different and vendor specific ways.

Standard are important. Linux is not a standards body. Linux
maintainers can only advise, not force, the industry to make
standards. At a certain point Linux's job is to implement software to
support what people have built. CC is a sad example where the industry
did not get together enough, but still Linux will support the CC mess.

>   For CXL devices, and I believe the devices fwctl is targeting, there
>   are a whole class of commands for vendor specific configuration and
>   debug. Commands that the kernel really need not worry about.

Right.

>   Some subsystems may want to allow high-performance science experiments
>   like what NVMe allows, but it seems worth asking the question if
>   standardizing device configuration and debug is really the best use of
>   upstream's limited time?

From what I've been seeing it looks like a significant waste of
time. For example there is minimal industry value in standardizing
values stored in a device's boot time flash configuration. If some
common software wants to access really generic configuration (like
SRIOV enable) then sure there is merit, but that is really the
minority.

Jason
Jason Gunthorpe June 5, 2024, 2:50 p.m. UTC | #12
On Tue, Jun 04, 2024 at 03:32:16PM -0700, Jakub Kicinski wrote:
> On Tue, 4 Jun 2024 14:28:05 -0700 Saeed Mahameed wrote:
> > On 04 Jun 07:04, Jakub Kicinski wrote:
> > >On Mon, 3 Jun 2024 21:01:58 -0600 David Ahern wrote:  
> > >> Seriously, Jakub, how is that in any way related to this patch set?  
> > >
> > >Whether they admit it or not, DOCA is a major reason nVidia wants
> > >this to be standalone rather than part of RDMA.
> > 
> > No, DOCA isn't on the agenda for this new interface. But what is the point
> > in arguing?
> 
> I'm not arguing any point, we argued enough. But you failed to disclose
> that DOCA is very likely user of this interface. So whoever you're
> planning to submit it to should know.

This is getting ridiculous. Did you disclose in your PSP cover letter
that all that work and new kernel uAPI is to support Meta's propritary
user space, even to the point that NO open source implementation even
exists yet? Let me check. Nope.

So why this made up double standard for Saeed? Especially after he
already said DOCA isn't on the agenda for mlx5's fwctl?

> > >> You are basically suggesting that if any vendor ever has an out of tree
> > >> option for its hardware every patch it sends should be considered a ruse
> > >> to enable or simplify proprietary options.
> > 
> > It's apparent that you're attributing sinister agendas to patchsets when
> > you fail to offer valid technical opinions regarding the NAK nature. Let's
> > address this outside of this patchset, as this isn't the first occurrence.
> > Consistency in evaluating patches is crucial;
> 
> Exactly :| Netdev people, including multiple prominent developers from
> Mellanox/nVidia have been nacking SDK interfaces in Linux networking
> for 20 years. How are we going to look to all the companies which have
> been doing IPUs for over a decade if we change the rules for nVidia?

That is a bleak way of painting things. fwctl is a developing
consensus on how to solve this class of problems. We get to have a
consensus that is different than the past because Linux dos actually
evolve. All your long suffering IPU comanpies are welcome to use fwctl
with their products going forward just as equally to nvidia/etc.

Amazingly, "rules" are not set in stone in Linux!

> If by "let's address this outside of this patchset" you mean that we
> should have a discussion about maintainer favoritism, and subsystem
> capture by vendors - you have my full support!

This vendor bashing needs to stop. You could have easially used the
word companies and been much more accurate. At this point the
hyperscale companies - your so-called "users" - are much more guilty
of "subsytem capture" than any vendor is, and it certainly has changed
the culture of Linux.

There are many legitimate complaints all around of maintainers being
capricious - it doesn't matter who employees them.

Jason
Jakub Kicinski June 5, 2024, 3:41 p.m. UTC | #13
On Wed, 5 Jun 2024 11:50:39 -0300 Jason Gunthorpe wrote:
> On Tue, Jun 04, 2024 at 03:32:16PM -0700, Jakub Kicinski wrote:
> > On Tue, 4 Jun 2024 14:28:05 -0700 Saeed Mahameed wrote:  

> > > No, DOCA isn't on the agenda for this new interface. But what is the point
> > > in arguing?  
> > 
> > I'm not arguing any point, we argued enough. But you failed to disclose
> > that DOCA is very likely user of this interface. So whoever you're
> > planning to submit it to should know.  
> 
> This is getting ridiculous. Did you disclose in your PSP cover letter
> that all that work and new kernel uAPI is to support Meta's propritary
> user space, even to the point that NO open source implementation even
> exists yet? Let me check. Nope.

There is no Meta proprietary implementation. Some Meta folks who are on
the CC of the submission are working on extending Fizz, but it's not
ready. Fizz is here: https://github.com/facebookincubator/fizz
David Ahern June 6, 2024, 1:58 a.m. UTC | #14
On 6/4/24 8:04 AM, Jakub Kicinski wrote:
> Ooo, is that a sore spot?

Maintainer overreach? Absolutely.

The sky is not falling with this proposed subsystem; engineers are
merely trying to solve real, customer problems.
David Ahern June 6, 2024, 2:35 a.m. UTC | #15
On 6/5/24 7:59 AM, Jason Gunthorpe wrote:
> On Tue, Jun 04, 2024 at 04:56:57PM -0700, Dan Williams wrote:
>> Jakub Kicinski wrote:
>> [..]
>>> I don't begrudge anyone building proprietary options, but leave
>>> upstream out of it.
>>
>> So I am of 2 minds here. In general, how is upstream benefited by
>> requiring every vendor command to be wrapped by a Linux command?
> 
> People actually can use upstream :)
> 
> Amazingly there is inherit benefit to people being able to use the
> software we produce.

There is. There is a clear preference for open source kernels and drivers.

Until a feature is standardized and/or commoditized, it does not make
sense to create a uapi for every H/W vendor whim. All of them are
attempting to solve real problems; some of them will stick. We know
which features are valuable when customers use them, ask for them and
other vendors copy them. Until then it is a 1-off by a vendor basically
proposing a solution. Not all ideas are good ideas, and we do not need
the burden of a uapi or the burden of out of tree drivers.

> 
>> 3 years on from that recommendation it seems no vendor has even needed
>> that level of distribution help. I.e. checking a few distro kernels
>> (Fedora, openSUSE) shows no uptake for CONFIG_CXL_MEM_RAW_COMMANDS=y in
>> their debug builds. I can only assume that locally compiled custom
>> kernel binaries are filling the need.
> 
> My strong advice would be to be careful about this. Android-ism where
> nobody runs the upstream kernel is a real thing. For something
> emerging like CXL there is a real risk that the hyperscale folks will
> go off and do their own OOT stuff and in-tree CXL will be something
> usuable but inferior. I've seen this happen enough times..
> 
> If people come and say we need X and the maintainer says no, they
> don't just give up and stop doing X, the go and do X anyhow out of
> tree. This has become especially true now that the center of business
> activity in server-Linux is driven by the hyperscale crowd that don't
> care much about upstream. Linux maintainer's don't actually have the
> power to force the industry to do things, though people do keep
> trying.. Maintainers can only lead, and productive leading is not done
> with a NO.

+1
Dan Williams June 6, 2024, 4:56 a.m. UTC | #16
Jason Gunthorpe wrote:
[..]
> > 3 years on from that recommendation it seems no vendor has even needed
> > that level of distribution help. I.e. checking a few distro kernels
> > (Fedora, openSUSE) shows no uptake for CONFIG_CXL_MEM_RAW_COMMANDS=y in
> > their debug builds. I can only assume that locally compiled custom
> > kernel binaries are filling the need.
> 
> My strong advice would be to be careful about this. Android-ism where
> nobody runs the upstream kernel is a real thing. For something
> emerging like CXL there is a real risk that the hyperscale folks will
> go off and do their own OOT stuff and in-tree CXL will be something
> usuable but inferior. I've seen this happen enough times..

Hence my openness to considering fwctl...

> If people come and say we need X and the maintainer says no, they
> don't just give up and stop doing X, the go and do X anyhow out of
> tree. This has become especially true now that the center of business
> activity in server-Linux is driven by the hyperscale crowd that don't
> care much about upstream.

"...don't care much about upstream...". This could be a whole separate
thread unto itself.

> Linux maintainer's don't actually have the power to force the industry
> to do things, though people do keep trying.. Maintainers can only
> lead, and productive leading is not done with a NO.
> 
> You will start to see this pain in maybe 5-10 years if CXL starts to
> be something deployed in an enterprise RedHat/Dell/etc sort of
> environment. Then that missing X becomes a critical issue because it
> turns out the hyperscale folks long since figured out it is really
> important but didn't do anything to enable it upstream.

This matches other feedback I have heard recently. Yes, distros hate
contending with every vendor's userspace toolkit, that was the original
distro feedback motivating CONFIG_CXL_MEM_RAW_COMMANDS to have a poison
pill of WARN() on use. However, allowing more vendor commands is more
preferable than contending with vendor out-of-tree drivers that likely
help keep the enterprise-distro-kernel stable-ABI train rolling. In
other words, legalize it in order to centrally regulate it.

[..]
> This is my effort here. If we document the expectations there is a
> much better chance that a standard body or device manufacturer can
> implement their interfaces in a way that works with the OS. There is a
> much higher chance they will attract CVEs and be forced to fix it if
> the security expectations are clearly laid out. You had a good
> observation in one of those links about how they are not OS
> people. Let's help them do better.
> 
> Shunt the less robust stuff to fwctl and then people can also make
> their own security choices, don't enable or load the fwctl modules and
> you get more protection. It is closer to your
> CONFIG_CXL_MEM_RAW_COMMANDS=y but at runtime.
> 
> I think I captured most of your commentary below here in patch 6.

I will take a look...

> >   Effects Log". In that "trust Command Effects" scenario the kernel still
> >   has no idea what the command is actually doing, but it can at least
> >   assert that the device does not claim that the command changes the
> >   contents of system-memory. Now, you might say, "the device can just
> >   lie", but that betrays a conceit of the kernel restriction. A device
> >   could lie that a Linux wrapped command when passed certain payloads does
> >   not in turn proxy to a restricted command.
> 
> Yeah, we have to trust the device. If the device is hostile toward the
> OS then there are already big problems. We need to allow for
> unintentional defects in the devices, but we don't need to be
> paranoid.
> 
> IMHO a command effects report, in conjunction with a robust OS centric
> defintion is something we can trust in.

So this is where I want to start and see if we can bridge the trust gap.

I am warming to your assertion that there is a wide array of
vendor-specific configuration and debug that are not an efficient use of
upstream's time to wrap in a shared Linux ABI. I want to explore fwctl
for CXL for that use case, I personally don't want to marshal a Linux
command to each vendor's slightly different backend CXL toggles.

At the same time, I also agree with the contention that a "do anything
you want and get away with it" tunnel invites shenanigans from folks
that may not care about the long term health of the Linux kernel vs
their short term interests. That it is difficult to unring the bell once
a tunnel is in place. While subsystems will rightly take different
stances to fwctl policy, that lack of one-size-fits all seems not
sufficient reason to keep the concept out of the kernel entirely.

I appreciate that you crafted this interface with an eye towards making
it unsuitable for data-path operations.

So my questions to try to understand the specific sticking points more
are:

1/ Can you think of a Command Effect that the device could enumerate to
address the specific shenanigan's that netdev is worried about? In other
words if every command a device enables has the stated effect of
"Configuration Change after Reset" does that cut out a significant
portion of the concern? Make this a debate on finer grained effects not
coarse grained binary decision on whether fwctl should move forward at
all.

2/ About the "what if the device lies?" question. We can't revert code
that used to work, but we can definitely work with enterprise distros to
turn off fwctl where there is concern it may lead or is leading to
shenanigans. So, document what each subsystem's stance towards fwctl is,
like maybe a distro only wants fwctl to front publicly documented vendor
commands, or maybe private vendor commands ok, but only with a
constrained set of Command Effects (I potentially see CXL here). A
distro should know what they are opting into for each fwctl instance, it
likely will always need to be subsystem specific policy. A distro can
also decide lockdown policy based on Command Effects above and beyond
the ones that clearly state they allow the device to modify the running
kernel.
Leon Romanovsky June 6, 2024, 8:50 a.m. UTC | #17
On Wed, Jun 05, 2024 at 09:56:14PM -0700, Dan Williams wrote:
> Jason Gunthorpe wrote:

<...>

> So my questions to try to understand the specific sticking points more
> are:
> 
> 1/ Can you think of a Command Effect that the device could enumerate to
> address the specific shenanigan's that netdev is worried about? In other
> words if every command a device enables has the stated effect of
> "Configuration Change after Reset" does that cut out a significant
> portion of the concern? 

It will prevent SR-IOV devices (or more accurate their VFs)
to be configured through the fwctl, as they are destroyed in HW
during reboot.

Thanks
Jakub Kicinski June 6, 2024, 2:18 p.m. UTC | #18
On Wed, 5 Jun 2024 20:35:49 -0600 David Ahern wrote:
> Until a feature is standardized and/or commoditized, it does not make
> sense to create a uapi for every H/W vendor whim.

This is not about non-standard features. I work with multiple vendors
as my day job. I ask them how to set basic link configuration and the
support person gives me a link to the vendor tools! I wish I could show
you the emails.

> All of them are attempting to solve real problems; some of them will
> stick. We know which features are valuable when customers use them,

Yes, once customers deploy a feature implemented via a vendor API
they will definitely migrate to a different API. Customers like risk
and wasting their engineering resources reimplementing and redeploying
things? And we have so much success move users to new APIs in Linux!

> ask for them and other vendors copy them. Until then it is a 1-off by
> a vendor basically proposing a solution.

Certainly. Because... who exactly will ask the second vendor to
implement the common API? 

And the second vendor will most certainly not mind the extra delay and
inconvenience having their product shipped via the publicly reviewed,
and slow to deploy kernel, while the first one is happily selling
the same feature already.

> Not all ideas are good ideas, and we do not need the burden of a uapi
> or the burden of out of tree drivers.

This API gives user space SDKs a trivial way of implementing all
switching, routing, filtering, QoS offloads etc.
An argument can be made that given somewhat mixed switchdev experience
we should just stay out of the way and let that happen. But just make
that argument then, instead of pretending the use of this API will be
limited to custom very vendor specific things.

Again, if someone needs this to ship their custom CXL/Infiniband 
AI fabric magic, which is un-interoperable by design -- none of 
my concern. But keep TCP/IP networking out of this :|
Jason Gunthorpe June 6, 2024, 2:41 p.m. UTC | #19
On Wed, Jun 05, 2024 at 09:56:14PM -0700, Dan Williams wrote:

> > If people come and say we need X and the maintainer says no, they
> > don't just give up and stop doing X, the go and do X anyhow out of
> > tree. This has become especially true now that the center of business
> > activity in server-Linux is driven by the hyperscale crowd that don't
> > care much about upstream.
> 
> "...don't care much about upstream...". This could be a whole separate
> thread unto itself.

Heh, it is a topic, but perhaps not one for polite company :)

> > Linux maintainer's don't actually have the power to force the industry
> > to do things, though people do keep trying.. Maintainers can only
> > lead, and productive leading is not done with a NO.
> > 
> > You will start to see this pain in maybe 5-10 years if CXL starts to
> > be something deployed in an enterprise RedHat/Dell/etc sort of
> > environment. Then that missing X becomes a critical issue because it
> > turns out the hyperscale folks long since figured out it is really
> > important but didn't do anything to enable it upstream.
> 
> This matches other feedback I have heard recently. Yes, distros hate
> contending with every vendor's userspace toolkit, that was the
> original

I'm not sure that is 100% true. Sure nobody likes that you have to
type 'abc X' and 'def Y' to do a similar thing, but from a distro
perpective if abc and def are both open sourced and packaged in the
distro it is still a far better outcome than users doing OOT drivers
and binary-only tools.

eg one of the long standing main Mellanox tools that is being ported
to fwctl is open source and in all distros:

 https://rpmfind.net/linux/rpm2html/search.php?query=mstflint

Projects have already experimented building tooling on top of it to
make a more cross-vendor experience in some areas.

In my view it is wrong to think the kernel is the only place we can
make generic things or that allowing userspace to see the raw device
interface immediately means fragmentation and chaos. The industry is
more robust than that. Giving people working in userspace room to
invent their own solutions is actually helpful to driving some
commonality. There are already soft targets in the K8S that people
need to fit into, if the first few steps are with abc/def tools and
that brings us to an eventual true commonality, then great.

> distro feedback motivating CONFIG_CXL_MEM_RAW_COMMANDS to have a poison
> pill of WARN() on use. However, allowing more vendor commands is more
> preferable than contending with vendor out-of-tree drivers that likely
> help keep the enterprise-distro-kernel stable-ABI train rolling. In
> other words, legalize it in order to centrally regulate it.

I also liked Jakub's idea of putting a taint in for things that were
likely to have an impact on support and debug, I included that concept
in fwctl.

> > >   Effects Log". In that "trust Command Effects" scenario the kernel still
> > >   has no idea what the command is actually doing, but it can at least
> > >   assert that the device does not claim that the command changes the
> > >   contents of system-memory. Now, you might say, "the device can just
> > >   lie", but that betrays a conceit of the kernel restriction. A device
> > >   could lie that a Linux wrapped command when passed certain payloads does
> > >   not in turn proxy to a restricted command.
> > 
> > Yeah, we have to trust the device. If the device is hostile toward the
> > OS then there are already big problems. We need to allow for
> > unintentional defects in the devices, but we don't need to be
> > paranoid.
> > 
> > IMHO a command effects report, in conjunction with a robust OS centric
> > defintion is something we can trust in.
> 
> So this is where I want to start and see if we can bridge the trust gap.
> 
> I am warming to your assertion that there is a wide array of
> vendor-specific configuration and debug that are not an efficient use of
> upstream's time to wrap in a shared Linux ABI. I want to explore fwctl
> for CXL for that use case, I personally don't want to marshal a Linux
> command to each vendor's slightly different backend CXL toggles.

Personally I think this idea to marshal/unmarshal everything in the
kernel is often misguided. If it is truely obvious and actually shared
multi-vendor capability then by all means go and do it.

But if you are spending weeks/months fighting about uAPI because all
the vendors are so different, it isn't obvious what is "generic" then
you've probably already lost. The very worst outcome is a per-device
uAPI masquerading as an obfuscated "generic" uAPI that wasted ages of
peoples time to argue out.

> At the same time, I also agree with the contention that a "do anything
> you want and get away with it" tunnel invites shenanigans from folks
> that may not care about the long term health of the Linux kernel vs
> their short term interests.

IMHO this is disproven by history. The above mstflint I linked to is
as old as as mlx5 HW, it runs today over PCI config space and an OOT
driver. Where is real the damage to the long term health of Linux or
the ecosystem?

Like I said before I view there is a difference between DRM wanting a
Vulkan stack and doing some device specific
configuration/debugging. One has vastly more open source value than
the other.

> So my questions to try to understand the specific sticking points more
> are:
> 
> 1/ Can you think of a Command Effect that the device could enumerate to
> address the specific shenanigan's that netdev is worried about? 

Nothing comes to mind..

> In other words if every command a device enables has the stated
> effect of "Configuration Change after Reset" does that cut out a
> significant portion of the concern?

Related to configuration - one of Saeed's oringinal ideas was to
implement a devlink command to set the configurables in the flash in a
way that mlx5 could implement all of its options, ideally with
configurables discovered dynamically from the running device. This LPC
presentation was so agressively rejected by Jakub that Saeed abandoned
it. In the discussion it was clear Jakub is requesting to review and
possibly reject every configurable.

On this topic, unfortunately, I don't see any technical middle ground
between "netdev is the gatekeeper for all FLASH configurables" and
"devices can be fully configured regardless of their design".

> 2/ About the "what if the device lies?" question. We can't revert code
> that used to work, but we can definitely work with enterprise distros to
> turn off fwctl where there is concern it may lead or is leading to
> shenanigans. 

Security is the one place where Linus has tolerated userspace
regressions. In this specific case I documented (or at least that was
the intent) there would be regression consequences to breaking the
security rules. Commands can be retroactively restricted to higher CAP
levels and rejected from lockdown if the device attracts a CVE.

IMHO the ecosystem is strongly motived to do security seriously these
days, I am not so worried.

> So, document what each subsystem's stance towards fwctl is,
> like maybe a distro only wants fwctl to front publicly documented vendor
> commands, or maybe private vendor commands ok, but only with a
> constrained set of Command Effects (I potentially see CXL here). 

I wouldn't say subsystem here, but techonology. I think it is
reasonable that a CXL fwctl driver have some kconfig tunables like you
already have. This idea works alot better if the underlying thing is
already standards based.

Linux subsystem isn't a meaningful concept for a multi-function device
like mlx5 and others.

Thanks,
Jason
Jason Gunthorpe June 6, 2024, 2:48 p.m. UTC | #20
On Thu, Jun 06, 2024 at 07:18:11AM -0700, Jakub Kicinski wrote:

> An argument can be made that given somewhat mixed switchdev experience
> we should just stay out of the way and let that happen. But just make
> that argument then, instead of pretending the use of this API will be
> limited to custom very vendor specific things.

Huh?

At least mlx5 already has a very robust userspace competition to
switchdev using RDMA APIs, available in DPDK. This is long since been
done and is widely deployed.

I have no idea where you get this made up idea that fwctl is somehow
about dataplane SDKs. The acclerated networking industry long ago
moved pasted netdev in upstream, it is well known to everyone. There
is no trick here.

fwctl is not some scheme to sneak dataplane SDKs into the kernel, you
are just making stuff up.

Jason
Jakub Kicinski June 6, 2024, 2:58 p.m. UTC | #21
On Thu, 6 Jun 2024 11:41:02 -0300 Jason Gunthorpe wrote:
> In my view it is wrong to think the kernel is the only place we can
> make generic things or that allowing userspace to see the raw device
> interface immediately means fragmentation and chaos. The industry is
> more robust than that. Giving people working in userspace room to
> invent their own solutions is actually helpful to driving some
> commonality. There are already soft targets in the K8S that people
> need to fit into, if the first few steps are with abc/def tools and
> that brings us to an eventual true commonality, then great.

Yes, this is the core of our disagreement. And one which is quite hard
to resolve with technical arguments.

I believe kernel may not be a great place to keep all the controls,
but it is in my opinion the most healthy open source project among 
the available options. You mention K8S, but I'd give SoNiC (the NOS) 
as a more relevant example. A hyperscaler or another trillion dollar
company can certainly have a swing at creating other open layers of
commonality. Together with its other trillion dollar friends.

Removing the minor inconvenience of having to ship an out of tree
module for out of tree tools is not worth the loss.
Jakub Kicinski June 6, 2024, 3:05 p.m. UTC | #22
On Thu, 6 Jun 2024 11:48:18 -0300 Jason Gunthorpe wrote:
> > An argument can be made that given somewhat mixed switchdev experience
> > we should just stay out of the way and let that happen. But just make
> > that argument then, instead of pretending the use of this API will be
> > limited to custom very vendor specific things.  
> 
> Huh?

I'm sorry, David as been working in netdev for a long time.
I have a tendency to address the person I'm replying to,
assuming their level of understanding of the problem space.
Which makes it harder to understand for bystanders.

> At least mlx5 already has a very robust userspace competition to
> switchdev using RDMA APIs, available in DPDK. This is long since been
> done and is widely deployed.

Yeah, we had this discussion multiple times

> I have no idea where you get this made up idea that fwctl is somehow
> about dataplane SDKs. The acclerated networking industry long ago
> moved pasted netdev in upstream, it is well known to everyone. There
> is no trick here.
> 
> fwctl is not some scheme to sneak dataplane SDKs into the kernel, you
> are just making stuff up.

By dataplane SDK you mean DOCA? I don't even want to go there.
I just meant forwarding offload _which I said_. You didn't understand
and now you're accusing me of "making stuff up".

This whole conversation is such a damn waste of time.
Dan Williams June 6, 2024, 5:24 p.m. UTC | #23
Jason Gunthorpe wrote:
[..]
> > I am warming to your assertion that there is a wide array of
> > vendor-specific configuration and debug that are not an efficient use of
> > upstream's time to wrap in a shared Linux ABI. I want to explore fwctl
> > for CXL for that use case, I personally don't want to marshal a Linux
> > command to each vendor's slightly different backend CXL toggles.
> 
> Personally I think this idea to marshal/unmarshal everything in the
> kernel is often misguided. If it is truely obvious and actually shared
> multi-vendor capability then by all means go and do it.
> 
> But if you are spending weeks/months fighting about uAPI because all
> the vendors are so different, it isn't obvious what is "generic" then
> you've probably already lost. The very worst outcome is a per-device
> uAPI masquerading as an obfuscated "generic" uAPI that wasted ages of
> peoples time to argue out.

Certainly once you have gotten to the "months of arguing" point it begs the
question "was there really any generic benefit to reap in the first
place?"

That said, *some* grappling, especially when muliple vendors hit the
list with the similar feature at the same time, has yielded
collaboration in the past. So I might be a few rungs back on the
spectrum from where you are, but I concede that yes, there is a point of
diminishing to negative returns.

> > At the same time, I also agree with the contention that a "do anything
> > you want and get away with it" tunnel invites shenanigans from folks
> > that may not care about the long term health of the Linux kernel vs
> > their short term interests.
> 
> IMHO this is disproven by history. The above mstflint I linked to is
> as old as as mlx5 HW, it runs today over PCI config space and an OOT
> driver. Where is real the damage to the long term health of Linux or
> the ecosystem?
> 
> Like I said before I view there is a difference between DRM wanting a
> Vulkan stack and doing some device specific
> configuration/debugging. One has vastly more open source value than
> the other.

Fair.

> > So my questions to try to understand the specific sticking points more
> > are:
> > 
> > 1/ Can you think of a Command Effect that the device could enumerate to
> > address the specific shenanigan's that netdev is worried about? 
> 
> Nothing comes to mind..

Ugh, that indeed seems too severe.

> > In other words if every command a device enables has the stated
> > effect of "Configuration Change after Reset" does that cut out a
> > significant portion of the concern?
> > In other words if every command a device enables has the stated
> > effect of "Configuration Change after Reset" does that cut out a
> > significant portion of the concern?
> 
> Related to configuration - one of Saeed's oringinal ideas was to
> way that mlx5 could implement all of its options, ideally with
> configurables discovered dynamically from the running device. This LPC
> presentation was so agressively rejected by Jakub that Saeed abandoned
> it. In the discussion it was clear Jakub is requesting to review and
> possibly reject every configurable.
> between "netdev is the gatekeeper for all FLASH configurables" and
> "devices can be fully configured regardless of their design".

This gets back to the unspoken conceit of the kernel restriction that I
mentioned earlier. At some point the kernel restriction begets a cynical
in-tree workaround or an out-of-tree workaround which either way means
upstream Linux loses.

> > 2/ About the "what if the device lies?" question. We can't revert code
> > that used to work, but we can definitely work with enterprise distros to
> > turn off fwctl where there is concern it may lead or is leading to
> > shenanigans. 
> 
> Security is the one place where Linus has tolerated userspace
> regressions. In this specific case I documented (or at least that was
> the intent) there would be regression consequences to breaking the
> security rules. Commands can be retroactively restricted to higher CAP
> levels and rejected from lockdown if the device attracts a CVE.
> 
> IMHO the ecosystem is strongly motived to do security seriously these
> days, I am not so worried.

That is a good point, if a Command Effect gets tied to a CVE, or a
cynical workaround gets tied to a CVE, both of those demand an upstream
and distro response.

> > So, document what each subsystem's stance towards fwctl is,
> > like maybe a distro only wants fwctl to front publicly documented vendor
> > commands, or maybe private vendor commands ok, but only with a
> > constrained set of Command Effects (I potentially see CXL here). 
> 
> I wouldn't say subsystem here, but techonology. I think it is
> reasonable that a CXL fwctl driver have some kconfig tunables like you
> already have. This idea works alot better if the underlying thing is
> already standards based.

True, I worry about these technologies that cross upstream maintainer
boundaries. When you have a composable switch that enables net, block,
and/or mem use cases, which upstream maintainer policy applies to the
fwctl posture of that thing?
David Ahern June 6, 2024, 5:47 p.m. UTC | #24
On 6/6/24 9:05 AM, Jakub Kicinski wrote:
> On Thu, 6 Jun 2024 11:48:18 -0300 Jason Gunthorpe wrote:
>>> An argument can be made that given somewhat mixed switchdev experience
>>> we should just stay out of the way and let that happen. But just make
>>> that argument then, instead of pretending the use of this API will be
>>> limited to custom very vendor specific things.  
>>
>> Huh?
> 
> I'm sorry, David as been working in netdev for a long time.

And I will continue working on Linux networking stack (netdev) while I
also work with the IB S/W stack, fwctl, and any other part of Linux
relevant to my job. I am not going to pick a silo (and should not be
required to).

> I have a tendency to address the person I'm replying to,
> assuming their level of understanding of the problem space.
> Which makes it harder to understand for bystanders.
> 
>> At least mlx5 already has a very robust userspace competition to
>> switchdev using RDMA APIs, available in DPDK. This is long since been
>> done and is widely deployed.
> 
> Yeah, we had this discussion multiple times

The switchdev / sonic comparison came to mind as well during this
thread. The existence of a kernel way (switchdev) has not stopped sonic
(userspace SDK) from gaining traction. In some cases the SDK is required
for device features that do not have a kernel uapi or vendors refuse to
offer a kernel way, so it is the only option.

The bottom line to me is that these hardline, dogmatic approaches -
resisting the recognition of reality - is only harming users. There is a
middle ground, open source drivers and tools that offer more flexibility.
Dan Williams June 6, 2024, 10:11 p.m. UTC | #25
Leon Romanovsky wrote:
> On Wed, Jun 05, 2024 at 09:56:14PM -0700, Dan Williams wrote:
> > Jason Gunthorpe wrote:
> 
> <...>
> 
> > So my questions to try to understand the specific sticking points more
> > are:
> > 
> > 1/ Can you think of a Command Effect that the device could enumerate to
> > address the specific shenanigan's that netdev is worried about? In other
> > words if every command a device enables has the stated effect of
> > "Configuration Change after Reset" does that cut out a significant
> > portion of the concern? 
> 
> It will prevent SR-IOV devices (or more accurate their VFs)
> to be configured through the fwctl, as they are destroyed in HW
> during reboot.

Right, but between zero configurability and losing live SR-IOV
configurabilitiy is there still value? Note, this is just a thought
experiment on what if any Command Effects Linux can comfortably tolerate
vs those that start to be more spicy and dip into removing stimulus /
focus on the commons, or otherwise injuring collaboration.
Jason Gunthorpe June 7, 2024, 12:02 a.m. UTC | #26
On Thu, Jun 06, 2024 at 03:11:21PM -0700, Dan Williams wrote:
> Leon Romanovsky wrote:
> > On Wed, Jun 05, 2024 at 09:56:14PM -0700, Dan Williams wrote:
> > > Jason Gunthorpe wrote:
> > 
> > <...>
> > 
> > > So my questions to try to understand the specific sticking points more
> > > are:
> > > 
> > > 1/ Can you think of a Command Effect that the device could enumerate to
> > > address the specific shenanigan's that netdev is worried about? In other
> > > words if every command a device enables has the stated effect of
> > > "Configuration Change after Reset" does that cut out a significant
> > > portion of the concern? 
> > 
> > It will prevent SR-IOV devices (or more accurate their VFs)
> > to be configured through the fwctl, as they are destroyed in HW
> > during reboot.
> 
> Right, but between zero configurability and losing live SR-IOV
> configurabilitiy is there still value? Note, this is just a thought
> experiment on what if any Command Effects Linux can comfortably tolerate
> vs those that start to be more spicy and dip into removing stimulus /
> focus on the commons, or otherwise injuring collaboration.

I like the idea of "takes effect on _function_ reset". VFs and PFs
both often have configuration that can become current once the fuction
is reset. A VF is usually reset by something like VFIO while a PF is
usually reset by a power cycle.

The fact configuration doesn't change until reset is, IMHO, a very
strong barrier from making some backdoor into a subsystem driver.

Jason
Jason Gunthorpe June 7, 2024, 12:25 a.m. UTC | #27
On Thu, Jun 06, 2024 at 10:24:46AM -0700, Dan Williams wrote:
> Jason Gunthorpe wrote:
> [..]
> > > I am warming to your assertion that there is a wide array of
> > > vendor-specific configuration and debug that are not an efficient use of
> > > upstream's time to wrap in a shared Linux ABI. I want to explore fwctl
> > > for CXL for that use case, I personally don't want to marshal a Linux
> > > command to each vendor's slightly different backend CXL toggles.
> > 
> > Personally I think this idea to marshal/unmarshal everything in the
> > kernel is often misguided. If it is truely obvious and actually shared
> > multi-vendor capability then by all means go and do it.
> > 
> > But if you are spending weeks/months fighting about uAPI because all
> > the vendors are so different, it isn't obvious what is "generic" then
> > you've probably already lost. The very worst outcome is a per-device
> > uAPI masquerading as an obfuscated "generic" uAPI that wasted ages of
> > peoples time to argue out.
> 
> Certainly once you have gotten to the "months of arguing" point it begs the
> question "was there really any generic benefit to reap in the first
> place?"

Indeed, but I've seen, and participated, in these things many times :)

> That said, *some* grappling, especially when muliple vendors hit the
> list with the similar feature at the same time, has yielded
> collaboration in the past. 

Absolutely! But we have also frequently done that retroactively, like
see three examples and then consolidate the common APIs. The challenge
is uAPI. Since we can't change uAPI people like to rush to make it
future proof without examples. Broadly I lean towards waiting until we
have several examples to build a standard uAPI and let the examples
evolve on their own.

If there is value in the commonality then people will change over.

> This gets back to the unspoken conceit of the kernel restriction that I
> mentioned earlier. At some point the kernel restriction begets a cynical
> in-tree workaround or an out-of-tree workaround which either way means
> upstream Linux loses.

Right.. The kernel just don't have the power to say no to the
industry. Things will just go OOT and it is really our community that
suffers in the long run. As I said, you can't lead with NO.

IHMO there has to be a really high quality reason to keep support for
HW that people have built out of the kernel. Especially start ups and
other more vulnerable companies. I don't think Linux maintainers
should be choosing industry winners and losers. I sometimes feel I
have a minority opinion here though :(

> > > So, document what each subsystem's stance towards fwctl is,
> > > like maybe a distro only wants fwctl to front publicly documented vendor
> > > commands, or maybe private vendor commands ok, but only with a
> > > constrained set of Command Effects (I potentially see CXL here). 
> > 
> > I wouldn't say subsystem here, but techonology. I think it is
> > reasonable that a CXL fwctl driver have some kconfig tunables like you
> > already have. This idea works alot better if the underlying thing is
> > already standards based.
> 
> True, I worry about these technologies that cross upstream maintainer
> boundaries. When you have a composable switch that enables net, block,
> and/or mem use cases, which upstream maintainer policy applies to the
> fwctl posture of that thing?

fwctl is intended to sit on its own. I think it is even a bad
architecture direction that Linux has N different ways to flash FW on
devices, N different ways to read diagnostics, etc all because each
subsystem went on its own. With fwctl I'd like to see a greater
consolidation of not re-inventing the low level of fw interaction
differently in each and every subsystem.

Like you mentioned CXL has its own way to program flash. How many ways
does Linux have to update device flash now? :(

So, if you have a real multi-function device fwctl should be the
central place to operate the shared PCI function and the FW
interface. There may be some duplication in subsystems, but that is a
side effect of our sub-system siloed development model (software
architecture tends to follow org chart, after all)

Jason
Jiri Pirko June 7, 2024, 6:48 a.m. UTC | #28
Thu, Jun 06, 2024 at 07:47:20PM CEST, dsahern@kernel.org wrote:
>On 6/6/24 9:05 AM, Jakub Kicinski wrote:
>> On Thu, 6 Jun 2024 11:48:18 -0300 Jason Gunthorpe wrote:
>>>> An argument can be made that given somewhat mixed switchdev experience
>>>> we should just stay out of the way and let that happen. But just make
>>>> that argument then, instead of pretending the use of this API will be
>>>> limited to custom very vendor specific things.  
>>>
>>> Huh?
>> 
>> I'm sorry, David as been working in netdev for a long time.
>
>And I will continue working on Linux networking stack (netdev) while I
>also work with the IB S/W stack, fwctl, and any other part of Linux
>relevant to my job. I am not going to pick a silo (and should not be
>required to).
>
>> I have a tendency to address the person I'm replying to,
>> assuming their level of understanding of the problem space.
>> Which makes it harder to understand for bystanders.
>> 
>>> At least mlx5 already has a very robust userspace competition to
>>> switchdev using RDMA APIs, available in DPDK. This is long since been
>>> done and is widely deployed.
>> 
>> Yeah, we had this discussion multiple times
>
>The switchdev / sonic comparison came to mind as well during this
>thread. The existence of a kernel way (switchdev) has not stopped sonic
>(userspace SDK) from gaining traction. In some cases the SDK is required

Is this discussion technical or policital? I'm asking because it makes
huge difference. There is no technical reason why sonic does not use
proper in-kernel solution from what I see
Yes, they chose technically the wrong way, a shortcut, requiring kernel
bypass. Honestly for reasons that are beyond my understanding :/


>for device features that do not have a kernel uapi or vendors refuse to
>offer a kernel way, so it is the only option.

Policical reasons.


>
>The bottom line to me is that these hardline, dogmatic approaches -
>resisting the recognition of reality - is only harming users. There is a
>middle ground, open source drivers and tools that offer more flexibility.
>
Jiri Pirko June 7, 2024, 7:34 a.m. UTC | #29
Thu, Jun 06, 2024 at 04:18:11PM CEST, kuba@kernel.org wrote:
>On Wed, 5 Jun 2024 20:35:49 -0600 David Ahern wrote:
>> Until a feature is standardized and/or commoditized, it does not make
>> sense to create a uapi for every H/W vendor whim.
>
>This is not about non-standard features. I work with multiple vendors
>as my day job. I ask them how to set basic link configuration and the
>support person gives me a link to the vendor tools! I wish I could show
>you the emails.

Even without emails seen, I believe you. Well, isn't it just natural? I
mean, it always takes a bigger (sometimes much bigger) effort to
implement things properly introducing/extending apis/uapis.
Implement things in vendor tool is easy, low hanging fruit, people
naturally pick them.

I've been around in netdev for better part of second decade.
I think, for the sake of discussion, it is worth mentioning, that
a big part of netdev success despite complexicity is that in the
past, any attempt of kernel bypass (I recall few) was promptly rejected.
There was always big push for proper abstracted solution. And I believe
it helped a lot all over the place. Is this approach depleted?
I don't know, maybe. (And yes, I'm aware not everything could be done
this way).

I understand the reason and motivation for this patchset and what it
will solve, don't get me wrong. I kind of like it, it will help to
remove all painful detours we currenly have.

My concern is, it opens a pandora box for netdev *for sure*.
It that desired and anticipated?

Do the gains overweight the potential losses? Will it help the
ecosystem?

What is motivation for vendor to take the hard way of using proper api
(even existing ones) after?

Moreover, wouldn't this serve for vendors to go out of leash and start
to introduce even more H/W vendor whims?

I think these are serious questions we need to ask before this is merged.


>
>> All of them are attempting to solve real problems; some of them will
>> stick. We know which features are valuable when customers use them,
>
>Yes, once customers deploy a feature implemented via a vendor API
>they will definitely migrate to a different API. Customers like risk
>and wasting their engineering resources reimplementing and redeploying
>things? And we have so much success move users to new APIs in Linux!
>
>> ask for them and other vendors copy them. Until then it is a 1-off by
>> a vendor basically proposing a solution.
>
>Certainly. Because... who exactly will ask the second vendor to
>implement the common API? 
>
>And the second vendor will most certainly not mind the extra delay and
>inconvenience having their product shipped via the publicly reviewed,
>and slow to deploy kernel, while the first one is happily selling
>the same feature already.
>
>> Not all ideas are good ideas, and we do not need the burden of a uapi
>> or the burden of out of tree drivers.
>
>This API gives user space SDKs a trivial way of implementing all
>switching, routing, filtering, QoS offloads etc.
>An argument can be made that given somewhat mixed switchdev experience

Can you elaborabe a bit more what you mean by "mixed switchdev
experience" please?



>we should just stay out of the way and let that happen. But just make
>that argument then, instead of pretending the use of this API will be
>limited to custom very vendor specific things.
>
>Again, if someone needs this to ship their custom CXL/Infiniband 
>AI fabric magic, which is un-interoperable by design -- none of 
>my concern. But keep TCP/IP networking out of this :|
>
Przemek Kitszel June 7, 2024, 10:47 a.m. UTC | #30
On 6/7/24 02:25, Jason Gunthorpe wrote:
> On Thu, Jun 06, 2024 at 10:24:46AM -0700, Dan Williams wrote:
>> Jason Gunthorpe wrote:
>> [..]
>>>> I am warming to your assertion that there is a wide array of
>>>> vendor-specific configuration and debug that are not an efficient use of
>>>> upstream's time to wrap in a shared Linux ABI. I want to explore fwctl
>>>> for CXL for that use case, I personally don't want to marshal a Linux
>>>> command to each vendor's slightly different backend CXL toggles.
>>>
>>> Personally I think this idea to marshal/unmarshal everything in the
>>> kernel is often misguided. If it is truely obvious and actually shared
>>> multi-vendor capability then by all means go and do it.
>>>
>>> But if you are spending weeks/months fighting about uAPI because all
>>> the vendors are so different, it isn't obvious what is "generic" then
>>> you've probably already lost. The very worst outcome is a per-device
>>> uAPI masquerading as an obfuscated "generic" uAPI that wasted ages of
>>> peoples time to argue out.
>>
>> Certainly once you have gotten to the "months of arguing" point it begs the
>> question "was there really any generic benefit to reap in the first
>> place?"
> 
> Indeed, but I've seen, and participated, in these things many times :)
> 
>> That said, *some* grappling, especially when muliple vendors hit the
>> list with the similar feature at the same time, has yielded
>> collaboration in the past.
> 
> Absolutely! But we have also frequently done that retroactively, like
> see three examples and then consolidate the common APIs. The challenge
> is uAPI. Since we can't change uAPI people like to rush to make it
> future proof without examples. Broadly I lean towards waiting until we
> have several examples to build a standard uAPI and let the examples
> evolve on their own.
> 
> If there is value in the commonality then people will change over.

what has changed over decades is that now Linux has much more users than
implementations of given tool

I would love to see a move of the uAPI barrier closer to the user,
we will be free to refactor kernel APIs, given "the system tool" will be
updated at the same time.
Obviously for a new uAPI that would (re)move the promise on the very
beginning.
Andrew Lunn June 7, 2024, 12:49 p.m. UTC | #31
> >This API gives user space SDKs a trivial way of implementing all
> >switching, routing, filtering, QoS offloads etc.
> >An argument can be made that given somewhat mixed switchdev experience
> 
> Can you elaborabe a bit more what you mean by "mixed switchdev
> experience" please?

I don't want to put words in Jakubs mouth but, in my opinion,
switchdev has been great for SoHo switches. We have over 100
supported, mostly implemented by the community, but some vendors also
supporting their own hardware.

We have two enterprise switch families supported, each by its own
vendor. And we have one TOR switch family supported by the vendor.

So i would say switchdev has worked out great for SoHo, but kernel
bypass is still the norm for most things bigger than SoHo.

Why? My guess is, the products with a SoHo switch is not actually a
switch. It is a wifi box, with a switch. It is a cable modem, with a
switch. It is an inflight entertainment system, with a switch, etc.
It is much easier to build such multi-purpose systems when everything
is nicely integrated into the kernel, you don't have to fight with
multiple vendors supplying SDKs which only work on a disjoint set of
kernels, etc.

For bigger, single purpose devices, it is just a switch, there is less
inconvenience of using just one vendor SDK, on top of the vendor
proscribed kernel.

	Andrew
Leon Romanovsky June 7, 2024, 1:12 p.m. UTC | #32
On Thu, Jun 06, 2024 at 03:11:21PM -0700, Dan Williams wrote:
> Leon Romanovsky wrote:
> > On Wed, Jun 05, 2024 at 09:56:14PM -0700, Dan Williams wrote:
> > > Jason Gunthorpe wrote:
> > 
> > <...>
> > 
> > > So my questions to try to understand the specific sticking points more
> > > are:
> > > 
> > > 1/ Can you think of a Command Effect that the device could enumerate to
> > > address the specific shenanigan's that netdev is worried about? In other
> > > words if every command a device enables has the stated effect of
> > > "Configuration Change after Reset" does that cut out a significant
> > > portion of the concern? 
> > 
> > It will prevent SR-IOV devices (or more accurate their VFs)
> > to be configured through the fwctl, as they are destroyed in HW
> > during reboot.
> 
> Right, but between zero configurability and losing live SR-IOV
> configurabilitiy is there still value?

For the users that are using SR-IOV, it is a big loss. It will require
from them to use two tools now instead of one.

My point is that we need to try and find best solution for the users
and not "compromise variant" that will make everyone unhappy.

Thanks
Jiri Pirko June 7, 2024, 1:34 p.m. UTC | #33
Fri, Jun 07, 2024 at 02:49:19PM CEST, andrew@lunn.ch wrote:
>> >This API gives user space SDKs a trivial way of implementing all
>> >switching, routing, filtering, QoS offloads etc.
>> >An argument can be made that given somewhat mixed switchdev experience
>> 
>> Can you elaborabe a bit more what you mean by "mixed switchdev
>> experience" please?
>
>I don't want to put words in Jakubs mouth but, in my opinion,
>switchdev has been great for SoHo switches. We have over 100
>supported, mostly implemented by the community, but some vendors also
>supporting their own hardware.
>
>We have two enterprise switch families supported, each by its own
>vendor. And we have one TOR switch family supported by the vendor.
>
>So i would say switchdev has worked out great for SoHo, but kernel
>bypass is still the norm for most things bigger than SoHo.
>
>Why? My guess is, the products with a SoHo switch is not actually a
>switch. It is a wifi box, with a switch. It is a cable modem, with a
>switch. It is an inflight entertainment system, with a switch, etc.
>It is much easier to build such multi-purpose systems when everything
>is nicely integrated into the kernel, you don't have to fight with
>multiple vendors supplying SDKs which only work on a disjoint set of
>kernels, etc.
>
>For bigger, single purpose devices, it is just a switch, there is less
>inconvenience of using just one vendor SDK, on top of the vendor
>proscribed kernel.

I'm aware of what you wrote and undertand it. I just thought Jakub's
mixed experience is about the APIs more than the politics behind vedors
adoptation process..


>
>	Andrew
>
David Ahern June 7, 2024, 2:50 p.m. UTC | #34
On 6/7/24 12:48 AM, Jiri Pirko wrote:
>> The switchdev / sonic comparison came to mind as well during this
>> thread. The existence of a kernel way (switchdev) has not stopped sonic
>> (userspace SDK) from gaining traction. In some cases the SDK is required
> 
> Is this discussion technical or policital? I'm asking because it makes
> huge difference. There is no technical reason why sonic does not use
> proper in-kernel solution from what I see
> Yes, they chose technically the wrong way, a shortcut, requiring kernel
> bypass. Honestly for reasons that are beyond my understanding :/
> 
> 
>> for device features that do not have a kernel uapi or vendors refuse to
>> offer a kernel way, so it is the only option.
> 
> Policical reasons.
> 

You meant financial reasons, not political. The dominant player in
switches has zero interest in switchdev, zero interest in open sourcing
their SDK. Nothing has changed on that front in the 9 years of
switchdev's existence and no amount of 'NO' by maintainers is ever going
to pressure said vendor to do that.

Mellanox offers both with the Spectrum line and should have a pretty
good understanding of how many customers deploy with the SDK vs
switchdev. Why is that? There are those who think in logical, simple
designs (switchdev), and those who prefer complex, all userspace designs
with ping-ponging messages across processes (sonic). The latter uses all
kinds of what I call silly rationalizations from userspace allows more
flexibility, to dealing with the the kernel is too rigid, or getting
changes in is too hard, or my favorite - Linux does not scale.

The bottom line is that the SDK model is not going away. Period.

The networking stack has accepted kernel bypass compromises (xdp, xdp
sockets, OVS, a lot of the ebpf hooks, ... just examples) with the
rationale that more is brought into the Linux way. fwctl is a similar
effort - an attempt at bringing more into an open source driver and tooling.
Jason Gunthorpe June 7, 2024, 3:14 p.m. UTC | #35
On Fri, Jun 07, 2024 at 08:50:17AM -0600, David Ahern wrote:

> Mellanox offers both with the Spectrum line and should have a pretty
> good understanding of how many customers deploy with the SDK vs
> switchdev. Why is that? 

We offer lots of options with mlx5 switching too, and switchdev is not
being selected by customers principally for performance reasons, in my
view.

The OVS space wants to operate the switch much like a firewall and
this creates a high rate of database updates and exception
packets. DPDK can operate all the same offload HW from userspace and
avoid all the system call and other kernel overhead. It is much more
purpose built to what OVS wants to do. In the >50Gbps space this
matters a lot and overall DPDK performance notably wins over switchdev
for many OVS workloads - even though the high speed path is
near-identical.

In this role DPDK is effectively a switch SDK, an open source one at
least.

Sadly I'm seeing signs that proprietary OVS focused SDKs (think
various P4 offerings and others) are out competing open DPDK on
merit :(

For whatever reason the market for switching is not strongly motivated
toward open SDKs, and the available open solutions are struggling a
bit to compete.

But to repeat again, fwctl is not for dataplane, it is not for
implementing a switch SDK (go use RDMA if you want to do that). I will
write here a commitment to accept patches blocking such usages if
drivers try to abuse the purpose of the subsystem.

Jason
Jiri Pirko June 7, 2024, 3:50 p.m. UTC | #36
Fri, Jun 07, 2024 at 05:14:51PM CEST, jgg@nvidia.com wrote:
>On Fri, Jun 07, 2024 at 08:50:17AM -0600, David Ahern wrote:
>
>> Mellanox offers both with the Spectrum line and should have a pretty
>> good understanding of how many customers deploy with the SDK vs
>> switchdev. Why is that? 
>
>We offer lots of options with mlx5 switching too, and switchdev is not
>being selected by customers principally for performance reasons, in my
>view.
>
>The OVS space wants to operate the switch much like a firewall and
>this creates a high rate of database updates and exception
>packets. DPDK can operate all the same offload HW from userspace and
>avoid all the system call and other kernel overhead. It is much more
>purpose built to what OVS wants to do. In the >50Gbps space this
>matters a lot and overall DPDK performance notably wins over switchdev
>for many OVS workloads - even though the high speed path is
>near-identical.
>
>In this role DPDK is effectively a switch SDK, an open source one at
>least.
>
>Sadly I'm seeing signs that proprietary OVS focused SDKs (think
>various P4 offerings and others) are out competing open DPDK on
>merit :(
>
>For whatever reason the market for switching is not strongly motivated
>toward open SDKs, and the available open solutions are struggling a
>bit to compete.
>
>But to repeat again, fwctl is not for dataplane, it is not for
>implementing a switch SDK (go use RDMA if you want to do that). I will

switch sdk is all about control plane.


>write here a commitment to accept patches blocking such usages if
>drivers try to abuse the purpose of the subsystem.
>
>Jason
Jason Gunthorpe June 7, 2024, 5:24 p.m. UTC | #37
On Fri, Jun 07, 2024 at 05:50:41PM +0200, Jiri Pirko wrote:

> >But to repeat again, fwctl is not for dataplane, it is not for
> >implementing a switch SDK (go use RDMA if you want to do that). I will
> 
> switch sdk is all about control plane.

Ah, a poor tearm. I ment any involvement in the data flow of the
device including reaching into the so-called control plane of a switch
to manipulate data flow.

Jason
Jakub Kicinski June 8, 2024, 1:43 a.m. UTC | #38
On Fri, 7 Jun 2024 15:34:48 +0200 Jiri Pirko wrote:
> >For bigger, single purpose devices, it is just a switch, there is less
> >inconvenience of using just one vendor SDK, on top of the vendor
> >proscribed kernel.  
> 
> I'm aware of what you wrote and undertand it. I just thought Jakub's
> mixed experience is about the APIs more than the politics behind vedors
> adoptation process..

Not the API / implementation, just that the adoption is limited.
The benefits of using a standard Linux approach is outweighed by
the large pool of talent with experience programming using the SDK
of *the* vendor.
Daniel Vetter June 11, 2024, 3:36 p.m. UTC | #39
On Wed, Jun 05, 2024 at 10:59:11AM -0300, Jason Gunthorpe wrote:
> On Tue, Jun 04, 2024 at 04:56:57PM -0700, Dan Williams wrote:
> > * Introspection / validation: Subsystem community needs to be able to
> >   audit behavior after the fact.
> > 
> >   To me this means even if the kernel is letting a command through based
> >   on the stated Command Effect of "Configuration Change after Cold Reset"
> >   upstream community has a need to be able to read the vendor
> >   specification for that command. I.e. commands might be vendor-specific,
> >   but never vendor-private. I see this as similar to the requirement for
> >   open source userspace for sophisticated accelerators.
> 
> I'm less hard on this. As long as reasonable open userspace exists I
> think it is fine to let other stuff through too. I can appreciate the
> DRM stance on this, but IMHO, there is meaningfully more value for open
> source in trying get an open Vulkan implementation vs blocking users
> from reading their vendor'd diagnostic SI values.
> 
> I don't think we should get into some kind of extremism and insist
> that every single bit must be documented/standardized or Linux won't
> support it.

I figured it might be useful to paint what we do in DRM with a bit more
nuance. In the principles, we're indeed fairly radical in what we require,
but in practice we aim for a much more pragmatic approach in what we
merge. There's two major axis here:

1. One is ecosystem maturity. One end is 3d, with vulkan as the clear
industry standard, and an upstream full-featured userspace driver in
mesa3d is the only technically reasonable choice. And all gpu vendors
agree and by this year even nvidia started hiring an upstream team. But
this didn't happen magically overnight, it took 1-2 decades of background
discussions and tactical push&pulling to get there.

The other end is currently AI accelerators. It's a complete mess, where
across the platform (client, edge, cloud), customer and vendor dimension
every point has a different stack. And the problem is so obvious that
everyone is working to fix this, which means currently
https://xkcd.com/927/ is happening in parallel. Just to get things going
we're accepting pretty much anything that's a notch above total garbage
for userspace and for merging into the kernel.

2. The other part is how much it impacts applications. If you can't run
the same application across different vendors, the case for an upstream
stack becomes a lot weaker. At the other end is infrastructure enabling
like device configuration, error handling and recovery, hw debugging and
reliablity/health reporting. That's a lot more vendor specific in nature
and needs to be customized anyway per deployement. And only much higher in
the stack, maybe in k8s, can a technically reasonable unification even
happen.  So again we're much more lenient about infrastructure enabling
and uapi than stuff applications will use directly.

Currently that's enough of a mess in drm that I feel like enforcing
something like fwctl is still too much. But maybe once fwctl is
established with other subsystems/devices we can start the conversations
with vendors to get this going a few years down the road.

Both together mean we land a lot of code that's questionable at best,
clear garbage at worst. But since we've been in the merging garbage
business just to get things going for decades, we've become pretty good at
dealing with the kernel internal and uapi fallout, some say too good. But
personally I don't think there's a path to where we are with 3d/vulkan
that doesn't go through years of this kind of suck, and very much merged
into upstream kind of suck.

For all the concerns about trusting vendors/devices to not abuse very broad
uapi interfaces: Modern accelerator command submission boils down to "run
this context at this $addr", and the kernel never ever directly sees
anything more fly by. That's the same interface you need for a no-op job
as a full blown AI workload, so in theory maximal abuse potential.

In practice, it doesn't seem to be an issue, at least not beyond the
intentionally pragmatic choices where we merge kernel code with known
sub-par/incomplete userspace. I'm not sure why, but to my knowledge all
attempts to break the spirit of our userspace rules while following the
letter die in vendor-internal discussions, at least for all the
established upstream driver teams.

And for new ones it takes years of private chats to get them going and
fully established in upstream anyway.

Maybe one reason we have a bit an extremist reputation is that all the
public takes are the radical principled requirements, while the actual
pragmatic discussions mostly happen in private.

tldr; fwctl as I understand it feels like a bridge to far for drm today,
but I'd very much like someone else to make this happen so we could
eventually push towards adoption too.

Cheers, Sima
Jason Gunthorpe June 11, 2024, 4:17 p.m. UTC | #40
On Tue, Jun 11, 2024 at 05:36:17PM +0200, Daniel Vetter wrote:
> reliablity/health reporting. That's a lot more vendor specific in nature
> and needs to be customized anyway per deployement. And only much higher in
> the stack, maybe in k8s, can a technically reasonable unification even
> happen.  So again we're much more lenient about infrastructure enabling
> and uapi than stuff applications will use directly.

To be clear, this is the specific niche fwctl is for. It is not for
GPU command submission or something like that, and as I said to Jiri I
would agree to agressively block such abuses.
 
> Currently that's enough of a mess in drm that I feel like enforcing
> something like fwctl is still too much. But maybe once fwctl is
> established with other subsystems/devices we can start the conversations
> with vendors to get this going a few years down the road.

I wouldn't say enforcing, but instead of having every GPU driver build
their own weird vendor'd way to access their debug/diagnostic stuff
steer them into fwctl. These data center GPUs with FW at least have
lots of appropriate stuff and all the vendor OOT stuff has tooling to
inspect the GPUs far more than DRM has code for (ie
rocm-smi/nvidia-smi are have some features that are potentially good
candidates for fwctl)

> In practice, it doesn't seem to be an issue, at least not beyond the
> intentionally pragmatic choices where we merge kernel code with known
> sub-par/incomplete userspace. I'm not sure why, but to my knowledge all
> attempts to break the spirit of our userspace rules while following the
> letter die in vendor-internal discussions, at least for all the
> established upstream driver teams.

I think the same is broadly true of RDMA as well, except we don't
bother with the kernel trying to police the command stream - direct
submission from userspace. I can't say it has been much of an issue.

> tldr; fwctl as I understand it feels like a bridge to far for drm today,
> but I'd very much like someone else to make this happen so we could
> eventually push towards adoption too.

Hahah, okay, well, I'm pushing :)

Jason
Daniel Vetter June 11, 2024, 4:54 p.m. UTC | #41
On Tue, Jun 11, 2024 at 01:17:02PM -0300, Jason Gunthorpe wrote:
> On Tue, Jun 11, 2024 at 05:36:17PM +0200, Daniel Vetter wrote:
> > reliablity/health reporting. That's a lot more vendor specific in nature
> > and needs to be customized anyway per deployement. And only much higher in
> > the stack, maybe in k8s, can a technically reasonable unification even
> > happen.  So again we're much more lenient about infrastructure enabling
> > and uapi than stuff applications will use directly.
> 
> To be clear, this is the specific niche fwctl is for. It is not for
> GPU command submission or something like that, and as I said to Jiri I
> would agree to agressively block such abuses.
>  
> > Currently that's enough of a mess in drm that I feel like enforcing
> > something like fwctl is still too much. But maybe once fwctl is
> > established with other subsystems/devices we can start the conversations
> > with vendors to get this going a few years down the road.
> 
> I wouldn't say enforcing, but instead of having every GPU driver build
> their own weird vendor'd way to access their debug/diagnostic stuff
> steer them into fwctl. These data center GPUs with FW at least have
> lots of appropriate stuff and all the vendor OOT stuff has tooling to
> inspect the GPUs far more than DRM has code for (ie
> rocm-smi/nvidia-smi are have some features that are potentially good
> candidates for fwctl)

Yeah "enforcing" to the level we do with 3d/vulkan would be years down the
road, if ever. Very unlikely imo for debug/diagnostics/tuning stuff.

> > In practice, it doesn't seem to be an issue, at least not beyond the
> > intentionally pragmatic choices where we merge kernel code with known
> > sub-par/incomplete userspace. I'm not sure why, but to my knowledge all
> > attempts to break the spirit of our userspace rules while following the
> > letter die in vendor-internal discussions, at least for all the
> > established upstream driver teams.
> 
> I think the same is broadly true of RDMA as well, except we don't
> bother with the kernel trying to police the command stream - direct
> submission from userspace. I can't say it has been much of an issue.

Maybe just a bit confusion, but all modern-ish drm drivers stopped parsing
the command stream a while ago. We only ever did that to fill security
gaps, never to enforce any rules about what userspace is allowed to do
beyond that.

The rule that the open userspace needs to be complete, for some reasonably
pragmatic definition of "complete", is entirely a social contract. And I'm
not aware of any real issues with enforcing that beyond just trusting the
established vendor teams. So yeah no real issues with uabi that allows
maximal abuse because it's entirely unchecked by the kernel code.

Or put differently, I think we're trying to make the same point.

> > tldr; fwctl as I understand it feels like a bridge to far for drm today,
> > but I'd very much like someone else to make this happen so we could
> > eventually push towards adoption too.
> 
> Hahah, okay, well, I'm pushing :)

Thanks :-)
-Sima