diff mbox series

[v2,6/8] fwctl: Add documentation

Message ID 6-v2-940e479ceba9+3821-fwctl_jgg@nvidia.com
State Superseded
Headers show
Series Introduce fwctl subystem | expand

Commit Message

Jason Gunthorpe June 24, 2024, 10:47 p.m. UTC
Document the purpose and rules for the fwctl subsystem.

Link in kdocs to the doc tree.

Nacked-by: Jakub Kicinski <kuba@kernel.org>
Link: https://lore.kernel.org/r/20240603114250.5325279c@kernel.org
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
---
 Documentation/userspace-api/fwctl.rst | 269 ++++++++++++++++++++++++++
 Documentation/userspace-api/index.rst |   1 +
 2 files changed, 270 insertions(+)
 create mode 100644 Documentation/userspace-api/fwctl.rst

Comments

Randy Dunlap June 25, 2024, 10:04 p.m. UTC | #1
On 6/24/24 3:47 PM, Jason Gunthorpe wrote:
> Document the purpose and rules for the fwctl subsystem.
> 
> Link in kdocs to the doc tree.
> 
> Nacked-by: Jakub Kicinski <kuba@kernel.org>
> Link: https://lore.kernel.org/r/20240603114250.5325279c@kernel.org
> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
> ---
>  Documentation/userspace-api/fwctl.rst | 269 ++++++++++++++++++++++++++
>  Documentation/userspace-api/index.rst |   1 +
>  2 files changed, 270 insertions(+)
>  create mode 100644 Documentation/userspace-api/fwctl.rst
> 
> diff --git a/Documentation/userspace-api/fwctl.rst b/Documentation/userspace-api/fwctl.rst
> new file mode 100644
> index 00000000000000..ece2db2530502f
> --- /dev/null
> +++ b/Documentation/userspace-api/fwctl.rst
> @@ -0,0 +1,269 @@
> +.. SPDX-License-Identifier: GPL-2.0
> +
> +===============
> +fwctl subsystem
> +===============
> +
> +:Author: Jason Gunthorpe
> +
> +Overview
> +========
> +
> +Modern devices contain extensive amounts of FW, and in many cases, are largely
> +software-defined pieces of hardware. The evolution of this approach is largely a
> +reaction to Moore's Law where a chip tape out is now highly expensive, and the
> +chip design is extremely large. Replacing fixed HW logic with a flexible and
> +tightly coupled FW/HW combination is an effective risk mitigation against chip
> +respin. Problems in the HW design can be counteracted in device FW. This is
> +especially true for devices which present a stable and backwards compatible
> +interface to the operating system driver (such as NVMe).
> +
> +The FW layer in devices has grown to incredible sizes and devices frequently
> +integrate clusters of fast processors to run it. For example, mlx5 devices have
> +over 30MB of FW code, and big configurations operate with over 1GB of FW managed
> +runtime state.
> +
> +The availability of such a flexible layer has created quite a variety in the
> +industry where single pieces of silicon are now configurable software-defined
> +devices and can operate in substantially different ways depending on the need.
> +Further, we often see cases where specific sites wish to operate devices in ways
> +that are highly specialized and require applications that have been tailored to
> +their unique configuration.
> +
> +Further, devices have become multi-functional and integrated to the point they
> +no longer fit neatly into the kernel's division of subsystems. Modern
> +multi-functional devices have drivers, such as bnxt/ice/mlx5/pds, that span many
> +subsystems while sharing the underlying hardware using the auxiliary device
> +system.
> +
> +All together this creates a challenge for the operating system, where devices
> +have an expansive FW environment that needs robust device-specific debugging
> +support, and FW-driven functionality that is not well suited to “generic”
> +interfaces. fwctl seeks to allow access to the full device functionality from
> +user space in the areas of debuggability, management, and first-boot/nth-boot
> +provisioning.
> +
> +fwctl is aimed at the common device design pattern where the OS and FW
> +communicate via an RPC message layer constructed with a queue or mailbox scheme.
> +In this case the driver will typically have some layer to deliver RPC messages
> +and collect RPC responses from device FW. The in-kernel subsystem drivers that
> +operate the device for its primary purposes will use these RPCs to build their
> +drivers, but devices also usually have a set of ancillary RPCs that don't really
> +fit into any specific subsystem. For example, a HW RAID controller is primarily
> +operated by the block layer but also comes with a set of RPCs to administer the
> +construction of drives within the HW RAID.
> +
> +In the past when devices were more single function, individual subsystems would
> +grow different approaches to solving some of these common problems. For instance
> +monitoring device health, manipulating its FLASH, debugging the FW,
> +provisioning, all have various unique interfaces across the kernel.
> +
> +fwctl's purpose is to define a common set of limited rules, described below,
> +that allow user space to securely construct and execute RPCs inside device FW.
> +The rules serve as an agreement between the operating system and FW on how to
> +correctly design the RPC interface. As a uAPI the subsystem provides a thin
> +layer of discovery and a generic uAPI to deliver the RPCs and collect the
> +response. It supports a system of user space libraries and tools which will
> +use this interface to control the device using the device native protocols.
> +
> +Scope of Action
> +---------------
> +
> +fwctl drivers are strictly restricted to being a way to operate the device FW.
> +It is not an avenue to access random kernel internals, or other operating system
> +SW states.
> +
> +fwctl instances must operate on a well-defined device function, and the device
> +should have a well-defined security model for what scope within the physical
> +device the function is permitted to access. For instance, the most complex PCIe
> +device today may broadly have several function-level scopes:
> +
> + 1. A privileged function with full access to the on-device global state and
> +    configuration
> +
> + 2. Multiple hypervisor functions with control over itself and child functions
> +    used with VMs
> +
> + 3. Multiple VM functions tightly scoped within the VM
> +
> +The device may create a logical parent/child relationship between these scopes.
> +For instance a child VM's FW may be within the scope of the hypervisor FW. It is
> +quite common in the VFIO world that the hypervisor environment has a complex
> +provisioning/profiling/configuration responsibility for the function VFIO
> +assigns to the VM.
> +
> +Further, within the function, devices often have RPC commands that fall within
> +some general scopes of action:
> +
> + 1. Access to function & child configuration, FLASH, etc/ that becomes live at a

                                                        etc.

> +    function reset.
> +
> + 2. Access to function & child runtime configuration that kernel drivers can
> +    discover at runtime.
> +
> + 3. Read only access to function debug information that may report on FW objects

       Read-only

> +    in the function & child, including FW objects owned by other kernel
> +    subsystems.
> +
> + 4. Write access to function & child debug information strictly compatible with
> +    the principles of kernel lockdown and kernel integrity protection. Triggers
> +    a kernel Taint.
> +
> + 5. Full debug device access. Triggers a kernel Taint, requires CAP_SYS_RAWIO.
> +
> +Userspace will provide a scope label on each RPC and the kernel must enforce the

Some places (above/below here) say "user space" instead of "userspace". Please choose one
and stick with it.

> +above CAP's and taints based on that scope. A combination of kernel and FW can

         CAPs


> +enforce that RPCs are placed in the correct scope by userspace.
> +
> +Denied behavior
> +---------------
> +
> +There are many things this interface must not allow user space to do (without a
> +Taint or CAP), broadly derived from the principles of kernel lockdown. Some
> +examples:
> +
> + 1. DMA to/from arbitrary memory, hang the system, run code in the device, or

An RPC message is going to run code in the device. Should this say something instead
like:

download [or load] code to be executed in the device,

> +    otherwise compromise device or system security and integrity.
> +
> + 2. Provide an abnormal “back door” to kernel drivers. No manipulation of kernel
> +    objects owned by kernel drivers.
> +
> + 3. Directly configure or otherwise control kernel drivers. A subsystem kernel
> +    driver can react to the device configuration at function reset/driver load
> +    time, but otherwise should not be coupled to fwctl.
> +
> + 4. Operate the HW in a way that overlaps with the core purpose of another
> +    primary kernel subsystem, such as read/write to LBAs, send/receive of
> +    network packets, or operate an accelerator's data plane.
> +
> +fwctl is not a replacement for device direct access subsystems like uacce or
> +VFIO.
> +
> +fwctl User API
> +==============
> +
> +.. kernel-doc:: include/uapi/fwctl/fwctl.h
> +.. kernel-doc:: include/uapi/fwctl/mlx5.h
> +
> +sysfs Class
> +-----------
> +
> +fwctl has a sysfs class (/sys/class/fwctl/fwctlNN/) and character devices
> +(/dev/fwctl/fwctlNN) with a simple numbered scheme. The character device
> +operates the iotcl uAPI described above.
> +
> +fwctl devices can be related to driver components in other subsystems through
> +sysfs::
> +
> +    $ ls /sys/class/fwctl/fwctl0/device/infiniband/
> +    ibp0s10f0
> +
> +    $ ls /sys/class/infiniband/ibp0s10f0/device/fwctl/
> +    fwctl0/
> +
> +    $ ls /sys/devices/pci0000:00/0000:00:0a.0/fwctl/fwctl0
> +    dev  device  power  subsystem  uevent
> +
> +User space Community
> +--------------------
> +
> +Drawing inspiration from nvme-cli, participating in the kernel side must come
> +with a user space in a common TBD git tree, at a minimum to usefully operate the
> +kernel driver. Providing such an implementation is a pre-condition to merging a
> +kernel driver.
> +
> +The goal is to build user space community around some of the shared problems
> +we all have, and ideally develop some common user space programs with some
> +starting themes of:
> +
> + - Device in-field debugging
> +
> + - HW provisioning
> +
> + - VFIO child device profiling before VM boot
> +
> + - Confidential Compute topics (attestation, secure provisioning)
> +
> +That stretch across all subsystems in the kernel. fwupd is a great example of

   that

> +how an excellent user space experience can emerge out of kernel-side diversity.
> +
> +fwctl Kernel API
> +================
> +
> +.. kernel-doc:: drivers/fwctl/main.c
> +   :export:
> +.. kernel-doc:: include/linux/fwctl.h
> +
> +fwctl Driver design
> +-------------------
> +
> +In many cases a fwctl driver is going to be part of a larger cross-subsystem
> +device possibly using the auxiliary_device mechanism. In that case several
> +subsystems are going to be sharing the same device and FW interface layer so the
> +device design must already provide for isolation and cooperation between kernel
> +subsystems. fwctl should fit into that same model.
> +
> +Part of the driver should include a description of how its scope restrictions
> +and security model work. The driver and FW together must ensure that RPCs
> +provided by user space are mapped to the appropriate scope. If the validation is
> +done in the driver then the validation can read a 'command effects' report from
> +the device, or hardwire the enforcement. If the validation is done in the FW,
> +then the driver should pass the fwctl_rpc_scope to the FW along with the command.
> +
> +The driver and FW must cooperate to ensure that either fwctl cannot allocate
> +any FW resources, or any resources it does allocate are freed on FD closure.  A
> +driver primarily constructed around FW RPCs may find that its core PCI function
> +and RPC layer belongs under fwctl with auxiliary devices connecting to other
> +subsystems.
> +
> +Each device type must represent a stable FW ABI, such that the userspace
> +components have the same general stability we expect from the kernel. FW upgrade
> +should not break the userspace tools.
> +
> +Security Response
> +=================
> +
> +The kernel remains the gatekeeper for this interface. If violations of the
> +scopes, security or isolation principles are found, we have options to let
> +devices fix them with a FW update, push a kernel patch to parse and block RPC

fwctl does not do FW updates, is that correct?

> +commands or push a kernel patch to block entire firmware versions/devices.
> +
> +While the kernel can always directly parse and restrict RPCs, it is expected
> +that the existing kernel pattern of allowing drivers to delegate validation to
> +FW to be a useful design.
> +
> +Existing Similar Examples
> +=========================
> +
> +The approach described in this document is not a new idea. Direct, or near
> +direct device access has been offered by the kernel in different areas for
> +decades. With more devices wanting to follow this design pattern it is becoming
> +clear that it is not entirely well understood and, more importantly, the
> +security considerations are not well defined or agreed upon.
> +
> +Some examples:
> +
> + - HW RAID controllers. This includes RPCs to do things like compose drives into
> +   a RAID volume, configure RAID parameters, monitor the HW and more.
> +
> + - Baseboard managers. RPCs for configuring settings in the device and more
> +
> + - NVMe vendor command capsules. nvme-cli provides access to some monitoring
> +   functions that different products have defined, but more exists.

                                                               exist.

> +
> + - CXL also has a NVMe-like vendor command system.
> +
> + - DRM allows user space drivers to send commands to the device via kernel
> +   mediation
> +
> + - RDMA allows user space drivers to directly push commands to the device
> +   without kernel involvement
> +
> + - Various “raw” APIs, raw HID (SDL2), raw USB, NVMe Generic Interface, etc

                                                                           etc.

> +
> +The first 4 are examples of areas that fwctl intends to cover.
> +
> +Some key lessons learned from these past efforts are the importance of having a
> +common user space project to use as a pre-condition for obtaining a kernel
> +driver. Developing good community around useful software in user space is key to
> +getting companies to fund participation to enable their products.
Jason Gunthorpe July 22, 2024, 4:18 p.m. UTC | #2
On Tue, Jun 25, 2024 at 03:04:42PM -0700, Randy Dunlap wrote:
> > +There are many things this interface must not allow user space to do (without a
> > +Taint or CAP), broadly derived from the principles of kernel lockdown. Some
> > +examples:
> > +
> > + 1. DMA to/from arbitrary memory, hang the system, run code in the device, or
> 
> An RPC message is going to run code in the device. Should this say something instead
> like:
> 
> download [or load] code to be executed in the device,

Yeah, it is a hard concept. It is kind of murky as even today's
devlink flash will let you load untrusted code into the device under
lockdown AFAICR.

How about:

 1. DMA to/from arbitrary memory, hang the system, compromise FW integrity with
    untrusted code, or otherwise compromise device or system security and
    integrity.

Which is a little broader I suppose.

> > +The kernel remains the gatekeeper for this interface. If violations of the
> > +scopes, security or isolation principles are found, we have options to let
> > +devices fix them with a FW update, push a kernel patch to parse and block RPC
> 
> fwctl does not do FW updates, is that correct?

I think it is up to the specific RPCs the device supports. Given there
is currently no way to marshal a large amount of data it is not a good
interface for FW update.

I'd encourage people to use devlink flash more broadly, but I also
wouldn't go out of the way to block FW update RPCs that might exist
from here.

I certainly wouldn't want people to make their own FW update ioctls
(as still seems to be happening) out of fear they shouldn't use
fwctl :\

Looking particularly at mlx5, we've had devlink flash for a long time
now, but it hasn't suceeded to displace the mlx5 specific tools, for
whatever reason.

I grabbed all the changes here thanks!

Jason
Randy Dunlap July 22, 2024, 8:40 p.m. UTC | #3
Hi,

On 7/22/24 9:18 AM, Jason Gunthorpe wrote:
> On Tue, Jun 25, 2024 at 03:04:42PM -0700, Randy Dunlap wrote:
>>> +There are many things this interface must not allow user space to do (without a
>>> +Taint or CAP), broadly derived from the principles of kernel lockdown. Some
>>> +examples:
>>> +
>>> + 1. DMA to/from arbitrary memory, hang the system, run code in the device, or
>>
>> An RPC message is going to run code in the device. Should this say something instead
>> like:
>>
>> download [or load] code to be executed in the device,
> 
> Yeah, it is a hard concept. It is kind of murky as even today's
> devlink flash will let you load untrusted code into the device under
> lockdown AFAICR.
> 
> How about:
> 
>  1. DMA to/from arbitrary memory, hang the system, compromise FW integrity with
>     untrusted code, or otherwise compromise device or system security and
>     integrity.
> 
> Which is a little broader I suppose.

OK, somewhat better.

>>> +The kernel remains the gatekeeper for this interface. If violations of the
>>> +scopes, security or isolation principles are found, we have options to let
>>> +devices fix them with a FW update, push a kernel patch to parse and block RPC
>>
>> fwctl does not do FW updates, is that correct?
> 
> I think it is up to the specific RPCs the device supports. Given there
> is currently no way to marshal a large amount of data it is not a good
> interface for FW update.
> 
> I'd encourage people to use devlink flash more broadly, but I also
> wouldn't go out of the way to block FW update RPCs that might exist
> from here.
> 
> I certainly wouldn't want people to make their own FW update ioctls
> (as still seems to be happening) out of fear they shouldn't use
> fwctl :\

fair enough.

> Looking particularly at mlx5, we've had devlink flash for a long time
> now, but it hasn't suceeded to displace the mlx5 specific tools, for
> whatever reason.
> 
> I grabbed all the changes here thanks!
Jonathan Cameron July 26, 2024, 3:50 p.m. UTC | #4
> diff --git a/Documentation/userspace-api/fwctl.rst b/Documentation/userspace-api/fwctl.rst
> new file mode 100644
> index 00000000000000..ece2db2530502f
> --- /dev/null
> +++ b/Documentation/userspace-api/fwctl.rst
> @@ -0,0 +1,269 @@

> +Overview
> +========
> +
> +Modern devices contain extensive amounts of FW, and in many cases, are largely

FW and, in many cases, are

> +software-defined pieces of hardware. The evolution of this approach is largely a
> +reaction to Moore's Law where a chip tape out is now highly expensive, and the
> +chip design is extremely large. Replacing fixed HW logic with a flexible and
> +tightly coupled FW/HW combination is an effective risk mitigation against chip
> +respin. Problems in the HW design can be counteracted in device FW. This is
> +especially true for devices which present a stable and backwards compatible
> +interface to the operating system driver (such as NVMe).

...

The document lays out where this sits well.

Jonathan
Jason Gunthorpe July 29, 2024, 4:11 p.m. UTC | #5
On Fri, Jul 26, 2024 at 04:50:21PM +0100, Jonathan Cameron wrote:
> 
> > diff --git a/Documentation/userspace-api/fwctl.rst b/Documentation/userspace-api/fwctl.rst
> > new file mode 100644
> > index 00000000000000..ece2db2530502f
> > --- /dev/null
> > +++ b/Documentation/userspace-api/fwctl.rst
> > @@ -0,0 +1,269 @@
> 
> > +Overview
> > +========
> > +
> > +Modern devices contain extensive amounts of FW, and in many cases, are largely
> 
> FW and, in many cases, are

Yep, Randy noted it too

> > +software-defined pieces of hardware. The evolution of this approach is largely a
> > +reaction to Moore's Law where a chip tape out is now highly expensive, and the
> > +chip design is extremely large. Replacing fixed HW logic with a flexible and
> > +tightly coupled FW/HW combination is an effective risk mitigation against chip
> > +respin. Problems in the HW design can be counteracted in device FW. This is
> > +especially true for devices which present a stable and backwards compatible
> > +interface to the operating system driver (such as NVMe).
> 
> ...
> 
> The document lays out where this sits well.

Thanks!

Jason
Daniel Vetter Aug. 6, 2024, 8:03 a.m. UTC | #6
On Mon, Jun 24, 2024 at 07:47:30PM -0300, Jason Gunthorpe wrote:
> Document the purpose and rules for the fwctl subsystem.
> 
> Link in kdocs to the doc tree.
> 
> Nacked-by: Jakub Kicinski <kuba@kernel.org>
> Link: https://lore.kernel.org/r/20240603114250.5325279c@kernel.org
> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>

I think we'll need something like fwctl rather sooner than later for gpus
too, so for all the fwctl patches up to this one:

Acked-by: Daniel Vetter <daniel.vetter@ffwll.ch>

I both really liked the approach to fwctl_unregister and
copy_struct_from_user, and didn't spot anything offensive in the code.

Bunch of optional thoughs below, my take at least from the drm point is
that it's better to get this going and learn as we do, than try to get
this perfect from the go. Some of the nuances will need to be informed by
practice anyway.
-Sima

> ---
>  Documentation/userspace-api/fwctl.rst | 269 ++++++++++++++++++++++++++
>  Documentation/userspace-api/index.rst |   1 +
>  2 files changed, 270 insertions(+)
>  create mode 100644 Documentation/userspace-api/fwctl.rst
> 
> diff --git a/Documentation/userspace-api/fwctl.rst b/Documentation/userspace-api/fwctl.rst
> new file mode 100644
> index 00000000000000..ece2db2530502f
> --- /dev/null
> +++ b/Documentation/userspace-api/fwctl.rst
> @@ -0,0 +1,269 @@
> +.. SPDX-License-Identifier: GPL-2.0
> +
> +===============
> +fwctl subsystem
> +===============
> +
> +:Author: Jason Gunthorpe
> +
> +Overview
> +========
> +
> +Modern devices contain extensive amounts of FW, and in many cases, are largely
> +software-defined pieces of hardware. The evolution of this approach is largely a
> +reaction to Moore's Law where a chip tape out is now highly expensive, and the
> +chip design is extremely large. Replacing fixed HW logic with a flexible and
> +tightly coupled FW/HW combination is an effective risk mitigation against chip
> +respin. Problems in the HW design can be counteracted in device FW. This is
> +especially true for devices which present a stable and backwards compatible
> +interface to the operating system driver (such as NVMe).
> +
> +The FW layer in devices has grown to incredible sizes and devices frequently
> +integrate clusters of fast processors to run it. For example, mlx5 devices have
> +over 30MB of FW code, and big configurations operate with over 1GB of FW managed
> +runtime state.
> +
> +The availability of such a flexible layer has created quite a variety in the
> +industry where single pieces of silicon are now configurable software-defined
> +devices and can operate in substantially different ways depending on the need.
> +Further, we often see cases where specific sites wish to operate devices in ways
> +that are highly specialized and require applications that have been tailored to
> +their unique configuration.
> +
> +Further, devices have become multi-functional and integrated to the point they
> +no longer fit neatly into the kernel's division of subsystems. Modern
> +multi-functional devices have drivers, such as bnxt/ice/mlx5/pds, that span many
> +subsystems while sharing the underlying hardware using the auxiliary device
> +system.
> +
> +All together this creates a challenge for the operating system, where devices
> +have an expansive FW environment that needs robust device-specific debugging
> +support, and FW-driven functionality that is not well suited to “generic”
> +interfaces. fwctl seeks to allow access to the full device functionality from
> +user space in the areas of debuggability, management, and first-boot/nth-boot
> +provisioning.
> +
> +fwctl is aimed at the common device design pattern where the OS and FW
> +communicate via an RPC message layer constructed with a queue or mailbox scheme.
> +In this case the driver will typically have some layer to deliver RPC messages
> +and collect RPC responses from device FW. The in-kernel subsystem drivers that
> +operate the device for its primary purposes will use these RPCs to build their
> +drivers, but devices also usually have a set of ancillary RPCs that don't really
> +fit into any specific subsystem. For example, a HW RAID controller is primarily
> +operated by the block layer but also comes with a set of RPCs to administer the
> +construction of drives within the HW RAID.
> +
> +In the past when devices were more single function, individual subsystems would
> +grow different approaches to solving some of these common problems. For instance
> +monitoring device health, manipulating its FLASH, debugging the FW,
> +provisioning, all have various unique interfaces across the kernel.
> +
> +fwctl's purpose is to define a common set of limited rules, described below,
> +that allow user space to securely construct and execute RPCs inside device FW.
> +The rules serve as an agreement between the operating system and FW on how to
> +correctly design the RPC interface. As a uAPI the subsystem provides a thin
> +layer of discovery and a generic uAPI to deliver the RPCs and collect the
> +response. It supports a system of user space libraries and tools which will
> +use this interface to control the device using the device native protocols.
> +
> +Scope of Action
> +---------------
> +
> +fwctl drivers are strictly restricted to being a way to operate the device FW.
> +It is not an avenue to access random kernel internals, or other operating system
> +SW states.
> +
> +fwctl instances must operate on a well-defined device function, and the device
> +should have a well-defined security model for what scope within the physical
> +device the function is permitted to access. For instance, the most complex PCIe
> +device today may broadly have several function-level scopes:
> +
> + 1. A privileged function with full access to the on-device global state and
> +    configuration
> +
> + 2. Multiple hypervisor functions with control over itself and child functions
> +    used with VMs
> +
> + 3. Multiple VM functions tightly scoped within the VM
> +
> +The device may create a logical parent/child relationship between these scopes.
> +For instance a child VM's FW may be within the scope of the hypervisor FW. It is
> +quite common in the VFIO world that the hypervisor environment has a complex
> +provisioning/profiling/configuration responsibility for the function VFIO
> +assigns to the VM.
> +
> +Further, within the function, devices often have RPC commands that fall within
> +some general scopes of action:

In your fwctl_rpc_scope you only have 4, not 5, and I think that's
clearer. Might also be good to put a kerneldoc link to the enum in here
for the details.

> + 1. Access to function & child configuration, FLASH, etc/ that becomes live at a
> +    function reset.
> +
> + 2. Access to function & child runtime configuration that kernel drivers can
> +    discover at runtime.

This one worries me, since it has potential for people to get it very
wrong and e.g. expose configuration where it's only safe if the driver
isn't bound, or at least no userspace is using it. I'd drop this and just
leave the one configuration rpc, with maybe more detail what exactly "When
configuration is written to the device remains in a fully supported
state." from the kerneldoc means. I think only safe options woulb be a)
applied on function reset b) transparent to both kernel and userspace
(beyond maybe device performance).

That might not cut all configuration items, but for those I think it'd be
best if fwctl guarantees through the driver model that there's no driver
bound to that function (or used by vfio/kvm), to guarantee safety. So
explicitly split them out as runtime configuration with a distinct rpc
scope. Maybe an addition for later.

> +
> + 3. Read only access to function debug information that may report on FW objects
> +    in the function & child, including FW objects owned by other kernel
> +    subsystems.
> +
> + 4. Write access to function & child debug information strictly compatible with
> +    the principles of kernel lockdown and kernel integrity protection. Triggers
> +    a kernel Taint.
> +
> + 5. Full debug device access. Triggers a kernel Taint, requires CAP_SYS_RAWIO.
> +
> +Userspace will provide a scope label on each RPC and the kernel must enforce the
> +above CAP's and taints based on that scope. A combination of kernel and FW can
> +enforce that RPCs are placed in the correct scope by userspace.
> +
> +Denied behavior
> +---------------
> +
> +There are many things this interface must not allow user space to do (without a
> +Taint or CAP), broadly derived from the principles of kernel lockdown. Some
> +examples:
> +
> + 1. DMA to/from arbitrary memory, hang the system, run code in the device, or
> +    otherwise compromise device or system security and integrity.
> +
> + 2. Provide an abnormal “back door” to kernel drivers. No manipulation of kernel
> +    objects owned by kernel drivers.
> +
> + 3. Directly configure or otherwise control kernel drivers. A subsystem kernel
> +    driver can react to the device configuration at function reset/driver load
> +    time, but otherwise should not be coupled to fwctl.

Kinda the same worry as above, I think this should be "... but otherwise
_must_ not be coupled to fwctl".

> + 4. Operate the HW in a way that overlaps with the core purpose of another
> +    primary kernel subsystem, such as read/write to LBAs, send/receive of
> +    network packets, or operate an accelerator's data plane.

I still think some words about what to do when fwctl exposes some
functional which later on is covered by a newly added subsystem that
didn't yet exist. Also maybe adding some more examples from the RAS side
of things, since that's come up a few time in the ksummit-discuss thread,
plus I think it's where we'll most likely have a new subsystem or extended
functionality of an existing one pop up and cause conflicts with fwctl
rpcs that have landed earlier.

I'm personally fine with "we'll figure that out when it happens."

> +
> +fwctl is not a replacement for device direct access subsystems like uacce or
> +VFIO.
> +
> +fwctl User API
> +==============
> +
> +.. kernel-doc:: include/uapi/fwctl/fwctl.h
> +.. kernel-doc:: include/uapi/fwctl/mlx5.h
> +
> +sysfs Class
> +-----------
> +
> +fwctl has a sysfs class (/sys/class/fwctl/fwctlNN/) and character devices
> +(/dev/fwctl/fwctlNN) with a simple numbered scheme. The character device
> +operates the iotcl uAPI described above.
> +
> +fwctl devices can be related to driver components in other subsystems through
> +sysfs::
> +
> +    $ ls /sys/class/fwctl/fwctl0/device/infiniband/
> +    ibp0s10f0
> +
> +    $ ls /sys/class/infiniband/ibp0s10f0/device/fwctl/
> +    fwctl0/
> +
> +    $ ls /sys/devices/pci0000:00/0000:00:0a.0/fwctl/fwctl0
> +    dev  device  power  subsystem  uevent
> +
> +User space Community
> +--------------------
> +
> +Drawing inspiration from nvme-cli, participating in the kernel side must come
> +with a user space in a common TBD git tree, at a minimum to usefully operate the
> +kernel driver. Providing such an implementation is a pre-condition to merging a
> +kernel driver.
> +
> +The goal is to build user space community around some of the shared problems
> +we all have, and ideally develop some common user space programs with some
> +starting themes of:
> +
> + - Device in-field debugging
> +
> + - HW provisioning
> +
> + - VFIO child device profiling before VM boot
> +
> + - Confidential Compute topics (attestation, secure provisioning)
> +
> +That stretch across all subsystems in the kernel. fwupd is a great example of
> +how an excellent user space experience can emerge out of kernel-side diversity.
> +
> +fwctl Kernel API
> +================
> +
> +.. kernel-doc:: drivers/fwctl/main.c
> +   :export:
> +.. kernel-doc:: include/linux/fwctl.h
> +
> +fwctl Driver design
> +-------------------
> +
> +In many cases a fwctl driver is going to be part of a larger cross-subsystem
> +device possibly using the auxiliary_device mechanism. In that case several
> +subsystems are going to be sharing the same device and FW interface layer so the
> +device design must already provide for isolation and cooperation between kernel
> +subsystems. fwctl should fit into that same model.
> +
> +Part of the driver should include a description of how its scope restrictions
> +and security model work. The driver and FW together must ensure that RPCs
> +provided by user space are mapped to the appropriate scope. If the validation is
> +done in the driver then the validation can read a 'command effects' report from
> +the device, or hardwire the enforcement. If the validation is done in the FW,
> +then the driver should pass the fwctl_rpc_scope to the FW along with the command.
> +
> +The driver and FW must cooperate to ensure that either fwctl cannot allocate
> +any FW resources, or any resources it does allocate are freed on FD closure.  A
> +driver primarily constructed around FW RPCs may find that its core PCI function
> +and RPC layer belongs under fwctl with auxiliary devices connecting to other
> +subsystems.
> +
> +Each device type must represent a stable FW ABI, such that the userspace
> +components have the same general stability we expect from the kernel. FW upgrade
> +should not break the userspace tools.

I think an exception for the debug rpcs (or maybe only
FWCTL_DEBUG_WRITE_FULL if we're super strict) could really help the case
for fwctl. Still not allowing to break individual rpcs, but maybe allow fw
to remove outdated ones. With gpu fw we already struggle with abi
breakages where the kernel driver can do some amount of impendance
mismatch. If this is extended to debug tooling I fear it just wont happen,
forcing big junks of what fwctl could support to stay out of tree.

And especially for field and even more in-house debug tooling, you really
want the userspace version matching your fw anyway.

Currently that mess tends to leave in debugfs and/or out-of-tree, so
there's no stable uapi guarantee anyway. And I don't see the point in
requiring it - if there is a need for stabling tooling, maybe that
indicates more the need for a new subsystem that standardized things
across devices/vendors.

Another case in point are tracepoints, where the stable abi question also
has a lot more nuanced answer. Debug fw support imo falls into that same
bucket.

> +
> +Security Response
> +=================
> +
> +The kernel remains the gatekeeper for this interface. If violations of the
> +scopes, security or isolation principles are found, we have options to let
> +devices fix them with a FW update, push a kernel patch to parse and block RPC
> +commands or push a kernel patch to block entire firmware versions/devices.
> +
> +While the kernel can always directly parse and restrict RPCs, it is expected
> +that the existing kernel pattern of allowing drivers to delegate validation to
> +FW to be a useful design.
> +
> +Existing Similar Examples
> +=========================
> +
> +The approach described in this document is not a new idea. Direct, or near
> +direct device access has been offered by the kernel in different areas for
> +decades. With more devices wanting to follow this design pattern it is becoming
> +clear that it is not entirely well understood and, more importantly, the
> +security considerations are not well defined or agreed upon.
> +
> +Some examples:
> +
> + - HW RAID controllers. This includes RPCs to do things like compose drives into
> +   a RAID volume, configure RAID parameters, monitor the HW and more.
> +
> + - Baseboard managers. RPCs for configuring settings in the device and more
> +
> + - NVMe vendor command capsules. nvme-cli provides access to some monitoring
> +   functions that different products have defined, but more exists.
> +
> + - CXL also has a NVMe-like vendor command system.
> +
> + - DRM allows user space drivers to send commands to the device via kernel
> +   mediation
> +
> + - RDMA allows user space drivers to directly push commands to the device
> +   without kernel involvement
> +
> + - Various “raw” APIs, raw HID (SDL2), raw USB, NVMe Generic Interface, etc
> +
> +The first 4 are examples of areas that fwctl intends to cover.

Maybe add a sentence here why the latter 3 aren't, just to strengthen that
point?

> +
> +Some key lessons learned from these past efforts are the importance of having a
> +common user space project to use as a pre-condition for obtaining a kernel
> +driver. Developing good community around useful software in user space is key to
> +getting companies to fund participation to enable their products.
> diff --git a/Documentation/userspace-api/index.rst b/Documentation/userspace-api/index.rst
> index 8a251d71fa6e14..990b4c0710c99e 100644
> --- a/Documentation/userspace-api/index.rst
> +++ b/Documentation/userspace-api/index.rst
> @@ -44,6 +44,7 @@ Devices and I/O
>  
>     accelerators/ocxl
>     dma-buf-alloc-exchange
> +   fwctl
>     gpio/index
>     iommu
>     iommufd
> -- 
> 2.45.2
>
Jason Gunthorpe Aug. 8, 2024, 12:24 p.m. UTC | #7
On Tue, Aug 06, 2024 at 10:03:36AM +0200, Daniel Vetter wrote:
> On Mon, Jun 24, 2024 at 07:47:30PM -0300, Jason Gunthorpe wrote:
> > Document the purpose and rules for the fwctl subsystem.
> > 
> > Link in kdocs to the doc tree.
> > 
> > Nacked-by: Jakub Kicinski <kuba@kernel.org>
> > Link: https://lore.kernel.org/r/20240603114250.5325279c@kernel.org
> > Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
> 
> I think we'll need something like fwctl rather sooner than later for gpus
> too, so for all the fwctl patches up to this one:

Yes, I think so as well. You can see it already in the various GPU out
of tree stuff where there is usually some expansive
monitoring/debug/provisioning tool there too.

> Acked-by: Daniel Vetter <daniel.vetter@ffwll.ch>

Thanks!
 
> I both really liked the approach to fwctl_unregister and
> copy_struct_from_user, and didn't spot anything offensive in the code.

It is copied from iommufd which copied concepts from the fixedup modern
rdma :) Proven over a few years now.

> > +Further, within the function, devices often have RPC commands that fall within
> > +some general scopes of action:
> 
> In your fwctl_rpc_scope you only have 4, not 5, and I think that's
> clearer. Might also be good to put a kerneldoc link to the enum in here
> for the details.

I bundled these two together in the enum FWCTL_RPC_CONFIGURATION:

> > + 1. Access to function & child configuration, FLASH, etc/ that becomes live at a
> > +    function reset.
> > +
> > + 2. Access to function & child runtime configuration that kernel drivers can
> > +    discover at runtime.
> 
> This one worries me, since it has potential for people to get it very
> wrong and e.g. expose configuration where it's only safe if the driver
> isn't bound, or at least no userspace is using it. 

The notion was "at runtime" meaning any active user of the device
would either not be aware of whatever change or already have some way
to learn about it.

Especially when we think about child configuration - ie configuration
of a VF from the hypervisor while a VM is running - there can be
useful things that fit under this category. For instance you might
throttle the device to support live migration.

Throttling is a really complex topic for an autonomous device like
GPU/RDMA.

> I'd drop this and just
> leave the one configuration rpc, with maybe more detail what exactly "When
> configuration is written to the device remains in a fully supported
> state." 

Ah, this language is incorporating the distro feedback around loosing
the ability to support the system.

> from the kerneldoc means. I think only safe options woulb be a)
> applied on function reset b) transparent to both kernel and userspace
> (beyond maybe device performance).

Let's lean into transparent more:

 1. Access to function & child configuration, FLASH, etc. that becomes live at a
    function reset. Access to function & child runtime configuration that is
    transparent or non-disruptive to any driver or VM.

> That might not cut all configuration items, but for those I think it'd be
> best if fwctl guarantees through the driver model that there's no driver
> bound to that function (or used by vfio/kvm), to guarantee safety. So
> explicitly split them out as runtime configuration with a distinct rpc
> scope. Maybe an addition for later.

No driver bound is too strong. VFIO and even mlx5_core can trigger a
FLR while still bound in a controlled way. Taking effect at FLR time
is a reasonable restriction.

Like if you reconfigure a child VF and then start a VM there will be
an automatic FLR in that process that can make the updated VF
configuration active.

> > + 3. Directly configure or otherwise control kernel drivers. A subsystem kernel
> > +    driver can react to the device configuration at function reset/driver load
> > +    time, but otherwise should not be coupled to fwctl.
> 
> Kinda the same worry as above, I think this should be "... but otherwise
> _must_ not be coupled to fwctl".

Yep

> > + 4. Operate the HW in a way that overlaps with the core purpose of another
> > +    primary kernel subsystem, such as read/write to LBAs, send/receive of
> > +    network packets, or operate an accelerator's data plane.
> 
> I still think some words about what to do when fwctl exposes some
> functional which later on is covered by a newly added subsystem that
> didn't yet exist. Also maybe adding some more examples from the RAS side
> of things, since that's come up a few time in the ksummit-discuss thread,
> plus I think it's where we'll most likely have a new subsystem or extended
> functionality of an existing one pop up and cause conflicts with fwctl
> rpcs that have landed earlier.

How about:

Operations exposed through fwctl's non-taining interfaces should be fully
sharable with other users of the device. For instance exposing a RPC through
fwctl should never prevent a kernel subsystem from also concurrently using that
same RPC or hardware unit down the road. In such cases fwctl will be less
important than proper kernel subsystems that eventually emerge. Mistakes in this
area resulting in clashes will be resolved in favour of a kernel implementation.

> > +fwctl Driver design
> > +-------------------
> > +
> > +In many cases a fwctl driver is going to be part of a larger cross-subsystem
> > +device possibly using the auxiliary_device mechanism. In that case several
> > +subsystems are going to be sharing the same device and FW interface layer so the
> > +device design must already provide for isolation and cooperation between kernel
> > +subsystems. fwctl should fit into that same model.
> > +
> > +Part of the driver should include a description of how its scope restrictions
> > +and security model work. The driver and FW together must ensure that RPCs
> > +provided by user space are mapped to the appropriate scope. If the validation is
> > +done in the driver then the validation can read a 'command effects' report from
> > +the device, or hardwire the enforcement. If the validation is done in the FW,
> > +then the driver should pass the fwctl_rpc_scope to the FW along with the command.
> > +
> > +The driver and FW must cooperate to ensure that either fwctl cannot allocate
> > +any FW resources, or any resources it does allocate are freed on FD closure.  A
> > +driver primarily constructed around FW RPCs may find that its core PCI function
> > +and RPC layer belongs under fwctl with auxiliary devices connecting to other
> > +subsystems.
> > +
> > +Each device type must represent a stable FW ABI, such that the userspace
> > +components have the same general stability we expect from the kernel. FW upgrade
> > +should not break the userspace tools.
> 
> I think an exception for the debug rpcs (or maybe only
> FWCTL_DEBUG_WRITE_FULL if we're super strict) could really help the case
> for fwctl. Still not allowing to break individual rpcs, but maybe allow fw
> to remove outdated ones. 

I'm definitely mindful of Linus's pragmatic view of ABI stability, where
existing tools that *people actually use* shouldn't break. You
shouldn't have to upgrade your tools when you upgrade your FW.

I think it is important to convey that as a the gold standard here
too.

> And especially for field and even more in-house debug tooling, you really
> want the userspace version matching your fw anyway.

Yes, this is definately the case. But those tools are also private and
I think fall under the *people actually use* exception.

So, lets try again:

Each device type must be mindful of Linux's philosophy for stable ABI. The FW
RPC interface does not have to meet a strictly stable ABI, but it does need to
meet an expectation that userspace tools that are deployed and in significant
use don't needlessly break. FW upgrade and kernel upgrade should keep widely
deployed tooling working.

Development and debugging focused RPCs under more permissive scopes can have
less stablitiy if the tools using them are only run under exceptional
circumstances and not for every day use of the device. Debugging tools may even
require exact version matching as they may require something similar to DWARF
debug information from the FW binary.

> Currently that mess tends to leave in debugfs and/or out-of-tree, so
> there's no stable uapi guarantee anyway. And I don't see the point in
> requiring it - if there is a need for stabling tooling, maybe that
> indicates more the need for a new subsystem that standardized things
> across devices/vendors.

Yes, fwctl needs to have both. I would expect things like the
configuration to have a fairly stable ABI. Maybe the list of
configurables will change but access to them should be ABI stable.

> > +Some examples:
> > +
> > + - HW RAID controllers. This includes RPCs to do things like compose drives into
> > +   a RAID volume, configure RAID parameters, monitor the HW and more.
> > +
> > + - Baseboard managers. RPCs for configuring settings in the device and more
> > +
> > + - NVMe vendor command capsules. nvme-cli provides access to some monitoring
> > +   functions that different products have defined, but more exists.
> > +
> > + - CXL also has a NVMe-like vendor command system.
> > +
> > + - DRM allows user space drivers to send commands to the device via kernel
> > +   mediation
> > +
> > + - RDMA allows user space drivers to directly push commands to the device
> > +   without kernel involvement
> > +
> > + - Various “raw” APIs, raw HID (SDL2), raw USB, NVMe Generic Interface, etc
> > +
> > +The first 4 are examples of areas that fwctl intends to cover.
> 
> Maybe add a sentence here why the latter 3 aren't, just to strengthen that
> point?

How about:

The first 4 are examples of areas that fwctl intends to cover. The latter three
are examples of denied behavior as they fully overlap with the primary purpose
of a kernel subsystem.

Thanks
Jason
Daniel Vetter Aug. 9, 2024, 9:21 a.m. UTC | #8
On Thu, Aug 08, 2024 at 09:24:13AM -0300, Jason Gunthorpe wrote:
> On Tue, Aug 06, 2024 at 10:03:36AM +0200, Daniel Vetter wrote:
> > On Mon, Jun 24, 2024 at 07:47:30PM -0300, Jason Gunthorpe wrote:
> > > Document the purpose and rules for the fwctl subsystem.
> > > 
> > > Link in kdocs to the doc tree.
> > > 
> > > Nacked-by: Jakub Kicinski <kuba@kernel.org>
> > > Link: https://lore.kernel.org/r/20240603114250.5325279c@kernel.org
> > > Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
> > 
> > I think we'll need something like fwctl rather sooner than later for gpus
> > too, so for all the fwctl patches up to this one:
> 
> Yes, I think so as well. You can see it already in the various GPU out
> of tree stuff where there is usually some expansive
> monitoring/debug/provisioning tool there too.
> 
> > Acked-by: Daniel Vetter <daniel.vetter@ffwll.ch>
> 
> Thanks!
>  
> > I both really liked the approach to fwctl_unregister and
> > copy_struct_from_user, and didn't spot anything offensive in the code.
> 
> It is copied from iommufd which copied concepts from the fixedup modern
> rdma :) Proven over a few years now.
> 
> > > +Further, within the function, devices often have RPC commands that fall within
> > > +some general scopes of action:
> > 
> > In your fwctl_rpc_scope you only have 4, not 5, and I think that's
> > clearer. Might also be good to put a kerneldoc link to the enum in here
> > for the details.
> 
> I bundled these two together in the enum FWCTL_RPC_CONFIGURATION:

Yeah I figured, but if you want to keep the split it might help that the
kerneldoc for FWCTL_RPC_CONFIGURATION mentions that it includes both
delayed configuration and runtimme configuration from the overview doc. Or
maybe group them here into 1a and 1b.

Also this one is an extremely minor bikeshed, feel free to ignore :-)
 
> > > + 1. Access to function & child configuration, FLASH, etc/ that becomes live at a
> > > +    function reset.
> > > +
> > > + 2. Access to function & child runtime configuration that kernel drivers can
> > > +    discover at runtime.
> > 
> > This one worries me, since it has potential for people to get it very
> > wrong and e.g. expose configuration where it's only safe if the driver
> > isn't bound, or at least no userspace is using it. 
> 
> The notion was "at runtime" meaning any active user of the device
> would either not be aware of whatever change or already have some way
> to learn about it.
> 
> Especially when we think about child configuration - ie configuration
> of a VF from the hypervisor while a VM is running - there can be
> useful things that fit under this category. For instance you might
> throttle the device to support live migration.
> 
> Throttling is a really complex topic for an autonomous device like
> GPU/RDMA.

Yeah these kind of things are imo fine. But when I read your description
of "can discover at runtime" I'm more thinking stuff like number of
compute cores, or channels for communication or whatever. Which you can
discover, but if you don't you discover it by failing because the thing
you thought was there is now suddenly gone.

Ofc the guest can discover throttling by noticing that its gpu suddenly
got a bit (or a lot) slower, but that's a different kind of discover imo.

> > I'd drop this and just
> > leave the one configuration rpc, with maybe more detail what exactly "When
> > configuration is written to the device remains in a fully supported
> > state." 
> 
> Ah, this language is incorporating the distro feedback around loosing
> the ability to support the system.

Maybe I'm just dense, but for me it'd be good to differentiate between
runtime changes like throttling, which generally shouldn't upset
drivers/guests. And changes which can if they don't go actively discover
them, and making it really clear the latter must be delayed until next
reset. Currently I read what you have as including the latter as allowed
without requiring reset, as long as the driver/guest _can_ discover the
change somehow.

So maybe something like this?

2. Access to function & child runtime configuration that are either
transparent to users of that function, or which can be discovered without
disruption (like when the fw/hw function interface makes provisions for
such runtime configuration changes in the programming model).

> > from the kerneldoc means. I think only safe options woulb be a)
> > applied on function reset b) transparent to both kernel and userspace
> > (beyond maybe device performance).
> 
> Let's lean into transparent more:
> 
>  1. Access to function & child configuration, FLASH, etc. that becomes live at a
>     function reset. Access to function & child runtime configuration that is
>     transparent or non-disruptive to any driver or VM.

Ack, sounds really clear to me now.

> 
> > That might not cut all configuration items, but for those I think it'd be
> > best if fwctl guarantees through the driver model that there's no driver
> > bound to that function (or used by vfio/kvm), to guarantee safety. So
> > explicitly split them out as runtime configuration with a distinct rpc
> > scope. Maybe an addition for later.
> 
> No driver bound is too strong. VFIO and even mlx5_core can trigger a
> FLR while still bound in a controlled way. Taking effect at FLR time
> is a reasonable restriction.
> 
> Like if you reconfigure a child VF and then start a VM there will be
> an automatic FLR in that process that can make the updated VF
> configuration active.

Ah yeah if the driver copes/initiates, then it's fine too. Maybe a case of
both hw reset and "no driver bound", but also a case of "details we can
figure out later".
 
> > > + 3. Directly configure or otherwise control kernel drivers. A subsystem kernel
> > > +    driver can react to the device configuration at function reset/driver load
> > > +    time, but otherwise should not be coupled to fwctl.
> > 
> > Kinda the same worry as above, I think this should be "... but otherwise
> > _must_ not be coupled to fwctl".
> 
> Yep
> 
> > > + 4. Operate the HW in a way that overlaps with the core purpose of another
> > > +    primary kernel subsystem, such as read/write to LBAs, send/receive of
> > > +    network packets, or operate an accelerator's data plane.
> > 
> > I still think some words about what to do when fwctl exposes some
> > functional which later on is covered by a newly added subsystem that
> > didn't yet exist. Also maybe adding some more examples from the RAS side
> > of things, since that's come up a few time in the ksummit-discuss thread,
> > plus I think it's where we'll most likely have a new subsystem or extended
> > functionality of an existing one pop up and cause conflicts with fwctl
> > rpcs that have landed earlier.
> 
> How about:
> 
> Operations exposed through fwctl's non-taining interfaces should be fully
> sharable with other users of the device. For instance exposing a RPC through
> fwctl should never prevent a kernel subsystem from also concurrently using that
> same RPC or hardware unit down the road. In such cases fwctl will be less
> important than proper kernel subsystems that eventually emerge. Mistakes in this
> area resulting in clashes will be resolved in favour of a kernel implementation.

Ack.

> 
> > > +fwctl Driver design
> > > +-------------------
> > > +
> > > +In many cases a fwctl driver is going to be part of a larger cross-subsystem
> > > +device possibly using the auxiliary_device mechanism. In that case several
> > > +subsystems are going to be sharing the same device and FW interface layer so the
> > > +device design must already provide for isolation and cooperation between kernel
> > > +subsystems. fwctl should fit into that same model.
> > > +
> > > +Part of the driver should include a description of how its scope restrictions
> > > +and security model work. The driver and FW together must ensure that RPCs
> > > +provided by user space are mapped to the appropriate scope. If the validation is
> > > +done in the driver then the validation can read a 'command effects' report from
> > > +the device, or hardwire the enforcement. If the validation is done in the FW,
> > > +then the driver should pass the fwctl_rpc_scope to the FW along with the command.
> > > +
> > > +The driver and FW must cooperate to ensure that either fwctl cannot allocate
> > > +any FW resources, or any resources it does allocate are freed on FD closure.  A
> > > +driver primarily constructed around FW RPCs may find that its core PCI function
> > > +and RPC layer belongs under fwctl with auxiliary devices connecting to other
> > > +subsystems.
> > > +
> > > +Each device type must represent a stable FW ABI, such that the userspace
> > > +components have the same general stability we expect from the kernel. FW upgrade
> > > +should not break the userspace tools.
> > 
> > I think an exception for the debug rpcs (or maybe only
> > FWCTL_DEBUG_WRITE_FULL if we're super strict) could really help the case
> > for fwctl. Still not allowing to break individual rpcs, but maybe allow fw
> > to remove outdated ones. 
> 
> I'm definitely mindful of Linus's pragmatic view of ABI stability, where
> existing tools that *people actually use* shouldn't break. You
> shouldn't have to upgrade your tools when you upgrade your FW.
> 
> I think it is important to convey that as a the gold standard here
> too.
> 
> > And especially for field and even more in-house debug tooling, you really
> > want the userspace version matching your fw anyway.
> 
> Yes, this is definately the case. But those tools are also private and
> I think fall under the *people actually use* exception.
> 
> So, lets try again:
> 
> Each device type must be mindful of Linux's philosophy for stable ABI. The FW
> RPC interface does not have to meet a strictly stable ABI, but it does need to
> meet an expectation that userspace tools that are deployed and in significant
> use don't needlessly break. FW upgrade and kernel upgrade should keep widely
> deployed tooling working.
> 
> Development and debugging focused RPCs under more permissive scopes can have
> less stablitiy if the tools using them are only run under exceptional
> circumstances and not for every day use of the device. Debugging tools may even
> require exact version matching as they may require something similar to DWARF
> debug information from the FW binary.

Perfect imo, ack.

> > Currently that mess tends to leave in debugfs and/or out-of-tree, so
> > there's no stable uapi guarantee anyway. And I don't see the point in
> > requiring it - if there is a need for stabling tooling, maybe that
> > indicates more the need for a new subsystem that standardized things
> > across devices/vendors.
> 
> Yes, fwctl needs to have both. I would expect things like the
> configuration to have a fairly stable ABI. Maybe the list of
> configurables will change but access to them should be ABI stable.
> 
> > > +Some examples:
> > > +
> > > + - HW RAID controllers. This includes RPCs to do things like compose drives into
> > > +   a RAID volume, configure RAID parameters, monitor the HW and more.
> > > +
> > > + - Baseboard managers. RPCs for configuring settings in the device and more
> > > +
> > > + - NVMe vendor command capsules. nvme-cli provides access to some monitoring
> > > +   functions that different products have defined, but more exists.
> > > +
> > > + - CXL also has a NVMe-like vendor command system.
> > > +
> > > + - DRM allows user space drivers to send commands to the device via kernel
> > > +   mediation
> > > +
> > > + - RDMA allows user space drivers to directly push commands to the device
> > > +   without kernel involvement
> > > +
> > > + - Various “raw” APIs, raw HID (SDL2), raw USB, NVMe Generic Interface, etc
> > > +
> > > +The first 4 are examples of areas that fwctl intends to cover.
> > 
> > Maybe add a sentence here why the latter 3 aren't, just to strengthen that
> > point?
> 
> How about:
> 
> The first 4 are examples of areas that fwctl intends to cover. The latter three
> are examples of denied behavior as they fully overlap with the primary purpose
> of a kernel subsystem.

Ack.

Cheers, Sima
diff mbox series

Patch

diff --git a/Documentation/userspace-api/fwctl.rst b/Documentation/userspace-api/fwctl.rst
new file mode 100644
index 00000000000000..ece2db2530502f
--- /dev/null
+++ b/Documentation/userspace-api/fwctl.rst
@@ -0,0 +1,269 @@ 
+.. SPDX-License-Identifier: GPL-2.0
+
+===============
+fwctl subsystem
+===============
+
+:Author: Jason Gunthorpe
+
+Overview
+========
+
+Modern devices contain extensive amounts of FW, and in many cases, are largely
+software-defined pieces of hardware. The evolution of this approach is largely a
+reaction to Moore's Law where a chip tape out is now highly expensive, and the
+chip design is extremely large. Replacing fixed HW logic with a flexible and
+tightly coupled FW/HW combination is an effective risk mitigation against chip
+respin. Problems in the HW design can be counteracted in device FW. This is
+especially true for devices which present a stable and backwards compatible
+interface to the operating system driver (such as NVMe).
+
+The FW layer in devices has grown to incredible sizes and devices frequently
+integrate clusters of fast processors to run it. For example, mlx5 devices have
+over 30MB of FW code, and big configurations operate with over 1GB of FW managed
+runtime state.
+
+The availability of such a flexible layer has created quite a variety in the
+industry where single pieces of silicon are now configurable software-defined
+devices and can operate in substantially different ways depending on the need.
+Further, we often see cases where specific sites wish to operate devices in ways
+that are highly specialized and require applications that have been tailored to
+their unique configuration.
+
+Further, devices have become multi-functional and integrated to the point they
+no longer fit neatly into the kernel's division of subsystems. Modern
+multi-functional devices have drivers, such as bnxt/ice/mlx5/pds, that span many
+subsystems while sharing the underlying hardware using the auxiliary device
+system.
+
+All together this creates a challenge for the operating system, where devices
+have an expansive FW environment that needs robust device-specific debugging
+support, and FW-driven functionality that is not well suited to “generic”
+interfaces. fwctl seeks to allow access to the full device functionality from
+user space in the areas of debuggability, management, and first-boot/nth-boot
+provisioning.
+
+fwctl is aimed at the common device design pattern where the OS and FW
+communicate via an RPC message layer constructed with a queue or mailbox scheme.
+In this case the driver will typically have some layer to deliver RPC messages
+and collect RPC responses from device FW. The in-kernel subsystem drivers that
+operate the device for its primary purposes will use these RPCs to build their
+drivers, but devices also usually have a set of ancillary RPCs that don't really
+fit into any specific subsystem. For example, a HW RAID controller is primarily
+operated by the block layer but also comes with a set of RPCs to administer the
+construction of drives within the HW RAID.
+
+In the past when devices were more single function, individual subsystems would
+grow different approaches to solving some of these common problems. For instance
+monitoring device health, manipulating its FLASH, debugging the FW,
+provisioning, all have various unique interfaces across the kernel.
+
+fwctl's purpose is to define a common set of limited rules, described below,
+that allow user space to securely construct and execute RPCs inside device FW.
+The rules serve as an agreement between the operating system and FW on how to
+correctly design the RPC interface. As a uAPI the subsystem provides a thin
+layer of discovery and a generic uAPI to deliver the RPCs and collect the
+response. It supports a system of user space libraries and tools which will
+use this interface to control the device using the device native protocols.
+
+Scope of Action
+---------------
+
+fwctl drivers are strictly restricted to being a way to operate the device FW.
+It is not an avenue to access random kernel internals, or other operating system
+SW states.
+
+fwctl instances must operate on a well-defined device function, and the device
+should have a well-defined security model for what scope within the physical
+device the function is permitted to access. For instance, the most complex PCIe
+device today may broadly have several function-level scopes:
+
+ 1. A privileged function with full access to the on-device global state and
+    configuration
+
+ 2. Multiple hypervisor functions with control over itself and child functions
+    used with VMs
+
+ 3. Multiple VM functions tightly scoped within the VM
+
+The device may create a logical parent/child relationship between these scopes.
+For instance a child VM's FW may be within the scope of the hypervisor FW. It is
+quite common in the VFIO world that the hypervisor environment has a complex
+provisioning/profiling/configuration responsibility for the function VFIO
+assigns to the VM.
+
+Further, within the function, devices often have RPC commands that fall within
+some general scopes of action:
+
+ 1. Access to function & child configuration, FLASH, etc/ that becomes live at a
+    function reset.
+
+ 2. Access to function & child runtime configuration that kernel drivers can
+    discover at runtime.
+
+ 3. Read only access to function debug information that may report on FW objects
+    in the function & child, including FW objects owned by other kernel
+    subsystems.
+
+ 4. Write access to function & child debug information strictly compatible with
+    the principles of kernel lockdown and kernel integrity protection. Triggers
+    a kernel Taint.
+
+ 5. Full debug device access. Triggers a kernel Taint, requires CAP_SYS_RAWIO.
+
+Userspace will provide a scope label on each RPC and the kernel must enforce the
+above CAP's and taints based on that scope. A combination of kernel and FW can
+enforce that RPCs are placed in the correct scope by userspace.
+
+Denied behavior
+---------------
+
+There are many things this interface must not allow user space to do (without a
+Taint or CAP), broadly derived from the principles of kernel lockdown. Some
+examples:
+
+ 1. DMA to/from arbitrary memory, hang the system, run code in the device, or
+    otherwise compromise device or system security and integrity.
+
+ 2. Provide an abnormal “back door” to kernel drivers. No manipulation of kernel
+    objects owned by kernel drivers.
+
+ 3. Directly configure or otherwise control kernel drivers. A subsystem kernel
+    driver can react to the device configuration at function reset/driver load
+    time, but otherwise should not be coupled to fwctl.
+
+ 4. Operate the HW in a way that overlaps with the core purpose of another
+    primary kernel subsystem, such as read/write to LBAs, send/receive of
+    network packets, or operate an accelerator's data plane.
+
+fwctl is not a replacement for device direct access subsystems like uacce or
+VFIO.
+
+fwctl User API
+==============
+
+.. kernel-doc:: include/uapi/fwctl/fwctl.h
+.. kernel-doc:: include/uapi/fwctl/mlx5.h
+
+sysfs Class
+-----------
+
+fwctl has a sysfs class (/sys/class/fwctl/fwctlNN/) and character devices
+(/dev/fwctl/fwctlNN) with a simple numbered scheme. The character device
+operates the iotcl uAPI described above.
+
+fwctl devices can be related to driver components in other subsystems through
+sysfs::
+
+    $ ls /sys/class/fwctl/fwctl0/device/infiniband/
+    ibp0s10f0
+
+    $ ls /sys/class/infiniband/ibp0s10f0/device/fwctl/
+    fwctl0/
+
+    $ ls /sys/devices/pci0000:00/0000:00:0a.0/fwctl/fwctl0
+    dev  device  power  subsystem  uevent
+
+User space Community
+--------------------
+
+Drawing inspiration from nvme-cli, participating in the kernel side must come
+with a user space in a common TBD git tree, at a minimum to usefully operate the
+kernel driver. Providing such an implementation is a pre-condition to merging a
+kernel driver.
+
+The goal is to build user space community around some of the shared problems
+we all have, and ideally develop some common user space programs with some
+starting themes of:
+
+ - Device in-field debugging
+
+ - HW provisioning
+
+ - VFIO child device profiling before VM boot
+
+ - Confidential Compute topics (attestation, secure provisioning)
+
+That stretch across all subsystems in the kernel. fwupd is a great example of
+how an excellent user space experience can emerge out of kernel-side diversity.
+
+fwctl Kernel API
+================
+
+.. kernel-doc:: drivers/fwctl/main.c
+   :export:
+.. kernel-doc:: include/linux/fwctl.h
+
+fwctl Driver design
+-------------------
+
+In many cases a fwctl driver is going to be part of a larger cross-subsystem
+device possibly using the auxiliary_device mechanism. In that case several
+subsystems are going to be sharing the same device and FW interface layer so the
+device design must already provide for isolation and cooperation between kernel
+subsystems. fwctl should fit into that same model.
+
+Part of the driver should include a description of how its scope restrictions
+and security model work. The driver and FW together must ensure that RPCs
+provided by user space are mapped to the appropriate scope. If the validation is
+done in the driver then the validation can read a 'command effects' report from
+the device, or hardwire the enforcement. If the validation is done in the FW,
+then the driver should pass the fwctl_rpc_scope to the FW along with the command.
+
+The driver and FW must cooperate to ensure that either fwctl cannot allocate
+any FW resources, or any resources it does allocate are freed on FD closure.  A
+driver primarily constructed around FW RPCs may find that its core PCI function
+and RPC layer belongs under fwctl with auxiliary devices connecting to other
+subsystems.
+
+Each device type must represent a stable FW ABI, such that the userspace
+components have the same general stability we expect from the kernel. FW upgrade
+should not break the userspace tools.
+
+Security Response
+=================
+
+The kernel remains the gatekeeper for this interface. If violations of the
+scopes, security or isolation principles are found, we have options to let
+devices fix them with a FW update, push a kernel patch to parse and block RPC
+commands or push a kernel patch to block entire firmware versions/devices.
+
+While the kernel can always directly parse and restrict RPCs, it is expected
+that the existing kernel pattern of allowing drivers to delegate validation to
+FW to be a useful design.
+
+Existing Similar Examples
+=========================
+
+The approach described in this document is not a new idea. Direct, or near
+direct device access has been offered by the kernel in different areas for
+decades. With more devices wanting to follow this design pattern it is becoming
+clear that it is not entirely well understood and, more importantly, the
+security considerations are not well defined or agreed upon.
+
+Some examples:
+
+ - HW RAID controllers. This includes RPCs to do things like compose drives into
+   a RAID volume, configure RAID parameters, monitor the HW and more.
+
+ - Baseboard managers. RPCs for configuring settings in the device and more
+
+ - NVMe vendor command capsules. nvme-cli provides access to some monitoring
+   functions that different products have defined, but more exists.
+
+ - CXL also has a NVMe-like vendor command system.
+
+ - DRM allows user space drivers to send commands to the device via kernel
+   mediation
+
+ - RDMA allows user space drivers to directly push commands to the device
+   without kernel involvement
+
+ - Various “raw” APIs, raw HID (SDL2), raw USB, NVMe Generic Interface, etc
+
+The first 4 are examples of areas that fwctl intends to cover.
+
+Some key lessons learned from these past efforts are the importance of having a
+common user space project to use as a pre-condition for obtaining a kernel
+driver. Developing good community around useful software in user space is key to
+getting companies to fund participation to enable their products.
diff --git a/Documentation/userspace-api/index.rst b/Documentation/userspace-api/index.rst
index 8a251d71fa6e14..990b4c0710c99e 100644
--- a/Documentation/userspace-api/index.rst
+++ b/Documentation/userspace-api/index.rst
@@ -44,6 +44,7 @@  Devices and I/O
 
    accelerators/ocxl
    dma-buf-alloc-exchange
+   fwctl
    gpio/index
    iommu
    iommufd