[V6,04/10] PCI/DOE: Introduce pci_doe_create_doe_devices

Message ID	20220201071952.900068-5-ira.weiny@intel.com
State	Superseded
Headers	show Return-Path: <linux-cxl-owner@kernel.org> From: ira.weiny@intel.com To: Dan Williams <dan.j.williams@intel.com>, Jonathan Cameron <Jonathan.Cameron@huawei.com>, Bjorn Helgaas <bhelgaas@google.com> Cc: Alison Schofield <alison.schofield@intel.com>, Vishal Verma <vishal.l.verma@intel.com>, Ira Weiny <ira.weiny@intel.com>, Ben Widawsky <ben.widawsky@intel.com>, linux-kernel@vger.kernel.org, linux-cxl@vger.kernel.org, linux-pci@vger.kernel.org Subject: [PATCH V6 04/10] PCI/DOE: Introduce pci_doe_create_doe_devices Date: Mon, 31 Jan 2022 23:19:46 -0800 Message-Id: <20220201071952.900068-5-ira.weiny@intel.com> In-Reply-To: <20220201071952.900068-1-ira.weiny@intel.com> References: <20220201071952.900068-1-ira.weiny@intel.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Precedence: bulk
Series	CXL: Read CDAT and DSMAS data from the device \| expand [V6,00/10] CXL: Read CDAT and DSMAS data from the device [V6,01/10] PCI: Add vendor ID for the PCI SIG [V6,02/10] PCI: Replace magic constant for PCI Sig Vendor ID [V6,03/10] PCI/DOE: Add Data Object Exchange Aux Driver [V6,04/10] PCI/DOE: Introduce pci_doe_create_doe_devices [V6,05/10] cxl/pci: Create DOE auxiliary devices [V6,06/10] cxl/pci: Find the DOE mailbox which supports CDAT [V6,07/10] cxl/mem: Read CDAT table [V6,08/10] cxl/cdat: Introduce cdat_hdr_valid() [V6,09/10] cxl/mem: Retry reading CDAT on failure [V6,10/10] cxl/cdat: Parse out DSMAS data from CDAT table

Ira Weiny Feb. 1, 2022, 7:19 a.m. UTC

From: Ira Weiny <ira.weiny@intel.com>

CXL and/or PCI devices can define DOE mailboxes.  Normally the kernel
will want to maintain control of all of these mailboxes.  However, under
a limited number of use cases users may want to allow user space access
to some of these mailboxes while the kernel retains control of the rest.
An example of this is for CXL Compliance Testing (see CXL 2.0 14.16.4
Compliance Mode DOE) which offers a mechanism to set different test
modes for a device.

Rather than re-invent the wheel the architecture creates auxiliary
devices for each DOE mailbox which can then be driven by a generic DOE
mailbox driver.  If access to an individual mailbox is required by user
space the driver for that mailbox can be unloaded and access handed to
user space.

Create the helper pci_doe_create_doe_devices() which iterates each DOE
mailbox found in the device and creates a DOE auxiliary device on the
auxiliary bus.  While doing so ensure that the auxiliary DOE driver
loads to drive that device.

Co-developed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Signed-off-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Signed-off-by: Ira Weiny <ira.weiny@intel.com>

---
Changes from V5:
	Rebased to latest
	Split this off from the CXL specific patch.  This introduces the
	support in PCI and the CXL code can call it in a future patch.
	Remove soft dep from the cxl_pci code because use of the helper
		function ensures that the pci_doe driver is already
		loaded.
	From Jonathan and Bjorn
		Move DOE device creation to the PCI core via
			pci_doe_create_doe_devices() helper
		document need for pci_set_master()
	From Bjorn
		Reword commit message for clarity
		put DOE_DEV_NAME in this patch from the previous
		Remove '__' prefix
	From Jonathan
		remove CXL_ADDRSPACE_* defines

Changes from V4:
	Make this an Auxiliary Driver rather than library functions
	Split this out into it's own patch
	Base on the new cxl_dev_state structure

Changes from Ben
	s/CXL_DOE_DEV_NAME/DOE_DEV_NAME/
---
 drivers/pci/doe.c       | 123 ++++++++++++++++++++++++++++++++++++++++
 include/linux/pci-doe.h |   3 +
 2 files changed, 126 insertions(+)

Bjorn Helgaas Feb. 3, 2022, 10:44 p.m. UTC | #1

On Mon, Jan 31, 2022 at 11:19:46PM -0800, ira.weiny@intel.com wrote:
> From: Ira Weiny <ira.weiny@intel.com>
> 
> CXL and/or PCI devices can define DOE mailboxes.  

In concrete terms, "DOE mailbox" refers to a DOE Capability, right?
PCIe devices are allowed to implement several instances of the DOE
Capability, of course.  I'm kind of partial to concreteness because it
makes it easier to map between the code and the spec.

> Normally the kernel will want to maintain control of all of these
> mailboxes.  However, under a limited number of use cases users may
> want to allow user space access to some of these mailboxes while the
> kernel retains control of the rest.

Is there something in this patch related to user-space vs kernel
control of things?  To me this patch looks like "for every DOE
Capability on a device, create an auxiliary device and try to attach
an auxiliary device driver to it."

If part of creating the auxiliary devices is adding things in sysfs, I
think it would be useful to mention that here.

> An example of this is for CXL Compliance Testing (see CXL 2.0
> 14.16.4 Compliance Mode DOE) which offers a mechanism to set
> different test modes for a device.

Not sure exactly what this contributes here.  I guess you're saying
you might want user-space access to this, but I don't see anything in
this patch related to that.

> Rather than re-invent the wheel the architecture creates auxiliary
> devices for each DOE mailbox which can then be driven by a generic
> DOE mailbox driver.  If access to an individual mailbox is required
> by user space the driver for that mailbox can be unloaded and access
> handed to user space.

IIUC a device can have several DOE Capabilities, and each Capability
can support several protocols.  So I would think the granularity might
be "protocol" rather than "mailbox" (DOE Capability).

But either way this text seems like it would go with a different patch
since this patch has nothing to specify a particular protocol or even
a particular mailbox/DOE Capability.

> Create the helper pci_doe_create_doe_devices() which iterates each DOE
> mailbox found in the device and creates a DOE auxiliary device on the
> auxiliary bus.  While doing so ensure that the auxiliary DOE driver
> loads to drive that device.

Here's a case where "iterating over DOE mailboxes found in the device"
is slightly abstract.  The code obviously iterates over DOE
*Capabilities* (PCI_EXT_CAP_ID_DOE), and that's something I can easily
find in the spec.

Knowing that this is a PCIe Capability is useful because it puts it in
the context of other capabilities ("optional things that live in
config space") and the mechanisms for synchronization and user-space
access.

> +/**
> + * pci_doe_create_doe_devices - Create auxiliary DOE devices for all DOE
> + *                              mailboxes found
> + * @pci_dev: The PCI device to scan for DOE mailboxes
> + *
> + * There is no coresponding destroy of these devices.  This function associates
> + * the DOE auxiliary devices created with the pci_dev passed in.  That
> + * association is device managed (devm_*) such that the DOE auxiliary device
> + * lifetime is always greater than or equal to the lifetime of the pci_dev.

This seems backwards.  What does it mean if the DOE aux dev lifetime
is *greater* than that of the pci_dev?  Surely you can't access a PCI
DOE Capability if the pci_dev is gone?

> + * RETURNS: 0 on success -ERRNO on failure.
> + */
> +int pci_doe_create_doe_devices(struct pci_dev *pdev)
> +{
> +	struct device *dev = &pdev->dev;
> +	int irqs, rc;
> +	u16 pos = 0;
> +
> +	/*
> +	 * An implementation may support an unknown number of interrupts.
> +	 * Assume that number is not that large and request them all.

This doesn't really inspire confidence :)  Playing devil's advocate,
since pdev is an arbitrary device, I would assume the number *is*
large.

> +	irqs = pci_msix_vec_count(pdev);
> +	rc = pci_alloc_irq_vectors(pdev, irqs, irqs, PCI_IRQ_MSIX);

pci_msix_vec_count() is apparently sort of discouraged; see
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/Documentation/PCI/msi-howto.rst?id=v5.16#n179

A DOE Capability may be implemented by any device, e.g., a NIC or
storage HBA, etc.  I'm a little queasy about IRQ alloc happening both
here and in the driver for the device's primary functionality.  Can
you reassure me that this is actually OK and safe?

Sorry if I've asked this before.  If I have, perhaps a comment would
be useful.

> +	if (rc != irqs) {
> +		/* No interrupt available - carry on */
> +		pci_dbg(pdev, "No interrupts available for DOE\n");
> +	} else {
> +		/*
> +		 * Enabling bus mastering is require for MSI/MSIx.  It could be

s/require/required/
s/MSIx/MSI-X/ to match spec usage.

But I think you only support MSI-X, since you passed "PCI_IRQ_MSIX", not
"PCI_IRQ_MSI | PCI_IRQ_MSIX" above?

> +		 * done later within the DOE initialization, but as it
> +		 * potentially has other impacts keep it here when setting up
> +		 * the IRQ's.

s/IRQ's/IRQs/

"Potentially has other impacts" is too vague, and this doesn't explain
why bus mastering should be enabled here rather than later.  The
device should not issue an MSI-X until DOE Interrupt Enable is set, so
near there seems like a logical place.

> +		 */
> +		pci_set_master(pdev);
> +		rc = devm_add_action_or_reset(dev,
> +					      pci_doe_free_irq_vectors,
> +					      pdev);
> +		if (rc)
> +			return rc;
> +	}

> +++ b/include/linux/pci-doe.h
> @@ -13,6 +13,8 @@
>  #ifndef LINUX_PCI_DOE_H
>  #define LINUX_PCI_DOE_H
>  
> +#define DOE_DEV_NAME "doe"

This is only used once, above.  Why not just use the string there
directly and skip the #define?  If it's needed elsewhere eventually,
we can add a #define then.

Bjorn

Jonathan Cameron Feb. 4, 2022, 2:51 p.m. UTC | #2

On Thu, 3 Feb 2022 16:44:37 -0600
Bjorn Helgaas <helgaas@kernel.org> wrote:

> On Mon, Jan 31, 2022 at 11:19:46PM -0800, ira.weiny@intel.com wrote:
> > From: Ira Weiny <ira.weiny@intel.com>
> > 
> > CXL and/or PCI devices can define DOE mailboxes.    
> 
> In concrete terms, "DOE mailbox" refers to a DOE Capability, right?
> PCIe devices are allowed to implement several instances of the DOE
> Capability, of course.  I'm kind of partial to concreteness because it
> makes it easier to map between the code and the spec.

I'll throw my opinions in here, though a lot of this is really
about the justification for representing this via an Aux bus.
Personally I wouldn't mind if we decided the kernel was always in
charge of DOEs.  Others have a different opinion :)

Changing references to be the DOE Capability makes sense to me to me
though may be worth a one line statement that each DOE capability
provides a single mailbox interface.

> 
> > Normally the kernel will want to maintain control of all of these
> > mailboxes.  However, under a limited number of use cases users may
> > want to allow user space access to some of these mailboxes while the
> > kernel retains control of the rest.  
> 
> Is there something in this patch related to user-space vs kernel
> control of things?  To me this patch looks like "for every DOE
> Capability on a device, create an auxiliary device and try to attach
> an auxiliary device driver to it."

Only the 'unbind' and access it directly (setpci etc) route.
These section of commentary was introduced because some people felt
we needed a standard way to do that unbind - that's one of the main
arguments in favour of using an auxbus as I understand it.

> 
> If part of creating the auxiliary devices is adding things in sysfs, I
> think it would be useful to mention that here.

The sysfs side of things is protocol specific. We dropped the plan
to have a generic interface. It will add a device though, with basic
controls, so good to call that out here (it's mentioned in a later
patch description).

> 
> > An example of this is for CXL Compliance Testing (see CXL 2.0
> > 14.16.4 Compliance Mode DOE) which offers a mechanism to set
> > different test modes for a device.  
> 
> Not sure exactly what this contributes here.  I guess you're saying
> you might want user-space access to this, but I don't see anything in
> this patch related to that.

Again, was part of the requirement raised to allow unbinding of the
DOE driver. Don't need to mention specifics here though so I'd be
fine with dropping this text.

> 
> > Rather than re-invent the wheel the architecture creates auxiliary
> > devices for each DOE mailbox which can then be driven by a generic
> > DOE mailbox driver.  If access to an individual mailbox is required
> > by user space the driver for that mailbox can be unloaded and access
> > handed to user space.  
> 
> IIUC a device can have several DOE Capabilities, and each Capability
> can support several protocols.  So I would think the granularity might
> be "protocol" rather than "mailbox" (DOE Capability).

The granularity of whether to have control from a driver or not has
to be at the capability level, because we need to mediate access
to the capability, not the protocol (that may need additional mediation,
but that's a problem for whatever controls the capability).
So either the capability is in control of userspace, or it is in control
of kernel.

> 
> But either way this text seems like it would go with a different patch
> since this patch has nothing to specify a particular protocol or even
> a particular mailbox/DOE Capability.
> 
> > Create the helper pci_doe_create_doe_devices() which iterates each DOE
> > mailbox found in the device and creates a DOE auxiliary device on the
> > auxiliary bus.  While doing so ensure that the auxiliary DOE driver
> > loads to drive that device.  
> 
> Here's a case where "iterating over DOE mailboxes found in the device"
> is slightly abstract.  The code obviously iterates over DOE
> *Capabilities* (PCI_EXT_CAP_ID_DOE), and that's something I can easily
> find in the spec.
> 
> Knowing that this is a PCIe Capability is useful because it puts it in
> the context of other capabilities ("optional things that live in
> config space") and the mechanisms for synchronization and user-space
> access.

Agreed.

> 
> > +/**
> > + * pci_doe_create_doe_devices - Create auxiliary DOE devices for all DOE
> > + *                              mailboxes found
> > + * @pci_dev: The PCI device to scan for DOE mailboxes
> > + *
> > + * There is no coresponding destroy of these devices.  This function associates
> > + * the DOE auxiliary devices created with the pci_dev passed in.  That
> > + * association is device managed (devm_*) such that the DOE auxiliary device
> > + * lifetime is always greater than or equal to the lifetime of the pci_dev.  
> 
> This seems backwards.  What does it mean if the DOE aux dev lifetime
> is *greater* than that of the pci_dev?  Surely you can't access a PCI
> DOE Capability if the pci_dev is gone?

I think the description is inaccurate - the end of life is the same
as that of the PCI driver binding to the pci_dev.  It'll get cleared
up if that is unbound etc.


> 
> > + * RETURNS: 0 on success -ERRNO on failure.
> > + */
> > +int pci_doe_create_doe_devices(struct pci_dev *pdev)
> > +{
> > +	struct device *dev = &pdev->dev;
> > +	int irqs, rc;
> > +	u16 pos = 0;
> > +
> > +	/*
> > +	 * An implementation may support an unknown number of interrupts.
> > +	 * Assume that number is not that large and request them all.  
> 
> This doesn't really inspire confidence :)  Playing devil's advocate,
> since pdev is an arbitrary device, I would assume the number *is*
> large.

Thomas's recent series on enabling expanding msi-X vectors after enabling
msi-X may be useful.

https://lore.kernel.org/linux-arm-kernel/20211210221642.869015045@linutronix.de/
  
But as I understand it that is still a work in progress and only applies
to MSI-X anwyay.

Perhaps we are better off just making this the callers problem and
documenting that MSI or MSI-X must be enabled and vectors requested
before we call this function.

> 
> > +	irqs = pci_msix_vec_count(pdev);
> > +	rc = pci_alloc_irq_vectors(pdev, irqs, irqs, PCI_IRQ_MSIX);  
> 
> pci_msix_vec_count() is apparently sort of discouraged; see
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/Documentation/PCI/msi-howto.rst?id=v5.16#n179

We can just pass in INT_MAX then if we want the maximum available.

> 
> A DOE Capability may be implemented by any device, e.g., a NIC or
> storage HBA, etc.  I'm a little queasy about IRQ alloc happening both
> here and in the driver for the device's primary functionality.  Can
> you reassure me that this is actually OK and safe?

> Sorry if I've asked this before.  If I have, perhaps a comment would
> be useful.

I went looking and earlier versions (pre Aux bus) made these calls in
the main driver before calling the DOE specific element.  So I don't think this came up
in previous reviews.

> 
> > +	if (rc != irqs) {
> > +		/* No interrupt available - carry on */
> > +		pci_dbg(pdev, "No interrupts available for DOE\n");
> > +	} else {
> > +		/*
> > +		 * Enabling bus mastering is require for MSI/MSIx.  It could be  
> 
> s/require/required/
> s/MSIx/MSI-X/ to match spec usage.
> 
> But I think you only support MSI-X, since you passed "PCI_IRQ_MSIX", not
> "PCI_IRQ_MSI | PCI_IRQ_MSIX" above?

Good point. I think it should be both.

> 
> > +		 * done later within the DOE initialization, but as it
> > +		 * potentially has other impacts keep it here when setting up
> > +		 * the IRQ's.  
> 
> s/IRQ's/IRQs/
> 
> "Potentially has other impacts" is too vague, and this doesn't explain
> why bus mastering should be enabled here rather than later.  The
> device should not issue an MSI-X until DOE Interrupt Enable is set, so
> near there seems like a logical place.

I can't remember what lead to that comment so hopefully moving to just
before the enable would be fine - if there was somewhere to do it.
I'm not sure there is as the IRQ enable is in the Auxilliary
Bus driver.  If we pull the pci_alloc_irq_vectors() out of here
into the caller, then the pci_set_master() should go with it.


> 
> > +		 */
> > +		pci_set_master(pdev);
> > +		rc = devm_add_action_or_reset(dev,
> > +					      pci_doe_free_irq_vectors,
> > +					      pdev);
> > +		if (rc)
> > +			return rc;
> > +	}  
> 
> > +++ b/include/linux/pci-doe.h
> > @@ -13,6 +13,8 @@
> >  #ifndef LINUX_PCI_DOE_H
> >  #define LINUX_PCI_DOE_H
> >  
> > +#define DOE_DEV_NAME "doe"  
> 
> This is only used once, above.  Why not just use the string there
> directly and skip the #define?  If it's needed elsewhere eventually,
> we can add a #define then.
> 
> Bjorn

Bjorn Helgaas Feb. 4, 2022, 4:27 p.m. UTC | #3

On Fri, Feb 04, 2022 at 02:51:16PM +0000, Jonathan Cameron wrote:
> On Thu, 3 Feb 2022 16:44:37 -0600
> Bjorn Helgaas <helgaas@kernel.org> wrote:
> > On Mon, Jan 31, 2022 at 11:19:46PM -0800, ira.weiny@intel.com wrote:

> > > + * pci_doe_create_doe_devices - Create auxiliary DOE devices for all DOE
> > > + *                              mailboxes found
> > > + * @pci_dev: The PCI device to scan for DOE mailboxes
> > > + *
> > > + * There is no coresponding destroy of these devices.  This function associates
> > > + * the DOE auxiliary devices created with the pci_dev passed in.  That
> > > + * association is device managed (devm_*) such that the DOE auxiliary device
> > > + * lifetime is always greater than or equal to the lifetime of the pci_dev.  
> > 
> > This seems backwards.  What does it mean if the DOE aux dev
> > lifetime is *greater* than that of the pci_dev?  Surely you can't
> > access a PCI DOE Capability if the pci_dev is gone?
> 
> I think the description is inaccurate - the end of life is the same
> as that of the PCI driver binding to the pci_dev.  It'll get cleared
> up if that is unbound etc.

I don't know much about devm, but I *think* the devm things get
released by devres_release_all(), which is called by
__device_release_driver() after it calls the bus or driver's .remove()
method (pci_device_remove(), in this case).

So in this case, I think the aux dev is created after the pci_dev and
released after the PCI driver and the PCI core are done with the
pci_dev.  I assume some refcounting prevents the pci_dev from actually
being deallocated until the aux dev is done with it.

I'm not confident that this is a robust situation.

> > > +		 * done later within the DOE initialization, but as it
> > > +		 * potentially has other impacts keep it here when setting up
> > > +		 * the IRQ's.  
> > 
> > s/IRQ's/IRQs/
> > 
> > "Potentially has other impacts" is too vague, and this doesn't
> > explain why bus mastering should be enabled here rather than
> > later.  The device should not issue an MSI-X until DOE Interrupt
> > Enable is set, so near there seems like a logical place.
> 
> I can't remember what lead to that comment so hopefully moving to
> just before the enable would be fine - if there was somewhere to do
> it.  I'm not sure there is as the IRQ enable is in the Auxilliary
> Bus driver.  If we pull the pci_alloc_irq_vectors() out of here into
> the caller, then the pci_set_master() should go with it.

I think pci_set_master() is tied to setting PCI_DOE_CTRL_INT_EN, not
to pci_alloc_irq_vectors().

Bjorn

Dan Williams Feb. 11, 2022, 2:54 a.m. UTC | #4

On Fri, Feb 4, 2022 at 8:28 AM Bjorn Helgaas <helgaas@kernel.org> wrote:
>
> On Fri, Feb 04, 2022 at 02:51:16PM +0000, Jonathan Cameron wrote:
> > On Thu, 3 Feb 2022 16:44:37 -0600
> > Bjorn Helgaas <helgaas@kernel.org> wrote:
> > > On Mon, Jan 31, 2022 at 11:19:46PM -0800, ira.weiny@intel.com wrote:
>
> > > > + * pci_doe_create_doe_devices - Create auxiliary DOE devices for all DOE
> > > > + *                              mailboxes found
> > > > + * @pci_dev: The PCI device to scan for DOE mailboxes
> > > > + *
> > > > + * There is no coresponding destroy of these devices.  This function associates
> > > > + * the DOE auxiliary devices created with the pci_dev passed in.  That
> > > > + * association is device managed (devm_*) such that the DOE auxiliary device
> > > > + * lifetime is always greater than or equal to the lifetime of the pci_dev.
> > >
> > > This seems backwards.  What does it mean if the DOE aux dev
> > > lifetime is *greater* than that of the pci_dev?  Surely you can't
> > > access a PCI DOE Capability if the pci_dev is gone?
> >
> > I think the description is inaccurate - the end of life is the same
> > as that of the PCI driver binding to the pci_dev.  It'll get cleared
> > up if that is unbound etc.
>
> I don't know much about devm, but I *think* the devm things get
> released by devres_release_all(), which is called by
> __device_release_driver() after it calls the bus or driver's .remove()
> method (pci_device_remove(), in this case).
>
> So in this case, I think the aux dev is created after the pci_dev and
> released after the PCI driver and the PCI core are done with the
> pci_dev.  I assume some refcounting prevents the pci_dev from actually
> being deallocated until the aux dev is done with it.
>
> I'm not confident that this is a robust situation.

devm is a replacement for hand coding driver ->remove() handlers.
Anything devm allocated at ->probe() will be freed in the proper
reverse order by the driver core after it calls ->remove(). Ideally
for pure devm usage the ->remove() handler can be elided altogether.
I'll go read this patch to make sure it follows the expected pattern
which is:

1/ Parent device driver performs kmalloc(), device_initialize(), and
device_add() of a child device.
2/ Parent registers a devm handler for that child device that will
trigger device_unregister() at remove

During parent device unregister or unbind the devm action will
complete device_unregister() for all children first.

That process is independent of the device lifetime that can be
arbitrarily extended by 3rd party get_device() or
CONFIG_DEBUG_KOBJECT_RELEASE. The device core / kobject hierarchy
guarantees that the parent device is pinned until after child-device
final put event. I.e. final put_device() on a child also triggers a
put_device() on the parent paired with the get_device() taken on the
parent at device_add() time.

Ira Weiny March 24, 2022, 12:26 a.m. UTC | #5

On Thu, Feb 03, 2022 at 04:44:37PM -0600, Bjorn Helgaas wrote:
> On Mon, Jan 31, 2022 at 11:19:46PM -0800, ira.weiny@intel.com wrote:
> > From: Ira Weiny <ira.weiny@intel.com>
> > 
> > CXL and/or PCI devices can define DOE mailboxes.  
> 
> In concrete terms, "DOE mailbox" refers to a DOE Capability, right?

Right.

> PCIe devices are allowed to implement several instances of the DOE
> Capability, of course.  I'm kind of partial to concreteness because it
> makes it easier to map between the code and the spec.

I agree.  I'm just not great at remembering the terminology.  I've fixed it up
in this next go round.

> 
> > Normally the kernel will want to maintain control of all of these
> > mailboxes.  However, under a limited number of use cases users may
> > want to allow user space access to some of these mailboxes while the
> > kernel retains control of the rest.
> 
> Is there something in this patch related to user-space vs kernel
> control of things?  To me this patch looks like "for every DOE
> Capability on a device, create an auxiliary device and try to attach
> an auxiliary device driver to it."

That was the way this series worked yest.

> 
> If part of creating the auxiliary devices is adding things in sysfs, I
> think it would be useful to mention that here.

Nothing was added in sysfs for this series and nothing is planned in the next
series.

That said Dan and I discussed internally and have settled on the PCI layer
being agnostic to the aux bus idea.

The new PCI layer code I have simply provides helpers to create struct
pci_doe_mb (mailbox objects) which control the state machine and accept
pci_doe_tasks as work requests.

The idea of using auxiliary bus devices is now used in the CXL layer as a way
to allow individual mailboxes to be controlled via user space directly by
unlinking that auxiliary device.

I have also added a modified version of Dan's patch from here:

https://lore.kernel.org/all/161663543465.1867664.5674061943008380442.stgit@dwillia2-desk3.amr.corp.intel.com/

The new version taints the kernel if a write occurs but allows any read.

I'm curious if allowing reads is really ok though.  Because allowing a read at
just the right time could allow snooping of DOE protocol data.  Could this be a
potential security issue with protocols like IDE?  It would be a difficult
attack for sure but...  :-/

> 
> > An example of this is for CXL Compliance Testing (see CXL 2.0
> > 14.16.4 Compliance Mode DOE) which offers a mechanism to set
> > different test modes for a device.
> 
> Not sure exactly what this contributes here.  I guess you're saying
> you might want user-space access to this, but I don't see anything in
> this patch related to that.

The detail in this patch was that if user space unlinked a specific DOE device
then user space could control that device directly and without interference
from this kernel code.  As I stated above we will use this mechanism for CXL
but other subsystems could decide to do something else and own each DOE MB
capability directly.

> 
> > Rather than re-invent the wheel the architecture creates auxiliary
> > devices for each DOE mailbox which can then be driven by a generic
> > DOE mailbox driver.  If access to an individual mailbox is required
> > by user space the driver for that mailbox can be unloaded and access
> > handed to user space.
> 
> IIUC a device can have several DOE Capabilities, and each Capability
> can support several protocols.  So I would think the granularity might
> be "protocol" rather than "mailbox" (DOE Capability).

We debated that and decided that was to fine a granularity.

> 
> But either way this text seems like it would go with a different patch
> since this patch has nothing to specify a particular protocol or even
> a particular mailbox/DOE Capability.

Again, this was just more justification for the aux bus architecture.

> 
> > Create the helper pci_doe_create_doe_devices() which iterates each DOE
> > mailbox found in the device and creates a DOE auxiliary device on the
> > auxiliary bus.  While doing so ensure that the auxiliary DOE driver
> > loads to drive that device.
> 
> Here's a case where "iterating over DOE mailboxes found in the device"
> is slightly abstract.  The code obviously iterates over DOE
> *Capabilities* (PCI_EXT_CAP_ID_DOE), and that's something I can easily
> find in the spec.

I've clarified that thanks.  Also in the new version I've created an iterator
to find the capabilities.

	pci_doe_for_each_off()

> 
> Knowing that this is a PCIe Capability is useful because it puts it in
> the context of other capabilities ("optional things that live in
> config space") and the mechanisms for synchronization and user-space
> access.

Yes thanks.

> 
> > +/**
> > + * pci_doe_create_doe_devices - Create auxiliary DOE devices for all DOE
> > + *                              mailboxes found
> > + * @pci_dev: The PCI device to scan for DOE mailboxes
> > + *
> > + * There is no coresponding destroy of these devices.  This function associates
> > + * the DOE auxiliary devices created with the pci_dev passed in.  That
> > + * association is device managed (devm_*) such that the DOE auxiliary device
> > + * lifetime is always greater than or equal to the lifetime of the pci_dev.
> 
> This seems backwards.  What does it mean if the DOE aux dev lifetime
> is *greater* than that of the pci_dev?  Surely you can't access a PCI
> DOE Capability if the pci_dev is gone?

No you could not.  Thus the idea that the pci_dev's lifetime was greater than
the lifetime of the auxiliary devices.

Regardless this has all changed away from being part of the core and more tied
to the management of particular devices.  Which I think is much more clear.

> 
> > + * RETURNS: 0 on success -ERRNO on failure.
> > + */
> > +int pci_doe_create_doe_devices(struct pci_dev *pdev)
> > +{
> > +	struct device *dev = &pdev->dev;
> > +	int irqs, rc;
> > +	u16 pos = 0;
> > +
> > +	/*
> > +	 * An implementation may support an unknown number of interrupts.
> > +	 * Assume that number is not that large and request them all.
> 
> This doesn't really inspire confidence :)  Playing devil's advocate,
> since pdev is an arbitrary device, I would assume the number *is*
> large.

I've moved the call to pci_alloc_irq_vectors() to the CXL code which is
managing the pci_dev itself (rather than being buried in this auxiliary device
stuff.)

> 
> > +	irqs = pci_msix_vec_count(pdev);
> > +	rc = pci_alloc_irq_vectors(pdev, irqs, irqs, PCI_IRQ_MSIX);
> 
> pci_msix_vec_count() is apparently sort of discouraged; see
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/Documentation/PCI/msi-howto.rst?id=v5.16#n179

I've removed pci_msix_vec_count() in favor of counting the DOE capabilities and
allocating the vectors based on that count.

> 
> A DOE Capability may be implemented by any device, e.g., a NIC or
> storage HBA, etc.  I'm a little queasy about IRQ alloc happening both
> here and in the driver for the device's primary functionality.  Can
> you reassure me that this is actually OK and safe?

I think it was perfectly safe for this implementation but it was probably not
a good idea generally.

> 
> Sorry if I've asked this before.  If I have, perhaps a comment would
> be useful.
> 
> > +	if (rc != irqs) {
> > +		/* No interrupt available - carry on */
> > +		pci_dbg(pdev, "No interrupts available for DOE\n");
> > +	} else {
> > +		/*
> > +		 * Enabling bus mastering is require for MSI/MSIx.  It could be
> 
> s/require/required/
> s/MSIx/MSI-X/ to match spec usage.
> 
> But I think you only support MSI-X, since you passed "PCI_IRQ_MSIX", not
> "PCI_IRQ_MSI | PCI_IRQ_MSIX" above?

Done.

> 
> > +		 * done later within the DOE initialization, but as it
> > +		 * potentially has other impacts keep it here when setting up
> > +		 * the IRQ's.
> 
> s/IRQ's/IRQs/
> 
> "Potentially has other impacts" is too vague, and this doesn't explain
> why bus mastering should be enabled here rather than later.  The
> device should not issue an MSI-X until DOE Interrupt Enable is set, so
> near there seems like a logical place.

Is it safe to call pci_set_master() more than once?

The reason I am asking is because I've debated between having the new create
mailbox command [pci_doe_create_mb()] request the irq or not.

The issue is the irq handler is part of the DOE state machine and so
pci_request_irq() needs to pass that handler.  I would rather not make that a
globally visible function.  Nor do I think it is appropriate for the DOE state
machine to trust callers setting the correct handler.

So currently the pci_alloc_irq_vectors() is the responsibility of the consumer
(CXL layer in this series) and pci_{request,free}_irq() is handled in the PCI
layer.

But placing pci_set_master() near the DOE Interrupt Enable would then cause
pci_set_master() to be called for each mailbox create.

For now I have left the pci_set_master() call next to pci_alloc_irq_vectors()
in the CXL layer.  As in Jonathans original code it gets called if the
allocation gets enough vectors for all mailboxes found.  And the use of irq's
is all or nothing for each CXL device.

Here is the code to be more clear...


drivers/cxl/pci.c:

int cxl_pci_create_doe_devices(struct pci_dev *pdev)
{               
        struct device *dev = &pdev->dev;
        bool use_irq = true;
        int irqs = 0;
        u16 off = 0;         
        int rc;
        
        pci_doe_for_each_off(pdev, off)
                irqs++;
        pci_info(pdev, "Found %d DOE mailbox's\n", irqs);
        
        /*                         
         * Allocate enough vectors for the DOE's
         */     
        rc = pci_alloc_irq_vectors(pdev, irqs, irqs, PCI_IRQ_MSI |
                                                     PCI_IRQ_MSIX);
        if (rc != irqs) {
                pci_err(pdev, "Not enough interrupts for all the DOEs; use polling\n");
                use_irq = false;
                /* Some got allocated; clean them up */
                if (rc > 0)
                        cxl_pci_free_irq_vectors(pdev); 
        } else {
                /*
                 * Enabling bus mastering is require for MSI/MSIx.  It could be
                 * done later within the DOE initialization, but as it
                 * potentially has other impacts keep it here when setting up
                 * the IRQ's.
                 */
                pci_set_master(pdev);
                rc = devm_add_action_or_reset(dev,
                                              cxl_pci_free_irq_vectors,
                                              pdev);
                if (rc)
                        return rc;
        }

        pci_doe_for_each_off(pdev, off) {
...
		/* Create each auxiliary device which internally calls */
		pci_doe_create_mb(pdev, off, use_irq);
...
	}
...
}


drivers/pci/pci-doe.c:

static irqreturn_t pci_doe_irq_handler(int irq, void *data)
{
...
}

static int pci_doe_request_irq(struct pci_doe_mb *doe_mb)
{
        struct pci_dev *pdev = doe_mb->pdev;
        int offset = doe_mb->cap_offset;
        int doe_irq, rc;
        u32 val;

        pci_read_config_dword(pdev, offset + PCI_DOE_CAP, &val);

        if (!FIELD_GET(PCI_DOE_CAP_INT, val))
                return -ENOTSUPP;

        doe_irq = FIELD_GET(PCI_DOE_CAP_IRQ, val);
        rc = pci_request_irq(pdev, doe_irq, pci_doe_irq_handler,
                             NULL, doe_mb,
                             "DOE[%d:%s]", doe_irq, pci_name(pdev));
        if (rc) 
                return rc;

        doe_mb->irq = doe_irq;
        pci_write_config_dword(pdev, offset + PCI_DOE_CTRL,
                               PCI_DOE_CTRL_INT_EN);
        return 0;
}

struct pci_doe_mb *pci_doe_create_mb(struct pci_dev *pdev, u16 cap_offset,
                                     bool use_irq)
{
...
        if (use_irq) {
                rc = pci_doe_request_irq(doe_mb);
                if (rc) 
                        pci_err(pdev, "DOE request irq failed for mailbox @ %u : %d\n",
                                cap_offset, rc);
        }
...
}


Does this look reasonable?

> 
> > +		 */
> > +		pci_set_master(pdev);
> > +		rc = devm_add_action_or_reset(dev,
> > +					      pci_doe_free_irq_vectors,
> > +					      pdev);
> > +		if (rc)
> > +			return rc;
> > +	}
> 
> > +++ b/include/linux/pci-doe.h
> > @@ -13,6 +13,8 @@
> >  #ifndef LINUX_PCI_DOE_H
> >  #define LINUX_PCI_DOE_H
> >  
> > +#define DOE_DEV_NAME "doe"
> 
> This is only used once, above.  Why not just use the string there
> directly and skip the #define?  If it's needed elsewhere eventually,
> we can add a #define then.

This is now moved elsewhere in the series.

Thanks for the feedback and sorry it's taken so long to respond.
Ira

> 
> Bjorn

Jonathan Cameron March 24, 2022, 2:05 p.m. UTC | #6

Hi Ira,

> Here is the code to be more clear...
> 
> 
> drivers/cxl/pci.c:
> 
> int cxl_pci_create_doe_devices(struct pci_dev *pdev)
> {               
>         struct device *dev = &pdev->dev;
>         bool use_irq = true;
>         int irqs = 0;
>         u16 off = 0;         
>         int rc;
>         
>         pci_doe_for_each_off(pdev, off)
>                 irqs++;
>         pci_info(pdev, "Found %d DOE mailbox's\n", irqs);
>         
>         /*                         
>          * Allocate enough vectors for the DOE's
>          */     
>         rc = pci_alloc_irq_vectors(pdev, irqs, irqs, PCI_IRQ_MSI |
>                                                      PCI_IRQ_MSIX);
>         if (rc != irqs) {
>                 pci_err(pdev, "Not enough interrupts for all the DOEs; use polling\n");
>                 use_irq = false;
>                 /* Some got allocated; clean them up */
>                 if (rc > 0)
>                         cxl_pci_free_irq_vectors(pdev); 
>         } else {
>                 /*
>                  * Enabling bus mastering is require for MSI/MSIx.  It could be
>                  * done later within the DOE initialization, but as it
>                  * potentially has other impacts keep it here when setting up
>                  * the IRQ's.
>                  */
>                 pci_set_master(pdev);
>                 rc = devm_add_action_or_reset(dev,
>                                               cxl_pci_free_irq_vectors,
>                                               pdev);
>                 if (rc)
>                         return rc;
>         }
> 
>         pci_doe_for_each_off(pdev, off) {
> ...
> 		/* Create each auxiliary device which internally calls */
> 		pci_doe_create_mb(pdev, off, use_irq);
> ...
> 	}
> ...
> }
> 
> 
> drivers/pci/pci-doe.c:
> 
> static irqreturn_t pci_doe_irq_handler(int irq, void *data)
> {
> ...
> }
> 
> static int pci_doe_request_irq(struct pci_doe_mb *doe_mb)
> {
>         struct pci_dev *pdev = doe_mb->pdev;
>         int offset = doe_mb->cap_offset;
>         int doe_irq, rc;
>         u32 val;
> 
>         pci_read_config_dword(pdev, offset + PCI_DOE_CAP, &val);
> 
>         if (!FIELD_GET(PCI_DOE_CAP_INT, val))
>                 return -ENOTSUPP;
> 
>         doe_irq = FIELD_GET(PCI_DOE_CAP_IRQ, val);
>         rc = pci_request_irq(pdev, doe_irq, pci_doe_irq_handler,
>                              NULL, doe_mb,
>                              "DOE[%d:%s]", doe_irq, pci_name(pdev));
>         if (rc) 
>                 return rc;
> 
>         doe_mb->irq = doe_irq;
>         pci_write_config_dword(pdev, offset + PCI_DOE_CTRL,
>                                PCI_DOE_CTRL_INT_EN);
>         return 0;
> }
> 
> struct pci_doe_mb *pci_doe_create_mb(struct pci_dev *pdev, u16 cap_offset,
>                                      bool use_irq)
> {
> ...
>         if (use_irq) {
>                 rc = pci_doe_request_irq(doe_mb);
>                 if (rc) 
>                         pci_err(pdev, "DOE request irq failed for mailbox @ %u : %d\n",
>                                 cap_offset, rc);
>         }
> ...
> }
> 
> 
> Does this look reasonable?

I'm a little nervous about how we are going to make DOEs on switches work.
Guess I'll do an experiment once your next version is out and check we
can do that reasonably cleanly.  For switches we'll probably have to
check for DOEs on all such ports and end up with infrastructure to
map to all protocols we might see on a switch.

Jonathan

>

Ira Weiny March 24, 2022, 11:44 p.m. UTC | #7

On Thu, Mar 24, 2022 at 02:05:39PM +0000, Jonathan Cameron wrote:
> Hi Ira,
> 
> > Here is the code to be more clear...
> > 
> > 
> > drivers/cxl/pci.c:
> > 
> > int cxl_pci_create_doe_devices(struct pci_dev *pdev)
> > {               
> >         struct device *dev = &pdev->dev;
> >         bool use_irq = true;
> >         int irqs = 0;
> >         u16 off = 0;         
> >         int rc;
> >         
> >         pci_doe_for_each_off(pdev, off)
> >                 irqs++;
> >         pci_info(pdev, "Found %d DOE mailbox's\n", irqs);
> >         
> >         /*                         
> >          * Allocate enough vectors for the DOE's
> >          */     
> >         rc = pci_alloc_irq_vectors(pdev, irqs, irqs, PCI_IRQ_MSI |
> >                                                      PCI_IRQ_MSIX);
> >         if (rc != irqs) {
> >                 pci_err(pdev, "Not enough interrupts for all the DOEs; use polling\n");
> >                 use_irq = false;
> >                 /* Some got allocated; clean them up */
> >                 if (rc > 0)
> >                         cxl_pci_free_irq_vectors(pdev); 
> >         } else {
> >                 /*
> >                  * Enabling bus mastering is require for MSI/MSIx.  It could be
> >                  * done later within the DOE initialization, but as it
> >                  * potentially has other impacts keep it here when setting up
> >                  * the IRQ's.
> >                  */
> >                 pci_set_master(pdev);
> >                 rc = devm_add_action_or_reset(dev,
> >                                               cxl_pci_free_irq_vectors,
> >                                               pdev);
> >                 if (rc)
> >                         return rc;
> >         }
> > 
> >         pci_doe_for_each_off(pdev, off) {
> > ...
> > 		/* Create each auxiliary device which internally calls */
> > 		pci_doe_create_mb(pdev, off, use_irq);
> > ...
> > 	}
> > ...
> > }
> > 
> > 
> > drivers/pci/pci-doe.c:
> > 
> > static irqreturn_t pci_doe_irq_handler(int irq, void *data)
> > {
> > ...
> > }
> > 
> > static int pci_doe_request_irq(struct pci_doe_mb *doe_mb)
> > {
> >         struct pci_dev *pdev = doe_mb->pdev;
> >         int offset = doe_mb->cap_offset;
> >         int doe_irq, rc;
> >         u32 val;
> > 
> >         pci_read_config_dword(pdev, offset + PCI_DOE_CAP, &val);
> > 
> >         if (!FIELD_GET(PCI_DOE_CAP_INT, val))
> >                 return -ENOTSUPP;
> > 
> >         doe_irq = FIELD_GET(PCI_DOE_CAP_IRQ, val);
> >         rc = pci_request_irq(pdev, doe_irq, pci_doe_irq_handler,
> >                              NULL, doe_mb,
> >                              "DOE[%d:%s]", doe_irq, pci_name(pdev));
> >         if (rc) 
> >                 return rc;
> > 
> >         doe_mb->irq = doe_irq;
> >         pci_write_config_dword(pdev, offset + PCI_DOE_CTRL,
> >                                PCI_DOE_CTRL_INT_EN);
> >         return 0;
> > }
> > 
> > struct pci_doe_mb *pci_doe_create_mb(struct pci_dev *pdev, u16 cap_offset,
> >                                      bool use_irq)
> > {
> > ...
> >         if (use_irq) {
> >                 rc = pci_doe_request_irq(doe_mb);
> >                 if (rc) 
> >                         pci_err(pdev, "DOE request irq failed for mailbox @ %u : %d\n",
> >                                 cap_offset, rc);
> >         }
> > ...
> > }
> > 
> > 
> > Does this look reasonable?
> 
> I'm a little nervous about how we are going to make DOEs on switches work.
> Guess I'll do an experiment once your next version is out and check we
> can do that reasonably cleanly.  For switches we'll probably have to
> check for DOEs on all such ports and end up with infrastructure to
> map to all protocols we might see on a switch.

Are the switches not represented as PCI devices in linux?

If my vision of switches is correct I think that problem is independent of what
I'm solving here.  In other words the relationship between a port on a switch
and a DOE capability on that switch will have to be established somehow and
nothing I'm doing precludes doing that, but at the same time nothing I'm doing
helps that either.

Ira

> 
> Jonathan
> 
> >

Jonathan Cameron March 25, 2022, 12:02 p.m. UTC | #8

On Thu, 24 Mar 2022 16:44:33 -0700
Ira Weiny <ira.weiny@intel.com> wrote:

> On Thu, Mar 24, 2022 at 02:05:39PM +0000, Jonathan Cameron wrote:
> > Hi Ira,
> >   
> > > Here is the code to be more clear...
> > > 
> > > 
> > > drivers/cxl/pci.c:
> > > 
> > > int cxl_pci_create_doe_devices(struct pci_dev *pdev)
> > > {               
> > >         struct device *dev = &pdev->dev;
> > >         bool use_irq = true;
> > >         int irqs = 0;
> > >         u16 off = 0;         
> > >         int rc;
> > >         
> > >         pci_doe_for_each_off(pdev, off)
> > >                 irqs++;
> > >         pci_info(pdev, "Found %d DOE mailbox's\n", irqs);
> > >         
> > >         /*                         
> > >          * Allocate enough vectors for the DOE's
> > >          */     
> > >         rc = pci_alloc_irq_vectors(pdev, irqs, irqs, PCI_IRQ_MSI |
> > >                                                      PCI_IRQ_MSIX);
> > >         if (rc != irqs) {
> > >                 pci_err(pdev, "Not enough interrupts for all the DOEs; use polling\n");
> > >                 use_irq = false;
> > >                 /* Some got allocated; clean them up */
> > >                 if (rc > 0)
> > >                         cxl_pci_free_irq_vectors(pdev); 
> > >         } else {
> > >                 /*
> > >                  * Enabling bus mastering is require for MSI/MSIx.  It could be
> > >                  * done later within the DOE initialization, but as it
> > >                  * potentially has other impacts keep it here when setting up
> > >                  * the IRQ's.
> > >                  */
> > >                 pci_set_master(pdev);
> > >                 rc = devm_add_action_or_reset(dev,
> > >                                               cxl_pci_free_irq_vectors,
> > >                                               pdev);
> > >                 if (rc)
> > >                         return rc;
> > >         }
> > > 
> > >         pci_doe_for_each_off(pdev, off) {
> > > ...
> > > 		/* Create each auxiliary device which internally calls */
> > > 		pci_doe_create_mb(pdev, off, use_irq);
> > > ...
> > > 	}
> > > ...
> > > }
> > > 
> > > 
> > > drivers/pci/pci-doe.c:
> > > 
> > > static irqreturn_t pci_doe_irq_handler(int irq, void *data)
> > > {
> > > ...
> > > }
> > > 
> > > static int pci_doe_request_irq(struct pci_doe_mb *doe_mb)
> > > {
> > >         struct pci_dev *pdev = doe_mb->pdev;
> > >         int offset = doe_mb->cap_offset;
> > >         int doe_irq, rc;
> > >         u32 val;
> > > 
> > >         pci_read_config_dword(pdev, offset + PCI_DOE_CAP, &val);
> > > 
> > >         if (!FIELD_GET(PCI_DOE_CAP_INT, val))
> > >                 return -ENOTSUPP;
> > > 
> > >         doe_irq = FIELD_GET(PCI_DOE_CAP_IRQ, val);
> > >         rc = pci_request_irq(pdev, doe_irq, pci_doe_irq_handler,
> > >                              NULL, doe_mb,
> > >                              "DOE[%d:%s]", doe_irq, pci_name(pdev));
> > >         if (rc) 
> > >                 return rc;
> > > 
> > >         doe_mb->irq = doe_irq;
> > >         pci_write_config_dword(pdev, offset + PCI_DOE_CTRL,
> > >                                PCI_DOE_CTRL_INT_EN);
> > >         return 0;
> > > }
> > > 
> > > struct pci_doe_mb *pci_doe_create_mb(struct pci_dev *pdev, u16 cap_offset,
> > >                                      bool use_irq)
> > > {
> > > ...
> > >         if (use_irq) {
> > >                 rc = pci_doe_request_irq(doe_mb);
> > >                 if (rc) 
> > >                         pci_err(pdev, "DOE request irq failed for mailbox @ %u : %d\n",
> > >                                 cap_offset, rc);
> > >         }
> > > ...
> > > }
> > > 
> > > 
> > > Does this look reasonable?  
> > 
> > I'm a little nervous about how we are going to make DOEs on switches work.
> > Guess I'll do an experiment once your next version is out and check we
> > can do that reasonably cleanly.  For switches we'll probably have to
> > check for DOEs on all such ports and end up with infrastructure to
> > map to all protocols we might see on a switch.  
> 
> Are the switches not represented as PCI devices in linux?
> 
> If my vision of switches is correct I think that problem is independent of what
> I'm solving here.  In other words the relationship between a port on a switch
> and a DOE capability on that switch will have to be established somehow and
> nothing I'm doing precludes doing that, but at the same time nothing I'm doing
> helps that either.

Sure, I'm just expressing nervousness and would want a PoC of that at least
to check it's not too nasty.  The port drivers are rather 'unusual' in PCI
so touching them always ends up more complex than I expect.

Anyhow, start of cycle so should be plenty of time to do such an RFC
once your code is out there.

Jonathan

> 
> Ira
> 
> > 
> > Jonathan
> >   
> > >

[V6,04/10] PCI/DOE: Introduce pci_doe_create_doe_devices

Commit Message

Comments

Patch