[2/5] PCI/DOE: Add Data Object Exchange Aux Driver

Message ID	20211105235056.3711389-3-ira.weiny@intel.com
State	New, archived
Headers	show Return-Path: <linux-cxl-owner@kernel.org> From: ira.weiny@intel.com To: Dan Williams <dan.j.williams@intel.com> Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com>, Ira Weiny <ira.weiny@intel.com>, Alison Schofield <alison.schofield@intel.com>, Vishal Verma <vishal.l.verma@intel.com>, Ben Widawsky <ben.widawsky@intel.com>, Bjorn Helgaas <bhelgaas@google.com>, linux-cxl@vger.kernel.org, linux-pci@vger.kernel.org Subject: [PATCH 2/5] PCI/DOE: Add Data Object Exchange Aux Driver Date: Fri, 5 Nov 2021 16:50:53 -0700 Message-Id: <20211105235056.3711389-3-ira.weiny@intel.com> In-Reply-To: <20211105235056.3711389-1-ira.weiny@intel.com> References: <20211105235056.3711389-1-ira.weiny@intel.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Precedence: bulk
Series	CXL: Read CDAT and DSMAS data from the device. \| expand [0/5] CXL: Read CDAT and DSMAS data from the device. [1/5] PCI: Add vendor ID for the PCI SIG [2/5] PCI/DOE: Add Data Object Exchange Aux Driver [3/5] cxl/pci: Add DOE Auxiliary Devices [4/5] cxl/mem: Add CDAT table reading from DOE [5/5] cxl/cdat: Parse out DSMAS data from CDAT table

Ira Weiny Nov. 5, 2021, 11:50 p.m. UTC

From: Jonathan Cameron <Jonathan.Cameron@huawei.com>

Introduced in a PCI ECN [1], DOE provides a config space based mailbox
with standard protocol discovery.  Each mailbox is accessed through a
DOE Extended Capability.

Define an auxiliary device driver which control DOE auxiliary devices
registered on the auxiliary bus.

A DOE mailbox is allowed to support any number of protocols while some
DOE protocol specifications apply additional restrictions.

The protocols supported are queried and cached.  pci_doe_supports_prot()
can be used to determine if the DOE device supports the protocol
specified.

A synchronous interface is provided in pci_doe_exchange_sync() to
perform a single query / response exchange from the driver through the
device specified.

Testing was conducted against QEMU using:

https://lore.kernel.org/qemu-devel/1619454964-10190-1-git-send-email-cbrowy@avery-design.com/

This code is based on Jonathan's V4 series here:

https://lore.kernel.org/linux-cxl/20210524133938.2815206-1-Jonathan.Cameron@huawei.com/

[1] https://members.pcisig.com/wg/PCI-SIG/document/14143
    Data Object Exchange (DOE) - Approved 12 March 2020

Co-developed-by: Ira Weiny <ira.weiny@intel.com>
Signed-off-by: Ira Weiny <ira.weiny@intel.com>
Signed-off-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>

---
Changes from Jonathan's V4
	Move the DOE MB code into the DOE auxiliary driver
	Remove Task List in favor of a wait queue

Changes from Ben
	remove CXL references
	propagate rc from pci functions on error
---
 drivers/pci/Kconfig           |  10 +
 drivers/pci/Makefile          |   3 +
 drivers/pci/doe.c             | 701 ++++++++++++++++++++++++++++++++++
 include/linux/pci-doe.h       |  63 +++
 include/uapi/linux/pci_regs.h |  29 +-
 5 files changed, 805 insertions(+), 1 deletion(-)
 create mode 100644 drivers/pci/doe.c
 create mode 100644 include/linux/pci-doe.h

Jonathan Cameron Nov. 8, 2021, 12:15 p.m. UTC | #1

On Fri, 5 Nov 2021 16:50:53 -0700
<ira.weiny@intel.com> wrote:

> From: Jonathan Cameron <Jonathan.Cameron@huawei.com>
> 
> Introduced in a PCI ECN [1], DOE provides a config space based mailbox
> with standard protocol discovery.  Each mailbox is accessed through a
> DOE Extended Capability.
> 
> Define an auxiliary device driver which control DOE auxiliary devices
> registered on the auxiliary bus.
> 
> A DOE mailbox is allowed to support any number of protocols while some
> DOE protocol specifications apply additional restrictions.
> 
> The protocols supported are queried and cached.  pci_doe_supports_prot()
> can be used to determine if the DOE device supports the protocol
> specified.
> 
> A synchronous interface is provided in pci_doe_exchange_sync() to
> perform a single query / response exchange from the driver through the
> device specified.
> 
> Testing was conducted against QEMU using:
> 
> https://lore.kernel.org/qemu-devel/1619454964-10190-1-git-send-email-cbrowy@avery-design.com/
> 
> This code is based on Jonathan's V4 series here:
> 
> https://lore.kernel.org/linux-cxl/20210524133938.2815206-1-Jonathan.Cameron@huawei.com/
> 
> [1] https://members.pcisig.com/wg/PCI-SIG/document/14143
>     Data Object Exchange (DOE) - Approved 12 March 2020
> 
> Co-developed-by: Ira Weiny <ira.weiny@intel.com>
> Signed-off-by: Ira Weiny <ira.weiny@intel.com>
> Signed-off-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>

Hi Ira,

Thanks for taking this on!

I'm sure at least half the comments below are about things I wrote
then forgot about. I'm not sure if it's a good thing but I've ignored
this for long enough I'm almost reviewing it as fresh code :(

I was carrying a local patch for the interrupt handler having 
figured out I'd missread the spec.   Note that I've since concluded
my local patch has it's own issues (it was unnecessarily complex)
so I've made some suggestions below that I'm fairly sure
fix things up.  Note these paths are hard to test and require adding
some fiddly state machines to QEMU to open up race windows...

> 
> ---
> Changes from Jonathan's V4
> 	Move the DOE MB code into the DOE auxiliary driver
> 	Remove Task List in favor of a wait queue
> 
> Changes from Ben
> 	remove CXL references
> 	propagate rc from pci functions on error

...


> diff --git a/drivers/pci/doe.c b/drivers/pci/doe.c
> new file mode 100644
> index 000000000000..2e702fdc7879
> --- /dev/null
> +++ b/drivers/pci/doe.c
> @@ -0,0 +1,701 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/*
> + * Data Object Exchange ECN
> + * https://members.pcisig.com/wg/PCI-SIG/document/14143
> + *
> + * Copyright (C) 2021 Huawei
> + *     Jonathan Cameron <Jonathan.Cameron@huawei.com>
> + */
> +
> +#include <linux/bitfield.h>
> +#include <linux/delay.h>
> +#include <linux/jiffies.h>
> +#include <linux/list.h>
> +#include <linux/mutex.h>
> +#include <linux/pci.h>
> +#include <linux/pci-doe.h>
> +#include <linux/workqueue.h>
> +#include <linux/module.h>
> +
> +#define PCI_DOE_PROTOCOL_DISCOVERY 0
> +
> +#define PCI_DOE_BUSY_MAX_RETRIES 16
> +#define PCI_DOE_POLL_INTERVAL (HZ / 128)
> +
> +/* Timeout of 1 second from 6.xx.1 (Operation), ECN - Data Object Exchange */
> +#define PCI_DOE_TIMEOUT HZ
> +
> +enum pci_doe_state {
> +	DOE_IDLE,
> +	DOE_WAIT_RESP,
> +	DOE_WAIT_ABORT,
> +	DOE_WAIT_ABORT_ON_ERR,
> +};
> +
> +/*

/**

Given it's in kernel-doc syntax, we might as well mark it as such.

> + * struct pci_doe_task - description of a query / response task
> + * @ex: The details of the task to be done
> + * @rv: Return value.  Length of received response or error
> + * @cb: Callback for completion of task
> + * @private: Private data passed to callback on completion
> + */
> +struct pci_doe_task {
> +	struct pci_doe_exchange *ex;
> +	int rv;
> +	void (*cb)(void *private);
> +	void *private;
> +};
> +
> +/**
> + * struct pci_doe - A single DOE mailbox driver
> + *
> + * @doe_dev: The DOE Auxiliary device being driven
> + * @abort_c: Completion used for initial abort handling
> + * @irq: Interrupt used for signaling DOE ready or abort
> + * @irq_name: Name used to identify the irq for a particular DOE
> + * @prots: Array of identifiers for protocols supported
> + * @num_prots: Size of prots array
> + * @cur_task: Current task the state machine is working on
> + * @wq: Wait queue to wait on if a query is in progress
> + * @state_lock: Protect the state of cur_task, abort, and dead
> + * @statemachine: Work item for the DOE state machine
> + * @state: Current state of this DOE
> + * @timeout_jiffies: 1 second after GO set
> + * @busy_retries: Count of retry attempts
> + * @abort: Request a manual abort (e.g. on init)
> + * @dead: Used to mark a DOE for which an ABORT has timed out. Further messages
> + *        will immediately be aborted with error
> + */
> +struct pci_doe {
> +	struct pci_doe_dev *doe_dev;
> +	struct completion abort_c;
> +	int irq;
> +	char *irq_name;
> +	struct pci_doe_protocol *prots;
> +	int num_prots;
> +
> +	struct pci_doe_task *cur_task;
> +	wait_queue_head_t wq;
> +	struct mutex state_lock;
> +	struct delayed_work statemachine;
> +	enum pci_doe_state state;
> +	unsigned long timeout_jiffies;
> +	unsigned int busy_retries;
> +	unsigned int abort:1;
> +	unsigned int dead:1;
> +};
> +
> +static irqreturn_t pci_doe_irq(int irq, void *data)

I was carrying a rework of this locally because I managed
to convince myself this is wrong.  It's been a while and naturally
I didn't write a comprehensive set of notes on why it was wrong...
(Note you can't trigger the problem paths in QEMU without some
nasty hacks as it relies on opening up race windows that make
limited sense for the QEMU implementation).

It's all centered on some details of exactly what causes an interrupt
on a DOE.  Section 6.xx.3 Interrupt Generation states:

If enabled, an interrupt message must be triggered every time the
logical AND of the following conditions transitions from FALSE to TRUE:

* The associated vector is unmasked ...
* The value of the DOE interrupt enable bit is 1b
* The value of the DOE interrupt status bit is 1b
(only last one really maters to us I think).

The interrupt status bit is an OR conditional.

Must be set.. Data Object Read bit or DOE error bit set or DOE busy bit cleared.

> +{
> +	struct pci_doe *doe = data;
> +	struct pci_dev *pdev = doe->doe_dev->pdev;
> +	int offset = doe->doe_dev->cap_offset;
> +	u32 val;
> +
> +	pci_read_config_dword(pdev, offset + PCI_DOE_STATUS, &val);
> +	if (FIELD_GET(PCI_DOE_STATUS_INT_STATUS, val)) {

So this bit is set on any of: BUSY dropped, READY or ERROR.
If it's set on BUSY drop, but then in between the read above and this clear
READY becomes true, then my reading is that we will not get another interrupt.
That is fine because we will read it again in the state machine and see the
new state. We could do more of the dance in the interrupt controller by doing
a reread after clear of INT_STATUS but I think it's cleaner to leave
it in the state machine.

It might look nicer here to only write BIT(1) - RW1C, but that doesn't matter as
all the rest of the register is RO.

> +		pci_write_config_dword(pdev, offset + PCI_DOE_STATUS, val);
> +		mod_delayed_work(system_wq, &doe->statemachine, 0);
> +		return IRQ_HANDLED;
> +	}
> +	/* Leave the error case to be handled outside IRQ */
> +	if (FIELD_GET(PCI_DOE_STATUS_ERROR, val)) {

I don't think we can get here because int status already true.
So should do this before the above general check to avoid clearning
the interrupt (we don't want more interrupts during the abort though
I'd hope the hardware wouldn't generate them).

So move this before the previous check.

> +		mod_delayed_work(system_wq, &doe->statemachine, 0);
> +		return IRQ_HANDLED;
> +	}
> +
> +	/*
> +	 * Busy being cleared can result in an interrupt, but as
> +	 * the original Busy may not have been detected, there is no
> +	 * way to separate such an interrupt from a spurious interrupt.
> +	 */

This is misleading - as Busy bit clear would have resulted in INT_STATUS being true above
(that was a misread of the spec from me in v4).
So I don't think we can get here in any valid path.

return IRQ_NONE; should be safe.


> +	return IRQ_HANDLED;
> +}

Summary of above suggested changes:
1) Move the DOE_STATUS_ERROR block before the DOE_STATUS_INT_STATUS one
2) Possibly uses
   pci_write_config_dword(pdev, offset + PCI_DOE_STATUS, PCI_DOE_STATUS_INT_STATUS);
   to be explicit on the write one to clear bit.
3) IRQ_NONE for the final return path as I'm fairly sure there is no valid route to that.
   
...

> +
> +static void pci_doe_task_complete(void *private)
> +{
> +	complete(private);
> +}

I wonder why this is up here? I'd move it down to just above the _sync()
function where it's used. This one was definitely one of mine :)

> +
> +static void doe_statemachine_work(struct work_struct *work)

I developed an interesting "relationship" with this state machine during
the original development ;)  I've just walked the paths and convinced
myself it works so all good.

> +{
> +	struct delayed_work *w = to_delayed_work(work);
> +	struct pci_doe *doe = container_of(w, struct pci_doe, statemachine);
> +	struct pci_dev *pdev = doe->doe_dev->pdev;
> +	int offset = doe->doe_dev->cap_offset;
> +	struct pci_doe_task *task;
> +	bool abort;
> +	u32 val;
> +	int rc;
> +
> +	mutex_lock(&doe->state_lock);
> +	task = doe->cur_task;
> +	abort = doe->abort;
> +	doe->abort = false;
> +	mutex_unlock(&doe->state_lock);
> +
> +	if (abort) {
> +		/*
> +		 * Currently only used during init - care needed if
> +		 * pci_doe_abort() is generally exposed as it would impact
> +		 * queries in flight.
> +		 */
> +		WARN_ON(task);
> +		doe->state = DOE_WAIT_ABORT;
> +		pci_doe_abort_start(doe);
> +		return;
> +	}
> +
> +	switch (doe->state) {
> +	case DOE_IDLE:
> +		if (task == NULL)
> +			return;
> +
> +		/* Nothing currently in flight so queue a task */
> +		rc = pci_doe_send_req(doe, task->ex);
> +		/*
> +		 * The specification does not provide any guidance on how long
> +		 * some other entity could keep the DOE busy, so try for 1
> +		 * second then fail. Busy handling is best effort only, because
> +		 * there is no way of avoiding racing against another user of
> +		 * the DOE.
> +		 */
> +		if (rc == -EBUSY) {
> +			doe->busy_retries++;
> +			if (doe->busy_retries == PCI_DOE_BUSY_MAX_RETRIES) {
> +				/* Long enough, fail this request */
> +				pci_WARN(pdev, true, "DOE busy for too long\n");
> +				doe->busy_retries = 0;
> +				goto err_busy;
> +			}
> +			schedule_delayed_work(w, HZ / PCI_DOE_BUSY_MAX_RETRIES);
> +			return;
> +		}
> +		if (rc)
> +			goto err_abort;
> +		doe->busy_retries = 0;
> +
> +		doe->state = DOE_WAIT_RESP;
> +		doe->timeout_jiffies = jiffies + HZ;
> +		/* Now poll or wait for IRQ with timeout */
> +		if (doe->irq > 0)
> +			schedule_delayed_work(w, PCI_DOE_TIMEOUT);
> +		else
> +			schedule_delayed_work(w, PCI_DOE_POLL_INTERVAL);
> +		return;
> +
> +	case DOE_WAIT_RESP:
> +		/* Not possible to get here with NULL task */
> +		pci_read_config_dword(pdev, offset + PCI_DOE_STATUS, &val);
> +		if (FIELD_GET(PCI_DOE_STATUS_ERROR, val)) {
> +			rc = -EIO;
> +			goto err_abort;
> +		}
> +
> +		if (!FIELD_GET(PCI_DOE_STATUS_DATA_OBJECT_READY, val)) {
> +			/* If not yet at timeout reschedule otherwise abort */
> +			if (time_after(jiffies, doe->timeout_jiffies)) {
> +				rc = -ETIMEDOUT;
> +				goto err_abort;
> +			}
> +			schedule_delayed_work(w, PCI_DOE_POLL_INTERVAL);
> +			return;
> +		}
> +
> +		rc  = pci_doe_recv_resp(doe, task->ex);
> +		if (rc < 0)
> +			goto err_abort;
> +
> +		doe->state = DOE_IDLE;
> +
> +		mutex_lock(&doe->state_lock);
> +		doe->cur_task = NULL;
> +		mutex_unlock(&doe->state_lock);
> +		wake_up_interruptible(&doe->wq);
> +
> +		/* Set the return value to the length of received payload */
> +		task->rv = rc;
> +		task->cb(task->private);
> +
> +		return;
> +
> +	case DOE_WAIT_ABORT:
> +	case DOE_WAIT_ABORT_ON_ERR:
> +		pci_read_config_dword(pdev, offset + PCI_DOE_STATUS, &val);
> +
> +		if (!FIELD_GET(PCI_DOE_STATUS_ERROR, val) &&
> +		    !FIELD_GET(PCI_DOE_STATUS_BUSY, val)) {
> +			/* Back to normal state - carry on */
> +			mutex_lock(&doe->state_lock);
> +			doe->cur_task = NULL;
> +			mutex_unlock(&doe->state_lock);
> +			wake_up_interruptible(&doe->wq);
> +
> +			/*
> +			 * For deliberately triggered abort, someone is
> +			 * waiting.
> +			 */
> +			if (doe->state == DOE_WAIT_ABORT)
> +				complete(&doe->abort_c);
> +
> +			doe->state = DOE_IDLE;
> +			return;
> +		}
> +		if (time_after(jiffies, doe->timeout_jiffies)) {
> +			/* Task has timed out and is dead - abort */
> +			pci_err(pdev, "DOE ABORT timed out\n");
> +			mutex_lock(&doe->state_lock);
> +			doe->dead = true;
> +			doe->cur_task = NULL;
> +			mutex_unlock(&doe->state_lock);
> +			wake_up_interruptible(&doe->wq);
> +
> +			if (doe->state == DOE_WAIT_ABORT)
> +				complete(&doe->abort_c);
> +		}
> +		return;
> +	}
> +
> +err_abort:
> +	doe->state = DOE_WAIT_ABORT_ON_ERR;
> +	pci_doe_abort_start(doe);
> +err_busy:
> +	task->rv = rc;
> +	task->cb(task->private);
> +	/* If here via err_busy, signal the task done. */
> +	if (doe->state == DOE_IDLE) {
> +		mutex_lock(&doe->state_lock);
> +		doe->cur_task = NULL;
> +		mutex_unlock(&doe->state_lock);
> +		wake_up_interruptible(&doe->wq);
> +	}
> +}
> +
> +/**
> + * pci_doe_exchange_sync() - Send a request, then wait for and receive a response
> + * @doe: DOE mailbox state structure
> + * @ex: Description of the buffers and Vendor ID + type used in this
> + *      request/response pair
> + *
> + * Excess data will be discarded.
> + *
> + * RETURNS: payload in bytes on success, < 0 on error
> + */
> +int pci_doe_exchange_sync(struct pci_doe_dev *doe_dev, struct pci_doe_exchange *ex)
> +{
> +	struct pci_doe *doe = dev_get_drvdata(&doe_dev->adev.dev);
> +	struct pci_doe_task task;
> +	DECLARE_COMPLETION_ONSTACK(c);
> +
> +	if (!doe)
> +		return -EAGAIN;
> +
> +	/* DOE requests must be a whole number of DW */
> +	if (ex->request_pl_sz % sizeof(u32))
> +		return -EINVAL;
> +
> +	task.ex = ex;
> +	task.cb = pci_doe_task_complete;
> +	task.private = &c;
> +
> +again:

Hmm.   Whether having this code at this layer makes sense hinges on
whether we want to easily support async use of the DOE in future.

In v4 some of the async handling had ended up in this function and
should probably have been factored out to give us a 
'queue up work' then 'wait for completion' sequence.

Given there is now more to be done in here perhaps we need to think
about such a separation to keep it clear that this is fundamentally
a synchronous wrapper around an asynchronous operation.

> +	mutex_lock(&doe->state_lock);
> +	if (doe->cur_task) {
> +		mutex_unlock(&doe->state_lock);
> +		wait_event_interruptible(doe->wq, doe->cur_task == NULL);
> +		goto again;
> +	}
> +
> +	if (doe->dead) {
> +		mutex_unlock(&doe->state_lock);
> +		return -EIO;
> +	}
> +	doe->cur_task = &task;
> +	schedule_delayed_work(&doe->statemachine, 0);
> +	mutex_unlock(&doe->state_lock);
> +
> +	wait_for_completion(&c);
> +
> +	return task.rv;
> +}
> +EXPORT_SYMBOL_GPL(pci_doe_exchange_sync);
> +
> +/**
> + * pci_doe_supports_prot() - Return if the DOE instance supports the given protocol
> + * @pdev: Device on which to find the DOE instance
> + * @vid: Protocol Vendor ID
> + * @type: protocol type
> + *
> + * This device can then be passed to pci_doe_exchange_sync() to execute a mailbox
> + * exchange through that DOE mailbox.
> + *
> + * RETURNS: True if the DOE device supports the protocol specified
> + */
> +bool pci_doe_supports_prot(struct pci_doe_dev *doe_dev, u16 vid, u8 type)
> +{
> +	struct pci_doe *doe = dev_get_drvdata(&doe_dev->adev.dev);
> +	int i;
> +
> +	if (!doe)
> +		return false;

How would this happen?  I don't think it can...  Probably
false paranoia from me...

> +
> +	for (i = 0; i < doe->num_prots; i++)
> +		if ((doe->prots[i].vid == vid) &&
> +		    (doe->prots[i].type == type))
> +			return true;
> +
> +	return false;
> +}
> +EXPORT_SYMBOL_GPL(pci_doe_supports_prot);

...

> +static void pci_doe_release_irq(struct pci_doe *doe)
> +{
> +	if (doe->irq > 0)
> +		free_irq(doe->irq, doe);

Is this trivial wrapper worth bothering with?  Maybe just
put the code inline?

> +}
> +

...

> +
> +static void pci_doe_unregister(struct pci_doe *doe)
> +{
> +	pci_doe_release_irq(doe);
> +	kfree(doe->irq_name);
> +	put_device(&doe->doe_dev->pdev->dev);

This makes me wonder if we should be doing the get_device()
earlier in probe?  Limited harm in moving it to near the start
and then ending up with it being 'obviously' correct...

> +}
> +
> +/*
> + * pci_doe_probe() - Set up the Mailbox
> + * @aux_dev: Auxiliary Device
> + * @id: Auxiliary device ID
> + *
> + * Probe the mailbox found for all protocols and set up the Mailbox
> + *
> + * RETURNS: 0 on success, < 0 on error
> + */
> +static int pci_doe_probe(struct auxiliary_device *aux_dev,
> +			 const struct auxiliary_device_id *id)
> +{
> +	struct pci_doe_dev *doe_dev = container_of(aux_dev,
> +					struct pci_doe_dev,
> +					adev);
> +	struct pci_doe *doe;
> +	int rc;
> +
> +	doe = kzalloc(sizeof(*doe), GFP_KERNEL);

Could go devm_ for this I think, though may not be worthwhile.

> +	if (!doe)
> +		return -ENOMEM;
> +
> +	mutex_init(&doe->state_lock);
> +	init_completion(&doe->abort_c);
> +	doe->doe_dev = doe_dev;
> +	init_waitqueue_head(&doe->wq);
> +	INIT_DELAYED_WORK(&doe->statemachine, doe_statemachine_work);
> +	dev_set_drvdata(&aux_dev->dev, doe);
> +
> +	rc = pci_doe_register(doe);
> +	if (rc)
> +		goto err_free;
> +
> +	rc = pci_doe_cache_protocols(doe);
> +	if (rc) {
> +		pci_doe_unregister(doe);

Mixture of different forms of error handling here.
I'd move this below and add an err_unregister label.

> +		goto err_free;
> +	}
> +
> +	return 0;
> +
> +err_free:
> +	kfree(doe);
> +	return rc;
> +}
> +
> +static void pci_doe_remove(struct auxiliary_device *aux_dev)
> +{
> +	struct pci_doe *doe = dev_get_drvdata(&aux_dev->dev);
> +
> +	/* First halt the state machine */
> +	cancel_delayed_work_sync(&doe->statemachine);
> +	kfree(doe->prots);

Logical flow to me is unregister first, free protocols second
(to reverse what we do in probe)

> +	pci_doe_unregister(doe);
> +	kfree(doe);
> +}
> +
> +static const struct auxiliary_device_id pci_doe_auxiliary_id_table[] = {
> +	{.name = "cxl_pci.doe", },

I'd like to hear from Bjorn on whether registering this from the CXL
device is the right approach or if we should perhaps just do it directly from
somewhere in PCI. (really applies to patch 3) I'll talk more about this there.

> +	{},
> +};
> +
> +MODULE_DEVICE_TABLE(auxiliary, pci_doe_auxiliary_id_table);
> +
> +struct auxiliary_driver pci_doe_auxiliary_drv = {
> +	.name = "pci_doe_drv",

I would assume this is only used in contexts where the _drv is
obvious?  I would go with "pci_doe".

> +	.id_table = pci_doe_auxiliary_id_table,
> +	.probe = pci_doe_probe,
> +	.remove = pci_doe_remove
> +};
> +
> +static int __init pci_doe_init_module(void)
> +{
> +	int ret;
> +
> +	ret = auxiliary_driver_register(&pci_doe_auxiliary_drv);
> +	if (ret) {
> +		pr_err("Failed pci_doe auxiliary_driver_register() ret=%d\n",
> +		       ret);
> +		return ret;
> +	}
> +
> +	return 0;
> +}
> +
> +static void __exit pci_doe_exit_module(void)
> +{
> +	auxiliary_driver_unregister(&pci_doe_auxiliary_drv);
> +}
> +
> +module_init(pci_doe_init_module);
> +module_exit(pci_doe_exit_module);

Seems like the auxiliary bus would benefit from a
module_auxiliary_driver() macro to cover this simple registration stuff
similar to module_i2c_driver() etc.

Mind you, looking at 5.15 this would be the only user, so maybe one
for the 'next' case on basis two instances proves it's 'common' ;)

> +MODULE_LICENSE("GPL v2");
> diff --git a/include/linux/pci-doe.h b/include/linux/pci-doe.h
> new file mode 100644
> index 000000000000..8380b7ad33d4
> --- /dev/null
> +++ b/include/linux/pci-doe.h
> @@ -0,0 +1,63 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +/*
> + * Data Object Exchange was added as an ECN to the PCIe r5.0 spec.
> + *
> + * Copyright (C) 2021 Huawei
> + *     Jonathan Cameron <Jonathan.Cameron@huawei.com>
> + */
> +
> +#include <linux/completion.h>
> +#include <linux/list.h>
> +#include <linux/mutex.h>

Not used in this header that I can see, so push down to the c files.

> +#include <linux/auxiliary_bus.h>
> +
> +#ifndef LINUX_PCI_DOE_H
> +#define LINUX_PCI_DOE_H
> +
> +#define DOE_DEV_NAME "doe"

Not sure this is used?

Ira Weiny Nov. 10, 2021, 5:45 a.m. UTC | #2

On Mon, Nov 08, 2021 at 12:15:46PM +0000, Jonathan Cameron wrote:
> On Fri, 5 Nov 2021 16:50:53 -0700
> <ira.weiny@intel.com> wrote:
> 
> > From: Jonathan Cameron <Jonathan.Cameron@huawei.com>
> > 
> > Introduced in a PCI ECN [1], DOE provides a config space based mailbox
> > with standard protocol discovery.  Each mailbox is accessed through a
> > DOE Extended Capability.
> > 
> > Define an auxiliary device driver which control DOE auxiliary devices
> > registered on the auxiliary bus.
> > 
> > A DOE mailbox is allowed to support any number of protocols while some
> > DOE protocol specifications apply additional restrictions.
> > 
> > The protocols supported are queried and cached.  pci_doe_supports_prot()
> > can be used to determine if the DOE device supports the protocol
> > specified.
> > 
> > A synchronous interface is provided in pci_doe_exchange_sync() to
> > perform a single query / response exchange from the driver through the
> > device specified.
> > 
> > Testing was conducted against QEMU using:
> > 
> > https://lore.kernel.org/qemu-devel/1619454964-10190-1-git-send-email-cbrowy@avery-design.com/
> > 
> > This code is based on Jonathan's V4 series here:
> > 
> > https://lore.kernel.org/linux-cxl/20210524133938.2815206-1-Jonathan.Cameron@huawei.com/
> > 
> > [1] https://members.pcisig.com/wg/PCI-SIG/document/14143
> >     Data Object Exchange (DOE) - Approved 12 March 2020
> > 
> > Co-developed-by: Ira Weiny <ira.weiny@intel.com>
> > Signed-off-by: Ira Weiny <ira.weiny@intel.com>
> > Signed-off-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
> 
> Hi Ira,
> 
> Thanks for taking this on!

NP I'm just sorry I'm so slow to get it moving.

> 
> I'm sure at least half the comments below are about things I wrote
> then forgot about. I'm not sure if it's a good thing but I've ignored
> this for long enough I'm almost reviewing it as fresh code :(
> 
> I was carrying a local patch for the interrupt handler having 
> figured out I'd missread the spec.   Note that I've since concluded
> my local patch has it's own issues (it was unnecessarily complex)
> so I've made some suggestions below that I'm fairly sure
> fix things up.  Note these paths are hard to test and require adding
> some fiddly state machines to QEMU to open up race windows...
> 
> > 
> > ---
> > Changes from Jonathan's V4
> > 	Move the DOE MB code into the DOE auxiliary driver
> > 	Remove Task List in favor of a wait queue
> > 
> > Changes from Ben
> > 	remove CXL references
> > 	propagate rc from pci functions on error
> 
> ...
> 
> 
> > diff --git a/drivers/pci/doe.c b/drivers/pci/doe.c
> > new file mode 100644
> > index 000000000000..2e702fdc7879
> > --- /dev/null
> > +++ b/drivers/pci/doe.c
> > @@ -0,0 +1,701 @@
> > +// SPDX-License-Identifier: GPL-2.0
> > +/*
> > + * Data Object Exchange ECN
> > + * https://members.pcisig.com/wg/PCI-SIG/document/14143
> > + *
> > + * Copyright (C) 2021 Huawei
> > + *     Jonathan Cameron <Jonathan.Cameron@huawei.com>
> > + */
> > +
> > +#include <linux/bitfield.h>
> > +#include <linux/delay.h>
> > +#include <linux/jiffies.h>
> > +#include <linux/list.h>
> > +#include <linux/mutex.h>
> > +#include <linux/pci.h>
> > +#include <linux/pci-doe.h>
> > +#include <linux/workqueue.h>
> > +#include <linux/module.h>
> > +
> > +#define PCI_DOE_PROTOCOL_DISCOVERY 0
> > +
> > +#define PCI_DOE_BUSY_MAX_RETRIES 16
> > +#define PCI_DOE_POLL_INTERVAL (HZ / 128)
> > +
> > +/* Timeout of 1 second from 6.xx.1 (Operation), ECN - Data Object Exchange */
> > +#define PCI_DOE_TIMEOUT HZ
> > +
> > +enum pci_doe_state {
> > +	DOE_IDLE,
> > +	DOE_WAIT_RESP,
> > +	DOE_WAIT_ABORT,
> > +	DOE_WAIT_ABORT_ON_ERR,
> > +};
> > +
> > +/*
> 
> /**
> 
> Given it's in kernel-doc syntax, we might as well mark it as such.

Yep done.

> 
> > + * struct pci_doe_task - description of a query / response task
> > + * @ex: The details of the task to be done
> > + * @rv: Return value.  Length of received response or error
> > + * @cb: Callback for completion of task
> > + * @private: Private data passed to callback on completion
> > + */
> > +struct pci_doe_task {
> > +	struct pci_doe_exchange *ex;
> > +	int rv;
> > +	void (*cb)(void *private);
> > +	void *private;
> > +};
> > +
> > +/**
> > + * struct pci_doe - A single DOE mailbox driver
> > + *
> > + * @doe_dev: The DOE Auxiliary device being driven
> > + * @abort_c: Completion used for initial abort handling
> > + * @irq: Interrupt used for signaling DOE ready or abort
> > + * @irq_name: Name used to identify the irq for a particular DOE
> > + * @prots: Array of identifiers for protocols supported
> > + * @num_prots: Size of prots array
> > + * @cur_task: Current task the state machine is working on
> > + * @wq: Wait queue to wait on if a query is in progress
> > + * @state_lock: Protect the state of cur_task, abort, and dead
> > + * @statemachine: Work item for the DOE state machine
> > + * @state: Current state of this DOE
> > + * @timeout_jiffies: 1 second after GO set
> > + * @busy_retries: Count of retry attempts
> > + * @abort: Request a manual abort (e.g. on init)
> > + * @dead: Used to mark a DOE for which an ABORT has timed out. Further messages
> > + *        will immediately be aborted with error
> > + */
> > +struct pci_doe {
> > +	struct pci_doe_dev *doe_dev;
> > +	struct completion abort_c;
> > +	int irq;
> > +	char *irq_name;
> > +	struct pci_doe_protocol *prots;
> > +	int num_prots;
> > +
> > +	struct pci_doe_task *cur_task;
> > +	wait_queue_head_t wq;
> > +	struct mutex state_lock;
> > +	struct delayed_work statemachine;
> > +	enum pci_doe_state state;
> > +	unsigned long timeout_jiffies;
> > +	unsigned int busy_retries;
> > +	unsigned int abort:1;
> > +	unsigned int dead:1;
> > +};
> > +
> > +static irqreturn_t pci_doe_irq(int irq, void *data)
> 
> I was carrying a rework of this locally because I managed
> to convince myself this is wrong.  It's been a while and naturally
> I didn't write a comprehensive set of notes on why it was wrong...
> (Note you can't trigger the problem paths in QEMU without some
> nasty hacks as it relies on opening up race windows that make
> limited sense for the QEMU implementation).
> 
> It's all centered on some details of exactly what causes an interrupt
> on a DOE.  Section 6.xx.3 Interrupt Generation states:
> 
> If enabled, an interrupt message must be triggered every time the
> logical AND of the following conditions transitions from FALSE to TRUE:
> 
> * The associated vector is unmasked ...
> * The value of the DOE interrupt enable bit is 1b
> * The value of the DOE interrupt status bit is 1b
> (only last one really maters to us I think).
> 
> The interrupt status bit is an OR conditional.
> 
> Must be set.. Data Object Read bit or DOE error bit set or DOE busy bit cleared.
> 
> > +{
> > +	struct pci_doe *doe = data;
> > +	struct pci_dev *pdev = doe->doe_dev->pdev;
> > +	int offset = doe->doe_dev->cap_offset;
> > +	u32 val;
> > +
> > +	pci_read_config_dword(pdev, offset + PCI_DOE_STATUS, &val);
> > +	if (FIELD_GET(PCI_DOE_STATUS_INT_STATUS, val)) {
> 
> So this bit is set on any of: BUSY dropped, READY or ERROR.
> If it's set on BUSY drop, but then in between the read above and this clear
> READY becomes true, then my reading is that we will not get another interrupt.
> That is fine because we will read it again in the state machine and see the
> new state. We could do more of the dance in the interrupt controller by doing
> a reread after clear of INT_STATUS but I think it's cleaner to leave
> it in the state machine.
> 
> It might look nicer here to only write BIT(1) - RW1C, but that doesn't matter as
> all the rest of the register is RO.
> 
> > +		pci_write_config_dword(pdev, offset + PCI_DOE_STATUS, val);
> > +		mod_delayed_work(system_wq, &doe->statemachine, 0);
> > +		return IRQ_HANDLED;
> > +	}
> > +	/* Leave the error case to be handled outside IRQ */
> > +	if (FIELD_GET(PCI_DOE_STATUS_ERROR, val)) {
> 
> I don't think we can get here because int status already true.
> So should do this before the above general check to avoid clearning
> the interrupt (we don't want more interrupts during the abort though
> I'd hope the hardware wouldn't generate them).
> 
> So move this before the previous check.
> 
> > +		mod_delayed_work(system_wq, &doe->statemachine, 0);
> > +		return IRQ_HANDLED;
> > +	}
> > +
> > +	/*
> > +	 * Busy being cleared can result in an interrupt, but as
> > +	 * the original Busy may not have been detected, there is no
> > +	 * way to separate such an interrupt from a spurious interrupt.
> > +	 */
> 
> This is misleading - as Busy bit clear would have resulted in INT_STATUS being true above
> (that was a misread of the spec from me in v4).
> So I don't think we can get here in any valid path.
> 
> return IRQ_NONE; should be safe.
> 
> 
> > +	return IRQ_HANDLED;
> > +}
> 
> Summary of above suggested changes:
> 1) Move the DOE_STATUS_ERROR block before the DOE_STATUS_INT_STATUS one
> 2) Possibly uses
>    pci_write_config_dword(pdev, offset + PCI_DOE_STATUS, PCI_DOE_STATUS_INT_STATUS);
>    to be explicit on the write one to clear bit.
> 3) IRQ_NONE for the final return path as I'm fairly sure there is no valid route to that.
>    

Done.

But just to ensure that I understand.  If STATUS_ERROR is indicated we are
basically not clearing the irq because we are resetting the mailbox?  Because
with this new code I don't see a pci_write_config_dword to clear INT_STATUS.

But if we are resetting the mailbox I think that is ok.

> ...
> 
> > +
> > +static void pci_doe_task_complete(void *private)
> > +{
> > +	complete(private);
> > +}
> 
> I wonder why this is up here? I'd move it down to just above the _sync()
> function where it's used. This one was definitely one of mine :)

Done.

> 
> > +
> > +static void doe_statemachine_work(struct work_struct *work)
> 
> I developed an interesting "relationship" with this state machine during
> the original development ;)  I've just walked the paths and convinced
> myself it works so all good.

Sweet!  :-D

> 
> > +{
> > +	struct delayed_work *w = to_delayed_work(work);
> > +	struct pci_doe *doe = container_of(w, struct pci_doe, statemachine);
> > +	struct pci_dev *pdev = doe->doe_dev->pdev;
> > +	int offset = doe->doe_dev->cap_offset;
> > +	struct pci_doe_task *task;
> > +	bool abort;
> > +	u32 val;
> > +	int rc;
> > +
> > +	mutex_lock(&doe->state_lock);
> > +	task = doe->cur_task;
> > +	abort = doe->abort;
> > +	doe->abort = false;
> > +	mutex_unlock(&doe->state_lock);
> > +
> > +	if (abort) {
> > +		/*
> > +		 * Currently only used during init - care needed if
> > +		 * pci_doe_abort() is generally exposed as it would impact
> > +		 * queries in flight.
> > +		 */
> > +		WARN_ON(task);
> > +		doe->state = DOE_WAIT_ABORT;
> > +		pci_doe_abort_start(doe);
> > +		return;
> > +	}
> > +
> > +	switch (doe->state) {
> > +	case DOE_IDLE:
> > +		if (task == NULL)
> > +			return;
> > +
> > +		/* Nothing currently in flight so queue a task */
> > +		rc = pci_doe_send_req(doe, task->ex);
> > +		/*
> > +		 * The specification does not provide any guidance on how long
> > +		 * some other entity could keep the DOE busy, so try for 1
> > +		 * second then fail. Busy handling is best effort only, because
> > +		 * there is no way of avoiding racing against another user of
> > +		 * the DOE.
> > +		 */
> > +		if (rc == -EBUSY) {
> > +			doe->busy_retries++;
> > +			if (doe->busy_retries == PCI_DOE_BUSY_MAX_RETRIES) {
> > +				/* Long enough, fail this request */
> > +				pci_WARN(pdev, true, "DOE busy for too long\n");
> > +				doe->busy_retries = 0;
> > +				goto err_busy;
> > +			}
> > +			schedule_delayed_work(w, HZ / PCI_DOE_BUSY_MAX_RETRIES);
> > +			return;
> > +		}
> > +		if (rc)
> > +			goto err_abort;
> > +		doe->busy_retries = 0;
> > +
> > +		doe->state = DOE_WAIT_RESP;
> > +		doe->timeout_jiffies = jiffies + HZ;
> > +		/* Now poll or wait for IRQ with timeout */
> > +		if (doe->irq > 0)
> > +			schedule_delayed_work(w, PCI_DOE_TIMEOUT);
> > +		else
> > +			schedule_delayed_work(w, PCI_DOE_POLL_INTERVAL);
> > +		return;
> > +
> > +	case DOE_WAIT_RESP:
> > +		/* Not possible to get here with NULL task */
> > +		pci_read_config_dword(pdev, offset + PCI_DOE_STATUS, &val);
> > +		if (FIELD_GET(PCI_DOE_STATUS_ERROR, val)) {
> > +			rc = -EIO;
> > +			goto err_abort;
> > +		}
> > +
> > +		if (!FIELD_GET(PCI_DOE_STATUS_DATA_OBJECT_READY, val)) {
> > +			/* If not yet at timeout reschedule otherwise abort */
> > +			if (time_after(jiffies, doe->timeout_jiffies)) {
> > +				rc = -ETIMEDOUT;
> > +				goto err_abort;
> > +			}
> > +			schedule_delayed_work(w, PCI_DOE_POLL_INTERVAL);
> > +			return;
> > +		}
> > +
> > +		rc  = pci_doe_recv_resp(doe, task->ex);
> > +		if (rc < 0)
> > +			goto err_abort;
> > +
> > +		doe->state = DOE_IDLE;
> > +
> > +		mutex_lock(&doe->state_lock);
> > +		doe->cur_task = NULL;
> > +		mutex_unlock(&doe->state_lock);
> > +		wake_up_interruptible(&doe->wq);
> > +
> > +		/* Set the return value to the length of received payload */
> > +		task->rv = rc;
> > +		task->cb(task->private);
> > +
> > +		return;
> > +
> > +	case DOE_WAIT_ABORT:
> > +	case DOE_WAIT_ABORT_ON_ERR:
> > +		pci_read_config_dword(pdev, offset + PCI_DOE_STATUS, &val);
> > +
> > +		if (!FIELD_GET(PCI_DOE_STATUS_ERROR, val) &&
> > +		    !FIELD_GET(PCI_DOE_STATUS_BUSY, val)) {
> > +			/* Back to normal state - carry on */
> > +			mutex_lock(&doe->state_lock);
> > +			doe->cur_task = NULL;
> > +			mutex_unlock(&doe->state_lock);
> > +			wake_up_interruptible(&doe->wq);
> > +
> > +			/*
> > +			 * For deliberately triggered abort, someone is
> > +			 * waiting.
> > +			 */
> > +			if (doe->state == DOE_WAIT_ABORT)
> > +				complete(&doe->abort_c);
> > +
> > +			doe->state = DOE_IDLE;
> > +			return;
> > +		}
> > +		if (time_after(jiffies, doe->timeout_jiffies)) {
> > +			/* Task has timed out and is dead - abort */
> > +			pci_err(pdev, "DOE ABORT timed out\n");
> > +			mutex_lock(&doe->state_lock);
> > +			doe->dead = true;
> > +			doe->cur_task = NULL;
> > +			mutex_unlock(&doe->state_lock);
> > +			wake_up_interruptible(&doe->wq);
> > +
> > +			if (doe->state == DOE_WAIT_ABORT)
> > +				complete(&doe->abort_c);
> > +		}
> > +		return;
> > +	}
> > +
> > +err_abort:
> > +	doe->state = DOE_WAIT_ABORT_ON_ERR;
> > +	pci_doe_abort_start(doe);
> > +err_busy:
> > +	task->rv = rc;
> > +	task->cb(task->private);
> > +	/* If here via err_busy, signal the task done. */
> > +	if (doe->state == DOE_IDLE) {
> > +		mutex_lock(&doe->state_lock);
> > +		doe->cur_task = NULL;
> > +		mutex_unlock(&doe->state_lock);
> > +		wake_up_interruptible(&doe->wq);
> > +	}
> > +}
> > +
> > +/**
> > + * pci_doe_exchange_sync() - Send a request, then wait for and receive a response
> > + * @doe: DOE mailbox state structure
> > + * @ex: Description of the buffers and Vendor ID + type used in this
> > + *      request/response pair
> > + *
> > + * Excess data will be discarded.
> > + *
> > + * RETURNS: payload in bytes on success, < 0 on error
> > + */
> > +int pci_doe_exchange_sync(struct pci_doe_dev *doe_dev, struct pci_doe_exchange *ex)
> > +{
> > +	struct pci_doe *doe = dev_get_drvdata(&doe_dev->adev.dev);
> > +	struct pci_doe_task task;
> > +	DECLARE_COMPLETION_ONSTACK(c);
> > +
> > +	if (!doe)
> > +		return -EAGAIN;
> > +
> > +	/* DOE requests must be a whole number of DW */
> > +	if (ex->request_pl_sz % sizeof(u32))
> > +		return -EINVAL;
> > +
> > +	task.ex = ex;
> > +	task.cb = pci_doe_task_complete;
> > +	task.private = &c;
> > +
> > +again:
> 
> Hmm.   Whether having this code at this layer makes sense hinges on
> whether we want to easily support async use of the DOE in future.

I struggled with this.  I was trying to strike a balance with making this a
synchronous call with only 1 outstanding task while leaving the statemachine
alone.

FWIW I think the queue you had was just fine even though there was only this
synchronous call.

> 
> In v4 some of the async handling had ended up in this function and
> should probably have been factored out to give us a 
> 'queue up work' then 'wait for completion' sequence.
> 
> Given there is now more to be done in here perhaps we need to think
> about such a separation to keep it clear that this is fundamentally
> a synchronous wrapper around an asynchronous operation.

I think that would be moving back in a direction of having a queue like you
defined in V4.  Eliminating the queue really defined this function to sleep
waiting for the state machine to be available.  Doing anything more would have
messed with the state machine you wrote and I did not want to do that.

Dan should we move back to having a queue_task/wait_task like Jonathan had
before?

> 
> > +	mutex_lock(&doe->state_lock);
> > +	if (doe->cur_task) {
> > +		mutex_unlock(&doe->state_lock);
> > +		wait_event_interruptible(doe->wq, doe->cur_task == NULL);
> > +		goto again;
> > +	}
> > +
> > +	if (doe->dead) {
> > +		mutex_unlock(&doe->state_lock);
> > +		return -EIO;
> > +	}
> > +	doe->cur_task = &task;
> > +	schedule_delayed_work(&doe->statemachine, 0);
> > +	mutex_unlock(&doe->state_lock);
> > +
> > +	wait_for_completion(&c);
> > +
> > +	return task.rv;
> > +}
> > +EXPORT_SYMBOL_GPL(pci_doe_exchange_sync);
> > +
> > +/**
> > + * pci_doe_supports_prot() - Return if the DOE instance supports the given protocol
> > + * @pdev: Device on which to find the DOE instance
> > + * @vid: Protocol Vendor ID
> > + * @type: protocol type
> > + *
> > + * This device can then be passed to pci_doe_exchange_sync() to execute a mailbox
> > + * exchange through that DOE mailbox.
> > + *
> > + * RETURNS: True if the DOE device supports the protocol specified
> > + */
> > +bool pci_doe_supports_prot(struct pci_doe_dev *doe_dev, u16 vid, u8 type)
> > +{
> > +	struct pci_doe *doe = dev_get_drvdata(&doe_dev->adev.dev);
> > +	int i;
> > +
> > +	if (!doe)
> > +		return false;
> 
> How would this happen?  I don't think it can...  Probably
> false paranoia from me...

The driver may not be loaded at this point.  The call operates on the aux
device not the driver.  Without a driver loaded I don't think we should return
any protocol support.  Even if the driver was loaded and there were some
protocols previously supported.

> 
> > +
> > +	for (i = 0; i < doe->num_prots; i++)
> > +		if ((doe->prots[i].vid == vid) &&
> > +		    (doe->prots[i].type == type))
> > +			return true;
> > +
> > +	return false;
> > +}
> > +EXPORT_SYMBOL_GPL(pci_doe_supports_prot);
> 
> ...
> 
> > +static void pci_doe_release_irq(struct pci_doe *doe)
> > +{
> > +	if (doe->irq > 0)
> > +		free_irq(doe->irq, doe);
> 
> Is this trivial wrapper worth bothering with?  Maybe just
> put the code inline?

Personally I like it this way because it is called in 2 places.

> 
> > +}
> > +
> 
> ...
> 
> > +
> > +static void pci_doe_unregister(struct pci_doe *doe)
> > +{
> > +	pci_doe_release_irq(doe);
> > +	kfree(doe->irq_name);
> > +	put_device(&doe->doe_dev->pdev->dev);
> 
> This makes me wonder if we should be doing the get_device()
> earlier in probe?  Limited harm in moving it to near the start
> and then ending up with it being 'obviously' correct...

Well...  get_device() is in pci_doe_register...  And it does it's own irq
unwinding.

I guess we could call pci_doe_unregister() from that if we refactored this...

How about this?  (Diff to this code)

diff --git a/drivers/pci/doe.c b/drivers/pci/doe.c
index 76acf4063b6b..6f2a419b3c93 100644
--- a/drivers/pci/doe.c
+++ b/drivers/pci/doe.c
@@ -545,10 +545,12 @@ static int pci_doe_abort(struct pci_doe *doe)
        return 0;
 }
 
-static void pci_doe_release_irq(struct pci_doe *doe)
+static void pci_doe_unregister(struct pci_doe *doe)
 {
        if (doe->irq > 0)
                free_irq(doe->irq, doe);
+       kfree(doe->irq_name);
+       put_device(&doe->doe_dev->pdev->dev);
 }
 
 static int pci_doe_register(struct pci_doe *doe)
@@ -559,21 +561,28 @@ static int pci_doe_register(struct pci_doe *doe)
        int rc, irq;
        u32 val;
 
+       /* Ensure the pci device remains until this driver is done with it */
+       get_device(&pdev->dev);
+
        pci_read_config_dword(pdev, offset + PCI_DOE_CAP, &val);
 
        if (!poll && FIELD_GET(PCI_DOE_CAP_INT, val)) {
                irq = pci_irq_vector(pdev, FIELD_GET(PCI_DOE_CAP_IRQ, val));
-               if (irq < 0)
-                       return irq;
+               if (irq < 0) {
+                       rc = irq;
+                       goto unregister;
+               }
 
                doe->irq_name = kasprintf(GFP_KERNEL, "DOE[%s]",
                                          doe->doe_dev->adev.name);
-               if (!doe->irq_name)
-                       return -ENOMEM;
+               if (!doe->irq_name) {
+                       rc = -ENOMEM;
+                       goto unregister;
+               }
 
                rc = request_irq(irq, pci_doe_irq, 0, doe->irq_name, doe);
                if (rc)
-                       goto err_free_name;
+                       goto unregister;
 
                doe->irq = irq;
                pci_write_config_dword(pdev, offset + PCI_DOE_CTRL,
@@ -583,27 +592,15 @@ static int pci_doe_register(struct pci_doe *doe)
        /* Reset the mailbox by issuing an abort */
        rc = pci_doe_abort(doe);
        if (rc)
-               goto err_free_irq;
-
-       /* Ensure the pci device remains until this driver is done with it */
-       get_device(&pdev->dev);
+               goto unregister;
 
        return 0;
 
-err_free_irq:
-       pci_doe_release_irq(doe);
-err_free_name:
-       kfree(doe->irq_name);
+unregister:
+       pci_doe_unregister(doe);
        return rc;
 }
 
-static void pci_doe_unregister(struct pci_doe *doe)
-{
-       pci_doe_release_irq(doe);
-       kfree(doe->irq_name);
-       put_device(&doe->doe_dev->pdev->dev);
-}
-
 /*
  * pci_doe_probe() - Set up the Mailbox
  * @aux_dev: Auxiliary Device


> 
> > +}
> > +
> > +/*
> > + * pci_doe_probe() - Set up the Mailbox
> > + * @aux_dev: Auxiliary Device
> > + * @id: Auxiliary device ID
> > + *
> > + * Probe the mailbox found for all protocols and set up the Mailbox
> > + *
> > + * RETURNS: 0 on success, < 0 on error
> > + */
> > +static int pci_doe_probe(struct auxiliary_device *aux_dev,
> > +			 const struct auxiliary_device_id *id)
> > +{
> > +	struct pci_doe_dev *doe_dev = container_of(aux_dev,
> > +					struct pci_doe_dev,
> > +					adev);
> > +	struct pci_doe *doe;
> > +	int rc;
> > +
> > +	doe = kzalloc(sizeof(*doe), GFP_KERNEL);
> 
> Could go devm_ for this I think, though may not be worthwhile.

Yes I think it is worth it...  I should use it more.

BTW why did you not use devm_krealloc() for the protocols?

I did not realize that call existed before you mentioned it in the other patch
review.

Any issue with using it there?

> 
> > +	if (!doe)
> > +		return -ENOMEM;
> > +
> > +	mutex_init(&doe->state_lock);
> > +	init_completion(&doe->abort_c);
> > +	doe->doe_dev = doe_dev;
> > +	init_waitqueue_head(&doe->wq);
> > +	INIT_DELAYED_WORK(&doe->statemachine, doe_statemachine_work);
> > +	dev_set_drvdata(&aux_dev->dev, doe);
> > +
> > +	rc = pci_doe_register(doe);
> > +	if (rc)
> > +		goto err_free;
> > +
> > +	rc = pci_doe_cache_protocols(doe);
> > +	if (rc) {
> > +		pci_doe_unregister(doe);
> 
> Mixture of different forms of error handling here.
> I'd move this below and add an err_unregister label.

Actually with the devm_kzalloc() we don't need the goto at all.  We can just
return.  I _think_?  Right?

> 
> > +		goto err_free;
> > +	}
> > +
> > +	return 0;
> > +
> > +err_free:
> > +	kfree(doe);
> > +	return rc;
> > +}
> > +
> > +static void pci_doe_remove(struct auxiliary_device *aux_dev)
> > +{
> > +	struct pci_doe *doe = dev_get_drvdata(&aux_dev->dev);
> > +
> > +	/* First halt the state machine */
> > +	cancel_delayed_work_sync(&doe->statemachine);
> > +	kfree(doe->prots);
> 
> Logical flow to me is unregister first, free protocols second
> (to reverse what we do in probe)

No this is the reverse of the probe order I think.

Order is
	register
	cache protocols

Then we
	free 'uncache' protocols
	unregister

Right?

> 
> > +	pci_doe_unregister(doe);
> > +	kfree(doe);
> > +}
> > +
> > +static const struct auxiliary_device_id pci_doe_auxiliary_id_table[] = {
> > +	{.name = "cxl_pci.doe", },
> 
> I'd like to hear from Bjorn on whether registering this from the CXL
> device is the right approach or if we should perhaps just do it directly from
> somewhere in PCI. (really applies to patch 3) I'll talk more about this there.

Actually I think this could be left blank until the next patch...  It's just
odd to define an empty table in the next few structures.  But technically this
is not needed until the devices are defined.

I'm ok waiting to see what Bjorn thinks regarding the CXL vs PCI placement
though.

> 
> > +	{},
> > +};
> > +
> > +MODULE_DEVICE_TABLE(auxiliary, pci_doe_auxiliary_id_table);
> > +
> > +struct auxiliary_driver pci_doe_auxiliary_drv = {
> > +	.name = "pci_doe_drv",
> 
> I would assume this is only used in contexts where the _drv is
> obvious?  I would go with "pci_doe".

Sure. done.

> 
> > +	.id_table = pci_doe_auxiliary_id_table,
> > +	.probe = pci_doe_probe,
> > +	.remove = pci_doe_remove
> > +};
> > +
> > +static int __init pci_doe_init_module(void)
> > +{
> > +	int ret;
> > +
> > +	ret = auxiliary_driver_register(&pci_doe_auxiliary_drv);
> > +	if (ret) {
> > +		pr_err("Failed pci_doe auxiliary_driver_register() ret=%d\n",
> > +		       ret);
> > +		return ret;
> > +	}
> > +
> > +	return 0;
> > +}
> > +
> > +static void __exit pci_doe_exit_module(void)
> > +{
> > +	auxiliary_driver_unregister(&pci_doe_auxiliary_drv);
> > +}
> > +
> > +module_init(pci_doe_init_module);
> > +module_exit(pci_doe_exit_module);
> 
> Seems like the auxiliary bus would benefit from a
> module_auxiliary_driver() macro to cover this simple registration stuff
> similar to module_i2c_driver() etc.
> 
> Mind you, looking at 5.15 this would be the only user, so maybe one
> for the 'next' case on basis two instances proves it's 'common' ;)

I'm inclined to leave this alone ATM.  I tried to clean up the auxiliary device
documentation and got a bunch more work asked of me by Greg KH.  So I'm behind
on that ATM.

Later we can investigate that a bit I think.

> 
> > +MODULE_LICENSE("GPL v2");
> > diff --git a/include/linux/pci-doe.h b/include/linux/pci-doe.h
> > new file mode 100644
> > index 000000000000..8380b7ad33d4
> > --- /dev/null
> > +++ b/include/linux/pci-doe.h
> > @@ -0,0 +1,63 @@
> > +/* SPDX-License-Identifier: GPL-2.0 */
> > +/*
> > + * Data Object Exchange was added as an ECN to the PCIe r5.0 spec.
> > + *
> > + * Copyright (C) 2021 Huawei
> > + *     Jonathan Cameron <Jonathan.Cameron@huawei.com>
> > + */
> > +
> > +#include <linux/completion.h>
> > +#include <linux/list.h>
> > +#include <linux/mutex.h>
> 
> Not used in this header that I can see, so push down to the c files.

oops...  thanks.

> 
> > +#include <linux/auxiliary_bus.h>
> > +
> > +#ifndef LINUX_PCI_DOE_H
> > +#define LINUX_PCI_DOE_H
> > +
> > +#define DOE_DEV_NAME "doe"
> 
> Not sure this is used?

Used in the next patch...  and it kind of goes along with the table_id name...

I'll see about moving both of those to the next patch where it makes more sense
for now.

Thanks for the review,
Ira

Bjorn Helgaas Nov. 16, 2021, 11:48 p.m. UTC | #3

On Fri, Nov 05, 2021 at 04:50:53PM -0700, ira.weiny@intel.com wrote:
> From: Jonathan Cameron <Jonathan.Cameron@huawei.com>
> 
> Introduced in a PCI ECN [1], DOE provides a config space based mailbox
> with standard protocol discovery.  Each mailbox is accessed through a
> DOE Extended Capability.
> 
> Define an auxiliary device driver which control DOE auxiliary devices
> registered on the auxiliary bus.

What do we gain by making this an auxiliary driver?

This doesn't really feel like a "driver," and apparently it used to be
a library.  I'd like to see the rationale and benefits of the driver
approach (in the eventual commit log as well as the current email
thread).

> A DOE mailbox is allowed to support any number of protocols while some
> DOE protocol specifications apply additional restrictions.

This sounds something like a fancy version of VPD, and VPD has been a
huge headache.  I hope DOE avoids that ;)

> The protocols supported are queried and cached.  pci_doe_supports_prot()
> can be used to determine if the DOE device supports the protocol
> specified.
> 
> A synchronous interface is provided in pci_doe_exchange_sync() to
> perform a single query / response exchange from the driver through the
> device specified.
> 
> Testing was conducted against QEMU using:
> 
> https://lore.kernel.org/qemu-devel/1619454964-10190-1-git-send-email-cbrowy@avery-design.com/
> 
> This code is based on Jonathan's V4 series here:
> 
> https://lore.kernel.org/linux-cxl/20210524133938.2815206-1-Jonathan.Cameron@huawei.com/
> 
> [1] https://members.pcisig.com/wg/PCI-SIG/document/14143
>     Data Object Exchange (DOE) - Approved 12 March 2020
> 
> Co-developed-by: Ira Weiny <ira.weiny@intel.com>
> Signed-off-by: Ira Weiny <ira.weiny@intel.com>
> Signed-off-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>

I think these sign-offs are out of order, per
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/Documentation/process/submitting-patches.rst?id=v5.14#n365
Last sign-off should be from the person who posted this.

> ---
> Changes from Jonathan's V4
> 	Move the DOE MB code into the DOE auxiliary driver
> 	Remove Task List in favor of a wait queue
> 
> Changes from Ben
> 	remove CXL references
> 	propagate rc from pci functions on error
> ---
>  drivers/pci/Kconfig           |  10 +
>  drivers/pci/Makefile          |   3 +
>  drivers/pci/doe.c             | 701 ++++++++++++++++++++++++++++++++++
>  include/linux/pci-doe.h       |  63 +++
>  include/uapi/linux/pci_regs.h |  29 +-
>  5 files changed, 805 insertions(+), 1 deletion(-)
>  create mode 100644 drivers/pci/doe.c
>  create mode 100644 include/linux/pci-doe.h
> 
> diff --git a/drivers/pci/Kconfig b/drivers/pci/Kconfig
> index 0c473d75e625..b512295538ba 100644
> --- a/drivers/pci/Kconfig
> +++ b/drivers/pci/Kconfig
> @@ -118,6 +118,16 @@ config XEN_PCIDEV_FRONTEND
>  	  The PCI device frontend driver allows the kernel to import arbitrary
>  	  PCI devices from a PCI backend to support PCI driver domains.
>  
> +config PCI_DOE_DRIVER
> +	tristate "PCI Data Object Exchange (DOE) driver"
> +	select AUXILIARY_BUS
> +	help
> +	  Driver for DOE auxiliary devices.
> +
> +	  DOE provides a simple mailbox in PCI config space that is used by a
> +	  number of different protocols.  DOE is defined in the Data Object
> +	  Exchange ECN to the PCIe r5.0 spec.
> +
>  config PCI_ATS
>  	bool
>  
> diff --git a/drivers/pci/Makefile b/drivers/pci/Makefile
> index d62c4ac4ae1b..afd9d7bd2b82 100644
> --- a/drivers/pci/Makefile
> +++ b/drivers/pci/Makefile
> @@ -28,8 +28,11 @@ obj-$(CONFIG_PCI_STUB)		+= pci-stub.o
>  obj-$(CONFIG_PCI_PF_STUB)	+= pci-pf-stub.o
>  obj-$(CONFIG_PCI_ECAM)		+= ecam.o
>  obj-$(CONFIG_PCI_P2PDMA)	+= p2pdma.o
> +obj-$(CONFIG_PCI_DOE_DRIVER)	+= pci-doe.o
>  obj-$(CONFIG_XEN_PCIDEV_FRONTEND) += xen-pcifront.o
>  
> +pci-doe-y := doe.o
> +
>  # Endpoint library must be initialized before its users
>  obj-$(CONFIG_PCI_ENDPOINT)	+= endpoint/
>  
> diff --git a/drivers/pci/doe.c b/drivers/pci/doe.c
> new file mode 100644
> index 000000000000..2e702fdc7879
> --- /dev/null
> +++ b/drivers/pci/doe.c
> @@ -0,0 +1,701 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/*
> + * Data Object Exchange ECN
> + * https://members.pcisig.com/wg/PCI-SIG/document/14143
> + *
> + * Copyright (C) 2021 Huawei
> + *     Jonathan Cameron <Jonathan.Cameron@huawei.com>
> + */
> +
> +#include <linux/bitfield.h>
> +#include <linux/delay.h>
> +#include <linux/jiffies.h>
> +#include <linux/list.h>
> +#include <linux/mutex.h>
> +#include <linux/pci.h>
> +#include <linux/pci-doe.h>
> +#include <linux/workqueue.h>
> +#include <linux/module.h>
> +
> +#define PCI_DOE_PROTOCOL_DISCOVERY 0
> +
> +#define PCI_DOE_BUSY_MAX_RETRIES 16
> +#define PCI_DOE_POLL_INTERVAL (HZ / 128)
> +
> +/* Timeout of 1 second from 6.xx.1 (Operation), ECN - Data Object Exchange */
> +#define PCI_DOE_TIMEOUT HZ
> +
> +enum pci_doe_state {
> +	DOE_IDLE,
> +	DOE_WAIT_RESP,
> +	DOE_WAIT_ABORT,
> +	DOE_WAIT_ABORT_ON_ERR,
> +};
> +
> +/*
> + * struct pci_doe_task - description of a query / response task
> + * @ex: The details of the task to be done
> + * @rv: Return value.  Length of received response or error
> + * @cb: Callback for completion of task
> + * @private: Private data passed to callback on completion
> + */
> +struct pci_doe_task {
> +	struct pci_doe_exchange *ex;
> +	int rv;
> +	void (*cb)(void *private);
> +	void *private;
> +};
> +
> +/**
> + * struct pci_doe - A single DOE mailbox driver
> + *
> + * @doe_dev: The DOE Auxiliary device being driven
> + * @abort_c: Completion used for initial abort handling
> + * @irq: Interrupt used for signaling DOE ready or abort
> + * @irq_name: Name used to identify the irq for a particular DOE
> + * @prots: Array of identifiers for protocols supported
> + * @num_prots: Size of prots array
> + * @cur_task: Current task the state machine is working on
> + * @wq: Wait queue to wait on if a query is in progress
> + * @state_lock: Protect the state of cur_task, abort, and dead
> + * @statemachine: Work item for the DOE state machine
> + * @state: Current state of this DOE
> + * @timeout_jiffies: 1 second after GO set
> + * @busy_retries: Count of retry attempts
> + * @abort: Request a manual abort (e.g. on init)
> + * @dead: Used to mark a DOE for which an ABORT has timed out. Further messages
> + *        will immediately be aborted with error
> + */
> +struct pci_doe {
> +	struct pci_doe_dev *doe_dev;
> +	struct completion abort_c;
> +	int irq;
> +	char *irq_name;
> +	struct pci_doe_protocol *prots;
> +	int num_prots;
> +
> +	struct pci_doe_task *cur_task;
> +	wait_queue_head_t wq;
> +	struct mutex state_lock;
> +	struct delayed_work statemachine;
> +	enum pci_doe_state state;
> +	unsigned long timeout_jiffies;
> +	unsigned int busy_retries;
> +	unsigned int abort:1;
> +	unsigned int dead:1;
> +};
> +
> +static irqreturn_t pci_doe_irq(int irq, void *data)
> +{
> +	struct pci_doe *doe = data;
> +	struct pci_dev *pdev = doe->doe_dev->pdev;
> +	int offset = doe->doe_dev->cap_offset;
> +	u32 val;
> +
> +	pci_read_config_dword(pdev, offset + PCI_DOE_STATUS, &val);
> +	if (FIELD_GET(PCI_DOE_STATUS_INT_STATUS, val)) {
> +		pci_write_config_dword(pdev, offset + PCI_DOE_STATUS, val);
> +		mod_delayed_work(system_wq, &doe->statemachine, 0);
> +		return IRQ_HANDLED;
> +	}
> +	/* Leave the error case to be handled outside IRQ */
> +	if (FIELD_GET(PCI_DOE_STATUS_ERROR, val)) {
> +		mod_delayed_work(system_wq, &doe->statemachine, 0);
> +		return IRQ_HANDLED;
> +	}
> +
> +	/*
> +	 * Busy being cleared can result in an interrupt, but as
> +	 * the original Busy may not have been detected, there is no
> +	 * way to separate such an interrupt from a spurious interrupt.
> +	 */
> +	return IRQ_HANDLED;
> +}
> +
> +/*
> + * Only call when safe to directly access the DOE, either because no tasks yet
> + * queued, or called from doe_statemachine_work() which has exclusive access to
> + * the DOE config space.
> + */
> +static void pci_doe_abort_start(struct pci_doe *doe)
> +{
> +	struct pci_dev *pdev = doe->doe_dev->pdev;
> +	int offset = doe->doe_dev->cap_offset;
> +	u32 val;
> +
> +	val = PCI_DOE_CTRL_ABORT;
> +	if (doe->irq)
> +		val |= PCI_DOE_CTRL_INT_EN;
> +	pci_write_config_dword(pdev, offset + PCI_DOE_CTRL, val);
> +
> +	doe->timeout_jiffies = jiffies + HZ;
> +	schedule_delayed_work(&doe->statemachine, HZ);
> +}
> +
> +static int pci_doe_send_req(struct pci_doe *doe, struct pci_doe_exchange *ex)
> +{
> +	struct pci_dev *pdev = doe->doe_dev->pdev;
> +	int offset = doe->doe_dev->cap_offset;
> +	u32 val;
> +	int i;
> +
> +	/*
> +	 * Check the DOE busy bit is not set. If it is set, this could indicate
> +	 * someone other than Linux (e.g. firmware) is using the mailbox. Note
> +	 * it is expected that firmware and OS will negotiate access rights via
> +	 * an, as yet to be defined method.
> +	 */
> +	pci_read_config_dword(pdev, offset + PCI_DOE_STATUS, &val);
> +	if (FIELD_GET(PCI_DOE_STATUS_BUSY, val))
> +		return -EBUSY;
> +
> +	if (FIELD_GET(PCI_DOE_STATUS_ERROR, val))
> +		return -EIO;
> +
> +	/* Write DOE Header */
> +	val = FIELD_PREP(PCI_DOE_DATA_OBJECT_HEADER_1_VID, ex->prot.vid) |
> +		FIELD_PREP(PCI_DOE_DATA_OBJECT_HEADER_1_TYPE, ex->prot.type);
> +	pci_write_config_dword(pdev, offset + PCI_DOE_WRITE, val);
> +	/* Length is 2 DW of header + length of payload in DW */
> +	pci_write_config_dword(pdev, offset + PCI_DOE_WRITE,
> +			       FIELD_PREP(PCI_DOE_DATA_OBJECT_HEADER_2_LENGTH,
> +					  2 + ex->request_pl_sz / sizeof(u32)));
> +	for (i = 0; i < ex->request_pl_sz / sizeof(u32); i++)
> +		pci_write_config_dword(pdev, offset + PCI_DOE_WRITE,
> +				       ex->request_pl[i]);
> +
> +	val = PCI_DOE_CTRL_GO;
> +	if (doe->irq)
> +		val |= PCI_DOE_CTRL_INT_EN;
> +
> +	pci_write_config_dword(pdev, offset + PCI_DOE_CTRL, val);
> +	/* Request is sent - now wait for poll or IRQ */
> +	return 0;
> +}
> +
> +static int pci_doe_recv_resp(struct pci_doe *doe, struct pci_doe_exchange *ex)
> +{
> +	struct pci_dev *pdev = doe->doe_dev->pdev;
> +	int offset = doe->doe_dev->cap_offset;
> +	size_t length;
> +	u32 val;
> +	int i;
> +
> +	/* Read the first dword to get the protocol */
> +	pci_read_config_dword(pdev, offset + PCI_DOE_READ, &val);
> +	if ((FIELD_GET(PCI_DOE_DATA_OBJECT_HEADER_1_VID, val) != ex->prot.vid) ||
> +	    (FIELD_GET(PCI_DOE_DATA_OBJECT_HEADER_1_TYPE, val) != ex->prot.type)) {
> +		pci_err(pdev,
> +			"Expected [VID, Protocol] = [%x, %x], got [%x, %x]\n",

Maybe "%#x" so this is less ambiguous?

> +			ex->prot.vid, ex->prot.type,
> +			FIELD_GET(PCI_DOE_DATA_OBJECT_HEADER_1_VID, val),
> +			FIELD_GET(PCI_DOE_DATA_OBJECT_HEADER_1_TYPE, val));
> +		return -EIO;
> +	}
> +
> +	pci_write_config_dword(pdev, offset + PCI_DOE_READ, 0);
> +	/* Read the second dword to get the length */
> +	pci_read_config_dword(pdev, offset + PCI_DOE_READ, &val);
> +	pci_write_config_dword(pdev, offset + PCI_DOE_READ, 0);
> +
> +	length = FIELD_GET(PCI_DOE_DATA_OBJECT_HEADER_2_LENGTH, val);
> +	if (length > SZ_1M || length < 2)
> +		return -EIO;
> +
> +	/* First 2 dwords have already been read */
> +	length -= 2;
> +	/* Read the rest of the response payload */
> +	for (i = 0; i < min(length, ex->response_pl_sz / sizeof(u32)); i++) {
> +		pci_read_config_dword(pdev, offset + PCI_DOE_READ,
> +				      &ex->response_pl[i]);
> +		pci_write_config_dword(pdev, offset + PCI_DOE_READ, 0);
> +	}
> +
> +	/* Flush excess length */
> +	for (; i < length; i++) {
> +		pci_read_config_dword(pdev, offset + PCI_DOE_READ, &val);
> +		pci_write_config_dword(pdev, offset + PCI_DOE_READ, 0);
> +	}
> +	/* Final error check to pick up on any since Data Object Ready */
> +	pci_read_config_dword(pdev, offset + PCI_DOE_STATUS, &val);
> +	if (FIELD_GET(PCI_DOE_STATUS_ERROR, val))
> +		return -EIO;
> +
> +	return min(length, ex->response_pl_sz / sizeof(u32)) * sizeof(u32);
> +}
> +
> +static void pci_doe_task_complete(void *private)
> +{
> +	complete(private);
> +}
> +
> +static void doe_statemachine_work(struct work_struct *work)
> +{
> +	struct delayed_work *w = to_delayed_work(work);
> +	struct pci_doe *doe = container_of(w, struct pci_doe, statemachine);
> +	struct pci_dev *pdev = doe->doe_dev->pdev;
> +	int offset = doe->doe_dev->cap_offset;
> +	struct pci_doe_task *task;
> +	bool abort;
> +	u32 val;
> +	int rc;
> +
> +	mutex_lock(&doe->state_lock);
> +	task = doe->cur_task;
> +	abort = doe->abort;
> +	doe->abort = false;
> +	mutex_unlock(&doe->state_lock);
> +
> +	if (abort) {
> +		/*
> +		 * Currently only used during init - care needed if
> +		 * pci_doe_abort() is generally exposed as it would impact
> +		 * queries in flight.
> +		 */
> +		WARN_ON(task);
> +		doe->state = DOE_WAIT_ABORT;
> +		pci_doe_abort_start(doe);
> +		return;
> +	}
> +
> +	switch (doe->state) {
> +	case DOE_IDLE:
> +		if (task == NULL)
> +			return;
> +
> +		/* Nothing currently in flight so queue a task */
> +		rc = pci_doe_send_req(doe, task->ex);
> +		/*
> +		 * The specification does not provide any guidance on how long
> +		 * some other entity could keep the DOE busy, so try for 1
> +		 * second then fail. Busy handling is best effort only, because
> +		 * there is no way of avoiding racing against another user of
> +		 * the DOE.
> +		 */
> +		if (rc == -EBUSY) {
> +			doe->busy_retries++;
> +			if (doe->busy_retries == PCI_DOE_BUSY_MAX_RETRIES) {
> +				/* Long enough, fail this request */
> +				pci_WARN(pdev, true, "DOE busy for too long\n");

I think pci_WARN() (as opposed to pci_warn()) gives us a register dump
or stacktrace.  That's useful if it might be a software problem.  But
this looks like an issue with the hardware or firmware on the adapter,
where the registers or stacktrace don't seem useful.

Maybe the busy duration would be useful?

> +				doe->busy_retries = 0;
> +				goto err_busy;
> +			}
> +			schedule_delayed_work(w, HZ / PCI_DOE_BUSY_MAX_RETRIES);
> +			return;
> +		}
> +		if (rc)
> +			goto err_abort;
> +		doe->busy_retries = 0;
> +
> +		doe->state = DOE_WAIT_RESP;
> +		doe->timeout_jiffies = jiffies + HZ;
> +		/* Now poll or wait for IRQ with timeout */
> +		if (doe->irq > 0)
> +			schedule_delayed_work(w, PCI_DOE_TIMEOUT);
> +		else
> +			schedule_delayed_work(w, PCI_DOE_POLL_INTERVAL);
> +		return;
> +
> +	case DOE_WAIT_RESP:
> +		/* Not possible to get here with NULL task */
> +		pci_read_config_dword(pdev, offset + PCI_DOE_STATUS, &val);
> +		if (FIELD_GET(PCI_DOE_STATUS_ERROR, val)) {
> +			rc = -EIO;
> +			goto err_abort;
> +		}
> +
> +		if (!FIELD_GET(PCI_DOE_STATUS_DATA_OBJECT_READY, val)) {
> +			/* If not yet at timeout reschedule otherwise abort */
> +			if (time_after(jiffies, doe->timeout_jiffies)) {
> +				rc = -ETIMEDOUT;
> +				goto err_abort;
> +			}
> +			schedule_delayed_work(w, PCI_DOE_POLL_INTERVAL);
> +			return;
> +		}
> +
> +		rc  = pci_doe_recv_resp(doe, task->ex);
> +		if (rc < 0)
> +			goto err_abort;
> +
> +		doe->state = DOE_IDLE;
> +
> +		mutex_lock(&doe->state_lock);
> +		doe->cur_task = NULL;
> +		mutex_unlock(&doe->state_lock);
> +		wake_up_interruptible(&doe->wq);
> +
> +		/* Set the return value to the length of received payload */
> +		task->rv = rc;
> +		task->cb(task->private);
> +
> +		return;
> +
> +	case DOE_WAIT_ABORT:
> +	case DOE_WAIT_ABORT_ON_ERR:
> +		pci_read_config_dword(pdev, offset + PCI_DOE_STATUS, &val);
> +
> +		if (!FIELD_GET(PCI_DOE_STATUS_ERROR, val) &&
> +		    !FIELD_GET(PCI_DOE_STATUS_BUSY, val)) {
> +			/* Back to normal state - carry on */
> +			mutex_lock(&doe->state_lock);
> +			doe->cur_task = NULL;
> +			mutex_unlock(&doe->state_lock);
> +			wake_up_interruptible(&doe->wq);
> +
> +			/*
> +			 * For deliberately triggered abort, someone is
> +			 * waiting.
> +			 */
> +			if (doe->state == DOE_WAIT_ABORT)
> +				complete(&doe->abort_c);
> +
> +			doe->state = DOE_IDLE;
> +			return;
> +		}
> +		if (time_after(jiffies, doe->timeout_jiffies)) {
> +			/* Task has timed out and is dead - abort */
> +			pci_err(pdev, "DOE ABORT timed out\n");
> +			mutex_lock(&doe->state_lock);
> +			doe->dead = true;
> +			doe->cur_task = NULL;
> +			mutex_unlock(&doe->state_lock);
> +			wake_up_interruptible(&doe->wq);
> +
> +			if (doe->state == DOE_WAIT_ABORT)
> +				complete(&doe->abort_c);
> +		}
> +		return;
> +	}
> +
> +err_abort:
> +	doe->state = DOE_WAIT_ABORT_ON_ERR;
> +	pci_doe_abort_start(doe);
> +err_busy:
> +	task->rv = rc;
> +	task->cb(task->private);
> +	/* If here via err_busy, signal the task done. */
> +	if (doe->state == DOE_IDLE) {
> +		mutex_lock(&doe->state_lock);
> +		doe->cur_task = NULL;
> +		mutex_unlock(&doe->state_lock);
> +		wake_up_interruptible(&doe->wq);
> +	}
> +}
> +
> +/**
> + * pci_doe_exchange_sync() - Send a request, then wait for and receive a response

Wrap to fit in 80 columns.  There are a few more.

> + * @doe: DOE mailbox state structure
> + * @ex: Description of the buffers and Vendor ID + type used in this
> + *      request/response pair
> + *
> + * Excess data will be discarded.
> + *
> + * RETURNS: payload in bytes on success, < 0 on error
> + */
> +int pci_doe_exchange_sync(struct pci_doe_dev *doe_dev, struct pci_doe_exchange *ex)
> +{
> +	struct pci_doe *doe = dev_get_drvdata(&doe_dev->adev.dev);
> +	struct pci_doe_task task;
> +	DECLARE_COMPLETION_ONSTACK(c);
> +
> +	if (!doe)
> +		return -EAGAIN;
> +
> +	/* DOE requests must be a whole number of DW */
> +	if (ex->request_pl_sz % sizeof(u32))
> +		return -EINVAL;
> +
> +	task.ex = ex;
> +	task.cb = pci_doe_task_complete;
> +	task.private = &c;
> +
> +again:
> +	mutex_lock(&doe->state_lock);
> +	if (doe->cur_task) {
> +		mutex_unlock(&doe->state_lock);
> +		wait_event_interruptible(doe->wq, doe->cur_task == NULL);
> +		goto again;
> +	}
> +
> +	if (doe->dead) {
> +		mutex_unlock(&doe->state_lock);
> +		return -EIO;
> +	}
> +	doe->cur_task = &task;
> +	schedule_delayed_work(&doe->statemachine, 0);
> +	mutex_unlock(&doe->state_lock);
> +
> +	wait_for_completion(&c);
> +
> +	return task.rv;
> +}
> +EXPORT_SYMBOL_GPL(pci_doe_exchange_sync);
> +
> +/**
> + * pci_doe_supports_prot() - Return if the DOE instance supports the given protocol
> + * @pdev: Device on which to find the DOE instance
> + * @vid: Protocol Vendor ID
> + * @type: protocol type
> + *
> + * This device can then be passed to pci_doe_exchange_sync() to execute a mailbox
> + * exchange through that DOE mailbox.
> + *
> + * RETURNS: True if the DOE device supports the protocol specified
> + */
> +bool pci_doe_supports_prot(struct pci_doe_dev *doe_dev, u16 vid, u8 type)
> +{
> +	struct pci_doe *doe = dev_get_drvdata(&doe_dev->adev.dev);
> +	int i;
> +
> +	if (!doe)
> +		return false;
> +
> +	for (i = 0; i < doe->num_prots; i++)
> +		if ((doe->prots[i].vid == vid) &&
> +		    (doe->prots[i].type == type))
> +			return true;
> +
> +	return false;
> +}
> +EXPORT_SYMBOL_GPL(pci_doe_supports_prot);
> +
> +static int pci_doe_discovery(struct pci_doe *doe, u8 *index, u16 *vid,
> +			     u8 *protocol)
> +{
> +	u32 request_pl = FIELD_PREP(PCI_DOE_DATA_OBJECT_DISC_REQ_3_INDEX, *index);
> +	u32 response_pl;
> +	struct pci_doe_exchange ex = {
> +		.prot.vid = PCI_VENDOR_ID_PCI_SIG,
> +		.prot.type = PCI_DOE_PROTOCOL_DISCOVERY,
> +		.request_pl = &request_pl,
> +		.request_pl_sz = sizeof(request_pl),
> +		.response_pl = &response_pl,
> +		.response_pl_sz = sizeof(response_pl),
> +	};
> +	int ret;
> +
> +	ret = pci_doe_exchange_sync(doe->doe_dev, &ex);
> +	if (ret < 0)
> +		return ret;
> +
> +	if (ret != sizeof(response_pl))
> +		return -EIO;
> +
> +	*vid = FIELD_GET(PCI_DOE_DATA_OBJECT_DISC_RSP_3_VID, response_pl);
> +	*protocol = FIELD_GET(PCI_DOE_DATA_OBJECT_DISC_RSP_3_PROTOCOL, response_pl);
> +	*index = FIELD_GET(PCI_DOE_DATA_OBJECT_DISC_RSP_3_NEXT_INDEX, response_pl);
> +
> +	return 0;
> +}
> +
> +static int pci_doe_cache_protocols(struct pci_doe *doe)
> +{
> +	u8 index = 0;
> +	int rc;
> +
> +	/* Discovery protocol must always be supported and must report itself */
> +	doe->num_prots = 1;
> +	doe->prots = kcalloc(doe->num_prots, sizeof(*doe->prots), GFP_KERNEL);
> +	if (doe->prots == NULL)
> +		return -ENOMEM;
> +
> +	do {
> +		struct pci_doe_protocol *prot;
> +
> +		prot = &doe->prots[doe->num_prots - 1];
> +		rc = pci_doe_discovery(doe, &index, &prot->vid, &prot->type);
> +		if (rc)
> +			goto err_free_prots;
> +
> +		if (index) {
> +			struct pci_doe_protocol *prot_new;
> +
> +			doe->num_prots++;
> +			prot_new = krealloc(doe->prots,
> +					    sizeof(*doe->prots) * doe->num_prots,
> +					    GFP_KERNEL);
> +			if (prot_new == NULL) {
> +				rc = -ENOMEM;
> +				goto err_free_prots;
> +			}
> +			doe->prots = prot_new;
> +		}
> +	} while (index);
> +
> +	return 0;
> +
> +err_free_prots:
> +	kfree(doe->prots);
> +	doe->num_prots = 0;
> +	doe->prots = NULL;
> +	return rc;
> +}
> +
> +static int pci_doe_abort(struct pci_doe *doe)
> +{
> +	reinit_completion(&doe->abort_c);
> +	mutex_lock(&doe->state_lock);
> +	doe->abort = true;
> +	mutex_unlock(&doe->state_lock);
> +	schedule_delayed_work(&doe->statemachine, 0);
> +	wait_for_completion(&doe->abort_c);
> +
> +	if (doe->dead)
> +		return -EIO;
> +
> +	return 0;
> +}
> +
> +static void pci_doe_release_irq(struct pci_doe *doe)
> +{
> +	if (doe->irq > 0)
> +		free_irq(doe->irq, doe);
> +}
> +
> +static int pci_doe_register(struct pci_doe *doe)
> +{
> +	struct pci_dev *pdev = doe->doe_dev->pdev;
> +	bool poll = !pci_dev_msi_enabled(pdev);
> +	int offset = doe->doe_dev->cap_offset;
> +	int rc, irq;
> +	u32 val;
> +
> +	pci_read_config_dword(pdev, offset + PCI_DOE_CAP, &val);
> +
> +	if (!poll && FIELD_GET(PCI_DOE_CAP_INT, val)) {
> +		irq = pci_irq_vector(pdev, FIELD_GET(PCI_DOE_CAP_IRQ, val));
> +		if (irq < 0)
> +			return irq;
> +
> +		doe->irq_name = kasprintf(GFP_KERNEL, "DOE[%s]",
> +					  doe->doe_dev->adev.name);
> +		if (!doe->irq_name)
> +			return -ENOMEM;
> +
> +		rc = request_irq(irq, pci_doe_irq, 0, doe->irq_name, doe);
> +		if (rc)
> +			goto err_free_name;
> +
> +		doe->irq = irq;
> +		pci_write_config_dword(pdev, offset + PCI_DOE_CTRL,
> +				       PCI_DOE_CTRL_INT_EN);
> +	}
> +
> +	/* Reset the mailbox by issuing an abort */
> +	rc = pci_doe_abort(doe);
> +	if (rc)
> +		goto err_free_irqs;
> +
> +	/* Ensure the pci device remains until this driver is done with it */

s/pci/PCI/

> +	get_device(&pdev->dev);

pci_dev_get()

There's kind of a mix of both in drivers/pci/, but since it exists, we
might as well use it.

> +
> +	return 0;
> +
> +err_free_irqs:
> +	pci_doe_release_irq(doe);
> +err_free_name:
> +	kfree(doe->irq_name);
> +	return rc;
> +}
> +
> +static void pci_doe_unregister(struct pci_doe *doe)
> +{
> +	pci_doe_release_irq(doe);
> +	kfree(doe->irq_name);
> +	put_device(&doe->doe_dev->pdev->dev);

pci_dev_put()

> +}
> +
> +/*
> + * pci_doe_probe() - Set up the Mailbox
> + * @aux_dev: Auxiliary Device
> + * @id: Auxiliary device ID
> + *
> + * Probe the mailbox found for all protocols and set up the Mailbox
> + *
> + * RETURNS: 0 on success, < 0 on error
> + */
> +static int pci_doe_probe(struct auxiliary_device *aux_dev,
> +			 const struct auxiliary_device_id *id)
> +{
> +	struct pci_doe_dev *doe_dev = container_of(aux_dev,
> +					struct pci_doe_dev,
> +					adev);
> +	struct pci_doe *doe;
> +	int rc;
> +
> +	doe = kzalloc(sizeof(*doe), GFP_KERNEL);
> +	if (!doe)
> +		return -ENOMEM;
> +
> +	mutex_init(&doe->state_lock);
> +	init_completion(&doe->abort_c);
> +	doe->doe_dev = doe_dev;
> +	init_waitqueue_head(&doe->wq);
> +	INIT_DELAYED_WORK(&doe->statemachine, doe_statemachine_work);
> +	dev_set_drvdata(&aux_dev->dev, doe);
> +
> +	rc = pci_doe_register(doe);
> +	if (rc)
> +		goto err_free;
> +
> +	rc = pci_doe_cache_protocols(doe);
> +	if (rc) {
> +		pci_doe_unregister(doe);
> +		goto err_free;
> +	}
> +
> +	return 0;
> +
> +err_free:
> +	kfree(doe);
> +	return rc;
> +}
> +
> +static void pci_doe_remove(struct auxiliary_device *aux_dev)
> +{
> +	struct pci_doe *doe = dev_get_drvdata(&aux_dev->dev);
> +
> +	/* First halt the state machine */
> +	cancel_delayed_work_sync(&doe->statemachine);
> +	kfree(doe->prots);
> +	pci_doe_unregister(doe);
> +	kfree(doe);
> +}
> +
> +static const struct auxiliary_device_id pci_doe_auxiliary_id_table[] = {
> +	{.name = "cxl_pci.doe", },
> +	{},
> +};
> +
> +MODULE_DEVICE_TABLE(auxiliary, pci_doe_auxiliary_id_table);
> +
> +struct auxiliary_driver pci_doe_auxiliary_drv = {
> +	.name = "pci_doe_drv",
> +	.id_table = pci_doe_auxiliary_id_table,
> +	.probe = pci_doe_probe,
> +	.remove = pci_doe_remove
> +};
> +
> +static int __init pci_doe_init_module(void)
> +{
> +	int ret;
> +
> +	ret = auxiliary_driver_register(&pci_doe_auxiliary_drv);
> +	if (ret) {
> +		pr_err("Failed pci_doe auxiliary_driver_register() ret=%d\n",
> +		       ret);
> +		return ret;
> +	}
> +
> +	return 0;
> +}
> +
> +static void __exit pci_doe_exit_module(void)
> +{
> +	auxiliary_driver_unregister(&pci_doe_auxiliary_drv);
> +}
> +
> +module_init(pci_doe_init_module);
> +module_exit(pci_doe_exit_module);
> +MODULE_LICENSE("GPL v2");

What is the benefit of this being loadable as opposed to being static?

> diff --git a/include/linux/pci-doe.h b/include/linux/pci-doe.h
> new file mode 100644
> index 000000000000..8380b7ad33d4
> --- /dev/null
> +++ b/include/linux/pci-doe.h
> @@ -0,0 +1,63 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +/*
> + * Data Object Exchange was added as an ECN to the PCIe r5.0 spec.
> + *
> + * Copyright (C) 2021 Huawei
> + *     Jonathan Cameron <Jonathan.Cameron@huawei.com>
> + */
> +
> +#include <linux/completion.h>
> +#include <linux/list.h>
> +#include <linux/mutex.h>
> +#include <linux/auxiliary_bus.h>
> +
> +#ifndef LINUX_PCI_DOE_H
> +#define LINUX_PCI_DOE_H
> +
> +#define DOE_DEV_NAME "doe"

This isn't used here and should move to the patch that needs it.

> +struct pci_doe_protocol {
> +	u16 vid;
> +	u8 type;
> +};
> +
> +/**
> + * struct pci_doe_exchange - represents a single query/response
> + *
> + * @prot: DOE Protocol
> + * @request_pl: The request payload
> + * @request_pl_sz: Size of the request payload
> + * @response_pl: The response payload
> + * @response_pl_sz: Size of the response payload
> + */
> +struct pci_doe_exchange {
> +	struct pci_doe_protocol prot;
> +	u32 *request_pl;
> +	size_t request_pl_sz;
> +	u32 *response_pl;
> +	size_t response_pl_sz;
> +};
> +
> +/**
> + * struct pci_doe_dev - DOE mailbox device
> + *
> + * @adrv: Auxiliary Driver data
> + * @pdev: PCI device this belongs to
> + * @offset: Capability offset
> + *
> + * This represents a single DOE mailbox device.  Devices should create this
> + * device and register it on the Auxiliary bus for the DOE driver to maintain.
> + *
> + */
> +struct pci_doe_dev {
> +	struct auxiliary_device adev;
> +	struct pci_dev *pdev;
> +	int cap_offset;
> +};
> +
> +/* Library operations */
> +int pci_doe_exchange_sync(struct pci_doe_dev *doe_dev,
> +				 struct pci_doe_exchange *ex);
> +bool pci_doe_supports_prot(struct pci_doe_dev *doe_dev, u16 vid, u8 type);
> +
> +#endif
> diff --git a/include/uapi/linux/pci_regs.h b/include/uapi/linux/pci_regs.h
> index e709ae8235e7..1073cd1916e1 100644
> --- a/include/uapi/linux/pci_regs.h
> +++ b/include/uapi/linux/pci_regs.h
> @@ -730,7 +730,8 @@
>  #define PCI_EXT_CAP_ID_DVSEC	0x23	/* Designated Vendor-Specific */
>  #define PCI_EXT_CAP_ID_DLF	0x25	/* Data Link Feature */
>  #define PCI_EXT_CAP_ID_PL_16GT	0x26	/* Physical Layer 16.0 GT/s */
> -#define PCI_EXT_CAP_ID_MAX	PCI_EXT_CAP_ID_PL_16GT
> +#define PCI_EXT_CAP_ID_DOE	0x2E	/* Data Object Exchange */
> +#define PCI_EXT_CAP_ID_MAX	PCI_EXT_CAP_ID_DOE
>  
>  #define PCI_EXT_CAP_DSN_SIZEOF	12
>  #define PCI_EXT_CAP_MCAST_ENDPOINT_SIZEOF 40
> @@ -1092,4 +1093,30 @@
>  #define  PCI_PL_16GT_LE_CTRL_USP_TX_PRESET_MASK		0x000000F0
>  #define  PCI_PL_16GT_LE_CTRL_USP_TX_PRESET_SHIFT	4
>  
> +/* Data Object Exchange */
> +#define PCI_DOE_CAP            0x04    /* DOE Capabilities Register */
> +#define  PCI_DOE_CAP_INT                       0x00000001  /* Interrupt Support */
> +#define  PCI_DOE_CAP_IRQ                       0x00000ffe  /* Interrupt Message Number */
> +#define PCI_DOE_CTRL           0x08    /* DOE Control Register */
> +#define  PCI_DOE_CTRL_ABORT                    0x00000001  /* DOE Abort */
> +#define  PCI_DOE_CTRL_INT_EN                   0x00000002  /* DOE Interrupt Enable */
> +#define  PCI_DOE_CTRL_GO                       0x80000000  /* DOE Go */
> +#define PCI_DOE_STATUS         0x0c    /* DOE Status Register */
> +#define  PCI_DOE_STATUS_BUSY                   0x00000001  /* DOE Busy */
> +#define  PCI_DOE_STATUS_INT_STATUS             0x00000002  /* DOE Interrupt Status */
> +#define  PCI_DOE_STATUS_ERROR                  0x00000004  /* DOE Error */
> +#define  PCI_DOE_STATUS_DATA_OBJECT_READY      0x80000000  /* Data Object Ready */
> +#define PCI_DOE_WRITE          0x10    /* DOE Write Data Mailbox Register */
> +#define PCI_DOE_READ           0x14    /* DOE Read Data Mailbox Register */
> +
> +/* DOE Data Object - note not actually registers */
> +#define PCI_DOE_DATA_OBJECT_HEADER_1_VID       0x0000ffff
> +#define PCI_DOE_DATA_OBJECT_HEADER_1_TYPE      0x00ff0000
> +#define PCI_DOE_DATA_OBJECT_HEADER_2_LENGTH    0x0003ffff
> +
> +#define PCI_DOE_DATA_OBJECT_DISC_REQ_3_INDEX   0x000000ff
> +#define PCI_DOE_DATA_OBJECT_DISC_RSP_3_VID     0x0000ffff
> +#define PCI_DOE_DATA_OBJECT_DISC_RSP_3_PROTOCOL        0x00ff0000
> +#define PCI_DOE_DATA_OBJECT_DISC_RSP_3_NEXT_INDEX 0xff000000

Conventional identation is via tabs when possible; the above all
appear to be spaces.

Jonathan Cameron Nov. 18, 2021, 6:48 p.m. UTC | #4

....

Sorry for the delay, I managed to miss this email.
Only realized I hadn't replied because I wanted to point out the docs
issue I'd missed in original review... See below.

> > I was carrying a rework of this locally because I managed
> > to convince myself this is wrong.  It's been a while and naturally
> > I didn't write a comprehensive set of notes on why it was wrong...
> > (Note you can't trigger the problem paths in QEMU without some
> > nasty hacks as it relies on opening up race windows that make
> > limited sense for the QEMU implementation).
> > 
> > It's all centered on some details of exactly what causes an interrupt
> > on a DOE.  Section 6.xx.3 Interrupt Generation states:
> > 
> > If enabled, an interrupt message must be triggered every time the
> > logical AND of the following conditions transitions from FALSE to TRUE:
> > 
> > * The associated vector is unmasked ...
> > * The value of the DOE interrupt enable bit is 1b
> > * The value of the DOE interrupt status bit is 1b
> > (only last one really maters to us I think).
> > 
> > The interrupt status bit is an OR conditional.
> > 
> > Must be set.. Data Object Read bit or DOE error bit set or DOE busy bit cleared.
> >   
> > > +{
> > > +	struct pci_doe *doe = data;
> > > +	struct pci_dev *pdev = doe->doe_dev->pdev;
> > > +	int offset = doe->doe_dev->cap_offset;
> > > +	u32 val;
> > > +
> > > +	pci_read_config_dword(pdev, offset + PCI_DOE_STATUS, &val);
> > > +	if (FIELD_GET(PCI_DOE_STATUS_INT_STATUS, val)) {  
> > 
> > So this bit is set on any of: BUSY dropped, READY or ERROR.
> > If it's set on BUSY drop, but then in between the read above and this clear
> > READY becomes true, then my reading is that we will not get another interrupt.
> > That is fine because we will read it again in the state machine and see the
> > new state. We could do more of the dance in the interrupt controller by doing
> > a reread after clear of INT_STATUS but I think it's cleaner to leave
> > it in the state machine.
> > 
> > It might look nicer here to only write BIT(1) - RW1C, but that doesn't matter as
> > all the rest of the register is RO.
> >   
> > > +		pci_write_config_dword(pdev, offset + PCI_DOE_STATUS, val);
> > > +		mod_delayed_work(system_wq, &doe->statemachine, 0);
> > > +		return IRQ_HANDLED;
> > > +	}
> > > +	/* Leave the error case to be handled outside IRQ */
> > > +	if (FIELD_GET(PCI_DOE_STATUS_ERROR, val)) {  
> > 
> > I don't think we can get here because int status already true.
> > So should do this before the above general check to avoid clearning
> > the interrupt (we don't want more interrupts during the abort though
> > I'd hope the hardware wouldn't generate them).
> > 
> > So move this before the previous check.
> >   
> > > +		mod_delayed_work(system_wq, &doe->statemachine, 0);
> > > +		return IRQ_HANDLED;
> > > +	}
> > > +
> > > +	/*
> > > +	 * Busy being cleared can result in an interrupt, but as
> > > +	 * the original Busy may not have been detected, there is no
> > > +	 * way to separate such an interrupt from a spurious interrupt.
> > > +	 */  
> > 
> > This is misleading - as Busy bit clear would have resulted in INT_STATUS being true above
> > (that was a misread of the spec from me in v4).
> > So I don't think we can get here in any valid path.
> > 
> > return IRQ_NONE; should be safe.
> > 
> >   
> > > +	return IRQ_HANDLED;
> > > +}  
> > 
> > Summary of above suggested changes:
> > 1) Move the DOE_STATUS_ERROR block before the DOE_STATUS_INT_STATUS one
> > 2) Possibly uses
> >    pci_write_config_dword(pdev, offset + PCI_DOE_STATUS, PCI_DOE_STATUS_INT_STATUS);
> >    to be explicit on the write one to clear bit.
> > 3) IRQ_NONE for the final return path as I'm fairly sure there is no valid route to that.
> >      
> 
> Done.
> 
> But just to ensure that I understand.  If STATUS_ERROR is indicated we are
> basically not clearing the irq because we are resetting the mailbox?  Because
> with this new code I don't see a pci_write_config_dword to clear INT_STATUS.

Exactly.

> 
> But if we are resetting the mailbox I think that is ok.
> 
> > ...
> >   
...

> > > +/**
> > > + * pci_doe_exchange_sync() - Send a request, then wait for and receive a response
> > > + * @doe: DOE mailbox state structure

This should be doe_dev,  another thing during build tests of a rebase as
I wanted to put the CMA stuff on top of this.


> > > + * @ex: Description of the buffers and Vendor ID + type used in this
> > > + *      request/response pair
> > > + *
> > > + * Excess data will be discarded.
> > > + *
> > > + * RETURNS: payload in bytes on success, < 0 on error
> > > + */
> > > +int pci_doe_exchange_sync(struct pci_doe_dev *doe_dev, struct pci_doe_exchange *ex)
> > > +{
> > > +	struct pci_doe *doe = dev_get_drvdata(&doe_dev->adev.dev);
> > > +	struct pci_doe_task task;
> > > +	DECLARE_COMPLETION_ONSTACK(c);
> > > +
> > > +	if (!doe)
> > > +		return -EAGAIN;
> > > +
> > > +	/* DOE requests must be a whole number of DW */
> > > +	if (ex->request_pl_sz % sizeof(u32))
> > > +		return -EINVAL;
> > > +
> > > +	task.ex = ex;
> > > +	task.cb = pci_doe_task_complete;
> > > +	task.private = &c;
> > > +
> > > +again:  
> > 
> > Hmm.   Whether having this code at this layer makes sense hinges on
> > whether we want to easily support async use of the DOE in future.  
> 
> I struggled with this.  I was trying to strike a balance with making this a
> synchronous call with only 1 outstanding task while leaving the statemachine
> alone.
> 
> FWIW I think the queue you had was just fine even though there was only this
> synchronous call.

We can put it back easily if we ever need it. Until then this is fine.

I'm not convinced any DOE use will be sufficiently high bandwidth that
it really matters if we support async accessors


> 
> > 
> > In v4 some of the async handling had ended up in this function and
> > should probably have been factored out to give us a 
> > 'queue up work' then 'wait for completion' sequence.
> > 
> > Given there is now more to be done in here perhaps we need to think
> > about such a separation to keep it clear that this is fundamentally
> > a synchronous wrapper around an asynchronous operation.  
> 
> I think that would be moving back in a direction of having a queue like you
> defined in V4.  Eliminating the queue really defined this function to sleep
> waiting for the state machine to be available.  Doing anything more would have
> messed with the state machine you wrote and I did not want to do that.
> 
> Dan should we move back to having a queue_task/wait_task like Jonathan had
> before?
> 
> >   
> > > +	mutex_lock(&doe->state_lock);
> > > +	if (doe->cur_task) {
> > > +		mutex_unlock(&doe->state_lock);
> > > +		wait_event_interruptible(doe->wq, doe->cur_task == NULL);
> > > +		goto again;
> > > +	}
> > > +
> > > +	if (doe->dead) {
> > > +		mutex_unlock(&doe->state_lock);
> > > +		return -EIO;
> > > +	}
> > > +	doe->cur_task = &task;
> > > +	schedule_delayed_work(&doe->statemachine, 0);
> > > +	mutex_unlock(&doe->state_lock);
> > > +
> > > +	wait_for_completion(&c);
> > > +
> > > +	return task.rv;
> > > +}
> > > +EXPORT_SYMBOL_GPL(pci_doe_exchange_sync);

> > ...
> >   
> > > +
> > > +static void pci_doe_unregister(struct pci_doe *doe)
> > > +{
> > > +	pci_doe_release_irq(doe);
> > > +	kfree(doe->irq_name);
> > > +	put_device(&doe->doe_dev->pdev->dev);  
> > 
> > This makes me wonder if we should be doing the get_device()
> > earlier in probe?  Limited harm in moving it to near the start
> > and then ending up with it being 'obviously' correct...  
> 
> Well...  get_device() is in pci_doe_register...  And it does it's own irq
> unwinding.
> 
> I guess we could call pci_doe_unregister() from that if we refactored this...
> 
> How about this?  (Diff to this code)

While it should work, I think I'd keep the error handling paths explicit
and not rely on irq_name == 0 and doe->irq == 0 making the error path
calls of pci_doe_unregister() safe.  That takes some thought when reviewing
(a little bit) whereas explicit error handling doesn't take as much.

I just don't like unregister having put_device() last when it isn't
the first thing done in register().  So moving that first perhaps gets
a reference before we strictly speaking need one, but it makes it clear
we can definitely release that reference where it's done in unregister.


> 
> diff --git a/drivers/pci/doe.c b/drivers/pci/doe.c
> index 76acf4063b6b..6f2a419b3c93 100644
> --- a/drivers/pci/doe.c
> +++ b/drivers/pci/doe.c
> @@ -545,10 +545,12 @@ static int pci_doe_abort(struct pci_doe *doe)
>         return 0;
>  }
>  
> -static void pci_doe_release_irq(struct pci_doe *doe)
> +static void pci_doe_unregister(struct pci_doe *doe)
>  {
>         if (doe->irq > 0)
>                 free_irq(doe->irq, doe);
> +       kfree(doe->irq_name);
> +       put_device(&doe->doe_dev->pdev->dev);
>  }
>  
>  static int pci_doe_register(struct pci_doe *doe)
> @@ -559,21 +561,28 @@ static int pci_doe_register(struct pci_doe *doe)
>         int rc, irq;
>         u32 val;
>  
> +       /* Ensure the pci device remains until this driver is done with it */
> +       get_device(&pdev->dev);
> +
>         pci_read_config_dword(pdev, offset + PCI_DOE_CAP, &val);
>  
>         if (!poll && FIELD_GET(PCI_DOE_CAP_INT, val)) {
>                 irq = pci_irq_vector(pdev, FIELD_GET(PCI_DOE_CAP_IRQ, val));
> -               if (irq < 0)
> -                       return irq;
> +               if (irq < 0) {
> +                       rc = irq;
> +                       goto unregister;
> +               }
>  
>                 doe->irq_name = kasprintf(GFP_KERNEL, "DOE[%s]",
>                                           doe->doe_dev->adev.name);
> -               if (!doe->irq_name)
> -                       return -ENOMEM;
> +               if (!doe->irq_name) {
> +                       rc = -ENOMEM;
> +                       goto unregister;
> +               }
>  
>                 rc = request_irq(irq, pci_doe_irq, 0, doe->irq_name, doe);
>                 if (rc)
> -                       goto err_free_name;
> +                       goto unregister;
>  
>                 doe->irq = irq;
>                 pci_write_config_dword(pdev, offset + PCI_DOE_CTRL,
> @@ -583,27 +592,15 @@ static int pci_doe_register(struct pci_doe *doe)
>         /* Reset the mailbox by issuing an abort */
>         rc = pci_doe_abort(doe);
>         if (rc)
> -               goto err_free_irq;
> -
> -       /* Ensure the pci device remains until this driver is done with it */
> -       get_device(&pdev->dev);
> +               goto unregister;
>  
>         return 0;
>  
> -err_free_irq:
> -       pci_doe_release_irq(doe);
> -err_free_name:
> -       kfree(doe->irq_name);
> +unregister:
> +       pci_doe_unregister(doe);
>         return rc;
>  }
>  
> -static void pci_doe_unregister(struct pci_doe *doe)
> -{
> -       pci_doe_release_irq(doe);
> -       kfree(doe->irq_name);
> -       put_device(&doe->doe_dev->pdev->dev);
> -}
> -
>  /*
>   * pci_doe_probe() - Set up the Mailbox
>   * @aux_dev: Auxiliary Device
> 
> 
> >   
> > > +}
> > > +
> > > +/*
> > > + * pci_doe_probe() - Set up the Mailbox
> > > + * @aux_dev: Auxiliary Device
> > > + * @id: Auxiliary device ID
> > > + *
> > > + * Probe the mailbox found for all protocols and set up the Mailbox
> > > + *
> > > + * RETURNS: 0 on success, < 0 on error
> > > + */
> > > +static int pci_doe_probe(struct auxiliary_device *aux_dev,
> > > +			 const struct auxiliary_device_id *id)
> > > +{
> > > +	struct pci_doe_dev *doe_dev = container_of(aux_dev,
> > > +					struct pci_doe_dev,
> > > +					adev);
> > > +	struct pci_doe *doe;
> > > +	int rc;
> > > +
> > > +	doe = kzalloc(sizeof(*doe), GFP_KERNEL);  
> > 
> > Could go devm_ for this I think, though may not be worthwhile.  
> 
> Yes I think it is worth it...  I should use it more.
> 
> BTW why did you not use devm_krealloc() for the protocols?

I have a sneaky suspicion this has been around long enough it predates
devm_krealloc.

> 
> I did not realize that call existed before you mentioned it in the other patch
> review.

It is rather new and shiny :)

> 
> Any issue with using it there?

Should be fine I think.

> 
> >   
> > > +	if (!doe)
> > > +		return -ENOMEM;
> > > +
> > > +	mutex_init(&doe->state_lock);
> > > +	init_completion(&doe->abort_c);
> > > +	doe->doe_dev = doe_dev;
> > > +	init_waitqueue_head(&doe->wq);
> > > +	INIT_DELAYED_WORK(&doe->statemachine, doe_statemachine_work);
> > > +	dev_set_drvdata(&aux_dev->dev, doe);
> > > +
> > > +	rc = pci_doe_register(doe);
> > > +	if (rc)
> > > +		goto err_free;
> > > +
> > > +	rc = pci_doe_cache_protocols(doe);
> > > +	if (rc) {
> > > +		pci_doe_unregister(doe);  
> > 
> > Mixture of different forms of error handling here.
> > I'd move this below and add an err_unregister label.  
> 
> Actually with the devm_kzalloc() we don't need the goto at all.  We can just
> return.  I _think_?  Right?

yes

> 
> >   
> > > +		goto err_free;
> > > +	}
> > > +
> > > +	return 0;
> > > +
> > > +err_free:
> > > +	kfree(doe);
> > > +	return rc;
> > > +}
> > > +
> > > +static void pci_doe_remove(struct auxiliary_device *aux_dev)
> > > +{
> > > +	struct pci_doe *doe = dev_get_drvdata(&aux_dev->dev);
> > > +
> > > +	/* First halt the state machine */
> > > +	cancel_delayed_work_sync(&doe->statemachine);
> > > +	kfree(doe->prots);  
> > 
> > Logical flow to me is unregister first, free protocols second
> > (to reverse what we do in probe)  
> 
> No this is the reverse of the probe order I think.
> 
> Order is
> 	register
> 	cache protocols
> 
> Then we
> 	free 'uncache' protocols
> 	unregister
> 
> Right?

Huh. No idea what I was going on about.

> 
> >   
> > > +	pci_doe_unregister(doe);
> > > +	kfree(doe);
> > > +}
> > > +
> > > +static const struct auxiliary_device_id pci_doe_auxiliary_id_table[] = {
> > > +	{.name = "cxl_pci.doe", },  
> > 
> > I'd like to hear from Bjorn on whether registering this from the CXL
> > device is the right approach or if we should perhaps just do it directly from
> > somewhere in PCI. (really applies to patch 3) I'll talk more about this there.  
> 
> Actually I think this could be left blank until the next patch...  It's just
> odd to define an empty table in the next few structures.  But technically this
> is not needed until the devices are defined.
> 
> I'm ok waiting to see what Bjorn thinks regarding the CXL vs PCI placement
> though.
> 
> >   
> > > +	{},
> > > +};
> > > +
> > > +MODULE_DEVICE_TABLE(auxiliary, pci_doe_auxiliary_id_table);
> > > +
> > > +struct auxiliary_driver pci_doe_auxiliary_drv = {
> > > +	.name = "pci_doe_drv",  
> > 
> > I would assume this is only used in contexts where the _drv is
> > obvious?  I would go with "pci_doe".  
> 
> Sure. done.
> 
> >   
> > > +	.id_table = pci_doe_auxiliary_id_table,
> > > +	.probe = pci_doe_probe,
> > > +	.remove = pci_doe_remove
> > > +};
> > > +
> > > +static int __init pci_doe_init_module(void)
> > > +{
> > > +	int ret;
> > > +
> > > +	ret = auxiliary_driver_register(&pci_doe_auxiliary_drv);
> > > +	if (ret) {
> > > +		pr_err("Failed pci_doe auxiliary_driver_register() ret=%d\n",
> > > +		       ret);
> > > +		return ret;
> > > +	}
> > > +
> > > +	return 0;
> > > +}
> > > +
> > > +static void __exit pci_doe_exit_module(void)
> > > +{
> > > +	auxiliary_driver_unregister(&pci_doe_auxiliary_drv);
> > > +}
> > > +
> > > +module_init(pci_doe_init_module);
> > > +module_exit(pci_doe_exit_module);  
> > 
> > Seems like the auxiliary bus would benefit from a
> > module_auxiliary_driver() macro to cover this simple registration stuff
> > similar to module_i2c_driver() etc.
> > 
> > Mind you, looking at 5.15 this would be the only user, so maybe one
> > for the 'next' case on basis two instances proves it's 'common' ;)  
> 
> I'm inclined to leave this alone ATM.  I tried to clean up the auxiliary device
> documentation and got a bunch more work asked of me by Greg KH.  So I'm behind
> on that ATM.
> 
> Later we can investigate that a bit I think.

Sure. No rush on that.

> 
> >   
> > > +MODULE_LICENSE("GPL v2");
> > > diff --git a/include/linux/pci-doe.h b/include/linux/pci-doe.h
> > > new file mode 100644
> > > index 000000000000..8380b7ad33d4
> > > --- /dev/null
> > > +++ b/include/linux/pci-doe.h
> > > @@ -0,0 +1,63 @@
> > > +/* SPDX-License-Identifier: GPL-2.0 */
> > > +/*
> > > + * Data Object Exchange was added as an ECN to the PCIe r5.0 spec.
> > > + *
> > > + * Copyright (C) 2021 Huawei
> > > + *     Jonathan Cameron <Jonathan.Cameron@huawei.com>
> > > + */
> > > +
> > > +#include <linux/completion.h>
> > > +#include <linux/list.h>
> > > +#include <linux/mutex.h>  
> > 
> > Not used in this header that I can see, so push down to the c files.  
> 
> oops...  thanks.
> 
> >   
> > > +#include <linux/auxiliary_bus.h>
> > > +
> > > +#ifndef LINUX_PCI_DOE_H
> > > +#define LINUX_PCI_DOE_H
> > > +
> > > +#define DOE_DEV_NAME "doe"  
> > 
> > Not sure this is used?  
> 
> Used in the next patch...  and it kind of goes along with the table_id name...
> 
> I'll see about moving both of those to the next patch where it makes more sense
> for now.
> 
> Thanks for the review,
> Ira
> 
Thanks,

Jonathan

Dan Williams Dec. 3, 2021, 8:48 p.m. UTC | #5

On Tue, Nov 16, 2021 at 3:48 PM Bjorn Helgaas <helgaas@kernel.org> wrote:
>
> On Fri, Nov 05, 2021 at 04:50:53PM -0700, ira.weiny@intel.com wrote:
> > From: Jonathan Cameron <Jonathan.Cameron@huawei.com>
> >
> > Introduced in a PCI ECN [1], DOE provides a config space based mailbox
> > with standard protocol discovery.  Each mailbox is accessed through a
> > DOE Extended Capability.
> >
> > Define an auxiliary device driver which control DOE auxiliary devices
> > registered on the auxiliary bus.
>
> What do we gain by making this an auxiliary driver?
>
> This doesn't really feel like a "driver," and apparently it used to be
> a library.  I'd like to see the rationale and benefits of the driver
> approach (in the eventual commit log as well as the current email
> thread).
>

I asked Ira to use the auxiliary bus for DOE primarily for the ABI it
offers for userspace to manage kernel vs userspace access to a device.
CONFIG_IO_STRICT_DEVMEM set the precedent that userspace can not
clobber mmio space that is actively claimed by a kernel driver. I
submit that DOE merits the same protection for DOE instances that the
kernel consumes.

Unlike other PCI configuration registers that root userspace has no
reason to touch unless it wants to actively break things, DOE is a
mechanism that root userspace may need to access directly in some
cases. There are a few examples that come to mind.

CXL Compliance Testing (see CXL 2.0 14.16.4 Compliance Mode DOE)
offers a mechanism to set different test modes for a DOE device. The
kernel has no reason to ever use that interface, and it has strong
reasons to want to block access to it in production. However, hardware
vendors also use debug Linux builds for hardware bringup. So I would
like to be able to say that the mechanism to gain access to the
compliance DOE is to detach the aux DOE driver from the right aux DOE
device. Could we build a custom way to do the same for the DOE
library, sure, but why re-invent the wheel when udev and the driver
model can handle this type of policy question already?

Another use case is SPDM where an agent can establish a secure message
passing channel to a device, or paravirtualized device to exchange
protected messages with the hypervisor. My expectation is that in
addition to the kernel establishing SPDM sessions for PCI IDE and
CXL.cachemem IDE (link Integrity and Data Encryption) there will be
use cases for root userspace to establish their own SPDM session. In
that scenario as well the kernel can be told to give up control of a
specific DOE instance by detaching the aux device for its driver, but
otherwise the kernel driver can be assured that userspace will not
clobber its communications with its own attempts to talk over the DOE.

Lastly, and perhaps this is minor, the PCI core is a built-in object
and aux-bus allows for breaking out device library functionality like
this into a dedicated module. But yes, that's not a good reason unto
itself because you could "auxify" almost anything past the point of
reason just to get more modules.

> > A DOE mailbox is allowed to support any number of protocols while some
> > DOE protocol specifications apply additional restrictions.
>
> This sounds something like a fancy version of VPD, and VPD has been a
> huge headache.  I hope DOE avoids that ;)

Please say a bit more, I think DOE is a rather large headache as
evidenced by us fine folks grappling with how to architect the Linux
enabling.

Bjorn Helgaas Dec. 3, 2021, 11:56 p.m. UTC | #6

On Fri, Dec 03, 2021 at 12:48:18PM -0800, Dan Williams wrote:
> On Tue, Nov 16, 2021 at 3:48 PM Bjorn Helgaas <helgaas@kernel.org> wrote:
> > On Fri, Nov 05, 2021 at 04:50:53PM -0700, ira.weiny@intel.com wrote:
> > > From: Jonathan Cameron <Jonathan.Cameron@huawei.com>
> > >
> > > Introduced in a PCI ECN [1], DOE provides a config space based mailbox
> > > with standard protocol discovery.  Each mailbox is accessed through a
> > > DOE Extended Capability.
> > >
> > > Define an auxiliary device driver which control DOE auxiliary devices
> > > registered on the auxiliary bus.
> >
> > What do we gain by making this an auxiliary driver?
> >
> > This doesn't really feel like a "driver," and apparently it used to be
> > a library.  I'd like to see the rationale and benefits of the driver
> > approach (in the eventual commit log as well as the current email
> > thread).
> 
> I asked Ira to use the auxiliary bus for DOE primarily for the ABI it
> offers for userspace to manage kernel vs userspace access to a device.
> CONFIG_IO_STRICT_DEVMEM set the precedent that userspace can not
> clobber mmio space that is actively claimed by a kernel driver. I
> submit that DOE merits the same protection for DOE instances that the
> kernel consumes.
>
> Unlike other PCI configuration registers that root userspace has no
> reason to touch unless it wants to actively break things, DOE is a
> mechanism that root userspace may need to access directly in some
> cases. There are a few examples that come to mind.

It's useful for root to read/write config registers with setpci, e.g.,
to test ASPM configuration, test power management behavior, etc.  That
can certainly break things and interfere with kernel access (and IMO
should taint the kernel) but we have so far accepted that risk.  I
think the same will be true for DOE.

In addition, I would think you might want a safe userspace interface
via sysfs, e.g., something like the "vpd" file, but I missed that if
it was in this series.

> CXL Compliance Testing (see CXL 2.0 14.16.4 Compliance Mode DOE)
> offers a mechanism to set different test modes for a DOE device. The
> kernel has no reason to ever use that interface, and it has strong
> reasons to want to block access to it in production. However, hardware
> vendors also use debug Linux builds for hardware bringup. So I would
> like to be able to say that the mechanism to gain access to the
> compliance DOE is to detach the aux DOE driver from the right aux DOE
> device. Could we build a custom way to do the same for the DOE
> library, sure, but why re-invent the wheel when udev and the driver
> model can handle this type of policy question already?
> 
> Another use case is SPDM where an agent can establish a secure message
> passing channel to a device, or paravirtualized device to exchange
> protected messages with the hypervisor. My expectation is that in
> addition to the kernel establishing SPDM sessions for PCI IDE and
> CXL.cachemem IDE (link Integrity and Data Encryption) there will be
> use cases for root userspace to establish their own SPDM session. In
> that scenario as well the kernel can be told to give up control of a
> specific DOE instance by detaching the aux device for its driver, but
> otherwise the kernel driver can be assured that userspace will not
> clobber its communications with its own attempts to talk over the DOE.

I assume the kernel needs to control access to DOE in all cases,
doesn't it?  For example, DOE can generate interrupts, and only the
kernel can field them.  Maybe if I saw the userspace interface this
would make more sense to me.  I'm hoping there's a reasonable "send
this query and give me the response" primitive that can be implemented
in the kernel, used by drivers, and exposed safely to userspace.

> Lastly, and perhaps this is minor, the PCI core is a built-in object
> and aux-bus allows for breaking out device library functionality like
> this into a dedicated module. But yes, that's not a good reason unto
> itself because you could "auxify" almost anything past the point of
> reason just to get more modules.
> 
> > > A DOE mailbox is allowed to support any number of protocols while some
> > > DOE protocol specifications apply additional restrictions.
> >
> > This sounds something like a fancy version of VPD, and VPD has been a
> > huge headache.  I hope DOE avoids that ;)
> 
> Please say a bit more, I think DOE is a rather large headache as
> evidenced by us fine folks grappling with how to architect the Linux
> enabling.

VPD is not widely used, so gets poor testing.  The device doesn't tell
us how much VPD it has, so we have to read until the end.  The data is
theoretically self-describing (series of elements, each containing a
type and size), but of course some devices don't format it correctly,
so we read to the absolute limit, which takes a long time.

Typical contents of unimplemented or uninitialized VPD are all 0x00 or
all 0xff, neither of which is defined as a valid "end of VPD" signal,
so we have hacky "this doesn't look like VPD" code.

The PCI core doesn't need VPD itself and should only need to look for
the size of each element and the "end" tag, but it works around some
issues by doing more interpretation of the data.  The spec is a little
ambiguous and leaves room for vendors to use types not mentioned in
the spec.

Some devices share VPD hardware across functions but don't protect it
correctly (a hardware defect, granted).

VPD has address/data registers, so it requires locking, polling for
completion, and timeouts.

Bjorn

Dan Williams Dec. 4, 2021, 3:47 p.m. UTC | #7

On Fri, Dec 3, 2021 at 3:56 PM Bjorn Helgaas <helgaas@kernel.org> wrote:
>
> On Fri, Dec 03, 2021 at 12:48:18PM -0800, Dan Williams wrote:
> > On Tue, Nov 16, 2021 at 3:48 PM Bjorn Helgaas <helgaas@kernel.org> wrote:
> > > On Fri, Nov 05, 2021 at 04:50:53PM -0700, ira.weiny@intel.com wrote:
> > > > From: Jonathan Cameron <Jonathan.Cameron@huawei.com>
> > > >
> > > > Introduced in a PCI ECN [1], DOE provides a config space based mailbox
> > > > with standard protocol discovery.  Each mailbox is accessed through a
> > > > DOE Extended Capability.
> > > >
> > > > Define an auxiliary device driver which control DOE auxiliary devices
> > > > registered on the auxiliary bus.
> > >
> > > What do we gain by making this an auxiliary driver?
> > >
> > > This doesn't really feel like a "driver," and apparently it used to be
> > > a library.  I'd like to see the rationale and benefits of the driver
> > > approach (in the eventual commit log as well as the current email
> > > thread).
> >
> > I asked Ira to use the auxiliary bus for DOE primarily for the ABI it
> > offers for userspace to manage kernel vs userspace access to a device.
> > CONFIG_IO_STRICT_DEVMEM set the precedent that userspace can not
> > clobber mmio space that is actively claimed by a kernel driver. I
> > submit that DOE merits the same protection for DOE instances that the
> > kernel consumes.
> >
> > Unlike other PCI configuration registers that root userspace has no
> > reason to touch unless it wants to actively break things, DOE is a
> > mechanism that root userspace may need to access directly in some
> > cases. There are a few examples that come to mind.
>
> It's useful for root to read/write config registers with setpci, e.g.,
> to test ASPM configuration, test power management behavior, etc.  That
> can certainly break things and interfere with kernel access (and IMO
> should taint the kernel) but we have so far accepted that risk.  I
> think the same will be true for DOE.

I think DOE is a demonstrable step worse than those examples and
pushes into the unacceptable risk category. It invites a communication
protocol with an unbounded range of side effects (especially when
controlling a device like a CXL memory expander that affects "System
RAM" directly). Part of what drives platform / device vendors to the
standards negotiation table is the OS encouraging common interfaces.
If Linux provides an unfettered DOE interface it reduces an incentive
for device vendors to collaborate with the kernel community.

I do like the taint proposal though, if I can't convince you that DOE
merits explicit root userspace lockout beyond
CONFIG_LOCK_DOWN_KERNEL_FORCE_INTEGRITY, I would settle for the kernel
warning loudly about DOE usage that falls outside the kernel's
expectations.

> In addition, I would think you might want a safe userspace interface
> via sysfs, e.g., something like the "vpd" file, but I missed that if
> it was in this series.

I do not think the kernel is well served by a generic userspace
passthrough for DOE for the same reason that there is no generic
passthrough for the ACPI _DSM facility. When the kernel becomes a
generic pipe to a vendor specific interface it impedes the kernel from
developing standard interfaces across vendors. Each vendor will ship
their own quirky feature and corresponding vendor-specific tool with
minimal incentive to coordinate with other vendors doing similar
things. At a minimum the userspace interface for DOE should be at a
level above the raw transport and be enabled per standardized /
published DOE protocol.  I.e. a userspace interface to read the CDAT
table retrieved over DOE, or a userspace interface to enumerate IDE
capabilities, etc.

> > CXL Compliance Testing (see CXL 2.0 14.16.4 Compliance Mode DOE)
> > offers a mechanism to set different test modes for a DOE device. The
> > kernel has no reason to ever use that interface, and it has strong
> > reasons to want to block access to it in production. However, hardware
> > vendors also use debug Linux builds for hardware bringup. So I would
> > like to be able to say that the mechanism to gain access to the
> > compliance DOE is to detach the aux DOE driver from the right aux DOE
> > device. Could we build a custom way to do the same for the DOE
> > library, sure, but why re-invent the wheel when udev and the driver
> > model can handle this type of policy question already?
> >
> > Another use case is SPDM where an agent can establish a secure message
> > passing channel to a device, or paravirtualized device to exchange
> > protected messages with the hypervisor. My expectation is that in
> > addition to the kernel establishing SPDM sessions for PCI IDE and
> > CXL.cachemem IDE (link Integrity and Data Encryption) there will be
> > use cases for root userspace to establish their own SPDM session. In
> > that scenario as well the kernel can be told to give up control of a
> > specific DOE instance by detaching the aux device for its driver, but
> > otherwise the kernel driver can be assured that userspace will not
> > clobber its communications with its own attempts to talk over the DOE.
>
> I assume the kernel needs to control access to DOE in all cases,
> doesn't it?  For example, DOE can generate interrupts, and only the
> kernel can field them.  Maybe if I saw the userspace interface this
> would make more sense to me.  I'm hoping there's a reasonable "send
> this query and give me the response" primitive that can be implemented
> in the kernel, used by drivers, and exposed safely to userspace.

A DOE can generate interrupts, but I have yet to see a protocol that
demands that. The userspace interface in the patches is just a binary
attribute to dump the "CDAT" table retrieved over DOE. No generic
passthrough is provided per the concerns above.

> > Lastly, and perhaps this is minor, the PCI core is a built-in object
> > and aux-bus allows for breaking out device library functionality like
> > this into a dedicated module. But yes, that's not a good reason unto
> > itself because you could "auxify" almost anything past the point of
> > reason just to get more modules.
> >
> > > > A DOE mailbox is allowed to support any number of protocols while some
> > > > DOE protocol specifications apply additional restrictions.
> > >
> > > This sounds something like a fancy version of VPD, and VPD has been a
> > > huge headache.  I hope DOE avoids that ;)
> >
> > Please say a bit more, I think DOE is a rather large headache as
> > evidenced by us fine folks grappling with how to architect the Linux
> > enabling.
>
> VPD is not widely used, so gets poor testing.  The device doesn't tell
> us how much VPD it has, so we have to read until the end.  The data is
> theoretically self-describing (series of elements, each containing a
> type and size), but of course some devices don't format it correctly,
> so we read to the absolute limit, which takes a long time.
>
> Typical contents of unimplemented or uninitialized VPD are all 0x00 or
> all 0xff, neither of which is defined as a valid "end of VPD" signal,
> so we have hacky "this doesn't look like VPD" code.
>
> The PCI core doesn't need VPD itself and should only need to look for
> the size of each element and the "end" tag, but it works around some
> issues by doing more interpretation of the data.  The spec is a little
> ambiguous and leaves room for vendors to use types not mentioned in
> the spec.
>
> Some devices share VPD hardware across functions but don't protect it
> correctly (a hardware defect, granted).
>
> VPD has address/data registers, so it requires locking, polling for
> completion, and timeouts.

Yikes, yes, that sounds like a headache.

Jonathan Cameron Dec. 6, 2021, 12:27 p.m. UTC | #8

On Sat, 4 Dec 2021 07:47:59 -0800
Dan Williams <dan.j.williams@intel.com> wrote:

> On Fri, Dec 3, 2021 at 3:56 PM Bjorn Helgaas <helgaas@kernel.org> wrote:
> >
> > On Fri, Dec 03, 2021 at 12:48:18PM -0800, Dan Williams wrote:  
> > > On Tue, Nov 16, 2021 at 3:48 PM Bjorn Helgaas <helgaas@kernel.org> wrote:  
> > > > On Fri, Nov 05, 2021 at 04:50:53PM -0700, ira.weiny@intel.com wrote:  
> > > > > From: Jonathan Cameron <Jonathan.Cameron@huawei.com>
> > > > >
> > > > > Introduced in a PCI ECN [1], DOE provides a config space based mailbox
> > > > > with standard protocol discovery.  Each mailbox is accessed through a
> > > > > DOE Extended Capability.
> > > > >
> > > > > Define an auxiliary device driver which control DOE auxiliary devices
> > > > > registered on the auxiliary bus.  
> > > >
> > > > What do we gain by making this an auxiliary driver?
> > > >
> > > > This doesn't really feel like a "driver," and apparently it used to be
> > > > a library.  I'd like to see the rationale and benefits of the driver
> > > > approach (in the eventual commit log as well as the current email
> > > > thread).  
> > >
> > > I asked Ira to use the auxiliary bus for DOE primarily for the ABI it
> > > offers for userspace to manage kernel vs userspace access to a device.
> > > CONFIG_IO_STRICT_DEVMEM set the precedent that userspace can not
> > > clobber mmio space that is actively claimed by a kernel driver. I
> > > submit that DOE merits the same protection for DOE instances that the
> > > kernel consumes.
> > >
> > > Unlike other PCI configuration registers that root userspace has no
> > > reason to touch unless it wants to actively break things, DOE is a
> > > mechanism that root userspace may need to access directly in some
> > > cases. There are a few examples that come to mind.  
> >
> > It's useful for root to read/write config registers with setpci, e.g.,
> > to test ASPM configuration, test power management behavior, etc.  That
> > can certainly break things and interfere with kernel access (and IMO
> > should taint the kernel) but we have so far accepted that risk.  I
> > think the same will be true for DOE.  
> 
> I think DOE is a demonstrable step worse than those examples and
> pushes into the unacceptable risk category. It invites a communication
> protocol with an unbounded range of side effects (especially when
> controlling a device like a CXL memory expander that affects "System
> RAM" directly). Part of what drives platform / device vendors to the
> standards negotiation table is the OS encouraging common interfaces.
> If Linux provides an unfettered DOE interface it reduces an incentive
> for device vendors to collaborate with the kernel community.
> 
> I do like the taint proposal though, if I can't convince you that DOE
> merits explicit root userspace lockout beyond
> CONFIG_LOCK_DOWN_KERNEL_FORCE_INTEGRITY, I would settle for the kernel
> warning loudly about DOE usage that falls outside the kernel's
> expectations.
> 
> > In addition, I would think you might want a safe userspace interface
> > via sysfs, e.g., something like the "vpd" file, but I missed that if
> > it was in this series.  
> 
> I do not think the kernel is well served by a generic userspace
> passthrough for DOE for the same reason that there is no generic
> passthrough for the ACPI _DSM facility. When the kernel becomes a
> generic pipe to a vendor specific interface it impedes the kernel from
> developing standard interfaces across vendors. Each vendor will ship
> their own quirky feature and corresponding vendor-specific tool with
> minimal incentive to coordinate with other vendors doing similar
> things. At a minimum the userspace interface for DOE should be at a
> level above the raw transport and be enabled per standardized /
> published DOE protocol.  I.e. a userspace interface to read the CDAT
> table retrieved over DOE, or a userspace interface to enumerate IDE
> capabilities, etc.

I agree with Dan that a generic pass through is a bad idea, but we
do have code for one in an earlier version...
https://lore.kernel.org/linux-pci/20210524133938.2815206-5-Jonathan.Cameron@huawei.com/

We could take the approach of an allow list for this, if we can figure
out an appropriate way to manage that list.

> 
> > > CXL Compliance Testing (see CXL 2.0 14.16.4 Compliance Mode DOE)
> > > offers a mechanism to set different test modes for a DOE device. The
> > > kernel has no reason to ever use that interface, and it has strong
> > > reasons to want to block access to it in production. However, hardware
> > > vendors also use debug Linux builds for hardware bringup. So I would
> > > like to be able to say that the mechanism to gain access to the
> > > compliance DOE is to detach the aux DOE driver from the right aux DOE
> > > device. Could we build a custom way to do the same for the DOE
> > > library, sure, but why re-invent the wheel when udev and the driver
> > > model can handle this type of policy question already?
> > >
> > > Another use case is SPDM where an agent can establish a secure message
> > > passing channel to a device, or paravirtualized device to exchange
> > > protected messages with the hypervisor. My expectation is that in
> > > addition to the kernel establishing SPDM sessions for PCI IDE and
> > > CXL.cachemem IDE (link Integrity and Data Encryption) there will be
> > > use cases for root userspace to establish their own SPDM session. In
> > > that scenario as well the kernel can be told to give up control of a
> > > specific DOE instance by detaching the aux device for its driver, but
> > > otherwise the kernel driver can be assured that userspace will not
> > > clobber its communications with its own attempts to talk over the DOE.  
> >
> > I assume the kernel needs to control access to DOE in all cases,
> > doesn't it?  For example, DOE can generate interrupts, and only the
> > kernel can field them.  Maybe if I saw the userspace interface this
> > would make more sense to me.  I'm hoping there's a reasonable "send
> > this query and give me the response" primitive that can be implemented
> > in the kernel, used by drivers, and exposed safely to userspace.  
> 
> A DOE can generate interrupts, but I have yet to see a protocol that
> demands that. The userspace interface in the patches is just a binary
> attribute to dump the "CDAT" table retrieved over DOE. No generic
> passthrough is provided per the concerns above.

I don't think it would be that hard to set up a protocol specific interface
for establishment of secure channels to cover that particular case.  As long
as we ensure userspace can see / manage the crypto elements it won't matter
if the data is passed through another interface on it's way to the DOE.

Jonathan

[2/5] PCI/DOE: Add Data Object Exchange Aux Driver

Commit Message

Comments

Patch