Message ID | 20211105235056.3711389-3-ira.weiny@intel.com |
---|---|
State | New, archived |
Headers | show |
Series | CXL: Read CDAT and DSMAS data from the device. | expand |
On Fri, 5 Nov 2021 16:50:53 -0700 <ira.weiny@intel.com> wrote: > From: Jonathan Cameron <Jonathan.Cameron@huawei.com> > > Introduced in a PCI ECN [1], DOE provides a config space based mailbox > with standard protocol discovery. Each mailbox is accessed through a > DOE Extended Capability. > > Define an auxiliary device driver which control DOE auxiliary devices > registered on the auxiliary bus. > > A DOE mailbox is allowed to support any number of protocols while some > DOE protocol specifications apply additional restrictions. > > The protocols supported are queried and cached. pci_doe_supports_prot() > can be used to determine if the DOE device supports the protocol > specified. > > A synchronous interface is provided in pci_doe_exchange_sync() to > perform a single query / response exchange from the driver through the > device specified. > > Testing was conducted against QEMU using: > > https://lore.kernel.org/qemu-devel/1619454964-10190-1-git-send-email-cbrowy@avery-design.com/ > > This code is based on Jonathan's V4 series here: > > https://lore.kernel.org/linux-cxl/20210524133938.2815206-1-Jonathan.Cameron@huawei.com/ > > [1] https://members.pcisig.com/wg/PCI-SIG/document/14143 > Data Object Exchange (DOE) - Approved 12 March 2020 > > Co-developed-by: Ira Weiny <ira.weiny@intel.com> > Signed-off-by: Ira Weiny <ira.weiny@intel.com> > Signed-off-by: Jonathan Cameron <Jonathan.Cameron@huawei.com> Hi Ira, Thanks for taking this on! I'm sure at least half the comments below are about things I wrote then forgot about. I'm not sure if it's a good thing but I've ignored this for long enough I'm almost reviewing it as fresh code :( I was carrying a local patch for the interrupt handler having figured out I'd missread the spec. Note that I've since concluded my local patch has it's own issues (it was unnecessarily complex) so I've made some suggestions below that I'm fairly sure fix things up. Note these paths are hard to test and require adding some fiddly state machines to QEMU to open up race windows... > > --- > Changes from Jonathan's V4 > Move the DOE MB code into the DOE auxiliary driver > Remove Task List in favor of a wait queue > > Changes from Ben > remove CXL references > propagate rc from pci functions on error ... > diff --git a/drivers/pci/doe.c b/drivers/pci/doe.c > new file mode 100644 > index 000000000000..2e702fdc7879 > --- /dev/null > +++ b/drivers/pci/doe.c > @@ -0,0 +1,701 @@ > +// SPDX-License-Identifier: GPL-2.0 > +/* > + * Data Object Exchange ECN > + * https://members.pcisig.com/wg/PCI-SIG/document/14143 > + * > + * Copyright (C) 2021 Huawei > + * Jonathan Cameron <Jonathan.Cameron@huawei.com> > + */ > + > +#include <linux/bitfield.h> > +#include <linux/delay.h> > +#include <linux/jiffies.h> > +#include <linux/list.h> > +#include <linux/mutex.h> > +#include <linux/pci.h> > +#include <linux/pci-doe.h> > +#include <linux/workqueue.h> > +#include <linux/module.h> > + > +#define PCI_DOE_PROTOCOL_DISCOVERY 0 > + > +#define PCI_DOE_BUSY_MAX_RETRIES 16 > +#define PCI_DOE_POLL_INTERVAL (HZ / 128) > + > +/* Timeout of 1 second from 6.xx.1 (Operation), ECN - Data Object Exchange */ > +#define PCI_DOE_TIMEOUT HZ > + > +enum pci_doe_state { > + DOE_IDLE, > + DOE_WAIT_RESP, > + DOE_WAIT_ABORT, > + DOE_WAIT_ABORT_ON_ERR, > +}; > + > +/* /** Given it's in kernel-doc syntax, we might as well mark it as such. > + * struct pci_doe_task - description of a query / response task > + * @ex: The details of the task to be done > + * @rv: Return value. Length of received response or error > + * @cb: Callback for completion of task > + * @private: Private data passed to callback on completion > + */ > +struct pci_doe_task { > + struct pci_doe_exchange *ex; > + int rv; > + void (*cb)(void *private); > + void *private; > +}; > + > +/** > + * struct pci_doe - A single DOE mailbox driver > + * > + * @doe_dev: The DOE Auxiliary device being driven > + * @abort_c: Completion used for initial abort handling > + * @irq: Interrupt used for signaling DOE ready or abort > + * @irq_name: Name used to identify the irq for a particular DOE > + * @prots: Array of identifiers for protocols supported > + * @num_prots: Size of prots array > + * @cur_task: Current task the state machine is working on > + * @wq: Wait queue to wait on if a query is in progress > + * @state_lock: Protect the state of cur_task, abort, and dead > + * @statemachine: Work item for the DOE state machine > + * @state: Current state of this DOE > + * @timeout_jiffies: 1 second after GO set > + * @busy_retries: Count of retry attempts > + * @abort: Request a manual abort (e.g. on init) > + * @dead: Used to mark a DOE for which an ABORT has timed out. Further messages > + * will immediately be aborted with error > + */ > +struct pci_doe { > + struct pci_doe_dev *doe_dev; > + struct completion abort_c; > + int irq; > + char *irq_name; > + struct pci_doe_protocol *prots; > + int num_prots; > + > + struct pci_doe_task *cur_task; > + wait_queue_head_t wq; > + struct mutex state_lock; > + struct delayed_work statemachine; > + enum pci_doe_state state; > + unsigned long timeout_jiffies; > + unsigned int busy_retries; > + unsigned int abort:1; > + unsigned int dead:1; > +}; > + > +static irqreturn_t pci_doe_irq(int irq, void *data) I was carrying a rework of this locally because I managed to convince myself this is wrong. It's been a while and naturally I didn't write a comprehensive set of notes on why it was wrong... (Note you can't trigger the problem paths in QEMU without some nasty hacks as it relies on opening up race windows that make limited sense for the QEMU implementation). It's all centered on some details of exactly what causes an interrupt on a DOE. Section 6.xx.3 Interrupt Generation states: If enabled, an interrupt message must be triggered every time the logical AND of the following conditions transitions from FALSE to TRUE: * The associated vector is unmasked ... * The value of the DOE interrupt enable bit is 1b * The value of the DOE interrupt status bit is 1b (only last one really maters to us I think). The interrupt status bit is an OR conditional. Must be set.. Data Object Read bit or DOE error bit set or DOE busy bit cleared. > +{ > + struct pci_doe *doe = data; > + struct pci_dev *pdev = doe->doe_dev->pdev; > + int offset = doe->doe_dev->cap_offset; > + u32 val; > + > + pci_read_config_dword(pdev, offset + PCI_DOE_STATUS, &val); > + if (FIELD_GET(PCI_DOE_STATUS_INT_STATUS, val)) { So this bit is set on any of: BUSY dropped, READY or ERROR. If it's set on BUSY drop, but then in between the read above and this clear READY becomes true, then my reading is that we will not get another interrupt. That is fine because we will read it again in the state machine and see the new state. We could do more of the dance in the interrupt controller by doing a reread after clear of INT_STATUS but I think it's cleaner to leave it in the state machine. It might look nicer here to only write BIT(1) - RW1C, but that doesn't matter as all the rest of the register is RO. > + pci_write_config_dword(pdev, offset + PCI_DOE_STATUS, val); > + mod_delayed_work(system_wq, &doe->statemachine, 0); > + return IRQ_HANDLED; > + } > + /* Leave the error case to be handled outside IRQ */ > + if (FIELD_GET(PCI_DOE_STATUS_ERROR, val)) { I don't think we can get here because int status already true. So should do this before the above general check to avoid clearning the interrupt (we don't want more interrupts during the abort though I'd hope the hardware wouldn't generate them). So move this before the previous check. > + mod_delayed_work(system_wq, &doe->statemachine, 0); > + return IRQ_HANDLED; > + } > + > + /* > + * Busy being cleared can result in an interrupt, but as > + * the original Busy may not have been detected, there is no > + * way to separate such an interrupt from a spurious interrupt. > + */ This is misleading - as Busy bit clear would have resulted in INT_STATUS being true above (that was a misread of the spec from me in v4). So I don't think we can get here in any valid path. return IRQ_NONE; should be safe. > + return IRQ_HANDLED; > +} Summary of above suggested changes: 1) Move the DOE_STATUS_ERROR block before the DOE_STATUS_INT_STATUS one 2) Possibly uses pci_write_config_dword(pdev, offset + PCI_DOE_STATUS, PCI_DOE_STATUS_INT_STATUS); to be explicit on the write one to clear bit. 3) IRQ_NONE for the final return path as I'm fairly sure there is no valid route to that. ... > + > +static void pci_doe_task_complete(void *private) > +{ > + complete(private); > +} I wonder why this is up here? I'd move it down to just above the _sync() function where it's used. This one was definitely one of mine :) > + > +static void doe_statemachine_work(struct work_struct *work) I developed an interesting "relationship" with this state machine during the original development ;) I've just walked the paths and convinced myself it works so all good. > +{ > + struct delayed_work *w = to_delayed_work(work); > + struct pci_doe *doe = container_of(w, struct pci_doe, statemachine); > + struct pci_dev *pdev = doe->doe_dev->pdev; > + int offset = doe->doe_dev->cap_offset; > + struct pci_doe_task *task; > + bool abort; > + u32 val; > + int rc; > + > + mutex_lock(&doe->state_lock); > + task = doe->cur_task; > + abort = doe->abort; > + doe->abort = false; > + mutex_unlock(&doe->state_lock); > + > + if (abort) { > + /* > + * Currently only used during init - care needed if > + * pci_doe_abort() is generally exposed as it would impact > + * queries in flight. > + */ > + WARN_ON(task); > + doe->state = DOE_WAIT_ABORT; > + pci_doe_abort_start(doe); > + return; > + } > + > + switch (doe->state) { > + case DOE_IDLE: > + if (task == NULL) > + return; > + > + /* Nothing currently in flight so queue a task */ > + rc = pci_doe_send_req(doe, task->ex); > + /* > + * The specification does not provide any guidance on how long > + * some other entity could keep the DOE busy, so try for 1 > + * second then fail. Busy handling is best effort only, because > + * there is no way of avoiding racing against another user of > + * the DOE. > + */ > + if (rc == -EBUSY) { > + doe->busy_retries++; > + if (doe->busy_retries == PCI_DOE_BUSY_MAX_RETRIES) { > + /* Long enough, fail this request */ > + pci_WARN(pdev, true, "DOE busy for too long\n"); > + doe->busy_retries = 0; > + goto err_busy; > + } > + schedule_delayed_work(w, HZ / PCI_DOE_BUSY_MAX_RETRIES); > + return; > + } > + if (rc) > + goto err_abort; > + doe->busy_retries = 0; > + > + doe->state = DOE_WAIT_RESP; > + doe->timeout_jiffies = jiffies + HZ; > + /* Now poll or wait for IRQ with timeout */ > + if (doe->irq > 0) > + schedule_delayed_work(w, PCI_DOE_TIMEOUT); > + else > + schedule_delayed_work(w, PCI_DOE_POLL_INTERVAL); > + return; > + > + case DOE_WAIT_RESP: > + /* Not possible to get here with NULL task */ > + pci_read_config_dword(pdev, offset + PCI_DOE_STATUS, &val); > + if (FIELD_GET(PCI_DOE_STATUS_ERROR, val)) { > + rc = -EIO; > + goto err_abort; > + } > + > + if (!FIELD_GET(PCI_DOE_STATUS_DATA_OBJECT_READY, val)) { > + /* If not yet at timeout reschedule otherwise abort */ > + if (time_after(jiffies, doe->timeout_jiffies)) { > + rc = -ETIMEDOUT; > + goto err_abort; > + } > + schedule_delayed_work(w, PCI_DOE_POLL_INTERVAL); > + return; > + } > + > + rc = pci_doe_recv_resp(doe, task->ex); > + if (rc < 0) > + goto err_abort; > + > + doe->state = DOE_IDLE; > + > + mutex_lock(&doe->state_lock); > + doe->cur_task = NULL; > + mutex_unlock(&doe->state_lock); > + wake_up_interruptible(&doe->wq); > + > + /* Set the return value to the length of received payload */ > + task->rv = rc; > + task->cb(task->private); > + > + return; > + > + case DOE_WAIT_ABORT: > + case DOE_WAIT_ABORT_ON_ERR: > + pci_read_config_dword(pdev, offset + PCI_DOE_STATUS, &val); > + > + if (!FIELD_GET(PCI_DOE_STATUS_ERROR, val) && > + !FIELD_GET(PCI_DOE_STATUS_BUSY, val)) { > + /* Back to normal state - carry on */ > + mutex_lock(&doe->state_lock); > + doe->cur_task = NULL; > + mutex_unlock(&doe->state_lock); > + wake_up_interruptible(&doe->wq); > + > + /* > + * For deliberately triggered abort, someone is > + * waiting. > + */ > + if (doe->state == DOE_WAIT_ABORT) > + complete(&doe->abort_c); > + > + doe->state = DOE_IDLE; > + return; > + } > + if (time_after(jiffies, doe->timeout_jiffies)) { > + /* Task has timed out and is dead - abort */ > + pci_err(pdev, "DOE ABORT timed out\n"); > + mutex_lock(&doe->state_lock); > + doe->dead = true; > + doe->cur_task = NULL; > + mutex_unlock(&doe->state_lock); > + wake_up_interruptible(&doe->wq); > + > + if (doe->state == DOE_WAIT_ABORT) > + complete(&doe->abort_c); > + } > + return; > + } > + > +err_abort: > + doe->state = DOE_WAIT_ABORT_ON_ERR; > + pci_doe_abort_start(doe); > +err_busy: > + task->rv = rc; > + task->cb(task->private); > + /* If here via err_busy, signal the task done. */ > + if (doe->state == DOE_IDLE) { > + mutex_lock(&doe->state_lock); > + doe->cur_task = NULL; > + mutex_unlock(&doe->state_lock); > + wake_up_interruptible(&doe->wq); > + } > +} > + > +/** > + * pci_doe_exchange_sync() - Send a request, then wait for and receive a response > + * @doe: DOE mailbox state structure > + * @ex: Description of the buffers and Vendor ID + type used in this > + * request/response pair > + * > + * Excess data will be discarded. > + * > + * RETURNS: payload in bytes on success, < 0 on error > + */ > +int pci_doe_exchange_sync(struct pci_doe_dev *doe_dev, struct pci_doe_exchange *ex) > +{ > + struct pci_doe *doe = dev_get_drvdata(&doe_dev->adev.dev); > + struct pci_doe_task task; > + DECLARE_COMPLETION_ONSTACK(c); > + > + if (!doe) > + return -EAGAIN; > + > + /* DOE requests must be a whole number of DW */ > + if (ex->request_pl_sz % sizeof(u32)) > + return -EINVAL; > + > + task.ex = ex; > + task.cb = pci_doe_task_complete; > + task.private = &c; > + > +again: Hmm. Whether having this code at this layer makes sense hinges on whether we want to easily support async use of the DOE in future. In v4 some of the async handling had ended up in this function and should probably have been factored out to give us a 'queue up work' then 'wait for completion' sequence. Given there is now more to be done in here perhaps we need to think about such a separation to keep it clear that this is fundamentally a synchronous wrapper around an asynchronous operation. > + mutex_lock(&doe->state_lock); > + if (doe->cur_task) { > + mutex_unlock(&doe->state_lock); > + wait_event_interruptible(doe->wq, doe->cur_task == NULL); > + goto again; > + } > + > + if (doe->dead) { > + mutex_unlock(&doe->state_lock); > + return -EIO; > + } > + doe->cur_task = &task; > + schedule_delayed_work(&doe->statemachine, 0); > + mutex_unlock(&doe->state_lock); > + > + wait_for_completion(&c); > + > + return task.rv; > +} > +EXPORT_SYMBOL_GPL(pci_doe_exchange_sync); > + > +/** > + * pci_doe_supports_prot() - Return if the DOE instance supports the given protocol > + * @pdev: Device on which to find the DOE instance > + * @vid: Protocol Vendor ID > + * @type: protocol type > + * > + * This device can then be passed to pci_doe_exchange_sync() to execute a mailbox > + * exchange through that DOE mailbox. > + * > + * RETURNS: True if the DOE device supports the protocol specified > + */ > +bool pci_doe_supports_prot(struct pci_doe_dev *doe_dev, u16 vid, u8 type) > +{ > + struct pci_doe *doe = dev_get_drvdata(&doe_dev->adev.dev); > + int i; > + > + if (!doe) > + return false; How would this happen? I don't think it can... Probably false paranoia from me... > + > + for (i = 0; i < doe->num_prots; i++) > + if ((doe->prots[i].vid == vid) && > + (doe->prots[i].type == type)) > + return true; > + > + return false; > +} > +EXPORT_SYMBOL_GPL(pci_doe_supports_prot); ... > +static void pci_doe_release_irq(struct pci_doe *doe) > +{ > + if (doe->irq > 0) > + free_irq(doe->irq, doe); Is this trivial wrapper worth bothering with? Maybe just put the code inline? > +} > + ... > + > +static void pci_doe_unregister(struct pci_doe *doe) > +{ > + pci_doe_release_irq(doe); > + kfree(doe->irq_name); > + put_device(&doe->doe_dev->pdev->dev); This makes me wonder if we should be doing the get_device() earlier in probe? Limited harm in moving it to near the start and then ending up with it being 'obviously' correct... > +} > + > +/* > + * pci_doe_probe() - Set up the Mailbox > + * @aux_dev: Auxiliary Device > + * @id: Auxiliary device ID > + * > + * Probe the mailbox found for all protocols and set up the Mailbox > + * > + * RETURNS: 0 on success, < 0 on error > + */ > +static int pci_doe_probe(struct auxiliary_device *aux_dev, > + const struct auxiliary_device_id *id) > +{ > + struct pci_doe_dev *doe_dev = container_of(aux_dev, > + struct pci_doe_dev, > + adev); > + struct pci_doe *doe; > + int rc; > + > + doe = kzalloc(sizeof(*doe), GFP_KERNEL); Could go devm_ for this I think, though may not be worthwhile. > + if (!doe) > + return -ENOMEM; > + > + mutex_init(&doe->state_lock); > + init_completion(&doe->abort_c); > + doe->doe_dev = doe_dev; > + init_waitqueue_head(&doe->wq); > + INIT_DELAYED_WORK(&doe->statemachine, doe_statemachine_work); > + dev_set_drvdata(&aux_dev->dev, doe); > + > + rc = pci_doe_register(doe); > + if (rc) > + goto err_free; > + > + rc = pci_doe_cache_protocols(doe); > + if (rc) { > + pci_doe_unregister(doe); Mixture of different forms of error handling here. I'd move this below and add an err_unregister label. > + goto err_free; > + } > + > + return 0; > + > +err_free: > + kfree(doe); > + return rc; > +} > + > +static void pci_doe_remove(struct auxiliary_device *aux_dev) > +{ > + struct pci_doe *doe = dev_get_drvdata(&aux_dev->dev); > + > + /* First halt the state machine */ > + cancel_delayed_work_sync(&doe->statemachine); > + kfree(doe->prots); Logical flow to me is unregister first, free protocols second (to reverse what we do in probe) > + pci_doe_unregister(doe); > + kfree(doe); > +} > + > +static const struct auxiliary_device_id pci_doe_auxiliary_id_table[] = { > + {.name = "cxl_pci.doe", }, I'd like to hear from Bjorn on whether registering this from the CXL device is the right approach or if we should perhaps just do it directly from somewhere in PCI. (really applies to patch 3) I'll talk more about this there. > + {}, > +}; > + > +MODULE_DEVICE_TABLE(auxiliary, pci_doe_auxiliary_id_table); > + > +struct auxiliary_driver pci_doe_auxiliary_drv = { > + .name = "pci_doe_drv", I would assume this is only used in contexts where the _drv is obvious? I would go with "pci_doe". > + .id_table = pci_doe_auxiliary_id_table, > + .probe = pci_doe_probe, > + .remove = pci_doe_remove > +}; > + > +static int __init pci_doe_init_module(void) > +{ > + int ret; > + > + ret = auxiliary_driver_register(&pci_doe_auxiliary_drv); > + if (ret) { > + pr_err("Failed pci_doe auxiliary_driver_register() ret=%d\n", > + ret); > + return ret; > + } > + > + return 0; > +} > + > +static void __exit pci_doe_exit_module(void) > +{ > + auxiliary_driver_unregister(&pci_doe_auxiliary_drv); > +} > + > +module_init(pci_doe_init_module); > +module_exit(pci_doe_exit_module); Seems like the auxiliary bus would benefit from a module_auxiliary_driver() macro to cover this simple registration stuff similar to module_i2c_driver() etc. Mind you, looking at 5.15 this would be the only user, so maybe one for the 'next' case on basis two instances proves it's 'common' ;) > +MODULE_LICENSE("GPL v2"); > diff --git a/include/linux/pci-doe.h b/include/linux/pci-doe.h > new file mode 100644 > index 000000000000..8380b7ad33d4 > --- /dev/null > +++ b/include/linux/pci-doe.h > @@ -0,0 +1,63 @@ > +/* SPDX-License-Identifier: GPL-2.0 */ > +/* > + * Data Object Exchange was added as an ECN to the PCIe r5.0 spec. > + * > + * Copyright (C) 2021 Huawei > + * Jonathan Cameron <Jonathan.Cameron@huawei.com> > + */ > + > +#include <linux/completion.h> > +#include <linux/list.h> > +#include <linux/mutex.h> Not used in this header that I can see, so push down to the c files. > +#include <linux/auxiliary_bus.h> > + > +#ifndef LINUX_PCI_DOE_H > +#define LINUX_PCI_DOE_H > + > +#define DOE_DEV_NAME "doe" Not sure this is used?
On Mon, Nov 08, 2021 at 12:15:46PM +0000, Jonathan Cameron wrote: > On Fri, 5 Nov 2021 16:50:53 -0700 > <ira.weiny@intel.com> wrote: > > > From: Jonathan Cameron <Jonathan.Cameron@huawei.com> > > > > Introduced in a PCI ECN [1], DOE provides a config space based mailbox > > with standard protocol discovery. Each mailbox is accessed through a > > DOE Extended Capability. > > > > Define an auxiliary device driver which control DOE auxiliary devices > > registered on the auxiliary bus. > > > > A DOE mailbox is allowed to support any number of protocols while some > > DOE protocol specifications apply additional restrictions. > > > > The protocols supported are queried and cached. pci_doe_supports_prot() > > can be used to determine if the DOE device supports the protocol > > specified. > > > > A synchronous interface is provided in pci_doe_exchange_sync() to > > perform a single query / response exchange from the driver through the > > device specified. > > > > Testing was conducted against QEMU using: > > > > https://lore.kernel.org/qemu-devel/1619454964-10190-1-git-send-email-cbrowy@avery-design.com/ > > > > This code is based on Jonathan's V4 series here: > > > > https://lore.kernel.org/linux-cxl/20210524133938.2815206-1-Jonathan.Cameron@huawei.com/ > > > > [1] https://members.pcisig.com/wg/PCI-SIG/document/14143 > > Data Object Exchange (DOE) - Approved 12 March 2020 > > > > Co-developed-by: Ira Weiny <ira.weiny@intel.com> > > Signed-off-by: Ira Weiny <ira.weiny@intel.com> > > Signed-off-by: Jonathan Cameron <Jonathan.Cameron@huawei.com> > > Hi Ira, > > Thanks for taking this on! NP I'm just sorry I'm so slow to get it moving. > > I'm sure at least half the comments below are about things I wrote > then forgot about. I'm not sure if it's a good thing but I've ignored > this for long enough I'm almost reviewing it as fresh code :( > > I was carrying a local patch for the interrupt handler having > figured out I'd missread the spec. Note that I've since concluded > my local patch has it's own issues (it was unnecessarily complex) > so I've made some suggestions below that I'm fairly sure > fix things up. Note these paths are hard to test and require adding > some fiddly state machines to QEMU to open up race windows... > > > > > --- > > Changes from Jonathan's V4 > > Move the DOE MB code into the DOE auxiliary driver > > Remove Task List in favor of a wait queue > > > > Changes from Ben > > remove CXL references > > propagate rc from pci functions on error > > ... > > > > diff --git a/drivers/pci/doe.c b/drivers/pci/doe.c > > new file mode 100644 > > index 000000000000..2e702fdc7879 > > --- /dev/null > > +++ b/drivers/pci/doe.c > > @@ -0,0 +1,701 @@ > > +// SPDX-License-Identifier: GPL-2.0 > > +/* > > + * Data Object Exchange ECN > > + * https://members.pcisig.com/wg/PCI-SIG/document/14143 > > + * > > + * Copyright (C) 2021 Huawei > > + * Jonathan Cameron <Jonathan.Cameron@huawei.com> > > + */ > > + > > +#include <linux/bitfield.h> > > +#include <linux/delay.h> > > +#include <linux/jiffies.h> > > +#include <linux/list.h> > > +#include <linux/mutex.h> > > +#include <linux/pci.h> > > +#include <linux/pci-doe.h> > > +#include <linux/workqueue.h> > > +#include <linux/module.h> > > + > > +#define PCI_DOE_PROTOCOL_DISCOVERY 0 > > + > > +#define PCI_DOE_BUSY_MAX_RETRIES 16 > > +#define PCI_DOE_POLL_INTERVAL (HZ / 128) > > + > > +/* Timeout of 1 second from 6.xx.1 (Operation), ECN - Data Object Exchange */ > > +#define PCI_DOE_TIMEOUT HZ > > + > > +enum pci_doe_state { > > + DOE_IDLE, > > + DOE_WAIT_RESP, > > + DOE_WAIT_ABORT, > > + DOE_WAIT_ABORT_ON_ERR, > > +}; > > + > > +/* > > /** > > Given it's in kernel-doc syntax, we might as well mark it as such. Yep done. > > > + * struct pci_doe_task - description of a query / response task > > + * @ex: The details of the task to be done > > + * @rv: Return value. Length of received response or error > > + * @cb: Callback for completion of task > > + * @private: Private data passed to callback on completion > > + */ > > +struct pci_doe_task { > > + struct pci_doe_exchange *ex; > > + int rv; > > + void (*cb)(void *private); > > + void *private; > > +}; > > + > > +/** > > + * struct pci_doe - A single DOE mailbox driver > > + * > > + * @doe_dev: The DOE Auxiliary device being driven > > + * @abort_c: Completion used for initial abort handling > > + * @irq: Interrupt used for signaling DOE ready or abort > > + * @irq_name: Name used to identify the irq for a particular DOE > > + * @prots: Array of identifiers for protocols supported > > + * @num_prots: Size of prots array > > + * @cur_task: Current task the state machine is working on > > + * @wq: Wait queue to wait on if a query is in progress > > + * @state_lock: Protect the state of cur_task, abort, and dead > > + * @statemachine: Work item for the DOE state machine > > + * @state: Current state of this DOE > > + * @timeout_jiffies: 1 second after GO set > > + * @busy_retries: Count of retry attempts > > + * @abort: Request a manual abort (e.g. on init) > > + * @dead: Used to mark a DOE for which an ABORT has timed out. Further messages > > + * will immediately be aborted with error > > + */ > > +struct pci_doe { > > + struct pci_doe_dev *doe_dev; > > + struct completion abort_c; > > + int irq; > > + char *irq_name; > > + struct pci_doe_protocol *prots; > > + int num_prots; > > + > > + struct pci_doe_task *cur_task; > > + wait_queue_head_t wq; > > + struct mutex state_lock; > > + struct delayed_work statemachine; > > + enum pci_doe_state state; > > + unsigned long timeout_jiffies; > > + unsigned int busy_retries; > > + unsigned int abort:1; > > + unsigned int dead:1; > > +}; > > + > > +static irqreturn_t pci_doe_irq(int irq, void *data) > > I was carrying a rework of this locally because I managed > to convince myself this is wrong. It's been a while and naturally > I didn't write a comprehensive set of notes on why it was wrong... > (Note you can't trigger the problem paths in QEMU without some > nasty hacks as it relies on opening up race windows that make > limited sense for the QEMU implementation). > > It's all centered on some details of exactly what causes an interrupt > on a DOE. Section 6.xx.3 Interrupt Generation states: > > If enabled, an interrupt message must be triggered every time the > logical AND of the following conditions transitions from FALSE to TRUE: > > * The associated vector is unmasked ... > * The value of the DOE interrupt enable bit is 1b > * The value of the DOE interrupt status bit is 1b > (only last one really maters to us I think). > > The interrupt status bit is an OR conditional. > > Must be set.. Data Object Read bit or DOE error bit set or DOE busy bit cleared. > > > +{ > > + struct pci_doe *doe = data; > > + struct pci_dev *pdev = doe->doe_dev->pdev; > > + int offset = doe->doe_dev->cap_offset; > > + u32 val; > > + > > + pci_read_config_dword(pdev, offset + PCI_DOE_STATUS, &val); > > + if (FIELD_GET(PCI_DOE_STATUS_INT_STATUS, val)) { > > So this bit is set on any of: BUSY dropped, READY or ERROR. > If it's set on BUSY drop, but then in between the read above and this clear > READY becomes true, then my reading is that we will not get another interrupt. > That is fine because we will read it again in the state machine and see the > new state. We could do more of the dance in the interrupt controller by doing > a reread after clear of INT_STATUS but I think it's cleaner to leave > it in the state machine. > > It might look nicer here to only write BIT(1) - RW1C, but that doesn't matter as > all the rest of the register is RO. > > > + pci_write_config_dword(pdev, offset + PCI_DOE_STATUS, val); > > + mod_delayed_work(system_wq, &doe->statemachine, 0); > > + return IRQ_HANDLED; > > + } > > + /* Leave the error case to be handled outside IRQ */ > > + if (FIELD_GET(PCI_DOE_STATUS_ERROR, val)) { > > I don't think we can get here because int status already true. > So should do this before the above general check to avoid clearning > the interrupt (we don't want more interrupts during the abort though > I'd hope the hardware wouldn't generate them). > > So move this before the previous check. > > > + mod_delayed_work(system_wq, &doe->statemachine, 0); > > + return IRQ_HANDLED; > > + } > > + > > + /* > > + * Busy being cleared can result in an interrupt, but as > > + * the original Busy may not have been detected, there is no > > + * way to separate such an interrupt from a spurious interrupt. > > + */ > > This is misleading - as Busy bit clear would have resulted in INT_STATUS being true above > (that was a misread of the spec from me in v4). > So I don't think we can get here in any valid path. > > return IRQ_NONE; should be safe. > > > > + return IRQ_HANDLED; > > +} > > Summary of above suggested changes: > 1) Move the DOE_STATUS_ERROR block before the DOE_STATUS_INT_STATUS one > 2) Possibly uses > pci_write_config_dword(pdev, offset + PCI_DOE_STATUS, PCI_DOE_STATUS_INT_STATUS); > to be explicit on the write one to clear bit. > 3) IRQ_NONE for the final return path as I'm fairly sure there is no valid route to that. > Done. But just to ensure that I understand. If STATUS_ERROR is indicated we are basically not clearing the irq because we are resetting the mailbox? Because with this new code I don't see a pci_write_config_dword to clear INT_STATUS. But if we are resetting the mailbox I think that is ok. > ... > > > + > > +static void pci_doe_task_complete(void *private) > > +{ > > + complete(private); > > +} > > I wonder why this is up here? I'd move it down to just above the _sync() > function where it's used. This one was definitely one of mine :) Done. > > > + > > +static void doe_statemachine_work(struct work_struct *work) > > I developed an interesting "relationship" with this state machine during > the original development ;) I've just walked the paths and convinced > myself it works so all good. Sweet! :-D > > > +{ > > + struct delayed_work *w = to_delayed_work(work); > > + struct pci_doe *doe = container_of(w, struct pci_doe, statemachine); > > + struct pci_dev *pdev = doe->doe_dev->pdev; > > + int offset = doe->doe_dev->cap_offset; > > + struct pci_doe_task *task; > > + bool abort; > > + u32 val; > > + int rc; > > + > > + mutex_lock(&doe->state_lock); > > + task = doe->cur_task; > > + abort = doe->abort; > > + doe->abort = false; > > + mutex_unlock(&doe->state_lock); > > + > > + if (abort) { > > + /* > > + * Currently only used during init - care needed if > > + * pci_doe_abort() is generally exposed as it would impact > > + * queries in flight. > > + */ > > + WARN_ON(task); > > + doe->state = DOE_WAIT_ABORT; > > + pci_doe_abort_start(doe); > > + return; > > + } > > + > > + switch (doe->state) { > > + case DOE_IDLE: > > + if (task == NULL) > > + return; > > + > > + /* Nothing currently in flight so queue a task */ > > + rc = pci_doe_send_req(doe, task->ex); > > + /* > > + * The specification does not provide any guidance on how long > > + * some other entity could keep the DOE busy, so try for 1 > > + * second then fail. Busy handling is best effort only, because > > + * there is no way of avoiding racing against another user of > > + * the DOE. > > + */ > > + if (rc == -EBUSY) { > > + doe->busy_retries++; > > + if (doe->busy_retries == PCI_DOE_BUSY_MAX_RETRIES) { > > + /* Long enough, fail this request */ > > + pci_WARN(pdev, true, "DOE busy for too long\n"); > > + doe->busy_retries = 0; > > + goto err_busy; > > + } > > + schedule_delayed_work(w, HZ / PCI_DOE_BUSY_MAX_RETRIES); > > + return; > > + } > > + if (rc) > > + goto err_abort; > > + doe->busy_retries = 0; > > + > > + doe->state = DOE_WAIT_RESP; > > + doe->timeout_jiffies = jiffies + HZ; > > + /* Now poll or wait for IRQ with timeout */ > > + if (doe->irq > 0) > > + schedule_delayed_work(w, PCI_DOE_TIMEOUT); > > + else > > + schedule_delayed_work(w, PCI_DOE_POLL_INTERVAL); > > + return; > > + > > + case DOE_WAIT_RESP: > > + /* Not possible to get here with NULL task */ > > + pci_read_config_dword(pdev, offset + PCI_DOE_STATUS, &val); > > + if (FIELD_GET(PCI_DOE_STATUS_ERROR, val)) { > > + rc = -EIO; > > + goto err_abort; > > + } > > + > > + if (!FIELD_GET(PCI_DOE_STATUS_DATA_OBJECT_READY, val)) { > > + /* If not yet at timeout reschedule otherwise abort */ > > + if (time_after(jiffies, doe->timeout_jiffies)) { > > + rc = -ETIMEDOUT; > > + goto err_abort; > > + } > > + schedule_delayed_work(w, PCI_DOE_POLL_INTERVAL); > > + return; > > + } > > + > > + rc = pci_doe_recv_resp(doe, task->ex); > > + if (rc < 0) > > + goto err_abort; > > + > > + doe->state = DOE_IDLE; > > + > > + mutex_lock(&doe->state_lock); > > + doe->cur_task = NULL; > > + mutex_unlock(&doe->state_lock); > > + wake_up_interruptible(&doe->wq); > > + > > + /* Set the return value to the length of received payload */ > > + task->rv = rc; > > + task->cb(task->private); > > + > > + return; > > + > > + case DOE_WAIT_ABORT: > > + case DOE_WAIT_ABORT_ON_ERR: > > + pci_read_config_dword(pdev, offset + PCI_DOE_STATUS, &val); > > + > > + if (!FIELD_GET(PCI_DOE_STATUS_ERROR, val) && > > + !FIELD_GET(PCI_DOE_STATUS_BUSY, val)) { > > + /* Back to normal state - carry on */ > > + mutex_lock(&doe->state_lock); > > + doe->cur_task = NULL; > > + mutex_unlock(&doe->state_lock); > > + wake_up_interruptible(&doe->wq); > > + > > + /* > > + * For deliberately triggered abort, someone is > > + * waiting. > > + */ > > + if (doe->state == DOE_WAIT_ABORT) > > + complete(&doe->abort_c); > > + > > + doe->state = DOE_IDLE; > > + return; > > + } > > + if (time_after(jiffies, doe->timeout_jiffies)) { > > + /* Task has timed out and is dead - abort */ > > + pci_err(pdev, "DOE ABORT timed out\n"); > > + mutex_lock(&doe->state_lock); > > + doe->dead = true; > > + doe->cur_task = NULL; > > + mutex_unlock(&doe->state_lock); > > + wake_up_interruptible(&doe->wq); > > + > > + if (doe->state == DOE_WAIT_ABORT) > > + complete(&doe->abort_c); > > + } > > + return; > > + } > > + > > +err_abort: > > + doe->state = DOE_WAIT_ABORT_ON_ERR; > > + pci_doe_abort_start(doe); > > +err_busy: > > + task->rv = rc; > > + task->cb(task->private); > > + /* If here via err_busy, signal the task done. */ > > + if (doe->state == DOE_IDLE) { > > + mutex_lock(&doe->state_lock); > > + doe->cur_task = NULL; > > + mutex_unlock(&doe->state_lock); > > + wake_up_interruptible(&doe->wq); > > + } > > +} > > + > > +/** > > + * pci_doe_exchange_sync() - Send a request, then wait for and receive a response > > + * @doe: DOE mailbox state structure > > + * @ex: Description of the buffers and Vendor ID + type used in this > > + * request/response pair > > + * > > + * Excess data will be discarded. > > + * > > + * RETURNS: payload in bytes on success, < 0 on error > > + */ > > +int pci_doe_exchange_sync(struct pci_doe_dev *doe_dev, struct pci_doe_exchange *ex) > > +{ > > + struct pci_doe *doe = dev_get_drvdata(&doe_dev->adev.dev); > > + struct pci_doe_task task; > > + DECLARE_COMPLETION_ONSTACK(c); > > + > > + if (!doe) > > + return -EAGAIN; > > + > > + /* DOE requests must be a whole number of DW */ > > + if (ex->request_pl_sz % sizeof(u32)) > > + return -EINVAL; > > + > > + task.ex = ex; > > + task.cb = pci_doe_task_complete; > > + task.private = &c; > > + > > +again: > > Hmm. Whether having this code at this layer makes sense hinges on > whether we want to easily support async use of the DOE in future. I struggled with this. I was trying to strike a balance with making this a synchronous call with only 1 outstanding task while leaving the statemachine alone. FWIW I think the queue you had was just fine even though there was only this synchronous call. > > In v4 some of the async handling had ended up in this function and > should probably have been factored out to give us a > 'queue up work' then 'wait for completion' sequence. > > Given there is now more to be done in here perhaps we need to think > about such a separation to keep it clear that this is fundamentally > a synchronous wrapper around an asynchronous operation. I think that would be moving back in a direction of having a queue like you defined in V4. Eliminating the queue really defined this function to sleep waiting for the state machine to be available. Doing anything more would have messed with the state machine you wrote and I did not want to do that. Dan should we move back to having a queue_task/wait_task like Jonathan had before? > > > + mutex_lock(&doe->state_lock); > > + if (doe->cur_task) { > > + mutex_unlock(&doe->state_lock); > > + wait_event_interruptible(doe->wq, doe->cur_task == NULL); > > + goto again; > > + } > > + > > + if (doe->dead) { > > + mutex_unlock(&doe->state_lock); > > + return -EIO; > > + } > > + doe->cur_task = &task; > > + schedule_delayed_work(&doe->statemachine, 0); > > + mutex_unlock(&doe->state_lock); > > + > > + wait_for_completion(&c); > > + > > + return task.rv; > > +} > > +EXPORT_SYMBOL_GPL(pci_doe_exchange_sync); > > + > > +/** > > + * pci_doe_supports_prot() - Return if the DOE instance supports the given protocol > > + * @pdev: Device on which to find the DOE instance > > + * @vid: Protocol Vendor ID > > + * @type: protocol type > > + * > > + * This device can then be passed to pci_doe_exchange_sync() to execute a mailbox > > + * exchange through that DOE mailbox. > > + * > > + * RETURNS: True if the DOE device supports the protocol specified > > + */ > > +bool pci_doe_supports_prot(struct pci_doe_dev *doe_dev, u16 vid, u8 type) > > +{ > > + struct pci_doe *doe = dev_get_drvdata(&doe_dev->adev.dev); > > + int i; > > + > > + if (!doe) > > + return false; > > How would this happen? I don't think it can... Probably > false paranoia from me... The driver may not be loaded at this point. The call operates on the aux device not the driver. Without a driver loaded I don't think we should return any protocol support. Even if the driver was loaded and there were some protocols previously supported. > > > + > > + for (i = 0; i < doe->num_prots; i++) > > + if ((doe->prots[i].vid == vid) && > > + (doe->prots[i].type == type)) > > + return true; > > + > > + return false; > > +} > > +EXPORT_SYMBOL_GPL(pci_doe_supports_prot); > > ... > > > +static void pci_doe_release_irq(struct pci_doe *doe) > > +{ > > + if (doe->irq > 0) > > + free_irq(doe->irq, doe); > > Is this trivial wrapper worth bothering with? Maybe just > put the code inline? Personally I like it this way because it is called in 2 places. > > > +} > > + > > ... > > > + > > +static void pci_doe_unregister(struct pci_doe *doe) > > +{ > > + pci_doe_release_irq(doe); > > + kfree(doe->irq_name); > > + put_device(&doe->doe_dev->pdev->dev); > > This makes me wonder if we should be doing the get_device() > earlier in probe? Limited harm in moving it to near the start > and then ending up with it being 'obviously' correct... Well... get_device() is in pci_doe_register... And it does it's own irq unwinding. I guess we could call pci_doe_unregister() from that if we refactored this... How about this? (Diff to this code) diff --git a/drivers/pci/doe.c b/drivers/pci/doe.c index 76acf4063b6b..6f2a419b3c93 100644 --- a/drivers/pci/doe.c +++ b/drivers/pci/doe.c @@ -545,10 +545,12 @@ static int pci_doe_abort(struct pci_doe *doe) return 0; } -static void pci_doe_release_irq(struct pci_doe *doe) +static void pci_doe_unregister(struct pci_doe *doe) { if (doe->irq > 0) free_irq(doe->irq, doe); + kfree(doe->irq_name); + put_device(&doe->doe_dev->pdev->dev); } static int pci_doe_register(struct pci_doe *doe) @@ -559,21 +561,28 @@ static int pci_doe_register(struct pci_doe *doe) int rc, irq; u32 val; + /* Ensure the pci device remains until this driver is done with it */ + get_device(&pdev->dev); + pci_read_config_dword(pdev, offset + PCI_DOE_CAP, &val); if (!poll && FIELD_GET(PCI_DOE_CAP_INT, val)) { irq = pci_irq_vector(pdev, FIELD_GET(PCI_DOE_CAP_IRQ, val)); - if (irq < 0) - return irq; + if (irq < 0) { + rc = irq; + goto unregister; + } doe->irq_name = kasprintf(GFP_KERNEL, "DOE[%s]", doe->doe_dev->adev.name); - if (!doe->irq_name) - return -ENOMEM; + if (!doe->irq_name) { + rc = -ENOMEM; + goto unregister; + } rc = request_irq(irq, pci_doe_irq, 0, doe->irq_name, doe); if (rc) - goto err_free_name; + goto unregister; doe->irq = irq; pci_write_config_dword(pdev, offset + PCI_DOE_CTRL, @@ -583,27 +592,15 @@ static int pci_doe_register(struct pci_doe *doe) /* Reset the mailbox by issuing an abort */ rc = pci_doe_abort(doe); if (rc) - goto err_free_irq; - - /* Ensure the pci device remains until this driver is done with it */ - get_device(&pdev->dev); + goto unregister; return 0; -err_free_irq: - pci_doe_release_irq(doe); -err_free_name: - kfree(doe->irq_name); +unregister: + pci_doe_unregister(doe); return rc; } -static void pci_doe_unregister(struct pci_doe *doe) -{ - pci_doe_release_irq(doe); - kfree(doe->irq_name); - put_device(&doe->doe_dev->pdev->dev); -} - /* * pci_doe_probe() - Set up the Mailbox * @aux_dev: Auxiliary Device > > > +} > > + > > +/* > > + * pci_doe_probe() - Set up the Mailbox > > + * @aux_dev: Auxiliary Device > > + * @id: Auxiliary device ID > > + * > > + * Probe the mailbox found for all protocols and set up the Mailbox > > + * > > + * RETURNS: 0 on success, < 0 on error > > + */ > > +static int pci_doe_probe(struct auxiliary_device *aux_dev, > > + const struct auxiliary_device_id *id) > > +{ > > + struct pci_doe_dev *doe_dev = container_of(aux_dev, > > + struct pci_doe_dev, > > + adev); > > + struct pci_doe *doe; > > + int rc; > > + > > + doe = kzalloc(sizeof(*doe), GFP_KERNEL); > > Could go devm_ for this I think, though may not be worthwhile. Yes I think it is worth it... I should use it more. BTW why did you not use devm_krealloc() for the protocols? I did not realize that call existed before you mentioned it in the other patch review. Any issue with using it there? > > > + if (!doe) > > + return -ENOMEM; > > + > > + mutex_init(&doe->state_lock); > > + init_completion(&doe->abort_c); > > + doe->doe_dev = doe_dev; > > + init_waitqueue_head(&doe->wq); > > + INIT_DELAYED_WORK(&doe->statemachine, doe_statemachine_work); > > + dev_set_drvdata(&aux_dev->dev, doe); > > + > > + rc = pci_doe_register(doe); > > + if (rc) > > + goto err_free; > > + > > + rc = pci_doe_cache_protocols(doe); > > + if (rc) { > > + pci_doe_unregister(doe); > > Mixture of different forms of error handling here. > I'd move this below and add an err_unregister label. Actually with the devm_kzalloc() we don't need the goto at all. We can just return. I _think_? Right? > > > + goto err_free; > > + } > > + > > + return 0; > > + > > +err_free: > > + kfree(doe); > > + return rc; > > +} > > + > > +static void pci_doe_remove(struct auxiliary_device *aux_dev) > > +{ > > + struct pci_doe *doe = dev_get_drvdata(&aux_dev->dev); > > + > > + /* First halt the state machine */ > > + cancel_delayed_work_sync(&doe->statemachine); > > + kfree(doe->prots); > > Logical flow to me is unregister first, free protocols second > (to reverse what we do in probe) No this is the reverse of the probe order I think. Order is register cache protocols Then we free 'uncache' protocols unregister Right? > > > + pci_doe_unregister(doe); > > + kfree(doe); > > +} > > + > > +static const struct auxiliary_device_id pci_doe_auxiliary_id_table[] = { > > + {.name = "cxl_pci.doe", }, > > I'd like to hear from Bjorn on whether registering this from the CXL > device is the right approach or if we should perhaps just do it directly from > somewhere in PCI. (really applies to patch 3) I'll talk more about this there. Actually I think this could be left blank until the next patch... It's just odd to define an empty table in the next few structures. But technically this is not needed until the devices are defined. I'm ok waiting to see what Bjorn thinks regarding the CXL vs PCI placement though. > > > + {}, > > +}; > > + > > +MODULE_DEVICE_TABLE(auxiliary, pci_doe_auxiliary_id_table); > > + > > +struct auxiliary_driver pci_doe_auxiliary_drv = { > > + .name = "pci_doe_drv", > > I would assume this is only used in contexts where the _drv is > obvious? I would go with "pci_doe". Sure. done. > > > + .id_table = pci_doe_auxiliary_id_table, > > + .probe = pci_doe_probe, > > + .remove = pci_doe_remove > > +}; > > + > > +static int __init pci_doe_init_module(void) > > +{ > > + int ret; > > + > > + ret = auxiliary_driver_register(&pci_doe_auxiliary_drv); > > + if (ret) { > > + pr_err("Failed pci_doe auxiliary_driver_register() ret=%d\n", > > + ret); > > + return ret; > > + } > > + > > + return 0; > > +} > > + > > +static void __exit pci_doe_exit_module(void) > > +{ > > + auxiliary_driver_unregister(&pci_doe_auxiliary_drv); > > +} > > + > > +module_init(pci_doe_init_module); > > +module_exit(pci_doe_exit_module); > > Seems like the auxiliary bus would benefit from a > module_auxiliary_driver() macro to cover this simple registration stuff > similar to module_i2c_driver() etc. > > Mind you, looking at 5.15 this would be the only user, so maybe one > for the 'next' case on basis two instances proves it's 'common' ;) I'm inclined to leave this alone ATM. I tried to clean up the auxiliary device documentation and got a bunch more work asked of me by Greg KH. So I'm behind on that ATM. Later we can investigate that a bit I think. > > > +MODULE_LICENSE("GPL v2"); > > diff --git a/include/linux/pci-doe.h b/include/linux/pci-doe.h > > new file mode 100644 > > index 000000000000..8380b7ad33d4 > > --- /dev/null > > +++ b/include/linux/pci-doe.h > > @@ -0,0 +1,63 @@ > > +/* SPDX-License-Identifier: GPL-2.0 */ > > +/* > > + * Data Object Exchange was added as an ECN to the PCIe r5.0 spec. > > + * > > + * Copyright (C) 2021 Huawei > > + * Jonathan Cameron <Jonathan.Cameron@huawei.com> > > + */ > > + > > +#include <linux/completion.h> > > +#include <linux/list.h> > > +#include <linux/mutex.h> > > Not used in this header that I can see, so push down to the c files. oops... thanks. > > > +#include <linux/auxiliary_bus.h> > > + > > +#ifndef LINUX_PCI_DOE_H > > +#define LINUX_PCI_DOE_H > > + > > +#define DOE_DEV_NAME "doe" > > Not sure this is used? Used in the next patch... and it kind of goes along with the table_id name... I'll see about moving both of those to the next patch where it makes more sense for now. Thanks for the review, Ira
On Fri, Nov 05, 2021 at 04:50:53PM -0700, ira.weiny@intel.com wrote: > From: Jonathan Cameron <Jonathan.Cameron@huawei.com> > > Introduced in a PCI ECN [1], DOE provides a config space based mailbox > with standard protocol discovery. Each mailbox is accessed through a > DOE Extended Capability. > > Define an auxiliary device driver which control DOE auxiliary devices > registered on the auxiliary bus. What do we gain by making this an auxiliary driver? This doesn't really feel like a "driver," and apparently it used to be a library. I'd like to see the rationale and benefits of the driver approach (in the eventual commit log as well as the current email thread). > A DOE mailbox is allowed to support any number of protocols while some > DOE protocol specifications apply additional restrictions. This sounds something like a fancy version of VPD, and VPD has been a huge headache. I hope DOE avoids that ;) > The protocols supported are queried and cached. pci_doe_supports_prot() > can be used to determine if the DOE device supports the protocol > specified. > > A synchronous interface is provided in pci_doe_exchange_sync() to > perform a single query / response exchange from the driver through the > device specified. > > Testing was conducted against QEMU using: > > https://lore.kernel.org/qemu-devel/1619454964-10190-1-git-send-email-cbrowy@avery-design.com/ > > This code is based on Jonathan's V4 series here: > > https://lore.kernel.org/linux-cxl/20210524133938.2815206-1-Jonathan.Cameron@huawei.com/ > > [1] https://members.pcisig.com/wg/PCI-SIG/document/14143 > Data Object Exchange (DOE) - Approved 12 March 2020 > > Co-developed-by: Ira Weiny <ira.weiny@intel.com> > Signed-off-by: Ira Weiny <ira.weiny@intel.com> > Signed-off-by: Jonathan Cameron <Jonathan.Cameron@huawei.com> I think these sign-offs are out of order, per https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/Documentation/process/submitting-patches.rst?id=v5.14#n365 Last sign-off should be from the person who posted this. > --- > Changes from Jonathan's V4 > Move the DOE MB code into the DOE auxiliary driver > Remove Task List in favor of a wait queue > > Changes from Ben > remove CXL references > propagate rc from pci functions on error > --- > drivers/pci/Kconfig | 10 + > drivers/pci/Makefile | 3 + > drivers/pci/doe.c | 701 ++++++++++++++++++++++++++++++++++ > include/linux/pci-doe.h | 63 +++ > include/uapi/linux/pci_regs.h | 29 +- > 5 files changed, 805 insertions(+), 1 deletion(-) > create mode 100644 drivers/pci/doe.c > create mode 100644 include/linux/pci-doe.h > > diff --git a/drivers/pci/Kconfig b/drivers/pci/Kconfig > index 0c473d75e625..b512295538ba 100644 > --- a/drivers/pci/Kconfig > +++ b/drivers/pci/Kconfig > @@ -118,6 +118,16 @@ config XEN_PCIDEV_FRONTEND > The PCI device frontend driver allows the kernel to import arbitrary > PCI devices from a PCI backend to support PCI driver domains. > > +config PCI_DOE_DRIVER > + tristate "PCI Data Object Exchange (DOE) driver" > + select AUXILIARY_BUS > + help > + Driver for DOE auxiliary devices. > + > + DOE provides a simple mailbox in PCI config space that is used by a > + number of different protocols. DOE is defined in the Data Object > + Exchange ECN to the PCIe r5.0 spec. > + > config PCI_ATS > bool > > diff --git a/drivers/pci/Makefile b/drivers/pci/Makefile > index d62c4ac4ae1b..afd9d7bd2b82 100644 > --- a/drivers/pci/Makefile > +++ b/drivers/pci/Makefile > @@ -28,8 +28,11 @@ obj-$(CONFIG_PCI_STUB) += pci-stub.o > obj-$(CONFIG_PCI_PF_STUB) += pci-pf-stub.o > obj-$(CONFIG_PCI_ECAM) += ecam.o > obj-$(CONFIG_PCI_P2PDMA) += p2pdma.o > +obj-$(CONFIG_PCI_DOE_DRIVER) += pci-doe.o > obj-$(CONFIG_XEN_PCIDEV_FRONTEND) += xen-pcifront.o > > +pci-doe-y := doe.o > + > # Endpoint library must be initialized before its users > obj-$(CONFIG_PCI_ENDPOINT) += endpoint/ > > diff --git a/drivers/pci/doe.c b/drivers/pci/doe.c > new file mode 100644 > index 000000000000..2e702fdc7879 > --- /dev/null > +++ b/drivers/pci/doe.c > @@ -0,0 +1,701 @@ > +// SPDX-License-Identifier: GPL-2.0 > +/* > + * Data Object Exchange ECN > + * https://members.pcisig.com/wg/PCI-SIG/document/14143 > + * > + * Copyright (C) 2021 Huawei > + * Jonathan Cameron <Jonathan.Cameron@huawei.com> > + */ > + > +#include <linux/bitfield.h> > +#include <linux/delay.h> > +#include <linux/jiffies.h> > +#include <linux/list.h> > +#include <linux/mutex.h> > +#include <linux/pci.h> > +#include <linux/pci-doe.h> > +#include <linux/workqueue.h> > +#include <linux/module.h> > + > +#define PCI_DOE_PROTOCOL_DISCOVERY 0 > + > +#define PCI_DOE_BUSY_MAX_RETRIES 16 > +#define PCI_DOE_POLL_INTERVAL (HZ / 128) > + > +/* Timeout of 1 second from 6.xx.1 (Operation), ECN - Data Object Exchange */ > +#define PCI_DOE_TIMEOUT HZ > + > +enum pci_doe_state { > + DOE_IDLE, > + DOE_WAIT_RESP, > + DOE_WAIT_ABORT, > + DOE_WAIT_ABORT_ON_ERR, > +}; > + > +/* > + * struct pci_doe_task - description of a query / response task > + * @ex: The details of the task to be done > + * @rv: Return value. Length of received response or error > + * @cb: Callback for completion of task > + * @private: Private data passed to callback on completion > + */ > +struct pci_doe_task { > + struct pci_doe_exchange *ex; > + int rv; > + void (*cb)(void *private); > + void *private; > +}; > + > +/** > + * struct pci_doe - A single DOE mailbox driver > + * > + * @doe_dev: The DOE Auxiliary device being driven > + * @abort_c: Completion used for initial abort handling > + * @irq: Interrupt used for signaling DOE ready or abort > + * @irq_name: Name used to identify the irq for a particular DOE > + * @prots: Array of identifiers for protocols supported > + * @num_prots: Size of prots array > + * @cur_task: Current task the state machine is working on > + * @wq: Wait queue to wait on if a query is in progress > + * @state_lock: Protect the state of cur_task, abort, and dead > + * @statemachine: Work item for the DOE state machine > + * @state: Current state of this DOE > + * @timeout_jiffies: 1 second after GO set > + * @busy_retries: Count of retry attempts > + * @abort: Request a manual abort (e.g. on init) > + * @dead: Used to mark a DOE for which an ABORT has timed out. Further messages > + * will immediately be aborted with error > + */ > +struct pci_doe { > + struct pci_doe_dev *doe_dev; > + struct completion abort_c; > + int irq; > + char *irq_name; > + struct pci_doe_protocol *prots; > + int num_prots; > + > + struct pci_doe_task *cur_task; > + wait_queue_head_t wq; > + struct mutex state_lock; > + struct delayed_work statemachine; > + enum pci_doe_state state; > + unsigned long timeout_jiffies; > + unsigned int busy_retries; > + unsigned int abort:1; > + unsigned int dead:1; > +}; > + > +static irqreturn_t pci_doe_irq(int irq, void *data) > +{ > + struct pci_doe *doe = data; > + struct pci_dev *pdev = doe->doe_dev->pdev; > + int offset = doe->doe_dev->cap_offset; > + u32 val; > + > + pci_read_config_dword(pdev, offset + PCI_DOE_STATUS, &val); > + if (FIELD_GET(PCI_DOE_STATUS_INT_STATUS, val)) { > + pci_write_config_dword(pdev, offset + PCI_DOE_STATUS, val); > + mod_delayed_work(system_wq, &doe->statemachine, 0); > + return IRQ_HANDLED; > + } > + /* Leave the error case to be handled outside IRQ */ > + if (FIELD_GET(PCI_DOE_STATUS_ERROR, val)) { > + mod_delayed_work(system_wq, &doe->statemachine, 0); > + return IRQ_HANDLED; > + } > + > + /* > + * Busy being cleared can result in an interrupt, but as > + * the original Busy may not have been detected, there is no > + * way to separate such an interrupt from a spurious interrupt. > + */ > + return IRQ_HANDLED; > +} > + > +/* > + * Only call when safe to directly access the DOE, either because no tasks yet > + * queued, or called from doe_statemachine_work() which has exclusive access to > + * the DOE config space. > + */ > +static void pci_doe_abort_start(struct pci_doe *doe) > +{ > + struct pci_dev *pdev = doe->doe_dev->pdev; > + int offset = doe->doe_dev->cap_offset; > + u32 val; > + > + val = PCI_DOE_CTRL_ABORT; > + if (doe->irq) > + val |= PCI_DOE_CTRL_INT_EN; > + pci_write_config_dword(pdev, offset + PCI_DOE_CTRL, val); > + > + doe->timeout_jiffies = jiffies + HZ; > + schedule_delayed_work(&doe->statemachine, HZ); > +} > + > +static int pci_doe_send_req(struct pci_doe *doe, struct pci_doe_exchange *ex) > +{ > + struct pci_dev *pdev = doe->doe_dev->pdev; > + int offset = doe->doe_dev->cap_offset; > + u32 val; > + int i; > + > + /* > + * Check the DOE busy bit is not set. If it is set, this could indicate > + * someone other than Linux (e.g. firmware) is using the mailbox. Note > + * it is expected that firmware and OS will negotiate access rights via > + * an, as yet to be defined method. > + */ > + pci_read_config_dword(pdev, offset + PCI_DOE_STATUS, &val); > + if (FIELD_GET(PCI_DOE_STATUS_BUSY, val)) > + return -EBUSY; > + > + if (FIELD_GET(PCI_DOE_STATUS_ERROR, val)) > + return -EIO; > + > + /* Write DOE Header */ > + val = FIELD_PREP(PCI_DOE_DATA_OBJECT_HEADER_1_VID, ex->prot.vid) | > + FIELD_PREP(PCI_DOE_DATA_OBJECT_HEADER_1_TYPE, ex->prot.type); > + pci_write_config_dword(pdev, offset + PCI_DOE_WRITE, val); > + /* Length is 2 DW of header + length of payload in DW */ > + pci_write_config_dword(pdev, offset + PCI_DOE_WRITE, > + FIELD_PREP(PCI_DOE_DATA_OBJECT_HEADER_2_LENGTH, > + 2 + ex->request_pl_sz / sizeof(u32))); > + for (i = 0; i < ex->request_pl_sz / sizeof(u32); i++) > + pci_write_config_dword(pdev, offset + PCI_DOE_WRITE, > + ex->request_pl[i]); > + > + val = PCI_DOE_CTRL_GO; > + if (doe->irq) > + val |= PCI_DOE_CTRL_INT_EN; > + > + pci_write_config_dword(pdev, offset + PCI_DOE_CTRL, val); > + /* Request is sent - now wait for poll or IRQ */ > + return 0; > +} > + > +static int pci_doe_recv_resp(struct pci_doe *doe, struct pci_doe_exchange *ex) > +{ > + struct pci_dev *pdev = doe->doe_dev->pdev; > + int offset = doe->doe_dev->cap_offset; > + size_t length; > + u32 val; > + int i; > + > + /* Read the first dword to get the protocol */ > + pci_read_config_dword(pdev, offset + PCI_DOE_READ, &val); > + if ((FIELD_GET(PCI_DOE_DATA_OBJECT_HEADER_1_VID, val) != ex->prot.vid) || > + (FIELD_GET(PCI_DOE_DATA_OBJECT_HEADER_1_TYPE, val) != ex->prot.type)) { > + pci_err(pdev, > + "Expected [VID, Protocol] = [%x, %x], got [%x, %x]\n", Maybe "%#x" so this is less ambiguous? > + ex->prot.vid, ex->prot.type, > + FIELD_GET(PCI_DOE_DATA_OBJECT_HEADER_1_VID, val), > + FIELD_GET(PCI_DOE_DATA_OBJECT_HEADER_1_TYPE, val)); > + return -EIO; > + } > + > + pci_write_config_dword(pdev, offset + PCI_DOE_READ, 0); > + /* Read the second dword to get the length */ > + pci_read_config_dword(pdev, offset + PCI_DOE_READ, &val); > + pci_write_config_dword(pdev, offset + PCI_DOE_READ, 0); > + > + length = FIELD_GET(PCI_DOE_DATA_OBJECT_HEADER_2_LENGTH, val); > + if (length > SZ_1M || length < 2) > + return -EIO; > + > + /* First 2 dwords have already been read */ > + length -= 2; > + /* Read the rest of the response payload */ > + for (i = 0; i < min(length, ex->response_pl_sz / sizeof(u32)); i++) { > + pci_read_config_dword(pdev, offset + PCI_DOE_READ, > + &ex->response_pl[i]); > + pci_write_config_dword(pdev, offset + PCI_DOE_READ, 0); > + } > + > + /* Flush excess length */ > + for (; i < length; i++) { > + pci_read_config_dword(pdev, offset + PCI_DOE_READ, &val); > + pci_write_config_dword(pdev, offset + PCI_DOE_READ, 0); > + } > + /* Final error check to pick up on any since Data Object Ready */ > + pci_read_config_dword(pdev, offset + PCI_DOE_STATUS, &val); > + if (FIELD_GET(PCI_DOE_STATUS_ERROR, val)) > + return -EIO; > + > + return min(length, ex->response_pl_sz / sizeof(u32)) * sizeof(u32); > +} > + > +static void pci_doe_task_complete(void *private) > +{ > + complete(private); > +} > + > +static void doe_statemachine_work(struct work_struct *work) > +{ > + struct delayed_work *w = to_delayed_work(work); > + struct pci_doe *doe = container_of(w, struct pci_doe, statemachine); > + struct pci_dev *pdev = doe->doe_dev->pdev; > + int offset = doe->doe_dev->cap_offset; > + struct pci_doe_task *task; > + bool abort; > + u32 val; > + int rc; > + > + mutex_lock(&doe->state_lock); > + task = doe->cur_task; > + abort = doe->abort; > + doe->abort = false; > + mutex_unlock(&doe->state_lock); > + > + if (abort) { > + /* > + * Currently only used during init - care needed if > + * pci_doe_abort() is generally exposed as it would impact > + * queries in flight. > + */ > + WARN_ON(task); > + doe->state = DOE_WAIT_ABORT; > + pci_doe_abort_start(doe); > + return; > + } > + > + switch (doe->state) { > + case DOE_IDLE: > + if (task == NULL) > + return; > + > + /* Nothing currently in flight so queue a task */ > + rc = pci_doe_send_req(doe, task->ex); > + /* > + * The specification does not provide any guidance on how long > + * some other entity could keep the DOE busy, so try for 1 > + * second then fail. Busy handling is best effort only, because > + * there is no way of avoiding racing against another user of > + * the DOE. > + */ > + if (rc == -EBUSY) { > + doe->busy_retries++; > + if (doe->busy_retries == PCI_DOE_BUSY_MAX_RETRIES) { > + /* Long enough, fail this request */ > + pci_WARN(pdev, true, "DOE busy for too long\n"); I think pci_WARN() (as opposed to pci_warn()) gives us a register dump or stacktrace. That's useful if it might be a software problem. But this looks like an issue with the hardware or firmware on the adapter, where the registers or stacktrace don't seem useful. Maybe the busy duration would be useful? > + doe->busy_retries = 0; > + goto err_busy; > + } > + schedule_delayed_work(w, HZ / PCI_DOE_BUSY_MAX_RETRIES); > + return; > + } > + if (rc) > + goto err_abort; > + doe->busy_retries = 0; > + > + doe->state = DOE_WAIT_RESP; > + doe->timeout_jiffies = jiffies + HZ; > + /* Now poll or wait for IRQ with timeout */ > + if (doe->irq > 0) > + schedule_delayed_work(w, PCI_DOE_TIMEOUT); > + else > + schedule_delayed_work(w, PCI_DOE_POLL_INTERVAL); > + return; > + > + case DOE_WAIT_RESP: > + /* Not possible to get here with NULL task */ > + pci_read_config_dword(pdev, offset + PCI_DOE_STATUS, &val); > + if (FIELD_GET(PCI_DOE_STATUS_ERROR, val)) { > + rc = -EIO; > + goto err_abort; > + } > + > + if (!FIELD_GET(PCI_DOE_STATUS_DATA_OBJECT_READY, val)) { > + /* If not yet at timeout reschedule otherwise abort */ > + if (time_after(jiffies, doe->timeout_jiffies)) { > + rc = -ETIMEDOUT; > + goto err_abort; > + } > + schedule_delayed_work(w, PCI_DOE_POLL_INTERVAL); > + return; > + } > + > + rc = pci_doe_recv_resp(doe, task->ex); > + if (rc < 0) > + goto err_abort; > + > + doe->state = DOE_IDLE; > + > + mutex_lock(&doe->state_lock); > + doe->cur_task = NULL; > + mutex_unlock(&doe->state_lock); > + wake_up_interruptible(&doe->wq); > + > + /* Set the return value to the length of received payload */ > + task->rv = rc; > + task->cb(task->private); > + > + return; > + > + case DOE_WAIT_ABORT: > + case DOE_WAIT_ABORT_ON_ERR: > + pci_read_config_dword(pdev, offset + PCI_DOE_STATUS, &val); > + > + if (!FIELD_GET(PCI_DOE_STATUS_ERROR, val) && > + !FIELD_GET(PCI_DOE_STATUS_BUSY, val)) { > + /* Back to normal state - carry on */ > + mutex_lock(&doe->state_lock); > + doe->cur_task = NULL; > + mutex_unlock(&doe->state_lock); > + wake_up_interruptible(&doe->wq); > + > + /* > + * For deliberately triggered abort, someone is > + * waiting. > + */ > + if (doe->state == DOE_WAIT_ABORT) > + complete(&doe->abort_c); > + > + doe->state = DOE_IDLE; > + return; > + } > + if (time_after(jiffies, doe->timeout_jiffies)) { > + /* Task has timed out and is dead - abort */ > + pci_err(pdev, "DOE ABORT timed out\n"); > + mutex_lock(&doe->state_lock); > + doe->dead = true; > + doe->cur_task = NULL; > + mutex_unlock(&doe->state_lock); > + wake_up_interruptible(&doe->wq); > + > + if (doe->state == DOE_WAIT_ABORT) > + complete(&doe->abort_c); > + } > + return; > + } > + > +err_abort: > + doe->state = DOE_WAIT_ABORT_ON_ERR; > + pci_doe_abort_start(doe); > +err_busy: > + task->rv = rc; > + task->cb(task->private); > + /* If here via err_busy, signal the task done. */ > + if (doe->state == DOE_IDLE) { > + mutex_lock(&doe->state_lock); > + doe->cur_task = NULL; > + mutex_unlock(&doe->state_lock); > + wake_up_interruptible(&doe->wq); > + } > +} > + > +/** > + * pci_doe_exchange_sync() - Send a request, then wait for and receive a response Wrap to fit in 80 columns. There are a few more. > + * @doe: DOE mailbox state structure > + * @ex: Description of the buffers and Vendor ID + type used in this > + * request/response pair > + * > + * Excess data will be discarded. > + * > + * RETURNS: payload in bytes on success, < 0 on error > + */ > +int pci_doe_exchange_sync(struct pci_doe_dev *doe_dev, struct pci_doe_exchange *ex) > +{ > + struct pci_doe *doe = dev_get_drvdata(&doe_dev->adev.dev); > + struct pci_doe_task task; > + DECLARE_COMPLETION_ONSTACK(c); > + > + if (!doe) > + return -EAGAIN; > + > + /* DOE requests must be a whole number of DW */ > + if (ex->request_pl_sz % sizeof(u32)) > + return -EINVAL; > + > + task.ex = ex; > + task.cb = pci_doe_task_complete; > + task.private = &c; > + > +again: > + mutex_lock(&doe->state_lock); > + if (doe->cur_task) { > + mutex_unlock(&doe->state_lock); > + wait_event_interruptible(doe->wq, doe->cur_task == NULL); > + goto again; > + } > + > + if (doe->dead) { > + mutex_unlock(&doe->state_lock); > + return -EIO; > + } > + doe->cur_task = &task; > + schedule_delayed_work(&doe->statemachine, 0); > + mutex_unlock(&doe->state_lock); > + > + wait_for_completion(&c); > + > + return task.rv; > +} > +EXPORT_SYMBOL_GPL(pci_doe_exchange_sync); > + > +/** > + * pci_doe_supports_prot() - Return if the DOE instance supports the given protocol > + * @pdev: Device on which to find the DOE instance > + * @vid: Protocol Vendor ID > + * @type: protocol type > + * > + * This device can then be passed to pci_doe_exchange_sync() to execute a mailbox > + * exchange through that DOE mailbox. > + * > + * RETURNS: True if the DOE device supports the protocol specified > + */ > +bool pci_doe_supports_prot(struct pci_doe_dev *doe_dev, u16 vid, u8 type) > +{ > + struct pci_doe *doe = dev_get_drvdata(&doe_dev->adev.dev); > + int i; > + > + if (!doe) > + return false; > + > + for (i = 0; i < doe->num_prots; i++) > + if ((doe->prots[i].vid == vid) && > + (doe->prots[i].type == type)) > + return true; > + > + return false; > +} > +EXPORT_SYMBOL_GPL(pci_doe_supports_prot); > + > +static int pci_doe_discovery(struct pci_doe *doe, u8 *index, u16 *vid, > + u8 *protocol) > +{ > + u32 request_pl = FIELD_PREP(PCI_DOE_DATA_OBJECT_DISC_REQ_3_INDEX, *index); > + u32 response_pl; > + struct pci_doe_exchange ex = { > + .prot.vid = PCI_VENDOR_ID_PCI_SIG, > + .prot.type = PCI_DOE_PROTOCOL_DISCOVERY, > + .request_pl = &request_pl, > + .request_pl_sz = sizeof(request_pl), > + .response_pl = &response_pl, > + .response_pl_sz = sizeof(response_pl), > + }; > + int ret; > + > + ret = pci_doe_exchange_sync(doe->doe_dev, &ex); > + if (ret < 0) > + return ret; > + > + if (ret != sizeof(response_pl)) > + return -EIO; > + > + *vid = FIELD_GET(PCI_DOE_DATA_OBJECT_DISC_RSP_3_VID, response_pl); > + *protocol = FIELD_GET(PCI_DOE_DATA_OBJECT_DISC_RSP_3_PROTOCOL, response_pl); > + *index = FIELD_GET(PCI_DOE_DATA_OBJECT_DISC_RSP_3_NEXT_INDEX, response_pl); > + > + return 0; > +} > + > +static int pci_doe_cache_protocols(struct pci_doe *doe) > +{ > + u8 index = 0; > + int rc; > + > + /* Discovery protocol must always be supported and must report itself */ > + doe->num_prots = 1; > + doe->prots = kcalloc(doe->num_prots, sizeof(*doe->prots), GFP_KERNEL); > + if (doe->prots == NULL) > + return -ENOMEM; > + > + do { > + struct pci_doe_protocol *prot; > + > + prot = &doe->prots[doe->num_prots - 1]; > + rc = pci_doe_discovery(doe, &index, &prot->vid, &prot->type); > + if (rc) > + goto err_free_prots; > + > + if (index) { > + struct pci_doe_protocol *prot_new; > + > + doe->num_prots++; > + prot_new = krealloc(doe->prots, > + sizeof(*doe->prots) * doe->num_prots, > + GFP_KERNEL); > + if (prot_new == NULL) { > + rc = -ENOMEM; > + goto err_free_prots; > + } > + doe->prots = prot_new; > + } > + } while (index); > + > + return 0; > + > +err_free_prots: > + kfree(doe->prots); > + doe->num_prots = 0; > + doe->prots = NULL; > + return rc; > +} > + > +static int pci_doe_abort(struct pci_doe *doe) > +{ > + reinit_completion(&doe->abort_c); > + mutex_lock(&doe->state_lock); > + doe->abort = true; > + mutex_unlock(&doe->state_lock); > + schedule_delayed_work(&doe->statemachine, 0); > + wait_for_completion(&doe->abort_c); > + > + if (doe->dead) > + return -EIO; > + > + return 0; > +} > + > +static void pci_doe_release_irq(struct pci_doe *doe) > +{ > + if (doe->irq > 0) > + free_irq(doe->irq, doe); > +} > + > +static int pci_doe_register(struct pci_doe *doe) > +{ > + struct pci_dev *pdev = doe->doe_dev->pdev; > + bool poll = !pci_dev_msi_enabled(pdev); > + int offset = doe->doe_dev->cap_offset; > + int rc, irq; > + u32 val; > + > + pci_read_config_dword(pdev, offset + PCI_DOE_CAP, &val); > + > + if (!poll && FIELD_GET(PCI_DOE_CAP_INT, val)) { > + irq = pci_irq_vector(pdev, FIELD_GET(PCI_DOE_CAP_IRQ, val)); > + if (irq < 0) > + return irq; > + > + doe->irq_name = kasprintf(GFP_KERNEL, "DOE[%s]", > + doe->doe_dev->adev.name); > + if (!doe->irq_name) > + return -ENOMEM; > + > + rc = request_irq(irq, pci_doe_irq, 0, doe->irq_name, doe); > + if (rc) > + goto err_free_name; > + > + doe->irq = irq; > + pci_write_config_dword(pdev, offset + PCI_DOE_CTRL, > + PCI_DOE_CTRL_INT_EN); > + } > + > + /* Reset the mailbox by issuing an abort */ > + rc = pci_doe_abort(doe); > + if (rc) > + goto err_free_irqs; > + > + /* Ensure the pci device remains until this driver is done with it */ s/pci/PCI/ > + get_device(&pdev->dev); pci_dev_get() There's kind of a mix of both in drivers/pci/, but since it exists, we might as well use it. > + > + return 0; > + > +err_free_irqs: > + pci_doe_release_irq(doe); > +err_free_name: > + kfree(doe->irq_name); > + return rc; > +} > + > +static void pci_doe_unregister(struct pci_doe *doe) > +{ > + pci_doe_release_irq(doe); > + kfree(doe->irq_name); > + put_device(&doe->doe_dev->pdev->dev); pci_dev_put() > +} > + > +/* > + * pci_doe_probe() - Set up the Mailbox > + * @aux_dev: Auxiliary Device > + * @id: Auxiliary device ID > + * > + * Probe the mailbox found for all protocols and set up the Mailbox > + * > + * RETURNS: 0 on success, < 0 on error > + */ > +static int pci_doe_probe(struct auxiliary_device *aux_dev, > + const struct auxiliary_device_id *id) > +{ > + struct pci_doe_dev *doe_dev = container_of(aux_dev, > + struct pci_doe_dev, > + adev); > + struct pci_doe *doe; > + int rc; > + > + doe = kzalloc(sizeof(*doe), GFP_KERNEL); > + if (!doe) > + return -ENOMEM; > + > + mutex_init(&doe->state_lock); > + init_completion(&doe->abort_c); > + doe->doe_dev = doe_dev; > + init_waitqueue_head(&doe->wq); > + INIT_DELAYED_WORK(&doe->statemachine, doe_statemachine_work); > + dev_set_drvdata(&aux_dev->dev, doe); > + > + rc = pci_doe_register(doe); > + if (rc) > + goto err_free; > + > + rc = pci_doe_cache_protocols(doe); > + if (rc) { > + pci_doe_unregister(doe); > + goto err_free; > + } > + > + return 0; > + > +err_free: > + kfree(doe); > + return rc; > +} > + > +static void pci_doe_remove(struct auxiliary_device *aux_dev) > +{ > + struct pci_doe *doe = dev_get_drvdata(&aux_dev->dev); > + > + /* First halt the state machine */ > + cancel_delayed_work_sync(&doe->statemachine); > + kfree(doe->prots); > + pci_doe_unregister(doe); > + kfree(doe); > +} > + > +static const struct auxiliary_device_id pci_doe_auxiliary_id_table[] = { > + {.name = "cxl_pci.doe", }, > + {}, > +}; > + > +MODULE_DEVICE_TABLE(auxiliary, pci_doe_auxiliary_id_table); > + > +struct auxiliary_driver pci_doe_auxiliary_drv = { > + .name = "pci_doe_drv", > + .id_table = pci_doe_auxiliary_id_table, > + .probe = pci_doe_probe, > + .remove = pci_doe_remove > +}; > + > +static int __init pci_doe_init_module(void) > +{ > + int ret; > + > + ret = auxiliary_driver_register(&pci_doe_auxiliary_drv); > + if (ret) { > + pr_err("Failed pci_doe auxiliary_driver_register() ret=%d\n", > + ret); > + return ret; > + } > + > + return 0; > +} > + > +static void __exit pci_doe_exit_module(void) > +{ > + auxiliary_driver_unregister(&pci_doe_auxiliary_drv); > +} > + > +module_init(pci_doe_init_module); > +module_exit(pci_doe_exit_module); > +MODULE_LICENSE("GPL v2"); What is the benefit of this being loadable as opposed to being static? > diff --git a/include/linux/pci-doe.h b/include/linux/pci-doe.h > new file mode 100644 > index 000000000000..8380b7ad33d4 > --- /dev/null > +++ b/include/linux/pci-doe.h > @@ -0,0 +1,63 @@ > +/* SPDX-License-Identifier: GPL-2.0 */ > +/* > + * Data Object Exchange was added as an ECN to the PCIe r5.0 spec. > + * > + * Copyright (C) 2021 Huawei > + * Jonathan Cameron <Jonathan.Cameron@huawei.com> > + */ > + > +#include <linux/completion.h> > +#include <linux/list.h> > +#include <linux/mutex.h> > +#include <linux/auxiliary_bus.h> > + > +#ifndef LINUX_PCI_DOE_H > +#define LINUX_PCI_DOE_H > + > +#define DOE_DEV_NAME "doe" This isn't used here and should move to the patch that needs it. > +struct pci_doe_protocol { > + u16 vid; > + u8 type; > +}; > + > +/** > + * struct pci_doe_exchange - represents a single query/response > + * > + * @prot: DOE Protocol > + * @request_pl: The request payload > + * @request_pl_sz: Size of the request payload > + * @response_pl: The response payload > + * @response_pl_sz: Size of the response payload > + */ > +struct pci_doe_exchange { > + struct pci_doe_protocol prot; > + u32 *request_pl; > + size_t request_pl_sz; > + u32 *response_pl; > + size_t response_pl_sz; > +}; > + > +/** > + * struct pci_doe_dev - DOE mailbox device > + * > + * @adrv: Auxiliary Driver data > + * @pdev: PCI device this belongs to > + * @offset: Capability offset > + * > + * This represents a single DOE mailbox device. Devices should create this > + * device and register it on the Auxiliary bus for the DOE driver to maintain. > + * > + */ > +struct pci_doe_dev { > + struct auxiliary_device adev; > + struct pci_dev *pdev; > + int cap_offset; > +}; > + > +/* Library operations */ > +int pci_doe_exchange_sync(struct pci_doe_dev *doe_dev, > + struct pci_doe_exchange *ex); > +bool pci_doe_supports_prot(struct pci_doe_dev *doe_dev, u16 vid, u8 type); > + > +#endif > diff --git a/include/uapi/linux/pci_regs.h b/include/uapi/linux/pci_regs.h > index e709ae8235e7..1073cd1916e1 100644 > --- a/include/uapi/linux/pci_regs.h > +++ b/include/uapi/linux/pci_regs.h > @@ -730,7 +730,8 @@ > #define PCI_EXT_CAP_ID_DVSEC 0x23 /* Designated Vendor-Specific */ > #define PCI_EXT_CAP_ID_DLF 0x25 /* Data Link Feature */ > #define PCI_EXT_CAP_ID_PL_16GT 0x26 /* Physical Layer 16.0 GT/s */ > -#define PCI_EXT_CAP_ID_MAX PCI_EXT_CAP_ID_PL_16GT > +#define PCI_EXT_CAP_ID_DOE 0x2E /* Data Object Exchange */ > +#define PCI_EXT_CAP_ID_MAX PCI_EXT_CAP_ID_DOE > > #define PCI_EXT_CAP_DSN_SIZEOF 12 > #define PCI_EXT_CAP_MCAST_ENDPOINT_SIZEOF 40 > @@ -1092,4 +1093,30 @@ > #define PCI_PL_16GT_LE_CTRL_USP_TX_PRESET_MASK 0x000000F0 > #define PCI_PL_16GT_LE_CTRL_USP_TX_PRESET_SHIFT 4 > > +/* Data Object Exchange */ > +#define PCI_DOE_CAP 0x04 /* DOE Capabilities Register */ > +#define PCI_DOE_CAP_INT 0x00000001 /* Interrupt Support */ > +#define PCI_DOE_CAP_IRQ 0x00000ffe /* Interrupt Message Number */ > +#define PCI_DOE_CTRL 0x08 /* DOE Control Register */ > +#define PCI_DOE_CTRL_ABORT 0x00000001 /* DOE Abort */ > +#define PCI_DOE_CTRL_INT_EN 0x00000002 /* DOE Interrupt Enable */ > +#define PCI_DOE_CTRL_GO 0x80000000 /* DOE Go */ > +#define PCI_DOE_STATUS 0x0c /* DOE Status Register */ > +#define PCI_DOE_STATUS_BUSY 0x00000001 /* DOE Busy */ > +#define PCI_DOE_STATUS_INT_STATUS 0x00000002 /* DOE Interrupt Status */ > +#define PCI_DOE_STATUS_ERROR 0x00000004 /* DOE Error */ > +#define PCI_DOE_STATUS_DATA_OBJECT_READY 0x80000000 /* Data Object Ready */ > +#define PCI_DOE_WRITE 0x10 /* DOE Write Data Mailbox Register */ > +#define PCI_DOE_READ 0x14 /* DOE Read Data Mailbox Register */ > + > +/* DOE Data Object - note not actually registers */ > +#define PCI_DOE_DATA_OBJECT_HEADER_1_VID 0x0000ffff > +#define PCI_DOE_DATA_OBJECT_HEADER_1_TYPE 0x00ff0000 > +#define PCI_DOE_DATA_OBJECT_HEADER_2_LENGTH 0x0003ffff > + > +#define PCI_DOE_DATA_OBJECT_DISC_REQ_3_INDEX 0x000000ff > +#define PCI_DOE_DATA_OBJECT_DISC_RSP_3_VID 0x0000ffff > +#define PCI_DOE_DATA_OBJECT_DISC_RSP_3_PROTOCOL 0x00ff0000 > +#define PCI_DOE_DATA_OBJECT_DISC_RSP_3_NEXT_INDEX 0xff000000 Conventional identation is via tabs when possible; the above all appear to be spaces.
.... Sorry for the delay, I managed to miss this email. Only realized I hadn't replied because I wanted to point out the docs issue I'd missed in original review... See below. > > I was carrying a rework of this locally because I managed > > to convince myself this is wrong. It's been a while and naturally > > I didn't write a comprehensive set of notes on why it was wrong... > > (Note you can't trigger the problem paths in QEMU without some > > nasty hacks as it relies on opening up race windows that make > > limited sense for the QEMU implementation). > > > > It's all centered on some details of exactly what causes an interrupt > > on a DOE. Section 6.xx.3 Interrupt Generation states: > > > > If enabled, an interrupt message must be triggered every time the > > logical AND of the following conditions transitions from FALSE to TRUE: > > > > * The associated vector is unmasked ... > > * The value of the DOE interrupt enable bit is 1b > > * The value of the DOE interrupt status bit is 1b > > (only last one really maters to us I think). > > > > The interrupt status bit is an OR conditional. > > > > Must be set.. Data Object Read bit or DOE error bit set or DOE busy bit cleared. > > > > > +{ > > > + struct pci_doe *doe = data; > > > + struct pci_dev *pdev = doe->doe_dev->pdev; > > > + int offset = doe->doe_dev->cap_offset; > > > + u32 val; > > > + > > > + pci_read_config_dword(pdev, offset + PCI_DOE_STATUS, &val); > > > + if (FIELD_GET(PCI_DOE_STATUS_INT_STATUS, val)) { > > > > So this bit is set on any of: BUSY dropped, READY or ERROR. > > If it's set on BUSY drop, but then in between the read above and this clear > > READY becomes true, then my reading is that we will not get another interrupt. > > That is fine because we will read it again in the state machine and see the > > new state. We could do more of the dance in the interrupt controller by doing > > a reread after clear of INT_STATUS but I think it's cleaner to leave > > it in the state machine. > > > > It might look nicer here to only write BIT(1) - RW1C, but that doesn't matter as > > all the rest of the register is RO. > > > > > + pci_write_config_dword(pdev, offset + PCI_DOE_STATUS, val); > > > + mod_delayed_work(system_wq, &doe->statemachine, 0); > > > + return IRQ_HANDLED; > > > + } > > > + /* Leave the error case to be handled outside IRQ */ > > > + if (FIELD_GET(PCI_DOE_STATUS_ERROR, val)) { > > > > I don't think we can get here because int status already true. > > So should do this before the above general check to avoid clearning > > the interrupt (we don't want more interrupts during the abort though > > I'd hope the hardware wouldn't generate them). > > > > So move this before the previous check. > > > > > + mod_delayed_work(system_wq, &doe->statemachine, 0); > > > + return IRQ_HANDLED; > > > + } > > > + > > > + /* > > > + * Busy being cleared can result in an interrupt, but as > > > + * the original Busy may not have been detected, there is no > > > + * way to separate such an interrupt from a spurious interrupt. > > > + */ > > > > This is misleading - as Busy bit clear would have resulted in INT_STATUS being true above > > (that was a misread of the spec from me in v4). > > So I don't think we can get here in any valid path. > > > > return IRQ_NONE; should be safe. > > > > > > > + return IRQ_HANDLED; > > > +} > > > > Summary of above suggested changes: > > 1) Move the DOE_STATUS_ERROR block before the DOE_STATUS_INT_STATUS one > > 2) Possibly uses > > pci_write_config_dword(pdev, offset + PCI_DOE_STATUS, PCI_DOE_STATUS_INT_STATUS); > > to be explicit on the write one to clear bit. > > 3) IRQ_NONE for the final return path as I'm fairly sure there is no valid route to that. > > > > Done. > > But just to ensure that I understand. If STATUS_ERROR is indicated we are > basically not clearing the irq because we are resetting the mailbox? Because > with this new code I don't see a pci_write_config_dword to clear INT_STATUS. Exactly. > > But if we are resetting the mailbox I think that is ok. > > > ... > > ... > > > +/** > > > + * pci_doe_exchange_sync() - Send a request, then wait for and receive a response > > > + * @doe: DOE mailbox state structure This should be doe_dev, another thing during build tests of a rebase as I wanted to put the CMA stuff on top of this. > > > + * @ex: Description of the buffers and Vendor ID + type used in this > > > + * request/response pair > > > + * > > > + * Excess data will be discarded. > > > + * > > > + * RETURNS: payload in bytes on success, < 0 on error > > > + */ > > > +int pci_doe_exchange_sync(struct pci_doe_dev *doe_dev, struct pci_doe_exchange *ex) > > > +{ > > > + struct pci_doe *doe = dev_get_drvdata(&doe_dev->adev.dev); > > > + struct pci_doe_task task; > > > + DECLARE_COMPLETION_ONSTACK(c); > > > + > > > + if (!doe) > > > + return -EAGAIN; > > > + > > > + /* DOE requests must be a whole number of DW */ > > > + if (ex->request_pl_sz % sizeof(u32)) > > > + return -EINVAL; > > > + > > > + task.ex = ex; > > > + task.cb = pci_doe_task_complete; > > > + task.private = &c; > > > + > > > +again: > > > > Hmm. Whether having this code at this layer makes sense hinges on > > whether we want to easily support async use of the DOE in future. > > I struggled with this. I was trying to strike a balance with making this a > synchronous call with only 1 outstanding task while leaving the statemachine > alone. > > FWIW I think the queue you had was just fine even though there was only this > synchronous call. We can put it back easily if we ever need it. Until then this is fine. I'm not convinced any DOE use will be sufficiently high bandwidth that it really matters if we support async accessors > > > > > In v4 some of the async handling had ended up in this function and > > should probably have been factored out to give us a > > 'queue up work' then 'wait for completion' sequence. > > > > Given there is now more to be done in here perhaps we need to think > > about such a separation to keep it clear that this is fundamentally > > a synchronous wrapper around an asynchronous operation. > > I think that would be moving back in a direction of having a queue like you > defined in V4. Eliminating the queue really defined this function to sleep > waiting for the state machine to be available. Doing anything more would have > messed with the state machine you wrote and I did not want to do that. > > Dan should we move back to having a queue_task/wait_task like Jonathan had > before? > > > > > > + mutex_lock(&doe->state_lock); > > > + if (doe->cur_task) { > > > + mutex_unlock(&doe->state_lock); > > > + wait_event_interruptible(doe->wq, doe->cur_task == NULL); > > > + goto again; > > > + } > > > + > > > + if (doe->dead) { > > > + mutex_unlock(&doe->state_lock); > > > + return -EIO; > > > + } > > > + doe->cur_task = &task; > > > + schedule_delayed_work(&doe->statemachine, 0); > > > + mutex_unlock(&doe->state_lock); > > > + > > > + wait_for_completion(&c); > > > + > > > + return task.rv; > > > +} > > > +EXPORT_SYMBOL_GPL(pci_doe_exchange_sync); > > ... > > > > > + > > > +static void pci_doe_unregister(struct pci_doe *doe) > > > +{ > > > + pci_doe_release_irq(doe); > > > + kfree(doe->irq_name); > > > + put_device(&doe->doe_dev->pdev->dev); > > > > This makes me wonder if we should be doing the get_device() > > earlier in probe? Limited harm in moving it to near the start > > and then ending up with it being 'obviously' correct... > > Well... get_device() is in pci_doe_register... And it does it's own irq > unwinding. > > I guess we could call pci_doe_unregister() from that if we refactored this... > > How about this? (Diff to this code) While it should work, I think I'd keep the error handling paths explicit and not rely on irq_name == 0 and doe->irq == 0 making the error path calls of pci_doe_unregister() safe. That takes some thought when reviewing (a little bit) whereas explicit error handling doesn't take as much. I just don't like unregister having put_device() last when it isn't the first thing done in register(). So moving that first perhaps gets a reference before we strictly speaking need one, but it makes it clear we can definitely release that reference where it's done in unregister. > > diff --git a/drivers/pci/doe.c b/drivers/pci/doe.c > index 76acf4063b6b..6f2a419b3c93 100644 > --- a/drivers/pci/doe.c > +++ b/drivers/pci/doe.c > @@ -545,10 +545,12 @@ static int pci_doe_abort(struct pci_doe *doe) > return 0; > } > > -static void pci_doe_release_irq(struct pci_doe *doe) > +static void pci_doe_unregister(struct pci_doe *doe) > { > if (doe->irq > 0) > free_irq(doe->irq, doe); > + kfree(doe->irq_name); > + put_device(&doe->doe_dev->pdev->dev); > } > > static int pci_doe_register(struct pci_doe *doe) > @@ -559,21 +561,28 @@ static int pci_doe_register(struct pci_doe *doe) > int rc, irq; > u32 val; > > + /* Ensure the pci device remains until this driver is done with it */ > + get_device(&pdev->dev); > + > pci_read_config_dword(pdev, offset + PCI_DOE_CAP, &val); > > if (!poll && FIELD_GET(PCI_DOE_CAP_INT, val)) { > irq = pci_irq_vector(pdev, FIELD_GET(PCI_DOE_CAP_IRQ, val)); > - if (irq < 0) > - return irq; > + if (irq < 0) { > + rc = irq; > + goto unregister; > + } > > doe->irq_name = kasprintf(GFP_KERNEL, "DOE[%s]", > doe->doe_dev->adev.name); > - if (!doe->irq_name) > - return -ENOMEM; > + if (!doe->irq_name) { > + rc = -ENOMEM; > + goto unregister; > + } > > rc = request_irq(irq, pci_doe_irq, 0, doe->irq_name, doe); > if (rc) > - goto err_free_name; > + goto unregister; > > doe->irq = irq; > pci_write_config_dword(pdev, offset + PCI_DOE_CTRL, > @@ -583,27 +592,15 @@ static int pci_doe_register(struct pci_doe *doe) > /* Reset the mailbox by issuing an abort */ > rc = pci_doe_abort(doe); > if (rc) > - goto err_free_irq; > - > - /* Ensure the pci device remains until this driver is done with it */ > - get_device(&pdev->dev); > + goto unregister; > > return 0; > > -err_free_irq: > - pci_doe_release_irq(doe); > -err_free_name: > - kfree(doe->irq_name); > +unregister: > + pci_doe_unregister(doe); > return rc; > } > > -static void pci_doe_unregister(struct pci_doe *doe) > -{ > - pci_doe_release_irq(doe); > - kfree(doe->irq_name); > - put_device(&doe->doe_dev->pdev->dev); > -} > - > /* > * pci_doe_probe() - Set up the Mailbox > * @aux_dev: Auxiliary Device > > > > > > > +} > > > + > > > +/* > > > + * pci_doe_probe() - Set up the Mailbox > > > + * @aux_dev: Auxiliary Device > > > + * @id: Auxiliary device ID > > > + * > > > + * Probe the mailbox found for all protocols and set up the Mailbox > > > + * > > > + * RETURNS: 0 on success, < 0 on error > > > + */ > > > +static int pci_doe_probe(struct auxiliary_device *aux_dev, > > > + const struct auxiliary_device_id *id) > > > +{ > > > + struct pci_doe_dev *doe_dev = container_of(aux_dev, > > > + struct pci_doe_dev, > > > + adev); > > > + struct pci_doe *doe; > > > + int rc; > > > + > > > + doe = kzalloc(sizeof(*doe), GFP_KERNEL); > > > > Could go devm_ for this I think, though may not be worthwhile. > > Yes I think it is worth it... I should use it more. > > BTW why did you not use devm_krealloc() for the protocols? I have a sneaky suspicion this has been around long enough it predates devm_krealloc. > > I did not realize that call existed before you mentioned it in the other patch > review. It is rather new and shiny :) > > Any issue with using it there? Should be fine I think. > > > > > > + if (!doe) > > > + return -ENOMEM; > > > + > > > + mutex_init(&doe->state_lock); > > > + init_completion(&doe->abort_c); > > > + doe->doe_dev = doe_dev; > > > + init_waitqueue_head(&doe->wq); > > > + INIT_DELAYED_WORK(&doe->statemachine, doe_statemachine_work); > > > + dev_set_drvdata(&aux_dev->dev, doe); > > > + > > > + rc = pci_doe_register(doe); > > > + if (rc) > > > + goto err_free; > > > + > > > + rc = pci_doe_cache_protocols(doe); > > > + if (rc) { > > > + pci_doe_unregister(doe); > > > > Mixture of different forms of error handling here. > > I'd move this below and add an err_unregister label. > > Actually with the devm_kzalloc() we don't need the goto at all. We can just > return. I _think_? Right? yes > > > > > > + goto err_free; > > > + } > > > + > > > + return 0; > > > + > > > +err_free: > > > + kfree(doe); > > > + return rc; > > > +} > > > + > > > +static void pci_doe_remove(struct auxiliary_device *aux_dev) > > > +{ > > > + struct pci_doe *doe = dev_get_drvdata(&aux_dev->dev); > > > + > > > + /* First halt the state machine */ > > > + cancel_delayed_work_sync(&doe->statemachine); > > > + kfree(doe->prots); > > > > Logical flow to me is unregister first, free protocols second > > (to reverse what we do in probe) > > No this is the reverse of the probe order I think. > > Order is > register > cache protocols > > Then we > free 'uncache' protocols > unregister > > Right? Huh. No idea what I was going on about. > > > > > > + pci_doe_unregister(doe); > > > + kfree(doe); > > > +} > > > + > > > +static const struct auxiliary_device_id pci_doe_auxiliary_id_table[] = { > > > + {.name = "cxl_pci.doe", }, > > > > I'd like to hear from Bjorn on whether registering this from the CXL > > device is the right approach or if we should perhaps just do it directly from > > somewhere in PCI. (really applies to patch 3) I'll talk more about this there. > > Actually I think this could be left blank until the next patch... It's just > odd to define an empty table in the next few structures. But technically this > is not needed until the devices are defined. > > I'm ok waiting to see what Bjorn thinks regarding the CXL vs PCI placement > though. > > > > > > + {}, > > > +}; > > > + > > > +MODULE_DEVICE_TABLE(auxiliary, pci_doe_auxiliary_id_table); > > > + > > > +struct auxiliary_driver pci_doe_auxiliary_drv = { > > > + .name = "pci_doe_drv", > > > > I would assume this is only used in contexts where the _drv is > > obvious? I would go with "pci_doe". > > Sure. done. > > > > > > + .id_table = pci_doe_auxiliary_id_table, > > > + .probe = pci_doe_probe, > > > + .remove = pci_doe_remove > > > +}; > > > + > > > +static int __init pci_doe_init_module(void) > > > +{ > > > + int ret; > > > + > > > + ret = auxiliary_driver_register(&pci_doe_auxiliary_drv); > > > + if (ret) { > > > + pr_err("Failed pci_doe auxiliary_driver_register() ret=%d\n", > > > + ret); > > > + return ret; > > > + } > > > + > > > + return 0; > > > +} > > > + > > > +static void __exit pci_doe_exit_module(void) > > > +{ > > > + auxiliary_driver_unregister(&pci_doe_auxiliary_drv); > > > +} > > > + > > > +module_init(pci_doe_init_module); > > > +module_exit(pci_doe_exit_module); > > > > Seems like the auxiliary bus would benefit from a > > module_auxiliary_driver() macro to cover this simple registration stuff > > similar to module_i2c_driver() etc. > > > > Mind you, looking at 5.15 this would be the only user, so maybe one > > for the 'next' case on basis two instances proves it's 'common' ;) > > I'm inclined to leave this alone ATM. I tried to clean up the auxiliary device > documentation and got a bunch more work asked of me by Greg KH. So I'm behind > on that ATM. > > Later we can investigate that a bit I think. Sure. No rush on that. > > > > > > +MODULE_LICENSE("GPL v2"); > > > diff --git a/include/linux/pci-doe.h b/include/linux/pci-doe.h > > > new file mode 100644 > > > index 000000000000..8380b7ad33d4 > > > --- /dev/null > > > +++ b/include/linux/pci-doe.h > > > @@ -0,0 +1,63 @@ > > > +/* SPDX-License-Identifier: GPL-2.0 */ > > > +/* > > > + * Data Object Exchange was added as an ECN to the PCIe r5.0 spec. > > > + * > > > + * Copyright (C) 2021 Huawei > > > + * Jonathan Cameron <Jonathan.Cameron@huawei.com> > > > + */ > > > + > > > +#include <linux/completion.h> > > > +#include <linux/list.h> > > > +#include <linux/mutex.h> > > > > Not used in this header that I can see, so push down to the c files. > > oops... thanks. > > > > > > +#include <linux/auxiliary_bus.h> > > > + > > > +#ifndef LINUX_PCI_DOE_H > > > +#define LINUX_PCI_DOE_H > > > + > > > +#define DOE_DEV_NAME "doe" > > > > Not sure this is used? > > Used in the next patch... and it kind of goes along with the table_id name... > > I'll see about moving both of those to the next patch where it makes more sense > for now. > > Thanks for the review, > Ira > Thanks, Jonathan
On Tue, Nov 16, 2021 at 3:48 PM Bjorn Helgaas <helgaas@kernel.org> wrote: > > On Fri, Nov 05, 2021 at 04:50:53PM -0700, ira.weiny@intel.com wrote: > > From: Jonathan Cameron <Jonathan.Cameron@huawei.com> > > > > Introduced in a PCI ECN [1], DOE provides a config space based mailbox > > with standard protocol discovery. Each mailbox is accessed through a > > DOE Extended Capability. > > > > Define an auxiliary device driver which control DOE auxiliary devices > > registered on the auxiliary bus. > > What do we gain by making this an auxiliary driver? > > This doesn't really feel like a "driver," and apparently it used to be > a library. I'd like to see the rationale and benefits of the driver > approach (in the eventual commit log as well as the current email > thread). > I asked Ira to use the auxiliary bus for DOE primarily for the ABI it offers for userspace to manage kernel vs userspace access to a device. CONFIG_IO_STRICT_DEVMEM set the precedent that userspace can not clobber mmio space that is actively claimed by a kernel driver. I submit that DOE merits the same protection for DOE instances that the kernel consumes. Unlike other PCI configuration registers that root userspace has no reason to touch unless it wants to actively break things, DOE is a mechanism that root userspace may need to access directly in some cases. There are a few examples that come to mind. CXL Compliance Testing (see CXL 2.0 14.16.4 Compliance Mode DOE) offers a mechanism to set different test modes for a DOE device. The kernel has no reason to ever use that interface, and it has strong reasons to want to block access to it in production. However, hardware vendors also use debug Linux builds for hardware bringup. So I would like to be able to say that the mechanism to gain access to the compliance DOE is to detach the aux DOE driver from the right aux DOE device. Could we build a custom way to do the same for the DOE library, sure, but why re-invent the wheel when udev and the driver model can handle this type of policy question already? Another use case is SPDM where an agent can establish a secure message passing channel to a device, or paravirtualized device to exchange protected messages with the hypervisor. My expectation is that in addition to the kernel establishing SPDM sessions for PCI IDE and CXL.cachemem IDE (link Integrity and Data Encryption) there will be use cases for root userspace to establish their own SPDM session. In that scenario as well the kernel can be told to give up control of a specific DOE instance by detaching the aux device for its driver, but otherwise the kernel driver can be assured that userspace will not clobber its communications with its own attempts to talk over the DOE. Lastly, and perhaps this is minor, the PCI core is a built-in object and aux-bus allows for breaking out device library functionality like this into a dedicated module. But yes, that's not a good reason unto itself because you could "auxify" almost anything past the point of reason just to get more modules. > > A DOE mailbox is allowed to support any number of protocols while some > > DOE protocol specifications apply additional restrictions. > > This sounds something like a fancy version of VPD, and VPD has been a > huge headache. I hope DOE avoids that ;) Please say a bit more, I think DOE is a rather large headache as evidenced by us fine folks grappling with how to architect the Linux enabling.
On Fri, Dec 03, 2021 at 12:48:18PM -0800, Dan Williams wrote: > On Tue, Nov 16, 2021 at 3:48 PM Bjorn Helgaas <helgaas@kernel.org> wrote: > > On Fri, Nov 05, 2021 at 04:50:53PM -0700, ira.weiny@intel.com wrote: > > > From: Jonathan Cameron <Jonathan.Cameron@huawei.com> > > > > > > Introduced in a PCI ECN [1], DOE provides a config space based mailbox > > > with standard protocol discovery. Each mailbox is accessed through a > > > DOE Extended Capability. > > > > > > Define an auxiliary device driver which control DOE auxiliary devices > > > registered on the auxiliary bus. > > > > What do we gain by making this an auxiliary driver? > > > > This doesn't really feel like a "driver," and apparently it used to be > > a library. I'd like to see the rationale and benefits of the driver > > approach (in the eventual commit log as well as the current email > > thread). > > I asked Ira to use the auxiliary bus for DOE primarily for the ABI it > offers for userspace to manage kernel vs userspace access to a device. > CONFIG_IO_STRICT_DEVMEM set the precedent that userspace can not > clobber mmio space that is actively claimed by a kernel driver. I > submit that DOE merits the same protection for DOE instances that the > kernel consumes. > > Unlike other PCI configuration registers that root userspace has no > reason to touch unless it wants to actively break things, DOE is a > mechanism that root userspace may need to access directly in some > cases. There are a few examples that come to mind. It's useful for root to read/write config registers with setpci, e.g., to test ASPM configuration, test power management behavior, etc. That can certainly break things and interfere with kernel access (and IMO should taint the kernel) but we have so far accepted that risk. I think the same will be true for DOE. In addition, I would think you might want a safe userspace interface via sysfs, e.g., something like the "vpd" file, but I missed that if it was in this series. > CXL Compliance Testing (see CXL 2.0 14.16.4 Compliance Mode DOE) > offers a mechanism to set different test modes for a DOE device. The > kernel has no reason to ever use that interface, and it has strong > reasons to want to block access to it in production. However, hardware > vendors also use debug Linux builds for hardware bringup. So I would > like to be able to say that the mechanism to gain access to the > compliance DOE is to detach the aux DOE driver from the right aux DOE > device. Could we build a custom way to do the same for the DOE > library, sure, but why re-invent the wheel when udev and the driver > model can handle this type of policy question already? > > Another use case is SPDM where an agent can establish a secure message > passing channel to a device, or paravirtualized device to exchange > protected messages with the hypervisor. My expectation is that in > addition to the kernel establishing SPDM sessions for PCI IDE and > CXL.cachemem IDE (link Integrity and Data Encryption) there will be > use cases for root userspace to establish their own SPDM session. In > that scenario as well the kernel can be told to give up control of a > specific DOE instance by detaching the aux device for its driver, but > otherwise the kernel driver can be assured that userspace will not > clobber its communications with its own attempts to talk over the DOE. I assume the kernel needs to control access to DOE in all cases, doesn't it? For example, DOE can generate interrupts, and only the kernel can field them. Maybe if I saw the userspace interface this would make more sense to me. I'm hoping there's a reasonable "send this query and give me the response" primitive that can be implemented in the kernel, used by drivers, and exposed safely to userspace. > Lastly, and perhaps this is minor, the PCI core is a built-in object > and aux-bus allows for breaking out device library functionality like > this into a dedicated module. But yes, that's not a good reason unto > itself because you could "auxify" almost anything past the point of > reason just to get more modules. > > > > A DOE mailbox is allowed to support any number of protocols while some > > > DOE protocol specifications apply additional restrictions. > > > > This sounds something like a fancy version of VPD, and VPD has been a > > huge headache. I hope DOE avoids that ;) > > Please say a bit more, I think DOE is a rather large headache as > evidenced by us fine folks grappling with how to architect the Linux > enabling. VPD is not widely used, so gets poor testing. The device doesn't tell us how much VPD it has, so we have to read until the end. The data is theoretically self-describing (series of elements, each containing a type and size), but of course some devices don't format it correctly, so we read to the absolute limit, which takes a long time. Typical contents of unimplemented or uninitialized VPD are all 0x00 or all 0xff, neither of which is defined as a valid "end of VPD" signal, so we have hacky "this doesn't look like VPD" code. The PCI core doesn't need VPD itself and should only need to look for the size of each element and the "end" tag, but it works around some issues by doing more interpretation of the data. The spec is a little ambiguous and leaves room for vendors to use types not mentioned in the spec. Some devices share VPD hardware across functions but don't protect it correctly (a hardware defect, granted). VPD has address/data registers, so it requires locking, polling for completion, and timeouts. Bjorn
On Fri, Dec 3, 2021 at 3:56 PM Bjorn Helgaas <helgaas@kernel.org> wrote: > > On Fri, Dec 03, 2021 at 12:48:18PM -0800, Dan Williams wrote: > > On Tue, Nov 16, 2021 at 3:48 PM Bjorn Helgaas <helgaas@kernel.org> wrote: > > > On Fri, Nov 05, 2021 at 04:50:53PM -0700, ira.weiny@intel.com wrote: > > > > From: Jonathan Cameron <Jonathan.Cameron@huawei.com> > > > > > > > > Introduced in a PCI ECN [1], DOE provides a config space based mailbox > > > > with standard protocol discovery. Each mailbox is accessed through a > > > > DOE Extended Capability. > > > > > > > > Define an auxiliary device driver which control DOE auxiliary devices > > > > registered on the auxiliary bus. > > > > > > What do we gain by making this an auxiliary driver? > > > > > > This doesn't really feel like a "driver," and apparently it used to be > > > a library. I'd like to see the rationale and benefits of the driver > > > approach (in the eventual commit log as well as the current email > > > thread). > > > > I asked Ira to use the auxiliary bus for DOE primarily for the ABI it > > offers for userspace to manage kernel vs userspace access to a device. > > CONFIG_IO_STRICT_DEVMEM set the precedent that userspace can not > > clobber mmio space that is actively claimed by a kernel driver. I > > submit that DOE merits the same protection for DOE instances that the > > kernel consumes. > > > > Unlike other PCI configuration registers that root userspace has no > > reason to touch unless it wants to actively break things, DOE is a > > mechanism that root userspace may need to access directly in some > > cases. There are a few examples that come to mind. > > It's useful for root to read/write config registers with setpci, e.g., > to test ASPM configuration, test power management behavior, etc. That > can certainly break things and interfere with kernel access (and IMO > should taint the kernel) but we have so far accepted that risk. I > think the same will be true for DOE. I think DOE is a demonstrable step worse than those examples and pushes into the unacceptable risk category. It invites a communication protocol with an unbounded range of side effects (especially when controlling a device like a CXL memory expander that affects "System RAM" directly). Part of what drives platform / device vendors to the standards negotiation table is the OS encouraging common interfaces. If Linux provides an unfettered DOE interface it reduces an incentive for device vendors to collaborate with the kernel community. I do like the taint proposal though, if I can't convince you that DOE merits explicit root userspace lockout beyond CONFIG_LOCK_DOWN_KERNEL_FORCE_INTEGRITY, I would settle for the kernel warning loudly about DOE usage that falls outside the kernel's expectations. > In addition, I would think you might want a safe userspace interface > via sysfs, e.g., something like the "vpd" file, but I missed that if > it was in this series. I do not think the kernel is well served by a generic userspace passthrough for DOE for the same reason that there is no generic passthrough for the ACPI _DSM facility. When the kernel becomes a generic pipe to a vendor specific interface it impedes the kernel from developing standard interfaces across vendors. Each vendor will ship their own quirky feature and corresponding vendor-specific tool with minimal incentive to coordinate with other vendors doing similar things. At a minimum the userspace interface for DOE should be at a level above the raw transport and be enabled per standardized / published DOE protocol. I.e. a userspace interface to read the CDAT table retrieved over DOE, or a userspace interface to enumerate IDE capabilities, etc. > > CXL Compliance Testing (see CXL 2.0 14.16.4 Compliance Mode DOE) > > offers a mechanism to set different test modes for a DOE device. The > > kernel has no reason to ever use that interface, and it has strong > > reasons to want to block access to it in production. However, hardware > > vendors also use debug Linux builds for hardware bringup. So I would > > like to be able to say that the mechanism to gain access to the > > compliance DOE is to detach the aux DOE driver from the right aux DOE > > device. Could we build a custom way to do the same for the DOE > > library, sure, but why re-invent the wheel when udev and the driver > > model can handle this type of policy question already? > > > > Another use case is SPDM where an agent can establish a secure message > > passing channel to a device, or paravirtualized device to exchange > > protected messages with the hypervisor. My expectation is that in > > addition to the kernel establishing SPDM sessions for PCI IDE and > > CXL.cachemem IDE (link Integrity and Data Encryption) there will be > > use cases for root userspace to establish their own SPDM session. In > > that scenario as well the kernel can be told to give up control of a > > specific DOE instance by detaching the aux device for its driver, but > > otherwise the kernel driver can be assured that userspace will not > > clobber its communications with its own attempts to talk over the DOE. > > I assume the kernel needs to control access to DOE in all cases, > doesn't it? For example, DOE can generate interrupts, and only the > kernel can field them. Maybe if I saw the userspace interface this > would make more sense to me. I'm hoping there's a reasonable "send > this query and give me the response" primitive that can be implemented > in the kernel, used by drivers, and exposed safely to userspace. A DOE can generate interrupts, but I have yet to see a protocol that demands that. The userspace interface in the patches is just a binary attribute to dump the "CDAT" table retrieved over DOE. No generic passthrough is provided per the concerns above. > > Lastly, and perhaps this is minor, the PCI core is a built-in object > > and aux-bus allows for breaking out device library functionality like > > this into a dedicated module. But yes, that's not a good reason unto > > itself because you could "auxify" almost anything past the point of > > reason just to get more modules. > > > > > > A DOE mailbox is allowed to support any number of protocols while some > > > > DOE protocol specifications apply additional restrictions. > > > > > > This sounds something like a fancy version of VPD, and VPD has been a > > > huge headache. I hope DOE avoids that ;) > > > > Please say a bit more, I think DOE is a rather large headache as > > evidenced by us fine folks grappling with how to architect the Linux > > enabling. > > VPD is not widely used, so gets poor testing. The device doesn't tell > us how much VPD it has, so we have to read until the end. The data is > theoretically self-describing (series of elements, each containing a > type and size), but of course some devices don't format it correctly, > so we read to the absolute limit, which takes a long time. > > Typical contents of unimplemented or uninitialized VPD are all 0x00 or > all 0xff, neither of which is defined as a valid "end of VPD" signal, > so we have hacky "this doesn't look like VPD" code. > > The PCI core doesn't need VPD itself and should only need to look for > the size of each element and the "end" tag, but it works around some > issues by doing more interpretation of the data. The spec is a little > ambiguous and leaves room for vendors to use types not mentioned in > the spec. > > Some devices share VPD hardware across functions but don't protect it > correctly (a hardware defect, granted). > > VPD has address/data registers, so it requires locking, polling for > completion, and timeouts. Yikes, yes, that sounds like a headache.
On Sat, 4 Dec 2021 07:47:59 -0800 Dan Williams <dan.j.williams@intel.com> wrote: > On Fri, Dec 3, 2021 at 3:56 PM Bjorn Helgaas <helgaas@kernel.org> wrote: > > > > On Fri, Dec 03, 2021 at 12:48:18PM -0800, Dan Williams wrote: > > > On Tue, Nov 16, 2021 at 3:48 PM Bjorn Helgaas <helgaas@kernel.org> wrote: > > > > On Fri, Nov 05, 2021 at 04:50:53PM -0700, ira.weiny@intel.com wrote: > > > > > From: Jonathan Cameron <Jonathan.Cameron@huawei.com> > > > > > > > > > > Introduced in a PCI ECN [1], DOE provides a config space based mailbox > > > > > with standard protocol discovery. Each mailbox is accessed through a > > > > > DOE Extended Capability. > > > > > > > > > > Define an auxiliary device driver which control DOE auxiliary devices > > > > > registered on the auxiliary bus. > > > > > > > > What do we gain by making this an auxiliary driver? > > > > > > > > This doesn't really feel like a "driver," and apparently it used to be > > > > a library. I'd like to see the rationale and benefits of the driver > > > > approach (in the eventual commit log as well as the current email > > > > thread). > > > > > > I asked Ira to use the auxiliary bus for DOE primarily for the ABI it > > > offers for userspace to manage kernel vs userspace access to a device. > > > CONFIG_IO_STRICT_DEVMEM set the precedent that userspace can not > > > clobber mmio space that is actively claimed by a kernel driver. I > > > submit that DOE merits the same protection for DOE instances that the > > > kernel consumes. > > > > > > Unlike other PCI configuration registers that root userspace has no > > > reason to touch unless it wants to actively break things, DOE is a > > > mechanism that root userspace may need to access directly in some > > > cases. There are a few examples that come to mind. > > > > It's useful for root to read/write config registers with setpci, e.g., > > to test ASPM configuration, test power management behavior, etc. That > > can certainly break things and interfere with kernel access (and IMO > > should taint the kernel) but we have so far accepted that risk. I > > think the same will be true for DOE. > > I think DOE is a demonstrable step worse than those examples and > pushes into the unacceptable risk category. It invites a communication > protocol with an unbounded range of side effects (especially when > controlling a device like a CXL memory expander that affects "System > RAM" directly). Part of what drives platform / device vendors to the > standards negotiation table is the OS encouraging common interfaces. > If Linux provides an unfettered DOE interface it reduces an incentive > for device vendors to collaborate with the kernel community. > > I do like the taint proposal though, if I can't convince you that DOE > merits explicit root userspace lockout beyond > CONFIG_LOCK_DOWN_KERNEL_FORCE_INTEGRITY, I would settle for the kernel > warning loudly about DOE usage that falls outside the kernel's > expectations. > > > In addition, I would think you might want a safe userspace interface > > via sysfs, e.g., something like the "vpd" file, but I missed that if > > it was in this series. > > I do not think the kernel is well served by a generic userspace > passthrough for DOE for the same reason that there is no generic > passthrough for the ACPI _DSM facility. When the kernel becomes a > generic pipe to a vendor specific interface it impedes the kernel from > developing standard interfaces across vendors. Each vendor will ship > their own quirky feature and corresponding vendor-specific tool with > minimal incentive to coordinate with other vendors doing similar > things. At a minimum the userspace interface for DOE should be at a > level above the raw transport and be enabled per standardized / > published DOE protocol. I.e. a userspace interface to read the CDAT > table retrieved over DOE, or a userspace interface to enumerate IDE > capabilities, etc. I agree with Dan that a generic pass through is a bad idea, but we do have code for one in an earlier version... https://lore.kernel.org/linux-pci/20210524133938.2815206-5-Jonathan.Cameron@huawei.com/ We could take the approach of an allow list for this, if we can figure out an appropriate way to manage that list. > > > > CXL Compliance Testing (see CXL 2.0 14.16.4 Compliance Mode DOE) > > > offers a mechanism to set different test modes for a DOE device. The > > > kernel has no reason to ever use that interface, and it has strong > > > reasons to want to block access to it in production. However, hardware > > > vendors also use debug Linux builds for hardware bringup. So I would > > > like to be able to say that the mechanism to gain access to the > > > compliance DOE is to detach the aux DOE driver from the right aux DOE > > > device. Could we build a custom way to do the same for the DOE > > > library, sure, but why re-invent the wheel when udev and the driver > > > model can handle this type of policy question already? > > > > > > Another use case is SPDM where an agent can establish a secure message > > > passing channel to a device, or paravirtualized device to exchange > > > protected messages with the hypervisor. My expectation is that in > > > addition to the kernel establishing SPDM sessions for PCI IDE and > > > CXL.cachemem IDE (link Integrity and Data Encryption) there will be > > > use cases for root userspace to establish their own SPDM session. In > > > that scenario as well the kernel can be told to give up control of a > > > specific DOE instance by detaching the aux device for its driver, but > > > otherwise the kernel driver can be assured that userspace will not > > > clobber its communications with its own attempts to talk over the DOE. > > > > I assume the kernel needs to control access to DOE in all cases, > > doesn't it? For example, DOE can generate interrupts, and only the > > kernel can field them. Maybe if I saw the userspace interface this > > would make more sense to me. I'm hoping there's a reasonable "send > > this query and give me the response" primitive that can be implemented > > in the kernel, used by drivers, and exposed safely to userspace. > > A DOE can generate interrupts, but I have yet to see a protocol that > demands that. The userspace interface in the patches is just a binary > attribute to dump the "CDAT" table retrieved over DOE. No generic > passthrough is provided per the concerns above. I don't think it would be that hard to set up a protocol specific interface for establishment of secure channels to cover that particular case. As long as we ensure userspace can see / manage the crypto elements it won't matter if the data is passed through another interface on it's way to the DOE. Jonathan
diff --git a/drivers/pci/Kconfig b/drivers/pci/Kconfig index 0c473d75e625..b512295538ba 100644 --- a/drivers/pci/Kconfig +++ b/drivers/pci/Kconfig @@ -118,6 +118,16 @@ config XEN_PCIDEV_FRONTEND The PCI device frontend driver allows the kernel to import arbitrary PCI devices from a PCI backend to support PCI driver domains. +config PCI_DOE_DRIVER + tristate "PCI Data Object Exchange (DOE) driver" + select AUXILIARY_BUS + help + Driver for DOE auxiliary devices. + + DOE provides a simple mailbox in PCI config space that is used by a + number of different protocols. DOE is defined in the Data Object + Exchange ECN to the PCIe r5.0 spec. + config PCI_ATS bool diff --git a/drivers/pci/Makefile b/drivers/pci/Makefile index d62c4ac4ae1b..afd9d7bd2b82 100644 --- a/drivers/pci/Makefile +++ b/drivers/pci/Makefile @@ -28,8 +28,11 @@ obj-$(CONFIG_PCI_STUB) += pci-stub.o obj-$(CONFIG_PCI_PF_STUB) += pci-pf-stub.o obj-$(CONFIG_PCI_ECAM) += ecam.o obj-$(CONFIG_PCI_P2PDMA) += p2pdma.o +obj-$(CONFIG_PCI_DOE_DRIVER) += pci-doe.o obj-$(CONFIG_XEN_PCIDEV_FRONTEND) += xen-pcifront.o +pci-doe-y := doe.o + # Endpoint library must be initialized before its users obj-$(CONFIG_PCI_ENDPOINT) += endpoint/ diff --git a/drivers/pci/doe.c b/drivers/pci/doe.c new file mode 100644 index 000000000000..2e702fdc7879 --- /dev/null +++ b/drivers/pci/doe.c @@ -0,0 +1,701 @@ +// SPDX-License-Identifier: GPL-2.0 +/* + * Data Object Exchange ECN + * https://members.pcisig.com/wg/PCI-SIG/document/14143 + * + * Copyright (C) 2021 Huawei + * Jonathan Cameron <Jonathan.Cameron@huawei.com> + */ + +#include <linux/bitfield.h> +#include <linux/delay.h> +#include <linux/jiffies.h> +#include <linux/list.h> +#include <linux/mutex.h> +#include <linux/pci.h> +#include <linux/pci-doe.h> +#include <linux/workqueue.h> +#include <linux/module.h> + +#define PCI_DOE_PROTOCOL_DISCOVERY 0 + +#define PCI_DOE_BUSY_MAX_RETRIES 16 +#define PCI_DOE_POLL_INTERVAL (HZ / 128) + +/* Timeout of 1 second from 6.xx.1 (Operation), ECN - Data Object Exchange */ +#define PCI_DOE_TIMEOUT HZ + +enum pci_doe_state { + DOE_IDLE, + DOE_WAIT_RESP, + DOE_WAIT_ABORT, + DOE_WAIT_ABORT_ON_ERR, +}; + +/* + * struct pci_doe_task - description of a query / response task + * @ex: The details of the task to be done + * @rv: Return value. Length of received response or error + * @cb: Callback for completion of task + * @private: Private data passed to callback on completion + */ +struct pci_doe_task { + struct pci_doe_exchange *ex; + int rv; + void (*cb)(void *private); + void *private; +}; + +/** + * struct pci_doe - A single DOE mailbox driver + * + * @doe_dev: The DOE Auxiliary device being driven + * @abort_c: Completion used for initial abort handling + * @irq: Interrupt used for signaling DOE ready or abort + * @irq_name: Name used to identify the irq for a particular DOE + * @prots: Array of identifiers for protocols supported + * @num_prots: Size of prots array + * @cur_task: Current task the state machine is working on + * @wq: Wait queue to wait on if a query is in progress + * @state_lock: Protect the state of cur_task, abort, and dead + * @statemachine: Work item for the DOE state machine + * @state: Current state of this DOE + * @timeout_jiffies: 1 second after GO set + * @busy_retries: Count of retry attempts + * @abort: Request a manual abort (e.g. on init) + * @dead: Used to mark a DOE for which an ABORT has timed out. Further messages + * will immediately be aborted with error + */ +struct pci_doe { + struct pci_doe_dev *doe_dev; + struct completion abort_c; + int irq; + char *irq_name; + struct pci_doe_protocol *prots; + int num_prots; + + struct pci_doe_task *cur_task; + wait_queue_head_t wq; + struct mutex state_lock; + struct delayed_work statemachine; + enum pci_doe_state state; + unsigned long timeout_jiffies; + unsigned int busy_retries; + unsigned int abort:1; + unsigned int dead:1; +}; + +static irqreturn_t pci_doe_irq(int irq, void *data) +{ + struct pci_doe *doe = data; + struct pci_dev *pdev = doe->doe_dev->pdev; + int offset = doe->doe_dev->cap_offset; + u32 val; + + pci_read_config_dword(pdev, offset + PCI_DOE_STATUS, &val); + if (FIELD_GET(PCI_DOE_STATUS_INT_STATUS, val)) { + pci_write_config_dword(pdev, offset + PCI_DOE_STATUS, val); + mod_delayed_work(system_wq, &doe->statemachine, 0); + return IRQ_HANDLED; + } + /* Leave the error case to be handled outside IRQ */ + if (FIELD_GET(PCI_DOE_STATUS_ERROR, val)) { + mod_delayed_work(system_wq, &doe->statemachine, 0); + return IRQ_HANDLED; + } + + /* + * Busy being cleared can result in an interrupt, but as + * the original Busy may not have been detected, there is no + * way to separate such an interrupt from a spurious interrupt. + */ + return IRQ_HANDLED; +} + +/* + * Only call when safe to directly access the DOE, either because no tasks yet + * queued, or called from doe_statemachine_work() which has exclusive access to + * the DOE config space. + */ +static void pci_doe_abort_start(struct pci_doe *doe) +{ + struct pci_dev *pdev = doe->doe_dev->pdev; + int offset = doe->doe_dev->cap_offset; + u32 val; + + val = PCI_DOE_CTRL_ABORT; + if (doe->irq) + val |= PCI_DOE_CTRL_INT_EN; + pci_write_config_dword(pdev, offset + PCI_DOE_CTRL, val); + + doe->timeout_jiffies = jiffies + HZ; + schedule_delayed_work(&doe->statemachine, HZ); +} + +static int pci_doe_send_req(struct pci_doe *doe, struct pci_doe_exchange *ex) +{ + struct pci_dev *pdev = doe->doe_dev->pdev; + int offset = doe->doe_dev->cap_offset; + u32 val; + int i; + + /* + * Check the DOE busy bit is not set. If it is set, this could indicate + * someone other than Linux (e.g. firmware) is using the mailbox. Note + * it is expected that firmware and OS will negotiate access rights via + * an, as yet to be defined method. + */ + pci_read_config_dword(pdev, offset + PCI_DOE_STATUS, &val); + if (FIELD_GET(PCI_DOE_STATUS_BUSY, val)) + return -EBUSY; + + if (FIELD_GET(PCI_DOE_STATUS_ERROR, val)) + return -EIO; + + /* Write DOE Header */ + val = FIELD_PREP(PCI_DOE_DATA_OBJECT_HEADER_1_VID, ex->prot.vid) | + FIELD_PREP(PCI_DOE_DATA_OBJECT_HEADER_1_TYPE, ex->prot.type); + pci_write_config_dword(pdev, offset + PCI_DOE_WRITE, val); + /* Length is 2 DW of header + length of payload in DW */ + pci_write_config_dword(pdev, offset + PCI_DOE_WRITE, + FIELD_PREP(PCI_DOE_DATA_OBJECT_HEADER_2_LENGTH, + 2 + ex->request_pl_sz / sizeof(u32))); + for (i = 0; i < ex->request_pl_sz / sizeof(u32); i++) + pci_write_config_dword(pdev, offset + PCI_DOE_WRITE, + ex->request_pl[i]); + + val = PCI_DOE_CTRL_GO; + if (doe->irq) + val |= PCI_DOE_CTRL_INT_EN; + + pci_write_config_dword(pdev, offset + PCI_DOE_CTRL, val); + /* Request is sent - now wait for poll or IRQ */ + return 0; +} + +static int pci_doe_recv_resp(struct pci_doe *doe, struct pci_doe_exchange *ex) +{ + struct pci_dev *pdev = doe->doe_dev->pdev; + int offset = doe->doe_dev->cap_offset; + size_t length; + u32 val; + int i; + + /* Read the first dword to get the protocol */ + pci_read_config_dword(pdev, offset + PCI_DOE_READ, &val); + if ((FIELD_GET(PCI_DOE_DATA_OBJECT_HEADER_1_VID, val) != ex->prot.vid) || + (FIELD_GET(PCI_DOE_DATA_OBJECT_HEADER_1_TYPE, val) != ex->prot.type)) { + pci_err(pdev, + "Expected [VID, Protocol] = [%x, %x], got [%x, %x]\n", + ex->prot.vid, ex->prot.type, + FIELD_GET(PCI_DOE_DATA_OBJECT_HEADER_1_VID, val), + FIELD_GET(PCI_DOE_DATA_OBJECT_HEADER_1_TYPE, val)); + return -EIO; + } + + pci_write_config_dword(pdev, offset + PCI_DOE_READ, 0); + /* Read the second dword to get the length */ + pci_read_config_dword(pdev, offset + PCI_DOE_READ, &val); + pci_write_config_dword(pdev, offset + PCI_DOE_READ, 0); + + length = FIELD_GET(PCI_DOE_DATA_OBJECT_HEADER_2_LENGTH, val); + if (length > SZ_1M || length < 2) + return -EIO; + + /* First 2 dwords have already been read */ + length -= 2; + /* Read the rest of the response payload */ + for (i = 0; i < min(length, ex->response_pl_sz / sizeof(u32)); i++) { + pci_read_config_dword(pdev, offset + PCI_DOE_READ, + &ex->response_pl[i]); + pci_write_config_dword(pdev, offset + PCI_DOE_READ, 0); + } + + /* Flush excess length */ + for (; i < length; i++) { + pci_read_config_dword(pdev, offset + PCI_DOE_READ, &val); + pci_write_config_dword(pdev, offset + PCI_DOE_READ, 0); + } + /* Final error check to pick up on any since Data Object Ready */ + pci_read_config_dword(pdev, offset + PCI_DOE_STATUS, &val); + if (FIELD_GET(PCI_DOE_STATUS_ERROR, val)) + return -EIO; + + return min(length, ex->response_pl_sz / sizeof(u32)) * sizeof(u32); +} + +static void pci_doe_task_complete(void *private) +{ + complete(private); +} + +static void doe_statemachine_work(struct work_struct *work) +{ + struct delayed_work *w = to_delayed_work(work); + struct pci_doe *doe = container_of(w, struct pci_doe, statemachine); + struct pci_dev *pdev = doe->doe_dev->pdev; + int offset = doe->doe_dev->cap_offset; + struct pci_doe_task *task; + bool abort; + u32 val; + int rc; + + mutex_lock(&doe->state_lock); + task = doe->cur_task; + abort = doe->abort; + doe->abort = false; + mutex_unlock(&doe->state_lock); + + if (abort) { + /* + * Currently only used during init - care needed if + * pci_doe_abort() is generally exposed as it would impact + * queries in flight. + */ + WARN_ON(task); + doe->state = DOE_WAIT_ABORT; + pci_doe_abort_start(doe); + return; + } + + switch (doe->state) { + case DOE_IDLE: + if (task == NULL) + return; + + /* Nothing currently in flight so queue a task */ + rc = pci_doe_send_req(doe, task->ex); + /* + * The specification does not provide any guidance on how long + * some other entity could keep the DOE busy, so try for 1 + * second then fail. Busy handling is best effort only, because + * there is no way of avoiding racing against another user of + * the DOE. + */ + if (rc == -EBUSY) { + doe->busy_retries++; + if (doe->busy_retries == PCI_DOE_BUSY_MAX_RETRIES) { + /* Long enough, fail this request */ + pci_WARN(pdev, true, "DOE busy for too long\n"); + doe->busy_retries = 0; + goto err_busy; + } + schedule_delayed_work(w, HZ / PCI_DOE_BUSY_MAX_RETRIES); + return; + } + if (rc) + goto err_abort; + doe->busy_retries = 0; + + doe->state = DOE_WAIT_RESP; + doe->timeout_jiffies = jiffies + HZ; + /* Now poll or wait for IRQ with timeout */ + if (doe->irq > 0) + schedule_delayed_work(w, PCI_DOE_TIMEOUT); + else + schedule_delayed_work(w, PCI_DOE_POLL_INTERVAL); + return; + + case DOE_WAIT_RESP: + /* Not possible to get here with NULL task */ + pci_read_config_dword(pdev, offset + PCI_DOE_STATUS, &val); + if (FIELD_GET(PCI_DOE_STATUS_ERROR, val)) { + rc = -EIO; + goto err_abort; + } + + if (!FIELD_GET(PCI_DOE_STATUS_DATA_OBJECT_READY, val)) { + /* If not yet at timeout reschedule otherwise abort */ + if (time_after(jiffies, doe->timeout_jiffies)) { + rc = -ETIMEDOUT; + goto err_abort; + } + schedule_delayed_work(w, PCI_DOE_POLL_INTERVAL); + return; + } + + rc = pci_doe_recv_resp(doe, task->ex); + if (rc < 0) + goto err_abort; + + doe->state = DOE_IDLE; + + mutex_lock(&doe->state_lock); + doe->cur_task = NULL; + mutex_unlock(&doe->state_lock); + wake_up_interruptible(&doe->wq); + + /* Set the return value to the length of received payload */ + task->rv = rc; + task->cb(task->private); + + return; + + case DOE_WAIT_ABORT: + case DOE_WAIT_ABORT_ON_ERR: + pci_read_config_dword(pdev, offset + PCI_DOE_STATUS, &val); + + if (!FIELD_GET(PCI_DOE_STATUS_ERROR, val) && + !FIELD_GET(PCI_DOE_STATUS_BUSY, val)) { + /* Back to normal state - carry on */ + mutex_lock(&doe->state_lock); + doe->cur_task = NULL; + mutex_unlock(&doe->state_lock); + wake_up_interruptible(&doe->wq); + + /* + * For deliberately triggered abort, someone is + * waiting. + */ + if (doe->state == DOE_WAIT_ABORT) + complete(&doe->abort_c); + + doe->state = DOE_IDLE; + return; + } + if (time_after(jiffies, doe->timeout_jiffies)) { + /* Task has timed out and is dead - abort */ + pci_err(pdev, "DOE ABORT timed out\n"); + mutex_lock(&doe->state_lock); + doe->dead = true; + doe->cur_task = NULL; + mutex_unlock(&doe->state_lock); + wake_up_interruptible(&doe->wq); + + if (doe->state == DOE_WAIT_ABORT) + complete(&doe->abort_c); + } + return; + } + +err_abort: + doe->state = DOE_WAIT_ABORT_ON_ERR; + pci_doe_abort_start(doe); +err_busy: + task->rv = rc; + task->cb(task->private); + /* If here via err_busy, signal the task done. */ + if (doe->state == DOE_IDLE) { + mutex_lock(&doe->state_lock); + doe->cur_task = NULL; + mutex_unlock(&doe->state_lock); + wake_up_interruptible(&doe->wq); + } +} + +/** + * pci_doe_exchange_sync() - Send a request, then wait for and receive a response + * @doe: DOE mailbox state structure + * @ex: Description of the buffers and Vendor ID + type used in this + * request/response pair + * + * Excess data will be discarded. + * + * RETURNS: payload in bytes on success, < 0 on error + */ +int pci_doe_exchange_sync(struct pci_doe_dev *doe_dev, struct pci_doe_exchange *ex) +{ + struct pci_doe *doe = dev_get_drvdata(&doe_dev->adev.dev); + struct pci_doe_task task; + DECLARE_COMPLETION_ONSTACK(c); + + if (!doe) + return -EAGAIN; + + /* DOE requests must be a whole number of DW */ + if (ex->request_pl_sz % sizeof(u32)) + return -EINVAL; + + task.ex = ex; + task.cb = pci_doe_task_complete; + task.private = &c; + +again: + mutex_lock(&doe->state_lock); + if (doe->cur_task) { + mutex_unlock(&doe->state_lock); + wait_event_interruptible(doe->wq, doe->cur_task == NULL); + goto again; + } + + if (doe->dead) { + mutex_unlock(&doe->state_lock); + return -EIO; + } + doe->cur_task = &task; + schedule_delayed_work(&doe->statemachine, 0); + mutex_unlock(&doe->state_lock); + + wait_for_completion(&c); + + return task.rv; +} +EXPORT_SYMBOL_GPL(pci_doe_exchange_sync); + +/** + * pci_doe_supports_prot() - Return if the DOE instance supports the given protocol + * @pdev: Device on which to find the DOE instance + * @vid: Protocol Vendor ID + * @type: protocol type + * + * This device can then be passed to pci_doe_exchange_sync() to execute a mailbox + * exchange through that DOE mailbox. + * + * RETURNS: True if the DOE device supports the protocol specified + */ +bool pci_doe_supports_prot(struct pci_doe_dev *doe_dev, u16 vid, u8 type) +{ + struct pci_doe *doe = dev_get_drvdata(&doe_dev->adev.dev); + int i; + + if (!doe) + return false; + + for (i = 0; i < doe->num_prots; i++) + if ((doe->prots[i].vid == vid) && + (doe->prots[i].type == type)) + return true; + + return false; +} +EXPORT_SYMBOL_GPL(pci_doe_supports_prot); + +static int pci_doe_discovery(struct pci_doe *doe, u8 *index, u16 *vid, + u8 *protocol) +{ + u32 request_pl = FIELD_PREP(PCI_DOE_DATA_OBJECT_DISC_REQ_3_INDEX, *index); + u32 response_pl; + struct pci_doe_exchange ex = { + .prot.vid = PCI_VENDOR_ID_PCI_SIG, + .prot.type = PCI_DOE_PROTOCOL_DISCOVERY, + .request_pl = &request_pl, + .request_pl_sz = sizeof(request_pl), + .response_pl = &response_pl, + .response_pl_sz = sizeof(response_pl), + }; + int ret; + + ret = pci_doe_exchange_sync(doe->doe_dev, &ex); + if (ret < 0) + return ret; + + if (ret != sizeof(response_pl)) + return -EIO; + + *vid = FIELD_GET(PCI_DOE_DATA_OBJECT_DISC_RSP_3_VID, response_pl); + *protocol = FIELD_GET(PCI_DOE_DATA_OBJECT_DISC_RSP_3_PROTOCOL, response_pl); + *index = FIELD_GET(PCI_DOE_DATA_OBJECT_DISC_RSP_3_NEXT_INDEX, response_pl); + + return 0; +} + +static int pci_doe_cache_protocols(struct pci_doe *doe) +{ + u8 index = 0; + int rc; + + /* Discovery protocol must always be supported and must report itself */ + doe->num_prots = 1; + doe->prots = kcalloc(doe->num_prots, sizeof(*doe->prots), GFP_KERNEL); + if (doe->prots == NULL) + return -ENOMEM; + + do { + struct pci_doe_protocol *prot; + + prot = &doe->prots[doe->num_prots - 1]; + rc = pci_doe_discovery(doe, &index, &prot->vid, &prot->type); + if (rc) + goto err_free_prots; + + if (index) { + struct pci_doe_protocol *prot_new; + + doe->num_prots++; + prot_new = krealloc(doe->prots, + sizeof(*doe->prots) * doe->num_prots, + GFP_KERNEL); + if (prot_new == NULL) { + rc = -ENOMEM; + goto err_free_prots; + } + doe->prots = prot_new; + } + } while (index); + + return 0; + +err_free_prots: + kfree(doe->prots); + doe->num_prots = 0; + doe->prots = NULL; + return rc; +} + +static int pci_doe_abort(struct pci_doe *doe) +{ + reinit_completion(&doe->abort_c); + mutex_lock(&doe->state_lock); + doe->abort = true; + mutex_unlock(&doe->state_lock); + schedule_delayed_work(&doe->statemachine, 0); + wait_for_completion(&doe->abort_c); + + if (doe->dead) + return -EIO; + + return 0; +} + +static void pci_doe_release_irq(struct pci_doe *doe) +{ + if (doe->irq > 0) + free_irq(doe->irq, doe); +} + +static int pci_doe_register(struct pci_doe *doe) +{ + struct pci_dev *pdev = doe->doe_dev->pdev; + bool poll = !pci_dev_msi_enabled(pdev); + int offset = doe->doe_dev->cap_offset; + int rc, irq; + u32 val; + + pci_read_config_dword(pdev, offset + PCI_DOE_CAP, &val); + + if (!poll && FIELD_GET(PCI_DOE_CAP_INT, val)) { + irq = pci_irq_vector(pdev, FIELD_GET(PCI_DOE_CAP_IRQ, val)); + if (irq < 0) + return irq; + + doe->irq_name = kasprintf(GFP_KERNEL, "DOE[%s]", + doe->doe_dev->adev.name); + if (!doe->irq_name) + return -ENOMEM; + + rc = request_irq(irq, pci_doe_irq, 0, doe->irq_name, doe); + if (rc) + goto err_free_name; + + doe->irq = irq; + pci_write_config_dword(pdev, offset + PCI_DOE_CTRL, + PCI_DOE_CTRL_INT_EN); + } + + /* Reset the mailbox by issuing an abort */ + rc = pci_doe_abort(doe); + if (rc) + goto err_free_irqs; + + /* Ensure the pci device remains until this driver is done with it */ + get_device(&pdev->dev); + + return 0; + +err_free_irqs: + pci_doe_release_irq(doe); +err_free_name: + kfree(doe->irq_name); + return rc; +} + +static void pci_doe_unregister(struct pci_doe *doe) +{ + pci_doe_release_irq(doe); + kfree(doe->irq_name); + put_device(&doe->doe_dev->pdev->dev); +} + +/* + * pci_doe_probe() - Set up the Mailbox + * @aux_dev: Auxiliary Device + * @id: Auxiliary device ID + * + * Probe the mailbox found for all protocols and set up the Mailbox + * + * RETURNS: 0 on success, < 0 on error + */ +static int pci_doe_probe(struct auxiliary_device *aux_dev, + const struct auxiliary_device_id *id) +{ + struct pci_doe_dev *doe_dev = container_of(aux_dev, + struct pci_doe_dev, + adev); + struct pci_doe *doe; + int rc; + + doe = kzalloc(sizeof(*doe), GFP_KERNEL); + if (!doe) + return -ENOMEM; + + mutex_init(&doe->state_lock); + init_completion(&doe->abort_c); + doe->doe_dev = doe_dev; + init_waitqueue_head(&doe->wq); + INIT_DELAYED_WORK(&doe->statemachine, doe_statemachine_work); + dev_set_drvdata(&aux_dev->dev, doe); + + rc = pci_doe_register(doe); + if (rc) + goto err_free; + + rc = pci_doe_cache_protocols(doe); + if (rc) { + pci_doe_unregister(doe); + goto err_free; + } + + return 0; + +err_free: + kfree(doe); + return rc; +} + +static void pci_doe_remove(struct auxiliary_device *aux_dev) +{ + struct pci_doe *doe = dev_get_drvdata(&aux_dev->dev); + + /* First halt the state machine */ + cancel_delayed_work_sync(&doe->statemachine); + kfree(doe->prots); + pci_doe_unregister(doe); + kfree(doe); +} + +static const struct auxiliary_device_id pci_doe_auxiliary_id_table[] = { + {.name = "cxl_pci.doe", }, + {}, +}; + +MODULE_DEVICE_TABLE(auxiliary, pci_doe_auxiliary_id_table); + +struct auxiliary_driver pci_doe_auxiliary_drv = { + .name = "pci_doe_drv", + .id_table = pci_doe_auxiliary_id_table, + .probe = pci_doe_probe, + .remove = pci_doe_remove +}; + +static int __init pci_doe_init_module(void) +{ + int ret; + + ret = auxiliary_driver_register(&pci_doe_auxiliary_drv); + if (ret) { + pr_err("Failed pci_doe auxiliary_driver_register() ret=%d\n", + ret); + return ret; + } + + return 0; +} + +static void __exit pci_doe_exit_module(void) +{ + auxiliary_driver_unregister(&pci_doe_auxiliary_drv); +} + +module_init(pci_doe_init_module); +module_exit(pci_doe_exit_module); +MODULE_LICENSE("GPL v2"); diff --git a/include/linux/pci-doe.h b/include/linux/pci-doe.h new file mode 100644 index 000000000000..8380b7ad33d4 --- /dev/null +++ b/include/linux/pci-doe.h @@ -0,0 +1,63 @@ +/* SPDX-License-Identifier: GPL-2.0 */ +/* + * Data Object Exchange was added as an ECN to the PCIe r5.0 spec. + * + * Copyright (C) 2021 Huawei + * Jonathan Cameron <Jonathan.Cameron@huawei.com> + */ + +#include <linux/completion.h> +#include <linux/list.h> +#include <linux/mutex.h> +#include <linux/auxiliary_bus.h> + +#ifndef LINUX_PCI_DOE_H +#define LINUX_PCI_DOE_H + +#define DOE_DEV_NAME "doe" + +struct pci_doe_protocol { + u16 vid; + u8 type; +}; + +/** + * struct pci_doe_exchange - represents a single query/response + * + * @prot: DOE Protocol + * @request_pl: The request payload + * @request_pl_sz: Size of the request payload + * @response_pl: The response payload + * @response_pl_sz: Size of the response payload + */ +struct pci_doe_exchange { + struct pci_doe_protocol prot; + u32 *request_pl; + size_t request_pl_sz; + u32 *response_pl; + size_t response_pl_sz; +}; + +/** + * struct pci_doe_dev - DOE mailbox device + * + * @adrv: Auxiliary Driver data + * @pdev: PCI device this belongs to + * @offset: Capability offset + * + * This represents a single DOE mailbox device. Devices should create this + * device and register it on the Auxiliary bus for the DOE driver to maintain. + * + */ +struct pci_doe_dev { + struct auxiliary_device adev; + struct pci_dev *pdev; + int cap_offset; +}; + +/* Library operations */ +int pci_doe_exchange_sync(struct pci_doe_dev *doe_dev, + struct pci_doe_exchange *ex); +bool pci_doe_supports_prot(struct pci_doe_dev *doe_dev, u16 vid, u8 type); + +#endif diff --git a/include/uapi/linux/pci_regs.h b/include/uapi/linux/pci_regs.h index e709ae8235e7..1073cd1916e1 100644 --- a/include/uapi/linux/pci_regs.h +++ b/include/uapi/linux/pci_regs.h @@ -730,7 +730,8 @@ #define PCI_EXT_CAP_ID_DVSEC 0x23 /* Designated Vendor-Specific */ #define PCI_EXT_CAP_ID_DLF 0x25 /* Data Link Feature */ #define PCI_EXT_CAP_ID_PL_16GT 0x26 /* Physical Layer 16.0 GT/s */ -#define PCI_EXT_CAP_ID_MAX PCI_EXT_CAP_ID_PL_16GT +#define PCI_EXT_CAP_ID_DOE 0x2E /* Data Object Exchange */ +#define PCI_EXT_CAP_ID_MAX PCI_EXT_CAP_ID_DOE #define PCI_EXT_CAP_DSN_SIZEOF 12 #define PCI_EXT_CAP_MCAST_ENDPOINT_SIZEOF 40 @@ -1092,4 +1093,30 @@ #define PCI_PL_16GT_LE_CTRL_USP_TX_PRESET_MASK 0x000000F0 #define PCI_PL_16GT_LE_CTRL_USP_TX_PRESET_SHIFT 4 +/* Data Object Exchange */ +#define PCI_DOE_CAP 0x04 /* DOE Capabilities Register */ +#define PCI_DOE_CAP_INT 0x00000001 /* Interrupt Support */ +#define PCI_DOE_CAP_IRQ 0x00000ffe /* Interrupt Message Number */ +#define PCI_DOE_CTRL 0x08 /* DOE Control Register */ +#define PCI_DOE_CTRL_ABORT 0x00000001 /* DOE Abort */ +#define PCI_DOE_CTRL_INT_EN 0x00000002 /* DOE Interrupt Enable */ +#define PCI_DOE_CTRL_GO 0x80000000 /* DOE Go */ +#define PCI_DOE_STATUS 0x0c /* DOE Status Register */ +#define PCI_DOE_STATUS_BUSY 0x00000001 /* DOE Busy */ +#define PCI_DOE_STATUS_INT_STATUS 0x00000002 /* DOE Interrupt Status */ +#define PCI_DOE_STATUS_ERROR 0x00000004 /* DOE Error */ +#define PCI_DOE_STATUS_DATA_OBJECT_READY 0x80000000 /* Data Object Ready */ +#define PCI_DOE_WRITE 0x10 /* DOE Write Data Mailbox Register */ +#define PCI_DOE_READ 0x14 /* DOE Read Data Mailbox Register */ + +/* DOE Data Object - note not actually registers */ +#define PCI_DOE_DATA_OBJECT_HEADER_1_VID 0x0000ffff +#define PCI_DOE_DATA_OBJECT_HEADER_1_TYPE 0x00ff0000 +#define PCI_DOE_DATA_OBJECT_HEADER_2_LENGTH 0x0003ffff + +#define PCI_DOE_DATA_OBJECT_DISC_REQ_3_INDEX 0x000000ff +#define PCI_DOE_DATA_OBJECT_DISC_RSP_3_VID 0x0000ffff +#define PCI_DOE_DATA_OBJECT_DISC_RSP_3_PROTOCOL 0x00ff0000 +#define PCI_DOE_DATA_OBJECT_DISC_RSP_3_NEXT_INDEX 0xff000000 + #endif /* LINUX_PCI_REGS_H */