diff mbox

pci-passthrough loses msi-x interrupts ability after domain destroy

Message ID CACMJ4GbEp3sEQsc93OsiRh3o1T+5F_byH9yWcKGa0Bn2e9XGUg@mail.gmail.com (mailing list archive)
State New, archived
Headers show

Commit Message

Christopher Clark Sept. 22, 2017, 2:09 a.m. UTC
On Thu, Sep 21, 2017 at 1:27 PM, Sander Eikelenboom
<linux@eikelenboom.it> wrote:
>
> On Thu, September 21, 2017, 10:39:52 AM, Roger Pau Monné wrote:
>
>> On Wed, Sep 20, 2017 at 03:50:35PM -0400, Jérôme Oufella wrote:
>>>
>>> I'm using PCI pass-through to map a PCIe (intel i210) controller into
>>> a HVM domain. The system uses xen-pciback to hide the appropriate PCI
>>> device from Dom0.
>>>
>>> When creating the HVM domain after an hypervisor cold boot, the HVM
>>> domain can access and use the PCIe controller without problem.
>>>
>>> However, if the HVM domain is destroyed then restarted, it won't be
>>> able to use the pass-through PCI device anymore. The PCI device is
>>> seen and can be mapped, however, the interrupts will not be passed to
>>> the HVM domain anymore (this is visible under a Linux guest as
>>> /proc/interrupts counters remain 0). The behavior on a Windows10 guest
>>> is the same.
>>>
>>> A few interesting hints I noticed:
>>>
>>> - On Dom0, 'lspci -vv' on that PCIe device between the "working" and
>>> the "muted interrupts" states, I noted a difference between the
>>> MSI-X caps:
>>>
>>> - Capabilities: [70] MSI-X: Enable- Count=5 Masked- <-- IRQs will work if domain started
>>> + Capabilities: [70] MSI-X: Enable- Count=5 Masked+ <-- IRQs won't work if domain started
>>>                                             ^^^^^^^
>
>> IMHO it seems that either your device is not able to perform a reset
>> successfully, or Linux is not correctly performing such reset. I don't
>> think there's a lot that can be done from the Xen side.
>
> Unfortunately for a lot of pci-devices a simple reset as performed by default isn't enough,
> but also almost none support a real pci FLR.
>
> In the distant past Konrad has made a patchset that implemented a bus reset and
> reseting config space. (It piggy backed on already existing libxl mechanism of
> trying to call on a syfs "do_flr" attribute which triggers pciback to perform
> the busreset and rewrite of config space for the device.
>
> I use that patchset ever since for my pci-passtrough needs and it works pretty
> well. I can shutdown an restart VM's with pci devices passed trhough (also AMD
> Radeon graphic cards).

Just to confirm the utility of that piece of work: OpenXT also uses an
extended version of that same patch to perform device reset for
passthrough.

I've attached a copy of that OpenXT patch to this message and it can
also be obtained from our git repository:
https://github.com/OpenXT/xenclient-oe/blob/f8d3b282a87231d9ae717b13d506e8e7e28c78c4/recipes-kernel/linux/4.9/patches/thorough-reset-interface-to-pciback-s-sysfs.patch
This version creates a sysfs node named "reset_device" and the OpenXT
libxl toolstack is patched to use that node instead of "do_flr".

Konrad's original work encountered pushback on upstream acceptance at
the time it was developed. I'm not sure I've found where that
discussion ended. Is there any prospect of a more comprehensive reset
mechanism being accepted into xen-pciback, or elsewhere in the kernel?

As noted in the original LKML threads, vfio has similar relevant pci
device reset retry logic. (Thanks to Rich Persaud for this pointer:)
http://elixir.free-electrons.com/linux/v4.14-rc1/source/drivers/vfio/pci/vfio_pci.c#L1353

libvirt also performs similar reset logic, using a direct low level
interface to config space (Thanks to Marek for this pointer, libvirt
is used by Qubes:)
https://github.com/libvirt/libvirt/blob/master/src/util/virpci.c#L929
I thinks this indicates that it would be possible to extend libxl to
do something similar, but that seems less satisfactory compared to
performing the work in a kernel-provided implementation.

Is there a way forward to providing this functionality within Xen
software or Linux?

Christopher
--

openxt.org

Comments

Pasi Kärkkäinen Sept. 22, 2017, 6:58 a.m. UTC | #1
Hi,

On Thu, Sep 21, 2017 at 07:09:12PM -0700, Christopher Clark wrote:
> On Thu, Sep 21, 2017 at 1:27 PM, Sander Eikelenboom
> <linux@eikelenboom.it> wrote:
> >
> > On Thu, September 21, 2017, 10:39:52 AM, Roger Pau Monné wrote:
> >
> >> On Wed, Sep 20, 2017 at 03:50:35PM -0400, Jérôme Oufella wrote:
> >>>
> >>> I'm using PCI pass-through to map a PCIe (intel i210) controller into
> >>> a HVM domain. The system uses xen-pciback to hide the appropriate PCI
> >>> device from Dom0.
> >>>
> >>> When creating the HVM domain after an hypervisor cold boot, the HVM
> >>> domain can access and use the PCIe controller without problem.
> >>>
> >>> However, if the HVM domain is destroyed then restarted, it won't be
> >>> able to use the pass-through PCI device anymore. The PCI device is
> >>> seen and can be mapped, however, the interrupts will not be passed to
> >>> the HVM domain anymore (this is visible under a Linux guest as
> >>> /proc/interrupts counters remain 0). The behavior on a Windows10 guest
> >>> is the same.
> >>>
> >>> A few interesting hints I noticed:
> >>>
> >>> - On Dom0, 'lspci -vv' on that PCIe device between the "working" and
> >>> the "muted interrupts" states, I noted a difference between the
> >>> MSI-X caps:
> >>>
> >>> - Capabilities: [70] MSI-X: Enable- Count=5 Masked- <-- IRQs will work if domain started
> >>> + Capabilities: [70] MSI-X: Enable- Count=5 Masked+ <-- IRQs won't work if domain started
> >>>                                             ^^^^^^^
> >
> >> IMHO it seems that either your device is not able to perform a reset
> >> successfully, or Linux is not correctly performing such reset. I don't
> >> think there's a lot that can be done from the Xen side.
> >
> > Unfortunately for a lot of pci-devices a simple reset as performed by default isn't enough,
> > but also almost none support a real pci FLR.
> >
> > In the distant past Konrad has made a patchset that implemented a bus reset and
> > reseting config space. (It piggy backed on already existing libxl mechanism of
> > trying to call on a syfs "do_flr" attribute which triggers pciback to perform
> > the busreset and rewrite of config space for the device.
> >
> > I use that patchset ever since for my pci-passtrough needs and it works pretty
> > well. I can shutdown an restart VM's with pci devices passed trhough (also AMD
> > Radeon graphic cards).
> 
> Just to confirm the utility of that piece of work: OpenXT also uses an
> extended version of that same patch to perform device reset for
> passthrough.
> 
> I've attached a copy of that OpenXT patch to this message and it can
> also be obtained from our git repository:
> https://github.com/OpenXT/xenclient-oe/blob/f8d3b282a87231d9ae717b13d506e8e7e28c78c4/recipes-kernel/linux/4.9/patches/thorough-reset-interface-to-pciback-s-sysfs.patch
> This version creates a sysfs node named "reset_device" and the OpenXT
> libxl toolstack is patched to use that node instead of "do_flr".
> 
> Konrad's original work encountered pushback on upstream acceptance at
> the time it was developed. I'm not sure I've found where that
> discussion ended. Is there any prospect of a more comprehensive reset
> mechanism being accepted into xen-pciback, or elsewhere in the kernel?
> 
> As noted in the original LKML threads, vfio has similar relevant pci
> device reset retry logic. (Thanks to Rich Persaud for this pointer:)
> http://elixir.free-electrons.com/linux/v4.14-rc1/source/drivers/vfio/pci/vfio_pci.c#L1353
> 
> libvirt also performs similar reset logic, using a direct low level
> interface to config space (Thanks to Marek for this pointer, libvirt
> is used by Qubes:)
> https://github.com/libvirt/libvirt/blob/master/src/util/virpci.c#L929
> I thinks this indicates that it would be possible to extend libxl to
> do something similar, but that seems less satisfactory compared to
> performing the work in a kernel-provided implementation.
> 
> Is there a way forward to providing this functionality within Xen
> software or Linux?
> 

Adding Konrad to CC ..


-- Pasi

> Christopher
> --
> 
> openxt.org

> From d686351d8ea4a1ea1d755d0a10f6f14d1c870911 Mon Sep 17 00:00:00 2001
> From: Kyle Temkin <ktemkin@binghamton.edu>
> Date: Wed, 8 Apr 2015 00:58:24 -0400
> Subject: [PATCH] Add thorough reset interface to pciback's sysfs.
> 
> --------------------------------------------------------------------------------
> SHORT DESCRIPTION:
> --------------------------------------------------------------------------------
> Adds an interface that allows "more thorough" resets to be performed
> on devices which don't support Function Level Resets (FLRs). This
> interface should allow the toolstack to ensure that a PCI device is in a
> known state prior to passing it through to a VM.
> 
> --------------------------------------------------------------------------------
> LONG DESCRIPTION:
> --------------------------------------------------------------------------------
> 
> From Konrad Rzeszutek Wilk's original post to xen-devel and the LKML:
> 
>   The life-cycle of a PCI device in Xen pciback is complex
>   and is constrained by the PCI generic locking mechanism.
> 
>   It starts with the device being binded to us - for which
>   we do a device function reset (and done via SysFS
>   so the PCI lock is held)
> 
>   If the device is unbinded from us - we also do a function
>   reset (also done via SysFS so the PCI lock is held).
> 
>   If the device is un-assigned from a guest - we do a function
>   reset (no PCI lock).
> 
>   All on the individual PCI function level (so bus:device:function).
> 
>   Unfortunatly a function reset is not adequate for certain
>   PCIe devices. The reset for an individual PCI function "means
>   device must support FLR (PCIe or AF), PM reset on D3hot->D0
>   device specific reset, or be a singleton device on a bus
>   a secondary bus reset.  FLR does not have widespread support,
>   reset is not very reliable, and bus topology is dictated by the
>   and device design.  We need to provide a means for a user to
>   a bus reset in cases where the existing mechanisms are not
>    or not reliable. " (Adam Williamson, 'vfio-pci: PCI hot reset
>   interface' commit 8b27ee60bfd6bbb84d2df28fa706c5c5081066ca).
> 
>   As such to do a slot or a bus reset is we need another mechanism.
>   This is not exposed SysFS as there is no good way of exposing
>   a bus topology there.
> 
>   This is due to the complexity - we MUST know that the different
>   functions off a PCIe device are not in use by other drivers, or
>   if they are in use (say one of them is assigned to a guest
>   and the other is idle) - it is still OK to reset the slot
>   (assuming both of them are owned by Xen pciback).
> 
>   This patch does that by doing an slot or bus reset (if
>   slot not supported) if all of the functions of a PCIe
>   device belong to Xen PCIback. We do not care if the device is
>   in-use as we depend on the toolstack to be aware of this -
>   however if it is we will WARN the user.
> 
>   Due to the complexity with the PCI lock we cannot do
>   the reset when a device is binded ('echo $BDF > bind')
>   or when unbinded ('echo $BDF > unbind') as the pci_[slot|bus]_reset
>   also take the same lock resulting in a dead-lock.
> 
>   Putting the reset function in a workqueue or thread
>   won't work either - as we have to do the reset function
>   outside the 'unbind' context (it holds the PCI lock).
>   But once you 'unbind' a device the device is no longer
>   under the ownership of Xen pciback and the pci_set_drvdata
>   has been reset so we cannot use a thread for this.
> 
>   Instead of doing all this complex dance, we depend on the toolstack
>   doing the right thing. As such implement [... a SysFS attribute]
>   which [... the toolstack]  uses when a device is detached or attached
>   from/to a guest. It bypasses the need to worry about the PCI lock.
> 
>   To not inadvertly do a bus reset that would affect devices that
>   are in use by other drivers (other than Xen pciback) prior
>   to the reset we check that all of the devices under the bridge
>   are owned by Xen pciback. If they are not we do not do
>   the bus (or slot) reset.
> 
>   We also warn the user if the device is in use - but still
>   continue with the reset. This should not happen as the toolstack
>   also does the check.
> 
> --
> 
> Our version of the patch has been modified to use a less confusing
> sysfs name. The original name ('do_flr') is inappropriate, as it
> implies a function level reset; it is entirely possible that the patch
> code will use a bus-level reset when appropriate.
> 
> The new sysfs entry is located at:
> 
>   /sys/bus/pci/drivers/pciback/reset_device
> 
> and can be activated by writing a domain:bus:device:function device
> identifier into the sysfs file. As an example:
> 
>   echo "0000:01:00.0" > /sys/bus/pci/drivers/pciback/reset_device
> 
> would reset the device matching the D:BDF descriptor above.
> 
> --------------------------------------------------------------------------------
> CHANGELOG:
> --------------------------------------------------------------------------------
> This is a port of a patch that likely had many authors, including:
>     -Konrad Rzeszutek Wilk
>     -Alex Williamson
>     -Ross Phillipson <rphilipson@ainfosec.com>
> Ported to OpenXT by: Kyle J. Temkin <temkink@ainfosec.com>, 4/8/15
> Rewrite by:          Kyle J. Temkin <temkink@ainfosec.com>, 4/10/15
> 
> --------------------------------------------------------------------------------
> DEPENDENCIES
> --------------------------------------------------------------------------------
> This patch requires ONE of the following:
>   -A relatively modern linux kernel (3.18+) as a base; which provides
>    the PCI functions used; or
>   -Our PCI reset backports patch (backport-pci-reset-functionality.patch),
>    which backports the relevant functionality to 3.11.
> 
> To take advantage of this patch, the utilized toolstack should be
> changed to:
>   -Use the provided "reset_device" property, rather than the PCI
>    device's sysfs "reset" entry. This enables resets beyond a FLR to be
>    used.
>   -Ensure that all functions of a given device are passed through
>    together. This allows us to use some of the more thorugh resetting
>    techniques, when possible.
> 
> --------------------------------------------------------------------------------
> REMOVAL
> --------------------------------------------------------------------------------
> This patch provides a service which is necessary for proper passthrough
> of many PCI cards: a generalized ability to reset PCI devices, without
> requiring that the device support FLR or power-management based resets.
> 
> This patch will be necessary until either the Linux PCI subsystem or Xen
> PCIback drivers are modified to provide this support; or until cards
> without proper FLR support are no longer supported.
> 
> --------------------------------------------------------------------------------
> UPSTREAM PLAN
> --------------------------------------------------------------------------------
> 
> This code is taken from a patch which was originally proposed and
> rejected from upstream on the LKML and xen-devel. An upstream
> implementation of the functionality of this patch is still necessary;
> and can and should be implemented.
> 
> This patch will hopefully be replaced with an upstream version when
> community concensus has produced a single "blessed" method of
> accomplishing its functionality.
> 
> --------------------------------------------------------------------------------
> PATCHES
> --------------------------------------------------------------------------------
> ---
>  drivers/xen/xen-pciback/pci_stub.c | 338 ++++++++++++++++++++++++++++++++++---
>  1 file changed, 312 insertions(+), 26 deletions(-)
> 
> Index: linux-4.9.40/drivers/xen/xen-pciback/pci_stub.c
> ===================================================================
> --- linux-4.9.40.orig/drivers/xen/xen-pciback/pci_stub.c
> +++ linux-4.9.40/drivers/xen/xen-pciback/pci_stub.c
> @@ -102,10 +102,9 @@ static void pcistub_device_release(struc
>  
>  	xen_unregister_device_domain_owner(dev);
>  
> -	/* Call the reset function which does not take lock as this
> -	 * is called from "unbind" which takes a device_lock mutex.
> -	 */
> -	__pci_reset_function_locked(dev);
> +
> +	/* Reset is done by the toolstack by using 'reset_device' on the
> +	 * SysFS. */
>  	if (pci_load_and_free_saved_state(dev, &dev_data->pci_saved_state))
>  		dev_info(&dev->dev, "Could not reload PCI state\n");
>  	else
> @@ -125,9 +124,6 @@ static void pcistub_device_release(struc
>  				 err);
>  	}
>  
> -	/* Disable the device */
> -	xen_pcibk_reset_device(dev);
> -
>  	kfree(dev_data);
>  	pci_set_drvdata(dev, NULL);
>  
> @@ -224,6 +220,271 @@ struct pci_dev *pcistub_get_pci_dev_by_s
>  	return found_dev;
>  }
>  
> +
> +/**
> + * Returns true iff the given device supports PCIe FLRs.
> + */
> +static bool __device_supports_pcie_flr(struct pci_dev *dev)
> +{
> +	u32 cap;
> +
> +	/*
> +         * Read the device's capabilities. Note that this can be used even on legacy
> +	 * PCI devices (and not just on PCIe devices)-- it indicates that no capabilities
> +	 * are supported if the device is legacy PCI by setting cap to 0.
> +	 */
> +	 pcie_capability_read_dword(dev, PCI_EXP_DEVCAP, &cap);
> +
> +	/* Return true iff the device advertises supporting an FLR. */
> +	return (cap & PCI_EXP_DEVCAP_FLR);
> +}
> +
> +
> +/**
> + * Returns true iff the given device supports PCI Advanced Functionality (AF) FLRs.
> + */
> +static bool __device_supports_pci_af_flr(struct pci_dev *dev)
> +{
> +	int pos;
> +	u8 capability_flags;
> +
> +	/* First, try to find the location of the PCI Advanced Functionality capability byte. */
> +	pos = pci_find_capability(dev, PCI_CAP_ID_AF);
> +
> +	/*
> +	 * If we weren't able to find the capability byte, this device doesn't support
> +	 * the Advanced Functionality extensions, and thus won't support AF FLR.
> +	 */
> +	if (!pos)
> +		return false;
> +
> +	/* Read the capabilities advertised in the AF capability byte. */
> +	pci_read_config_byte(dev, pos + PCI_AF_CAP, &capability_flags);
> +
> +	/*
> +	 * If the device does support AF, it will advertise FLR support via the
> +	 * PCI_AF_CAP_FLR bit. We'll also check for the Transactions Pending (TP)
> +	 * mechanism, as the kernel requires this extension to issue an AF FLR.
> +	 * (Internally, the PCI reset code needs to be able to wait for all
> +	 * pending transactions to complete prior to issuing the AF FLR.)
> +	 */
> +	return (capability_flags & PCI_AF_CAP_TP) && (capability_flags & PCI_AF_CAP_FLR);
> +}
> +
> +
> +/**
> + * Returns true iff the given device adverstises supporting function-
> + * level-reset (FLR).
> + */
> +static bool device_supports_flr(struct pci_dev *dev)
> +{
> +	return __device_supports_pci_af_flr(dev) || __device_supports_pcie_flr(dev);
> +}
> +
> +
> +/**
> + * Returns true iff the given device is located in a slot that
> + * supports hotplugging slot resets.
> + */
> +static bool device_supports_slot_reset(struct pci_dev *dev)
> +{
> +	return !pci_probe_reset_slot(dev->slot);
> +}
> +
> +
> +/**
> + * Returns true iff the given device is located on a bus that
> + * we can reset. Note that root bridges are excluded, as this
> + * would cause more than just an SBR.
> + */
> +static bool device_supports_bus_reset(struct pci_dev *dev)
> +{
> +	return !pci_is_root_bus(dev->bus) && !pci_probe_reset_bus(dev->bus);
> +}
> +
> +
> +/**
> + * Out argument for the __safe_to_sbr_device_callback function.
> + */
> +struct safe_to_sbr_arguments {
> +
> +	//Stores the most recently encountered PCI device that does
> +	//not belong to pciback. As used below, this is the result of a
> +	//search for a non-pciback device on a bus; we stop upon finding
> +	//the first non-pciback device.
> +	struct pci_dev *last_non_pciback_device;
> +
> +	//Stores the number of pciback devices that appear to be in use
> +	//on the bus in question.
> +	int use_count;
> +
> +};
> +
> +
> +/**
> + *	A callback function which determines if a given PCI device is owned by pciback,
> + *	and whether the given device is in use. Used by safe_to_sbr_device.
> + *
> + *	@param dev The PCI device to be checked.
> + *	@param data An out argument of type struct safe_to_sbr_device_callback_arguments.
> + *			Updated to indicate the result of the search. See the struct's definition
> + *			for more details.
> + *
> + */
> +static int __safe_to_sbr_device_callback(struct pci_dev *dev, void *data)
> +{
> +
> +	struct pcistub_device *psdev;
> +
> +	bool device_owned_by_pciback = false;
> +	struct safe_to_sbr_arguments *arg = data;
> +
> +	unsigned long flags;
> +
> +	//Ensure that we have exclusive access to the list of PCI devices,
> +	//so we can traverse it.
> +	spin_lock_irqsave(&pcistub_devices_lock, flags);
> +
> +	//Iterate over all PCI devices owned by the pci stub.
> +	list_for_each_entry(psdev, &pcistub_devices, dev_list) {
> +
> +		//If the given device is owned by pciback...
> +		if (psdev->dev == dev) {
> +
> +			//mark it as a pciback device.
> +			device_owned_by_pciback = true;
> +
> +			//If we have a physical device associated with the pciback device,
> +			//mark this device as in-use.
> +			if (psdev->pdev)
> +				arg->use_count++;
> +
> +			//Stop searching; we've found a the PCIback device associated with this one.
> +			break;
> +		}
> +	}
> +
> +	//Release the PCI device lock...
> +	spin_unlock_irqrestore(&pcistub_devices_lock, flags);
> +
> +	//... and report if we've found a device that's not owned by pciback.
> +	dev_dbg(&dev->dev, "%s\n", device_owned_by_pciback ? "is owned by pciback, and can be reset if not in use."
> +			: "not owned by pciback, and thus cannot be reset.");
> +
> +	//If we've found a device that's not owned by pciback, update our data
> +	//argument so it points to the most recent unowned device. (We check
> +	//this like a flag, later: if it's never set, no one owns the device!)
> +	if (!device_owned_by_pciback)
> +		arg->last_non_pciback_device = dev;
> +
> +	//If we've found a device that's not owned by pciback, return false--
> +	//this indicates that pci_walk_bus should cease its walk.
> +	return !device_owned_by_pciback;
> +}
> +
> +
> +/**
> + * Returns true iff it should be safe to issue a secondary bus reset
> + * to the device; that is, if an SBR can be issued without disrupting
> + * other devices.
> + */
> +static bool safe_to_sbr_device(struct pci_dev *dev)
> +{
> +	struct safe_to_sbr_arguments walk_result = { .last_non_pciback_device = NULL, .use_count = 0 };
> +
> +	//Walk the PCI bus, attempting to find if any of the given devices
> +	pci_walk_bus(dev->bus, __safe_to_sbr_device_callback, &walk_result);
> +
> +	//If the device is in use, emit a warning error.
> +	if(walk_result.use_count > 0)
> +		dev_dbg(&dev->dev, "is in use; currently not safe to SBR device.\n");
> +
> +	//Return true iff we did not pick up any other devices
> +	//that were either in use, or not owned by pciback.
> +	return (walk_result.last_non_pciback_device == NULL) && (walk_result.use_count == 0);
> +}
> +
> +
> +/**
> + * Attempt a raw reset of the provided PCI device-- via any
> + * method available to us. This method prefers the gentlest
> + * possible reset method-- currently an FLR, which many
> + * PCIe devices should support.
> + *
> + * @param dev The pci device to be reset.
> + * @return Zero on success, or the error code generated by the reset method on failure.
> + */
> +static int __pcistub_raw_device_reset(struct pci_dev *dev)
> +{
> +	//Determine if bus resetting techniques (SBR, slot resets)
> +	//are safe, and thus should be allowed.
> +	int allow_bus_reset = safe_to_sbr_device(dev);
> +
> +	//If FLRs are supported; we'll try to let the linux kernel
> +	//manually reset the device.
> +	if(device_supports_flr(dev)) {
> +		dev_dbg(&dev->dev, "Resetting device using an FLR.");
> +		return pci_reset_function(dev);
> +	}
> +
> +	//Next, we'll try the next gentlest: a hotplugging reset
> +	//of the PCI slot.
> +	if(allow_bus_reset && device_supports_slot_reset(dev)) {
> +		dev_dbg(&dev->dev, "Resetting device using a slot reset.");
> +		return pci_try_reset_slot(dev->slot);
> +	}
> +
> +	//Finally, we'll try the most drastic: resetting the parent
> +	//PCI bus-- which we can only do conditionally.
> +	if(allow_bus_reset && device_supports_bus_reset(dev)) {
> +		dev_dbg(&dev->dev, "Resetting device using an SBR.");
> +		return pci_try_reset_bus(dev->bus);
> +	}
> +
> +	//If we weren't able to reset the device by any of our known-good methods,
> +	//fall back to the linux kernel's reset function. Unfortunately, this considers a
> +	//power management reset to be a valid reset; though this doesn't work for many devices--
> +	//especially GPUs.
> +	dev_err(&dev->dev, "No reset methods available for %s. Falling back to kernel reset.", pci_name(dev));
> +	pci_reset_function(dev);
> +
> +	//Return an error code, indicating that we likely did not reset the device correctly.
> +	return -ENOTTY;
> +}
> +
> +
> +/**
> + * Resets the target (pciback-owned) PCI device. Primarily intended
> + * for use by the toolstack, so it can ensure a consistent PCI device
> + * state on VM startup.
> + *
> + * @param dev The device to be reset.
> + * @return Zero on success, or a negated error code on failure.
> + */
> +static int pcistub_reset_pci_dev(struct pci_dev *dev)
> +{
> +	int rc;
> +
> +	if (!dev)
> +		return -EINVAL;
> +
> +	/*
> +	 * Takes the PCI lock. OK to do it as we are never called
> +	 * from 'unbind' state and don't deadlock.
> +	 */
> +	rc =__pcistub_raw_device_reset(dev);
> +	pci_restore_state(dev);
> +
> +	/* This disables the device. */
> +	xen_pcibk_reset_device(dev);
> +
> +	/* And cleanup up our emulated fields. */
> +	xen_pcibk_config_reset_dev(dev);
> +	return rc;
> +}
> +
> +
> +
>  struct pci_dev *pcistub_get_pci_dev(struct xen_pcibk_device *pdev,
>  				    struct pci_dev *dev)
>  {
> @@ -279,11 +540,13 @@ void pcistub_put_pci_dev(struct pci_dev
>  	* pcistub and xen_pcibk when AER is in processing
>  	*/
>  	down_write(&pcistub_sem);
> -	/* Cleanup our device
> -	 * (so it's ready for the next domain)
> -	 */
>  	device_lock_assert(&dev->dev);
> -	__pci_reset_function_locked(dev);
> +	/*
> +	 * Reset is up to the toolstack.
> +	 * The toolstack has to call 'reset_device' before
> +	 * providing the PCI device to a guest (see pcistub_reset_device).
> +	 */
> +	//__pci_reset_function_locked(dev);
>  
>  	dev_data = pci_get_drvdata(dev);
>  	ret = pci_load_saved_state(dev, dev_data->pci_saved_state);
> @@ -1460,6 +1723,41 @@ static ssize_t restrictive_add(struct de
>  }
>  static DRIVER_ATTR(restrictive, S_IWUSR, NULL, restrictive_add);
>  
> +/**
> + * Handles the "reset_device" sysfs attribute. This is the primary reset interface
> + * utilized by the toolstack.
> + */
> +static ssize_t pcistub_sysfs_reset_device(struct device_driver *drv, const char *buf, size_t count)
> +{
> +	int domain, bus, slot, func, err;
> +	struct pcistub_device *psdev;
> +
> +	//Attempt to convert the user's string to a BDF/slot.
> +	err = str_to_slot(buf, &domain, &bus, &slot, &func);
> +	if (err)
> +		return -ENODEV;
> +
> +	//... and then use that slot to find the pciback device.
> +	psdev = pcistub_device_find(domain, bus, slot, func);
> +
> +	//If we have a device, attempt to reset it using our internal reset path.
> +	if (psdev) {
> +		err = pcistub_reset_pci_dev(psdev->dev);
> +		pcistub_device_put(psdev);
> +
> +		//If we were not able to reset the device, return the relevant error code.
> +		if(err)
> +			err = -ENODEV;
> +	}
> +	//Otherwise, indicate that there's no such device.
> +	else {
> +		err = -ENODEV;
> +	}
> +
> +	return err ? err : count;
> +
> +}
> +static DRIVER_ATTR(reset_device, S_IWUSR, NULL, pcistub_sysfs_reset_device);
>  
>  static void pcistub_exit(void)
>  {
> @@ -1476,6 +1774,8 @@ static void pcistub_exit(void)
>  			   &driver_attr_irq_handlers);
>  	driver_remove_file(&xen_pcibk_pci_driver.driver,
>  			   &driver_attr_irq_handler_state);
> +	driver_remove_file(&xen_pcibk_pci_driver.driver,
> +			   &driver_attr_reset_device);
>  	pci_unregister_driver(&xen_pcibk_pci_driver);
>  }
>  
> @@ -1572,6 +1872,9 @@ static int __init pcistub_init(void)
>  	if (!err)
>  		err = driver_create_file(&xen_pcibk_pci_driver.driver,
>  					&driver_attr_irq_handler_state);
> +	if (!err)
> +		err = driver_create_file(&xen_pcibk_pci_driver.driver,
> +					&driver_attr_reset_device);
>  	if (err)
>  		pcistub_exit();
>  

> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xen.org
> https://lists.xen.org/xen-devel
Sander Eikelenboom Sept. 22, 2017, 7:35 a.m. UTC | #2
On 22/09/17 04:09, Christopher Clark wrote:
> On Thu, Sep 21, 2017 at 1:27 PM, Sander Eikelenboom
> <linux@eikelenboom.it> wrote:
>>
>> On Thu, September 21, 2017, 10:39:52 AM, Roger Pau Monné wrote:
>>
>>> On Wed, Sep 20, 2017 at 03:50:35PM -0400, Jérôme Oufella wrote:
>>>>
>>>> I'm using PCI pass-through to map a PCIe (intel i210) controller into
>>>> a HVM domain. The system uses xen-pciback to hide the appropriate PCI
>>>> device from Dom0.
>>>>
>>>> When creating the HVM domain after an hypervisor cold boot, the HVM
>>>> domain can access and use the PCIe controller without problem.
>>>>
>>>> However, if the HVM domain is destroyed then restarted, it won't be
>>>> able to use the pass-through PCI device anymore. The PCI device is
>>>> seen and can be mapped, however, the interrupts will not be passed to
>>>> the HVM domain anymore (this is visible under a Linux guest as
>>>> /proc/interrupts counters remain 0). The behavior on a Windows10 guest
>>>> is the same.
>>>>
>>>> A few interesting hints I noticed:
>>>>
>>>> - On Dom0, 'lspci -vv' on that PCIe device between the "working" and
>>>> the "muted interrupts" states, I noted a difference between the
>>>> MSI-X caps:
>>>>
>>>> - Capabilities: [70] MSI-X: Enable- Count=5 Masked- <-- IRQs will work if domain started
>>>> + Capabilities: [70] MSI-X: Enable- Count=5 Masked+ <-- IRQs won't work if domain started
>>>>                                             ^^^^^^^
>>
>>> IMHO it seems that either your device is not able to perform a reset
>>> successfully, or Linux is not correctly performing such reset. I don't
>>> think there's a lot that can be done from the Xen side.
>>
>> Unfortunately for a lot of pci-devices a simple reset as performed by default isn't enough,
>> but also almost none support a real pci FLR.
>>
>> In the distant past Konrad has made a patchset that implemented a bus reset and
>> reseting config space. (It piggy backed on already existing libxl mechanism of
>> trying to call on a syfs "do_flr" attribute which triggers pciback to perform
>> the busreset and rewrite of config space for the device.
>>
>> I use that patchset ever since for my pci-passtrough needs and it works pretty
>> well. I can shutdown an restart VM's with pci devices passed trhough (also AMD
>> Radeon graphic cards).
> 
> Just to confirm the utility of that piece of work: OpenXT also uses an
> extended version of that same patch to perform device reset for
> passthrough.
> 
> I've attached a copy of that OpenXT patch to this message and it can
> also be obtained from our git repository:
> https://github.com/OpenXT/xenclient-oe/blob/f8d3b282a87231d9ae717b13d506e8e7e28c78c4/recipes-kernel/linux/4.9/patches/thorough-reset-interface-to-pciback-s-sysfs.patch
> This version creates a sysfs node named "reset_device" and the OpenXT
> libxl toolstack is patched to use that node instead of "do_flr".

Nice to hear there are more users of this patch. On #xen on IRC there were from time to time
also users who tried pci-passtrough and ran into this issue (and probably abandonning the idea
since having to restart your host before being able to use your pass throughed device again
defies much of the use case).
 
> Konrad's original work encountered pushback on upstream acceptance at
> the time it was developed. I'm not sure I've found where that
> discussion ended. Is there any prospect of a more comprehensive reset
> mechanism being accepted into xen-pciback, or elsewhere in the kernel?

Yeah it was nacked by David Vrabel and the discussion somewhat bleeded to death. 
From what i remember the main issue was with the naming, since it doesn't do a FLR,
the sysfs hook shouldn't be called "do_flr".

Some other perhaps minor issues i can think of are:
- No way to excempt pci-devices from this new way of resetting them.
  Perhaps there could be pci devices/topologies were this way of
  resetting causes more problems than it solves and could cause a
  regression. Unfortunately auto detecting what works doesn't seem to
  be possible. On the other hand (though only with my n=10) i haven't encountered
  such a device yet.

- The communication path between libxl and the kernel via sysfs.
  I think the preference was for a:
  a) having it use a more common used Xen communication channel or
  b) having it all self-contained in pci-back. (from my memory and the openxt patch description
     there could be some locking issue when trying to implement it this way,
     but the vfio guys had that solved for there reset implementation if i
     from one of the comments in there source code (patches by Alex Williamson
     if i remember correctly).

- Not an issue back then when the patch was made, but as the question earlier to Roger,
  the hypervisor seems to grow more interference with pci devices with the PVH dom0 work.
  If and hoow does that relate to pci-back and pci-passthrough and (the location of) resetting mechanisms ?


So i think David's NACK was mostly for the patchset having some hackish cosmetics.

On the upside one can conclude that this patchset is now pretty well tested over the years ;)

Since David has left, perhaps Jurgen/Boris/Konrad could express their views (again) ?
(CC'ed them as well)

> As noted in the original LKML threads, vfio has similar relevant pci
> device reset retry logic. (Thanks to Rich Persaud for this pointer:)
> http://elixir.free-electrons.com/linux/v4.14-rc1/source/drivers/vfio/pci/vfio_pci.c#L1353
> 
> libvirt also performs similar reset logic, using a direct low level
> interface to config space (Thanks to Marek for this pointer, libvirt
> is used by Qubes:)
> https://github.com/libvirt/libvirt/blob/master/src/util/virpci.c#L929
> I thinks this indicates that it would be possible to extend libxl to
> do something similar, but that seems less satisfactory compared to
> performing the work in a kernel-provided implementation.
> 
> Is there a way forward to providing this functionality within Xen
> software or Linux> Christopher
> --
> 
> openxt.org
>
Konrad Rzeszutek Wilk Sept. 22, 2017, 7:25 p.m. UTC | #3
On Fri, Sep 22, 2017 at 09:35:40AM +0200, Sander Eikelenboom wrote:
> On 22/09/17 04:09, Christopher Clark wrote:
> > On Thu, Sep 21, 2017 at 1:27 PM, Sander Eikelenboom
> > <linux@eikelenboom.it> wrote:
> >>
> >> On Thu, September 21, 2017, 10:39:52 AM, Roger Pau Monné wrote:
> >>
> >>> On Wed, Sep 20, 2017 at 03:50:35PM -0400, Jérôme Oufella wrote:
> >>>>
> >>>> I'm using PCI pass-through to map a PCIe (intel i210) controller into
> >>>> a HVM domain. The system uses xen-pciback to hide the appropriate PCI
> >>>> device from Dom0.
> >>>>
> >>>> When creating the HVM domain after an hypervisor cold boot, the HVM
> >>>> domain can access and use the PCIe controller without problem.
> >>>>
> >>>> However, if the HVM domain is destroyed then restarted, it won't be
> >>>> able to use the pass-through PCI device anymore. The PCI device is
> >>>> seen and can be mapped, however, the interrupts will not be passed to
> >>>> the HVM domain anymore (this is visible under a Linux guest as
> >>>> /proc/interrupts counters remain 0). The behavior on a Windows10 guest
> >>>> is the same.
> >>>>
> >>>> A few interesting hints I noticed:
> >>>>
> >>>> - On Dom0, 'lspci -vv' on that PCIe device between the "working" and
> >>>> the "muted interrupts" states, I noted a difference between the
> >>>> MSI-X caps:
> >>>>
> >>>> - Capabilities: [70] MSI-X: Enable- Count=5 Masked- <-- IRQs will work if domain started
> >>>> + Capabilities: [70] MSI-X: Enable- Count=5 Masked+ <-- IRQs won't work if domain started
> >>>>                                             ^^^^^^^
> >>
> >>> IMHO it seems that either your device is not able to perform a reset
> >>> successfully, or Linux is not correctly performing such reset. I don't
> >>> think there's a lot that can be done from the Xen side.
> >>
> >> Unfortunately for a lot of pci-devices a simple reset as performed by default isn't enough,
> >> but also almost none support a real pci FLR.
> >>
> >> In the distant past Konrad has made a patchset that implemented a bus reset and
> >> reseting config space. (It piggy backed on already existing libxl mechanism of
> >> trying to call on a syfs "do_flr" attribute which triggers pciback to perform
> >> the busreset and rewrite of config space for the device.
> >>
> >> I use that patchset ever since for my pci-passtrough needs and it works pretty
> >> well. I can shutdown an restart VM's with pci devices passed trhough (also AMD
> >> Radeon graphic cards).
> > 
> > Just to confirm the utility of that piece of work: OpenXT also uses an
> > extended version of that same patch to perform device reset for
> > passthrough.
> > 
> > I've attached a copy of that OpenXT patch to this message and it can
> > also be obtained from our git repository:
> > https://github.com/OpenXT/xenclient-oe/blob/f8d3b282a87231d9ae717b13d506e8e7e28c78c4/recipes-kernel/linux/4.9/patches/thorough-reset-interface-to-pciback-s-sysfs.patch
> > This version creates a sysfs node named "reset_device" and the OpenXT
> > libxl toolstack is patched to use that node instead of "do_flr".
> 
> Nice to hear there are more users of this patch. On #xen on IRC there were from time to time
> also users who tried pci-passtrough and ran into this issue (and probably abandonning the idea
> since having to restart your host before being able to use your pass throughed device again
> defies much of the use case).
>  
> > Konrad's original work encountered pushback on upstream acceptance at
> > the time it was developed. I'm not sure I've found where that
> > discussion ended. Is there any prospect of a more comprehensive reset
> > mechanism being accepted into xen-pciback, or elsewhere in the kernel?
> 
> Yeah it was nacked by David Vrabel and the discussion somewhat bleeded to death. 
> >From what i remember the main issue was with the naming, since it doesn't do a FLR,
> the sysfs hook shouldn't be called "do_flr".
> 
> Some other perhaps minor issues i can think of are:
> - No way to excempt pci-devices from this new way of resetting them.
>   Perhaps there could be pci devices/topologies were this way of
>   resetting causes more problems than it solves and could cause a
>   regression. Unfortunately auto detecting what works doesn't seem to
>   be possible. On the other hand (though only with my n=10) i haven't encountered
>   such a device yet.
> 
> - The communication path between libxl and the kernel via sysfs.
>   I think the preference was for a:
>   a) having it use a more common used Xen communication channel or
>   b) having it all self-contained in pci-back. (from my memory and the openxt patch description
>      there could be some locking issue when trying to implement it this way,
>      but the vfio guys had that solved for there reset implementation if i
>      from one of the comments in there source code (patches by Alex Williamson
>      if i remember correctly).
> 
> - Not an issue back then when the patch was made, but as the question earlier to Roger,
>   the hypervisor seems to grow more interference with pci devices with the PVH dom0 work.
>   If and hoow does that relate to pci-back and pci-passthrough and (the location of) resetting mechanisms ?
> 
> 
> So i think David's NACK was mostly for the patchset having some hackish cosmetics.

He didn't like 'do_flr' which made sense as the patchset did not do FLR. It made a bus-reset
for more than one device (if those devices were assigned to pciback).

> 
> On the upside one can conclude that this patchset is now pretty well tested over the years ;)
> 
> Since David has left, perhaps Jurgen/Boris/Konrad could express their views (again) ?
> (CC'ed them as well)

I've asked Govinda (CC-ed) to refresh the patchset against the lastest kernel and
repost it and see where it goes.

> 
> > As noted in the original LKML threads, vfio has similar relevant pci
> > device reset retry logic. (Thanks to Rich Persaud for this pointer:)
> > http://elixir.free-electrons.com/linux/v4.14-rc1/source/drivers/vfio/pci/vfio_pci.c#L1353
> > 
> > libvirt also performs similar reset logic, using a direct low level
> > interface to config space (Thanks to Marek for this pointer, libvirt
> > is used by Qubes:)
> > https://github.com/libvirt/libvirt/blob/master/src/util/virpci.c#L929
> > I thinks this indicates that it would be possible to extend libxl to
> > do something similar, but that seems less satisfactory compared to
> > performing the work in a kernel-provided implementation.
> > 
> > Is there a way forward to providing this functionality within Xen
> > software or Linux> Christopher
> > --
> > 
> > openxt.org
> > 
>
Roger Pau Monné Sept. 25, 2017, 9:54 a.m. UTC | #4
On Fri, Sep 22, 2017 at 09:35:40AM +0200, Sander Eikelenboom wrote:
> - Not an issue back then when the patch was made, but as the question earlier to Roger,
>   the hypervisor seems to grow more interference with pci devices with the PVH dom0 work.
>   If and hoow does that relate to pci-back and pci-passthrough and (the location of) resetting mechanisms ?

pci-back will still be required in order to do pci-passthrough to PV
guests. The aim is to use the new pci-emulation layer to perform
pci-passthrough to both HVM and PVH guests, deprecating the
passthrough code inside of QEMU.

I would prefer to have the reset mechanism inside of Xen, but I'm not
sure how feasible it is to put this code inside of Xen because I
haven't looked at it yet.

Roger.
Ross Philipson Sept. 25, 2017, 2:41 p.m. UTC | #5
>
> ​[snip]​
>
>
> > So i think David's NACK was mostly for the patchset having some hackish
> cosmetics.
>
> He didn't like 'do_flr' which made sense as the patchset did not do FLR.
> It made a bus-reset
> for more than one device (if those devices were assigned to pciback).
>

​When I first wrote this, FLR was not yet implemented in the kernel so I
was actually adding FLR support; thus the name do_flr. So that is just
cargo from years ago :)​



>
> >
> > On the upside one can conclude that this patchset is now pretty well
> tested over the years ;)
> >
> > Since David has left, perhaps Jurgen/Boris/Konrad could express their
> views (again) ?
> > (CC'ed them as well)
>
> I've asked Govinda (CC-ed) to refresh the patchset against the lastest
> kernel and
> repost it and see where it goes.
>
> >
> > > As noted in the original LKML threads, vfio has similar relevant pci
> > > device reset retry logic. (Thanks to Rich Persaud for this pointer:)
> > > http://elixir.free-electrons.com/linux/v4.14-rc1/source/
> drivers/vfio/pci/vfio_pci.c#L1353
> > >
> > > libvirt also performs similar reset logic, using a direct low level
> > > interface to config space (Thanks to Marek for this pointer, libvirt
> > > is used by Qubes:)
> > > https://github.com/libvirt/libvirt/blob/master/src/util/virpci.c#L929
> > > I thinks this indicates that it would be possible to extend libxl to
> > > do something similar, but that seems less satisfactory compared to
> > > performing the work in a kernel-provided implementation.
> > >
> > > Is there a way forward to providing this functionality within Xen
> > > software or Linux> Christopher
> > > --
> > >
> > > openxt.org
> > >
> >
>
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xen.org
> https://lists.xen.org/xen-devel
>
Pasi Kärkkäinen Nov. 2, 2017, 4:59 p.m. UTC | #6
Hi,

On Fri, Sep 22, 2017 at 03:25:00PM -0400, Konrad Rzeszutek Wilk wrote:
> On Fri, Sep 22, 2017 at 09:35:40AM +0200, Sander Eikelenboom wrote:
> > On 22/09/17 04:09, Christopher Clark wrote:
> > > On Thu, Sep 21, 2017 at 1:27 PM, Sander Eikelenboom
> > > <linux@eikelenboom.it> wrote:
> > >>
> > >> On Thu, September 21, 2017, 10:39:52 AM, Roger Pau Monné wrote:
> > >>
> > >>> On Wed, Sep 20, 2017 at 03:50:35PM -0400, Jérôme Oufella wrote:
> > >>>>
> > >>>> I'm using PCI pass-through to map a PCIe (intel i210) controller into
> > >>>> a HVM domain. The system uses xen-pciback to hide the appropriate PCI
> > >>>> device from Dom0.
> > >>>>
> > >>>> When creating the HVM domain after an hypervisor cold boot, the HVM
> > >>>> domain can access and use the PCIe controller without problem.
> > >>>>
> > >>>> However, if the HVM domain is destroyed then restarted, it won't be
> > >>>> able to use the pass-through PCI device anymore. The PCI device is
> > >>>> seen and can be mapped, however, the interrupts will not be passed to
> > >>>> the HVM domain anymore (this is visible under a Linux guest as
> > >>>> /proc/interrupts counters remain 0). The behavior on a Windows10 guest
> > >>>> is the same.
> > >>>>
> > >>>> A few interesting hints I noticed:
> > >>>>
> > >>>> - On Dom0, 'lspci -vv' on that PCIe device between the "working" and
> > >>>> the "muted interrupts" states, I noted a difference between the
> > >>>> MSI-X caps:
> > >>>>
> > >>>> - Capabilities: [70] MSI-X: Enable- Count=5 Masked- <-- IRQs will work if domain started
> > >>>> + Capabilities: [70] MSI-X: Enable- Count=5 Masked+ <-- IRQs won't work if domain started
> > >>>>                                             ^^^^^^^
> > >>
> > >>> IMHO it seems that either your device is not able to perform a reset
> > >>> successfully, or Linux is not correctly performing such reset. I don't
> > >>> think there's a lot that can be done from the Xen side.
> > >>
> > >> Unfortunately for a lot of pci-devices a simple reset as performed by default isn't enough,
> > >> but also almost none support a real pci FLR.
> > >>
> > >> In the distant past Konrad has made a patchset that implemented a bus reset and
> > >> reseting config space. (It piggy backed on already existing libxl mechanism of
> > >> trying to call on a syfs "do_flr" attribute which triggers pciback to perform
> > >> the busreset and rewrite of config space for the device.
> > >>
> > >> I use that patchset ever since for my pci-passtrough needs and it works pretty
> > >> well. I can shutdown an restart VM's with pci devices passed trhough (also AMD
> > >> Radeon graphic cards).
> > > 
> > > Just to confirm the utility of that piece of work: OpenXT also uses an
> > > extended version of that same patch to perform device reset for
> > > passthrough.
> > > 
> > > I've attached a copy of that OpenXT patch to this message and it can
> > > also be obtained from our git repository:
> > > https://github.com/OpenXT/xenclient-oe/blob/f8d3b282a87231d9ae717b13d506e8e7e28c78c4/recipes-kernel/linux/4.9/patches/thorough-reset-interface-to-pciback-s-sysfs.patch
> > > This version creates a sysfs node named "reset_device" and the OpenXT
> > > libxl toolstack is patched to use that node instead of "do_flr".
> > 
> > Nice to hear there are more users of this patch. On #xen on IRC there were from time to time
> > also users who tried pci-passtrough and ran into this issue (and probably abandonning the idea
> > since having to restart your host before being able to use your pass throughed device again
> > defies much of the use case).
> >  
> > > Konrad's original work encountered pushback on upstream acceptance at
> > > the time it was developed. I'm not sure I've found where that
> > > discussion ended. Is there any prospect of a more comprehensive reset
> > > mechanism being accepted into xen-pciback, or elsewhere in the kernel?
> > 
> > Yeah it was nacked by David Vrabel and the discussion somewhat bleeded to death. 
> > >From what i remember the main issue was with the naming, since it doesn't do a FLR,
> > the sysfs hook shouldn't be called "do_flr".
> > 
> > Some other perhaps minor issues i can think of are:
> > - No way to excempt pci-devices from this new way of resetting them.
> >   Perhaps there could be pci devices/topologies were this way of
> >   resetting causes more problems than it solves and could cause a
> >   regression. Unfortunately auto detecting what works doesn't seem to
> >   be possible. On the other hand (though only with my n=10) i haven't encountered
> >   such a device yet.
> > 
> > - The communication path between libxl and the kernel via sysfs.
> >   I think the preference was for a:
> >   a) having it use a more common used Xen communication channel or
> >   b) having it all self-contained in pci-back. (from my memory and the openxt patch description
> >      there could be some locking issue when trying to implement it this way,
> >      but the vfio guys had that solved for there reset implementation if i
> >      from one of the comments in there source code (patches by Alex Williamson
> >      if i remember correctly).
> > 
> > - Not an issue back then when the patch was made, but as the question earlier to Roger,
> >   the hypervisor seems to grow more interference with pci devices with the PVH dom0 work.
> >   If and hoow does that relate to pci-back and pci-passthrough and (the location of) resetting mechanisms ?
> > 
> > 
> > So i think David's NACK was mostly for the patchset having some hackish cosmetics.
> 
> He didn't like 'do_flr' which made sense as the patchset did not do FLR. It made a bus-reset
> for more than one device (if those devices were assigned to pciback).
> 
> > 
> > On the upside one can conclude that this patchset is now pretty well tested over the years ;)
> > 
> > Since David has left, perhaps Jurgen/Boris/Konrad could express their views (again) ?
> > (CC'ed them as well)
> 
> I've asked Govinda (CC-ed) to refresh the patchset against the lastest kernel and
> repost it and see where it goes.
> 

Nice. Looking forward to seeing the refreshed patchset hit the mailinglist! :)


Thanks,

-- Pasi

> > 
> > > As noted in the original LKML threads, vfio has similar relevant pci
> > > device reset retry logic. (Thanks to Rich Persaud for this pointer:)
> > > http://elixir.free-electrons.com/linux/v4.14-rc1/source/drivers/vfio/pci/vfio_pci.c#L1353
> > > 
> > > libvirt also performs similar reset logic, using a direct low level
> > > interface to config space (Thanks to Marek for this pointer, libvirt
> > > is used by Qubes:)
> > > https://github.com/libvirt/libvirt/blob/master/src/util/virpci.c#L929
> > > I thinks this indicates that it would be possible to extend libxl to
> > > do something similar, but that seems less satisfactory compared to
> > > performing the work in a kernel-provided implementation.
> > > 
> > > Is there a way forward to providing this functionality within Xen
> > > software or Linux> Christopher
> > > --
> > > 
> > > openxt.org
> > > 
> >
diff mbox

Patch

From d686351d8ea4a1ea1d755d0a10f6f14d1c870911 Mon Sep 17 00:00:00 2001
From: Kyle Temkin <ktemkin@binghamton.edu>
Date: Wed, 8 Apr 2015 00:58:24 -0400
Subject: [PATCH] Add thorough reset interface to pciback's sysfs.

--------------------------------------------------------------------------------
SHORT DESCRIPTION:
--------------------------------------------------------------------------------
Adds an interface that allows "more thorough" resets to be performed
on devices which don't support Function Level Resets (FLRs). This
interface should allow the toolstack to ensure that a PCI device is in a
known state prior to passing it through to a VM.

--------------------------------------------------------------------------------
LONG DESCRIPTION:
--------------------------------------------------------------------------------

From Konrad Rzeszutek Wilk's original post to xen-devel and the LKML:

  The life-cycle of a PCI device in Xen pciback is complex
  and is constrained by the PCI generic locking mechanism.

  It starts with the device being binded to us - for which
  we do a device function reset (and done via SysFS
  so the PCI lock is held)

  If the device is unbinded from us - we also do a function
  reset (also done via SysFS so the PCI lock is held).

  If the device is un-assigned from a guest - we do a function
  reset (no PCI lock).

  All on the individual PCI function level (so bus:device:function).

  Unfortunatly a function reset is not adequate for certain
  PCIe devices. The reset for an individual PCI function "means
  device must support FLR (PCIe or AF), PM reset on D3hot->D0
  device specific reset, or be a singleton device on a bus
  a secondary bus reset.  FLR does not have widespread support,
  reset is not very reliable, and bus topology is dictated by the
  and device design.  We need to provide a means for a user to
  a bus reset in cases where the existing mechanisms are not
   or not reliable. " (Adam Williamson, 'vfio-pci: PCI hot reset
  interface' commit 8b27ee60bfd6bbb84d2df28fa706c5c5081066ca).

  As such to do a slot or a bus reset is we need another mechanism.
  This is not exposed SysFS as there is no good way of exposing
  a bus topology there.

  This is due to the complexity - we MUST know that the different
  functions off a PCIe device are not in use by other drivers, or
  if they are in use (say one of them is assigned to a guest
  and the other is idle) - it is still OK to reset the slot
  (assuming both of them are owned by Xen pciback).

  This patch does that by doing an slot or bus reset (if
  slot not supported) if all of the functions of a PCIe
  device belong to Xen PCIback. We do not care if the device is
  in-use as we depend on the toolstack to be aware of this -
  however if it is we will WARN the user.

  Due to the complexity with the PCI lock we cannot do
  the reset when a device is binded ('echo $BDF > bind')
  or when unbinded ('echo $BDF > unbind') as the pci_[slot|bus]_reset
  also take the same lock resulting in a dead-lock.

  Putting the reset function in a workqueue or thread
  won't work either - as we have to do the reset function
  outside the 'unbind' context (it holds the PCI lock).
  But once you 'unbind' a device the device is no longer
  under the ownership of Xen pciback and the pci_set_drvdata
  has been reset so we cannot use a thread for this.

  Instead of doing all this complex dance, we depend on the toolstack
  doing the right thing. As such implement [... a SysFS attribute]
  which [... the toolstack]  uses when a device is detached or attached
  from/to a guest. It bypasses the need to worry about the PCI lock.

  To not inadvertly do a bus reset that would affect devices that
  are in use by other drivers (other than Xen pciback) prior
  to the reset we check that all of the devices under the bridge
  are owned by Xen pciback. If they are not we do not do
  the bus (or slot) reset.

  We also warn the user if the device is in use - but still
  continue with the reset. This should not happen as the toolstack
  also does the check.

--

Our version of the patch has been modified to use a less confusing
sysfs name. The original name ('do_flr') is inappropriate, as it
implies a function level reset; it is entirely possible that the patch
code will use a bus-level reset when appropriate.

The new sysfs entry is located at:

  /sys/bus/pci/drivers/pciback/reset_device

and can be activated by writing a domain:bus:device:function device
identifier into the sysfs file. As an example:

  echo "0000:01:00.0" > /sys/bus/pci/drivers/pciback/reset_device

would reset the device matching the D:BDF descriptor above.

--------------------------------------------------------------------------------
CHANGELOG:
--------------------------------------------------------------------------------
This is a port of a patch that likely had many authors, including:
    -Konrad Rzeszutek Wilk
    -Alex Williamson
    -Ross Phillipson <rphilipson@ainfosec.com>
Ported to OpenXT by: Kyle J. Temkin <temkink@ainfosec.com>, 4/8/15
Rewrite by:          Kyle J. Temkin <temkink@ainfosec.com>, 4/10/15

--------------------------------------------------------------------------------
DEPENDENCIES
--------------------------------------------------------------------------------
This patch requires ONE of the following:
  -A relatively modern linux kernel (3.18+) as a base; which provides
   the PCI functions used; or
  -Our PCI reset backports patch (backport-pci-reset-functionality.patch),
   which backports the relevant functionality to 3.11.

To take advantage of this patch, the utilized toolstack should be
changed to:
  -Use the provided "reset_device" property, rather than the PCI
   device's sysfs "reset" entry. This enables resets beyond a FLR to be
   used.
  -Ensure that all functions of a given device are passed through
   together. This allows us to use some of the more thorugh resetting
   techniques, when possible.

--------------------------------------------------------------------------------
REMOVAL
--------------------------------------------------------------------------------
This patch provides a service which is necessary for proper passthrough
of many PCI cards: a generalized ability to reset PCI devices, without
requiring that the device support FLR or power-management based resets.

This patch will be necessary until either the Linux PCI subsystem or Xen
PCIback drivers are modified to provide this support; or until cards
without proper FLR support are no longer supported.

--------------------------------------------------------------------------------
UPSTREAM PLAN
--------------------------------------------------------------------------------

This code is taken from a patch which was originally proposed and
rejected from upstream on the LKML and xen-devel. An upstream
implementation of the functionality of this patch is still necessary;
and can and should be implemented.

This patch will hopefully be replaced with an upstream version when
community concensus has produced a single "blessed" method of
accomplishing its functionality.

--------------------------------------------------------------------------------
PATCHES
--------------------------------------------------------------------------------
---
 drivers/xen/xen-pciback/pci_stub.c | 338 ++++++++++++++++++++++++++++++++++---
 1 file changed, 312 insertions(+), 26 deletions(-)

Index: linux-4.9.40/drivers/xen/xen-pciback/pci_stub.c
===================================================================
--- linux-4.9.40.orig/drivers/xen/xen-pciback/pci_stub.c
+++ linux-4.9.40/drivers/xen/xen-pciback/pci_stub.c
@@ -102,10 +102,9 @@  static void pcistub_device_release(struc
 
 	xen_unregister_device_domain_owner(dev);
 
-	/* Call the reset function which does not take lock as this
-	 * is called from "unbind" which takes a device_lock mutex.
-	 */
-	__pci_reset_function_locked(dev);
+
+	/* Reset is done by the toolstack by using 'reset_device' on the
+	 * SysFS. */
 	if (pci_load_and_free_saved_state(dev, &dev_data->pci_saved_state))
 		dev_info(&dev->dev, "Could not reload PCI state\n");
 	else
@@ -125,9 +124,6 @@  static void pcistub_device_release(struc
 				 err);
 	}
 
-	/* Disable the device */
-	xen_pcibk_reset_device(dev);
-
 	kfree(dev_data);
 	pci_set_drvdata(dev, NULL);
 
@@ -224,6 +220,271 @@  struct pci_dev *pcistub_get_pci_dev_by_s
 	return found_dev;
 }
 
+
+/**
+ * Returns true iff the given device supports PCIe FLRs.
+ */
+static bool __device_supports_pcie_flr(struct pci_dev *dev)
+{
+	u32 cap;
+
+	/*
+         * Read the device's capabilities. Note that this can be used even on legacy
+	 * PCI devices (and not just on PCIe devices)-- it indicates that no capabilities
+	 * are supported if the device is legacy PCI by setting cap to 0.
+	 */
+	 pcie_capability_read_dword(dev, PCI_EXP_DEVCAP, &cap);
+
+	/* Return true iff the device advertises supporting an FLR. */
+	return (cap & PCI_EXP_DEVCAP_FLR);
+}
+
+
+/**
+ * Returns true iff the given device supports PCI Advanced Functionality (AF) FLRs.
+ */
+static bool __device_supports_pci_af_flr(struct pci_dev *dev)
+{
+	int pos;
+	u8 capability_flags;
+
+	/* First, try to find the location of the PCI Advanced Functionality capability byte. */
+	pos = pci_find_capability(dev, PCI_CAP_ID_AF);
+
+	/*
+	 * If we weren't able to find the capability byte, this device doesn't support
+	 * the Advanced Functionality extensions, and thus won't support AF FLR.
+	 */
+	if (!pos)
+		return false;
+
+	/* Read the capabilities advertised in the AF capability byte. */
+	pci_read_config_byte(dev, pos + PCI_AF_CAP, &capability_flags);
+
+	/*
+	 * If the device does support AF, it will advertise FLR support via the
+	 * PCI_AF_CAP_FLR bit. We'll also check for the Transactions Pending (TP)
+	 * mechanism, as the kernel requires this extension to issue an AF FLR.
+	 * (Internally, the PCI reset code needs to be able to wait for all
+	 * pending transactions to complete prior to issuing the AF FLR.)
+	 */
+	return (capability_flags & PCI_AF_CAP_TP) && (capability_flags & PCI_AF_CAP_FLR);
+}
+
+
+/**
+ * Returns true iff the given device adverstises supporting function-
+ * level-reset (FLR).
+ */
+static bool device_supports_flr(struct pci_dev *dev)
+{
+	return __device_supports_pci_af_flr(dev) || __device_supports_pcie_flr(dev);
+}
+
+
+/**
+ * Returns true iff the given device is located in a slot that
+ * supports hotplugging slot resets.
+ */
+static bool device_supports_slot_reset(struct pci_dev *dev)
+{
+	return !pci_probe_reset_slot(dev->slot);
+}
+
+
+/**
+ * Returns true iff the given device is located on a bus that
+ * we can reset. Note that root bridges are excluded, as this
+ * would cause more than just an SBR.
+ */
+static bool device_supports_bus_reset(struct pci_dev *dev)
+{
+	return !pci_is_root_bus(dev->bus) && !pci_probe_reset_bus(dev->bus);
+}
+
+
+/**
+ * Out argument for the __safe_to_sbr_device_callback function.
+ */
+struct safe_to_sbr_arguments {
+
+	//Stores the most recently encountered PCI device that does
+	//not belong to pciback. As used below, this is the result of a
+	//search for a non-pciback device on a bus; we stop upon finding
+	//the first non-pciback device.
+	struct pci_dev *last_non_pciback_device;
+
+	//Stores the number of pciback devices that appear to be in use
+	//on the bus in question.
+	int use_count;
+
+};
+
+
+/**
+ *	A callback function which determines if a given PCI device is owned by pciback,
+ *	and whether the given device is in use. Used by safe_to_sbr_device.
+ *
+ *	@param dev The PCI device to be checked.
+ *	@param data An out argument of type struct safe_to_sbr_device_callback_arguments.
+ *			Updated to indicate the result of the search. See the struct's definition
+ *			for more details.
+ *
+ */
+static int __safe_to_sbr_device_callback(struct pci_dev *dev, void *data)
+{
+
+	struct pcistub_device *psdev;
+
+	bool device_owned_by_pciback = false;
+	struct safe_to_sbr_arguments *arg = data;
+
+	unsigned long flags;
+
+	//Ensure that we have exclusive access to the list of PCI devices,
+	//so we can traverse it.
+	spin_lock_irqsave(&pcistub_devices_lock, flags);
+
+	//Iterate over all PCI devices owned by the pci stub.
+	list_for_each_entry(psdev, &pcistub_devices, dev_list) {
+
+		//If the given device is owned by pciback...
+		if (psdev->dev == dev) {
+
+			//mark it as a pciback device.
+			device_owned_by_pciback = true;
+
+			//If we have a physical device associated with the pciback device,
+			//mark this device as in-use.
+			if (psdev->pdev)
+				arg->use_count++;
+
+			//Stop searching; we've found a the PCIback device associated with this one.
+			break;
+		}
+	}
+
+	//Release the PCI device lock...
+	spin_unlock_irqrestore(&pcistub_devices_lock, flags);
+
+	//... and report if we've found a device that's not owned by pciback.
+	dev_dbg(&dev->dev, "%s\n", device_owned_by_pciback ? "is owned by pciback, and can be reset if not in use."
+			: "not owned by pciback, and thus cannot be reset.");
+
+	//If we've found a device that's not owned by pciback, update our data
+	//argument so it points to the most recent unowned device. (We check
+	//this like a flag, later: if it's never set, no one owns the device!)
+	if (!device_owned_by_pciback)
+		arg->last_non_pciback_device = dev;
+
+	//If we've found a device that's not owned by pciback, return false--
+	//this indicates that pci_walk_bus should cease its walk.
+	return !device_owned_by_pciback;
+}
+
+
+/**
+ * Returns true iff it should be safe to issue a secondary bus reset
+ * to the device; that is, if an SBR can be issued without disrupting
+ * other devices.
+ */
+static bool safe_to_sbr_device(struct pci_dev *dev)
+{
+	struct safe_to_sbr_arguments walk_result = { .last_non_pciback_device = NULL, .use_count = 0 };
+
+	//Walk the PCI bus, attempting to find if any of the given devices
+	pci_walk_bus(dev->bus, __safe_to_sbr_device_callback, &walk_result);
+
+	//If the device is in use, emit a warning error.
+	if(walk_result.use_count > 0)
+		dev_dbg(&dev->dev, "is in use; currently not safe to SBR device.\n");
+
+	//Return true iff we did not pick up any other devices
+	//that were either in use, or not owned by pciback.
+	return (walk_result.last_non_pciback_device == NULL) && (walk_result.use_count == 0);
+}
+
+
+/**
+ * Attempt a raw reset of the provided PCI device-- via any
+ * method available to us. This method prefers the gentlest
+ * possible reset method-- currently an FLR, which many
+ * PCIe devices should support.
+ *
+ * @param dev The pci device to be reset.
+ * @return Zero on success, or the error code generated by the reset method on failure.
+ */
+static int __pcistub_raw_device_reset(struct pci_dev *dev)
+{
+	//Determine if bus resetting techniques (SBR, slot resets)
+	//are safe, and thus should be allowed.
+	int allow_bus_reset = safe_to_sbr_device(dev);
+
+	//If FLRs are supported; we'll try to let the linux kernel
+	//manually reset the device.
+	if(device_supports_flr(dev)) {
+		dev_dbg(&dev->dev, "Resetting device using an FLR.");
+		return pci_reset_function(dev);
+	}
+
+	//Next, we'll try the next gentlest: a hotplugging reset
+	//of the PCI slot.
+	if(allow_bus_reset && device_supports_slot_reset(dev)) {
+		dev_dbg(&dev->dev, "Resetting device using a slot reset.");
+		return pci_try_reset_slot(dev->slot);
+	}
+
+	//Finally, we'll try the most drastic: resetting the parent
+	//PCI bus-- which we can only do conditionally.
+	if(allow_bus_reset && device_supports_bus_reset(dev)) {
+		dev_dbg(&dev->dev, "Resetting device using an SBR.");
+		return pci_try_reset_bus(dev->bus);
+	}
+
+	//If we weren't able to reset the device by any of our known-good methods,
+	//fall back to the linux kernel's reset function. Unfortunately, this considers a
+	//power management reset to be a valid reset; though this doesn't work for many devices--
+	//especially GPUs.
+	dev_err(&dev->dev, "No reset methods available for %s. Falling back to kernel reset.", pci_name(dev));
+	pci_reset_function(dev);
+
+	//Return an error code, indicating that we likely did not reset the device correctly.
+	return -ENOTTY;
+}
+
+
+/**
+ * Resets the target (pciback-owned) PCI device. Primarily intended
+ * for use by the toolstack, so it can ensure a consistent PCI device
+ * state on VM startup.
+ *
+ * @param dev The device to be reset.
+ * @return Zero on success, or a negated error code on failure.
+ */
+static int pcistub_reset_pci_dev(struct pci_dev *dev)
+{
+	int rc;
+
+	if (!dev)
+		return -EINVAL;
+
+	/*
+	 * Takes the PCI lock. OK to do it as we are never called
+	 * from 'unbind' state and don't deadlock.
+	 */
+	rc =__pcistub_raw_device_reset(dev);
+	pci_restore_state(dev);
+
+	/* This disables the device. */
+	xen_pcibk_reset_device(dev);
+
+	/* And cleanup up our emulated fields. */
+	xen_pcibk_config_reset_dev(dev);
+	return rc;
+}
+
+
+
 struct pci_dev *pcistub_get_pci_dev(struct xen_pcibk_device *pdev,
 				    struct pci_dev *dev)
 {
@@ -279,11 +540,13 @@  void pcistub_put_pci_dev(struct pci_dev
 	* pcistub and xen_pcibk when AER is in processing
 	*/
 	down_write(&pcistub_sem);
-	/* Cleanup our device
-	 * (so it's ready for the next domain)
-	 */
 	device_lock_assert(&dev->dev);
-	__pci_reset_function_locked(dev);
+	/*
+	 * Reset is up to the toolstack.
+	 * The toolstack has to call 'reset_device' before
+	 * providing the PCI device to a guest (see pcistub_reset_device).
+	 */
+	//__pci_reset_function_locked(dev);
 
 	dev_data = pci_get_drvdata(dev);
 	ret = pci_load_saved_state(dev, dev_data->pci_saved_state);
@@ -1460,6 +1723,41 @@  static ssize_t restrictive_add(struct de
 }
 static DRIVER_ATTR(restrictive, S_IWUSR, NULL, restrictive_add);
 
+/**
+ * Handles the "reset_device" sysfs attribute. This is the primary reset interface
+ * utilized by the toolstack.
+ */
+static ssize_t pcistub_sysfs_reset_device(struct device_driver *drv, const char *buf, size_t count)
+{
+	int domain, bus, slot, func, err;
+	struct pcistub_device *psdev;
+
+	//Attempt to convert the user's string to a BDF/slot.
+	err = str_to_slot(buf, &domain, &bus, &slot, &func);
+	if (err)
+		return -ENODEV;
+
+	//... and then use that slot to find the pciback device.
+	psdev = pcistub_device_find(domain, bus, slot, func);
+
+	//If we have a device, attempt to reset it using our internal reset path.
+	if (psdev) {
+		err = pcistub_reset_pci_dev(psdev->dev);
+		pcistub_device_put(psdev);
+
+		//If we were not able to reset the device, return the relevant error code.
+		if(err)
+			err = -ENODEV;
+	}
+	//Otherwise, indicate that there's no such device.
+	else {
+		err = -ENODEV;
+	}
+
+	return err ? err : count;
+
+}
+static DRIVER_ATTR(reset_device, S_IWUSR, NULL, pcistub_sysfs_reset_device);
 
 static void pcistub_exit(void)
 {
@@ -1476,6 +1774,8 @@  static void pcistub_exit(void)
 			   &driver_attr_irq_handlers);
 	driver_remove_file(&xen_pcibk_pci_driver.driver,
 			   &driver_attr_irq_handler_state);
+	driver_remove_file(&xen_pcibk_pci_driver.driver,
+			   &driver_attr_reset_device);
 	pci_unregister_driver(&xen_pcibk_pci_driver);
 }
 
@@ -1572,6 +1872,9 @@  static int __init pcistub_init(void)
 	if (!err)
 		err = driver_create_file(&xen_pcibk_pci_driver.driver,
 					&driver_attr_irq_handler_state);
+	if (!err)
+		err = driver_create_file(&xen_pcibk_pci_driver.driver,
+					&driver_attr_reset_device);
 	if (err)
 		pcistub_exit();