From patchwork Fri Feb 13 18:22:45 2009 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Chris Wright X-Patchwork-Id: 7055 Received: from vger.kernel.org (vger.kernel.org [209.132.176.167]) by demeter.kernel.org (8.14.2/8.14.2) with ESMTP id n1DIMrab003201 for ; Fri, 13 Feb 2009 18:22:53 GMT Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752518AbZBMSWv (ORCPT ); Fri, 13 Feb 2009 13:22:51 -0500 Received: (majordomo@vger.kernel.org) by vger.kernel.org id S1752638AbZBMSWu (ORCPT ); Fri, 13 Feb 2009 13:22:50 -0500 Received: from mx2.redhat.com ([66.187.237.31]:56994 "EHLO mx2.redhat.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752518AbZBMSWt (ORCPT ); Fri, 13 Feb 2009 13:22:49 -0500 Received: from int-mx2.corp.redhat.com (int-mx2.corp.redhat.com [172.16.27.26]) by mx2.redhat.com (8.13.8/8.13.8) with ESMTP id n1DIMncH001432; Fri, 13 Feb 2009 13:22:49 -0500 Received: from ns3.rdu.redhat.com (ns3.rdu.redhat.com [10.11.255.199]) by int-mx2.corp.redhat.com (8.13.1/8.13.1) with ESMTP id n1DIMnUr015140; Fri, 13 Feb 2009 13:22:49 -0500 Received: from x200.localdomain (vpn-12-36.rdu.redhat.com [10.11.12.36]) by ns3.rdu.redhat.com (8.13.8/8.13.8) with ESMTP id n1DIMlEc005737; Fri, 13 Feb 2009 13:22:47 -0500 Date: Fri, 13 Feb 2009 10:22:45 -0800 From: Chris Wright To: Matthew Wilcox Cc: Mark McLoughlin , kvm , "linux-pci@vger.kernel.org" , Chris Wright , "Dugger, Donald D" , "Kay, Allen M" Subject: Re: KVM PCI device assignment issues Message-ID: <20090213182245.GB9898@x200.localdomain> References: <1234542767.23746.81.camel@blaa> <20090213173628.GC16841@parisc-linux.org> MIME-Version: 1.0 Content-Disposition: inline In-Reply-To: <20090213173628.GC16841@parisc-linux.org> User-Agent: Mutt/1.5.18 (2008-05-17) X-Scanned-By: MIMEDefang 2.58 on 172.16.27.26 Sender: kvm-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: kvm@vger.kernel.org * Matthew Wilcox (matthew@wil.cx) wrote: > On Fri, Feb 13, 2009 at 04:32:47PM +0000, Mark McLoughlin wrote: > > - Conventional PCI devices (i.e. PCI/PCI-X, not PCIe) behind the same > > bridge must be assigned to the same VT-d domain - i.e given device > > A (0000:0f:1.0) and device B (and 0000:0f:2.0), if you assign > > device A to guest, you cannot then use device B in the host or > > another guest. > > Is that a limitation of the VT-d / IOMMU setup? Yes. The source id will essentially show up as the bridge. > > - Some newer PCIe devices (and newer conventional PCI devices too via > > PCI Advanced Features) support Function Level Reset (FLR). This > > allows a PCI function to be reset without affecting any other > > functions on that device, or any other devices. This feature is not > > widespread yet AFAIK - e.g. I've seen it on an audio controller, > > and it must also be supported by SR-IOV devices. > > Yes, that's definitely not very widespread yet. OTOH, we don't need to > worry about disturbing other functions if all devices behind the same > bridge have to be mapped to the same guest. FLR (when it exists) would work fine for devices not behind a conventional pci bridge. > > Driver Unbinding > > ================ > > > > Before a device is assigned to a guest, we should make sure that no host > > device driver is currently bound to the device. > > > > We can do that with e.g. > > > > $> echo -n "8086 10de" > /sys/bus/pci/drivers/pci-stub/new_id > > $> echo -n 0000:00:19.0 > /sys/bus/pci/drivers/e1000e/unbind > > $> echo -n 0000:00:19.0 > /sys/bus/pci/drivers/pci-stub/bind > > > > One minor problem with this scheme is that at this point you can't > > unbind from pci-stub and trigger a re-probe and have e1000e bind to it. > > In order to support that, we need a "remove_id" interface to remove the > > dynamic ID. > > It sounds like you'd be OK with a 'remove_id' interface that only > removes subsequently-added interfaces. > > I might suggest a second approach which would be to have an explicit > echo to the bind file ignore the list of ids. Then you wouldn't need to > 'echo -n "8086 10de"' to begin with. I tried that first, and it dips into the driver logic, where it wants to filter via ->match. Untested patch below _should_ be enough to avoid adding the id to begin with. > > Furthermore, it should be possible to do this without actually affecting > > any of the devices - i.e. a "try to unbind and see if we oops" approach > > clearly isn't great. > > Well, yes. I'd even be upset if my network or storage flickered away > briefly while another using was starting to run KVM. > > > This last constraint is the most difficult and points to the logic > > needing to be in userland management libraries. Possibly the only sane > > kernel space support would be "try to unbind and reset; if it works then > > the device is assignable". > > If we expose a 'reset' file in the /sys/bus/pci/devices/*/ directories > for devices that are resettable, that should be enough, I would think. Yes, currently it's only internal and it's not robust given the reset constraints. thanks, -chris --- drivers/base/base.h | 1 + drivers/base/bus.c | 2 +- drivers/base/dd.c | 15 +++++++++++++-- 3 files changed, 15 insertions(+), 3 deletions(-) -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html diff --git a/drivers/base/base.h b/drivers/base/base.h index 0a5f055..60dc346 100644 --- a/drivers/base/base.h +++ b/drivers/base/base.h @@ -86,6 +86,7 @@ extern void bus_remove_driver(struct device_driver *drv); extern void driver_detach(struct device_driver *drv); extern int driver_probe_device(struct device_driver *drv, struct device *dev); +extern int driver_bind_probe_device(struct device_driver *drv, struct device *dev); extern void sysdev_shutdown(void); extern int sysdev_suspend(pm_message_t state); diff --git a/drivers/base/bus.c b/drivers/base/bus.c index 83f32b8..ad28338 100644 --- a/drivers/base/bus.c +++ b/drivers/base/bus.c @@ -202,7 +202,7 @@ static ssize_t driver_bind(struct device_driver *drv, if (dev->parent) /* Needed for USB */ down(&dev->parent->sem); down(&dev->sem); - err = driver_probe_device(drv, dev); + err = driver_bind_probe_device(drv, dev); up(&dev->sem); if (dev->parent) up(&dev->parent->sem); diff --git a/drivers/base/dd.c b/drivers/base/dd.c index 315bed8..fba6463 100644 --- a/drivers/base/dd.c +++ b/drivers/base/dd.c @@ -184,13 +184,14 @@ int driver_probe_done(void) * This function must be called with @dev->sem held. When called for a * USB interface, @dev->parent->sem must be held as well. */ -int driver_probe_device(struct device_driver *drv, struct device *dev) +static int __driver_probe_device(struct device_driver *drv, struct device *dev, + bool force) { int ret = 0; if (!device_is_registered(dev)) return -ENODEV; - if (drv->bus->match && !drv->bus->match(dev, drv)) + if (!force && drv->bus->match && !drv->bus->match(dev, drv)) goto done; pr_debug("bus: '%s': %s: matched device %s with driver %s\n", @@ -202,6 +203,16 @@ done: return ret; } +int driver_probe_bind_device(struct device_driver *drv, struct device *dev) +{ + return __driver_probe_device(drv, dev, 1); +} + +int driver_probe_device(struct device_driver *drv, struct device *dev) +{ + return __driver_probe_device(drv, dev, 0); +} + static int __device_attach(struct device_driver *drv, void *data) { struct device *dev = data;