Message ID | 20170627172150.2630-1-venu.busireddy@oracle.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
On 27/06/17 19:21, Venu Busireddy wrote: > This patch set is part of a set of patchs that together allow containment > of unrecoverable AER errors from PCIe devices assigned to guests in > passthrough mode. The containment is achieved by killing the guest and > hiding the device upon receiving the fatal AER error. > > The Xen patch related to this patch is: > > https://lists.xen.org/archives/html/xen-devel/2017-06/msg03269.html > > > This patch stores in xenstore the <s:b:d.f> of the passed through device > that triggered the AER unrecoverable error. This will allow xen (with a > watcher setup to watch "aerFailedSBDF") to make the device unassignable > until the next reboot or operator intervention using the xen tool stack. > > Note: > When unrecoverable AER errors are detected from the PCIe devices > assigned to guests in passthrough mode, BIOS's bring down the server, > thus bringing down the entire hypervisor. For this patch set to work, > the AER error handling needs to be delegated to the host operating system. > > Signed-off-by: Venu Busireddy <venu.busireddy@oracle.com> > Signed-off-by: Elena Ufimtseva <elena.ufimtseva@oracle.com> Reviewed-by: Juergen Gross <jgross@suse.com> Thanks, Juergen
On 07/27/2017 11:31 AM, Juergen Gross wrote: > On 27/06/17 19:21, Venu Busireddy wrote: >> This patch set is part of a set of patchs that together allow containment >> of unrecoverable AER errors from PCIe devices assigned to guests in >> passthrough mode. The containment is achieved by killing the guest and >> hiding the device upon receiving the fatal AER error. >> >> The Xen patch related to this patch is: >> >> https://lists.xen.org/archives/html/xen-devel/2017-06/msg03269.html >> >> >> This patch stores in xenstore the <s:b:d.f> of the passed through device >> that triggered the AER unrecoverable error. This will allow xen (with a >> watcher setup to watch "aerFailedSBDF") to make the device unassignable >> until the next reboot or operator intervention using the xen tool stack. >> >> Note: >> When unrecoverable AER errors are detected from the PCIe devices >> assigned to guests in passthrough mode, BIOS's bring down the server, >> thus bringing down the entire hypervisor. For this patch set to work, >> the AER error handling needs to be delegated to the host operating system. >> >> Signed-off-by: Venu Busireddy <venu.busireddy@oracle.com> >> Signed-off-by: Elena Ufimtseva <elena.ufimtseva@oracle.com> > Reviewed-by: Juergen Gross <jgross@suse.com> > Isn't this dependent on Xen-side patches? I think we should wait until those are accepted (if nothing else, the watch name should be agreed upon). -boris
On 27/07/17 17:41, Boris Ostrovsky wrote: > On 07/27/2017 11:31 AM, Juergen Gross wrote: >> On 27/06/17 19:21, Venu Busireddy wrote: >>> This patch set is part of a set of patchs that together allow containment >>> of unrecoverable AER errors from PCIe devices assigned to guests in >>> passthrough mode. The containment is achieved by killing the guest and >>> hiding the device upon receiving the fatal AER error. >>> >>> The Xen patch related to this patch is: >>> >>> https://lists.xen.org/archives/html/xen-devel/2017-06/msg03269.html >>> >>> >>> This patch stores in xenstore the <s:b:d.f> of the passed through device >>> that triggered the AER unrecoverable error. This will allow xen (with a >>> watcher setup to watch "aerFailedSBDF") to make the device unassignable >>> until the next reboot or operator intervention using the xen tool stack. >>> >>> Note: >>> When unrecoverable AER errors are detected from the PCIe devices >>> assigned to guests in passthrough mode, BIOS's bring down the server, >>> thus bringing down the entire hypervisor. For this patch set to work, >>> the AER error handling needs to be delegated to the host operating system. >>> >>> Signed-off-by: Venu Busireddy <venu.busireddy@oracle.com> >>> Signed-off-by: Elena Ufimtseva <elena.ufimtseva@oracle.com> >> Reviewed-by: Juergen Gross <jgross@suse.com> >> > > Isn't this dependent on Xen-side patches? I think we should wait until > those are accepted (if nothing else, the watch name should be agreed upon). Right. I wouldn't commit the patch until the Xen side is ready. Just wanted to give a "go ahead" for the Xen side to avoid a deadlock. :-) Juergen
diff --git a/drivers/xen/xen-pciback/pci_stub.c b/drivers/xen/xen-pciback/pci_stub.c index 6331a95..5a4bae5 100644 --- a/drivers/xen/xen-pciback/pci_stub.c +++ b/drivers/xen/xen-pciback/pci_stub.c @@ -656,11 +656,13 @@ static const struct pci_device_id pcistub_ids[] = { }; #define PCI_NODENAME_MAX 40 +#define PCI_DEVICENAME_MAX 14 static void kill_domain_by_device(struct pcistub_device *psdev) { struct xenbus_transaction xbt; int err; char nodename[PCI_NODENAME_MAX]; + char devicename[PCI_DEVICENAME_MAX]; BUG_ON(!psdev); snprintf(nodename, PCI_NODENAME_MAX, "/local/domain/0/backend/pci/%d/0", @@ -675,6 +677,18 @@ static void kill_domain_by_device(struct pcistub_device *psdev) } /*PV AER handlers will set this flag*/ xenbus_printf(xbt, nodename, "aerState" , "aerfail"); + + /* + * Xend versions <= 4.4 depend on "aerState" and expect its value + * to be set to "aerfail". Therefore, add a new node "aerFailedSBDF" + * to set the device name. + */ + snprintf(devicename, PCI_DEVICENAME_MAX, "%04x:%02x:%02x.%x", + pci_domain_nr(psdev->dev->bus), + psdev->dev->bus->number, + PCI_SLOT(psdev->dev->devfn), PCI_FUNC(psdev->dev->devfn)); + xenbus_printf(xbt, nodename, "aerFailedSBDF" , devicename); + err = xenbus_transaction_end(xbt, 0); if (err) { if (err == -EAGAIN)