diff mbox

xen-pciback: Mark a PCIe device to be hidden on AER error

Message ID 20170627172150.2630-1-venu.busireddy@oracle.com (mailing list archive)
State New, archived
Headers show

Commit Message

Venu Busireddy June 27, 2017, 5:21 p.m. UTC
This patch set is part of a set of patchs that together allow containment
of unrecoverable AER errors from PCIe devices assigned to guests in
passthrough mode. The containment is achieved by killing the guest and
hiding the device upon receiving the fatal AER error.

The Xen patch related to this patch is:

https://lists.xen.org/archives/html/xen-devel/2017-06/msg03269.html


This patch stores in xenstore the <s:b:d.f> of the passed through device
that triggered the AER unrecoverable error. This will allow xen (with a
watcher setup to watch "aerFailedSBDF") to make the device unassignable
until the next reboot or operator intervention using the xen tool stack.

Note:
When unrecoverable AER errors are detected from the PCIe devices
assigned to guests in passthrough mode, BIOS's bring down the server,
thus bringing down the entire hypervisor. For this patch set to work,
the AER error handling needs to be delegated to the host operating system.

Signed-off-by: Venu Busireddy <venu.busireddy@oracle.com>
Signed-off-by: Elena Ufimtseva <elena.ufimtseva@oracle.com>
---

 drivers/xen/xen-pciback/pci_stub.c | 14 ++++++++++++++
 1 file changed, 14 insertions(+)

Comments

Jürgen Groß July 27, 2017, 3:31 p.m. UTC | #1
On 27/06/17 19:21, Venu Busireddy wrote:
> This patch set is part of a set of patchs that together allow containment
> of unrecoverable AER errors from PCIe devices assigned to guests in
> passthrough mode. The containment is achieved by killing the guest and
> hiding the device upon receiving the fatal AER error.
> 
> The Xen patch related to this patch is:
> 
> https://lists.xen.org/archives/html/xen-devel/2017-06/msg03269.html
> 
> 
> This patch stores in xenstore the <s:b:d.f> of the passed through device
> that triggered the AER unrecoverable error. This will allow xen (with a
> watcher setup to watch "aerFailedSBDF") to make the device unassignable
> until the next reboot or operator intervention using the xen tool stack.
> 
> Note:
> When unrecoverable AER errors are detected from the PCIe devices
> assigned to guests in passthrough mode, BIOS's bring down the server,
> thus bringing down the entire hypervisor. For this patch set to work,
> the AER error handling needs to be delegated to the host operating system.
> 
> Signed-off-by: Venu Busireddy <venu.busireddy@oracle.com>
> Signed-off-by: Elena Ufimtseva <elena.ufimtseva@oracle.com>

Reviewed-by: Juergen Gross <jgross@suse.com>


Thanks,

Juergen
Boris Ostrovsky July 27, 2017, 3:41 p.m. UTC | #2
On 07/27/2017 11:31 AM, Juergen Gross wrote:
> On 27/06/17 19:21, Venu Busireddy wrote:
>> This patch set is part of a set of patchs that together allow containment
>> of unrecoverable AER errors from PCIe devices assigned to guests in
>> passthrough mode. The containment is achieved by killing the guest and
>> hiding the device upon receiving the fatal AER error.
>>
>> The Xen patch related to this patch is:
>>
>> https://lists.xen.org/archives/html/xen-devel/2017-06/msg03269.html
>>
>>
>> This patch stores in xenstore the <s:b:d.f> of the passed through device
>> that triggered the AER unrecoverable error. This will allow xen (with a
>> watcher setup to watch "aerFailedSBDF") to make the device unassignable
>> until the next reboot or operator intervention using the xen tool stack.
>>
>> Note:
>> When unrecoverable AER errors are detected from the PCIe devices
>> assigned to guests in passthrough mode, BIOS's bring down the server,
>> thus bringing down the entire hypervisor. For this patch set to work,
>> the AER error handling needs to be delegated to the host operating system.
>>
>> Signed-off-by: Venu Busireddy <venu.busireddy@oracle.com>
>> Signed-off-by: Elena Ufimtseva <elena.ufimtseva@oracle.com>
> Reviewed-by: Juergen Gross <jgross@suse.com>
>

Isn't this dependent on Xen-side patches? I think we should wait until
those are accepted (if nothing else, the watch name should be agreed upon).

-boris
Jürgen Groß July 27, 2017, 3:46 p.m. UTC | #3
On 27/07/17 17:41, Boris Ostrovsky wrote:
> On 07/27/2017 11:31 AM, Juergen Gross wrote:
>> On 27/06/17 19:21, Venu Busireddy wrote:
>>> This patch set is part of a set of patchs that together allow containment
>>> of unrecoverable AER errors from PCIe devices assigned to guests in
>>> passthrough mode. The containment is achieved by killing the guest and
>>> hiding the device upon receiving the fatal AER error.
>>>
>>> The Xen patch related to this patch is:
>>>
>>> https://lists.xen.org/archives/html/xen-devel/2017-06/msg03269.html
>>>
>>>
>>> This patch stores in xenstore the <s:b:d.f> of the passed through device
>>> that triggered the AER unrecoverable error. This will allow xen (with a
>>> watcher setup to watch "aerFailedSBDF") to make the device unassignable
>>> until the next reboot or operator intervention using the xen tool stack.
>>>
>>> Note:
>>> When unrecoverable AER errors are detected from the PCIe devices
>>> assigned to guests in passthrough mode, BIOS's bring down the server,
>>> thus bringing down the entire hypervisor. For this patch set to work,
>>> the AER error handling needs to be delegated to the host operating system.
>>>
>>> Signed-off-by: Venu Busireddy <venu.busireddy@oracle.com>
>>> Signed-off-by: Elena Ufimtseva <elena.ufimtseva@oracle.com>
>> Reviewed-by: Juergen Gross <jgross@suse.com>
>>
> 
> Isn't this dependent on Xen-side patches? I think we should wait until
> those are accepted (if nothing else, the watch name should be agreed upon).

Right. I wouldn't commit the patch until the Xen side is ready. Just
wanted to give a "go ahead" for the Xen side to avoid a deadlock. :-)


Juergen
diff mbox

Patch

diff --git a/drivers/xen/xen-pciback/pci_stub.c b/drivers/xen/xen-pciback/pci_stub.c
index 6331a95..5a4bae5 100644
--- a/drivers/xen/xen-pciback/pci_stub.c
+++ b/drivers/xen/xen-pciback/pci_stub.c
@@ -656,11 +656,13 @@  static const struct pci_device_id pcistub_ids[] = {
 };
 
 #define PCI_NODENAME_MAX 40
+#define PCI_DEVICENAME_MAX 14
 static void kill_domain_by_device(struct pcistub_device *psdev)
 {
 	struct xenbus_transaction xbt;
 	int err;
 	char nodename[PCI_NODENAME_MAX];
+	char devicename[PCI_DEVICENAME_MAX];
 
 	BUG_ON(!psdev);
 	snprintf(nodename, PCI_NODENAME_MAX, "/local/domain/0/backend/pci/%d/0",
@@ -675,6 +677,18 @@  static void kill_domain_by_device(struct pcistub_device *psdev)
 	}
 	/*PV AER handlers will set this flag*/
 	xenbus_printf(xbt, nodename, "aerState" , "aerfail");
+
+	/*
+	 * Xend versions <= 4.4 depend on "aerState" and expect its value
+	 * to be set to "aerfail". Therefore, add a new node "aerFailedSBDF"
+	 * to set the device name.
+	 */
+	snprintf(devicename, PCI_DEVICENAME_MAX, "%04x:%02x:%02x.%x",
+		 pci_domain_nr(psdev->dev->bus),
+		 psdev->dev->bus->number,
+		 PCI_SLOT(psdev->dev->devfn), PCI_FUNC(psdev->dev->devfn));
+	xenbus_printf(xbt, nodename, "aerFailedSBDF" , devicename);
+
 	err = xenbus_transaction_end(xbt, 0);
 	if (err) {
 		if (err == -EAGAIN)