diff mbox series

libxl / xen-pciback interaction

Message ID c9bf77ec-8a82-5a6e-c0eb-36e4cc373b23@suse.com (mailing list archive)
State New, archived
Headers show
Series libxl / xen-pciback interaction | expand

Commit Message

Jan Beulich March 16, 2021, 2:27 p.m. UTC
All,

while trying to test a pciback fix for how SR-IOV VFs get placed in the
guest PCI topology I noticed that my test VM would only ever see the 1st
out of 3 VFs that I passed to it. As it turns out, the driver watches
its own (backend) node, and upon first receiving notification it
evaluates state found in xenstore to set up the backend device.
Subsequently it switches the device to Initialised. After this switching,
not further instances of the watch triggering would do anything.

In all instances I observed the watch event getting processed when the
"num_devs" node still reported 1. Trying to deal with this in libxl, by
delaying the writing of the "num_devs" node, led to a fatal error
("num_devs" not being available to read) in the driver, causing the
device to move to Closing state. Therefore I decided that the issue has
to be addressed in the driver, resulting in a patch (reproduced below)
that I'm not overly happy with. I think the present libxl behavior is
wrong - it shouldn't trigger driver initialization without having fully
populated the information the driver is supposed to consume for its
device initialization. The only solution that I can think of, however,
doesn't look very appealing either: Instead of putting all pieces of the
data for one device in a transaction, make a single transaction cover
all devices collectively.

Are there opinions about where to address the issue, or suggestions as
to better approaches than the ones shown / outlined?

While doing this it also occurred to me as odd how "num_devs" is
actually used: It's not really a "number of devices" indicator, but
instead a "next device has this number" one: libxl reads the present
value and increments it by one for every new device. Destroying
(hot-unplugging) of devices doesn't have any effect on the value. If
addition / removal of a device happens a number of times for a VM,
quite a few leftover, no longer used entries would accumulate afaict.
This isn't only consuming space in xenstore for no good reason, but
also means pciback has to do an increasing amount of processing every
time a reconfigure event happens.

Jan

xen-pciback: reconfigure also from backend watch handler

When multiple PCI devices get assigned to a guest right at boot, libxl
incrementally populates the backend tree. The writes for the first of
the devices trigger the backend watch. In turn xen_pcibk_setup_backend()
will set the XenBus state to Initialised, at which point no further
reconfigures would happen unless a device got hotplugged. Arrange for
reconfigure to also get triggered from the backend watch handler.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Cc: stable@vger.kernel.org
---
I will admit that this isn't entirely race-free (with the guest actually
attaching in parallel), but from the looks of it such a race ought to be
benign (not the least thanks to the mutex). Ideally the tool stack
wouldn't write num_devs until all devices had their information
populated. I tried doing so in libxl, but xen_pcibk_setup_backend()
calling xenbus_dev_fatal() when not being able to read that node
prohibits such an approach (or else libxl and driver changes would need
to go into use in lock-step).

I wonder why what is being watched isn't just the num_devs node. Right
now the watch triggers quite frequently without anything relevant
actually having changed (I suppose in at least some cases in response
to writes by the backend itself).

Comments

Jürgen Groß March 16, 2021, 3:03 p.m. UTC | #1
On 16.03.21 15:27, Jan Beulich wrote:
> All,
> 
> while trying to test a pciback fix for how SR-IOV VFs get placed in the
> guest PCI topology I noticed that my test VM would only ever see the 1st
> out of 3 VFs that I passed to it. As it turns out, the driver watches
> its own (backend) node, and upon first receiving notification it
> evaluates state found in xenstore to set up the backend device.
> Subsequently it switches the device to Initialised. After this switching,
> not further instances of the watch triggering would do anything.
> 
> In all instances I observed the watch event getting processed when the
> "num_devs" node still reported 1. Trying to deal with this in libxl, by
> delaying the writing of the "num_devs" node, led to a fatal error
> ("num_devs" not being available to read) in the driver, causing the
> device to move to Closing state. Therefore I decided that the issue has
> to be addressed in the driver, resulting in a patch (reproduced below)
> that I'm not overly happy with. I think the present libxl behavior is
> wrong - it shouldn't trigger driver initialization without having fully
> populated the information the driver is supposed to consume for its
> device initialization. The only solution that I can think of, however,
> doesn't look very appealing either: Instead of putting all pieces of the
> data for one device in a transaction, make a single transaction cover
> all devices collectively.

Any reason why you don't like this solution? Its not as if there would
be a large problem to be expected with using a single transaction for
all PCI devices passed through (assuming you didn't mean to pack really
all devices of the guest into that single transaction).


Juergen
Jan Beulich March 16, 2021, 3:20 p.m. UTC | #2
On 16.03.2021 16:03, Jürgen Groß wrote:
> On 16.03.21 15:27, Jan Beulich wrote:
>> All,
>>
>> while trying to test a pciback fix for how SR-IOV VFs get placed in the
>> guest PCI topology I noticed that my test VM would only ever see the 1st
>> out of 3 VFs that I passed to it. As it turns out, the driver watches
>> its own (backend) node, and upon first receiving notification it
>> evaluates state found in xenstore to set up the backend device.
>> Subsequently it switches the device to Initialised. After this switching,
>> not further instances of the watch triggering would do anything.
>>
>> In all instances I observed the watch event getting processed when the
>> "num_devs" node still reported 1. Trying to deal with this in libxl, by
>> delaying the writing of the "num_devs" node, led to a fatal error
>> ("num_devs" not being available to read) in the driver, causing the
>> device to move to Closing state. Therefore I decided that the issue has
>> to be addressed in the driver, resulting in a patch (reproduced below)
>> that I'm not overly happy with. I think the present libxl behavior is
>> wrong - it shouldn't trigger driver initialization without having fully
>> populated the information the driver is supposed to consume for its
>> device initialization. The only solution that I can think of, however,
>> doesn't look very appealing either: Instead of putting all pieces of the
>> data for one device in a transaction, make a single transaction cover
>> all devices collectively.
> 
> Any reason why you don't like this solution?

It would be quite a bit of code churn afaict (i.e. not an undertaking I
would like to start for code I'm not really familiar with), and error
cleanup might also become quite ugly.

> Its not as if there would
> be a large problem to be expected with using a single transaction for
> all PCI devices passed through (assuming you didn't mean to pack really
> all devices of the guest into that single transaction).

I meant all PCI devices; I can't even see a remote reason why all
devices regardless of type would want packing in a single transaction.

Jan
Ian Jackson March 17, 2021, 3:12 p.m. UTC | #3
Jan Beulich writes ("libxl / xen-pciback interaction"):
> while trying to test a pciback fix for how SR-IOV VFs get placed in the
> guest PCI topology I noticed that my test VM would only ever see the 1st
> out of 3 VFs that I passed to it. As it turns out, the driver watches
> its own (backend) node, and upon first receiving notification it
> evaluates state found in xenstore to set up the backend device.
> Subsequently it switches the device to Initialised. After this switching,
> not further instances of the watch triggering would do anything.

This makes it sound like this driver does not support hotplug.

> While doing this it also occurred to me as odd how "num_devs" is
> actually used: It's not really a "number of devices" indicator, but
> instead a "next device has this number" one: libxl reads the present
> value and increments it by one for every new device. Destroying
> (hot-unplugging) of devices doesn't have any effect on the value.

But this makes it sound like the driver *does* support hotplug.

How does what libxl is doing differ from a setup, immediately followed
by a hot-add ?

>   If addition / removal of a device happens a number of times for a
> VM, quite a few leftover, no longer used entries would accumulate
> afaict.  This isn't only consuming space in xenstore for no good
> reason, but also means pciback has to do an increasing amount of
> processing every time a reconfigure event happens.

I'm kind of surprised that num_devs is used this way by the driver.  I
guess the obvious approach of just listing the directory to find the
devices would often be accidentally-quadratic in the number of
simultaneous PCI devices but that hardly seems like a problem.

I wonder if there is some race/reuse hazard that would prevent libxl
knowing when to reuse a dev number.

Ian.
Jan Beulich March 17, 2021, 3:44 p.m. UTC | #4
On 17.03.2021 16:12, Ian Jackson wrote:
> Jan Beulich writes ("libxl / xen-pciback interaction"):
>> while trying to test a pciback fix for how SR-IOV VFs get placed in the
>> guest PCI topology I noticed that my test VM would only ever see the 1st
>> out of 3 VFs that I passed to it. As it turns out, the driver watches
>> its own (backend) node, and upon first receiving notification it
>> evaluates state found in xenstore to set up the backend device.
>> Subsequently it switches the device to Initialised. After this switching,
>> not further instances of the watch triggering would do anything.
> 
> This makes it sound like this driver does not support hotplug.
> 
>> While doing this it also occurred to me as odd how "num_devs" is
>> actually used: It's not really a "number of devices" indicator, but
>> instead a "next device has this number" one: libxl reads the present
>> value and increments it by one for every new device. Destroying
>> (hot-unplugging) of devices doesn't have any effect on the value.
> 
> But this makes it sound like the driver *does* support hotplug.
> 
> How does what libxl is doing differ from a setup, immediately followed
> by a hot-add ?

In the hot-add case libxl drives things through Reconfiguring state.
I'm not sure this would be an appropriate (and backwards compatible)
thing to do when initially populating xenstore.

Jan
Ian Jackson March 17, 2021, 3:55 p.m. UTC | #5
Jan Beulich writes ("Re: libxl / xen-pciback interaction [and 1 more messages]"):
> On 17.03.2021 16:12, Ian Jackson wrote:
> > How does what libxl is doing differ from a setup, immediately followed
> > by a hot-add ?
> 
> In the hot-add case libxl drives things through Reconfiguring state.
> I'm not sure this would be an appropriate (and backwards compatible)
> thing to do when initially populating xenstore.

Ah.  Tbanks, that is precisely the answer to my question.

I think that means, therefore, populating the whole lot in one
transaction.

(From what you say it doesn't sound like it's possible to write only a
subset, perhaps with state "not ready yet" and then set them all go
"go" at the end.)

Ian.
Jan Beulich March 18, 2021, 9:08 a.m. UTC | #6
On 17.03.2021 16:55, Ian Jackson wrote:
> Jan Beulich writes ("Re: libxl / xen-pciback interaction [and 1 more messages]"):
>> On 17.03.2021 16:12, Ian Jackson wrote:
>>> How does what libxl is doing differ from a setup, immediately followed
>>> by a hot-add ?
>>
>> In the hot-add case libxl drives things through Reconfiguring state.
>> I'm not sure this would be an appropriate (and backwards compatible)
>> thing to do when initially populating xenstore.
> 
> Ah.  Tbanks, that is precisely the answer to my question.
> 
> I think that means, therefore, populating the whole lot in one
> transaction.
> 
> (From what you say it doesn't sound like it's possible to write only a
> subset, perhaps with state "not ready yet" and then set them all go
> "go" at the end.)

Indeed, that's my understanding and a consequence of pciback's watch
covering the entire backend subtree, rather than e.g. just the
num_devs node.

Jan
diff mbox series

Patch

--- a/drivers/xen/xen-pciback/xenbus.c
+++ b/drivers/xen/xen-pciback/xenbus.c
@@ -359,7 +359,8 @@  out:
 	return err;
 }
 
-static int xen_pcibk_reconfigure(struct xen_pcibk_device *pdev)
+static int xen_pcibk_reconfigure(struct xen_pcibk_device *pdev,
+				 enum xenbus_state state)
 {
 	int err = 0;
 	int num_devs;
@@ -374,8 +375,7 @@  static int xen_pcibk_reconfigure(struct
 
 	mutex_lock(&pdev->dev_lock);
 	/* Make sure we only reconfigure once */
-	if (xenbus_read_driver_state(pdev->xdev->nodename) !=
-	    XenbusStateReconfiguring)
+	if (xenbus_read_driver_state(pdev->xdev->nodename) != state)
 		goto out;
 
 	err = xenbus_scanf(XBT_NIL, pdev->xdev->nodename, "num_devs", "%d",
@@ -500,6 +500,9 @@  static int xen_pcibk_reconfigure(struct
 		}
 	}
 
+	if (state != XenbusStateReconfiguring)
+		goto out;
+
 	err = xenbus_switch_state(pdev->xdev, XenbusStateReconfigured);
 	if (err) {
 		xenbus_dev_fatal(pdev->xdev, err,
@@ -525,7 +528,7 @@  static void xen_pcibk_frontend_changed(s
 		break;
 
 	case XenbusStateReconfiguring:
-		xen_pcibk_reconfigure(pdev);
+		xen_pcibk_reconfigure(pdev, XenbusStateReconfiguring);
 		break;
 
 	case XenbusStateConnected:
@@ -664,6 +667,10 @@  static void xen_pcibk_be_watch(struct xe
 		xen_pcibk_setup_backend(pdev);
 		break;
 
+	case XenbusStateInitialised:
+		xen_pcibk_reconfigure(pdev, XenbusStateInitialised);
+		break;
+
 	default:
 		break;
 	}