mbox series

[RFC,0/2] Attempt to implement the standby feature for assigned network devices

Message ID 20181025140631.634922-1-sameeh@daynix.com (mailing list archive)
Headers show
Series Attempt to implement the standby feature for assigned network devices | expand

Message

Sameeh Jubran Oct. 25, 2018, 2:06 p.m. UTC
From: Sameeh Jubran <sjubran@redhat.com>

Hi all,

Background:

There has been a few attempts to implement the standby feature for vfio
assigned devices which aims to enable the migration of such devices. This
is another attempt.

The series implements an infrastructure for hiding devices from the bus
upon boot. What it does is the following:

* In the first patch the infrastructure for hiding the device is added
  for the qbus and qdev APIs. A "hidden" boolean is added to the device
  state and it is set based on a callback to the standby device which
  registers itself for handling the assessment: "should the primary device
  be hidden?" by cross validating the ids of the devices.

* In the second patch the virtio-net uses the API to hide the vfio
  device and unhides it when the feature is acked.

Disclaimers:

* I have only scratch tested this and from qemu side, it seems to be
  working.
* This is an RFC so it lacks some proper error handling in few cases
  and proper resource freeing. I wanted to get some feedback first
  before it is finalized.

Command line example:

/home/sameeh/Builds/failover/qemu/x86_64-softmmu/qemu-system-x86_64 \
-netdev tap,id=hostnet0,script=world_bridge_standalone.sh,downscript=no,ifname=cc1_71 \
-netdev tap,vhost=on,id=hostnet1,script=world_bridge_standalone.sh,downscript=no,ifname=cc1_72,queues=4 \
-device virtio-net,host_mtu=1500,netdev=hostnet1,id=cc1_72,vectors=10,mq=on,primary=cc1_71 \
-device e1000,netdev=hostnet0,id=cc1_71,standby=cc1_72 \

Migration support:

Pre migration or during setup phase of the migration we should send an
unplug request to the guest to unplug the primary device. I haven't had
the chance to implement that part yet but should do soon. Do you know
what's the best approach to do so? I wanted to have a callback to the
virtio-net device which tries to send an unplug request to the guest and
if succeeds then the migration continues. It needs to handle the case where
the migration fails and then it has to replug the primary device back.

The following terms are used as interchangeable:
standby - virtio-net
primary - vfio-device - physical device - assigned device

Please share your thoughts and suggestions,
Thanks!

Sameeh Jubran (2):
  qdev/qbus: Add hidden device support
  virtio-net: Implement VIRTIO_NET_F_STANDBY feature

 hw/core/qdev.c                 | 48 +++++++++++++++++++++++++---
 hw/net/virtio-net.c            | 25 +++++++++++++++
 hw/pci/pci.c                   |  1 +
 include/hw/pci/pci.h           |  2 ++
 include/hw/qdev-core.h         | 11 ++++++-
 include/hw/virtio/virtio-net.h |  5 +++
 qdev-monitor.c                 | 58 ++++++++++++++++++++++++++++++++--
 7 files changed, 142 insertions(+), 8 deletions(-)

Comments

Sameeh Jubran Oct. 25, 2018, 6:01 p.m. UTC | #1
On Thu, Oct 25, 2018 at 5:06 PM Sameeh Jubran <sameeh@daynix.com> wrote:
>
> From: Sameeh Jubran <sjubran@redhat.com>
>
> Hi all,
>
> Background:
>
> There has been a few attempts to implement the standby feature for vfio
> assigned devices which aims to enable the migration of such devices. This
> is another attempt.
>
> The series implements an infrastructure for hiding devices from the bus
> upon boot. What it does is the following:
>
> * In the first patch the infrastructure for hiding the device is added
>   for the qbus and qdev APIs. A "hidden" boolean is added to the device
>   state and it is set based on a callback to the standby device which
>   registers itself for handling the assessment: "should the primary device
>   be hidden?" by cross validating the ids of the devices.
>
> * In the second patch the virtio-net uses the API to hide the vfio
>   device and unhides it when the feature is acked.
>
> Disclaimers:
>
> * I have only scratch tested this and from qemu side, it seems to be
>   working.
> * This is an RFC so it lacks some proper error handling in few cases
>   and proper resource freeing. I wanted to get some feedback first
>   before it is finalized.
>
> Command line example:
>
> /home/sameeh/Builds/failover/qemu/x86_64-softmmu/qemu-system-x86_64 \
> -netdev tap,id=hostnet0,script=world_bridge_standalone.sh,downscript=no,ifname=cc1_71 \
> -netdev tap,vhost=on,id=hostnet1,script=world_bridge_standalone.sh,downscript=no,ifname=cc1_72,queues=4 \
> -device virtio-net,host_mtu=1500,netdev=hostnet1,id=cc1_72,vectors=10,mq=on,primary=cc1_71 \
> -device e1000,netdev=hostnet0,id=cc1_71,standby=cc1_72 \
>
> Migration support:
>
> Pre migration or during setup phase of the migration we should send an
> unplug request to the guest to unplug the primary device. I haven't had
> the chance to implement that part yet but should do soon. Do you know
> what's the best approach to do so? I wanted to have a callback to the
> virtio-net device which tries to send an unplug request to the guest and
> if succeeds then the migration continues. It needs to handle the case where
> the migration fails and then it has to replug the primary device back.
I think that the "add_migration_state_change_notifier" API call can be used
from within the virtio-net device to achieve this, what do you think?
>
> The following terms are used as interchangeable:
> standby - virtio-net
> primary - vfio-device - physical device - assigned device
>
> Please share your thoughts and suggestions,
> Thanks!
>
> Sameeh Jubran (2):
>   qdev/qbus: Add hidden device support
>   virtio-net: Implement VIRTIO_NET_F_STANDBY feature
>
>  hw/core/qdev.c                 | 48 +++++++++++++++++++++++++---
>  hw/net/virtio-net.c            | 25 +++++++++++++++
>  hw/pci/pci.c                   |  1 +
>  include/hw/pci/pci.h           |  2 ++
>  include/hw/qdev-core.h         | 11 ++++++-
>  include/hw/virtio/virtio-net.h |  5 +++
>  qdev-monitor.c                 | 58 ++++++++++++++++++++++++++++++++--
>  7 files changed, 142 insertions(+), 8 deletions(-)
>
> --
> 2.17.0
>
Michael S. Tsirkin Oct. 25, 2018, 10:17 p.m. UTC | #2
On Thu, Oct 25, 2018 at 05:06:29PM +0300, Sameeh Jubran wrote:
> From: Sameeh Jubran <sjubran@redhat.com>
> 
> Hi all,
> 
> Background:
> 
> There has been a few attempts to implement the standby feature for vfio
> assigned devices which aims to enable the migration of such devices. This
> is another attempt.
> 
> The series implements an infrastructure for hiding devices from the bus
> upon boot. What it does is the following:
> 
> * In the first patch the infrastructure for hiding the device is added
>   for the qbus and qdev APIs. A "hidden" boolean is added to the device
>   state and it is set based on a callback to the standby device which
>   registers itself for handling the assessment: "should the primary device
>   be hidden?" by cross validating the ids of the devices.
> 
> * In the second patch the virtio-net uses the API to hide the vfio
>   device and unhides it when the feature is acked.
> 
> Disclaimers:
> 
> * I have only scratch tested this and from qemu side, it seems to be
>   working.
> * This is an RFC so it lacks some proper error handling in few cases
>   and proper resource freeing. I wanted to get some feedback first
>   before it is finalized.
> 
> Command line example:
> 
> /home/sameeh/Builds/failover/qemu/x86_64-softmmu/qemu-system-x86_64 \
> -netdev tap,id=hostnet0,script=world_bridge_standalone.sh,downscript=no,ifname=cc1_71 \
> -netdev tap,vhost=on,id=hostnet1,script=world_bridge_standalone.sh,downscript=no,ifname=cc1_72,queues=4 \
> -device virtio-net,host_mtu=1500,netdev=hostnet1,id=cc1_72,vectors=10,mq=on,primary=cc1_71 \
> -device e1000,netdev=hostnet0,id=cc1_71,standby=cc1_72 \
> 
> Migration support:
> 
> Pre migration or during setup phase of the migration we should send an
> unplug request to the guest to unplug the primary device. I haven't had
> the chance to implement that part yet but should do soon. Do you know
> what's the best approach to do so? I wanted to have a callback to the
> virtio-net device which tries to send an unplug request to the guest and
> if succeeds then the migration continues. It needs to handle the case where
> the migration fails and then it has to replug the primary device back.
> 
> The following terms are used as interchangeable:
> standby - virtio-net
> primary - vfio-device - physical device - assigned device
> 
> Please share your thoughts and suggestions,
> Thanks!

Didn't have time to look at code yet. Could you test with a VF please?
That's the real test, isn't it?

> Sameeh Jubran (2):
>   qdev/qbus: Add hidden device support
>   virtio-net: Implement VIRTIO_NET_F_STANDBY feature
> 
>  hw/core/qdev.c                 | 48 +++++++++++++++++++++++++---
>  hw/net/virtio-net.c            | 25 +++++++++++++++
>  hw/pci/pci.c                   |  1 +
>  include/hw/pci/pci.h           |  2 ++
>  include/hw/qdev-core.h         | 11 ++++++-
>  include/hw/virtio/virtio-net.h |  5 +++
>  qdev-monitor.c                 | 58 ++++++++++++++++++++++++++++++++--
>  7 files changed, 142 insertions(+), 8 deletions(-)
> 
> -- 
> 2.17.0
Michael Roth Dec. 5, 2018, 4:18 p.m. UTC | #3
Quoting Sameeh Jubran (2018-10-25 13:01:10)
> On Thu, Oct 25, 2018 at 5:06 PM Sameeh Jubran <sameeh@daynix.com> wrote:
> >
> > From: Sameeh Jubran <sjubran@redhat.com>
> >
> > Hi all,
> >
> > Background:
> >
> > There has been a few attempts to implement the standby feature for vfio
> > assigned devices which aims to enable the migration of such devices. This
> > is another attempt.
> >
> > The series implements an infrastructure for hiding devices from the bus
> > upon boot. What it does is the following:
> >
> > * In the first patch the infrastructure for hiding the device is added
> >   for the qbus and qdev APIs. A "hidden" boolean is added to the device
> >   state and it is set based on a callback to the standby device which
> >   registers itself for handling the assessment: "should the primary device
> >   be hidden?" by cross validating the ids of the devices.
> >
> > * In the second patch the virtio-net uses the API to hide the vfio
> >   device and unhides it when the feature is acked.
> >
> > Disclaimers:
> >
> > * I have only scratch tested this and from qemu side, it seems to be
> >   working.
> > * This is an RFC so it lacks some proper error handling in few cases
> >   and proper resource freeing. I wanted to get some feedback first
> >   before it is finalized.
> >
> > Command line example:
> >
> > /home/sameeh/Builds/failover/qemu/x86_64-softmmu/qemu-system-x86_64 \
> > -netdev tap,id=hostnet0,script=world_bridge_standalone.sh,downscript=no,ifname=cc1_71 \
> > -netdev tap,vhost=on,id=hostnet1,script=world_bridge_standalone.sh,downscript=no,ifname=cc1_72,queues=4 \
> > -device virtio-net,host_mtu=1500,netdev=hostnet1,id=cc1_72,vectors=10,mq=on,primary=cc1_71 \
> > -device e1000,netdev=hostnet0,id=cc1_71,standby=cc1_72 \
> >
> > Migration support:
> >
> > Pre migration or during setup phase of the migration we should send an
> > unplug request to the guest to unplug the primary device. I haven't had
> > the chance to implement that part yet but should do soon. Do you know
> > what's the best approach to do so? I wanted to have a callback to the
> > virtio-net device which tries to send an unplug request to the guest and
> > if succeeds then the migration continues. It needs to handle the case where
> > the migration fails and then it has to replug the primary device back.
> I think that the "add_migration_state_change_notifier" API call can be used
> from within the virtio-net device to achieve this, what do you think?

I think it would be good to hear from the libvirt folks (on Cc:) on this as
having QEMU unplug a device without libvirt's involvement seems like it
could cause issues. Personally I think it seems cleaner to just have QEMU
handle the 'hidden' aspects of the device and leave it to QMP/libvirt to do
the unplug beforehand. On the libvirt side I could imagine adding an option
like virsh migrate --switch-to-standby-networking or something along
that line to do it automatically (if we decide doing it automatically is
even needed on that end).

> >
> > The following terms are used as interchangeable:
> > standby - virtio-net
> > primary - vfio-device - physical device - assigned device
> >
> > Please share your thoughts and suggestions,
> > Thanks!
> >
> > Sameeh Jubran (2):
> >   qdev/qbus: Add hidden device support
> >   virtio-net: Implement VIRTIO_NET_F_STANDBY feature
> >
> >  hw/core/qdev.c                 | 48 +++++++++++++++++++++++++---
> >  hw/net/virtio-net.c            | 25 +++++++++++++++
> >  hw/pci/pci.c                   |  1 +
> >  include/hw/pci/pci.h           |  2 ++
> >  include/hw/qdev-core.h         | 11 ++++++-
> >  include/hw/virtio/virtio-net.h |  5 +++
> >  qdev-monitor.c                 | 58 ++++++++++++++++++++++++++++++++--
> >  7 files changed, 142 insertions(+), 8 deletions(-)
> >
> > --
> > 2.17.0
> >
> 
> 
> -- 
> Respectfully,
> Sameeh Jubran
> Linkedin
> Software Engineer @ Daynix.
>
Peter Krempa Dec. 5, 2018, 5:09 p.m. UTC | #4
On Wed, Dec 05, 2018 at 10:18:29 -0600, Michael Roth wrote:
> Quoting Sameeh Jubran (2018-10-25 13:01:10)
> > On Thu, Oct 25, 2018 at 5:06 PM Sameeh Jubran <sameeh@daynix.com> wrote:
> > > From: Sameeh Jubran <sjubran@redhat.com>
> > > Migration support:
> > >
> > > Pre migration or during setup phase of the migration we should send an
> > > unplug request to the guest to unplug the primary device. I haven't had
> > > the chance to implement that part yet but should do soon. Do you know
> > > what's the best approach to do so? I wanted to have a callback to the
> > > virtio-net device which tries to send an unplug request to the guest and
> > > if succeeds then the migration continues. It needs to handle the case where
> > > the migration fails and then it has to replug the primary device back.
> > I think that the "add_migration_state_change_notifier" API call can be used
> > from within the virtio-net device to achieve this, what do you think?
> 
> I think it would be good to hear from the libvirt folks (on Cc:) on this as
> having QEMU unplug a device without libvirt's involvement seems like it
> could cause issues. Personally I think it seems cleaner to just have QEMU
> handle the 'hidden' aspects of the device and leave it to QMP/libvirt to do
> the unplug beforehand. On the libvirt side I could imagine adding an option
> like virsh migrate --switch-to-standby-networking or something along
> that line to do it automatically (if we decide doing it automatically is
> even needed on that end).

I remember talking about this approach some time ago.

In general the migration itself is a very complex process which has too
many places where it can fail. The same applies to device hotunplug.
This series proposes to merge those two together into an even more
complex behemoth.

Few scenarios which don't have clear solution come into my mind:
- Since unplug request time is actually unbounded. The guest OS may
  arbitrarily reject it or execute it at any later time, migration may get
  stuck in a halfway state without any clear rollback or failure scenario.

- After migration, device hotplug may fail for whatever reason, leaving
  networking crippled and again no clear single-case rollback scenario.

Then there's stuff which requires libvirt/management cooperation
- picking of the network device on destination
- making sure that the device is present etc.

From managements point of view, bundling all this together is really not
a good idea since it creates a very big matrix of failure scenarios. In
general even libvirt will prefer that upper layer management drives this
externally, since any rolback scenario will result in a policy decision
of what to do in certain cases, and what timeouts to pick.
Daniel P. Berrangé Dec. 5, 2018, 5:18 p.m. UTC | #5
On Thu, Oct 25, 2018 at 05:06:29PM +0300, Sameeh Jubran wrote:
> From: Sameeh Jubran <sjubran@redhat.com>
> 
> Hi all,
> 
> Background:
> 
> There has been a few attempts to implement the standby feature for vfio
> assigned devices which aims to enable the migration of such devices. This
> is another attempt.
> 
> The series implements an infrastructure for hiding devices from the bus
> upon boot. What it does is the following:
> 
> * In the first patch the infrastructure for hiding the device is added
>   for the qbus and qdev APIs. A "hidden" boolean is added to the device
>   state and it is set based on a callback to the standby device which
>   registers itself for handling the assessment: "should the primary device
>   be hidden?" by cross validating the ids of the devices.
> 
> * In the second patch the virtio-net uses the API to hide the vfio
>   device and unhides it when the feature is acked.

IIUC, the general idea is that we want to provide a pair of associated NIC
devices to the guest, one emulated, one physical PCI device. The guest would
put them in a bonded pair. Before migration the PCI device is unplugged & a
new PCI device plugged on target after migration. The guest traffic continues
without interuption due to the emulate device.

This kind of conceptual approach can already be implemented today by management
apps. The only hard problem that exists today is how the guest OS can figure
out that a particular pair of devices it has are intended to be used together. 

With this series, IIUC, the virtio-net device is getting a given property which
defines the qdev ID of the associated VFIO device. When the guest OS activates
the virtio-net device and acknowledges the STANDBY feature bit, qdev then
unhides the associated VFIO device.

AFAICT the guest has to infer that the device which suddenly appears is the one
associated with the virtio-net device it just initialized, for purposes of
setting up the NIC bonding. There doesn't appear to be any explicit assocation
between the devices exposed to the guest.

This feels pretty fragile for a guest needing to match up devices when there
are many pairs of devices exposed to a single guest.

Unless I'm mis-reading the patches, it looks like the VFIO device always has
to be available at the time QEMU is started. There's no way to boot a guest
and then later hotplug a VFIO device to accelerate the existing virtio-net NIC.
Or similarly after migration there might not be any VFIO device available
initially when QEMU is started to accept the incoming migration. So it might
need to run in degraded mode for an extended period of time until one becomes
available for hotplugging. The use of qdev IDs makes this troublesome, as the
qdev ID of the future VFIO device would need to be decided upfront before it
even exists.

So overall I'm not really a fan of the dynamic hiding/unhiding of devices. I
would much prefer to see some way to expose an explicit relationship between
the devices to the guest.

> Disclaimers:
> 
> * I have only scratch tested this and from qemu side, it seems to be
>   working.
> * This is an RFC so it lacks some proper error handling in few cases
>   and proper resource freeing. I wanted to get some feedback first
>   before it is finalized.
> 
> Command line example:
> 
> /home/sameeh/Builds/failover/qemu/x86_64-softmmu/qemu-system-x86_64 \
> -netdev tap,id=hostnet0,script=world_bridge_standalone.sh,downscript=no,ifname=cc1_71 \
> -netdev tap,vhost=on,id=hostnet1,script=world_bridge_standalone.sh,downscript=no,ifname=cc1_72,queues=4 \
> -device virtio-net,host_mtu=1500,netdev=hostnet1,id=cc1_72,vectors=10,mq=on,primary=cc1_71 \
> -device e1000,netdev=hostnet0,id=cc1_71,standby=cc1_72 \
> 
> Migration support:
> 
> Pre migration or during setup phase of the migration we should send an
> unplug request to the guest to unplug the primary device. I haven't had
> the chance to implement that part yet but should do soon. Do you know
> what's the best approach to do so? I wanted to have a callback to the
> virtio-net device which tries to send an unplug request to the guest and
> if succeeds then the migration continues. It needs to handle the case where
> the migration fails and then it has to replug the primary device back.

Having QEMU do this internally gets into a world of pain when you have
multiple devices in the guest.

Consider if we have 2 pairs of devices. We unplug one VFIO device, but
unplugging the second VFIO device fails, thus we try to replug the first
VFIO device but this now fails too. We don't even get as far as starting
the migration before we have to return an error.

The mgmt app will just see that the migration failed, but it will not
be sure which devices are now actually exposed to the guest OS correctly.

The similar problem hits if we started the migration data stream, but
then had to abort and so need to tear try to replug in the source but
failed for some reasons.

Doing the VFIO device plugging/unplugging explicitly from the mgmt app
gives that mgmt app direct information about which devices have been
successfully made available to the guest at all time, becuase the mgmt
app can see the errors from each step of the process.  Trying to do
this inside QEMU doesn't achieve anything the mgmt app can't already
do, but it obscures what happens during failures.  The same applies at
the libvirt level too, which is why mgmt apps today will do the VFIO
unplug/replug either side of migration themselves.


Regards,
Daniel
Michael S. Tsirkin Dec. 5, 2018, 5:22 p.m. UTC | #6
On Wed, Dec 05, 2018 at 06:09:16PM +0100, Peter Krempa wrote:
> From managements point of view, bundling all this together is really not
> a good idea since it creates a very big matrix of failure scenarios.

I think this is clear. This is why we are doing it in QEMU where we can
actually do all the rollbacks transparently.

> In
> general even libvirt will prefer that upper layer management drives this
> externally, since any rolback scenario will result in a policy decision
> of what to do in certain cases, and what timeouts to pick.

Architectural ugliness of implementing what is from users perspective a
mechanism and not a policy aside, experience teaches that this isn't
going to happen.  People have been talking about the idea of doing
this at the upper layers for years.
Michael S. Tsirkin Dec. 5, 2018, 5:26 p.m. UTC | #7
On Wed, Dec 05, 2018 at 05:18:18PM +0000, Daniel P. Berrangé wrote:
> On Thu, Oct 25, 2018 at 05:06:29PM +0300, Sameeh Jubran wrote:
> > From: Sameeh Jubran <sjubran@redhat.com>
> > 
> > Hi all,
> > 
> > Background:
> > 
> > There has been a few attempts to implement the standby feature for vfio
> > assigned devices which aims to enable the migration of such devices. This
> > is another attempt.
> > 
> > The series implements an infrastructure for hiding devices from the bus
> > upon boot. What it does is the following:
> > 
> > * In the first patch the infrastructure for hiding the device is added
> >   for the qbus and qdev APIs. A "hidden" boolean is added to the device
> >   state and it is set based on a callback to the standby device which
> >   registers itself for handling the assessment: "should the primary device
> >   be hidden?" by cross validating the ids of the devices.
> > 
> > * In the second patch the virtio-net uses the API to hide the vfio
> >   device and unhides it when the feature is acked.
> 
> IIUC, the general idea is that we want to provide a pair of associated NIC
> devices to the guest, one emulated, one physical PCI device. The guest would
> put them in a bonded pair. Before migration the PCI device is unplugged & a
> new PCI device plugged on target after migration. The guest traffic continues
> without interuption due to the emulate device.
> 
> This kind of conceptual approach can already be implemented today by management
> apps. The only hard problem that exists today is how the guest OS can figure
> out that a particular pair of devices it has are intended to be used together. 
> 
> With this series, IIUC, the virtio-net device is getting a given property which
> defines the qdev ID of the associated VFIO device. When the guest OS activates
> the virtio-net device and acknowledges the STANDBY feature bit, qdev then
> unhides the associated VFIO device.
> 
> AFAICT the guest has to infer that the device which suddenly appears is the one
> associated with the virtio-net device it just initialized, for purposes of
> setting up the NIC bonding. There doesn't appear to be any explicit assocation
> between the devices exposed to the guest.
> 
> This feels pretty fragile for a guest needing to match up devices when there
> are many pairs of devices exposed to a single guest.
> 
> Unless I'm mis-reading the patches, it looks like the VFIO device always has
> to be available at the time QEMU is started. There's no way to boot a guest
> and then later hotplug a VFIO device to accelerate the existing virtio-net NIC.

That should be supported.

> Or similarly after migration there might not be any VFIO device available
> initially when QEMU is started to accept the incoming migration. So it might
> need to run in degraded mode for an extended period of time until one becomes
> available for hotplugging.

That should work too.

> The use of qdev IDs makes this troublesome, as the
> qdev ID of the future VFIO device would need to be decided upfront before it
> even exists.

I agree this sounds problematic.

> 
> So overall I'm not really a fan of the dynamic hiding/unhiding of devices.

Dynamic hiding is an orthogonal issue though. It's needed for
error handling in case of migration failure: we do not
want to close the VFIO device but we do need to
hide it from guest. libvirt should not be involved in
this aspect though.

> I
> would much prefer to see some way to expose an explicit relationship between
> the devices to the guest.
> 
> > Disclaimers:
> > 
> > * I have only scratch tested this and from qemu side, it seems to be
> >   working.
> > * This is an RFC so it lacks some proper error handling in few cases
> >   and proper resource freeing. I wanted to get some feedback first
> >   before it is finalized.
> > 
> > Command line example:
> > 
> > /home/sameeh/Builds/failover/qemu/x86_64-softmmu/qemu-system-x86_64 \
> > -netdev tap,id=hostnet0,script=world_bridge_standalone.sh,downscript=no,ifname=cc1_71 \
> > -netdev tap,vhost=on,id=hostnet1,script=world_bridge_standalone.sh,downscript=no,ifname=cc1_72,queues=4 \
> > -device virtio-net,host_mtu=1500,netdev=hostnet1,id=cc1_72,vectors=10,mq=on,primary=cc1_71 \
> > -device e1000,netdev=hostnet0,id=cc1_71,standby=cc1_72 \
> > 
> > Migration support:
> > 
> > Pre migration or during setup phase of the migration we should send an
> > unplug request to the guest to unplug the primary device. I haven't had
> > the chance to implement that part yet but should do soon. Do you know
> > what's the best approach to do so? I wanted to have a callback to the
> > virtio-net device which tries to send an unplug request to the guest and
> > if succeeds then the migration continues. It needs to handle the case where
> > the migration fails and then it has to replug the primary device back.
> 
> Having QEMU do this internally gets into a world of pain when you have
> multiple devices in the guest.
> 
> Consider if we have 2 pairs of devices. We unplug one VFIO device, but
> unplugging the second VFIO device fails, thus we try to replug the first
> VFIO device but this now fails too. We don't even get as far as starting
> the migration before we have to return an error.
> 
> The mgmt app will just see that the migration failed, but it will not
> be sure which devices are now actually exposed to the guest OS correctly.
> 
> The similar problem hits if we started the migration data stream, but
> then had to abort and so need to tear try to replug in the source but
> failed for some reasons.
> 
> Doing the VFIO device plugging/unplugging explicitly from the mgmt app
> gives that mgmt app direct information about which devices have been
> successfully made available to the guest at all time, becuase the mgmt
> app can see the errors from each step of the process.  Trying to do
> this inside QEMU doesn't achieve anything the mgmt app can't already
> do, but it obscures what happens during failures.  The same applies at
> the libvirt level too, which is why mgmt apps today will do the VFIO
> unplug/replug either side of migration themselves.
> 
> 
> Regards,
> Daniel
> -- 
> |: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
> |: https://libvirt.org         -o-            https://fstop138.berrange.com :|
> |: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|
Daniel P. Berrangé Dec. 5, 2018, 5:26 p.m. UTC | #8
On Wed, Dec 05, 2018 at 12:22:18PM -0500, Michael S. Tsirkin wrote:
> On Wed, Dec 05, 2018 at 06:09:16PM +0100, Peter Krempa wrote:
> > From managements point of view, bundling all this together is really not
> > a good idea since it creates a very big matrix of failure scenarios.
> 
> I think this is clear. This is why we are doing it in QEMU where we can
> actually do all the rollbacks transparently.
> 
> > In
> > general even libvirt will prefer that upper layer management drives this
> > externally, since any rolback scenario will result in a policy decision
> > of what to do in certain cases, and what timeouts to pick.
> 
> Architectural ugliness of implementing what is from users perspective a
> mechanism and not a policy aside, experience teaches that this isn't
> going to happen.  People have been talking about the idea of doing
> this at the upper layers for years.

The ability to unplugg+replug VFIO devices either side of migration
has existed in OpenStack for a long time. They also have metadata
that can be exposed to the guest to allow it to describe which pairs
of (emulated,vfio) devices should be used together.

Regards,
Daniel
Daniel P. Berrangé Dec. 5, 2018, 5:43 p.m. UTC | #9
On Wed, Dec 05, 2018 at 06:09:16PM +0100, Peter Krempa wrote:
> On Wed, Dec 05, 2018 at 10:18:29 -0600, Michael Roth wrote:
> > Quoting Sameeh Jubran (2018-10-25 13:01:10)
> > > On Thu, Oct 25, 2018 at 5:06 PM Sameeh Jubran <sameeh@daynix.com> wrote:
> > > > From: Sameeh Jubran <sjubran@redhat.com>
> > > > Migration support:
> > > >
> > > > Pre migration or during setup phase of the migration we should send an
> > > > unplug request to the guest to unplug the primary device. I haven't had
> > > > the chance to implement that part yet but should do soon. Do you know
> > > > what's the best approach to do so? I wanted to have a callback to the
> > > > virtio-net device which tries to send an unplug request to the guest and
> > > > if succeeds then the migration continues. It needs to handle the case where
> > > > the migration fails and then it has to replug the primary device back.
> > > I think that the "add_migration_state_change_notifier" API call can be used
> > > from within the virtio-net device to achieve this, what do you think?
> > 
> > I think it would be good to hear from the libvirt folks (on Cc:) on this as
> > having QEMU unplug a device without libvirt's involvement seems like it
> > could cause issues. Personally I think it seems cleaner to just have QEMU
> > handle the 'hidden' aspects of the device and leave it to QMP/libvirt to do
> > the unplug beforehand. On the libvirt side I could imagine adding an option
> > like virsh migrate --switch-to-standby-networking or something along
> > that line to do it automatically (if we decide doing it automatically is
> > even needed on that end).
> 
> I remember talking about this approach some time ago.
> 
> In general the migration itself is a very complex process which has too
> many places where it can fail. The same applies to device hotunplug.
> This series proposes to merge those two together into an even more
> complex behemoth.
> 
> Few scenarios which don't have clear solution come into my mind:
> - Since unplug request time is actually unbounded. The guest OS may
>   arbitrarily reject it or execute it at any later time, migration may get
>   stuck in a halfway state without any clear rollback or failure scenario.

IMHO this is the really big deal. Doing this inside QEMU can arbitrarily
delay the start of migration, but this is opaque to mgmt apps because it
all becomes hidden behind the migrate command. It is common for mgmt apps
to serialize migration operations, otherwise they compete for limited
network bandwidth making it less likely that any will complete. If we're
waiting for a guest OS to do the unplug though, we don't want to be
stopping other migrations from being started in the mean time. Having
the unplugs done from the mgmt app explicitly gives it the flexibility
to decide how to order and serialize things to suit its needs.

> - After migration, device hotplug may fail for whatever reason, leaving
>   networking crippled and again no clear single-case rollback scenario.

I'd say s/crippled/degraded/. Anyway depending on the the reason that
triggered the migration, you may not even want to rollback to the source
host, despite the VFIO hotplug failing on the target.

If the original host was being evacuated in order to upgrade it to the latest
security patches, or due to hardware problems, it can be preferrable to just
let the VM start running on the target host with just emulated NICs only and
worry about getting working VFIO later.

> Then there's stuff which requires libvirt/management cooperation
> - picking of the network device on destination
> - making sure that the device is present etc.
> 
> From managements point of view, bundling all this together is really not
> a good idea since it creates a very big matrix of failure scenarios. In
> general even libvirt will prefer that upper layer management drives this
> externally, since any rolback scenario will result in a policy decision
> of what to do in certain cases, and what timeouts to pick.

Indeed, leaving these policies decisions to the mgmt app has been a
better approach in general over time, as the view of what's the best
way to approach a problem has changed over time.

Regards,
Daniel
Michael Roth Dec. 5, 2018, 8:24 p.m. UTC | #10
Quoting Daniel P. Berrangé (2018-12-05 11:18:18)
> On Thu, Oct 25, 2018 at 05:06:29PM +0300, Sameeh Jubran wrote:
> > From: Sameeh Jubran <sjubran@redhat.com>
> > 
> > Hi all,
> > 
> > Background:
> > 
> > There has been a few attempts to implement the standby feature for vfio
> > assigned devices which aims to enable the migration of such devices. This
> > is another attempt.
> > 
> > The series implements an infrastructure for hiding devices from the bus
> > upon boot. What it does is the following:
> > 
> > * In the first patch the infrastructure for hiding the device is added
> >   for the qbus and qdev APIs. A "hidden" boolean is added to the device
> >   state and it is set based on a callback to the standby device which
> >   registers itself for handling the assessment: "should the primary device
> >   be hidden?" by cross validating the ids of the devices.
> > 
> > * In the second patch the virtio-net uses the API to hide the vfio
> >   device and unhides it when the feature is acked.
> 
> IIUC, the general idea is that we want to provide a pair of associated NIC
> devices to the guest, one emulated, one physical PCI device. The guest would
> put them in a bonded pair. Before migration the PCI device is unplugged & a
> new PCI device plugged on target after migration. The guest traffic continues
> without interuption due to the emulate device.
> 
> This kind of conceptual approach can already be implemented today by management
> apps. The only hard problem that exists today is how the guest OS can figure
> out that a particular pair of devices it has are intended to be used together. 
> 
> With this series, IIUC, the virtio-net device is getting a given property which
> defines the qdev ID of the associated VFIO device. When the guest OS activates
> the virtio-net device and acknowledges the STANDBY feature bit, qdev then
> unhides the associated VFIO device.
> 
> AFAICT the guest has to infer that the device which suddenly appears is the one
> associated with the virtio-net device it just initialized, for purposes of
> setting up the NIC bonding. There doesn't appear to be any explicit assocation
> between the devices exposed to the guest.
> 
> This feels pretty fragile for a guest needing to match up devices when there
> are many pairs of devices exposed to a single guest.

The impression I get from linux.git:Documentation/networking/net_failover.rst 
is that the matching is done based on the primary/standby NICs having
the same MAC address. In theory you pass both to a guest and based on
MAC it essentially does automatic, and if you additionally add STANDBY
it'll know to use a virtio-net device specifically for failover.

None of this requires any sort of hiding/plugging of devices from
QEMU/libvirt (except for the VFIO unplug we'd need to initiate live migration
and the VFIO hotplug on the other end to switch back over).

That simplifies things greatly, but also introduces the problem of how
an older guest will handle seeing 2 NICs with the same MAC, which IIUC
is why this series is looking at hotplugging the VFIO device only after
we confirm STANDBY is supported by the virtio-net device, and why it's
being done transparent to management.

> 
> Unless I'm mis-reading the patches, it looks like the VFIO device always has
> to be available at the time QEMU is started. There's no way to boot a guest
> and then later hotplug a VFIO device to accelerate the existing virtio-net NIC.
> Or similarly after migration there might not be any VFIO device available
> initially when QEMU is started to accept the incoming migration. So it might
> need to run in degraded mode for an extended period of time until one becomes
> available for hotplugging. The use of qdev IDs makes this troublesome, as the
> qdev ID of the future VFIO device would need to be decided upfront before it
> even exists.

> 
> So overall I'm not really a fan of the dynamic hiding/unhiding of devices. I
> would much prefer to see some way to expose an explicit relationship between
> the devices to the guest.

If we place the burden of determining whether the guest supports STANDBY
on the part of users/management, a lot of this complexity goes away. For
instance, one possible implementation is to simply fail migration and say
"sorry your VFIO device is still there" if the VFIO device is still around
at the start of migration (whether due to unplug failure or a
user/management forgetting to do it manually beforehand).

So how important is it that setting F_STANDBY cap doesn't break older
guests? If the idea is to support live migration with VFs then aren't
we still dead in the water if the guest boots okay but doesn't have
the requisite functionality to be migrated later? Shouldn't that all
be sorted out as early as possible? Is a very clear QEMU error message
in this case insufficient?

And if backward compatibility is important, are there alternative
approaches? Like maybe starting off with a dummy MAC and switching over
to the duplicate MAC only after F_STANDBY is negotiated? In that case
we could still warn users/management about it but still have the guest
be otherwise functional.

> 
> > Disclaimers:
> > 
> > * I have only scratch tested this and from qemu side, it seems to be
> >   working.
> > * This is an RFC so it lacks some proper error handling in few cases
> >   and proper resource freeing. I wanted to get some feedback first
> >   before it is finalized.
> > 
> > Command line example:
> > 
> > /home/sameeh/Builds/failover/qemu/x86_64-softmmu/qemu-system-x86_64 \
> > -netdev tap,id=hostnet0,script=world_bridge_standalone.sh,downscript=no,ifname=cc1_71 \
> > -netdev tap,vhost=on,id=hostnet1,script=world_bridge_standalone.sh,downscript=no,ifname=cc1_72,queues=4 \
> > -device virtio-net,host_mtu=1500,netdev=hostnet1,id=cc1_72,vectors=10,mq=on,primary=cc1_71 \
> > -device e1000,netdev=hostnet0,id=cc1_71,standby=cc1_72 \
> > 
> > Migration support:
> > 
> > Pre migration or during setup phase of the migration we should send an
> > unplug request to the guest to unplug the primary device. I haven't had
> > the chance to implement that part yet but should do soon. Do you know
> > what's the best approach to do so? I wanted to have a callback to the
> > virtio-net device which tries to send an unplug request to the guest and
> > if succeeds then the migration continues. It needs to handle the case where
> > the migration fails and then it has to replug the primary device back.
> 
> Having QEMU do this internally gets into a world of pain when you have
> multiple devices in the guest.
> 
> Consider if we have 2 pairs of devices. We unplug one VFIO device, but
> unplugging the second VFIO device fails, thus we try to replug the first
> VFIO device but this now fails too. We don't even get as far as starting
> the migration before we have to return an error.
> 
> The mgmt app will just see that the migration failed, but it will not
> be sure which devices are now actually exposed to the guest OS correctly.
> 
> The similar problem hits if we started the migration data stream, but
> then had to abort and so need to tear try to replug in the source but
> failed for some reasons.
> 
> Doing the VFIO device plugging/unplugging explicitly from the mgmt app
> gives that mgmt app direct information about which devices have been
> successfully made available to the guest at all time, becuase the mgmt
> app can see the errors from each step of the process.  Trying to do
> this inside QEMU doesn't achieve anything the mgmt app can't already
> do, but it obscures what happens during failures.  The same applies at
> the libvirt level too, which is why mgmt apps today will do the VFIO
> unplug/replug either side of migration themselves.
> 
> 
> Regards,
> Daniel
> -- 
> |: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
> |: https://libvirt.org         -o-            https://fstop138.berrange.com :|
> |: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|
>
Michael Roth Dec. 5, 2018, 8:44 p.m. UTC | #11
Quoting Michael Roth (2018-12-05 14:24:32)
> Quoting Daniel P. Berrangé (2018-12-05 11:18:18)
> > On Thu, Oct 25, 2018 at 05:06:29PM +0300, Sameeh Jubran wrote:
> > > From: Sameeh Jubran <sjubran@redhat.com>
> > > 
> > > Hi all,
> > > 
> > > Background:
> > > 
> > > There has been a few attempts to implement the standby feature for vfio
> > > assigned devices which aims to enable the migration of such devices. This
> > > is another attempt.
> > > 
> > > The series implements an infrastructure for hiding devices from the bus
> > > upon boot. What it does is the following:
> > > 
> > > * In the first patch the infrastructure for hiding the device is added
> > >   for the qbus and qdev APIs. A "hidden" boolean is added to the device
> > >   state and it is set based on a callback to the standby device which
> > >   registers itself for handling the assessment: "should the primary device
> > >   be hidden?" by cross validating the ids of the devices.
> > > 
> > > * In the second patch the virtio-net uses the API to hide the vfio
> > >   device and unhides it when the feature is acked.
> > 
> > IIUC, the general idea is that we want to provide a pair of associated NIC
> > devices to the guest, one emulated, one physical PCI device. The guest would
> > put them in a bonded pair. Before migration the PCI device is unplugged & a
> > new PCI device plugged on target after migration. The guest traffic continues
> > without interuption due to the emulate device.
> > 
> > This kind of conceptual approach can already be implemented today by management
> > apps. The only hard problem that exists today is how the guest OS can figure
> > out that a particular pair of devices it has are intended to be used together. 
> > 
> > With this series, IIUC, the virtio-net device is getting a given property which
> > defines the qdev ID of the associated VFIO device. When the guest OS activates
> > the virtio-net device and acknowledges the STANDBY feature bit, qdev then
> > unhides the associated VFIO device.
> > 
> > AFAICT the guest has to infer that the device which suddenly appears is the one
> > associated with the virtio-net device it just initialized, for purposes of
> > setting up the NIC bonding. There doesn't appear to be any explicit assocation
> > between the devices exposed to the guest.
> > 
> > This feels pretty fragile for a guest needing to match up devices when there
> > are many pairs of devices exposed to a single guest.
> 
> The impression I get from linux.git:Documentation/networking/net_failover.rst 
> is that the matching is done based on the primary/standby NICs having
> the same MAC address. In theory you pass both to a guest and based on
> MAC it essentially does automatic, and if you additionally add STANDBY
> it'll know to use a virtio-net device specifically for failover.
> 
> None of this requires any sort of hiding/plugging of devices from
> QEMU/libvirt (except for the VFIO unplug we'd need to initiate live migration
> and the VFIO hotplug on the other end to switch back over).
> 
> That simplifies things greatly, but also introduces the problem of how
> an older guest will handle seeing 2 NICs with the same MAC, which IIUC
> is why this series is looking at hotplugging the VFIO device only after
> we confirm STANDBY is supported by the virtio-net device, and why it's
> being done transparent to management.
> 
> > 
> > Unless I'm mis-reading the patches, it looks like the VFIO device always has
> > to be available at the time QEMU is started. There's no way to boot a guest
> > and then later hotplug a VFIO device to accelerate the existing virtio-net NIC.
> > Or similarly after migration there might not be any VFIO device available
> > initially when QEMU is started to accept the incoming migration. So it might
> > need to run in degraded mode for an extended period of time until one becomes
> > available for hotplugging. The use of qdev IDs makes this troublesome, as the
> > qdev ID of the future VFIO device would need to be decided upfront before it
> > even exists.
> 
> > 
> > So overall I'm not really a fan of the dynamic hiding/unhiding of devices. I
> > would much prefer to see some way to expose an explicit relationship between
> > the devices to the guest.
> 
> If we place the burden of determining whether the guest supports STANDBY
> on the part of users/management, a lot of this complexity goes away. For
> instance, one possible implementation is to simply fail migration and say
> "sorry your VFIO device is still there" if the VFIO device is still around
> at the start of migration (whether due to unplug failure or a
> user/management forgetting to do it manually beforehand).
> 
> So how important is it that setting F_STANDBY cap doesn't break older
> guests? If the idea is to support live migration with VFs then aren't
> we still dead in the water if the guest boots okay but doesn't have
> the requisite functionality to be migrated later? Shouldn't that all

Well, I guess that's not really the scenario with this approach. Instead
they'd run with degraded network performance but could still at least be
migrated.
Michael S. Tsirkin Dec. 5, 2018, 8:57 p.m. UTC | #12
On Wed, Dec 05, 2018 at 02:24:32PM -0600, Michael Roth wrote:
> Quoting Daniel P. Berrangé (2018-12-05 11:18:18)
> > On Thu, Oct 25, 2018 at 05:06:29PM +0300, Sameeh Jubran wrote:
> > > From: Sameeh Jubran <sjubran@redhat.com>
> > > 
> > > Hi all,
> > > 
> > > Background:
> > > 
> > > There has been a few attempts to implement the standby feature for vfio
> > > assigned devices which aims to enable the migration of such devices. This
> > > is another attempt.
> > > 
> > > The series implements an infrastructure for hiding devices from the bus
> > > upon boot. What it does is the following:
> > > 
> > > * In the first patch the infrastructure for hiding the device is added
> > >   for the qbus and qdev APIs. A "hidden" boolean is added to the device
> > >   state and it is set based on a callback to the standby device which
> > >   registers itself for handling the assessment: "should the primary device
> > >   be hidden?" by cross validating the ids of the devices.
> > > 
> > > * In the second patch the virtio-net uses the API to hide the vfio
> > >   device and unhides it when the feature is acked.
> > 
> > IIUC, the general idea is that we want to provide a pair of associated NIC
> > devices to the guest, one emulated, one physical PCI device. The guest would
> > put them in a bonded pair. Before migration the PCI device is unplugged & a
> > new PCI device plugged on target after migration. The guest traffic continues
> > without interuption due to the emulate device.
> > 
> > This kind of conceptual approach can already be implemented today by management
> > apps. The only hard problem that exists today is how the guest OS can figure
> > out that a particular pair of devices it has are intended to be used together. 
> > 
> > With this series, IIUC, the virtio-net device is getting a given property which
> > defines the qdev ID of the associated VFIO device. When the guest OS activates
> > the virtio-net device and acknowledges the STANDBY feature bit, qdev then
> > unhides the associated VFIO device.
> > 
> > AFAICT the guest has to infer that the device which suddenly appears is the one
> > associated with the virtio-net device it just initialized, for purposes of
> > setting up the NIC bonding. There doesn't appear to be any explicit assocation
> > between the devices exposed to the guest.
> > 
> > This feels pretty fragile for a guest needing to match up devices when there
> > are many pairs of devices exposed to a single guest.
> 
> The impression I get from linux.git:Documentation/networking/net_failover.rst 
> is that the matching is done based on the primary/standby NICs having
> the same MAC address. In theory you pass both to a guest and based on
> MAC it essentially does automatic, and if you additionally add STANDBY
> it'll know to use a virtio-net device specifically for failover.
> 
> None of this requires any sort of hiding/plugging of devices from
> QEMU/libvirt (except for the VFIO unplug we'd need to initiate live migration
> and the VFIO hotplug on the other end to switch back over).
> 
> That simplifies things greatly, but also introduces the problem of how
> an older guest will handle seeing 2 NICs with the same MAC, which IIUC
> is why this series is looking at hotplugging the VFIO device only after
> we confirm STANDBY is supported by the virtio-net device, and why it's
> being done transparent to management.

Exactly, thanks for the summary.

> > 
> > Unless I'm mis-reading the patches, it looks like the VFIO device always has
> > to be available at the time QEMU is started. There's no way to boot a guest
> > and then later hotplug a VFIO device to accelerate the existing virtio-net NIC.
> > Or similarly after migration there might not be any VFIO device available
> > initially when QEMU is started to accept the incoming migration. So it might
> > need to run in degraded mode for an extended period of time until one becomes
> > available for hotplugging. The use of qdev IDs makes this troublesome, as the
> > qdev ID of the future VFIO device would need to be decided upfront before it
> > even exists.
> 
> > 
> > So overall I'm not really a fan of the dynamic hiding/unhiding of devices. I
> > would much prefer to see some way to expose an explicit relationship between
> > the devices to the guest.
> 
> If we place the burden of determining whether the guest supports STANDBY
> on the part of users/management, a lot of this complexity goes away. For
> instance, one possible implementation is to simply fail migration and say
> "sorry your VFIO device is still there" if the VFIO device is still around
> at the start of migration (whether due to unplug failure or a
> user/management forgetting to do it manually beforehand).

It's a bit different. What happens is that migration just doesn't
finish. Same as it sometimes doesn't when guest dirties too much memory.
Upper layers usually handle that in a way similar to what you describe.
If it's desirable that the reason for migration not finishing is
reported to user, we can add that information for sure. Though most
users likely won't care.

> So how important is it that setting F_STANDBY cap doesn't break older
> guests? If the idea is to support live migration with VFs then aren't
> we still dead in the water if the guest boots okay but doesn't have
> the requisite functionality to be migrated later?

No because such legacy guest will never see the PT device at all.  So it
can migrate.

> Shouldn't that all
> be sorted out as early as possible? Is a very clear QEMU error message
> in this case insufficient?
> 
> And if backward compatibility is important, are there alternative
> approaches? Like maybe starting off with a dummy MAC and switching over
> to the duplicate MAC only after F_STANDBY is negotiated? In that case
> we could still warn users/management about it but still have the guest
> be otherwise functional.
> 
> > 
> > > Disclaimers:
> > > 
> > > * I have only scratch tested this and from qemu side, it seems to be
> > >   working.
> > > * This is an RFC so it lacks some proper error handling in few cases
> > >   and proper resource freeing. I wanted to get some feedback first
> > >   before it is finalized.
> > > 
> > > Command line example:
> > > 
> > > /home/sameeh/Builds/failover/qemu/x86_64-softmmu/qemu-system-x86_64 \
> > > -netdev tap,id=hostnet0,script=world_bridge_standalone.sh,downscript=no,ifname=cc1_71 \
> > > -netdev tap,vhost=on,id=hostnet1,script=world_bridge_standalone.sh,downscript=no,ifname=cc1_72,queues=4 \
> > > -device virtio-net,host_mtu=1500,netdev=hostnet1,id=cc1_72,vectors=10,mq=on,primary=cc1_71 \
> > > -device e1000,netdev=hostnet0,id=cc1_71,standby=cc1_72 \
> > > 
> > > Migration support:
> > > 
> > > Pre migration or during setup phase of the migration we should send an
> > > unplug request to the guest to unplug the primary device. I haven't had
> > > the chance to implement that part yet but should do soon. Do you know
> > > what's the best approach to do so? I wanted to have a callback to the
> > > virtio-net device which tries to send an unplug request to the guest and
> > > if succeeds then the migration continues. It needs to handle the case where
> > > the migration fails and then it has to replug the primary device back.
> > 
> > Having QEMU do this internally gets into a world of pain when you have
> > multiple devices in the guest.
> > 
> > Consider if we have 2 pairs of devices. We unplug one VFIO device, but
> > unplugging the second VFIO device fails, thus we try to replug the first
> > VFIO device but this now fails too. We don't even get as far as starting
> > the migration before we have to return an error.
> > 
> > The mgmt app will just see that the migration failed, but it will not
> > be sure which devices are now actually exposed to the guest OS correctly.
> > 
> > The similar problem hits if we started the migration data stream, but
> > then had to abort and so need to tear try to replug in the source but
> > failed for some reasons.
> > 
> > Doing the VFIO device plugging/unplugging explicitly from the mgmt app
> > gives that mgmt app direct information about which devices have been
> > successfully made available to the guest at all time, becuase the mgmt
> > app can see the errors from each step of the process.  Trying to do
> > this inside QEMU doesn't achieve anything the mgmt app can't already
> > do, but it obscures what happens during failures.  The same applies at
> > the libvirt level too, which is why mgmt apps today will do the VFIO
> > unplug/replug either side of migration themselves.
> > 
> > 
> > Regards,
> > Daniel
> > -- 
> > |: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
> > |: https://libvirt.org         -o-            https://fstop138.berrange.com :|
> > |: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|
> >
Michael S. Tsirkin Dec. 5, 2018, 8:58 p.m. UTC | #13
On Wed, Dec 05, 2018 at 02:44:38PM -0600, Michael Roth wrote:
> > So how important is it that setting F_STANDBY cap doesn't break older
> > guests? If the idea is to support live migration with VFs then aren't
> > we still dead in the water if the guest boots okay but doesn't have
> > the requisite functionality to be migrated later? Shouldn't that all
> 
> Well, I guess that's not really the scenario with this approach. Instead
> they'd run with degraded network performance but could still at least be
> migrated.

Thanks, that's a good summary. And instead of degraded we call it
un-accelerated.
Daniel P. Berrangé Dec. 6, 2018, 10:01 a.m. UTC | #14
On Wed, Dec 05, 2018 at 03:57:14PM -0500, Michael S. Tsirkin wrote:
> On Wed, Dec 05, 2018 at 02:24:32PM -0600, Michael Roth wrote:
> > Quoting Daniel P. Berrangé (2018-12-05 11:18:18)
> > > 
> > > Unless I'm mis-reading the patches, it looks like the VFIO device always has
> > > to be available at the time QEMU is started. There's no way to boot a guest
> > > and then later hotplug a VFIO device to accelerate the existing virtio-net NIC.
> > > Or similarly after migration there might not be any VFIO device available
> > > initially when QEMU is started to accept the incoming migration. So it might
> > > need to run in degraded mode for an extended period of time until one becomes
> > > available for hotplugging. The use of qdev IDs makes this troublesome, as the
> > > qdev ID of the future VFIO device would need to be decided upfront before it
> > > even exists.
> > 
> > > 
> > > So overall I'm not really a fan of the dynamic hiding/unhiding of devices. I
> > > would much prefer to see some way to expose an explicit relationship between
> > > the devices to the guest.
> > 
> > If we place the burden of determining whether the guest supports STANDBY
> > on the part of users/management, a lot of this complexity goes away. For
> > instance, one possible implementation is to simply fail migration and say
> > "sorry your VFIO device is still there" if the VFIO device is still around
> > at the start of migration (whether due to unplug failure or a
> > user/management forgetting to do it manually beforehand).
> 
> It's a bit different. What happens is that migration just doesn't
> finish. Same as it sometimes doesn't when guest dirties too much memory.
> Upper layers usually handle that in a way similar to what you describe.
> If it's desirable that the reason for migration not finishing is
> reported to user, we can add that information for sure. Though most
> users likely won't care.

Users absolutely *do* care why migration is not finishing. A migration that
does not finish is a major problem for mgmt apps in many case of the use
cases for migration. Especially important when evacuating VMs from a host
in order to do a software upgrade or replace faulty hardware. As mentioned
previously, they will also often serialize migrations to prevent eh network
being overutilized, so a migration that runs indefinitely will stall
evacuation of additional VMs too.  Predictable execution of migration and
clear error reporting/handling are critical features. IMHO this is the key
reason VFIO unplug/plug needs to be done explicitly by the mgmt app, so it
can be in control over when each part of the process takes place.

> > So how important is it that setting F_STANDBY cap doesn't break older
> > guests? If the idea is to support live migration with VFs then aren't
> > we still dead in the water if the guest boots okay but doesn't have
> > the requisite functionality to be migrated later?
> 
> No because such legacy guest will never see the PT device at all.  So it
> can migrate.

PCI devices are a precious finite resource. If a guest is not going to use
it, we must never add the VFIO device to QEMU in the first place. Adding a
PCI device that is never activated wastes precious resources, preventing
other guests that need PCI devices from being launched on the host.

Regards,
Daniel
Daniel P. Berrangé Dec. 6, 2018, 10:06 a.m. UTC | #15
On Wed, Dec 05, 2018 at 02:24:32PM -0600, Michael Roth wrote:
> Quoting Daniel P. Berrangé (2018-12-05 11:18:18)
> > On Thu, Oct 25, 2018 at 05:06:29PM +0300, Sameeh Jubran wrote:
> > > From: Sameeh Jubran <sjubran@redhat.com>
> > > 
> > > Hi all,
> > > 
> > > Background:
> > > 
> > > There has been a few attempts to implement the standby feature for vfio
> > > assigned devices which aims to enable the migration of such devices. This
> > > is another attempt.
> > > 
> > > The series implements an infrastructure for hiding devices from the bus
> > > upon boot. What it does is the following:
> > > 
> > > * In the first patch the infrastructure for hiding the device is added
> > >   for the qbus and qdev APIs. A "hidden" boolean is added to the device
> > >   state and it is set based on a callback to the standby device which
> > >   registers itself for handling the assessment: "should the primary device
> > >   be hidden?" by cross validating the ids of the devices.
> > > 
> > > * In the second patch the virtio-net uses the API to hide the vfio
> > >   device and unhides it when the feature is acked.
> > 
> > IIUC, the general idea is that we want to provide a pair of associated NIC
> > devices to the guest, one emulated, one physical PCI device. The guest would
> > put them in a bonded pair. Before migration the PCI device is unplugged & a
> > new PCI device plugged on target after migration. The guest traffic continues
> > without interuption due to the emulate device.
> > 
> > This kind of conceptual approach can already be implemented today by management
> > apps. The only hard problem that exists today is how the guest OS can figure
> > out that a particular pair of devices it has are intended to be used together. 
> > 
> > With this series, IIUC, the virtio-net device is getting a given property which
> > defines the qdev ID of the associated VFIO device. When the guest OS activates
> > the virtio-net device and acknowledges the STANDBY feature bit, qdev then
> > unhides the associated VFIO device.
> > 
> > AFAICT the guest has to infer that the device which suddenly appears is the one
> > associated with the virtio-net device it just initialized, for purposes of
> > setting up the NIC bonding. There doesn't appear to be any explicit assocation
> > between the devices exposed to the guest.
> > 
> > This feels pretty fragile for a guest needing to match up devices when there
> > are many pairs of devices exposed to a single guest.
> 
> The impression I get from linux.git:Documentation/networking/net_failover.rst 
> is that the matching is done based on the primary/standby NICs having
> the same MAC address. In theory you pass both to a guest and based on
> MAC it essentially does automatic, and if you additionally add STANDBY
> it'll know to use a virtio-net device specifically for failover.
> 
> None of this requires any sort of hiding/plugging of devices from
> QEMU/libvirt (except for the VFIO unplug we'd need to initiate live migration
> and the VFIO hotplug on the other end to switch back over).
> 
> That simplifies things greatly, but also introduces the problem of how
> an older guest will handle seeing 2 NICs with the same MAC, which IIUC
> is why this series is looking at hotplugging the VFIO device only after
> we confirm STANDBY is supported by the virtio-net device, and why it's
> being done transparent to management.
>
> > 
> > Unless I'm mis-reading the patches, it looks like the VFIO device always has
> > to be available at the time QEMU is started. There's no way to boot a guest
> > and then later hotplug a VFIO device to accelerate the existing virtio-net NIC.
> > Or similarly after migration there might not be any VFIO device available
> > initially when QEMU is started to accept the incoming migration. So it might
> > need to run in degraded mode for an extended period of time until one becomes
> > available for hotplugging. The use of qdev IDs makes this troublesome, as the
> > qdev ID of the future VFIO device would need to be decided upfront before it
> > even exists.
> 
> > 
> > So overall I'm not really a fan of the dynamic hiding/unhiding of devices. I
> > would much prefer to see some way to expose an explicit relationship between
> > the devices to the guest.
> 
> If we place the burden of determining whether the guest supports STANDBY
> on the part of users/management, a lot of this complexity goes away. For
> instance, one possible implementation is to simply fail migration and say
> "sorry your VFIO device is still there" if the VFIO device is still around
> at the start of migration (whether due to unplug failure or a
> user/management forgetting to do it manually beforehand).
> 
> So how important is it that setting F_STANDBY cap doesn't break older
> guests? If the idea is to support live migration with VFs then aren't
> we still dead in the water if the guest boots okay but doesn't have
> the requisite functionality to be migrated later? Shouldn't that all
> be sorted out as early as possible? Is a very clear QEMU error message
> in this case insufficient?
> 
> And if backward compatibility is important, are there alternative
> approaches? Like maybe starting off with a dummy MAC and switching over
> to the duplicate MAC only after F_STANDBY is negotiated? In that case
> we could still warn users/management about it but still have the guest
> be otherwise functional.

Relying on F_STANDBY negotiation to decide whether to activate the VFIO
device is a bad idea. PCI devices are precious, so if the guest OS does
not support this standby feature, we must never add the VFIO device to
QEMU in the first place.

We have the libosinfo project which provides metadata on what features
different guest OS versions support. This can be used to indicate whether
a guest OS version supports the standby NIC concept and thus avoid needing
to allocate PCI devices to guests that will never use them.

F_STANDBY is still useful as a flag to inform the guest OS that it should
pair up NICs with identical MACs, as opposed to configuring them separately.
It shouldn't be used to show/hide the device though, we should simply never
add the 2nd device if we know it won't be used by a given guest OS version.


Regards,
Daniel
Eduardo Habkost Dec. 7, 2018, 4:36 p.m. UTC | #16
On Thu, Dec 06, 2018 at 10:06:18AM +0000, Daniel P. Berrangé wrote:
> On Wed, Dec 05, 2018 at 02:24:32PM -0600, Michael Roth wrote:
> > Quoting Daniel P. Berrangé (2018-12-05 11:18:18)
> > > On Thu, Oct 25, 2018 at 05:06:29PM +0300, Sameeh Jubran wrote:
> > > > From: Sameeh Jubran <sjubran@redhat.com>
> > > > 
> > > > Hi all,
> > > > 
> > > > Background:
> > > > 
> > > > There has been a few attempts to implement the standby feature for vfio
> > > > assigned devices which aims to enable the migration of such devices. This
> > > > is another attempt.
> > > > 
> > > > The series implements an infrastructure for hiding devices from the bus
> > > > upon boot. What it does is the following:
> > > > 
> > > > * In the first patch the infrastructure for hiding the device is added
> > > >   for the qbus and qdev APIs. A "hidden" boolean is added to the device
> > > >   state and it is set based on a callback to the standby device which
> > > >   registers itself for handling the assessment: "should the primary device
> > > >   be hidden?" by cross validating the ids of the devices.
> > > > 
> > > > * In the second patch the virtio-net uses the API to hide the vfio
> > > >   device and unhides it when the feature is acked.
> > > 
> > > IIUC, the general idea is that we want to provide a pair of associated NIC
> > > devices to the guest, one emulated, one physical PCI device. The guest would
> > > put them in a bonded pair. Before migration the PCI device is unplugged & a
> > > new PCI device plugged on target after migration. The guest traffic continues
> > > without interuption due to the emulate device.
> > > 
> > > This kind of conceptual approach can already be implemented today by management
> > > apps. The only hard problem that exists today is how the guest OS can figure
> > > out that a particular pair of devices it has are intended to be used together. 
> > > 
> > > With this series, IIUC, the virtio-net device is getting a given property which
> > > defines the qdev ID of the associated VFIO device. When the guest OS activates
> > > the virtio-net device and acknowledges the STANDBY feature bit, qdev then
> > > unhides the associated VFIO device.
> > > 
> > > AFAICT the guest has to infer that the device which suddenly appears is the one
> > > associated with the virtio-net device it just initialized, for purposes of
> > > setting up the NIC bonding. There doesn't appear to be any explicit assocation
> > > between the devices exposed to the guest.
> > > 
> > > This feels pretty fragile for a guest needing to match up devices when there
> > > are many pairs of devices exposed to a single guest.
> > 
> > The impression I get from linux.git:Documentation/networking/net_failover.rst 
> > is that the matching is done based on the primary/standby NICs having
> > the same MAC address. In theory you pass both to a guest and based on
> > MAC it essentially does automatic, and if you additionally add STANDBY
> > it'll know to use a virtio-net device specifically for failover.
> > 
> > None of this requires any sort of hiding/plugging of devices from
> > QEMU/libvirt (except for the VFIO unplug we'd need to initiate live migration
> > and the VFIO hotplug on the other end to switch back over).
> > 
> > That simplifies things greatly, but also introduces the problem of how
> > an older guest will handle seeing 2 NICs with the same MAC, which IIUC
> > is why this series is looking at hotplugging the VFIO device only after
> > we confirm STANDBY is supported by the virtio-net device, and why it's
> > being done transparent to management.
> >
> > > 
> > > Unless I'm mis-reading the patches, it looks like the VFIO device always has
> > > to be available at the time QEMU is started. There's no way to boot a guest
> > > and then later hotplug a VFIO device to accelerate the existing virtio-net NIC.
> > > Or similarly after migration there might not be any VFIO device available
> > > initially when QEMU is started to accept the incoming migration. So it might
> > > need to run in degraded mode for an extended period of time until one becomes
> > > available for hotplugging. The use of qdev IDs makes this troublesome, as the
> > > qdev ID of the future VFIO device would need to be decided upfront before it
> > > even exists.
> > 
> > > 
> > > So overall I'm not really a fan of the dynamic hiding/unhiding of devices. I
> > > would much prefer to see some way to expose an explicit relationship between
> > > the devices to the guest.
> > 
> > If we place the burden of determining whether the guest supports STANDBY
> > on the part of users/management, a lot of this complexity goes away. For
> > instance, one possible implementation is to simply fail migration and say
> > "sorry your VFIO device is still there" if the VFIO device is still around
> > at the start of migration (whether due to unplug failure or a
> > user/management forgetting to do it manually beforehand).
> > 
> > So how important is it that setting F_STANDBY cap doesn't break older
> > guests? If the idea is to support live migration with VFs then aren't
> > we still dead in the water if the guest boots okay but doesn't have
> > the requisite functionality to be migrated later? Shouldn't that all
> > be sorted out as early as possible? Is a very clear QEMU error message
> > in this case insufficient?
> > 
> > And if backward compatibility is important, are there alternative
> > approaches? Like maybe starting off with a dummy MAC and switching over
> > to the duplicate MAC only after F_STANDBY is negotiated? In that case
> > we could still warn users/management about it but still have the guest
> > be otherwise functional.
> 
> Relying on F_STANDBY negotiation to decide whether to activate the VFIO
> device is a bad idea. PCI devices are precious, so if the guest OS does
> not support this standby feature, we must never add the VFIO device to
> QEMU in the first place.
> 
> We have the libosinfo project which provides metadata on what features
> different guest OS versions support. This can be used to indicate whether
> a guest OS version supports the standby NIC concept and thus avoid needing
> to allocate PCI devices to guests that will never use them.
> 
> F_STANDBY is still useful as a flag to inform the guest OS that it should
> pair up NICs with identical MACs, as opposed to configuring them separately.
> It shouldn't be used to show/hide the device though, we should simply never
> add the 2nd device if we know it won't be used by a given guest OS version.

The two mechanisms are not exclusive.  Not wasting a PCI device
if the guest OS won't use it is a good idea.  Making the guest
behave gracefully even when an older driver is loaded is also
useful.
Daniel P. Berrangé Dec. 7, 2018, 4:46 p.m. UTC | #17
On Fri, Dec 07, 2018 at 02:36:07PM -0200, Eduardo Habkost wrote:
> On Thu, Dec 06, 2018 at 10:06:18AM +0000, Daniel P. Berrangé wrote:
> > On Wed, Dec 05, 2018 at 02:24:32PM -0600, Michael Roth wrote:
> > > Quoting Daniel P. Berrangé (2018-12-05 11:18:18)
> > > > On Thu, Oct 25, 2018 at 05:06:29PM +0300, Sameeh Jubran wrote:
> > > > > From: Sameeh Jubran <sjubran@redhat.com>
> > > > > 
> > > > > Hi all,
> > > > > 
> > > > > Background:
> > > > > 
> > > > > There has been a few attempts to implement the standby feature for vfio
> > > > > assigned devices which aims to enable the migration of such devices. This
> > > > > is another attempt.
> > > > > 
> > > > > The series implements an infrastructure for hiding devices from the bus
> > > > > upon boot. What it does is the following:
> > > > > 
> > > > > * In the first patch the infrastructure for hiding the device is added
> > > > >   for the qbus and qdev APIs. A "hidden" boolean is added to the device
> > > > >   state and it is set based on a callback to the standby device which
> > > > >   registers itself for handling the assessment: "should the primary device
> > > > >   be hidden?" by cross validating the ids of the devices.
> > > > > 
> > > > > * In the second patch the virtio-net uses the API to hide the vfio
> > > > >   device and unhides it when the feature is acked.
> > > > 
> > > > IIUC, the general idea is that we want to provide a pair of associated NIC
> > > > devices to the guest, one emulated, one physical PCI device. The guest would
> > > > put them in a bonded pair. Before migration the PCI device is unplugged & a
> > > > new PCI device plugged on target after migration. The guest traffic continues
> > > > without interuption due to the emulate device.
> > > > 
> > > > This kind of conceptual approach can already be implemented today by management
> > > > apps. The only hard problem that exists today is how the guest OS can figure
> > > > out that a particular pair of devices it has are intended to be used together. 
> > > > 
> > > > With this series, IIUC, the virtio-net device is getting a given property which
> > > > defines the qdev ID of the associated VFIO device. When the guest OS activates
> > > > the virtio-net device and acknowledges the STANDBY feature bit, qdev then
> > > > unhides the associated VFIO device.
> > > > 
> > > > AFAICT the guest has to infer that the device which suddenly appears is the one
> > > > associated with the virtio-net device it just initialized, for purposes of
> > > > setting up the NIC bonding. There doesn't appear to be any explicit assocation
> > > > between the devices exposed to the guest.
> > > > 
> > > > This feels pretty fragile for a guest needing to match up devices when there
> > > > are many pairs of devices exposed to a single guest.
> > > 
> > > The impression I get from linux.git:Documentation/networking/net_failover.rst 
> > > is that the matching is done based on the primary/standby NICs having
> > > the same MAC address. In theory you pass both to a guest and based on
> > > MAC it essentially does automatic, and if you additionally add STANDBY
> > > it'll know to use a virtio-net device specifically for failover.
> > > 
> > > None of this requires any sort of hiding/plugging of devices from
> > > QEMU/libvirt (except for the VFIO unplug we'd need to initiate live migration
> > > and the VFIO hotplug on the other end to switch back over).
> > > 
> > > That simplifies things greatly, but also introduces the problem of how
> > > an older guest will handle seeing 2 NICs with the same MAC, which IIUC
> > > is why this series is looking at hotplugging the VFIO device only after
> > > we confirm STANDBY is supported by the virtio-net device, and why it's
> > > being done transparent to management.
> > >
> > > > 
> > > > Unless I'm mis-reading the patches, it looks like the VFIO device always has
> > > > to be available at the time QEMU is started. There's no way to boot a guest
> > > > and then later hotplug a VFIO device to accelerate the existing virtio-net NIC.
> > > > Or similarly after migration there might not be any VFIO device available
> > > > initially when QEMU is started to accept the incoming migration. So it might
> > > > need to run in degraded mode for an extended period of time until one becomes
> > > > available for hotplugging. The use of qdev IDs makes this troublesome, as the
> > > > qdev ID of the future VFIO device would need to be decided upfront before it
> > > > even exists.
> > > 
> > > > 
> > > > So overall I'm not really a fan of the dynamic hiding/unhiding of devices. I
> > > > would much prefer to see some way to expose an explicit relationship between
> > > > the devices to the guest.
> > > 
> > > If we place the burden of determining whether the guest supports STANDBY
> > > on the part of users/management, a lot of this complexity goes away. For
> > > instance, one possible implementation is to simply fail migration and say
> > > "sorry your VFIO device is still there" if the VFIO device is still around
> > > at the start of migration (whether due to unplug failure or a
> > > user/management forgetting to do it manually beforehand).
> > > 
> > > So how important is it that setting F_STANDBY cap doesn't break older
> > > guests? If the idea is to support live migration with VFs then aren't
> > > we still dead in the water if the guest boots okay but doesn't have
> > > the requisite functionality to be migrated later? Shouldn't that all
> > > be sorted out as early as possible? Is a very clear QEMU error message
> > > in this case insufficient?
> > > 
> > > And if backward compatibility is important, are there alternative
> > > approaches? Like maybe starting off with a dummy MAC and switching over
> > > to the duplicate MAC only after F_STANDBY is negotiated? In that case
> > > we could still warn users/management about it but still have the guest
> > > be otherwise functional.
> > 
> > Relying on F_STANDBY negotiation to decide whether to activate the VFIO
> > device is a bad idea. PCI devices are precious, so if the guest OS does
> > not support this standby feature, we must never add the VFIO device to
> > QEMU in the first place.
> > 
> > We have the libosinfo project which provides metadata on what features
> > different guest OS versions support. This can be used to indicate whether
> > a guest OS version supports the standby NIC concept and thus avoid needing
> > to allocate PCI devices to guests that will never use them.
> > 
> > F_STANDBY is still useful as a flag to inform the guest OS that it should
> > pair up NICs with identical MACs, as opposed to configuring them separately.
> > It shouldn't be used to show/hide the device though, we should simply never
> > add the 2nd device if we know it won't be used by a given guest OS version.
> 
> The two mechanisms are not exclusive.  Not wasting a PCI device
> if the guest OS won't use it is a good idea.  Making the guest
> behave gracefully even when an older driver is loaded is also
> useful.

I'm not convinced it is useful enough to justify playing games in qdev
with dynamically hiding devices. This adds complexity to the code which
will make it harder to maintain and debug at runtime.


Regards,
Daniel
Roman Kagan Dec. 7, 2018, 5:50 p.m. UTC | #18
On Thu, Dec 06, 2018 at 10:06:18AM +0000, Daniel P. Berrangé wrote:
> On Wed, Dec 05, 2018 at 02:24:32PM -0600, Michael Roth wrote:
> > Quoting Daniel P. Berrangé (2018-12-05 11:18:18)
> > > On Thu, Oct 25, 2018 at 05:06:29PM +0300, Sameeh Jubran wrote:
> > > > From: Sameeh Jubran <sjubran@redhat.com>
> > > > 
> > > > Hi all,
> > > > 
> > > > Background:
> > > > 
> > > > There has been a few attempts to implement the standby feature for vfio
> > > > assigned devices which aims to enable the migration of such devices. This
> > > > is another attempt.
> > > > 
> > > > The series implements an infrastructure for hiding devices from the bus
> > > > upon boot. What it does is the following:
> > > > 
> > > > * In the first patch the infrastructure for hiding the device is added
> > > >   for the qbus and qdev APIs. A "hidden" boolean is added to the device
> > > >   state and it is set based on a callback to the standby device which
> > > >   registers itself for handling the assessment: "should the primary device
> > > >   be hidden?" by cross validating the ids of the devices.
> > > > 
> > > > * In the second patch the virtio-net uses the API to hide the vfio
> > > >   device and unhides it when the feature is acked.
> > > 
> > > IIUC, the general idea is that we want to provide a pair of associated NIC
> > > devices to the guest, one emulated, one physical PCI device. The guest would
> > > put them in a bonded pair. Before migration the PCI device is unplugged & a
> > > new PCI device plugged on target after migration. The guest traffic continues
> > > without interuption due to the emulate device.
> > > 
> > > This kind of conceptual approach can already be implemented today by management
> > > apps. The only hard problem that exists today is how the guest OS can figure
> > > out that a particular pair of devices it has are intended to be used together. 
> > > 
> > > With this series, IIUC, the virtio-net device is getting a given property which
> > > defines the qdev ID of the associated VFIO device. When the guest OS activates
> > > the virtio-net device and acknowledges the STANDBY feature bit, qdev then
> > > unhides the associated VFIO device.
> > > 
> > > AFAICT the guest has to infer that the device which suddenly appears is the one
> > > associated with the virtio-net device it just initialized, for purposes of
> > > setting up the NIC bonding. There doesn't appear to be any explicit assocation
> > > between the devices exposed to the guest.
> > > 
> > > This feels pretty fragile for a guest needing to match up devices when there
> > > are many pairs of devices exposed to a single guest.
> > 
> > The impression I get from linux.git:Documentation/networking/net_failover.rst 
> > is that the matching is done based on the primary/standby NICs having
> > the same MAC address. In theory you pass both to a guest and based on
> > MAC it essentially does automatic, and if you additionally add STANDBY
> > it'll know to use a virtio-net device specifically for failover.
> > 
> > None of this requires any sort of hiding/plugging of devices from
> > QEMU/libvirt (except for the VFIO unplug we'd need to initiate live migration
> > and the VFIO hotplug on the other end to switch back over).
> > 
> > That simplifies things greatly, but also introduces the problem of how
> > an older guest will handle seeing 2 NICs with the same MAC, which IIUC
> > is why this series is looking at hotplugging the VFIO device only after
> > we confirm STANDBY is supported by the virtio-net device, and why it's
> > being done transparent to management.
> >
> > > 
> > > Unless I'm mis-reading the patches, it looks like the VFIO device always has
> > > to be available at the time QEMU is started. There's no way to boot a guest
> > > and then later hotplug a VFIO device to accelerate the existing virtio-net NIC.
> > > Or similarly after migration there might not be any VFIO device available
> > > initially when QEMU is started to accept the incoming migration. So it might
> > > need to run in degraded mode for an extended period of time until one becomes
> > > available for hotplugging. The use of qdev IDs makes this troublesome, as the
> > > qdev ID of the future VFIO device would need to be decided upfront before it
> > > even exists.
> > 
> > > 
> > > So overall I'm not really a fan of the dynamic hiding/unhiding of devices. I
> > > would much prefer to see some way to expose an explicit relationship between
> > > the devices to the guest.
> > 
> > If we place the burden of determining whether the guest supports STANDBY
> > on the part of users/management, a lot of this complexity goes away. For
> > instance, one possible implementation is to simply fail migration and say
> > "sorry your VFIO device is still there" if the VFIO device is still around
> > at the start of migration (whether due to unplug failure or a
> > user/management forgetting to do it manually beforehand).
> > 
> > So how important is it that setting F_STANDBY cap doesn't break older
> > guests? If the idea is to support live migration with VFs then aren't
> > we still dead in the water if the guest boots okay but doesn't have
> > the requisite functionality to be migrated later? Shouldn't that all
> > be sorted out as early as possible? Is a very clear QEMU error message
> > in this case insufficient?
> > 
> > And if backward compatibility is important, are there alternative
> > approaches? Like maybe starting off with a dummy MAC and switching over
> > to the duplicate MAC only after F_STANDBY is negotiated? In that case
> > we could still warn users/management about it but still have the guest
> > be otherwise functional.
> 
> Relying on F_STANDBY negotiation to decide whether to activate the VFIO
> device is a bad idea.  PCI devices are precious, so if the guest OS does
> not support this standby feature, we must never add the VFIO device to
> QEMU in the first place.

But it can be a good idea if the upper layer can see the result of this
negotiation and only grab the VFIO device and plug it in when it sees
the guest acknowledge the support for it.

There was an attempt to expose acknowledged virtio features in QMP/HMP
about a year ago (purely for debugging back then); it can be resurrected
and adapted for this purpose.

> We have the libosinfo project which provides metadata on what features
> different guest OS versions support. This can be used to indicate whether
> a guest OS version supports the standby NIC concept and thus avoid needing
> to allocate PCI devices to guests that will never use them.

Not sure how reliable this is.

> F_STANDBY is still useful as a flag to inform the guest OS that it should
> pair up NICs with identical MACs, as opposed to configuring them separately.
> It shouldn't be used to show/hide the device though, we should simply never
> add the 2nd device if we know it won't be used by a given guest OS version.

Agreed.

Roman.
Michael S. Tsirkin Dec. 7, 2018, 6:20 p.m. UTC | #19
On Thu, Dec 06, 2018 at 10:06:18AM +0000, Daniel P. Berrangé wrote:
> On Wed, Dec 05, 2018 at 02:24:32PM -0600, Michael Roth wrote:
> > Quoting Daniel P. Berrangé (2018-12-05 11:18:18)
> > > On Thu, Oct 25, 2018 at 05:06:29PM +0300, Sameeh Jubran wrote:
> > > > From: Sameeh Jubran <sjubran@redhat.com>
> > > > 
> > > > Hi all,
> > > > 
> > > > Background:
> > > > 
> > > > There has been a few attempts to implement the standby feature for vfio
> > > > assigned devices which aims to enable the migration of such devices. This
> > > > is another attempt.
> > > > 
> > > > The series implements an infrastructure for hiding devices from the bus
> > > > upon boot. What it does is the following:
> > > > 
> > > > * In the first patch the infrastructure for hiding the device is added
> > > >   for the qbus and qdev APIs. A "hidden" boolean is added to the device
> > > >   state and it is set based on a callback to the standby device which
> > > >   registers itself for handling the assessment: "should the primary device
> > > >   be hidden?" by cross validating the ids of the devices.
> > > > 
> > > > * In the second patch the virtio-net uses the API to hide the vfio
> > > >   device and unhides it when the feature is acked.
> > > 
> > > IIUC, the general idea is that we want to provide a pair of associated NIC
> > > devices to the guest, one emulated, one physical PCI device. The guest would
> > > put them in a bonded pair. Before migration the PCI device is unplugged & a
> > > new PCI device plugged on target after migration. The guest traffic continues
> > > without interuption due to the emulate device.
> > > 
> > > This kind of conceptual approach can already be implemented today by management
> > > apps. The only hard problem that exists today is how the guest OS can figure
> > > out that a particular pair of devices it has are intended to be used together. 
> > > 
> > > With this series, IIUC, the virtio-net device is getting a given property which
> > > defines the qdev ID of the associated VFIO device. When the guest OS activates
> > > the virtio-net device and acknowledges the STANDBY feature bit, qdev then
> > > unhides the associated VFIO device.
> > > 
> > > AFAICT the guest has to infer that the device which suddenly appears is the one
> > > associated with the virtio-net device it just initialized, for purposes of
> > > setting up the NIC bonding. There doesn't appear to be any explicit assocation
> > > between the devices exposed to the guest.
> > > 
> > > This feels pretty fragile for a guest needing to match up devices when there
> > > are many pairs of devices exposed to a single guest.
> > 
> > The impression I get from linux.git:Documentation/networking/net_failover.rst 
> > is that the matching is done based on the primary/standby NICs having
> > the same MAC address. In theory you pass both to a guest and based on
> > MAC it essentially does automatic, and if you additionally add STANDBY
> > it'll know to use a virtio-net device specifically for failover.
> > 
> > None of this requires any sort of hiding/plugging of devices from
> > QEMU/libvirt (except for the VFIO unplug we'd need to initiate live migration
> > and the VFIO hotplug on the other end to switch back over).
> > 
> > That simplifies things greatly, but also introduces the problem of how
> > an older guest will handle seeing 2 NICs with the same MAC, which IIUC
> > is why this series is looking at hotplugging the VFIO device only after
> > we confirm STANDBY is supported by the virtio-net device, and why it's
> > being done transparent to management.
> >
> > > 
> > > Unless I'm mis-reading the patches, it looks like the VFIO device always has
> > > to be available at the time QEMU is started. There's no way to boot a guest
> > > and then later hotplug a VFIO device to accelerate the existing virtio-net NIC.
> > > Or similarly after migration there might not be any VFIO device available
> > > initially when QEMU is started to accept the incoming migration. So it might
> > > need to run in degraded mode for an extended period of time until one becomes
> > > available for hotplugging. The use of qdev IDs makes this troublesome, as the
> > > qdev ID of the future VFIO device would need to be decided upfront before it
> > > even exists.
> > 
> > > 
> > > So overall I'm not really a fan of the dynamic hiding/unhiding of devices. I
> > > would much prefer to see some way to expose an explicit relationship between
> > > the devices to the guest.
> > 
> > If we place the burden of determining whether the guest supports STANDBY
> > on the part of users/management, a lot of this complexity goes away. For
> > instance, one possible implementation is to simply fail migration and say
> > "sorry your VFIO device is still there" if the VFIO device is still around
> > at the start of migration (whether due to unplug failure or a
> > user/management forgetting to do it manually beforehand).
> > 
> > So how important is it that setting F_STANDBY cap doesn't break older
> > guests? If the idea is to support live migration with VFs then aren't
> > we still dead in the water if the guest boots okay but doesn't have
> > the requisite functionality to be migrated later? Shouldn't that all
> > be sorted out as early as possible? Is a very clear QEMU error message
> > in this case insufficient?
> > 
> > And if backward compatibility is important, are there alternative
> > approaches? Like maybe starting off with a dummy MAC and switching over
> > to the duplicate MAC only after F_STANDBY is negotiated? In that case
> > we could still warn users/management about it but still have the guest
> > be otherwise functional.
> 
> Relying on F_STANDBY negotiation to decide whether to activate the VFIO
> device is a bad idea. PCI devices are precious, so if the guest OS does
> not support this standby feature, we must never add the VFIO device to
> QEMU in the first place.
> 
> We have the libosinfo project which provides metadata on what features
> different guest OS versions support. This can be used to indicate whether
> a guest OS version supports the standby NIC concept and thus avoid needing
> to allocate PCI devices to guests that will never use them.
> 
> F_STANDBY is still useful as a flag to inform the guest OS that it should
> pair up NICs with identical MACs, as opposed to configuring them separately.
> It shouldn't be used to show/hide the device though, we should simply never
> add the 2nd device if we know it won't be used by a given guest OS version.
> 
> 
> Regards,
> Daniel

I think what you say about using libosinfo to preserve PCI resources
is a very good idea.

I do however disagree with relying on it for correctness.

For example it seems very reasonable to only use virtio
during boot and then only enable a PT device once OS is active.

In short we need to minimise guest and management smarts required for
basic functionality.  What you propose seems to be going back to the
original ideas involving everything from guest firmware up to higher
level management stack. I don't see a problem with someone working on
such designs but it didn't happen in the 5 years since it was proposed
for Fedora.



> -- 
> |: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
> |: https://libvirt.org         -o-            https://fstop138.berrange.com :|
> |: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|
Michael S. Tsirkin Dec. 7, 2018, 6:26 p.m. UTC | #20
On Fri, Dec 07, 2018 at 04:46:29PM +0000, Daniel P. Berrangé wrote:
> I'm not convinced it is useful enough to justify playing games in qdev
> with dynamically hiding devices. This adds complexity to the code which
> will make it harder to maintain and debug at runtime.

I actually think a hidden device is a useful concept to model.
E.g. you can have a powered off slot and a PCI device in
such a slot isn't visible but isn't gone either.

Right now we force-eject such devices.

But it sounds reasonable that e.g. a bunch of guests cooperate
and share an assigned device and then whoever wants to
use it, powers it up. These patches do not implement this
of course but it's a step in that direction.