qemu: Introduce VIRTIO_NET_F_STANDBY feature bit to virtio_net
diff mbox

Message ID 1525734594-11134-1-git-send-email-sridhar.samudrala@intel.com
State New
Headers show

Commit Message

Samudrala, Sridhar May 7, 2018, 11:09 p.m. UTC
This feature bit can be used by hypervisor to indicate virtio_net device to
act as a standby for another device with the same MAC address.

I tested this with a small change to the patch to mark the STANDBY feature 'true'
by default as i am using libvirt to start the VMs.
Is there a way to pass the newly added feature bit 'standby' to qemu via libvirt
XML file?

Signed-off-by: Sridhar Samudrala <sridhar.samudrala@intel.com>
---
 hw/net/virtio-net.c                         | 2 ++
 include/standard-headers/linux/virtio_net.h | 3 +++
 2 files changed, 5 insertions(+)

Comments

Samudrala, Sridhar June 5, 2018, 1:41 a.m. UTC | #1
Ping on this patch now that the kernel patches are accepted into davem's net-next tree.
https://patchwork.ozlabs.org/cover/920005/


On 5/7/2018 4:09 PM, Sridhar Samudrala wrote:
> This feature bit can be used by hypervisor to indicate virtio_net device to
> act as a standby for another device with the same MAC address.
>
> I tested this with a small change to the patch to mark the STANDBY feature 'true'
> by default as i am using libvirt to start the VMs.
> Is there a way to pass the newly added feature bit 'standby' to qemu via libvirt
> XML file?
>
> Signed-off-by: Sridhar Samudrala <sridhar.samudrala@intel.com>
> ---
>   hw/net/virtio-net.c                         | 2 ++
>   include/standard-headers/linux/virtio_net.h | 3 +++
>   2 files changed, 5 insertions(+)
>
> diff --git a/hw/net/virtio-net.c b/hw/net/virtio-net.c
> index 90502fca7c..38b3140670 100644
> --- a/hw/net/virtio-net.c
> +++ b/hw/net/virtio-net.c
> @@ -2198,6 +2198,8 @@ static Property virtio_net_properties[] = {
>                        true),
>       DEFINE_PROP_INT32("speed", VirtIONet, net_conf.speed, SPEED_UNKNOWN),
>       DEFINE_PROP_STRING("duplex", VirtIONet, net_conf.duplex_str),
> +    DEFINE_PROP_BIT64("standby", VirtIONet, host_features, VIRTIO_NET_F_STANDBY,
> +                      false),
>       DEFINE_PROP_END_OF_LIST(),
>   };
>   
> diff --git a/include/standard-headers/linux/virtio_net.h b/include/standard-headers/linux/virtio_net.h
> index e9f255ea3f..01ec09684c 100644
> --- a/include/standard-headers/linux/virtio_net.h
> +++ b/include/standard-headers/linux/virtio_net.h
> @@ -57,6 +57,9 @@
>   					 * Steering */
>   #define VIRTIO_NET_F_CTRL_MAC_ADDR 23	/* Set MAC address */
>   
> +#define VIRTIO_NET_F_STANDBY      62    /* Act as standby for another device
> +                                         * with the same MAC.
> +                                         */
>   #define VIRTIO_NET_F_SPEED_DUPLEX 63	/* Device set linkspeed and duplex */
>   
>   #ifndef VIRTIO_NET_NO_LEGACY
Jason Wang June 5, 2018, 2:06 a.m. UTC | #2
On 2018年06月05日 09:41, Samudrala, Sridhar wrote:
> Ping on this patch now that the kernel patches are accepted into 
> davem's net-next tree.
> https://patchwork.ozlabs.org/cover/920005/
>
>
> On 5/7/2018 4:09 PM, Sridhar Samudrala wrote:
>> This feature bit can be used by hypervisor to indicate virtio_net 
>> device to
>> act as a standby for another device with the same MAC address.
>>
>> I tested this with a small change to the patch to mark the STANDBY 
>> feature 'true'
>> by default as i am using libvirt to start the VMs.
>> Is there a way to pass the newly added feature bit 'standby' to qemu 
>> via libvirt
>> XML file?
>>

Maybe you can try qemu command line passthrough:

https://libvirt.org/drvqemu.html#qemucommand

The patch looks good to me. Let's see if Michael have any comment.

Thanks

>> Signed-off-by: Sridhar Samudrala <sridhar.samudrala@intel.com>
>> ---
>>   hw/net/virtio-net.c                         | 2 ++
>>   include/standard-headers/linux/virtio_net.h | 3 +++
>>   2 files changed, 5 insertions(+)
>>
>> diff --git a/hw/net/virtio-net.c b/hw/net/virtio-net.c
>> index 90502fca7c..38b3140670 100644
>> --- a/hw/net/virtio-net.c
>> +++ b/hw/net/virtio-net.c
>> @@ -2198,6 +2198,8 @@ static Property virtio_net_properties[] = {
>>                        true),
>>       DEFINE_PROP_INT32("speed", VirtIONet, net_conf.speed, 
>> SPEED_UNKNOWN),
>>       DEFINE_PROP_STRING("duplex", VirtIONet, net_conf.duplex_str),
>> +    DEFINE_PROP_BIT64("standby", VirtIONet, host_features, 
>> VIRTIO_NET_F_STANDBY,
>> +                      false),
>>       DEFINE_PROP_END_OF_LIST(),
>>   };
>>   diff --git a/include/standard-headers/linux/virtio_net.h 
>> b/include/standard-headers/linux/virtio_net.h
>> index e9f255ea3f..01ec09684c 100644
>> --- a/include/standard-headers/linux/virtio_net.h
>> +++ b/include/standard-headers/linux/virtio_net.h
>> @@ -57,6 +57,9 @@
>>                        * Steering */
>>   #define VIRTIO_NET_F_CTRL_MAC_ADDR 23    /* Set MAC address */
>>   +#define VIRTIO_NET_F_STANDBY      62    /* Act as standby for 
>> another device
>> +                                         * with the same MAC.
>> +                                         */
>>   #define VIRTIO_NET_F_SPEED_DUPLEX 63    /* Device set linkspeed and 
>> duplex */
>>     #ifndef VIRTIO_NET_NO_LEGACY
>
Michael S. Tsirkin June 5, 2018, 12:33 p.m. UTC | #3
I don't think this is sufficient.

If both primary and standby devices are present, a legacy guest without
support for the feature might see two devices with same mac and get
confused.

I think that we should only make primary visible after guest acked the
backup feature bit.

And on reset or when backup is cleared in some other way, unplug the
primary.

Something like the below will do the job:

Primary device is added with a special "primary-failover" flag.
A virtual machine is then initialized with just a standby virtio
device. Primary is not yet added.

Later QEMU detects that guest driver device set DRIVER_OK.
It then exposes the primary device to the guest, and triggers 
a device addition event (hot-plug event) for it.

If QEMU detects guest driver removal, it initiates a hot-unplug sequence
to remove the primary driver.  In particular, if QEMU detects guest
re-initialization (e.g. by detecting guest reset) it immediately removes
the primary device.

We can move some of this code to management as well, architecturally it
does not make too much sense but it might be easier implementation-wise.

HTH

On Mon, Jun 04, 2018 at 06:41:48PM -0700, Samudrala, Sridhar wrote:
> Ping on this patch now that the kernel patches are accepted into davem's net-next tree.
> https://patchwork.ozlabs.org/cover/920005/
> 
> 
> On 5/7/2018 4:09 PM, Sridhar Samudrala wrote:
> > This feature bit can be used by hypervisor to indicate virtio_net device to
> > act as a standby for another device with the same MAC address.
> > 
> > I tested this with a small change to the patch to mark the STANDBY feature 'true'
> > by default as i am using libvirt to start the VMs.
> > Is there a way to pass the newly added feature bit 'standby' to qemu via libvirt
> > XML file?
> > 
> > Signed-off-by: Sridhar Samudrala <sridhar.samudrala@intel.com>
> > ---
> >   hw/net/virtio-net.c                         | 2 ++
> >   include/standard-headers/linux/virtio_net.h | 3 +++
> >   2 files changed, 5 insertions(+)
> > 
> > diff --git a/hw/net/virtio-net.c b/hw/net/virtio-net.c
> > index 90502fca7c..38b3140670 100644
> > --- a/hw/net/virtio-net.c
> > +++ b/hw/net/virtio-net.c
> > @@ -2198,6 +2198,8 @@ static Property virtio_net_properties[] = {
> >                        true),
> >       DEFINE_PROP_INT32("speed", VirtIONet, net_conf.speed, SPEED_UNKNOWN),
> >       DEFINE_PROP_STRING("duplex", VirtIONet, net_conf.duplex_str),
> > +    DEFINE_PROP_BIT64("standby", VirtIONet, host_features, VIRTIO_NET_F_STANDBY,
> > +                      false),
> >       DEFINE_PROP_END_OF_LIST(),
> >   };
> > diff --git a/include/standard-headers/linux/virtio_net.h b/include/standard-headers/linux/virtio_net.h
> > index e9f255ea3f..01ec09684c 100644
> > --- a/include/standard-headers/linux/virtio_net.h
> > +++ b/include/standard-headers/linux/virtio_net.h
> > @@ -57,6 +57,9 @@
> >   					 * Steering */
> >   #define VIRTIO_NET_F_CTRL_MAC_ADDR 23	/* Set MAC address */
> > +#define VIRTIO_NET_F_STANDBY      62    /* Act as standby for another device
> > +                                         * with the same MAC.
> > +                                         */
> >   #define VIRTIO_NET_F_SPEED_DUPLEX 63	/* Device set linkspeed and duplex */
> >   #ifndef VIRTIO_NET_NO_LEGACY
Samudrala, Sridhar June 5, 2018, 8:20 p.m. UTC | #4
On 6/5/2018 5:33 AM, Michael S. Tsirkin wrote:
> I don't think this is sufficient.

Sure. This is not sufficient for a complete solution, but is Qemu the right place
to manage primary/standby interfaces?

I think the other steps including plugging/unplugging the primary interface needs
to handled by some orchestration layer.


>
> If both primary and standby devices are present, a legacy guest without
> support for the feature might see two devices with same mac and get
> confused.

Yes. Isn't this possible today when a VM is started with a mis-configured domain
XML file that passes a virtio-net interface and a VF with the same MAC?


>
> I think that we should only make primary visible after guest acked the
> backup feature bit.

The primary can be plugged/unplugged at any time by the management layer. So
i guess this needs to be handled in the libvirt layer.

>
> And on reset or when backup is cleared in some other way, unplug the
> primary.
>
> Something like the below will do the job:
>
> Primary device is added with a special "primary-failover" flag.

Are you suggesting an extension to Qemu to add another flag for
primary device too?

> A virtual machine is then initialized with just a standby virtio
> device. Primary is not yet added.
>
> Later QEMU detects that guest driver device set DRIVER_OK.
> It then exposes the primary device to the guest, and triggers
> a device addition event (hot-plug event) for it.
>
> If QEMU detects guest driver removal, it initiates a hot-unplug sequence
> to remove the primary driver.  In particular, if QEMU detects guest
> re-initialization (e.g. by detecting guest reset) it immediately removes
> the primary device.
>
> We can move some of this code to management as well, architecturally it
> does not make too much sense but it might be easier implementation-wise.
>
> HTH
>
> On Mon, Jun 04, 2018 at 06:41:48PM -0700, Samudrala, Sridhar wrote:
>> Ping on this patch now that the kernel patches are accepted into davem's net-next tree.
>> https://patchwork.ozlabs.org/cover/920005/
>>
>>
>> On 5/7/2018 4:09 PM, Sridhar Samudrala wrote:
>>> This feature bit can be used by hypervisor to indicate virtio_net device to
>>> act as a standby for another device with the same MAC address.
>>>
>>> I tested this with a small change to the patch to mark the STANDBY feature 'true'
>>> by default as i am using libvirt to start the VMs.
>>> Is there a way to pass the newly added feature bit 'standby' to qemu via libvirt
>>> XML file?
>>>
>>> Signed-off-by: Sridhar Samudrala <sridhar.samudrala@intel.com>
>>> ---
>>>    hw/net/virtio-net.c                         | 2 ++
>>>    include/standard-headers/linux/virtio_net.h | 3 +++
>>>    2 files changed, 5 insertions(+)
>>>
>>> diff --git a/hw/net/virtio-net.c b/hw/net/virtio-net.c
>>> index 90502fca7c..38b3140670 100644
>>> --- a/hw/net/virtio-net.c
>>> +++ b/hw/net/virtio-net.c
>>> @@ -2198,6 +2198,8 @@ static Property virtio_net_properties[] = {
>>>                         true),
>>>        DEFINE_PROP_INT32("speed", VirtIONet, net_conf.speed, SPEED_UNKNOWN),
>>>        DEFINE_PROP_STRING("duplex", VirtIONet, net_conf.duplex_str),
>>> +    DEFINE_PROP_BIT64("standby", VirtIONet, host_features, VIRTIO_NET_F_STANDBY,
>>> +                      false),
>>>        DEFINE_PROP_END_OF_LIST(),
>>>    };
>>> diff --git a/include/standard-headers/linux/virtio_net.h b/include/standard-headers/linux/virtio_net.h
>>> index e9f255ea3f..01ec09684c 100644
>>> --- a/include/standard-headers/linux/virtio_net.h
>>> +++ b/include/standard-headers/linux/virtio_net.h
>>> @@ -57,6 +57,9 @@
>>>    					 * Steering */
>>>    #define VIRTIO_NET_F_CTRL_MAC_ADDR 23	/* Set MAC address */
>>> +#define VIRTIO_NET_F_STANDBY      62    /* Act as standby for another device
>>> +                                         * with the same MAC.
>>> +                                         */
>>>    #define VIRTIO_NET_F_SPEED_DUPLEX 63	/* Device set linkspeed and duplex */
>>>    #ifndef VIRTIO_NET_NO_LEGACY
Michael S. Tsirkin June 5, 2018, 8:37 p.m. UTC | #5
On Tue, Jun 05, 2018 at 01:20:33PM -0700, Samudrala, Sridhar wrote:
> 
> On 6/5/2018 5:33 AM, Michael S. Tsirkin wrote:
> > I don't think this is sufficient.
> 
> Sure. This is not sufficient for a complete solution, but is Qemu the right place
> to manage primary/standby interfaces?
> 
> I think the other steps including plugging/unplugging the primary interface needs
> to handled by some orchestration layer.

I don't see any real benefit to it but yes, you can push it up the stack.
Unfortunately the patch still won't be sufficient then, see below :)

> 
> > 
> > If both primary and standby devices are present, a legacy guest without
> > support for the feature might see two devices with same mac and get
> > confused.
> 
> Yes. Isn't this possible today when a VM is started with a mis-configured domain
> XML file that passes a virtio-net interface and a VF with the same MAC?

Yes, so if you misconfigure the hypervisor then your guest won't run.
Not sure what you are trying to say by this though.

> 
> > 
> > I think that we should only make primary visible after guest acked the
> > backup feature bit.
> 
> The primary can be plugged/unplugged at any time by the management layer.
> So i guess this needs to be handled in the libvirt layer.

management just wants to give the primary to guest and later take it back,
it really does not care about the details of the process,
so I don't see what does pushing it up the stack buy you.

So I don't think it *needs* to be done in libvirt. It probably can if you
add a bunch of hooks so it knows whenever vm reboots, driver binds and
unbinds from device, and can check that backup flag was set.
If you are pushing for a setup like that please get a buy-in
from libvirt maintainers or better write a patch.

> > 
> > And on reset or when backup is cleared in some other way, unplug the
> > primary.
> > 
> > Something like the below will do the job:
> > 
> > Primary device is added with a special "primary-failover" flag.
> 
> Are you suggesting an extension to Qemu to add another flag for
> primary device too?

Exactly.

> > A virtual machine is then initialized with just a standby virtio
> > device. Primary is not yet added.
> > 
> > Later QEMU detects that guest driver device set DRIVER_OK.
> > It then exposes the primary device to the guest, and triggers
> > a device addition event (hot-plug event) for it.
> > 
> > If QEMU detects guest driver removal, it initiates a hot-unplug sequence
> > to remove the primary driver.  In particular, if QEMU detects guest
> > re-initialization (e.g. by detecting guest reset) it immediately removes
> > the primary device.
> > 
> > We can move some of this code to management as well, architecturally it
> > does not make too much sense but it might be easier implementation-wise.
> > 
> > HTH
> > 
> > On Mon, Jun 04, 2018 at 06:41:48PM -0700, Samudrala, Sridhar wrote:
> > > Ping on this patch now that the kernel patches are accepted into davem's net-next tree.
> > > https://patchwork.ozlabs.org/cover/920005/
> > > 
> > > 
> > > On 5/7/2018 4:09 PM, Sridhar Samudrala wrote:
> > > > This feature bit can be used by hypervisor to indicate virtio_net device to
> > > > act as a standby for another device with the same MAC address.
> > > > 
> > > > I tested this with a small change to the patch to mark the STANDBY feature 'true'
> > > > by default as i am using libvirt to start the VMs.
> > > > Is there a way to pass the newly added feature bit 'standby' to qemu via libvirt
> > > > XML file?
> > > > 
> > > > Signed-off-by: Sridhar Samudrala <sridhar.samudrala@intel.com>
> > > > ---
> > > >    hw/net/virtio-net.c                         | 2 ++
> > > >    include/standard-headers/linux/virtio_net.h | 3 +++
> > > >    2 files changed, 5 insertions(+)
> > > > 
> > > > diff --git a/hw/net/virtio-net.c b/hw/net/virtio-net.c
> > > > index 90502fca7c..38b3140670 100644
> > > > --- a/hw/net/virtio-net.c
> > > > +++ b/hw/net/virtio-net.c
> > > > @@ -2198,6 +2198,8 @@ static Property virtio_net_properties[] = {
> > > >                         true),
> > > >        DEFINE_PROP_INT32("speed", VirtIONet, net_conf.speed, SPEED_UNKNOWN),
> > > >        DEFINE_PROP_STRING("duplex", VirtIONet, net_conf.duplex_str),
> > > > +    DEFINE_PROP_BIT64("standby", VirtIONet, host_features, VIRTIO_NET_F_STANDBY,
> > > > +                      false),
> > > >        DEFINE_PROP_END_OF_LIST(),
> > > >    };
> > > > diff --git a/include/standard-headers/linux/virtio_net.h b/include/standard-headers/linux/virtio_net.h
> > > > index e9f255ea3f..01ec09684c 100644
> > > > --- a/include/standard-headers/linux/virtio_net.h
> > > > +++ b/include/standard-headers/linux/virtio_net.h
> > > > @@ -57,6 +57,9 @@
> > > >    					 * Steering */
> > > >    #define VIRTIO_NET_F_CTRL_MAC_ADDR 23	/* Set MAC address */
> > > > +#define VIRTIO_NET_F_STANDBY      62    /* Act as standby for another device
> > > > +                                         * with the same MAC.
> > > > +                                         */
> > > >    #define VIRTIO_NET_F_SPEED_DUPLEX 63	/* Device set linkspeed and duplex */
> > > >    #ifndef VIRTIO_NET_NO_LEGACY
Siwei Liu June 5, 2018, 9:16 p.m. UTC | #6
Good to see this discussion going. I share the same feeling that the
decision of plugging the primary (passthrough) should only be made
until guest driver acknowledges DRIVER_OK and _F_STANDBY.
Architecturally this intelligence should be baken to QEMU itself
rather than moving up to management stack, such as libvirt.

A few questions in line below.

On Tue, Jun 5, 2018 at 5:33 AM, Michael S. Tsirkin <mst@redhat.com> wrote:
> I don't think this is sufficient.
>
> If both primary and standby devices are present, a legacy guest without
> support for the feature might see two devices with same mac and get
> confused.
>
> I think that we should only make primary visible after guest acked the
> backup feature bit.
>
> And on reset or when backup is cleared in some other way, unplug the
> primary.
>
> Something like the below will do the job:
>
> Primary device is added with a special "primary-failover" flag.

I wonder if you envision this flag as a user interface or some
internal attribute/property to QEMU device?

Since QEMU needs to associate this particular primary-failover device
with a virtio standby instance for checking DRIVER_OK as you indicate
below, I wonder if the group ID thing will be helpful to set
"primary-failover" flag internally without having to add another
option in CLI?

> A virtual machine is then initialized with just a standby virtio
> device. Primary is not yet added.
>
> Later QEMU detects that guest driver device set DRIVER_OK.
> It then exposes the primary device to the guest, and triggers
> a device addition event (hot-plug event) for it.

Sounds OK. While there might be a small window the guest already
starts to use virtio interface before the passthrough gets plugged in,
I think that should be fine.

Another option is to wait until the primary appears and gets
registered in the guest, so it can bring up the upper failover
interface.

>
> If QEMU detects guest driver removal, it initiates a hot-unplug sequence
> to remove the primary driver.  In particular, if QEMU detects guest
> re-initialization (e.g. by detecting guest reset) it immediately removes
> the primary device.

Agreed.

For legacy guest, user might prefer seeing a single passthrough device
rather than a virtio device. I would hope there's an option to allow
it to happen, instead of removing all passthrough devices.

Regards,
-Siwei

>
> We can move some of this code to management as well, architecturally it
> does not make too much sense but it might be easier implementation-wise.
>
> HTH
>
> On Mon, Jun 04, 2018 at 06:41:48PM -0700, Samudrala, Sridhar wrote:
>> Ping on this patch now that the kernel patches are accepted into davem's net-next tree.
>> https://patchwork.ozlabs.org/cover/920005/
>>
>>
>> On 5/7/2018 4:09 PM, Sridhar Samudrala wrote:
>> > This feature bit can be used by hypervisor to indicate virtio_net device to
>> > act as a standby for another device with the same MAC address.
>> >
>> > I tested this with a small change to the patch to mark the STANDBY feature 'true'
>> > by default as i am using libvirt to start the VMs.
>> > Is there a way to pass the newly added feature bit 'standby' to qemu via libvirt
>> > XML file?
>> >
>> > Signed-off-by: Sridhar Samudrala <sridhar.samudrala@intel.com>
>> > ---
>> >   hw/net/virtio-net.c                         | 2 ++
>> >   include/standard-headers/linux/virtio_net.h | 3 +++
>> >   2 files changed, 5 insertions(+)
>> >
>> > diff --git a/hw/net/virtio-net.c b/hw/net/virtio-net.c
>> > index 90502fca7c..38b3140670 100644
>> > --- a/hw/net/virtio-net.c
>> > +++ b/hw/net/virtio-net.c
>> > @@ -2198,6 +2198,8 @@ static Property virtio_net_properties[] = {
>> >                        true),
>> >       DEFINE_PROP_INT32("speed", VirtIONet, net_conf.speed, SPEED_UNKNOWN),
>> >       DEFINE_PROP_STRING("duplex", VirtIONet, net_conf.duplex_str),
>> > +    DEFINE_PROP_BIT64("standby", VirtIONet, host_features, VIRTIO_NET_F_STANDBY,
>> > +                      false),
>> >       DEFINE_PROP_END_OF_LIST(),
>> >   };
>> > diff --git a/include/standard-headers/linux/virtio_net.h b/include/standard-headers/linux/virtio_net.h
>> > index e9f255ea3f..01ec09684c 100644
>> > --- a/include/standard-headers/linux/virtio_net.h
>> > +++ b/include/standard-headers/linux/virtio_net.h
>> > @@ -57,6 +57,9 @@
>> >                                      * Steering */
>> >   #define VIRTIO_NET_F_CTRL_MAC_ADDR 23     /* Set MAC address */
>> > +#define VIRTIO_NET_F_STANDBY      62    /* Act as standby for another device
>> > +                                         * with the same MAC.
>> > +                                         */
>> >   #define VIRTIO_NET_F_SPEED_DUPLEX 63      /* Device set linkspeed and duplex */
>> >   #ifndef VIRTIO_NET_NO_LEGACY
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: virtio-dev-unsubscribe@lists.oasis-open.org
> For additional commands, e-mail: virtio-dev-help@lists.oasis-open.org
>
Michael S. Tsirkin June 5, 2018, 9:32 p.m. UTC | #7
On Tue, Jun 05, 2018 at 02:16:44PM -0700, Siwei Liu wrote:
> Good to see this discussion going. I share the same feeling that the
> decision of plugging the primary (passthrough) should only be made
> until guest driver acknowledges DRIVER_OK and _F_STANDBY.
> Architecturally this intelligence should be baken to QEMU itself
> rather than moving up to management stack, such as libvirt.
> 
> A few questions in line below.
> 
> On Tue, Jun 5, 2018 at 5:33 AM, Michael S. Tsirkin <mst@redhat.com> wrote:
> > I don't think this is sufficient.
> >
> > If both primary and standby devices are present, a legacy guest without
> > support for the feature might see two devices with same mac and get
> > confused.
> >
> > I think that we should only make primary visible after guest acked the
> > backup feature bit.
> >
> > And on reset or when backup is cleared in some other way, unplug the
> > primary.
> >
> > Something like the below will do the job:
> >
> > Primary device is added with a special "primary-failover" flag.
> 
> I wonder if you envision this flag as a user interface or some
> internal attribute/property to QEMU device?
> 
> Since QEMU needs to associate this particular primary-failover device
> with a virtio standby instance for checking DRIVER_OK as you indicate
> below, I wonder if the group ID thing will be helpful to set
> "primary-failover" flag internally without having to add another
> option in CLI?

I haven't thought about it but it's an option.

> > A virtual machine is then initialized with just a standby virtio
> > device. Primary is not yet added.
> >
> > Later QEMU detects that guest driver device set DRIVER_OK.
> > It then exposes the primary device to the guest, and triggers
> > a device addition event (hot-plug event) for it.
> 
> Sounds OK. While there might be a small window the guest already
> starts to use virtio interface before the passthrough gets plugged in,
> I think that should be fine.
> 
> Another option is to wait until the primary appears and gets
> registered in the guest, so it can bring up the upper failover
> interface.

We can't be sure this will ever happen, can we? We might be asked to
migrate at any time.

> >
> > If QEMU detects guest driver removal, it initiates a hot-unplug sequence
> > to remove the primary driver.  In particular, if QEMU detects guest
> > re-initialization (e.g. by detecting guest reset) it immediately removes
> > the primary device.
> 
> Agreed.
> 
> For legacy guest, user might prefer seeing a single passthrough device
> rather than a virtio device. I would hope there's an option to allow
> it to happen, instead of removing all passthrough devices.
> 
> Regards,
> -Siwei

I don't see a way to make it work since then you can't migrate, can you?

The way I see it, if you don't need migration, just use PT without
failover.  If you do, then you either use virtio or failover if guest
supports it.

> >
> > We can move some of this code to management as well, architecturally it
> > does not make too much sense but it might be easier implementation-wise.
> >
> > HTH
> >
> > On Mon, Jun 04, 2018 at 06:41:48PM -0700, Samudrala, Sridhar wrote:
> >> Ping on this patch now that the kernel patches are accepted into davem's net-next tree.
> >> https://patchwork.ozlabs.org/cover/920005/
> >>
> >>
> >> On 5/7/2018 4:09 PM, Sridhar Samudrala wrote:
> >> > This feature bit can be used by hypervisor to indicate virtio_net device to
> >> > act as a standby for another device with the same MAC address.
> >> >
> >> > I tested this with a small change to the patch to mark the STANDBY feature 'true'
> >> > by default as i am using libvirt to start the VMs.
> >> > Is there a way to pass the newly added feature bit 'standby' to qemu via libvirt
> >> > XML file?
> >> >
> >> > Signed-off-by: Sridhar Samudrala <sridhar.samudrala@intel.com>
> >> > ---
> >> >   hw/net/virtio-net.c                         | 2 ++
> >> >   include/standard-headers/linux/virtio_net.h | 3 +++
> >> >   2 files changed, 5 insertions(+)
> >> >
> >> > diff --git a/hw/net/virtio-net.c b/hw/net/virtio-net.c
> >> > index 90502fca7c..38b3140670 100644
> >> > --- a/hw/net/virtio-net.c
> >> > +++ b/hw/net/virtio-net.c
> >> > @@ -2198,6 +2198,8 @@ static Property virtio_net_properties[] = {
> >> >                        true),
> >> >       DEFINE_PROP_INT32("speed", VirtIONet, net_conf.speed, SPEED_UNKNOWN),
> >> >       DEFINE_PROP_STRING("duplex", VirtIONet, net_conf.duplex_str),
> >> > +    DEFINE_PROP_BIT64("standby", VirtIONet, host_features, VIRTIO_NET_F_STANDBY,
> >> > +                      false),
> >> >       DEFINE_PROP_END_OF_LIST(),
> >> >   };
> >> > diff --git a/include/standard-headers/linux/virtio_net.h b/include/standard-headers/linux/virtio_net.h
> >> > index e9f255ea3f..01ec09684c 100644
> >> > --- a/include/standard-headers/linux/virtio_net.h
> >> > +++ b/include/standard-headers/linux/virtio_net.h
> >> > @@ -57,6 +57,9 @@
> >> >                                      * Steering */
> >> >   #define VIRTIO_NET_F_CTRL_MAC_ADDR 23     /* Set MAC address */
> >> > +#define VIRTIO_NET_F_STANDBY      62    /* Act as standby for another device
> >> > +                                         * with the same MAC.
> >> > +                                         */
> >> >   #define VIRTIO_NET_F_SPEED_DUPLEX 63      /* Device set linkspeed and duplex */
> >> >   #ifndef VIRTIO_NET_NO_LEGACY
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: virtio-dev-unsubscribe@lists.oasis-open.org
> > For additional commands, e-mail: virtio-dev-help@lists.oasis-open.org
> >
Siwei Liu June 5, 2018, 10:09 p.m. UTC | #8
On Tue, Jun 5, 2018 at 2:32 PM, Michael S. Tsirkin <mst@redhat.com> wrote:
> On Tue, Jun 05, 2018 at 02:16:44PM -0700, Siwei Liu wrote:
>> Good to see this discussion going. I share the same feeling that the
>> decision of plugging the primary (passthrough) should only be made
>> until guest driver acknowledges DRIVER_OK and _F_STANDBY.
>> Architecturally this intelligence should be baken to QEMU itself
>> rather than moving up to management stack, such as libvirt.
>>
>> A few questions in line below.
>>
>> On Tue, Jun 5, 2018 at 5:33 AM, Michael S. Tsirkin <mst@redhat.com> wrote:
>> > I don't think this is sufficient.
>> >
>> > If both primary and standby devices are present, a legacy guest without
>> > support for the feature might see two devices with same mac and get
>> > confused.
>> >
>> > I think that we should only make primary visible after guest acked the
>> > backup feature bit.
>> >
>> > And on reset or when backup is cleared in some other way, unplug the
>> > primary.
>> >
>> > Something like the below will do the job:
>> >
>> > Primary device is added with a special "primary-failover" flag.
>>
>> I wonder if you envision this flag as a user interface or some
>> internal attribute/property to QEMU device?
>>
>> Since QEMU needs to associate this particular primary-failover device
>> with a virtio standby instance for checking DRIVER_OK as you indicate
>> below, I wonder if the group ID thing will be helpful to set
>> "primary-failover" flag internally without having to add another
>> option in CLI?
>
> I haven't thought about it but it's an option.
>
>> > A virtual machine is then initialized with just a standby virtio
>> > device. Primary is not yet added.
>> >
>> > Later QEMU detects that guest driver device set DRIVER_OK.
>> > It then exposes the primary device to the guest, and triggers
>> > a device addition event (hot-plug event) for it.
>>
>> Sounds OK. While there might be a small window the guest already
>> starts to use virtio interface before the passthrough gets plugged in,
>> I think that should be fine.
>>
>> Another option is to wait until the primary appears and gets
>> registered in the guest, so it can bring up the upper failover
>> interface.
>
> We can't be sure this will ever happen, can we? We might be asked to
> migrate at any time.

Right. For that to work, virtio driver needs a complicated state
tracking hand-in-hand with the host for VF's hot plugging activity. If
hot plugging is defered due to migration activity the failover
interface can be brought up immediately.

As said, this is just an option. I don't have strong preference here.

>
>> >
>> > If QEMU detects guest driver removal, it initiates a hot-unplug sequence
>> > to remove the primary driver.  In particular, if QEMU detects guest
>> > re-initialization (e.g. by detecting guest reset) it immediately removes
>> > the primary device.
>>
>> Agreed.
>>
>> For legacy guest, user might prefer seeing a single passthrough device
>> rather than a virtio device. I would hope there's an option to allow
>> it to happen, instead of removing all passthrough devices.
>>
>> Regards,
>> -Siwei
>
> I don't see a way to make it work since then you can't migrate, can you?

The thing is cloud service provider might prefer sticking to the same
level of service agreement (SLA) of keeping VF over migration,
particularly when guest OS or kernel does not have the support.The
footprint in guest OS support might NOT be prevailing in the early
days. There's no point the cloud vendor can switch back and forth, as
there's no way for hypervisor to detect what kernel version even OS
the guest will be running ahead of time, until the guest gets fully
loaded and boots up, right?

-Siwei

>
> The way I see it, if you don't need migration, just use PT without
> failover.  If you do, then you either use virtio or failover if guest
> supports it.
>
>> >
>> > We can move some of this code to management as well, architecturally it
>> > does not make too much sense but it might be easier implementation-wise.
>> >
>> > HTH
>> >
>> > On Mon, Jun 04, 2018 at 06:41:48PM -0700, Samudrala, Sridhar wrote:
>> >> Ping on this patch now that the kernel patches are accepted into davem's net-next tree.
>> >> https://patchwork.ozlabs.org/cover/920005/
>> >>
>> >>
>> >> On 5/7/2018 4:09 PM, Sridhar Samudrala wrote:
>> >> > This feature bit can be used by hypervisor to indicate virtio_net device to
>> >> > act as a standby for another device with the same MAC address.
>> >> >
>> >> > I tested this with a small change to the patch to mark the STANDBY feature 'true'
>> >> > by default as i am using libvirt to start the VMs.
>> >> > Is there a way to pass the newly added feature bit 'standby' to qemu via libvirt
>> >> > XML file?
>> >> >
>> >> > Signed-off-by: Sridhar Samudrala <sridhar.samudrala@intel.com>
>> >> > ---
>> >> >   hw/net/virtio-net.c                         | 2 ++
>> >> >   include/standard-headers/linux/virtio_net.h | 3 +++
>> >> >   2 files changed, 5 insertions(+)
>> >> >
>> >> > diff --git a/hw/net/virtio-net.c b/hw/net/virtio-net.c
>> >> > index 90502fca7c..38b3140670 100644
>> >> > --- a/hw/net/virtio-net.c
>> >> > +++ b/hw/net/virtio-net.c
>> >> > @@ -2198,6 +2198,8 @@ static Property virtio_net_properties[] = {
>> >> >                        true),
>> >> >       DEFINE_PROP_INT32("speed", VirtIONet, net_conf.speed, SPEED_UNKNOWN),
>> >> >       DEFINE_PROP_STRING("duplex", VirtIONet, net_conf.duplex_str),
>> >> > +    DEFINE_PROP_BIT64("standby", VirtIONet, host_features, VIRTIO_NET_F_STANDBY,
>> >> > +                      false),
>> >> >       DEFINE_PROP_END_OF_LIST(),
>> >> >   };
>> >> > diff --git a/include/standard-headers/linux/virtio_net.h b/include/standard-headers/linux/virtio_net.h
>> >> > index e9f255ea3f..01ec09684c 100644
>> >> > --- a/include/standard-headers/linux/virtio_net.h
>> >> > +++ b/include/standard-headers/linux/virtio_net.h
>> >> > @@ -57,6 +57,9 @@
>> >> >                                      * Steering */
>> >> >   #define VIRTIO_NET_F_CTRL_MAC_ADDR 23     /* Set MAC address */
>> >> > +#define VIRTIO_NET_F_STANDBY      62    /* Act as standby for another device
>> >> > +                                         * with the same MAC.
>> >> > +                                         */
>> >> >   #define VIRTIO_NET_F_SPEED_DUPLEX 63      /* Device set linkspeed and duplex */
>> >> >   #ifndef VIRTIO_NET_NO_LEGACY
>> >
>> > ---------------------------------------------------------------------
>> > To unsubscribe, e-mail: virtio-dev-unsubscribe@lists.oasis-open.org
>> > For additional commands, e-mail: virtio-dev-help@lists.oasis-open.org
>> >
Jason Wang June 6, 2018, 2:29 a.m. UTC | #9
On 2018年06月05日 20:33, Michael S. Tsirkin wrote:
> I don't think this is sufficient.
>
> If both primary and standby devices are present, a legacy guest without
> support for the feature might see two devices with same mac and get
> confused.
>
> I think that we should only make primary visible after guest acked the
> backup feature bit.

I think we want exactly the reverse? E.g fail the negotiation when guest 
does not ack backup feature.

Otherwise legacy guest won't even have the chance to see primary device 
in the guest.

>
> And on reset or when backup is cleared in some other way, unplug the
> primary.

What if guest does not support hotplug?

>
> Something like the below will do the job:
>
> Primary device is added with a special "primary-failover" flag.
> A virtual machine is then initialized with just a standby virtio
> device. Primary is not yet added.

A question is how to do the matching? Qemu knows nothing about e.g mac 
address of a pass-through device I believe?

>
> Later QEMU detects that guest driver device set DRIVER_OK.
> It then exposes the primary device to the guest, and triggers
> a device addition event (hot-plug event) for it.

Do you mean we won't have primary device in the initial qemu cli?

>
> If QEMU detects guest driver removal, it initiates a hot-unplug sequence
> to remove the primary driver.  In particular, if QEMU detects guest
> re-initialization (e.g. by detecting guest reset) it immediately removes
> the primary device.

I believe guest failover module should handle this gracefully?

Thanks

>
> We can move some of this code to management as well, architecturally it
> does not make too much sense but it might be easier implementation-wise.
>
> HTH
>
> On Mon, Jun 04, 2018 at 06:41:48PM -0700, Samudrala, Sridhar wrote:
>> Ping on this patch now that the kernel patches are accepted into davem's net-next tree.
>> https://patchwork.ozlabs.org/cover/920005/
>>
>>
>> On 5/7/2018 4:09 PM, Sridhar Samudrala wrote:
>>> This feature bit can be used by hypervisor to indicate virtio_net device to
>>> act as a standby for another device with the same MAC address.
>>>
>>> I tested this with a small change to the patch to mark the STANDBY feature 'true'
>>> by default as i am using libvirt to start the VMs.
>>> Is there a way to pass the newly added feature bit 'standby' to qemu via libvirt
>>> XML file?
>>>
>>> Signed-off-by: Sridhar Samudrala <sridhar.samudrala@intel.com>
>>> ---
>>>    hw/net/virtio-net.c                         | 2 ++
>>>    include/standard-headers/linux/virtio_net.h | 3 +++
>>>    2 files changed, 5 insertions(+)
>>>
>>> diff --git a/hw/net/virtio-net.c b/hw/net/virtio-net.c
>>> index 90502fca7c..38b3140670 100644
>>> --- a/hw/net/virtio-net.c
>>> +++ b/hw/net/virtio-net.c
>>> @@ -2198,6 +2198,8 @@ static Property virtio_net_properties[] = {
>>>                         true),
>>>        DEFINE_PROP_INT32("speed", VirtIONet, net_conf.speed, SPEED_UNKNOWN),
>>>        DEFINE_PROP_STRING("duplex", VirtIONet, net_conf.duplex_str),
>>> +    DEFINE_PROP_BIT64("standby", VirtIONet, host_features, VIRTIO_NET_F_STANDBY,
>>> +                      false),
>>>        DEFINE_PROP_END_OF_LIST(),
>>>    };
>>> diff --git a/include/standard-headers/linux/virtio_net.h b/include/standard-headers/linux/virtio_net.h
>>> index e9f255ea3f..01ec09684c 100644
>>> --- a/include/standard-headers/linux/virtio_net.h
>>> +++ b/include/standard-headers/linux/virtio_net.h
>>> @@ -57,6 +57,9 @@
>>>    					 * Steering */
>>>    #define VIRTIO_NET_F_CTRL_MAC_ADDR 23	/* Set MAC address */
>>> +#define VIRTIO_NET_F_STANDBY      62    /* Act as standby for another device
>>> +                                         * with the same MAC.
>>> +                                         */
>>>    #define VIRTIO_NET_F_SPEED_DUPLEX 63	/* Device set linkspeed and duplex */
>>>    #ifndef VIRTIO_NET_NO_LEGACY
Samudrala, Sridhar June 6, 2018, 6:17 p.m. UTC | #10
On 6/4/2018 7:06 PM, Jason Wang wrote:
>
>
> On 2018年06月05日 09:41, Samudrala, Sridhar wrote:
>> Ping on this patch now that the kernel patches are accepted into 
>> davem's net-next tree.
>> https://patchwork.ozlabs.org/cover/920005/
>>
>>
>> On 5/7/2018 4:09 PM, Sridhar Samudrala wrote:
>>> This feature bit can be used by hypervisor to indicate virtio_net 
>>> device to
>>> act as a standby for another device with the same MAC address.
>>>
>>> I tested this with a small change to the patch to mark the STANDBY 
>>> feature 'true'
>>> by default as i am using libvirt to start the VMs.
>>> Is there a way to pass the newly added feature bit 'standby' to qemu 
>>> via libvirt
>>> XML file?
>>>
>
> Maybe you can try qemu command line passthrough:
>
> https://libvirt.org/drvqemu.html#qemucommand

It looks like this can be used to pass command line arguments to qemu.
Is it possible to specify a virtio specific attribute via this method?

For ex: to say mrg_rxbuf is off we can add the following line to virtio
section of the domain xml file.
   <host mrg_rxbuf='off'/>

I think libvirt needs to be extended to to support the new 'standby' attribute
via this mechanism.
Adding Liane Stump and libvirt to the CC list.

Michael,
Can we start with getting this patch into Qemu and an update to libvirt to
support the 'standby' feature so that this feature can be enabled via
some scripts/orchestration layer for now.

We could improve this solution by enhancing Qemu to do automatic management of the
addition/deletion of the primary device based on feature negotiation as a later patch.


>
> The patch looks good to me. Let's see if Michael have any comment.
>
> Thanks
>
>>> Signed-off-by: Sridhar Samudrala <sridhar.samudrala@intel.com>
>>> ---
>>>   hw/net/virtio-net.c                         | 2 ++
>>>   include/standard-headers/linux/virtio_net.h | 3 +++
>>>   2 files changed, 5 insertions(+)
>>>
>>> diff --git a/hw/net/virtio-net.c b/hw/net/virtio-net.c
>>> index 90502fca7c..38b3140670 100644
>>> --- a/hw/net/virtio-net.c
>>> +++ b/hw/net/virtio-net.c
>>> @@ -2198,6 +2198,8 @@ static Property virtio_net_properties[] = {
>>>                        true),
>>>       DEFINE_PROP_INT32("speed", VirtIONet, net_conf.speed, 
>>> SPEED_UNKNOWN),
>>>       DEFINE_PROP_STRING("duplex", VirtIONet, net_conf.duplex_str),
>>> +    DEFINE_PROP_BIT64("standby", VirtIONet, host_features, 
>>> VIRTIO_NET_F_STANDBY,
>>> +                      false),
>>>       DEFINE_PROP_END_OF_LIST(),
>>>   };
>>>   diff --git a/include/standard-headers/linux/virtio_net.h 
>>> b/include/standard-headers/linux/virtio_net.h
>>> index e9f255ea3f..01ec09684c 100644
>>> --- a/include/standard-headers/linux/virtio_net.h
>>> +++ b/include/standard-headers/linux/virtio_net.h
>>> @@ -57,6 +57,9 @@
>>>                        * Steering */
>>>   #define VIRTIO_NET_F_CTRL_MAC_ADDR 23    /* Set MAC address */
>>>   +#define VIRTIO_NET_F_STANDBY      62    /* Act as standby for 
>>> another device
>>> +                                         * with the same MAC.
>>> +                                         */
>>>   #define VIRTIO_NET_F_SPEED_DUPLEX 63    /* Device set linkspeed 
>>> and duplex */
>>>     #ifndef VIRTIO_NET_NO_LEGACY
>>
>
Ján Tomko June 6, 2018, 6:52 p.m. UTC | #11
On Wed, Jun 06, 2018 at 11:17:36AM -0700, Samudrala, Sridhar wrote:
>On 6/4/2018 7:06 PM, Jason Wang wrote:
>>
>>
>> On 2018年06月05日 09:41, Samudrala, Sridhar wrote:
>>> Ping on this patch now that the kernel patches are accepted into
>>> davem's net-next tree.
>>> https://patchwork.ozlabs.org/cover/920005/
>>>
>>>
>>> On 5/7/2018 4:09 PM, Sridhar Samudrala wrote:
>>>> This feature bit can be used by hypervisor to indicate virtio_net
>>>> device to
>>>> act as a standby for another device with the same MAC address.
>>>>
>>>> I tested this with a small change to the patch to mark the STANDBY
>>>> feature 'true'
>>>> by default as i am using libvirt to start the VMs.
>>>> Is there a way to pass the newly added feature bit 'standby' to qemu
>>>> via libvirt
>>>> XML file?
>>>>
>>
>> Maybe you can try qemu command line passthrough:
>>
>> https://libvirt.org/drvqemu.html#qemucommand
>
>It looks like this can be used to pass command line arguments to qemu.
>Is it possible to specify a virtio specific attribute via this method?
>

Yes, for testing purposes you should be able to do this via using QEMU's
-set command line argument:
http://blog.vmsplice.net/2011/04/how-to-pass-qemu-command-line-options.html
i.e.:

<domain type='kvm' xmlns:qemu='http://libvirt.org/schemas/domain/qemu/1.0'>
  ...
  <qemu:commandline>
    <qemu:arg value='-set'/>
    <qemu:arg value='device.net0.standby=on'/>
  </qemu:commandline>
</domain>


>For ex: to say mrg_rxbuf is off we can add the following line to virtio
>section of the domain xml file.
>   <host mrg_rxbuf='off'/>
>
>I think libvirt needs to be extended to to support the new 'standby' attribute
>via this mechanism.
>Adding Liane Stump and libvirt to the CC list.

*Laine

>
>Michael,
>Can we start with getting this patch into Qemu and an update to libvirt to
>support the 'standby' feature so that this feature can be enabled via
>some scripts/orchestration layer for now.
>
>We could improve this solution by enhancing Qemu to do automatic management of the
>addition/deletion of the primary device based on feature negotiation as a later patch.
>

If that means the libvirt attribute would no longer be needed, I don't
see the reason to add it to libvirt in the first place.

Jano
Michael S. Tsirkin June 6, 2018, 6:53 p.m. UTC | #12
On Wed, Jun 06, 2018 at 11:17:36AM -0700, Samudrala, Sridhar wrote:
> On 6/4/2018 7:06 PM, Jason Wang wrote:
> > 
> > 
> > On 2018年06月05日 09:41, Samudrala, Sridhar wrote:
> > > Ping on this patch now that the kernel patches are accepted into
> > > davem's net-next tree.
> > > https://patchwork.ozlabs.org/cover/920005/
> > > 
> > > 
> > > On 5/7/2018 4:09 PM, Sridhar Samudrala wrote:
> > > > This feature bit can be used by hypervisor to indicate
> > > > virtio_net device to
> > > > act as a standby for another device with the same MAC address.
> > > > 
> > > > I tested this with a small change to the patch to mark the
> > > > STANDBY feature 'true'
> > > > by default as i am using libvirt to start the VMs.
> > > > Is there a way to pass the newly added feature bit 'standby' to
> > > > qemu via libvirt
> > > > XML file?
> > > > 
> > 
> > Maybe you can try qemu command line passthrough:
> > 
> > https://libvirt.org/drvqemu.html#qemucommand
> 
> It looks like this can be used to pass command line arguments to qemu.
> Is it possible to specify a virtio specific attribute via this method?
> 
> For ex: to say mrg_rxbuf is off we can add the following line to virtio
> section of the domain xml file.
>   <host mrg_rxbuf='off'/>
> 
> I think libvirt needs to be extended to to support the new 'standby' attribute
> via this mechanism.
> Adding Liane Stump and libvirt to the CC list.
> 
> Michael,
> Can we start with getting this patch into Qemu and an update to libvirt to
> support the 'standby' feature so that this feature can be enabled via
> some scripts/orchestration layer for now.

Unfortunately we don't give the hypothetical orchestration layer
enough info about driver being bound, so it does not know
when is it safe to add a primary.  And a similar issue around reset.

We could add events for that which would be a small patch,
but the issue then is that guest can trigger a storm of
these events.

> We could improve this solution by enhancing Qemu to do automatic management of the
> addition/deletion of the primary device based on feature negotiation as a later patch.

I'm not sure what the point of the back and forth would be though.
Typically after people invest in a specific config and get it working,
it's very hard to move them to another solution.

> 
> > 
> > The patch looks good to me. Let's see if Michael have any comment.
> > 
> > Thanks
> > 
> > > > Signed-off-by: Sridhar Samudrala <sridhar.samudrala@intel.com>
> > > > ---
> > > >   hw/net/virtio-net.c                         | 2 ++
> > > >   include/standard-headers/linux/virtio_net.h | 3 +++
> > > >   2 files changed, 5 insertions(+)
> > > > 
> > > > diff --git a/hw/net/virtio-net.c b/hw/net/virtio-net.c
> > > > index 90502fca7c..38b3140670 100644
> > > > --- a/hw/net/virtio-net.c
> > > > +++ b/hw/net/virtio-net.c
> > > > @@ -2198,6 +2198,8 @@ static Property virtio_net_properties[] = {
> > > >                        true),
> > > >       DEFINE_PROP_INT32("speed", VirtIONet, net_conf.speed,
> > > > SPEED_UNKNOWN),
> > > >       DEFINE_PROP_STRING("duplex", VirtIONet, net_conf.duplex_str),
> > > > +    DEFINE_PROP_BIT64("standby", VirtIONet, host_features,
> > > > VIRTIO_NET_F_STANDBY,
> > > > +                      false),
> > > >       DEFINE_PROP_END_OF_LIST(),
> > > >   };
> > > >   diff --git a/include/standard-headers/linux/virtio_net.h
> > > > b/include/standard-headers/linux/virtio_net.h
> > > > index e9f255ea3f..01ec09684c 100644
> > > > --- a/include/standard-headers/linux/virtio_net.h
> > > > +++ b/include/standard-headers/linux/virtio_net.h
> > > > @@ -57,6 +57,9 @@
> > > >                        * Steering */
> > > >   #define VIRTIO_NET_F_CTRL_MAC_ADDR 23    /* Set MAC address */
> > > >   +#define VIRTIO_NET_F_STANDBY      62    /* Act as standby for
> > > > another device
> > > > +                                         * with the same MAC.
> > > > +                                         */
> > > >   #define VIRTIO_NET_F_SPEED_DUPLEX 63    /* Device set
> > > > linkspeed and duplex */
> > > >     #ifndef VIRTIO_NET_NO_LEGACY
> > > 
> >
Samudrala, Sridhar June 6, 2018, 7:39 p.m. UTC | #13
On 6/6/2018 11:52 AM, Ján Tomko wrote:
> On Wed, Jun 06, 2018 at 11:17:36AM -0700, Samudrala, Sridhar wrote:
>> On 6/4/2018 7:06 PM, Jason Wang wrote:
>>>
>>>
>>> On 2018年06月05日 09:41, Samudrala, Sridhar wrote:
>>>> Ping on this patch now that the kernel patches are accepted into
>>>> davem's net-next tree.
>>>> https://patchwork.ozlabs.org/cover/920005/
>>>>
>>>>
>>>> On 5/7/2018 4:09 PM, Sridhar Samudrala wrote:
>>>>> This feature bit can be used by hypervisor to indicate virtio_net
>>>>> device to
>>>>> act as a standby for another device with the same MAC address.
>>>>>
>>>>> I tested this with a small change to the patch to mark the STANDBY
>>>>> feature 'true'
>>>>> by default as i am using libvirt to start the VMs.
>>>>> Is there a way to pass the newly added feature bit 'standby' to qemu
>>>>> via libvirt
>>>>> XML file?
>>>>>
>>>
>>> Maybe you can try qemu command line passthrough:
>>>
>>> https://libvirt.org/drvqemu.html#qemucommand
>>
>> It looks like this can be used to pass command line arguments to qemu.
>> Is it possible to specify a virtio specific attribute via this method?
>>
>
> Yes, for testing purposes you should be able to do this via using QEMU's
> -set command line argument:
> http://blog.vmsplice.net/2011/04/how-to-pass-qemu-command-line-options.html 
>
> i.e.:
>
> <domain type='kvm' 
> xmlns:qemu='http://libvirt.org/schemas/domain/qemu/1.0'>
>  ...
>  <qemu:commandline>
>    <qemu:arg value='-set'/>
>    <qemu:arg value='device.net0.standby=on'/>
>  </qemu:commandline>
> </domain>

Thanks. Will try this.

>
>> For ex: to say mrg_rxbuf is off we can add the following line to virtio
>> section of the domain xml file.
>>   <host mrg_rxbuf='off'/>
>>
>> I think libvirt needs to be extended to to support the new 'standby' 
>> attribute
>> via this mechanism.
>> Adding Liane Stump and libvirt to the CC list.
>
> *Laine
>
>>
>> Michael,
>> Can we start with getting this patch into Qemu and an update to 
>> libvirt to
>> support the 'standby' feature so that this feature can be enabled via
>> some scripts/orchestration layer for now.
>>
>> We could improve this solution by enhancing Qemu to do automatic 
>> management of the
>> addition/deletion of the primary device based on feature negotiation 
>> as a later patch.
>>
>
> If that means the libvirt attribute would no longer be needed, I don't
> see the reason to add it to libvirt in the first place.

I think we still need this attribute supported via libvirt so that a user/admin
can enable this feature via domain XML specification.
Michael S. Tsirkin June 11, 2018, 5:26 p.m. UTC | #14
On Mon, May 07, 2018 at 04:09:54PM -0700, Sridhar Samudrala wrote:
> This feature bit can be used by hypervisor to indicate virtio_net device to
> act as a standby for another device with the same MAC address.
> 
> I tested this with a small change to the patch to mark the STANDBY feature 'true'
> by default as i am using libvirt to start the VMs.
> Is there a way to pass the newly added feature bit 'standby' to qemu via libvirt
> XML file?
> 
> Signed-off-by: Sridhar Samudrala <sridhar.samudrala@intel.com>

So I do not think we can commit to this interface: we
really need to control visibility of the primary device.

However just for testing purposes, we could add a non-stable
interface "x-standby" with the understanding that as any
x- prefix it's unstable and will be changed down the road,
likely in the next release.


> ---
>  hw/net/virtio-net.c                         | 2 ++
>  include/standard-headers/linux/virtio_net.h | 3 +++
>  2 files changed, 5 insertions(+)
> 
> diff --git a/hw/net/virtio-net.c b/hw/net/virtio-net.c
> index 90502fca7c..38b3140670 100644
> --- a/hw/net/virtio-net.c
> +++ b/hw/net/virtio-net.c
> @@ -2198,6 +2198,8 @@ static Property virtio_net_properties[] = {
>                       true),
>      DEFINE_PROP_INT32("speed", VirtIONet, net_conf.speed, SPEED_UNKNOWN),
>      DEFINE_PROP_STRING("duplex", VirtIONet, net_conf.duplex_str),
> +    DEFINE_PROP_BIT64("standby", VirtIONet, host_features, VIRTIO_NET_F_STANDBY,
> +                      false),
>      DEFINE_PROP_END_OF_LIST(),
>  };
>  
> diff --git a/include/standard-headers/linux/virtio_net.h b/include/standard-headers/linux/virtio_net.h
> index e9f255ea3f..01ec09684c 100644
> --- a/include/standard-headers/linux/virtio_net.h
> +++ b/include/standard-headers/linux/virtio_net.h
> @@ -57,6 +57,9 @@
>  					 * Steering */
>  #define VIRTIO_NET_F_CTRL_MAC_ADDR 23	/* Set MAC address */
>  
> +#define VIRTIO_NET_F_STANDBY      62    /* Act as standby for another device
> +                                         * with the same MAC.
> +                                         */
>  #define VIRTIO_NET_F_SPEED_DUPLEX 63	/* Device set linkspeed and duplex */
>  
>  #ifndef VIRTIO_NET_NO_LEGACY
> -- 
> 2.14.3
Jason Wang June 12, 2018, 1:54 a.m. UTC | #15
On 2018年06月12日 01:26, Michael S. Tsirkin wrote:
> On Mon, May 07, 2018 at 04:09:54PM -0700, Sridhar Samudrala wrote:
>> This feature bit can be used by hypervisor to indicate virtio_net device to
>> act as a standby for another device with the same MAC address.
>>
>> I tested this with a small change to the patch to mark the STANDBY feature 'true'
>> by default as i am using libvirt to start the VMs.
>> Is there a way to pass the newly added feature bit 'standby' to qemu via libvirt
>> XML file?
>>
>> Signed-off-by: Sridhar Samudrala <sridhar.samudrala@intel.com>
> So I do not think we can commit to this interface: we
> really need to control visibility of the primary device.

The problem is legacy guest won't use primary device at all if we do this.

How about control the visibility of standby device?

Thanks

> However just for testing purposes, we could add a non-stable
> interface "x-standby" with the understanding that as any
> x- prefix it's unstable and will be changed down the road,
> likely in the next release.
>
>
>> ---
>>   hw/net/virtio-net.c                         | 2 ++
>>   include/standard-headers/linux/virtio_net.h | 3 +++
>>   2 files changed, 5 insertions(+)
>>
>> diff --git a/hw/net/virtio-net.c b/hw/net/virtio-net.c
>> index 90502fca7c..38b3140670 100644
>> --- a/hw/net/virtio-net.c
>> +++ b/hw/net/virtio-net.c
>> @@ -2198,6 +2198,8 @@ static Property virtio_net_properties[] = {
>>                        true),
>>       DEFINE_PROP_INT32("speed", VirtIONet, net_conf.speed, SPEED_UNKNOWN),
>>       DEFINE_PROP_STRING("duplex", VirtIONet, net_conf.duplex_str),
>> +    DEFINE_PROP_BIT64("standby", VirtIONet, host_features, VIRTIO_NET_F_STANDBY,
>> +                      false),
>>       DEFINE_PROP_END_OF_LIST(),
>>   };
>>   
>> diff --git a/include/standard-headers/linux/virtio_net.h b/include/standard-headers/linux/virtio_net.h
>> index e9f255ea3f..01ec09684c 100644
>> --- a/include/standard-headers/linux/virtio_net.h
>> +++ b/include/standard-headers/linux/virtio_net.h
>> @@ -57,6 +57,9 @@
>>   					 * Steering */
>>   #define VIRTIO_NET_F_CTRL_MAC_ADDR 23	/* Set MAC address */
>>   
>> +#define VIRTIO_NET_F_STANDBY      62    /* Act as standby for another device
>> +                                         * with the same MAC.
>> +                                         */
>>   #define VIRTIO_NET_F_SPEED_DUPLEX 63	/* Device set linkspeed and duplex */
>>   
>>   #ifndef VIRTIO_NET_NO_LEGACY
>> -- 
>> 2.14.3
Michael S. Tsirkin June 12, 2018, 2:17 a.m. UTC | #16
On Tue, Jun 12, 2018 at 09:54:44AM +0800, Jason Wang wrote:
> 
> 
> On 2018年06月12日 01:26, Michael S. Tsirkin wrote:
> > On Mon, May 07, 2018 at 04:09:54PM -0700, Sridhar Samudrala wrote:
> > > This feature bit can be used by hypervisor to indicate virtio_net device to
> > > act as a standby for another device with the same MAC address.
> > > 
> > > I tested this with a small change to the patch to mark the STANDBY feature 'true'
> > > by default as i am using libvirt to start the VMs.
> > > Is there a way to pass the newly added feature bit 'standby' to qemu via libvirt
> > > XML file?
> > > 
> > > Signed-off-by: Sridhar Samudrala <sridhar.samudrala@intel.com>
> > So I do not think we can commit to this interface: we
> > really need to control visibility of the primary device.
> 
> The problem is legacy guest won't use primary device at all if we do this.

And that's by design - I think it's the only way to ensure the
legacy guest isn't confused.

> How about control the visibility of standby device?
> 
> Thanks

standy the always there to guarantee no downtime.

> > However just for testing purposes, we could add a non-stable
> > interface "x-standby" with the understanding that as any
> > x- prefix it's unstable and will be changed down the road,
> > likely in the next release.
> > 
> > 
> > > ---
> > >   hw/net/virtio-net.c                         | 2 ++
> > >   include/standard-headers/linux/virtio_net.h | 3 +++
> > >   2 files changed, 5 insertions(+)
> > > 
> > > diff --git a/hw/net/virtio-net.c b/hw/net/virtio-net.c
> > > index 90502fca7c..38b3140670 100644
> > > --- a/hw/net/virtio-net.c
> > > +++ b/hw/net/virtio-net.c
> > > @@ -2198,6 +2198,8 @@ static Property virtio_net_properties[] = {
> > >                        true),
> > >       DEFINE_PROP_INT32("speed", VirtIONet, net_conf.speed, SPEED_UNKNOWN),
> > >       DEFINE_PROP_STRING("duplex", VirtIONet, net_conf.duplex_str),
> > > +    DEFINE_PROP_BIT64("standby", VirtIONet, host_features, VIRTIO_NET_F_STANDBY,
> > > +                      false),
> > >       DEFINE_PROP_END_OF_LIST(),
> > >   };
> > > diff --git a/include/standard-headers/linux/virtio_net.h b/include/standard-headers/linux/virtio_net.h
> > > index e9f255ea3f..01ec09684c 100644
> > > --- a/include/standard-headers/linux/virtio_net.h
> > > +++ b/include/standard-headers/linux/virtio_net.h
> > > @@ -57,6 +57,9 @@
> > >   					 * Steering */
> > >   #define VIRTIO_NET_F_CTRL_MAC_ADDR 23	/* Set MAC address */
> > > +#define VIRTIO_NET_F_STANDBY      62    /* Act as standby for another device
> > > +                                         * with the same MAC.
> > > +                                         */
> > >   #define VIRTIO_NET_F_SPEED_DUPLEX 63	/* Device set linkspeed and duplex */
> > >   #ifndef VIRTIO_NET_NO_LEGACY
> > > -- 
> > > 2.14.3
Samudrala, Sridhar June 12, 2018, 5:02 a.m. UTC | #17
On 6/11/2018 7:17 PM, Michael S. Tsirkin wrote:
> On Tue, Jun 12, 2018 at 09:54:44AM +0800, Jason Wang wrote:
>>
>> On 2018年06月12日 01:26, Michael S. Tsirkin wrote:
>>> On Mon, May 07, 2018 at 04:09:54PM -0700, Sridhar Samudrala wrote:
>>>> This feature bit can be used by hypervisor to indicate virtio_net device to
>>>> act as a standby for another device with the same MAC address.
>>>>
>>>> I tested this with a small change to the patch to mark the STANDBY feature 'true'
>>>> by default as i am using libvirt to start the VMs.
>>>> Is there a way to pass the newly added feature bit 'standby' to qemu via libvirt
>>>> XML file?
>>>>
>>>> Signed-off-by: Sridhar Samudrala <sridhar.samudrala@intel.com>
>>> So I do not think we can commit to this interface: we
>>> really need to control visibility of the primary device.
>> The problem is legacy guest won't use primary device at all if we do this.
> And that's by design - I think it's the only way to ensure the
> legacy guest isn't confused.

Yes. I think so. But i am not sure if Qemu is the right place to control the visibility
of the primary device. The primary device may not be specified as an argument to Qemu. It
may be plugged in later.
The cloud service provider is providing a feature that enables low latency datapath and live
migration capability.
A tenant can use this feature only if he is running a VM that has virtio-net with failover support.

I think Qemu should check if guest virtio-net supports this feature and provide a mechanism for
an upper layer indicating if the STANDBY feature is successfully negotiated or not.
The upper layer can then decide if it should hot plug a VF with the same MAC and manage the 2 links.
If VF is successfully hot plugged, virtio-net link should be disabled.


>
>> How about control the visibility of standby device?
>>
>> Thanks
> standy the always there to guarantee no downtime.
>
>>> However just for testing purposes, we could add a non-stable
>>> interface "x-standby" with the understanding that as any
>>> x- prefix it's unstable and will be changed down the road,
>>> likely in the next release.
>>>
>>>
>>>> ---
>>>>    hw/net/virtio-net.c                         | 2 ++
>>>>    include/standard-headers/linux/virtio_net.h | 3 +++
>>>>    2 files changed, 5 insertions(+)
>>>>
>>>> diff --git a/hw/net/virtio-net.c b/hw/net/virtio-net.c
>>>> index 90502fca7c..38b3140670 100644
>>>> --- a/hw/net/virtio-net.c
>>>> +++ b/hw/net/virtio-net.c
>>>> @@ -2198,6 +2198,8 @@ static Property virtio_net_properties[] = {
>>>>                         true),
>>>>        DEFINE_PROP_INT32("speed", VirtIONet, net_conf.speed, SPEED_UNKNOWN),
>>>>        DEFINE_PROP_STRING("duplex", VirtIONet, net_conf.duplex_str),
>>>> +    DEFINE_PROP_BIT64("standby", VirtIONet, host_features, VIRTIO_NET_F_STANDBY,
>>>> +                      false),
>>>>        DEFINE_PROP_END_OF_LIST(),
>>>>    };
>>>> diff --git a/include/standard-headers/linux/virtio_net.h b/include/standard-headers/linux/virtio_net.h
>>>> index e9f255ea3f..01ec09684c 100644
>>>> --- a/include/standard-headers/linux/virtio_net.h
>>>> +++ b/include/standard-headers/linux/virtio_net.h
>>>> @@ -57,6 +57,9 @@
>>>>    					 * Steering */
>>>>    #define VIRTIO_NET_F_CTRL_MAC_ADDR 23	/* Set MAC address */
>>>> +#define VIRTIO_NET_F_STANDBY      62    /* Act as standby for another device
>>>> +                                         * with the same MAC.
>>>> +                                         */
>>>>    #define VIRTIO_NET_F_SPEED_DUPLEX 63	/* Device set linkspeed and duplex */
>>>>    #ifndef VIRTIO_NET_NO_LEGACY
>>>> -- 
>>>> 2.14.3
Michael S. Tsirkin June 12, 2018, 11:34 a.m. UTC | #18
On Mon, Jun 11, 2018 at 10:02:45PM -0700, Samudrala, Sridhar wrote:
> On 6/11/2018 7:17 PM, Michael S. Tsirkin wrote:
> > On Tue, Jun 12, 2018 at 09:54:44AM +0800, Jason Wang wrote:
> > > 
> > > On 2018年06月12日 01:26, Michael S. Tsirkin wrote:
> > > > On Mon, May 07, 2018 at 04:09:54PM -0700, Sridhar Samudrala wrote:
> > > > > This feature bit can be used by hypervisor to indicate virtio_net device to
> > > > > act as a standby for another device with the same MAC address.
> > > > > 
> > > > > I tested this with a small change to the patch to mark the STANDBY feature 'true'
> > > > > by default as i am using libvirt to start the VMs.
> > > > > Is there a way to pass the newly added feature bit 'standby' to qemu via libvirt
> > > > > XML file?
> > > > > 
> > > > > Signed-off-by: Sridhar Samudrala <sridhar.samudrala@intel.com>
> > > > So I do not think we can commit to this interface: we
> > > > really need to control visibility of the primary device.
> > > The problem is legacy guest won't use primary device at all if we do this.
> > And that's by design - I think it's the only way to ensure the
> > legacy guest isn't confused.
> 
> Yes. I think so. But i am not sure if Qemu is the right place to control the visibility
> of the primary device. The primary device may not be specified as an argument to Qemu. It
> may be plugged in later.
> The cloud service provider is providing a feature that enables low latency datapath and live
> migration capability.
> A tenant can use this feature only if he is running a VM that has virtio-net with failover support.

Well live migration is there already. The new feature is low latency
data path.

And it's the guest that needs failover support not the VM.


> I think Qemu should check if guest virtio-net supports this feature and provide a mechanism for
> an upper layer indicating if the STANDBY feature is successfully negotiated or not.
> The upper layer can then decide if it should hot plug a VF with the same MAC and manage the 2 links.
> If VF is successfully hot plugged, virtio-net link should be disabled.

Did you even talk to upper layer management about it?
Just list the steps they need to do and you will see
that's a lot of machinery to manage by the upper layer.

What do we gain in flexibility? As far as I can see the
only gain is some resources saved for legacy VMs.

That's not a lot as tenant of the upper layer probably already has
at least a hunch that it's a new guest otherwise
why bother specifying the feature at all - you
save even more resources without it.




> 
> > 
> > > How about control the visibility of standby device?
> > > 
> > > Thanks
> > standy the always there to guarantee no downtime.
> > 
> > > > However just for testing purposes, we could add a non-stable
> > > > interface "x-standby" with the understanding that as any
> > > > x- prefix it's unstable and will be changed down the road,
> > > > likely in the next release.
> > > > 
> > > > 
> > > > > ---
> > > > >    hw/net/virtio-net.c                         | 2 ++
> > > > >    include/standard-headers/linux/virtio_net.h | 3 +++
> > > > >    2 files changed, 5 insertions(+)
> > > > > 
> > > > > diff --git a/hw/net/virtio-net.c b/hw/net/virtio-net.c
> > > > > index 90502fca7c..38b3140670 100644
> > > > > --- a/hw/net/virtio-net.c
> > > > > +++ b/hw/net/virtio-net.c
> > > > > @@ -2198,6 +2198,8 @@ static Property virtio_net_properties[] = {
> > > > >                         true),
> > > > >        DEFINE_PROP_INT32("speed", VirtIONet, net_conf.speed, SPEED_UNKNOWN),
> > > > >        DEFINE_PROP_STRING("duplex", VirtIONet, net_conf.duplex_str),
> > > > > +    DEFINE_PROP_BIT64("standby", VirtIONet, host_features, VIRTIO_NET_F_STANDBY,
> > > > > +                      false),
> > > > >        DEFINE_PROP_END_OF_LIST(),
> > > > >    };
> > > > > diff --git a/include/standard-headers/linux/virtio_net.h b/include/standard-headers/linux/virtio_net.h
> > > > > index e9f255ea3f..01ec09684c 100644
> > > > > --- a/include/standard-headers/linux/virtio_net.h
> > > > > +++ b/include/standard-headers/linux/virtio_net.h
> > > > > @@ -57,6 +57,9 @@
> > > > >    					 * Steering */
> > > > >    #define VIRTIO_NET_F_CTRL_MAC_ADDR 23	/* Set MAC address */
> > > > > +#define VIRTIO_NET_F_STANDBY      62    /* Act as standby for another device
> > > > > +                                         * with the same MAC.
> > > > > +                                         */
> > > > >    #define VIRTIO_NET_F_SPEED_DUPLEX 63	/* Device set linkspeed and duplex */
> > > > >    #ifndef VIRTIO_NET_NO_LEGACY
> > > > > -- 
> > > > > 2.14.3
Michael S. Tsirkin June 12, 2018, 11:47 a.m. UTC | #19
On Tue, Jun 05, 2018 at 03:09:26PM -0700, Siwei Liu wrote:
> The thing is cloud service provider might prefer sticking to the same
> level of service agreement (SLA) of keeping VF over migration,

That requirement is trivially satisfied by just a single VF :) If your
SLA does not require live migration, you should do just that.
Michael S. Tsirkin June 12, 2018, 11:54 a.m. UTC | #20
On Wed, Jun 06, 2018 at 10:29:03AM +0800, Jason Wang wrote:
> 
> 
> On 2018年06月05日 20:33, Michael S. Tsirkin wrote:
> > I don't think this is sufficient.
> > 
> > If both primary and standby devices are present, a legacy guest without
> > support for the feature might see two devices with same mac and get
> > confused.
> > 
> > I think that we should only make primary visible after guest acked the
> > backup feature bit.
> 
> I think we want exactly the reverse? E.g fail the negotiation when guest
> does not ack backup feature.
> 
> Otherwise legacy guest won't even have the chance to see primary device in
> the guest.

That's by design.

> > 
> > And on reset or when backup is cleared in some other way, unplug the
> > primary.
> 
> What if guest does not support hotplug?

It shouldn't ack the backup feature then and it's a good point.
We should both document this and check kernel config has
hotplug support. Sridhar could you take a look pls?

> > 
> > Something like the below will do the job:
> > 
> > Primary device is added with a special "primary-failover" flag.
> > A virtual machine is then initialized with just a standby virtio
> > device. Primary is not yet added.
> 
> A question is how to do the matching? Qemu knows nothing about e.g mac
> address of a pass-through device I believe?

Supply a flag to the VFIO when it's added, this way QEMU will know.

> > 
> > Later QEMU detects that guest driver device set DRIVER_OK.
> > It then exposes the primary device to the guest, and triggers
> > a device addition event (hot-plug event) for it.
> 
> Do you mean we won't have primary device in the initial qemu cli?

No, that's not what I mean.

I mean management will supply a flag to VFIO and then


- VFIO defers exposing
primary to guest until guest acks the feature bit.
- When we see guest ack, initiate hotplug.
- On reboot, hide it again.
- On reset without reboot, request hot-unplug and on eject hide it again.


> > 
> > If QEMU detects guest driver removal, it initiates a hot-unplug sequence
> > to remove the primary driver.  In particular, if QEMU detects guest
> > re-initialization (e.g. by detecting guest reset) it immediately removes
> > the primary device.
> 
> I believe guest failover module should handle this gracefully?

It can't control enumeration order, if primary is enumerated before
standby then guest will load its driver and it's too late
when failover driver is loaded.

> Thanks
> 
> > 
> > We can move some of this code to management as well, architecturally it
> > does not make too much sense but it might be easier implementation-wise.
> > 
> > HTH
> > 
> > On Mon, Jun 04, 2018 at 06:41:48PM -0700, Samudrala, Sridhar wrote:
> > > Ping on this patch now that the kernel patches are accepted into davem's net-next tree.
> > > https://patchwork.ozlabs.org/cover/920005/
> > > 
> > > 
> > > On 5/7/2018 4:09 PM, Sridhar Samudrala wrote:
> > > > This feature bit can be used by hypervisor to indicate virtio_net device to
> > > > act as a standby for another device with the same MAC address.
> > > > 
> > > > I tested this with a small change to the patch to mark the STANDBY feature 'true'
> > > > by default as i am using libvirt to start the VMs.
> > > > Is there a way to pass the newly added feature bit 'standby' to qemu via libvirt
> > > > XML file?
> > > > 
> > > > Signed-off-by: Sridhar Samudrala <sridhar.samudrala@intel.com>
> > > > ---
> > > >    hw/net/virtio-net.c                         | 2 ++
> > > >    include/standard-headers/linux/virtio_net.h | 3 +++
> > > >    2 files changed, 5 insertions(+)
> > > > 
> > > > diff --git a/hw/net/virtio-net.c b/hw/net/virtio-net.c
> > > > index 90502fca7c..38b3140670 100644
> > > > --- a/hw/net/virtio-net.c
> > > > +++ b/hw/net/virtio-net.c
> > > > @@ -2198,6 +2198,8 @@ static Property virtio_net_properties[] = {
> > > >                         true),
> > > >        DEFINE_PROP_INT32("speed", VirtIONet, net_conf.speed, SPEED_UNKNOWN),
> > > >        DEFINE_PROP_STRING("duplex", VirtIONet, net_conf.duplex_str),
> > > > +    DEFINE_PROP_BIT64("standby", VirtIONet, host_features, VIRTIO_NET_F_STANDBY,
> > > > +                      false),
> > > >        DEFINE_PROP_END_OF_LIST(),
> > > >    };
> > > > diff --git a/include/standard-headers/linux/virtio_net.h b/include/standard-headers/linux/virtio_net.h
> > > > index e9f255ea3f..01ec09684c 100644
> > > > --- a/include/standard-headers/linux/virtio_net.h
> > > > +++ b/include/standard-headers/linux/virtio_net.h
> > > > @@ -57,6 +57,9 @@
> > > >    					 * Steering */
> > > >    #define VIRTIO_NET_F_CTRL_MAC_ADDR 23	/* Set MAC address */
> > > > +#define VIRTIO_NET_F_STANDBY      62    /* Act as standby for another device
> > > > +                                         * with the same MAC.
> > > > +                                         */
> > > >    #define VIRTIO_NET_F_SPEED_DUPLEX 63	/* Device set linkspeed and duplex */
> > > >    #ifndef VIRTIO_NET_NO_LEGACY
Samudrala, Sridhar June 13, 2018, 12:08 a.m. UTC | #21
On 6/12/2018 4:34 AM, Michael S. Tsirkin wrote:
> On Mon, Jun 11, 2018 at 10:02:45PM -0700, Samudrala, Sridhar wrote:
>> On 6/11/2018 7:17 PM, Michael S. Tsirkin wrote:
>>> On Tue, Jun 12, 2018 at 09:54:44AM +0800, Jason Wang wrote:
>>>> On 2018年06月12日 01:26, Michael S. Tsirkin wrote:
>>>>> On Mon, May 07, 2018 at 04:09:54PM -0700, Sridhar Samudrala wrote:
>>>>>> This feature bit can be used by hypervisor to indicate virtio_net device to
>>>>>> act as a standby for another device with the same MAC address.
>>>>>>
>>>>>> I tested this with a small change to the patch to mark the STANDBY feature 'true'
>>>>>> by default as i am using libvirt to start the VMs.
>>>>>> Is there a way to pass the newly added feature bit 'standby' to qemu via libvirt
>>>>>> XML file?
>>>>>>
>>>>>> Signed-off-by: Sridhar Samudrala <sridhar.samudrala@intel.com>
>>>>> So I do not think we can commit to this interface: we
>>>>> really need to control visibility of the primary device.
>>>> The problem is legacy guest won't use primary device at all if we do this.
>>> And that's by design - I think it's the only way to ensure the
>>> legacy guest isn't confused.
>> Yes. I think so. But i am not sure if Qemu is the right place to control the visibility
>> of the primary device. The primary device may not be specified as an argument to Qemu. It
>> may be plugged in later.
>> The cloud service provider is providing a feature that enables low latency datapath and live
>> migration capability.
>> A tenant can use this feature only if he is running a VM that has virtio-net with failover support.
> Well live migration is there already. The new feature is low latency
> data path.

we get live migration with just virtio.  But I meant live migration with VF as
primary device.

>
> And it's the guest that needs failover support not the VM.

Isn't guest and VM synonymous?


>
>
>> I think Qemu should check if guest virtio-net supports this feature and provide a mechanism for
>> an upper layer indicating if the STANDBY feature is successfully negotiated or not.
>> The upper layer can then decide if it should hot plug a VF with the same MAC and manage the 2 links.
>> If VF is successfully hot plugged, virtio-net link should be disabled.
> Did you even talk to upper layer management about it?
> Just list the steps they need to do and you will see
> that's a lot of machinery to manage by the upper layer.
>
> What do we gain in flexibility? As far as I can see the
> only gain is some resources saved for legacy VMs.
>
> That's not a lot as tenant of the upper layer probably already has
> at least a hunch that it's a new guest otherwise
> why bother specifying the feature at all - you
> save even more resources without it.
>

I am not all that familiar with how Qemu manages network devices. If we can do all the
required management of the primary/standby devices within Qemu, that is definitely a better
approach without upper layer involvement.


>
>
>>>> How about control the visibility of standby device?
>>>>
>>>> Thanks
>>> standy the always there to guarantee no downtime.
>>>
>>>>> However just for testing purposes, we could add a non-stable
>>>>> interface "x-standby" with the understanding that as any
>>>>> x- prefix it's unstable and will be changed down the road,
>>>>> likely in the next release.
>>>>>
>>>>>
>>>>>> ---
>>>>>>     hw/net/virtio-net.c                         | 2 ++
>>>>>>     include/standard-headers/linux/virtio_net.h | 3 +++
>>>>>>     2 files changed, 5 insertions(+)
>>>>>>
>>>>>> diff --git a/hw/net/virtio-net.c b/hw/net/virtio-net.c
>>>>>> index 90502fca7c..38b3140670 100644
>>>>>> --- a/hw/net/virtio-net.c
>>>>>> +++ b/hw/net/virtio-net.c
>>>>>> @@ -2198,6 +2198,8 @@ static Property virtio_net_properties[] = {
>>>>>>                          true),
>>>>>>         DEFINE_PROP_INT32("speed", VirtIONet, net_conf.speed, SPEED_UNKNOWN),
>>>>>>         DEFINE_PROP_STRING("duplex", VirtIONet, net_conf.duplex_str),
>>>>>> +    DEFINE_PROP_BIT64("standby", VirtIONet, host_features, VIRTIO_NET_F_STANDBY,
>>>>>> +                      false),
>>>>>>         DEFINE_PROP_END_OF_LIST(),
>>>>>>     };
>>>>>> diff --git a/include/standard-headers/linux/virtio_net.h b/include/standard-headers/linux/virtio_net.h
>>>>>> index e9f255ea3f..01ec09684c 100644
>>>>>> --- a/include/standard-headers/linux/virtio_net.h
>>>>>> +++ b/include/standard-headers/linux/virtio_net.h
>>>>>> @@ -57,6 +57,9 @@
>>>>>>     					 * Steering */
>>>>>>     #define VIRTIO_NET_F_CTRL_MAC_ADDR 23	/* Set MAC address */
>>>>>> +#define VIRTIO_NET_F_STANDBY      62    /* Act as standby for another device
>>>>>> +                                         * with the same MAC.
>>>>>> +                                         */
>>>>>>     #define VIRTIO_NET_F_SPEED_DUPLEX 63	/* Device set linkspeed and duplex */
>>>>>>     #ifndef VIRTIO_NET_NO_LEGACY
>>>>>> -- 
>>>>>> 2.14.3
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: virtio-dev-unsubscribe@lists.oasis-open.org
> For additional commands, e-mail: virtio-dev-help@lists.oasis-open.org
>
Samudrala, Sridhar June 13, 2018, 12:20 a.m. UTC | #22
On 6/12/2018 4:54 AM, Michael S. Tsirkin wrote:
> On Wed, Jun 06, 2018 at 10:29:03AM +0800, Jason Wang wrote:
>>
>> On 2018年06月05日 20:33, Michael S. Tsirkin wrote:
>>> I don't think this is sufficient.
>>>
>>> If both primary and standby devices are present, a legacy guest without
>>> support for the feature might see two devices with same mac and get
>>> confused.
>>>
>>> I think that we should only make primary visible after guest acked the
>>> backup feature bit.
>> I think we want exactly the reverse? E.g fail the negotiation when guest
>> does not ack backup feature.
>>
>> Otherwise legacy guest won't even have the chance to see primary device in
>> the guest.
> That's by design.
>
>>> And on reset or when backup is cleared in some other way, unplug the
>>> primary.
>> What if guest does not support hotplug?
> It shouldn't ack the backup feature then and it's a good point.
> We should both document this and check kernel config has
> hotplug support. Sridhar could you take a look pls?

Right now we select NET_FAILOVER when virtio-net is enabled,  Should we select
PCI_HOTPLUG when NET_FAILOVER is enabled?

>
>>> Something like the below will do the job:
>>>
>>> Primary device is added with a special "primary-failover" flag.
>>> A virtual machine is then initialized with just a standby virtio
>>> device. Primary is not yet added.
>> A question is how to do the matching? Qemu knows nothing about e.g mac
>> address of a pass-through device I believe?
> Supply a flag to the VFIO when it's added, this way QEMU will know.
>
>>> Later QEMU detects that guest driver device set DRIVER_OK.
>>> It then exposes the primary device to the guest, and triggers
>>> a device addition event (hot-plug event) for it.
>> Do you mean we won't have primary device in the initial qemu cli?
> No, that's not what I mean.
>
> I mean management will supply a flag to VFIO and then
>
>
> - VFIO defers exposing
> primary to guest until guest acks the feature bit.
> - When we see guest ack, initiate hotplug.
> - On reboot, hide it again.
> - On reset without reboot, request hot-unplug and on eject hide it again.
>
>
>>> If QEMU detects guest driver removal, it initiates a hot-unplug sequence
>>> to remove the primary driver.  In particular, if QEMU detects guest
>>> re-initialization (e.g. by detecting guest reset) it immediately removes
>>> the primary device.
>> I believe guest failover module should handle this gracefully?
> It can't control enumeration order, if primary is enumerated before
> standby then guest will load its driver and it's too late
> when failover driver is loaded.

Does it mean that if guest unloads virtio-net driver, Qemu should automatically
unplug the associated primary device?

What about guest unloading/re-loading primary device driver?


>
>> Thanks
>>
>>> We can move some of this code to management as well, architecturally it
>>> does not make too much sense but it might be easier implementation-wise.
>>>
>>> HTH
>>>
>>> On Mon, Jun 04, 2018 at 06:41:48PM -0700, Samudrala, Sridhar wrote:
>>>> Ping on this patch now that the kernel patches are accepted into davem's net-next tree.
>>>> https://patchwork.ozlabs.org/cover/920005/
>>>>
>>>>
>>>> On 5/7/2018 4:09 PM, Sridhar Samudrala wrote:
>>>>> This feature bit can be used by hypervisor to indicate virtio_net device to
>>>>> act as a standby for another device with the same MAC address.
>>>>>
>>>>> I tested this with a small change to the patch to mark the STANDBY feature 'true'
>>>>> by default as i am using libvirt to start the VMs.
>>>>> Is there a way to pass the newly added feature bit 'standby' to qemu via libvirt
>>>>> XML file?
>>>>>
>>>>> Signed-off-by: Sridhar Samudrala <sridhar.samudrala@intel.com>
>>>>> ---
>>>>>     hw/net/virtio-net.c                         | 2 ++
>>>>>     include/standard-headers/linux/virtio_net.h | 3 +++
>>>>>     2 files changed, 5 insertions(+)
>>>>>
>>>>> diff --git a/hw/net/virtio-net.c b/hw/net/virtio-net.c
>>>>> index 90502fca7c..38b3140670 100644
>>>>> --- a/hw/net/virtio-net.c
>>>>> +++ b/hw/net/virtio-net.c
>>>>> @@ -2198,6 +2198,8 @@ static Property virtio_net_properties[] = {
>>>>>                          true),
>>>>>         DEFINE_PROP_INT32("speed", VirtIONet, net_conf.speed, SPEED_UNKNOWN),
>>>>>         DEFINE_PROP_STRING("duplex", VirtIONet, net_conf.duplex_str),
>>>>> +    DEFINE_PROP_BIT64("standby", VirtIONet, host_features, VIRTIO_NET_F_STANDBY,
>>>>> +                      false),
>>>>>         DEFINE_PROP_END_OF_LIST(),
>>>>>     };
>>>>> diff --git a/include/standard-headers/linux/virtio_net.h b/include/standard-headers/linux/virtio_net.h
>>>>> index e9f255ea3f..01ec09684c 100644
>>>>> --- a/include/standard-headers/linux/virtio_net.h
>>>>> +++ b/include/standard-headers/linux/virtio_net.h
>>>>> @@ -57,6 +57,9 @@
>>>>>     					 * Steering */
>>>>>     #define VIRTIO_NET_F_CTRL_MAC_ADDR 23	/* Set MAC address */
>>>>> +#define VIRTIO_NET_F_STANDBY      62    /* Act as standby for another device
>>>>> +                                         * with the same MAC.
>>>>> +                                         */
>>>>>     #define VIRTIO_NET_F_SPEED_DUPLEX 63	/* Device set linkspeed and duplex */
>>>>>     #ifndef VIRTIO_NET_NO_LEGACY
Jason Wang June 13, 2018, 2:38 a.m. UTC | #23
On 2018年06月12日 19:54, Michael S. Tsirkin wrote:
> On Wed, Jun 06, 2018 at 10:29:03AM +0800, Jason Wang wrote:
>>
>> On 2018年06月05日 20:33, Michael S. Tsirkin wrote:
>>> I don't think this is sufficient.
>>>
>>> If both primary and standby devices are present, a legacy guest without
>>> support for the feature might see two devices with same mac and get
>>> confused.
>>>
>>> I think that we should only make primary visible after guest acked the
>>> backup feature bit.
>> I think we want exactly the reverse? E.g fail the negotiation when guest
>> does not ack backup feature.
>>
>> Otherwise legacy guest won't even have the chance to see primary device in
>> the guest.
> That's by design.

So management needs to know the capability of guest to set the backup 
feature. This looks a chicken or egg problem to me.

>
>>> And on reset or when backup is cleared in some other way, unplug the
>>> primary.
>> What if guest does not support hotplug?
> It shouldn't ack the backup feature then and it's a good point.
> We should both document this and check kernel config has
> hotplug support. Sridhar could you take a look pls?
>
>>> Something like the below will do the job:
>>>
>>> Primary device is added with a special "primary-failover" flag.
>>> A virtual machine is then initialized with just a standby virtio
>>> device. Primary is not yet added.
>> A question is how to do the matching? Qemu knows nothing about e.g mac
>> address of a pass-through device I believe?
> Supply a flag to the VFIO when it's added, this way QEMU will know.
>
>>> Later QEMU detects that guest driver device set DRIVER_OK.
>>> It then exposes the primary device to the guest, and triggers
>>> a device addition event (hot-plug event) for it.
>> Do you mean we won't have primary device in the initial qemu cli?
> No, that's not what I mean.
>
> I mean management will supply a flag to VFIO and then
>
>
> - VFIO defers exposing
> primary to guest until guest acks the feature bit.
> - When we see guest ack, initiate hotplug.
> - On reboot, hide it again.
> - On reset without reboot, request hot-unplug and on eject hide it again.

This sounds much like a kind of bonding in qemu.

>
>>> If QEMU detects guest driver removal, it initiates a hot-unplug sequence
>>> to remove the primary driver.  In particular, if QEMU detects guest
>>> re-initialization (e.g. by detecting guest reset) it immediately removes
>>> the primary device.
>> I believe guest failover module should handle this gracefully?
> It can't control enumeration order, if primary is enumerated before
> standby then guest will load its driver and it's too late
> when failover driver is loaded.

Well, even if we can delay the visibility of primary until DRIVER_OK, 
there still be a race I think? And it looks to me it's still a bug of guest:

E.g primary could be probed before failover_register() in guest. Then we 
will miss the enslaving of primary forever.

Thanks

>
>> Thanks
>>
>>> We can move some of this code to management as well, architecturally it
>>> does not make too much sense but it might be easier implementation-wise.
>>>
>>> HTH
>>>
>>> On Mon, Jun 04, 2018 at 06:41:48PM -0700, Samudrala, Sridhar wrote:
>>>> Ping on this patch now that the kernel patches are accepted into davem's net-next tree.
>>>> https://patchwork.ozlabs.org/cover/920005/
>>>>
>>>>
>>>> On 5/7/2018 4:09 PM, Sridhar Samudrala wrote:
>>>>> This feature bit can be used by hypervisor to indicate virtio_net device to
>>>>> act as a standby for another device with the same MAC address.
>>>>>
>>>>> I tested this with a small change to the patch to mark the STANDBY feature 'true'
>>>>> by default as i am using libvirt to start the VMs.
>>>>> Is there a way to pass the newly added feature bit 'standby' to qemu via libvirt
>>>>> XML file?
>>>>>
>>>>> Signed-off-by: Sridhar Samudrala <sridhar.samudrala@intel.com>
>>>>> ---
>>>>>     hw/net/virtio-net.c                         | 2 ++
>>>>>     include/standard-headers/linux/virtio_net.h | 3 +++
>>>>>     2 files changed, 5 insertions(+)
>>>>>
>>>>> diff --git a/hw/net/virtio-net.c b/hw/net/virtio-net.c
>>>>> index 90502fca7c..38b3140670 100644
>>>>> --- a/hw/net/virtio-net.c
>>>>> +++ b/hw/net/virtio-net.c
>>>>> @@ -2198,6 +2198,8 @@ static Property virtio_net_properties[] = {
>>>>>                          true),
>>>>>         DEFINE_PROP_INT32("speed", VirtIONet, net_conf.speed, SPEED_UNKNOWN),
>>>>>         DEFINE_PROP_STRING("duplex", VirtIONet, net_conf.duplex_str),
>>>>> +    DEFINE_PROP_BIT64("standby", VirtIONet, host_features, VIRTIO_NET_F_STANDBY,
>>>>> +                      false),
>>>>>         DEFINE_PROP_END_OF_LIST(),
>>>>>     };
>>>>> diff --git a/include/standard-headers/linux/virtio_net.h b/include/standard-headers/linux/virtio_net.h
>>>>> index e9f255ea3f..01ec09684c 100644
>>>>> --- a/include/standard-headers/linux/virtio_net.h
>>>>> +++ b/include/standard-headers/linux/virtio_net.h
>>>>> @@ -57,6 +57,9 @@
>>>>>     					 * Steering */
>>>>>     #define VIRTIO_NET_F_CTRL_MAC_ADDR 23	/* Set MAC address */
>>>>> +#define VIRTIO_NET_F_STANDBY      62    /* Act as standby for another device
>>>>> +                                         * with the same MAC.
>>>>> +                                         */
>>>>>     #define VIRTIO_NET_F_SPEED_DUPLEX 63	/* Device set linkspeed and duplex */
>>>>>     #ifndef VIRTIO_NET_NO_LEGACY
Jason Wang June 13, 2018, 2:41 a.m. UTC | #24
On 2018年06月13日 08:20, Samudrala, Sridhar wrote:
> On 6/12/2018 4:54 AM, Michael S. Tsirkin wrote:
>> On Wed, Jun 06, 2018 at 10:29:03AM +0800, Jason Wang wrote:
>>>
>>> On 2018年06月05日 20:33, Michael S. Tsirkin wrote:
>>>> I don't think this is sufficient.
>>>>
>>>> If both primary and standby devices are present, a legacy guest 
>>>> without
>>>> support for the feature might see two devices with same mac and get
>>>> confused.
>>>>
>>>> I think that we should only make primary visible after guest acked the
>>>> backup feature bit.
>>> I think we want exactly the reverse? E.g fail the negotiation when 
>>> guest
>>> does not ack backup feature.
>>>
>>> Otherwise legacy guest won't even have the chance to see primary 
>>> device in
>>> the guest.
>> That's by design.
>>
>>>> And on reset or when backup is cleared in some other way, unplug the
>>>> primary.
>>> What if guest does not support hotplug?
>> It shouldn't ack the backup feature then and it's a good point.
>> We should both document this and check kernel config has
>> hotplug support. Sridhar could you take a look pls?
>
> Right now we select NET_FAILOVER when virtio-net is enabled, Should we 
> select
> PCI_HOTPLUG when NET_FAILOVER is enabled? 

Maybe I was wrong but it looks to me PCI_HOTPLUG is no sufficient since 
it depends on other to work e.g HOTPLUG_PCI_ACPI.

Thanks
Samudrala, Sridhar June 13, 2018, 4:24 a.m. UTC | #25
On 6/12/2018 7:38 PM, Jason Wang wrote:
>
>
> On 2018年06月12日 19:54, Michael S. Tsirkin wrote:
>> On Wed, Jun 06, 2018 at 10:29:03AM +0800, Jason Wang wrote:
>>>
>>> On 2018年06月05日 20:33, Michael S. Tsirkin wrote:
>>>> I don't think this is sufficient.
>>>>
>>>> If both primary and standby devices are present, a legacy guest 
>>>> without
>>>> support for the feature might see two devices with same mac and get
>>>> confused.
>>>>
>>>> I think that we should only make primary visible after guest acked the
>>>> backup feature bit.
>>> I think we want exactly the reverse? E.g fail the negotiation when 
>>> guest
>>> does not ack backup feature.
>>>
>>> Otherwise legacy guest won't even have the chance to see primary 
>>> device in
>>> the guest.
>> That's by design.
>
> So management needs to know the capability of guest to set the backup 
> feature. This looks a chicken or egg problem to me.

I don't think so. If the tenant requests 'accelerated datapath feature', the management
will set 'standby' feature bit on virtio-net interface and if the guest virtio-net driver
supports this feature, then the tenant VM will get that capability via a hot-plugged
primary device.


>
>>
>>>> And on reset or when backup is cleared in some other way, unplug the
>>>> primary.
>>> What if guest does not support hotplug?
>> It shouldn't ack the backup feature then and it's a good point.
>> We should both document this and check kernel config has
>> hotplug support. Sridhar could you take a look pls?
>>
>>>> Something like the below will do the job:
>>>>
>>>> Primary device is added with a special "primary-failover" flag.
>>>> A virtual machine is then initialized with just a standby virtio
>>>> device. Primary is not yet added.
>>> A question is how to do the matching? Qemu knows nothing about e.g mac
>>> address of a pass-through device I believe?
>> Supply a flag to the VFIO when it's added, this way QEMU will know.
>>
>>>> Later QEMU detects that guest driver device set DRIVER_OK.
>>>> It then exposes the primary device to the guest, and triggers
>>>> a device addition event (hot-plug event) for it.
>>> Do you mean we won't have primary device in the initial qemu cli?
>> No, that's not what I mean.
>>
>> I mean management will supply a flag to VFIO and then
>>
>>
>> - VFIO defers exposing
>> primary to guest until guest acks the feature bit.
>> - When we see guest ack, initiate hotplug.
>> - On reboot, hide it again.
>> - On reset without reboot, request hot-unplug and on eject hide it 
>> again.
>
> This sounds much like a kind of bonding in qemu.
>
>>
>>>> If QEMU detects guest driver removal, it initiates a hot-unplug 
>>>> sequence
>>>> to remove the primary driver.  In particular, if QEMU detects guest
>>>> re-initialization (e.g. by detecting guest reset) it immediately 
>>>> removes
>>>> the primary device.
>>> I believe guest failover module should handle this gracefully?
>> It can't control enumeration order, if primary is enumerated before
>> standby then guest will load its driver and it's too late
>> when failover driver is loaded.
>
> Well, even if we can delay the visibility of primary until DRIVER_OK, 
> there still be a race I think? And it looks to me it's still a bug of 
> guest:
>
> E.g primary could be probed before failover_register() in guest. Then 
> we will miss the enslaving of primary forever.

That is not an issue. Even if the primary is probed before failover driver, it will
enslave the primary via the call to failover_existing_slave_register() as part of
failover_register() routine.

>
> Thanks
>
>>
>>> Thanks
>>>
>>>> We can move some of this code to management as well, 
>>>> architecturally it
>>>> does not make too much sense but it might be easier 
>>>> implementation-wise.
>>>>
>>>> HTH
>>>>
>>>> On Mon, Jun 04, 2018 at 06:41:48PM -0700, Samudrala, Sridhar wrote:
>>>>> Ping on this patch now that the kernel patches are accepted into 
>>>>> davem's net-next tree.
>>>>> https://patchwork.ozlabs.org/cover/920005/
>>>>>
>>>>>
>>>>> On 5/7/2018 4:09 PM, Sridhar Samudrala wrote:
>>>>>> This feature bit can be used by hypervisor to indicate virtio_net 
>>>>>> device to
>>>>>> act as a standby for another device with the same MAC address.
>>>>>>
>>>>>> I tested this with a small change to the patch to mark the 
>>>>>> STANDBY feature 'true'
>>>>>> by default as i am using libvirt to start the VMs.
>>>>>> Is there a way to pass the newly added feature bit 'standby' to 
>>>>>> qemu via libvirt
>>>>>> XML file?
>>>>>>
>>>>>> Signed-off-by: Sridhar Samudrala <sridhar.samudrala@intel.com>
>>>>>> ---
>>>>>>     hw/net/virtio-net.c                         | 2 ++
>>>>>>     include/standard-headers/linux/virtio_net.h | 3 +++
>>>>>>     2 files changed, 5 insertions(+)
>>>>>>
>>>>>> diff --git a/hw/net/virtio-net.c b/hw/net/virtio-net.c
>>>>>> index 90502fca7c..38b3140670 100644
>>>>>> --- a/hw/net/virtio-net.c
>>>>>> +++ b/hw/net/virtio-net.c
>>>>>> @@ -2198,6 +2198,8 @@ static Property virtio_net_properties[] = {
>>>>>>                          true),
>>>>>>         DEFINE_PROP_INT32("speed", VirtIONet, net_conf.speed, 
>>>>>> SPEED_UNKNOWN),
>>>>>>         DEFINE_PROP_STRING("duplex", VirtIONet, 
>>>>>> net_conf.duplex_str),
>>>>>> +    DEFINE_PROP_BIT64("standby", VirtIONet, host_features, 
>>>>>> VIRTIO_NET_F_STANDBY,
>>>>>> +                      false),
>>>>>>         DEFINE_PROP_END_OF_LIST(),
>>>>>>     };
>>>>>> diff --git a/include/standard-headers/linux/virtio_net.h 
>>>>>> b/include/standard-headers/linux/virtio_net.h
>>>>>> index e9f255ea3f..01ec09684c 100644
>>>>>> --- a/include/standard-headers/linux/virtio_net.h
>>>>>> +++ b/include/standard-headers/linux/virtio_net.h
>>>>>> @@ -57,6 +57,9 @@
>>>>>>                          * Steering */
>>>>>>     #define VIRTIO_NET_F_CTRL_MAC_ADDR 23    /* Set MAC address */
>>>>>> +#define VIRTIO_NET_F_STANDBY      62    /* Act as standby for 
>>>>>> another device
>>>>>> +                                         * with the same MAC.
>>>>>> +                                         */
>>>>>>     #define VIRTIO_NET_F_SPEED_DUPLEX 63    /* Device set 
>>>>>> linkspeed and duplex */
>>>>>>     #ifndef VIRTIO_NET_NO_LEGACY
>
Jason Wang June 13, 2018, 5:40 a.m. UTC | #26
On 2018年06月13日 12:24, Samudrala, Sridhar wrote:
> On 6/12/2018 7:38 PM, Jason Wang wrote:
>>
>>
>> On 2018年06月12日 19:54, Michael S. Tsirkin wrote:
>>> On Wed, Jun 06, 2018 at 10:29:03AM +0800, Jason Wang wrote:
>>>>
>>>> On 2018年06月05日 20:33, Michael S. Tsirkin wrote:
>>>>> I don't think this is sufficient.
>>>>>
>>>>> If both primary and standby devices are present, a legacy guest 
>>>>> without
>>>>> support for the feature might see two devices with same mac and get
>>>>> confused.
>>>>>
>>>>> I think that we should only make primary visible after guest acked 
>>>>> the
>>>>> backup feature bit.
>>>> I think we want exactly the reverse? E.g fail the negotiation when 
>>>> guest
>>>> does not ack backup feature.
>>>>
>>>> Otherwise legacy guest won't even have the chance to see primary 
>>>> device in
>>>> the guest.
>>> That's by design.
>>
>> So management needs to know the capability of guest to set the backup 
>> feature. This looks a chicken or egg problem to me.
>
> I don't think so. If the tenant requests 'accelerated datapath 
> feature', the management
> will set 'standby' feature bit on virtio-net interface and if the 
> guest virtio-net driver
> supports this feature, then the tenant VM will get that capability via 
> a hot-plugged
> primary device.

Ok, I thought exactly the reverse because of the commit title is "enable 
virtio_net to act as a standby for a passthru device". But re-read the 
commit log content, I understand the case a little bit. Btw, VF is not 
necessarily faster than virtio-net, especially consider virtio-net may 
have a lot of queues.

>
>
>>
>>>
>>>>> And on reset or when backup is cleared in some other way, unplug the
>>>>> primary.
>>>> What if guest does not support hotplug?
>>> It shouldn't ack the backup feature then and it's a good point.
>>> We should both document this and check kernel config has
>>> hotplug support. Sridhar could you take a look pls?
>>>
>>>>> Something like the below will do the job:
>>>>>
>>>>> Primary device is added with a special "primary-failover" flag.
>>>>> A virtual machine is then initialized with just a standby virtio
>>>>> device. Primary is not yet added.
>>>> A question is how to do the matching? Qemu knows nothing about e.g mac
>>>> address of a pass-through device I believe?
>>> Supply a flag to the VFIO when it's added, this way QEMU will know.
>>>
>>>>> Later QEMU detects that guest driver device set DRIVER_OK.
>>>>> It then exposes the primary device to the guest, and triggers
>>>>> a device addition event (hot-plug event) for it.
>>>> Do you mean we won't have primary device in the initial qemu cli?
>>> No, that's not what I mean.
>>>
>>> I mean management will supply a flag to VFIO and then
>>>
>>>
>>> - VFIO defers exposing
>>> primary to guest until guest acks the feature bit.
>>> - When we see guest ack, initiate hotplug.
>>> - On reboot, hide it again.
>>> - On reset without reboot, request hot-unplug and on eject hide it 
>>> again.
>>
>> This sounds much like a kind of bonding in qemu.
>>
>>>
>>>>> If QEMU detects guest driver removal, it initiates a hot-unplug 
>>>>> sequence
>>>>> to remove the primary driver.  In particular, if QEMU detects guest
>>>>> re-initialization (e.g. by detecting guest reset) it immediately 
>>>>> removes
>>>>> the primary device.
>>>> I believe guest failover module should handle this gracefully?
>>> It can't control enumeration order, if primary is enumerated before
>>> standby then guest will load its driver and it's too late
>>> when failover driver is loaded.
>>
>> Well, even if we can delay the visibility of primary until DRIVER_OK, 
>> there still be a race I think? And it looks to me it's still a bug of 
>> guest:
>>
>> E.g primary could be probed before failover_register() in guest. Then 
>> we will miss the enslaving of primary forever.
>
> That is not an issue. Even if the primary is probed before failover 
> driver, it will
> enslave the primary via the call to failover_existing_slave_register() 
> as part of
> failover_register() routine.

Aha I get it. So the enumeration order is not an issue.

Consider primary may still be seen by guest kernel even if we delay its 
visibility, I wonder whether we can control the lifecycle of primary 
through driver but not qemu. This can simplify a lot of things.

Thanks

>
>>
>> Thanks
>>
>>>
>>>> Thanks
>>>>
>>>>> We can move some of this code to management as well, 
>>>>> architecturally it
>>>>> does not make too much sense but it might be easier 
>>>>> implementation-wise.
>>>>>
>>>>> HTH
>>>>>
>>>>> On Mon, Jun 04, 2018 at 06:41:48PM -0700, Samudrala, Sridhar wrote:
>>>>>> Ping on this patch now that the kernel patches are accepted into 
>>>>>> davem's net-next tree.
>>>>>> https://patchwork.ozlabs.org/cover/920005/
>>>>>>
>>>>>>
>>>>>> On 5/7/2018 4:09 PM, Sridhar Samudrala wrote:
>>>>>>> This feature bit can be used by hypervisor to indicate 
>>>>>>> virtio_net device to
>>>>>>> act as a standby for another device with the same MAC address.
>>>>>>>
>>>>>>> I tested this with a small change to the patch to mark the 
>>>>>>> STANDBY feature 'true'
>>>>>>> by default as i am using libvirt to start the VMs.
>>>>>>> Is there a way to pass the newly added feature bit 'standby' to 
>>>>>>> qemu via libvirt
>>>>>>> XML file?
>>>>>>>
>>>>>>> Signed-off-by: Sridhar Samudrala <sridhar.samudrala@intel.com>
>>>>>>> ---
>>>>>>>     hw/net/virtio-net.c                         | 2 ++
>>>>>>>     include/standard-headers/linux/virtio_net.h | 3 +++
>>>>>>>     2 files changed, 5 insertions(+)
>>>>>>>
>>>>>>> diff --git a/hw/net/virtio-net.c b/hw/net/virtio-net.c
>>>>>>> index 90502fca7c..38b3140670 100644
>>>>>>> --- a/hw/net/virtio-net.c
>>>>>>> +++ b/hw/net/virtio-net.c
>>>>>>> @@ -2198,6 +2198,8 @@ static Property virtio_net_properties[] = {
>>>>>>>                          true),
>>>>>>>         DEFINE_PROP_INT32("speed", VirtIONet, net_conf.speed, 
>>>>>>> SPEED_UNKNOWN),
>>>>>>>         DEFINE_PROP_STRING("duplex", VirtIONet, 
>>>>>>> net_conf.duplex_str),
>>>>>>> +    DEFINE_PROP_BIT64("standby", VirtIONet, host_features, 
>>>>>>> VIRTIO_NET_F_STANDBY,
>>>>>>> +                      false),
>>>>>>>         DEFINE_PROP_END_OF_LIST(),
>>>>>>>     };
>>>>>>> diff --git a/include/standard-headers/linux/virtio_net.h 
>>>>>>> b/include/standard-headers/linux/virtio_net.h
>>>>>>> index e9f255ea3f..01ec09684c 100644
>>>>>>> --- a/include/standard-headers/linux/virtio_net.h
>>>>>>> +++ b/include/standard-headers/linux/virtio_net.h
>>>>>>> @@ -57,6 +57,9 @@
>>>>>>>                          * Steering */
>>>>>>>     #define VIRTIO_NET_F_CTRL_MAC_ADDR 23    /* Set MAC address */
>>>>>>> +#define VIRTIO_NET_F_STANDBY      62    /* Act as standby for 
>>>>>>> another device
>>>>>>> +                                         * with the same MAC.
>>>>>>> +                                         */
>>>>>>>     #define VIRTIO_NET_F_SPEED_DUPLEX 63    /* Device set 
>>>>>>> linkspeed and duplex */
>>>>>>>     #ifndef VIRTIO_NET_NO_LEGACY
>>
>
Siwei Liu June 14, 2018, 12:56 a.m. UTC | #27
On Tue, Jun 12, 2018 at 4:47 AM, Michael S. Tsirkin <mst@redhat.com> wrote:
> On Tue, Jun 05, 2018 at 03:09:26PM -0700, Siwei Liu wrote:
>> The thing is cloud service provider might prefer sticking to the same
>> level of service agreement (SLA) of keeping VF over migration,
>
> That requirement is trivially satisfied by just a single VF :) If your
> SLA does not require live migration, you should do just that.

The requirement is enable live migration by default without getting
user too much attention. Migration should be attempted if guest is
live migratable. If not, present a single VF to legacy guest.

You seem to think it's not a valid use case? Or impossible to do so?
What if I have something in mind that can support my theory as well?

-Siwei

>
> --
> MST
Siwei Liu June 14, 2018, 1:02 a.m. UTC | #28
On Tue, Jun 12, 2018 at 5:08 PM, Samudrala, Sridhar
<sridhar.samudrala@intel.com> wrote:
> On 6/12/2018 4:34 AM, Michael S. Tsirkin wrote:
>>
>> On Mon, Jun 11, 2018 at 10:02:45PM -0700, Samudrala, Sridhar wrote:
>>>
>>> On 6/11/2018 7:17 PM, Michael S. Tsirkin wrote:
>>>>
>>>> On Tue, Jun 12, 2018 at 09:54:44AM +0800, Jason Wang wrote:
>>>>>
>>>>> On 2018年06月12日 01:26, Michael S. Tsirkin wrote:
>>>>>>
>>>>>> On Mon, May 07, 2018 at 04:09:54PM -0700, Sridhar Samudrala wrote:
>>>>>>>
>>>>>>> This feature bit can be used by hypervisor to indicate virtio_net
>>>>>>> device to
>>>>>>> act as a standby for another device with the same MAC address.
>>>>>>>
>>>>>>> I tested this with a small change to the patch to mark the STANDBY
>>>>>>> feature 'true'
>>>>>>> by default as i am using libvirt to start the VMs.
>>>>>>> Is there a way to pass the newly added feature bit 'standby' to qemu
>>>>>>> via libvirt
>>>>>>> XML file?
>>>>>>>
>>>>>>> Signed-off-by: Sridhar Samudrala <sridhar.samudrala@intel.com>
>>>>>>
>>>>>> So I do not think we can commit to this interface: we
>>>>>> really need to control visibility of the primary device.
>>>>>
>>>>> The problem is legacy guest won't use primary device at all if we do
>>>>> this.
>>>>
>>>> And that's by design - I think it's the only way to ensure the
>>>> legacy guest isn't confused.
>>>
>>> Yes. I think so. But i am not sure if Qemu is the right place to control
>>> the visibility
>>> of the primary device. The primary device may not be specified as an
>>> argument to Qemu. It
>>> may be plugged in later.
>>> The cloud service provider is providing a feature that enables low
>>> latency datapath and live
>>> migration capability.
>>> A tenant can use this feature only if he is running a VM that has
>>> virtio-net with failover support.
>>
>> Well live migration is there already. The new feature is low latency
>> data path.
>
>
> we get live migration with just virtio.  But I meant live migration with VF
> as
> primary device.
>
>>
>> And it's the guest that needs failover support not the VM.
>
>
> Isn't guest and VM synonymous?
>
>
>>
>>
>>> I think Qemu should check if guest virtio-net supports this feature and
>>> provide a mechanism for
>>> an upper layer indicating if the STANDBY feature is successfully
>>> negotiated or not.
>>> The upper layer can then decide if it should hot plug a VF with the same
>>> MAC and manage the 2 links.
>>> If VF is successfully hot plugged, virtio-net link should be disabled.
>>
>> Did you even talk to upper layer management about it?
>> Just list the steps they need to do and you will see
>> that's a lot of machinery to manage by the upper layer.
>>
>> What do we gain in flexibility? As far as I can see the
>> only gain is some resources saved for legacy VMs.
>>
>> That's not a lot as tenant of the upper layer probably already has
>> at least a hunch that it's a new guest otherwise
>> why bother specifying the feature at all - you
>> save even more resources without it.
>>
>
> I am not all that familiar with how Qemu manages network devices. If we can
> do all the
> required management of the primary/standby devices within Qemu, that is
> definitely a better
> approach without upper layer involvement.

Right. I would imagine in the extreme case the upper layer doesn't
have to be involved at all if QEMU manages all hot plug/unplug logic.
The management tool can supply passthrough device and virtio with the
same group UUID, QEMU auto-manages the presence of the primary, and
hot plug the device as needed before or after the migration.

-Siwei

>
>
>>
>>
>>>>> How about control the visibility of standby device?
>>>>>
>>>>> Thanks
>>>>
>>>> standy the always there to guarantee no downtime.
>>>>
>>>>>> However just for testing purposes, we could add a non-stable
>>>>>> interface "x-standby" with the understanding that as any
>>>>>> x- prefix it's unstable and will be changed down the road,
>>>>>> likely in the next release.
>>>>>>
>>>>>>
>>>>>>> ---
>>>>>>>     hw/net/virtio-net.c                         | 2 ++
>>>>>>>     include/standard-headers/linux/virtio_net.h | 3 +++
>>>>>>>     2 files changed, 5 insertions(+)
>>>>>>>
>>>>>>> diff --git a/hw/net/virtio-net.c b/hw/net/virtio-net.c
>>>>>>> index 90502fca7c..38b3140670 100644
>>>>>>> --- a/hw/net/virtio-net.c
>>>>>>> +++ b/hw/net/virtio-net.c
>>>>>>> @@ -2198,6 +2198,8 @@ static Property virtio_net_properties[] = {
>>>>>>>                          true),
>>>>>>>         DEFINE_PROP_INT32("speed", VirtIONet, net_conf.speed,
>>>>>>> SPEED_UNKNOWN),
>>>>>>>         DEFINE_PROP_STRING("duplex", VirtIONet, net_conf.duplex_str),
>>>>>>> +    DEFINE_PROP_BIT64("standby", VirtIONet, host_features,
>>>>>>> VIRTIO_NET_F_STANDBY,
>>>>>>> +                      false),
>>>>>>>         DEFINE_PROP_END_OF_LIST(),
>>>>>>>     };
>>>>>>> diff --git a/include/standard-headers/linux/virtio_net.h
>>>>>>> b/include/standard-headers/linux/virtio_net.h
>>>>>>> index e9f255ea3f..01ec09684c 100644
>>>>>>> --- a/include/standard-headers/linux/virtio_net.h
>>>>>>> +++ b/include/standard-headers/linux/virtio_net.h
>>>>>>> @@ -57,6 +57,9 @@
>>>>>>>                                          * Steering */
>>>>>>>     #define VIRTIO_NET_F_CTRL_MAC_ADDR 23       /* Set MAC address */
>>>>>>> +#define VIRTIO_NET_F_STANDBY      62    /* Act as standby for
>>>>>>> another device
>>>>>>> +                                         * with the same MAC.
>>>>>>> +                                         */
>>>>>>>     #define VIRTIO_NET_F_SPEED_DUPLEX 63        /* Device set
>>>>>>> linkspeed and duplex */
>>>>>>>     #ifndef VIRTIO_NET_NO_LEGACY
>>>>>>> --
>>>>>>> 2.14.3
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: virtio-dev-unsubscribe@lists.oasis-open.org
>> For additional commands, e-mail: virtio-dev-help@lists.oasis-open.org
>>
>
Cornelia Huck June 14, 2018, 10:02 a.m. UTC | #29
I've been pointed to this discussion (which I had missed previously)
and I'm getting a headache. Let me first summarize how I understand how
this feature is supposed to work, then I'll respond to some individual
points.

The basic idea is to enable guests to migrate seamlessly, while still
making it possible for them to use a passed-through device for more
performance etc. The means to do so is to hook a virtio-net device
together with a network device passed through via vfio. The
vfio-handled device is there for performance, the virtio device for
migratability. We have a new virtio feature bit for that which needs to
be negotiated for that 'combined' device to be available. We have to
consider two cases:

- Older guests that do not support the new feature bit. We presume that
  those guests will be confused if they get two network devices with
  the same MAC, so the idea is to not show them the vfio-handled device
  at all.
- Guests that negotiate the feature bit. We only know positively that
  they (a) know the feature bit and (b) are prepared to handle the
  consequences of negotiating it after they set the FEATURES_OK bit.
  This is therefore the earliest point in time that the vfio-handled
  device should be visible or usable for the guest.

On Wed, 13 Jun 2018 18:02:01 -0700
Siwei Liu <loseweigh@gmail.com> wrote:

> On Tue, Jun 12, 2018 at 5:08 PM, Samudrala, Sridhar
> <sridhar.samudrala@intel.com> wrote:
> > On 6/12/2018 4:34 AM, Michael S. Tsirkin wrote:  
> >>
> >> On Mon, Jun 11, 2018 at 10:02:45PM -0700, Samudrala, Sridhar wrote:  
> >>>
> >>> On 6/11/2018 7:17 PM, Michael S. Tsirkin wrote:  
> >>>>
> >>>> On Tue, Jun 12, 2018 at 09:54:44AM +0800, Jason Wang wrote:  
> >>>>>
> >>>>> On 2018年06月12日 01:26, Michael S. Tsirkin wrote:  
> >>>>>>
> >>>>>> On Mon, May 07, 2018 at 04:09:54PM -0700, Sridhar Samudrala wrote:  
> >>>>>>>
> >>>>>>> This feature bit can be used by hypervisor to indicate virtio_net
> >>>>>>> device to
> >>>>>>> act as a standby for another device with the same MAC address.
> >>>>>>>
> >>>>>>> I tested this with a small change to the patch to mark the STANDBY
> >>>>>>> feature 'true'
> >>>>>>> by default as i am using libvirt to start the VMs.
> >>>>>>> Is there a way to pass the newly added feature bit 'standby' to qemu
> >>>>>>> via libvirt
> >>>>>>> XML file?
> >>>>>>>
> >>>>>>> Signed-off-by: Sridhar Samudrala <sridhar.samudrala@intel.com>  
> >>>>>>
> >>>>>> So I do not think we can commit to this interface: we
> >>>>>> really need to control visibility of the primary device.  
> >>>>>
> >>>>> The problem is legacy guest won't use primary device at all if we do
> >>>>> this.  
> >>>>
> >>>> And that's by design - I think it's the only way to ensure the
> >>>> legacy guest isn't confused.  
> >>>
> >>> Yes. I think so. But i am not sure if Qemu is the right place to control
> >>> the visibility
> >>> of the primary device. The primary device may not be specified as an
> >>> argument to Qemu. It
> >>> may be plugged in later.
> >>> The cloud service provider is providing a feature that enables low
> >>> latency datapath and live
> >>> migration capability.
> >>> A tenant can use this feature only if he is running a VM that has
> >>> virtio-net with failover support.  

So, do you know from the outset that there will be such a coupled
device? I.e., is it a property of the VM definition?

Can there be a 'prepared' virtio-net device that presents the STANDBY
feature even if there currently is no vfio-handled device available --
but making it possible to simply hotplug that device later?

Should it be possible to add a virtio/vfio pair later on?

> >>
> >> Well live migration is there already. The new feature is low latency
> >> data path.  
> >
> >
> > we get live migration with just virtio.  But I meant live migration with VF
> > as
> > primary device.
> >  
> >>
> >> And it's the guest that needs failover support not the VM.  
> >
> >
> > Isn't guest and VM synonymous?

I think we need to be really careful to not mix up the two: The VM
contains the definitions, but it is up to the guest how it uses them.

> >
> >  
> >>
> >>  
> >>> I think Qemu should check if guest virtio-net supports this feature and
> >>> provide a mechanism for
> >>> an upper layer indicating if the STANDBY feature is successfully
> >>> negotiated or not.
> >>> The upper layer can then decide if it should hot plug a VF with the same
> >>> MAC and manage the 2 links.
> >>> If VF is successfully hot plugged, virtio-net link should be disabled.  
> >>
> >> Did you even talk to upper layer management about it?
> >> Just list the steps they need to do and you will see
> >> that's a lot of machinery to manage by the upper layer.
> >>
> >> What do we gain in flexibility? As far as I can see the
> >> only gain is some resources saved for legacy VMs.
> >>
> >> That's not a lot as tenant of the upper layer probably already has
> >> at least a hunch that it's a new guest otherwise
> >> why bother specifying the feature at all - you
> >> save even more resources without it.
> >>  
> >
> > I am not all that familiar with how Qemu manages network devices. If we can
> > do all the
> > required management of the primary/standby devices within Qemu, that is
> > definitely a better
> > approach without upper layer involvement.  
> 
> Right. I would imagine in the extreme case the upper layer doesn't
> have to be involved at all if QEMU manages all hot plug/unplug logic.
> The management tool can supply passthrough device and virtio with the
> same group UUID, QEMU auto-manages the presence of the primary, and
> hot plug the device as needed before or after the migration.

I do not really see how you can manage that kind of stuff in QEMU only.
Have you talked to some libvirt folks? (And I'm not sure what you refer
to with 'group UUID'?)

Also, I think you need to make a distinction between hotplugging a
device and making it visible to the guest. What does 'hotplugging' mean
here? Adding it to the VM definition? Would it be enough to have the
vfio-based device not operational until the virtio feature bit has been
negotiated?

What happens if the guest does not use the vfio-based device after it
has been made available? Will you still disable the virtio-net link?
(All that link handling definitely sounds like a task for libvirt or
the like.)

Regarding hot(un)plugging during migration, I think you also need to
keep in mind that different architectures/busses have different
semantics there. Something that works if there's an unplug handshake may
not work on a platform with surprise removal.

Have you considered guest agents? All of this is punching through
several layers, and I'm not sure if that is a good idea.
Michael S. Tsirkin June 14, 2018, 12:50 p.m. UTC | #30
On Wed, Jun 13, 2018 at 06:02:01PM -0700, Siwei Liu wrote:
> >> And it's the guest that needs failover support not the VM.
> >
> >
> > Isn't guest and VM synonymous?

Guest is whatever software is running on top of the hypervisor.

The virtual machine is the interface between the two.
Siwei Liu June 15, 2018, 1:57 a.m. UTC | #31
Thank you for sharing your thoughts, Cornelia. With questions below, I
think you raised really good points, some of which I don't have answer
yet and would also like to explore here.

First off, I don't want to push the discussion to the extreme at this
point, or sell anything about having QEMU manage everything
automatically. Don't get me wrong, it's not there yet. Let's don't
assume we are tied to a specific or concerte solution. I think the key
for our discussion might be to define or refine the boundary between
VM and guest,  e.g. what each layer is expected to control and manage
exactly.

In my view, there might be possibly 3 different options to represent
the failover device conceipt to QEMU and libvirt (or any upper layer
software):

a. Seperate device: in this model, virtio and passthough remains
separate devices just as today. QEMU exposes the standby feature bit
for virtio, and publish status/event around the negotiation process of
this feature bit for libvirt to react upon. Since Libvirt has the
pairing relationship itself, maybe through MAC address or something
else, it can control the presence of primary by hot plugging or
unplugging the passthrough device, although it has to work tightly
with virtio's feature negotation process. Not just for migration but
also various corner scenarios (driver/feature ok, device reset,
reboot, legacy guest etc) along virtio's feature negotiation.

b. Coupled device: in this model, virtio and passthough devices are
weakly coupled using some group ID, i.e. QEMU match the passthough
device for a standby virtio instance by comparing the group ID value
present behind each device's bridge. Libvirt provides QEMU the group
ID for both type of devices, and only deals with hot plug for
migration, by checking some migration status exposed (e.g. the feature
negotiation status on the virtio device) by QEMU. QEMU manages the
visibility of the primary in guest along virtio's feature negotiation
process.

c. Fully combined device: in this model, virtio and passthough devices
are viewed as a single VM interface altogther. QEMU not just controls
the visibility of the primary in guest, but can also manage the
exposure of the passthrough for migratability. It can be like that
libvirt supplies the group ID to QEMU. Or libvirt does not even have
to provide group ID for grouping the two devices, if just one single
combined device is exposed by QEMU. In either case, QEMU manages all
aspect of such internal construct, including virtio feature
negotiation, presence of the primary, and live migration.

It looks like to me that, in your opinion, you seem to prefer go with
(a). While I'm actually okay with either (b) or (c). Do I understand
your point correctly?

The reason that I feel that (a) might not be ideal, just as Michael
alluded to (quoting below), is that as management stack, it really
doesn't need to care about the detailed process of feature negotiation
(if we view the guest presence of the primary as part of feature
negotiation at an extended level not just virtio). All it needs to be
done is to hand in the required devices to QEMU and that's all. Why do
we need to addd various hooks, events for whichever happens internally
within the guest?

''
Primary device is added with a special "primary-failover" flag.
A virtual machine is then initialized with just a standby virtio
device. Primary is not yet added.

Later QEMU detects that guest driver device set DRIVER_OK.
It then exposes the primary device to the guest, and triggers
a device addition event (hot-plug event) for it.

If QEMU detects guest driver removal, it initiates a hot-unplug sequence
to remove the primary driver.  In particular, if QEMU detects guest
re-initialization (e.g. by detecting guest reset) it immediately removes
the primary device.
''

and,

''
management just wants to give the primary to guest and later take it back,
it really does not care about the details of the process,
so I don't see what does pushing it up the stack buy you.

So I don't think it *needs* to be done in libvirt. It probably can if you
add a bunch of hooks so it knows whenever vm reboots, driver binds and
unbinds from device, and can check that backup flag was set.
If you are pushing for a setup like that please get a buy-in
from libvirt maintainers or better write a patch.
''

Let me know if my clarifications sound clear to you now.


Thanks,
-Siwei


On Thu, Jun 14, 2018 at 3:02 AM, Cornelia Huck <cohuck@redhat.com> wrote:
> I've been pointed to this discussion (which I had missed previously)
> and I'm getting a headache. Let me first summarize how I understand how
> this feature is supposed to work, then I'll respond to some individual
> points.
>
> The basic idea is to enable guests to migrate seamlessly, while still
> making it possible for them to use a passed-through device for more
> performance etc. The means to do so is to hook a virtio-net device
> together with a network device passed through via vfio. The
> vfio-handled device is there for performance, the virtio device for
> migratability. We have a new virtio feature bit for that which needs to
> be negotiated for that 'combined' device to be available. We have to
> consider two cases:
>
> - Older guests that do not support the new feature bit. We presume that
>   those guests will be confused if they get two network devices with
>   the same MAC, so the idea is to not show them the vfio-handled device
>   at all.
> - Guests that negotiate the feature bit. We only know positively that
>   they (a) know the feature bit and (b) are prepared to handle the
>   consequences of negotiating it after they set the FEATURES_OK bit.
>   This is therefore the earliest point in time that the vfio-handled
>   device should be visible or usable for the guest.
>
> On Wed, 13 Jun 2018 18:02:01 -0700
> Siwei Liu <loseweigh@gmail.com> wrote:
>
>> On Tue, Jun 12, 2018 at 5:08 PM, Samudrala, Sridhar
>> <sridhar.samudrala@intel.com> wrote:
>> > On 6/12/2018 4:34 AM, Michael S. Tsirkin wrote:
>> >>
>> >> On Mon, Jun 11, 2018 at 10:02:45PM -0700, Samudrala, Sridhar wrote:
>> >>>
>> >>> On 6/11/2018 7:17 PM, Michael S. Tsirkin wrote:
>> >>>>
>> >>>> On Tue, Jun 12, 2018 at 09:54:44AM +0800, Jason Wang wrote:
>> >>>>>
>> >>>>> On 2018年06月12日 01:26, Michael S. Tsirkin wrote:
>> >>>>>>
>> >>>>>> On Mon, May 07, 2018 at 04:09:54PM -0700, Sridhar Samudrala wrote:
>> >>>>>>>
>> >>>>>>> This feature bit can be used by hypervisor to indicate virtio_net
>> >>>>>>> device to
>> >>>>>>> act as a standby for another device with the same MAC address.
>> >>>>>>>
>> >>>>>>> I tested this with a small change to the patch to mark the STANDBY
>> >>>>>>> feature 'true'
>> >>>>>>> by default as i am using libvirt to start the VMs.
>> >>>>>>> Is there a way to pass the newly added feature bit 'standby' to qemu
>> >>>>>>> via libvirt
>> >>>>>>> XML file?
>> >>>>>>>
>> >>>>>>> Signed-off-by: Sridhar Samudrala <sridhar.samudrala@intel.com>
>> >>>>>>
>> >>>>>> So I do not think we can commit to this interface: we
>> >>>>>> really need to control visibility of the primary device.
>> >>>>>
>> >>>>> The problem is legacy guest won't use primary device at all if we do
>> >>>>> this.
>> >>>>
>> >>>> And that's by design - I think it's the only way to ensure the
>> >>>> legacy guest isn't confused.
>> >>>
>> >>> Yes. I think so. But i am not sure if Qemu is the right place to control
>> >>> the visibility
>> >>> of the primary device. The primary device may not be specified as an
>> >>> argument to Qemu. It
>> >>> may be plugged in later.
>> >>> The cloud service provider is providing a feature that enables low
>> >>> latency datapath and live
>> >>> migration capability.
>> >>> A tenant can use this feature only if he is running a VM that has
>> >>> virtio-net with failover support.
>
> So, do you know from the outset that there will be such a coupled
> device? I.e., is it a property of the VM definition?
>
> Can there be a 'prepared' virtio-net device that presents the STANDBY
> feature even if there currently is no vfio-handled device available --
> but making it possible to simply hotplug that device later?
>
> Should it be possible to add a virtio/vfio pair later on?
>
>> >>
>> >> Well live migration is there already. The new feature is low latency
>> >> data path.
>> >
>> >
>> > we get live migration with just virtio.  But I meant live migration with VF
>> > as
>> > primary device.
>> >
>> >>
>> >> And it's the guest that needs failover support not the VM.
>> >
>> >
>> > Isn't guest and VM synonymous?
>
> I think we need to be really careful to not mix up the two: The VM
> contains the definitions, but it is up to the guest how it uses them.
>
>> >
>> >
>> >>
>> >>
>> >>> I think Qemu should check if guest virtio-net supports this feature and
>> >>> provide a mechanism for
>> >>> an upper layer indicating if the STANDBY feature is successfully
>> >>> negotiated or not.
>> >>> The upper layer can then decide if it should hot plug a VF with the same
>> >>> MAC and manage the 2 links.
>> >>> If VF is successfully hot plugged, virtio-net link should be disabled.
>> >>
>> >> Did you even talk to upper layer management about it?
>> >> Just list the steps they need to do and you will see
>> >> that's a lot of machinery to manage by the upper layer.
>> >>
>> >> What do we gain in flexibility? As far as I can see the
>> >> only gain is some resources saved for legacy VMs.
>> >>
>> >> That's not a lot as tenant of the upper layer probably already has
>> >> at least a hunch that it's a new guest otherwise
>> >> why bother specifying the feature at all - you
>> >> save even more resources without it.
>> >>
>> >
>> > I am not all that familiar with how Qemu manages network devices. If we can
>> > do all the
>> > required management of the primary/standby devices within Qemu, that is
>> > definitely a better
>> > approach without upper layer involvement.
>>
>> Right. I would imagine in the extreme case the upper layer doesn't
>> have to be involved at all if QEMU manages all hot plug/unplug logic.
>> The management tool can supply passthrough device and virtio with the
>> same group UUID, QEMU auto-manages the presence of the primary, and
>> hot plug the device as needed before or after the migration.
>
> I do not really see how you can manage that kind of stuff in QEMU only.
> Have you talked to some libvirt folks? (And I'm not sure what you refer
> to with 'group UUID'?)
>
> Also, I think you need to make a distinction between hotplugging a
> device and making it visible to the guest. What does 'hotplugging' mean
> here? Adding it to the VM definition? Would it be enough to have the
> vfio-based device not operational until the virtio feature bit has been
> negotiated?
>
> What happens if the guest does not use the vfio-based device after it
> has been made available? Will you still disable the virtio-net link?
> (All that link handling definitely sounds like a task for libvirt or
> the like.)
>
> Regarding hot(un)plugging during migration, I think you also need to
> keep in mind that different architectures/busses have different
> semantics there. Something that works if there's an unplug handshake may
> not work on a platform with surprise removal.
>
> Have you considered guest agents? All of this is punching through
> several layers, and I'm not sure if that is a good idea.
Michael S. Tsirkin June 15, 2018, 2:34 a.m. UTC | #32
On Thu, Jun 14, 2018 at 12:02:31PM +0200, Cornelia Huck wrote:
> So, do you know from the outset that there will be such a coupled
> device? I.e., is it a property of the VM definition?
> 
> Can there be a 'prepared' virtio-net device that presents the STANDBY
> feature even if there currently is no vfio-handled device available --
> but making it possible to simply hotplug that device later?
> Should it be possible to add a virtio/vfio pair later on?

Yes, that's the plan, more or less.

> > >>> I think Qemu should check if guest virtio-net supports this feature and
> > >>> provide a mechanism for
> > >>> an upper layer indicating if the STANDBY feature is successfully
> > >>> negotiated or not.
> > >>> The upper layer can then decide if it should hot plug a VF with the same
> > >>> MAC and manage the 2 links.
> > >>> If VF is successfully hot plugged, virtio-net link should be disabled.  

BTW I see no reason to do this last part. The role of the
standby device is to be always there.

> > >>
> > >> Did you even talk to upper layer management about it?
> > >> Just list the steps they need to do and you will see
> > >> that's a lot of machinery to manage by the upper layer.
> > >>
> > >> What do we gain in flexibility? As far as I can see the
> > >> only gain is some resources saved for legacy VMs.
> > >>
> > >> That's not a lot as tenant of the upper layer probably already has
> > >> at least a hunch that it's a new guest otherwise
> > >> why bother specifying the feature at all - you
> > >> save even more resources without it.
> > >>  
> > >
> > > I am not all that familiar with how Qemu manages network devices. If we can
> > > do all the
> > > required management of the primary/standby devices within Qemu, that is
> > > definitely a better
> > > approach without upper layer involvement.  
> > 
> > Right. I would imagine in the extreme case the upper layer doesn't
> > have to be involved at all if QEMU manages all hot plug/unplug logic.
> > The management tool can supply passthrough device and virtio with the
> > same group UUID, QEMU auto-manages the presence of the primary, and
> > hot plug the device as needed before or after the migration.
> 
> I do not really see how you can manage that kind of stuff in QEMU only.

So right now failover is limited to pci passthrough devices only.
The idea is to realize the vfio device but not expose it
to guest. Have a separate command to expose it to guest.
Hotunplug would also hide it from guest but not unrealize it.

This will help ensure that e.g. on migration failure we can
re-expose the device without risk of running out of resources.
Cornelia Huck June 15, 2018, 9:32 a.m. UTC | #33
On Fri, 15 Jun 2018 05:34:24 +0300
"Michael S. Tsirkin" <mst@redhat.com> wrote:

> On Thu, Jun 14, 2018 at 12:02:31PM +0200, Cornelia Huck wrote:

> > > > I am not all that familiar with how Qemu manages network devices. If we can
> > > > do all the
> > > > required management of the primary/standby devices within Qemu, that is
> > > > definitely a better
> > > > approach without upper layer involvement.    
> > > 
> > > Right. I would imagine in the extreme case the upper layer doesn't
> > > have to be involved at all if QEMU manages all hot plug/unplug logic.
> > > The management tool can supply passthrough device and virtio with the
> > > same group UUID, QEMU auto-manages the presence of the primary, and
> > > hot plug the device as needed before or after the migration.  
> > 
> > I do not really see how you can manage that kind of stuff in QEMU only.  
> 
> So right now failover is limited to pci passthrough devices only.
> The idea is to realize the vfio device but not expose it
> to guest. Have a separate command to expose it to guest.
> Hotunplug would also hide it from guest but not unrealize it.

So, this would not be real hot(un)plug, but 'hide it from the guest',
right? The concept of "we have it realized in QEMU, but the guest can't
discover and use it" should be translatable to non-pci as well (at
least for ccw).

> 
> This will help ensure that e.g. on migration failure we can
> re-expose the device without risk of running out of resources.

Makes sense.

Should that 'hidden' state be visible/settable from outside as well
(e.g. via a property)? I guess yes, so that management software has a
chance to see whether a device is visible. Settable may be useful if we
find another use case for hiding realized devices.
Cornelia Huck June 15, 2018, 11:48 a.m. UTC | #34
On Thu, 14 Jun 2018 18:57:11 -0700
Siwei Liu <loseweigh@gmail.com> wrote:

> Thank you for sharing your thoughts, Cornelia. With questions below, I
> think you raised really good points, some of which I don't have answer
> yet and would also like to explore here.
> 
> First off, I don't want to push the discussion to the extreme at this
> point, or sell anything about having QEMU manage everything
> automatically. Don't get me wrong, it's not there yet. Let's don't
> assume we are tied to a specific or concerte solution. I think the key
> for our discussion might be to define or refine the boundary between
> VM and guest,  e.g. what each layer is expected to control and manage
> exactly.
> 
> In my view, there might be possibly 3 different options to represent
> the failover device conceipt to QEMU and libvirt (or any upper layer
> software):
> 
> a. Seperate device: in this model, virtio and passthough remains
> separate devices just as today. QEMU exposes the standby feature bit
> for virtio, and publish status/event around the negotiation process of
> this feature bit for libvirt to react upon. Since Libvirt has the
> pairing relationship itself, maybe through MAC address or something
> else, it can control the presence of primary by hot plugging or
> unplugging the passthrough device, although it has to work tightly
> with virtio's feature negotation process. Not just for migration but
> also various corner scenarios (driver/feature ok, device reset,
> reboot, legacy guest etc) along virtio's feature negotiation.

Yes, that one has obvious tie-ins to virtio's modus operandi.

> 
> b. Coupled device: in this model, virtio and passthough devices are
> weakly coupled using some group ID, i.e. QEMU match the passthough
> device for a standby virtio instance by comparing the group ID value
> present behind each device's bridge. Libvirt provides QEMU the group
> ID for both type of devices, and only deals with hot plug for
> migration, by checking some migration status exposed (e.g. the feature
> negotiation status on the virtio device) by QEMU. QEMU manages the
> visibility of the primary in guest along virtio's feature negotiation
> process.

I'm a bit confused here. What, exactly, ties the two devices together?
If libvirt already has the knowledge that it should manage the two as a
couple, why do we need the group id (or something else for other
architectures)? (Maybe I'm simply missing something because I'm not
that familiar with pci.)

> 
> c. Fully combined device: in this model, virtio and passthough devices
> are viewed as a single VM interface altogther. QEMU not just controls
> the visibility of the primary in guest, but can also manage the
> exposure of the passthrough for migratability. It can be like that
> libvirt supplies the group ID to QEMU. Or libvirt does not even have
> to provide group ID for grouping the two devices, if just one single
> combined device is exposed by QEMU. In either case, QEMU manages all
> aspect of such internal construct, including virtio feature
> negotiation, presence of the primary, and live migration.

Same question as above.

> 
> It looks like to me that, in your opinion, you seem to prefer go with
> (a). While I'm actually okay with either (b) or (c). Do I understand
> your point correctly?

I'm not yet preferring anything, as I'm still trying to understand how
this works :) I hope we can arrive at a model that covers the use case
and that is also flexible enough to be extended to other platforms.

> 
> The reason that I feel that (a) might not be ideal, just as Michael
> alluded to (quoting below), is that as management stack, it really
> doesn't need to care about the detailed process of feature negotiation
> (if we view the guest presence of the primary as part of feature
> negotiation at an extended level not just virtio). All it needs to be
> done is to hand in the required devices to QEMU and that's all. Why do
> we need to addd various hooks, events for whichever happens internally
> within the guest?
> 
> ''
> Primary device is added with a special "primary-failover" flag.
> A virtual machine is then initialized with just a standby virtio
> device. Primary is not yet added.
> 
> Later QEMU detects that guest driver device set DRIVER_OK.
> It then exposes the primary device to the guest, and triggers
> a device addition event (hot-plug event) for it.
> 
> If QEMU detects guest driver removal, it initiates a hot-unplug sequence
> to remove the primary driver.  In particular, if QEMU detects guest
> re-initialization (e.g. by detecting guest reset) it immediately removes
> the primary device.
> ''
> 
> and,
> 
> ''
> management just wants to give the primary to guest and later take it back,
> it really does not care about the details of the process,
> so I don't see what does pushing it up the stack buy you.
> 
> So I don't think it *needs* to be done in libvirt. It probably can if you
> add a bunch of hooks so it knows whenever vm reboots, driver binds and
> unbinds from device, and can check that backup flag was set.
> If you are pushing for a setup like that please get a buy-in
> from libvirt maintainers or better write a patch.
> ''

This actually seems to mean the opposite to me: We need to know what
the guest is doing and when, as it directly drives what we need to do
with the devices. If we switch to a visibility vs a hotplug model (see
the other mail), we might be able to handle that part within qemu.
However, I don't see how you get around needing libvirt to actually set
this up in the first place and to handle migration per se.
Michael S. Tsirkin June 15, 2018, 12:31 p.m. UTC | #35
On Fri, Jun 15, 2018 at 11:32:42AM +0200, Cornelia Huck wrote:
> On Fri, 15 Jun 2018 05:34:24 +0300
> "Michael S. Tsirkin" <mst@redhat.com> wrote:
> 
> > On Thu, Jun 14, 2018 at 12:02:31PM +0200, Cornelia Huck wrote:
> 
> > > > > I am not all that familiar with how Qemu manages network devices. If we can
> > > > > do all the
> > > > > required management of the primary/standby devices within Qemu, that is
> > > > > definitely a better
> > > > > approach without upper layer involvement.    
> > > > 
> > > > Right. I would imagine in the extreme case the upper layer doesn't
> > > > have to be involved at all if QEMU manages all hot plug/unplug logic.
> > > > The management tool can supply passthrough device and virtio with the
> > > > same group UUID, QEMU auto-manages the presence of the primary, and
> > > > hot plug the device as needed before or after the migration.  
> > > 
> > > I do not really see how you can manage that kind of stuff in QEMU only.  
> > 
> > So right now failover is limited to pci passthrough devices only.
> > The idea is to realize the vfio device but not expose it
> > to guest. Have a separate command to expose it to guest.
> > Hotunplug would also hide it from guest but not unrealize it.
> 
> So, this would not be real hot(un)plug, but 'hide it from the guest',
> right? The concept of "we have it realized in QEMU, but the guest can't
> discover and use it" should be translatable to non-pci as well (at
> least for ccw).
> 
> > 
> > This will help ensure that e.g. on migration failure we can
> > re-expose the device without risk of running out of resources.
> 
> Makes sense.
> 
> Should that 'hidden' state be visible/settable from outside as well
> (e.g. via a property)? I guess yes, so that management software has a
> chance to see whether a device is visible.

Might be handy for debug, but note that since QEMU manages this
state it's transient: can change at any time, so it's kind
of hard for management to rely on it.

> Settable may be useful if we
> find another use case for hiding realized devices.
Siwei Liu June 15, 2018, 5:06 p.m. UTC | #36
On Fri, Jun 15, 2018 at 4:48 AM, Cornelia Huck <cohuck@redhat.com> wrote:
> On Thu, 14 Jun 2018 18:57:11 -0700
> Siwei Liu <loseweigh@gmail.com> wrote:
>
>> Thank you for sharing your thoughts, Cornelia. With questions below, I
>> think you raised really good points, some of which I don't have answer
>> yet and would also like to explore here.
>>
>> First off, I don't want to push the discussion to the extreme at this
>> point, or sell anything about having QEMU manage everything
>> automatically. Don't get me wrong, it's not there yet. Let's don't
>> assume we are tied to a specific or concerte solution. I think the key
>> for our discussion might be to define or refine the boundary between
>> VM and guest,  e.g. what each layer is expected to control and manage
>> exactly.
>>
>> In my view, there might be possibly 3 different options to represent
>> the failover device conceipt to QEMU and libvirt (or any upper layer
>> software):
>>
>> a. Seperate device: in this model, virtio and passthough remains
>> separate devices just as today. QEMU exposes the standby feature bit
>> for virtio, and publish status/event around the negotiation process of
>> this feature bit for libvirt to react upon. Since Libvirt has the
>> pairing relationship itself, maybe through MAC address or something
>> else, it can control the presence of primary by hot plugging or
>> unplugging the passthrough device, although it has to work tightly
>> with virtio's feature negotation process. Not just for migration but
>> also various corner scenarios (driver/feature ok, device reset,
>> reboot, legacy guest etc) along virtio's feature negotiation.
>
> Yes, that one has obvious tie-ins to virtio's modus operandi.
>
>>
>> b. Coupled device: in this model, virtio and passthough devices are
>> weakly coupled using some group ID, i.e. QEMU match the passthough
>> device for a standby virtio instance by comparing the group ID value
>> present behind each device's bridge. Libvirt provides QEMU the group
>> ID for both type of devices, and only deals with hot plug for
>> migration, by checking some migration status exposed (e.g. the feature
>> negotiation status on the virtio device) by QEMU. QEMU manages the
>> visibility of the primary in guest along virtio's feature negotiation
>> process.
>
> I'm a bit confused here. What, exactly, ties the two devices together?

The group UUID. Since QEMU VFIO dvice does not have insight of MAC
address (which it doesn't have to), the association between VFIO
passthrough and standby must be specificed for QEMU to understand the
relationship with this model. Note, standby feature is no longer
required to be exposed under this model.

> If libvirt already has the knowledge that it should manage the two as a
> couple, why do we need the group id (or something else for other
> architectures)? (Maybe I'm simply missing something because I'm not
> that familiar with pci.)

The idea is to have QEMU control the visibility and enumeration order
of the passthrough VFIO for the failover scenario. Hotplug can be one
way to achieve it, and perhaps there's other way around also. The
group ID is not just for QEMU to couple devices, it's also helpful to
guest too as grouping using MAC address is just not safe.

>
>>
>> c. Fully combined device: in this model, virtio and passthough devices
>> are viewed as a single VM interface altogther. QEMU not just controls
>> the visibility of the primary in guest, but can also manage the
>> exposure of the passthrough for migratability. It can be like that
>> libvirt supplies the group ID to QEMU. Or libvirt does not even have
>> to provide group ID for grouping the two devices, if just one single
>> combined device is exposed by QEMU. In either case, QEMU manages all
>> aspect of such internal construct, including virtio feature
>> negotiation, presence of the primary, and live migration.
>
> Same question as above.
>
>>
>> It looks like to me that, in your opinion, you seem to prefer go with
>> (a). While I'm actually okay with either (b) or (c). Do I understand
>> your point correctly?
>
> I'm not yet preferring anything, as I'm still trying to understand how
> this works :) I hope we can arrive at a model that covers the use case
> and that is also flexible enough to be extended to other platforms.
>
>>
>> The reason that I feel that (a) might not be ideal, just as Michael
>> alluded to (quoting below), is that as management stack, it really
>> doesn't need to care about the detailed process of feature negotiation
>> (if we view the guest presence of the primary as part of feature
>> negotiation at an extended level not just virtio). All it needs to be
>> done is to hand in the required devices to QEMU and that's all. Why do
>> we need to addd various hooks, events for whichever happens internally
>> within the guest?
>>
>> ''
>> Primary device is added with a special "primary-failover" flag.
>> A virtual machine is then initialized with just a standby virtio
>> device. Primary is not yet added.
>>
>> Later QEMU detects that guest driver device set DRIVER_OK.
>> It then exposes the primary device to the guest, and triggers
>> a device addition event (hot-plug event) for it.
>>
>> If QEMU detects guest driver removal, it initiates a hot-unplug sequence
>> to remove the primary driver.  In particular, if QEMU detects guest
>> re-initialization (e.g. by detecting guest reset) it immediately removes
>> the primary device.
>> ''
>>
>> and,
>>
>> ''
>> management just wants to give the primary to guest and later take it back,
>> it really does not care about the details of the process,
>> so I don't see what does pushing it up the stack buy you.
>>
>> So I don't think it *needs* to be done in libvirt. It probably can if you
>> add a bunch of hooks so it knows whenever vm reboots, driver binds and
>> unbinds from device, and can check that backup flag was set.
>> If you are pushing for a setup like that please get a buy-in
>> from libvirt maintainers or better write a patch.
>> ''
>
> This actually seems to mean the opposite to me: We need to know what
> the guest is doing and when, as it directly drives what we need to do
> with the devices. If we switch to a visibility vs a hotplug model (see
> the other mail), we might be able to handle that part within qemu.

In the model of (b), I think it essentially turns hotplug to one of
mechanisms for QEMU to control the visibility. The libvirt can still
manage the hotplug of individual devices during live migration or in
normal situation to hot add/remove devices. Though the visibility of
the VFIO is under the controll of QEMU, and it's possible that the hot
add/remove request does not involve actual hot plug activity in guest
at all.

In the model of (c), the hotplug semantics of the combined device
would mean differently - it would end up with devices plugged in or
out altogther. To make this work, we either have to build a brand new
bond-like QEMU device consist of virtio and VFIO internally, or need
to have some abstraction in place for libvirt to manipulate the
combined device (and prohibit libvirt from operating on individual
internal device directly). Note with this model the group ID doesn't
even need to get exposed to libvirt, just imagine libvirt to supply
all options required to configure two regular virtio-net and VFIO
devices for a single device object, and QEMU will deal with the
device's visibility and enumeration, such when to hot plug VFIO device
in to or out from the guest.

It might be complicated to implement (c) though.

Regards,
-Siwei


> However, I don't see how you get around needing libvirt to actually set
> this up in the first place and to handle migration per se.
Cornelia Huck June 18, 2018, 1:27 p.m. UTC | #37
On Fri, 15 Jun 2018 15:31:43 +0300
"Michael S. Tsirkin" <mst@redhat.com> wrote:

> On Fri, Jun 15, 2018 at 11:32:42AM +0200, Cornelia Huck wrote:
> > On Fri, 15 Jun 2018 05:34:24 +0300
> > "Michael S. Tsirkin" <mst@redhat.com> wrote:
> >   
> > > On Thu, Jun 14, 2018 at 12:02:31PM +0200, Cornelia Huck wrote:  
> >   
> > > > > > I am not all that familiar with how Qemu manages network devices. If we can
> > > > > > do all the
> > > > > > required management of the primary/standby devices within Qemu, that is
> > > > > > definitely a better
> > > > > > approach without upper layer involvement.      
> > > > > 
> > > > > Right. I would imagine in the extreme case the upper layer doesn't
> > > > > have to be involved at all if QEMU manages all hot plug/unplug logic.
> > > > > The management tool can supply passthrough device and virtio with the
> > > > > same group UUID, QEMU auto-manages the presence of the primary, and
> > > > > hot plug the device as needed before or after the migration.    
> > > > 
> > > > I do not really see how you can manage that kind of stuff in QEMU only.    
> > > 
> > > So right now failover is limited to pci passthrough devices only.
> > > The idea is to realize the vfio device but not expose it
> > > to guest. Have a separate command to expose it to guest.
> > > Hotunplug would also hide it from guest but not unrealize it.  
> > 
> > So, this would not be real hot(un)plug, but 'hide it from the guest',
> > right? The concept of "we have it realized in QEMU, but the guest can't
> > discover and use it" should be translatable to non-pci as well (at
> > least for ccw).
> >   
> > > 
> > > This will help ensure that e.g. on migration failure we can
> > > re-expose the device without risk of running out of resources.  
> > 
> > Makes sense.
> > 
> > Should that 'hidden' state be visible/settable from outside as well
> > (e.g. via a property)? I guess yes, so that management software has a
> > chance to see whether a device is visible.  
> 
> Might be handy for debug, but note that since QEMU manages this
> state it's transient: can change at any time, so it's kind
> of hard for management to rely on it.

Might be another reason to have this controlled by management software;
being able to find out easily why a device is not visible to the guest
seems to be a useful thing.

Anyway, let's defer this discussion until it is clear how we actually
want to handle the whole setup.

> 
> > Settable may be useful if we
> > find another use case for hiding realized devices.
Cornelia Huck June 19, 2018, 10:54 a.m. UTC | #38
On Fri, 15 Jun 2018 10:06:07 -0700
Siwei Liu <loseweigh@gmail.com> wrote:

> On Fri, Jun 15, 2018 at 4:48 AM, Cornelia Huck <cohuck@redhat.com> wrote:
> > On Thu, 14 Jun 2018 18:57:11 -0700
> > Siwei Liu <loseweigh@gmail.com> wrote:
> >  
> >> Thank you for sharing your thoughts, Cornelia. With questions below, I
> >> think you raised really good points, some of which I don't have answer
> >> yet and would also like to explore here.
> >>
> >> First off, I don't want to push the discussion to the extreme at this
> >> point, or sell anything about having QEMU manage everything
> >> automatically. Don't get me wrong, it's not there yet. Let's don't
> >> assume we are tied to a specific or concerte solution. I think the key
> >> for our discussion might be to define or refine the boundary between
> >> VM and guest,  e.g. what each layer is expected to control and manage
> >> exactly.
> >>
> >> In my view, there might be possibly 3 different options to represent
> >> the failover device conceipt to QEMU and libvirt (or any upper layer
> >> software):
> >>
> >> a. Seperate device: in this model, virtio and passthough remains
> >> separate devices just as today. QEMU exposes the standby feature bit
> >> for virtio, and publish status/event around the negotiation process of
> >> this feature bit for libvirt to react upon. Since Libvirt has the
> >> pairing relationship itself, maybe through MAC address or something
> >> else, it can control the presence of primary by hot plugging or
> >> unplugging the passthrough device, although it has to work tightly
> >> with virtio's feature negotation process. Not just for migration but
> >> also various corner scenarios (driver/feature ok, device reset,
> >> reboot, legacy guest etc) along virtio's feature negotiation.  
> >
> > Yes, that one has obvious tie-ins to virtio's modus operandi.
> >  
> >>
> >> b. Coupled device: in this model, virtio and passthough devices are
> >> weakly coupled using some group ID, i.e. QEMU match the passthough
> >> device for a standby virtio instance by comparing the group ID value
> >> present behind each device's bridge. Libvirt provides QEMU the group
> >> ID for both type of devices, and only deals with hot plug for
> >> migration, by checking some migration status exposed (e.g. the feature
> >> negotiation status on the virtio device) by QEMU. QEMU manages the
> >> visibility of the primary in guest along virtio's feature negotiation
> >> process.  
> >
> > I'm a bit confused here. What, exactly, ties the two devices together?  
> 
> The group UUID. Since QEMU VFIO dvice does not have insight of MAC
> address (which it doesn't have to), the association between VFIO
> passthrough and standby must be specificed for QEMU to understand the
> relationship with this model. Note, standby feature is no longer
> required to be exposed under this model.

Isn't that a bit limiting, though?

With this model, you can probably tie a vfio-pci device and a
virtio-net-pci device together. But this will fail if you have
different transports: Consider tying together a vfio-pci device and a
virtio-net-ccw device on s390, for example. The standby feature bit is
on the virtio-net level and should not have any dependency on the
transport used.

> 
> > If libvirt already has the knowledge that it should manage the two as a
> > couple, why do we need the group id (or something else for other
> > architectures)? (Maybe I'm simply missing something because I'm not
> > that familiar with pci.)  
> 
> The idea is to have QEMU control the visibility and enumeration order
> of the passthrough VFIO for the failover scenario. Hotplug can be one
> way to achieve it, and perhaps there's other way around also. The
> group ID is not just for QEMU to couple devices, it's also helpful to
> guest too as grouping using MAC address is just not safe.

Sorry about dragging mainframes into this, but this will only work for
homogenous device coupling, not for heterogenous. Consider my vfio-pci
+ virtio-net-ccw example again: The guest cannot find out that the two
belong together by checking some group ID, it has to either use the MAC
or some needs-to-be-architectured property.

Alternatively, we could propose that mechanism as pci-only, which means
we can rely on mechanisms that won't necessarily work on non-pci
transports. (FWIW, I don't see a use case for using vfio-ccw to pass
through a network card anytime in the near future, due to the nature of
network cards currently in use on s390.)

> 
> >  
> >>
> >> c. Fully combined device: in this model, virtio and passthough devices
> >> are viewed as a single VM interface altogther. QEMU not just controls
> >> the visibility of the primary in guest, but can also manage the
> >> exposure of the passthrough for migratability. It can be like that
> >> libvirt supplies the group ID to QEMU. Or libvirt does not even have
> >> to provide group ID for grouping the two devices, if just one single
> >> combined device is exposed by QEMU. In either case, QEMU manages all
> >> aspect of such internal construct, including virtio feature
> >> negotiation, presence of the primary, and live migration.  
> >
> > Same question as above.
> >  
> >>
> >> It looks like to me that, in your opinion, you seem to prefer go with
> >> (a). While I'm actually okay with either (b) or (c). Do I understand
> >> your point correctly?  
> >
> > I'm not yet preferring anything, as I'm still trying to understand how
> > this works :) I hope we can arrive at a model that covers the use case
> > and that is also flexible enough to be extended to other platforms.
> >  
> >>
> >> The reason that I feel that (a) might not be ideal, just as Michael
> >> alluded to (quoting below), is that as management stack, it really
> >> doesn't need to care about the detailed process of feature negotiation
> >> (if we view the guest presence of the primary as part of feature
> >> negotiation at an extended level not just virtio). All it needs to be
> >> done is to hand in the required devices to QEMU and that's all. Why do
> >> we need to addd various hooks, events for whichever happens internally
> >> within the guest?
> >>
> >> ''
> >> Primary device is added with a special "primary-failover" flag.
> >> A virtual machine is then initialized with just a standby virtio
> >> device. Primary is not yet added.
> >>
> >> Later QEMU detects that guest driver device set DRIVER_OK.
> >> It then exposes the primary device to the guest, and triggers
> >> a device addition event (hot-plug event) for it.
> >>
> >> If QEMU detects guest driver removal, it initiates a hot-unplug sequence
> >> to remove the primary driver.  In particular, if QEMU detects guest
> >> re-initialization (e.g. by detecting guest reset) it immediately removes
> >> the primary device.
> >> ''
> >>
> >> and,
> >>
> >> ''
> >> management just wants to give the primary to guest and later take it back,
> >> it really does not care about the details of the process,
> >> so I don't see what does pushing it up the stack buy you.
> >>
> >> So I don't think it *needs* to be done in libvirt. It probably can if you
> >> add a bunch of hooks so it knows whenever vm reboots, driver binds and
> >> unbinds from device, and can check that backup flag was set.
> >> If you are pushing for a setup like that please get a buy-in
> >> from libvirt maintainers or better write a patch.
> >> ''  
> >
> > This actually seems to mean the opposite to me: We need to know what
> > the guest is doing and when, as it directly drives what we need to do
> > with the devices. If we switch to a visibility vs a hotplug model (see
> > the other mail), we might be able to handle that part within qemu.  
> 
> In the model of (b), I think it essentially turns hotplug to one of
> mechanisms for QEMU to control the visibility. The libvirt can still
> manage the hotplug of individual devices during live migration or in
> normal situation to hot add/remove devices. Though the visibility of
> the VFIO is under the controll of QEMU, and it's possible that the hot
> add/remove request does not involve actual hot plug activity in guest
> at all.

That depends on how you model visibility, I guess. You'll probably want
to stop traffic flowing through one or the other of the cards; would
link down or similar be enough for the virtio device?

> 
> In the model of (c), the hotplug semantics of the combined device
> would mean differently - it would end up with devices plugged in or
> out altogther. To make this work, we either have to build a brand new
> bond-like QEMU device consist of virtio and VFIO internally, or need
> to have some abstraction in place for libvirt to manipulate the
> combined device (and prohibit libvirt from operating on individual
> internal device directly). Note with this model the group ID doesn't
> even need to get exposed to libvirt, just imagine libvirt to supply
> all options required to configure two regular virtio-net and VFIO
> devices for a single device object, and QEMU will deal with the
> device's visibility and enumeration, such when to hot plug VFIO device
> in to or out from the guest.
> 
> It might be complicated to implement (c) though.

I think (c) would be even more complicated for heterogenous setups.
Siwei Liu June 19, 2018, 8:09 p.m. UTC | #39
On Tue, Jun 19, 2018 at 3:54 AM, Cornelia Huck <cohuck@redhat.com> wrote:
> On Fri, 15 Jun 2018 10:06:07 -0700
> Siwei Liu <loseweigh@gmail.com> wrote:
>
>> On Fri, Jun 15, 2018 at 4:48 AM, Cornelia Huck <cohuck@redhat.com> wrote:
>> > On Thu, 14 Jun 2018 18:57:11 -0700
>> > Siwei Liu <loseweigh@gmail.com> wrote:
>> >
>> >> Thank you for sharing your thoughts, Cornelia. With questions below, I
>> >> think you raised really good points, some of which I don't have answer
>> >> yet and would also like to explore here.
>> >>
>> >> First off, I don't want to push the discussion to the extreme at this
>> >> point, or sell anything about having QEMU manage everything
>> >> automatically. Don't get me wrong, it's not there yet. Let's don't
>> >> assume we are tied to a specific or concerte solution. I think the key
>> >> for our discussion might be to define or refine the boundary between
>> >> VM and guest,  e.g. what each layer is expected to control and manage
>> >> exactly.
>> >>
>> >> In my view, there might be possibly 3 different options to represent
>> >> the failover device conceipt to QEMU and libvirt (or any upper layer
>> >> software):
>> >>
>> >> a. Seperate device: in this model, virtio and passthough remains
>> >> separate devices just as today. QEMU exposes the standby feature bit
>> >> for virtio, and publish status/event around the negotiation process of
>> >> this feature bit for libvirt to react upon. Since Libvirt has the
>> >> pairing relationship itself, maybe through MAC address or something
>> >> else, it can control the presence of primary by hot plugging or
>> >> unplugging the passthrough device, although it has to work tightly
>> >> with virtio's feature negotation process. Not just for migration but
>> >> also various corner scenarios (driver/feature ok, device reset,
>> >> reboot, legacy guest etc) along virtio's feature negotiation.
>> >
>> > Yes, that one has obvious tie-ins to virtio's modus operandi.
>> >
>> >>
>> >> b. Coupled device: in this model, virtio and passthough devices are
>> >> weakly coupled using some group ID, i.e. QEMU match the passthough
>> >> device for a standby virtio instance by comparing the group ID value
>> >> present behind each device's bridge. Libvirt provides QEMU the group
>> >> ID for both type of devices, and only deals with hot plug for
>> >> migration, by checking some migration status exposed (e.g. the feature
>> >> negotiation status on the virtio device) by QEMU. QEMU manages the
>> >> visibility of the primary in guest along virtio's feature negotiation
>> >> process.
>> >
>> > I'm a bit confused here. What, exactly, ties the two devices together?
>>
>> The group UUID. Since QEMU VFIO dvice does not have insight of MAC
>> address (which it doesn't have to), the association between VFIO
>> passthrough and standby must be specificed for QEMU to understand the
>> relationship with this model. Note, standby feature is no longer
>> required to be exposed under this model.
>
> Isn't that a bit limiting, though?
>
> With this model, you can probably tie a vfio-pci device and a
> virtio-net-pci device together. But this will fail if you have
> different transports: Consider tying together a vfio-pci device and a
> virtio-net-ccw device on s390, for example. The standby feature bit is
> on the virtio-net level and should not have any dependency on the
> transport used.

Probably we'd limit the support for grouping to virtio-net-pci device
and vfio-pci device only. For virtio-net-pci, as you might see with
Venu's patch, we store the group UUID on the config space of
virtio-pci, which is only applicable to PCI transport.

If virtio-net-ccw needs to support the same, I think similar grouping
interface should be defined on the VirtIO CCW transport. I think the
current implementation of the Linux failover driver assumes that it's
SR-IOV VF with same MAC address which the virtio-net-pci needs to pair
with, and that the PV path is on same PF without needing to update
network of the port-MAC association change. If we need to extend the
grouping mechanism to virtio-net-ccw, it has to pass such failover
mode to virtio driver specifically through some other option I guess.

>
>>
>> > If libvirt already has the knowledge that it should manage the two as a
>> > couple, why do we need the group id (or something else for other
>> > architectures)? (Maybe I'm simply missing something because I'm not
>> > that familiar with pci.)
>>
>> The idea is to have QEMU control the visibility and enumeration order
>> of the passthrough VFIO for the failover scenario. Hotplug can be one
>> way to achieve it, and perhaps there's other way around also. The
>> group ID is not just for QEMU to couple devices, it's also helpful to
>> guest too as grouping using MAC address is just not safe.
>
> Sorry about dragging mainframes into this, but this will only work for
> homogenous device coupling, not for heterogenous. Consider my vfio-pci
> + virtio-net-ccw example again: The guest cannot find out that the two
> belong together by checking some group ID, it has to either use the MAC
> or some needs-to-be-architectured property.
>
> Alternatively, we could propose that mechanism as pci-only, which means
> we can rely on mechanisms that won't necessarily work on non-pci
> transports. (FWIW, I don't see a use case for using vfio-ccw to pass
> through a network card anytime in the near future, due to the nature of
> network cards currently in use on s390.)

Yes, let's do this just for PCI transport (homogenous) for now.

>
>>
>> >
>> >>
>> >> c. Fully combined device: in this model, virtio and passthough devices
>> >> are viewed as a single VM interface altogther. QEMU not just controls
>> >> the visibility of the primary in guest, but can also manage the
>> >> exposure of the passthrough for migratability. It can be like that
>> >> libvirt supplies the group ID to QEMU. Or libvirt does not even have
>> >> to provide group ID for grouping the two devices, if just one single
>> >> combined device is exposed by QEMU. In either case, QEMU manages all
>> >> aspect of such internal construct, including virtio feature
>> >> negotiation, presence of the primary, and live migration.
>> >
>> > Same question as above.
>> >
>> >>
>> >> It looks like to me that, in your opinion, you seem to prefer go with
>> >> (a). While I'm actually okay with either (b) or (c). Do I understand
>> >> your point correctly?
>> >
>> > I'm not yet preferring anything, as I'm still trying to understand how
>> > this works :) I hope we can arrive at a model that covers the use case
>> > and that is also flexible enough to be extended to other platforms.
>> >
>> >>
>> >> The reason that I feel that (a) might not be ideal, just as Michael
>> >> alluded to (quoting below), is that as management stack, it really
>> >> doesn't need to care about the detailed process of feature negotiation
>> >> (if we view the guest presence of the primary as part of feature
>> >> negotiation at an extended level not just virtio). All it needs to be
>> >> done is to hand in the required devices to QEMU and that's all. Why do
>> >> we need to addd various hooks, events for whichever happens internally
>> >> within the guest?
>> >>
>> >> ''
>> >> Primary device is added with a special "primary-failover" flag.
>> >> A virtual machine is then initialized with just a standby virtio
>> >> device. Primary is not yet added.
>> >>
>> >> Later QEMU detects that guest driver device set DRIVER_OK.
>> >> It then exposes the primary device to the guest, and triggers
>> >> a device addition event (hot-plug event) for it.
>> >>
>> >> If QEMU detects guest driver removal, it initiates a hot-unplug sequence
>> >> to remove the primary driver.  In particular, if QEMU detects guest
>> >> re-initialization (e.g. by detecting guest reset) it immediately removes
>> >> the primary device.
>> >> ''
>> >>
>> >> and,
>> >>
>> >> ''
>> >> management just wants to give the primary to guest and later take it back,
>> >> it really does not care about the details of the process,
>> >> so I don't see what does pushing it up the stack buy you.
>> >>
>> >> So I don't think it *needs* to be done in libvirt. It probably can if you
>> >> add a bunch of hooks so it knows whenever vm reboots, driver binds and
>> >> unbinds from device, and can check that backup flag was set.
>> >> If you are pushing for a setup like that please get a buy-in
>> >> from libvirt maintainers or better write a patch.
>> >> ''
>> >
>> > This actually seems to mean the opposite to me: We need to know what
>> > the guest is doing and when, as it directly drives what we need to do
>> > with the devices. If we switch to a visibility vs a hotplug model (see
>> > the other mail), we might be able to handle that part within qemu.
>>
>> In the model of (b), I think it essentially turns hotplug to one of
>> mechanisms for QEMU to control the visibility. The libvirt can still
>> manage the hotplug of individual devices during live migration or in
>> normal situation to hot add/remove devices. Though the visibility of
>> the VFIO is under the controll of QEMU, and it's possible that the hot
>> add/remove request does not involve actual hot plug activity in guest
>> at all.
>
> That depends on how you model visibility, I guess. You'll probably want
> to stop traffic flowing through one or the other of the cards; would
> link down or similar be enough for the virtio device?

I'm not sure if it is a good idea. The guest user will see two devices
with same MAC but one of them is down. Do you expect user to use it or
not? And since the guest is going to be migrated, we need to unplug a
broken VF from guest before migrating, why do we bother plugging in
this useless VF at the first place?


Thanks,
-Siwei
>
>>
>> In the model of (c), the hotplug semantics of the combined device
>> would mean differently - it would end up with devices plugged in or
>> out altogther. To make this work, we either have to build a brand new
>> bond-like QEMU device consist of virtio and VFIO internally, or need
>> to have some abstraction in place for libvirt to manipulate the
>> combined device (and prohibit libvirt from operating on individual
>> internal device directly). Note with this model the group ID doesn't
>> even need to get exposed to libvirt, just imagine libvirt to supply
>> all options required to configure two regular virtio-net and VFIO
>> devices for a single device object, and QEMU will deal with the
>> device's visibility and enumeration, such when to hot plug VFIO device
>> in to or out from the guest.
>>
>> It might be complicated to implement (c) though.
>
> I think (c) would be even more complicated for heterogenous setups.
Michael S. Tsirkin June 19, 2018, 8:32 p.m. UTC | #40
On Tue, Jun 19, 2018 at 12:54:53PM +0200, Cornelia Huck wrote:
> Sorry about dragging mainframes into this, but this will only work for
> homogenous device coupling, not for heterogenous. Consider my vfio-pci
> + virtio-net-ccw example again: The guest cannot find out that the two
> belong together by checking some group ID, it has to either use the MAC
> or some needs-to-be-architectured property.
> 
> Alternatively, we could propose that mechanism as pci-only, which means
> we can rely on mechanisms that won't necessarily work on non-pci
> transports. (FWIW, I don't see a use case for using vfio-ccw to pass
> through a network card anytime in the near future, due to the nature of
> network cards currently in use on s390.)

That's what it boils down to, yes.  If there's need to have this for
non-pci devices, then we should put it in config space.
Cornelia, what do you think?
Cornelia Huck June 20, 2018, 9:53 a.m. UTC | #41
On Tue, 19 Jun 2018 23:32:06 +0300
"Michael S. Tsirkin" <mst@redhat.com> wrote:

> On Tue, Jun 19, 2018 at 12:54:53PM +0200, Cornelia Huck wrote:
> > Sorry about dragging mainframes into this, but this will only work for
> > homogenous device coupling, not for heterogenous. Consider my vfio-pci
> > + virtio-net-ccw example again: The guest cannot find out that the two
> > belong together by checking some group ID, it has to either use the MAC
> > or some needs-to-be-architectured property.
> > 
> > Alternatively, we could propose that mechanism as pci-only, which means
> > we can rely on mechanisms that won't necessarily work on non-pci
> > transports. (FWIW, I don't see a use case for using vfio-ccw to pass
> > through a network card anytime in the near future, due to the nature of
> > network cards currently in use on s390.)  
> 
> That's what it boils down to, yes.  If there's need to have this for
> non-pci devices, then we should put it in config space.
> Cornelia, what do you think?
> 

I think the only really useful config on s390 is the vfio-pci network
card coupled with a virtio-net-ccw device: Using an s390 network card
via vfio-ccw is out due to the nature of the s390 network cards, and
virtio-ccw is the default transport (virtio-pci is not supported on any
enterprise distro AFAIK).

For this, having a uuid in the config space could work (vfio-pci
devices have a config space by virtue of being pci devices, and
virtio-net-ccw devices have a config space by virtue of being virtio
devices -- ccw devices usually don't have that concept).
Michael S. Tsirkin June 20, 2018, 2:11 p.m. UTC | #42
On Wed, Jun 20, 2018 at 11:53:59AM +0200, Cornelia Huck wrote:
> On Tue, 19 Jun 2018 23:32:06 +0300
> "Michael S. Tsirkin" <mst@redhat.com> wrote:
> 
> > On Tue, Jun 19, 2018 at 12:54:53PM +0200, Cornelia Huck wrote:
> > > Sorry about dragging mainframes into this, but this will only work for
> > > homogenous device coupling, not for heterogenous. Consider my vfio-pci
> > > + virtio-net-ccw example again: The guest cannot find out that the two
> > > belong together by checking some group ID, it has to either use the MAC
> > > or some needs-to-be-architectured property.
> > > 
> > > Alternatively, we could propose that mechanism as pci-only, which means
> > > we can rely on mechanisms that won't necessarily work on non-pci
> > > transports. (FWIW, I don't see a use case for using vfio-ccw to pass
> > > through a network card anytime in the near future, due to the nature of
> > > network cards currently in use on s390.)  
> > 
> > That's what it boils down to, yes.  If there's need to have this for
> > non-pci devices, then we should put it in config space.
> > Cornelia, what do you think?
> > 
> 
> I think the only really useful config on s390 is the vfio-pci network
> card coupled with a virtio-net-ccw device: Using an s390 network card
> via vfio-ccw is out due to the nature of the s390 network cards, and
> virtio-ccw is the default transport (virtio-pci is not supported on any
> enterprise distro AFAIK).
> 
> For this, having a uuid in the config space could work (vfio-pci
> devices have a config space by virtue of being pci devices, and
> virtio-net-ccw devices have a config space by virtue of being virtio
> devices -- ccw devices usually don't have that concept).

OK so this calls for adding such a field generally (it's
device agnostic right now).

How would you suggest doing that?
Cornelia Huck June 20, 2018, 2:34 p.m. UTC | #43
On Tue, 19 Jun 2018 13:09:14 -0700
Siwei Liu <loseweigh@gmail.com> wrote:

> On Tue, Jun 19, 2018 at 3:54 AM, Cornelia Huck <cohuck@redhat.com> wrote:
> > On Fri, 15 Jun 2018 10:06:07 -0700
> > Siwei Liu <loseweigh@gmail.com> wrote:
> >  
> >> On Fri, Jun 15, 2018 at 4:48 AM, Cornelia Huck <cohuck@redhat.com> wrote:  
> >> > On Thu, 14 Jun 2018 18:57:11 -0700
> >> > Siwei Liu <loseweigh@gmail.com> wrote:

> >> > I'm a bit confused here. What, exactly, ties the two devices together?  
> >>
> >> The group UUID. Since QEMU VFIO dvice does not have insight of MAC
> >> address (which it doesn't have to), the association between VFIO
> >> passthrough and standby must be specificed for QEMU to understand the
> >> relationship with this model. Note, standby feature is no longer
> >> required to be exposed under this model.  
> >
> > Isn't that a bit limiting, though?
> >
> > With this model, you can probably tie a vfio-pci device and a
> > virtio-net-pci device together. But this will fail if you have
> > different transports: Consider tying together a vfio-pci device and a
> > virtio-net-ccw device on s390, for example. The standby feature bit is
> > on the virtio-net level and should not have any dependency on the
> > transport used.  
> 
> Probably we'd limit the support for grouping to virtio-net-pci device
> and vfio-pci device only. For virtio-net-pci, as you might see with
> Venu's patch, we store the group UUID on the config space of
> virtio-pci, which is only applicable to PCI transport.
> 
> If virtio-net-ccw needs to support the same, I think similar grouping
> interface should be defined on the VirtIO CCW transport. I think the
> current implementation of the Linux failover driver assumes that it's
> SR-IOV VF with same MAC address which the virtio-net-pci needs to pair
> with, and that the PV path is on same PF without needing to update
> network of the port-MAC association change. If we need to extend the
> grouping mechanism to virtio-net-ccw, it has to pass such failover
> mode to virtio driver specifically through some other option I guess.

Hm, I've just spent some time reading the Linux failover code and I did
not really find much pci-related magic in there (other than checking
for a pci device in net_failover_slave_pre_register). We also seem to
look for a matching device by MAC only. What magic am I missing?

Is the look-for-uuid handling supposed to happen in the host only?

> >> > If libvirt already has the knowledge that it should manage the two as a
> >> > couple, why do we need the group id (or something else for other
> >> > architectures)? (Maybe I'm simply missing something because I'm not
> >> > that familiar with pci.)  
> >>
> >> The idea is to have QEMU control the visibility and enumeration order
> >> of the passthrough VFIO for the failover scenario. Hotplug can be one
> >> way to achieve it, and perhaps there's other way around also. The
> >> group ID is not just for QEMU to couple devices, it's also helpful to
> >> guest too as grouping using MAC address is just not safe.  
> >
> > Sorry about dragging mainframes into this, but this will only work for
> > homogenous device coupling, not for heterogenous. Consider my vfio-pci
> > + virtio-net-ccw example again: The guest cannot find out that the two
> > belong together by checking some group ID, it has to either use the MAC
> > or some needs-to-be-architectured property.
> >
> > Alternatively, we could propose that mechanism as pci-only, which means
> > we can rely on mechanisms that won't necessarily work on non-pci
> > transports. (FWIW, I don't see a use case for using vfio-ccw to pass
> > through a network card anytime in the near future, due to the nature of
> > network cards currently in use on s390.)  
> 
> Yes, let's do this just for PCI transport (homogenous) for now.

But why? Using pci for passthrough to make things easier (and because
there's not really a use case), sure. But I really don't want to
restrict this to virtio-pci only.

> >> In the model of (b), I think it essentially turns hotplug to one of
> >> mechanisms for QEMU to control the visibility. The libvirt can still
> >> manage the hotplug of individual devices during live migration or in
> >> normal situation to hot add/remove devices. Though the visibility of
> >> the VFIO is under the controll of QEMU, and it's possible that the hot
> >> add/remove request does not involve actual hot plug activity in guest
> >> at all.  
> >
> > That depends on how you model visibility, I guess. You'll probably want
> > to stop traffic flowing through one or the other of the cards; would
> > link down or similar be enough for the virtio device?  
> 
> I'm not sure if it is a good idea. The guest user will see two devices
> with same MAC but one of them is down. Do you expect user to use it or
> not? And since the guest is going to be migrated, we need to unplug a
> broken VF from guest before migrating, why do we bother plugging in
> this useless VF at the first place?

I was thinking about using hotunplugging only over migration and doing
the link up only after feature negotiation has finished, but that is
probably too complicated. Let's stick to hotplug for simplicity's sake.
Cornelia Huck June 20, 2018, 4:06 p.m. UTC | #44
On Wed, 20 Jun 2018 17:11:59 +0300
"Michael S. Tsirkin" <mst@redhat.com> wrote:

> On Wed, Jun 20, 2018 at 11:53:59AM +0200, Cornelia Huck wrote:
> > On Tue, 19 Jun 2018 23:32:06 +0300
> > "Michael S. Tsirkin" <mst@redhat.com> wrote:
> >   
> > > On Tue, Jun 19, 2018 at 12:54:53PM +0200, Cornelia Huck wrote:  
> > > > Sorry about dragging mainframes into this, but this will only work for
> > > > homogenous device coupling, not for heterogenous. Consider my vfio-pci
> > > > + virtio-net-ccw example again: The guest cannot find out that the two
> > > > belong together by checking some group ID, it has to either use the MAC
> > > > or some needs-to-be-architectured property.
> > > > 
> > > > Alternatively, we could propose that mechanism as pci-only, which means
> > > > we can rely on mechanisms that won't necessarily work on non-pci
> > > > transports. (FWIW, I don't see a use case for using vfio-ccw to pass
> > > > through a network card anytime in the near future, due to the nature of
> > > > network cards currently in use on s390.)    
> > > 
> > > That's what it boils down to, yes.  If there's need to have this for
> > > non-pci devices, then we should put it in config space.
> > > Cornelia, what do you think?
> > >   
> > 
> > I think the only really useful config on s390 is the vfio-pci network
> > card coupled with a virtio-net-ccw device: Using an s390 network card
> > via vfio-ccw is out due to the nature of the s390 network cards, and
> > virtio-ccw is the default transport (virtio-pci is not supported on any
> > enterprise distro AFAIK).
> > 
> > For this, having a uuid in the config space could work (vfio-pci
> > devices have a config space by virtue of being pci devices, and
> > virtio-net-ccw devices have a config space by virtue of being virtio
> > devices -- ccw devices usually don't have that concept).  
> 
> OK so this calls for adding such a field generally (it's
> device agnostic right now).
> 
> How would you suggest doing that?

I hope that I'm not thoroughly confused at this point in time, so I'll
summarize my current understanding (also keep in mind that I haven't
looked at Venu's patches yet):

- The Linux guest initiates coupling from the virtio-net driver.
  Matching the other device is done via the MAC, and only pci devices
  are allowed for the failover device. (There does not seem to be any
  restriction on the transport of the virtio-net device.)
- The Linux guest virtio-net driver does not allow changing the MAC if
  standby has been negotiated (implying that the hypervisor needs to
  configure the correct MAC).
- In QEMU, we need to know which two devices (vfio-pci and virtio-net)
  go together, so that the virtio-net device gets the correct MAC. We
  also need the pairing so that we can make the vfio-pci device
  available once the guest has negotiated the standby feature.

We can tack the two devices together in QEMU by introducing new,
optional properties pointing from the virtio-net device to the vfio-pci
device (only offer standby if this is set) and the other way around
(don't make the device visible at the start if this is set). Problems:

- The admin needs to figure out the MAC by themselves and set it
  correctly. If this is incorrect, the vfio-pci device cannot be found
  in the guest. (Not sure how much of a problem this is in practice --
  and QEMU cannot figure out the MAC without poking at the vfio-pci
  device, and we probably want to avoid that.)
- This two-way pointing makes for interesting handing of the command
  line and when both devices are plugged later.

In any case, I'm not sure anymore why we'd want the extra uuid. Is
there any way QEMU (or libvirt) can figure it out without actually
looking at the vfio-pci device?
Michael S. Tsirkin June 20, 2018, 7:48 p.m. UTC | #45
On Wed, Jun 20, 2018 at 06:06:19PM +0200, Cornelia Huck wrote:
> In any case, I'm not sure anymore why we'd want the extra uuid.

It's mostly so we can have e.g. multiple devices with same MAC
(which some people seem to want in order to then use
then with different containers).

But it is also handy for when you assign a PF, since then you
can't set the MAC.
Siwei Liu June 20, 2018, 7:59 p.m. UTC | #46
On Wed, Jun 20, 2018 at 7:34 AM, Cornelia Huck <cohuck@redhat.com> wrote:
> On Tue, 19 Jun 2018 13:09:14 -0700
> Siwei Liu <loseweigh@gmail.com> wrote:
>
>> On Tue, Jun 19, 2018 at 3:54 AM, Cornelia Huck <cohuck@redhat.com> wrote:
>> > On Fri, 15 Jun 2018 10:06:07 -0700
>> > Siwei Liu <loseweigh@gmail.com> wrote:
>> >
>> >> On Fri, Jun 15, 2018 at 4:48 AM, Cornelia Huck <cohuck@redhat.com> wrote:
>> >> > On Thu, 14 Jun 2018 18:57:11 -0700
>> >> > Siwei Liu <loseweigh@gmail.com> wrote:
>
>> >> > I'm a bit confused here. What, exactly, ties the two devices together?
>> >>
>> >> The group UUID. Since QEMU VFIO dvice does not have insight of MAC
>> >> address (which it doesn't have to), the association between VFIO
>> >> passthrough and standby must be specificed for QEMU to understand the
>> >> relationship with this model. Note, standby feature is no longer
>> >> required to be exposed under this model.
>> >
>> > Isn't that a bit limiting, though?
>> >
>> > With this model, you can probably tie a vfio-pci device and a
>> > virtio-net-pci device together. But this will fail if you have
>> > different transports: Consider tying together a vfio-pci device and a
>> > virtio-net-ccw device on s390, for example. The standby feature bit is
>> > on the virtio-net level and should not have any dependency on the
>> > transport used.
>>
>> Probably we'd limit the support for grouping to virtio-net-pci device
>> and vfio-pci device only. For virtio-net-pci, as you might see with
>> Venu's patch, we store the group UUID on the config space of
>> virtio-pci, which is only applicable to PCI transport.
>>
>> If virtio-net-ccw needs to support the same, I think similar grouping
>> interface should be defined on the VirtIO CCW transport. I think the
>> current implementation of the Linux failover driver assumes that it's
>> SR-IOV VF with same MAC address which the virtio-net-pci needs to pair
>> with, and that the PV path is on same PF without needing to update
>> network of the port-MAC association change. If we need to extend the
>> grouping mechanism to virtio-net-ccw, it has to pass such failover
>> mode to virtio driver specifically through some other option I guess.
>
> Hm, I've just spent some time reading the Linux failover code and I did
> not really find much pci-related magic in there (other than checking
> for a pci device in net_failover_slave_pre_register). We also seem to
> look for a matching device by MAC only. What magic am I missing?

The existing assumptions around SR-IOV VF and thus PCI is implicit. A
lot of simplications are built on the fact that the passthrough device
is a SR-IOV Virtual Function specifically than others: MAC addresses
for couple devices must be the same, changing MAC address is
prohibited, programming VLAN filter is challenged, the datapath of
virtio-net has to share the same physical function where VF belongs
to. There's no hankshake during datapath switching at all to support a
normal passthrough device at this point. I'd imagine some work around
that ahead, which might be a bit involved than just to support a
simplified model for VF migration.

>
> Is the look-for-uuid handling supposed to happen in the host only?

The look-for-MAC matching scheme is not ideal in many aspects. I don't
want to repeat those again, but once the group UUID is added to QEMU,
the failover driver is supposed to switch to the UUID based matching
scheme in the guest.

>
>> >> > If libvirt already has the knowledge that it should manage the two as a
>> >> > couple, why do we need the group id (or something else for other
>> >> > architectures)? (Maybe I'm simply missing something because I'm not
>> >> > that familiar with pci.)
>> >>
>> >> The idea is to have QEMU control the visibility and enumeration order
>> >> of the passthrough VFIO for the failover scenario. Hotplug can be one
>> >> way to achieve it, and perhaps there's other way around also. The
>> >> group ID is not just for QEMU to couple devices, it's also helpful to
>> >> guest too as grouping using MAC address is just not safe.
>> >
>> > Sorry about dragging mainframes into this, but this will only work for
>> > homogenous device coupling, not for heterogenous. Consider my vfio-pci
>> > + virtio-net-ccw example again: The guest cannot find out that the two
>> > belong together by checking some group ID, it has to either use the MAC
>> > or some needs-to-be-architectured property.
>> >
>> > Alternatively, we could propose that mechanism as pci-only, which means
>> > we can rely on mechanisms that won't necessarily work on non-pci
>> > transports. (FWIW, I don't see a use case for using vfio-ccw to pass
>> > through a network card anytime in the near future, due to the nature of
>> > network cards currently in use on s390.)
>>
>> Yes, let's do this just for PCI transport (homogenous) for now.
>
> But why? Using pci for passthrough to make things easier (and because
> there's not really a use case), sure. But I really don't want to
> restrict this to virtio-pci only.

Of course, technically it doesn't have to be virtio-pci only. The
group UUID can even extend it further to non-pci transport. However,
with the current focus of the driver support on SR-IOV VF and limited
use case on non-pci, I'd feel no immediate effort will be needed on
that front.

>
>> >> In the model of (b), I think it essentially turns hotplug to one of
>> >> mechanisms for QEMU to control the visibility. The libvirt can still
>> >> manage the hotplug of individual devices during live migration or in
>> >> normal situation to hot add/remove devices. Though the visibility of
>> >> the VFIO is under the controll of QEMU, and it's possible that the hot
>> >> add/remove request does not involve actual hot plug activity in guest
>> >> at all.
>> >
>> > That depends on how you model visibility, I guess. You'll probably want
>> > to stop traffic flowing through one or the other of the cards; would
>> > link down or similar be enough for the virtio device?
>>
>> I'm not sure if it is a good idea. The guest user will see two devices
>> with same MAC but one of them is down. Do you expect user to use it or
>> not? And since the guest is going to be migrated, we need to unplug a
>> broken VF from guest before migrating, why do we bother plugging in
>> this useless VF at the first place?
>
> I was thinking about using hotunplugging only over migration and doing
> the link up only after feature negotiation has finished, but that is
> probably too complicated. Let's stick to hotplug for simplicity's sake.

OK. Thanks for the discussion, it's really useful.

Regards,
-Siwei
Cornelia Huck June 21, 2018, 2:59 p.m. UTC | #47
On Wed, 20 Jun 2018 22:48:58 +0300
"Michael S. Tsirkin" <mst@redhat.com> wrote:

> On Wed, Jun 20, 2018 at 06:06:19PM +0200, Cornelia Huck wrote:
> > In any case, I'm not sure anymore why we'd want the extra uuid.  
> 
> It's mostly so we can have e.g. multiple devices with same MAC
> (which some people seem to want in order to then use
> then with different containers).
> 
> But it is also handy for when you assign a PF, since then you
> can't set the MAC.
> 

OK, so what about the following:

- introduce a new feature bit, VIRTIO_NET_F_STANDBY_UUID that indicates
  that we have a new uuid field in the virtio-net config space
- in QEMU, add a property for virtio-net that allows to specify a uuid,
  offer VIRTIO_NET_F_STANDBY_UUID if set
- when configuring, set the property to the group UUID of the vfio-pci
  device
- in the guest, use the uuid from the virtio-net device's config space
  if applicable; else, fall back to matching by MAC as done today

That should work for all virtio transports.
Michael S. Tsirkin June 21, 2018, 6:14 p.m. UTC | #48
On Wed, Jun 13, 2018 at 01:40:59PM +0800, Jason Wang wrote:
> 
> 
> On 2018年06月13日 12:24, Samudrala, Sridhar wrote:
> > On 6/12/2018 7:38 PM, Jason Wang wrote:
> > > 
> > > 
> > > On 2018年06月12日 19:54, Michael S. Tsirkin wrote:
> > > > On Wed, Jun 06, 2018 at 10:29:03AM +0800, Jason Wang wrote:
> > > > > 
> > > > > On 2018年06月05日 20:33, Michael S. Tsirkin wrote:
> > > > > > I don't think this is sufficient.
> > > > > > 
> > > > > > If both primary and standby devices are present, a
> > > > > > legacy guest without
> > > > > > support for the feature might see two devices with same mac and get
> > > > > > confused.
> > > > > > 
> > > > > > I think that we should only make primary visible after
> > > > > > guest acked the
> > > > > > backup feature bit.
> > > > > I think we want exactly the reverse? E.g fail the
> > > > > negotiation when guest
> > > > > does not ack backup feature.
> > > > > 
> > > > > Otherwise legacy guest won't even have the chance to see
> > > > > primary device in
> > > > > the guest.
> > > > That's by design.
> > > 
> > > So management needs to know the capability of guest to set the
> > > backup feature. This looks a chicken or egg problem to me.
> > 
> > I don't think so. If the tenant requests 'accelerated datapath feature',
> > the management
> > will set 'standby' feature bit on virtio-net interface and if the guest
> > virtio-net driver
> > supports this feature, then the tenant VM will get that capability via a
> > hot-plugged
> > primary device.
> 
> Ok, I thought exactly the reverse because of the commit title is "enable
> virtio_net to act as a standby for a passthru device". But re-read the
> commit log content, I understand the case a little bit. Btw, VF is not
> necessarily faster than virtio-net, especially consider virtio-net may have
> a lot of queues.

Don't do that then, right?

> > 
> > 
> > > 
> > > > 
> > > > > > And on reset or when backup is cleared in some other way, unplug the
> > > > > > primary.
> > > > > What if guest does not support hotplug?
> > > > It shouldn't ack the backup feature then and it's a good point.
> > > > We should both document this and check kernel config has
> > > > hotplug support. Sridhar could you take a look pls?
> > > > 
> > > > > > Something like the below will do the job:
> > > > > > 
> > > > > > Primary device is added with a special "primary-failover" flag.
> > > > > > A virtual machine is then initialized with just a standby virtio
> > > > > > device. Primary is not yet added.
> > > > > A question is how to do the matching? Qemu knows nothing about e.g mac
> > > > > address of a pass-through device I believe?
> > > > Supply a flag to the VFIO when it's added, this way QEMU will know.
> > > > 
> > > > > > Later QEMU detects that guest driver device set DRIVER_OK.
> > > > > > It then exposes the primary device to the guest, and triggers
> > > > > > a device addition event (hot-plug event) for it.
> > > > > Do you mean we won't have primary device in the initial qemu cli?
> > > > No, that's not what I mean.
> > > > 
> > > > I mean management will supply a flag to VFIO and then
> > > > 
> > > > 
> > > > - VFIO defers exposing
> > > > primary to guest until guest acks the feature bit.
> > > > - When we see guest ack, initiate hotplug.
> > > > - On reboot, hide it again.
> > > > - On reset without reboot, request hot-unplug and on eject hide
> > > > it again.
> > > 
> > > This sounds much like a kind of bonding in qemu.
> > > 
> > > > 
> > > > > > If QEMU detects guest driver removal, it initiates a
> > > > > > hot-unplug sequence
> > > > > > to remove the primary driver.  In particular, if QEMU detects guest
> > > > > > re-initialization (e.g. by detecting guest reset) it
> > > > > > immediately removes
> > > > > > the primary device.
> > > > > I believe guest failover module should handle this gracefully?
> > > > It can't control enumeration order, if primary is enumerated before
> > > > standby then guest will load its driver and it's too late
> > > > when failover driver is loaded.
> > > 
> > > Well, even if we can delay the visibility of primary until
> > > DRIVER_OK, there still be a race I think? And it looks to me it's
> > > still a bug of guest:
> > > 
> > > E.g primary could be probed before failover_register() in guest.
> > > Then we will miss the enslaving of primary forever.
> > 
> > That is not an issue. Even if the primary is probed before failover
> > driver, it will
> > enslave the primary via the call to failover_existing_slave_register()
> > as part of
> > failover_register() routine.
> 
> Aha I get it. So the enumeration order is not an issue.
> 
> Consider primary may still be seen by guest kernel even if we delay its
> visibility, I wonder whether we can control the lifecycle of primary through
> driver but not qemu. This can simplify a lot of things.
> 
> Thanks
> 
> > 
> > > 
> > > Thanks
> > > 
> > > > 
> > > > > Thanks
> > > > > 
> > > > > > We can move some of this code to management as well,
> > > > > > architecturally it
> > > > > > does not make too much sense but it might be easier
> > > > > > implementation-wise.
> > > > > > 
> > > > > > HTH
> > > > > > 
> > > > > > On Mon, Jun 04, 2018 at 06:41:48PM -0700, Samudrala, Sridhar wrote:
> > > > > > > Ping on this patch now that the kernel patches are
> > > > > > > accepted into davem's net-next tree.
> > > > > > > https://patchwork.ozlabs.org/cover/920005/
> > > > > > > 
> > > > > > > 
> > > > > > > On 5/7/2018 4:09 PM, Sridhar Samudrala wrote:
> > > > > > > > This feature bit can be used by hypervisor to
> > > > > > > > indicate virtio_net device to
> > > > > > > > act as a standby for another device with the same MAC address.
> > > > > > > > 
> > > > > > > > I tested this with a small change to the patch
> > > > > > > > to mark the STANDBY feature 'true'
> > > > > > > > by default as i am using libvirt to start the VMs.
> > > > > > > > Is there a way to pass the newly added feature
> > > > > > > > bit 'standby' to qemu via libvirt
> > > > > > > > XML file?
> > > > > > > > 
> > > > > > > > Signed-off-by: Sridhar Samudrala <sridhar.samudrala@intel.com>
> > > > > > > > ---
> > > > > > > >     hw/net/virtio-net.c                         | 2 ++
> > > > > > > >     include/standard-headers/linux/virtio_net.h | 3 +++
> > > > > > > >     2 files changed, 5 insertions(+)
> > > > > > > > 
> > > > > > > > diff --git a/hw/net/virtio-net.c b/hw/net/virtio-net.c
> > > > > > > > index 90502fca7c..38b3140670 100644
> > > > > > > > --- a/hw/net/virtio-net.c
> > > > > > > > +++ b/hw/net/virtio-net.c
> > > > > > > > @@ -2198,6 +2198,8 @@ static Property virtio_net_properties[] = {
> > > > > > > >                          true),
> > > > > > > >         DEFINE_PROP_INT32("speed", VirtIONet,
> > > > > > > > net_conf.speed, SPEED_UNKNOWN),
> > > > > > > >         DEFINE_PROP_STRING("duplex", VirtIONet,
> > > > > > > > net_conf.duplex_str),
> > > > > > > > +    DEFINE_PROP_BIT64("standby", VirtIONet,
> > > > > > > > host_features, VIRTIO_NET_F_STANDBY,
> > > > > > > > +                      false),
> > > > > > > >         DEFINE_PROP_END_OF_LIST(),
> > > > > > > >     };
> > > > > > > > diff --git
> > > > > > > > a/include/standard-headers/linux/virtio_net.h
> > > > > > > > b/include/standard-headers/linux/virtio_net.h
> > > > > > > > index e9f255ea3f..01ec09684c 100644
> > > > > > > > --- a/include/standard-headers/linux/virtio_net.h
> > > > > > > > +++ b/include/standard-headers/linux/virtio_net.h
> > > > > > > > @@ -57,6 +57,9 @@
> > > > > > > >                          * Steering */
> > > > > > > >     #define VIRTIO_NET_F_CTRL_MAC_ADDR 23    /* Set MAC address */
> > > > > > > > +#define VIRTIO_NET_F_STANDBY      62    /* Act
> > > > > > > > as standby for another device
> > > > > > > > +                                         * with the same MAC.
> > > > > > > > +                                         */
> > > > > > > >     #define VIRTIO_NET_F_SPEED_DUPLEX 63    /*
> > > > > > > > Device set linkspeed and duplex */
> > > > > > > >     #ifndef VIRTIO_NET_NO_LEGACY
> > > 
> >
Michael S. Tsirkin June 21, 2018, 6:20 p.m. UTC | #49
On Thu, Jun 21, 2018 at 04:59:13PM +0200, Cornelia Huck wrote:
> On Wed, 20 Jun 2018 22:48:58 +0300
> "Michael S. Tsirkin" <mst@redhat.com> wrote:
> 
> > On Wed, Jun 20, 2018 at 06:06:19PM +0200, Cornelia Huck wrote:
> > > In any case, I'm not sure anymore why we'd want the extra uuid.  
> > 
> > It's mostly so we can have e.g. multiple devices with same MAC
> > (which some people seem to want in order to then use
> > then with different containers).
> > 
> > But it is also handy for when you assign a PF, since then you
> > can't set the MAC.
> > 
> 
> OK, so what about the following:
> 
> - introduce a new feature bit, VIRTIO_NET_F_STANDBY_UUID that indicates
>   that we have a new uuid field in the virtio-net config space
> - in QEMU, add a property for virtio-net that allows to specify a uuid,
>   offer VIRTIO_NET_F_STANDBY_UUID if set
> - when configuring, set the property to the group UUID of the vfio-pci
>   device
> - in the guest, use the uuid from the virtio-net device's config space
>   if applicable; else, fall back to matching by MAC as done today
> 
> That should work for all virtio transports.

True. I'm a bit unhappy that it's virtio net specific though
since down the road I expect we'll have a very similar feature
for scsi (and maybe others).

But we do not have a way to have fields that are portable
both across devices and transports, and I think it would
be a useful addition. How would this work though? Any idea?
Siwei Liu June 22, 2018, 1:07 a.m. UTC | #50
On Thu, Jun 21, 2018 at 11:14 AM, Michael S. Tsirkin <mst@redhat.com> wrote:
> On Wed, Jun 13, 2018 at 01:40:59PM +0800, Jason Wang wrote:
>>
>>
>> On 2018年06月13日 12:24, Samudrala, Sridhar wrote:
>> > On 6/12/2018 7:38 PM, Jason Wang wrote:
>> > >
>> > >
>> > > On 2018年06月12日 19:54, Michael S. Tsirkin wrote:
>> > > > On Wed, Jun 06, 2018 at 10:29:03AM +0800, Jason Wang wrote:
>> > > > >
>> > > > > On 2018年06月05日 20:33, Michael S. Tsirkin wrote:
>> > > > > > I don't think this is sufficient.
>> > > > > >
>> > > > > > If both primary and standby devices are present, a
>> > > > > > legacy guest without
>> > > > > > support for the feature might see two devices with same mac and get
>> > > > > > confused.
>> > > > > >
>> > > > > > I think that we should only make primary visible after
>> > > > > > guest acked the
>> > > > > > backup feature bit.
>> > > > > I think we want exactly the reverse? E.g fail the
>> > > > > negotiation when guest
>> > > > > does not ack backup feature.
>> > > > >
>> > > > > Otherwise legacy guest won't even have the chance to see
>> > > > > primary device in
>> > > > > the guest.
>> > > > That's by design.
>> > >
>> > > So management needs to know the capability of guest to set the
>> > > backup feature. This looks a chicken or egg problem to me.
>> >
>> > I don't think so. If the tenant requests 'accelerated datapath feature',
>> > the management
>> > will set 'standby' feature bit on virtio-net interface and if the guest
>> > virtio-net driver
>> > supports this feature, then the tenant VM will get that capability via a
>> > hot-plugged
>> > primary device.
>>
>> Ok, I thought exactly the reverse because of the commit title is "enable
>> virtio_net to act as a standby for a passthru device". But re-read the
>> commit log content, I understand the case a little bit. Btw, VF is not
>> necessarily faster than virtio-net, especially consider virtio-net may have
>> a lot of queues.
>
> Don't do that then, right?

I don't understand. Where did the standby feature come to imply the
"accelerated datapath" thing? Isn't failover/standby a generic high
availblity term, rather than marry it to the concept of device model
specifics? Do we expect scsi to work exactly the same way with
"accelerated datapath"?

-Siwei


>
>> >
>> >
>> > >
>> > > >
>> > > > > > And on reset or when backup is cleared in some other way, unplug the
>> > > > > > primary.
>> > > > > What if guest does not support hotplug?
>> > > > It shouldn't ack the backup feature then and it's a good point.
>> > > > We should both document this and check kernel config has
>> > > > hotplug support. Sridhar could you take a look pls?
>> > > >
>> > > > > > Something like the below will do the job:
>> > > > > >
>> > > > > > Primary device is added with a special "primary-failover" flag.
>> > > > > > A virtual machine is then initialized with just a standby virtio
>> > > > > > device. Primary is not yet added.
>> > > > > A question is how to do the matching? Qemu knows nothing about e.g mac
>> > > > > address of a pass-through device I believe?
>> > > > Supply a flag to the VFIO when it's added, this way QEMU will know.
>> > > >
>> > > > > > Later QEMU detects that guest driver device set DRIVER_OK.
>> > > > > > It then exposes the primary device to the guest, and triggers
>> > > > > > a device addition event (hot-plug event) for it.
>> > > > > Do you mean we won't have primary device in the initial qemu cli?
>> > > > No, that's not what I mean.
>> > > >
>> > > > I mean management will supply a flag to VFIO and then
>> > > >
>> > > >
>> > > > - VFIO defers exposing
>> > > > primary to guest until guest acks the feature bit.
>> > > > - When we see guest ack, initiate hotplug.
>> > > > - On reboot, hide it again.
>> > > > - On reset without reboot, request hot-unplug and on eject hide
>> > > > it again.
>> > >
>> > > This sounds much like a kind of bonding in qemu.
>> > >
>> > > >
>> > > > > > If QEMU detects guest driver removal, it initiates a
>> > > > > > hot-unplug sequence
>> > > > > > to remove the primary driver.  In particular, if QEMU detects guest
>> > > > > > re-initialization (e.g. by detecting guest reset) it
>> > > > > > immediately removes
>> > > > > > the primary device.
>> > > > > I believe guest failover module should handle this gracefully?
>> > > > It can't control enumeration order, if primary is enumerated before
>> > > > standby then guest will load its driver and it's too late
>> > > > when failover driver is loaded.
>> > >
>> > > Well, even if we can delay the visibility of primary until
>> > > DRIVER_OK, there still be a race I think? And it looks to me it's
>> > > still a bug of guest:
>> > >
>> > > E.g primary could be probed before failover_register() in guest.
>> > > Then we will miss the enslaving of primary forever.
>> >
>> > That is not an issue. Even if the primary is probed before failover
>> > driver, it will
>> > enslave the primary via the call to failover_existing_slave_register()
>> > as part of
>> > failover_register() routine.
>>
>> Aha I get it. So the enumeration order is not an issue.
>>
>> Consider primary may still be seen by guest kernel even if we delay its
>> visibility, I wonder whether we can control the lifecycle of primary through
>> driver but not qemu. This can simplify a lot of things.
>>
>> Thanks
>>
>> >
>> > >
>> > > Thanks
>> > >
>> > > >
>> > > > > Thanks
>> > > > >
>> > > > > > We can move some of this code to management as well,
>> > > > > > architecturally it
>> > > > > > does not make too much sense but it might be easier
>> > > > > > implementation-wise.
>> > > > > >
>> > > > > > HTH
>> > > > > >
>> > > > > > On Mon, Jun 04, 2018 at 06:41:48PM -0700, Samudrala, Sridhar wrote:
>> > > > > > > Ping on this patch now that the kernel patches are
>> > > > > > > accepted into davem's net-next tree.
>> > > > > > > https://patchwork.ozlabs.org/cover/920005/
>> > > > > > >
>> > > > > > >
>> > > > > > > On 5/7/2018 4:09 PM, Sridhar Samudrala wrote:
>> > > > > > > > This feature bit can be used by hypervisor to
>> > > > > > > > indicate virtio_net device to
>> > > > > > > > act as a standby for another device with the same MAC address.
>> > > > > > > >
>> > > > > > > > I tested this with a small change to the patch
>> > > > > > > > to mark the STANDBY feature 'true'
>> > > > > > > > by default as i am using libvirt to start the VMs.
>> > > > > > > > Is there a way to pass the newly added feature
>> > > > > > > > bit 'standby' to qemu via libvirt
>> > > > > > > > XML file?
>> > > > > > > >
>> > > > > > > > Signed-off-by: Sridhar Samudrala <sridhar.samudrala@intel.com>
>> > > > > > > > ---
>> > > > > > > >     hw/net/virtio-net.c                         | 2 ++
>> > > > > > > >     include/standard-headers/linux/virtio_net.h | 3 +++
>> > > > > > > >     2 files changed, 5 insertions(+)
>> > > > > > > >
>> > > > > > > > diff --git a/hw/net/virtio-net.c b/hw/net/virtio-net.c
>> > > > > > > > index 90502fca7c..38b3140670 100644
>> > > > > > > > --- a/hw/net/virtio-net.c
>> > > > > > > > +++ b/hw/net/virtio-net.c
>> > > > > > > > @@ -2198,6 +2198,8 @@ static Property virtio_net_properties[] = {
>> > > > > > > >                          true),
>> > > > > > > >         DEFINE_PROP_INT32("speed", VirtIONet,
>> > > > > > > > net_conf.speed, SPEED_UNKNOWN),
>> > > > > > > >         DEFINE_PROP_STRING("duplex", VirtIONet,
>> > > > > > > > net_conf.duplex_str),
>> > > > > > > > +    DEFINE_PROP_BIT64("standby", VirtIONet,
>> > > > > > > > host_features, VIRTIO_NET_F_STANDBY,
>> > > > > > > > +                      false),
>> > > > > > > >         DEFINE_PROP_END_OF_LIST(),
>> > > > > > > >     };
>> > > > > > > > diff --git
>> > > > > > > > a/include/standard-headers/linux/virtio_net.h
>> > > > > > > > b/include/standard-headers/linux/virtio_net.h
>> > > > > > > > index e9f255ea3f..01ec09684c 100644
>> > > > > > > > --- a/include/standard-headers/linux/virtio_net.h
>> > > > > > > > +++ b/include/standard-headers/linux/virtio_net.h
>> > > > > > > > @@ -57,6 +57,9 @@
>> > > > > > > >                          * Steering */
>> > > > > > > >     #define VIRTIO_NET_F_CTRL_MAC_ADDR 23    /* Set MAC address */
>> > > > > > > > +#define VIRTIO_NET_F_STANDBY      62    /* Act
>> > > > > > > > as standby for another device
>> > > > > > > > +                                         * with the same MAC.
>> > > > > > > > +                                         */
>> > > > > > > >     #define VIRTIO_NET_F_SPEED_DUPLEX 63    /*
>> > > > > > > > Device set linkspeed and duplex */
>> > > > > > > >     #ifndef VIRTIO_NET_NO_LEGACY
>> > >
>> >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: virtio-dev-unsubscribe@lists.oasis-open.org
> For additional commands, e-mail: virtio-dev-help@lists.oasis-open.org
>
Siwei Liu June 22, 2018, 1:21 a.m. UTC | #51
On Thu, Jun 21, 2018 at 7:59 AM, Cornelia Huck <cohuck@redhat.com> wrote:
> On Wed, 20 Jun 2018 22:48:58 +0300
> "Michael S. Tsirkin" <mst@redhat.com> wrote:
>
>> On Wed, Jun 20, 2018 at 06:06:19PM +0200, Cornelia Huck wrote:
>> > In any case, I'm not sure anymore why we'd want the extra uuid.
>>
>> It's mostly so we can have e.g. multiple devices with same MAC
>> (which some people seem to want in order to then use
>> then with different containers).
>>
>> But it is also handy for when you assign a PF, since then you
>> can't set the MAC.
>>
>
> OK, so what about the following:
>
> - introduce a new feature bit, VIRTIO_NET_F_STANDBY_UUID that indicates
>   that we have a new uuid field in the virtio-net config space
> - in QEMU, add a property for virtio-net that allows to specify a uuid,
>   offer VIRTIO_NET_F_STANDBY_UUID if set
> - when configuring, set the property to the group UUID of the vfio-pci
>   device

If feature negotiation fails on VIRTIO_NET_F_STANDBY_UUID, is it safe
to still expose UUID in the config space on virtio-pci?

I'm not even sure if it's sane to expose group UUID on the PCI bridge
where the corresponding vfio-pci device attached to for a guest which
doesn't support the feature (legacy).

-Siwei


> - in the guest, use the uuid from the virtio-net device's config space
>   if applicable; else, fall back to matching by MAC as done today
>
> That should work for all virtio transports.
Venu Busireddy June 22, 2018, 2:25 a.m. UTC | #52
On 2018-06-21 18:21:55 -0700, Siwei Liu wrote:
> On Thu, Jun 21, 2018 at 7:59 AM, Cornelia Huck <cohuck@redhat.com> wrote:
> > On Wed, 20 Jun 2018 22:48:58 +0300
> > "Michael S. Tsirkin" <mst@redhat.com> wrote:
> >
> >> On Wed, Jun 20, 2018 at 06:06:19PM +0200, Cornelia Huck wrote:
> >> > In any case, I'm not sure anymore why we'd want the extra uuid.
> >>
> >> It's mostly so we can have e.g. multiple devices with same MAC
> >> (which some people seem to want in order to then use
> >> then with different containers).
> >>
> >> But it is also handy for when you assign a PF, since then you
> >> can't set the MAC.
> >>
> >
> > OK, so what about the following:
> >
> > - introduce a new feature bit, VIRTIO_NET_F_STANDBY_UUID that indicates
> >   that we have a new uuid field in the virtio-net config space
> > - in QEMU, add a property for virtio-net that allows to specify a uuid,
> >   offer VIRTIO_NET_F_STANDBY_UUID if set
> > - when configuring, set the property to the group UUID of the vfio-pci
> >   device
> 
> If feature negotiation fails on VIRTIO_NET_F_STANDBY_UUID, is it safe
> to still expose UUID in the config space on virtio-pci?

Why is it not safe? When we expose VIRTIO_NET_F_STANDBY_UUID, it is up
to the driver to decide if it wants to use it or not. If it does not
want to use the feature, it would also imply that the driver will not
be interested in the UUID value in the config space. So, the UUID will
be some piece of data that simply sits around; nobody cares for it.

> I'm not even sure if it's sane to expose group UUID on the PCI bridge
> where the corresponding vfio-pci device attached to for a guest which
> doesn't support the feature (legacy).

Unfortunately, you don't know beforehand if the guest will be a legacy
guest that doesn't support the feature. As is the case with the UUID in
the virtio-pci device's config space, the UUID in the bridge's config
space will/should be ignored by the legacy guest.

Venu

> > - in the guest, use the uuid from the virtio-net device's config space
> >   if applicable; else, fall back to matching by MAC as done today
> >
> > That should work for all virtio transports.
Michael S. Tsirkin June 22, 2018, 2:30 a.m. UTC | #53
On Thu, Jun 21, 2018 at 06:07:18PM -0700, Siwei Liu wrote:
> On Thu, Jun 21, 2018 at 11:14 AM, Michael S. Tsirkin <mst@redhat.com> wrote:
> > On Wed, Jun 13, 2018 at 01:40:59PM +0800, Jason Wang wrote:
> >>
> >>
> >> On 2018年06月13日 12:24, Samudrala, Sridhar wrote:
> >> > On 6/12/2018 7:38 PM, Jason Wang wrote:
> >> > >
> >> > >
> >> > > On 2018年06月12日 19:54, Michael S. Tsirkin wrote:
> >> > > > On Wed, Jun 06, 2018 at 10:29:03AM +0800, Jason Wang wrote:
> >> > > > >
> >> > > > > On 2018年06月05日 20:33, Michael S. Tsirkin wrote:
> >> > > > > > I don't think this is sufficient.
> >> > > > > >
> >> > > > > > If both primary and standby devices are present, a
> >> > > > > > legacy guest without
> >> > > > > > support for the feature might see two devices with same mac and get
> >> > > > > > confused.
> >> > > > > >
> >> > > > > > I think that we should only make primary visible after
> >> > > > > > guest acked the
> >> > > > > > backup feature bit.
> >> > > > > I think we want exactly the reverse? E.g fail the
> >> > > > > negotiation when guest
> >> > > > > does not ack backup feature.
> >> > > > >
> >> > > > > Otherwise legacy guest won't even have the chance to see
> >> > > > > primary device in
> >> > > > > the guest.
> >> > > > That's by design.
> >> > >
> >> > > So management needs to know the capability of guest to set the
> >> > > backup feature. This looks a chicken or egg problem to me.
> >> >
> >> > I don't think so. If the tenant requests 'accelerated datapath feature',
> >> > the management
> >> > will set 'standby' feature bit on virtio-net interface and if the guest
> >> > virtio-net driver
> >> > supports this feature, then the tenant VM will get that capability via a
> >> > hot-plugged
> >> > primary device.
> >>
> >> Ok, I thought exactly the reverse because of the commit title is "enable
> >> virtio_net to act as a standby for a passthru device". But re-read the
> >> commit log content, I understand the case a little bit. Btw, VF is not
> >> necessarily faster than virtio-net, especially consider virtio-net may have
> >> a lot of queues.
> >
> > Don't do that then, right?
> 
> I don't understand. Where did the standby feature come to imply the
> "accelerated datapath" thing?
> Isn't failover/standby a generic high
> availblity term, rather than marry it to the concept of device model
> specifics? Do we expect scsi to work exactly the same way with
> "accelerated datapath"?

That's not what I said.
The semantics are that the primary is always used if present in
preference to standby.
Jason said virtio net is sometimes preferable.
If that's the case don't make it a standby.

More advanced use-cases do exist and e.g. Alexander Duyck
suggested using a switch-dev. failover isn't it though.
Michael S. Tsirkin June 22, 2018, 2:32 a.m. UTC | #54
On Thu, Jun 21, 2018 at 06:21:55PM -0700, Siwei Liu wrote:
> On Thu, Jun 21, 2018 at 7:59 AM, Cornelia Huck <cohuck@redhat.com> wrote:
> > On Wed, 20 Jun 2018 22:48:58 +0300
> > "Michael S. Tsirkin" <mst@redhat.com> wrote:
> >
> >> On Wed, Jun 20, 2018 at 06:06:19PM +0200, Cornelia Huck wrote:
> >> > In any case, I'm not sure anymore why we'd want the extra uuid.
> >>
> >> It's mostly so we can have e.g. multiple devices with same MAC
> >> (which some people seem to want in order to then use
> >> then with different containers).
> >>
> >> But it is also handy for when you assign a PF, since then you
> >> can't set the MAC.
> >>
> >
> > OK, so what about the following:
> >
> > - introduce a new feature bit, VIRTIO_NET_F_STANDBY_UUID that indicates
> >   that we have a new uuid field in the virtio-net config space
> > - in QEMU, add a property for virtio-net that allows to specify a uuid,
> >   offer VIRTIO_NET_F_STANDBY_UUID if set
> > - when configuring, set the property to the group UUID of the vfio-pci
> >   device
> 
> If feature negotiation fails on VIRTIO_NET_F_STANDBY_UUID, is it safe
> to still expose UUID in the config space on virtio-pci?


Yes but guest is not supposed to read it.

> I'm not even sure if it's sane to expose group UUID on the PCI bridge
> where the corresponding vfio-pci device attached to for a guest which
> doesn't support the feature (legacy).
> 
> -Siwei

Yes but you won't add the primary behind such a bridge.

> 
> > - in the guest, use the uuid from the virtio-net device's config space
> >   if applicable; else, fall back to matching by MAC as done today
> >
> > That should work for all virtio transports.
Cornelia Huck June 22, 2018, 3:09 p.m. UTC | #55
On Thu, 21 Jun 2018 21:20:13 +0300
"Michael S. Tsirkin" <mst@redhat.com> wrote:

> On Thu, Jun 21, 2018 at 04:59:13PM +0200, Cornelia Huck wrote:
> > OK, so what about the following:
> > 
> > - introduce a new feature bit, VIRTIO_NET_F_STANDBY_UUID that indicates
> >   that we have a new uuid field in the virtio-net config space
> > - in QEMU, add a property for virtio-net that allows to specify a uuid,
> >   offer VIRTIO_NET_F_STANDBY_UUID if set
> > - when configuring, set the property to the group UUID of the vfio-pci
> >   device
> > - in the guest, use the uuid from the virtio-net device's config space
> >   if applicable; else, fall back to matching by MAC as done today
> > 
> > That should work for all virtio transports.  
> 
> True. I'm a bit unhappy that it's virtio net specific though
> since down the road I expect we'll have a very similar feature
> for scsi (and maybe others).
> 
> But we do not have a way to have fields that are portable
> both across devices and transports, and I think it would
> be a useful addition. How would this work though? Any idea?

Can we introduce some kind of device-independent config space area?
Pushing back the device-specific config space by a certain value if the
appropriate feature is negotiated and use that for things like the uuid?

But regardless of that, I'm not sure whether extending this approach to
other device types is the way to go. Tying together two different
devices is creating complicated situations at least in the hypervisor
(even if it's fairly straightforward in the guest). [I have not come
around again to look at the "how to handle visibility in QEMU"
questions due to lack of cycles, sorry about that.]

So, what's the goal of this approach? Only to allow migration with
vfio-pci, or also to plug in a faster device and use it instead of an
already attached paravirtualized device?

What about migration of vfio devices that are not easily replaced by a
paravirtualized device? I'm thinking of vfio-ccw, where our main (and
currently only) supported device is dasd (disks) -- which can do a lot
of specialized things that virtio-blk does not support (and should not
or even cannot support). Would it be more helpful to focus on generic
migration support for vfio instead of going about it device by device?

This network device approach already seems far along, so it makes sense
to continue with it. But I'm not sure whether we want to spend time and
energy on that for other device types rather than working on a general
solution for vfio migration.
Michael S. Tsirkin June 22, 2018, 7:05 p.m. UTC | #56
On Fri, Jun 22, 2018 at 05:09:55PM +0200, Cornelia Huck wrote:
> On Thu, 21 Jun 2018 21:20:13 +0300
> "Michael S. Tsirkin" <mst@redhat.com> wrote:
> 
> > On Thu, Jun 21, 2018 at 04:59:13PM +0200, Cornelia Huck wrote:
> > > OK, so what about the following:
> > > 
> > > - introduce a new feature bit, VIRTIO_NET_F_STANDBY_UUID that indicates
> > >   that we have a new uuid field in the virtio-net config space
> > > - in QEMU, add a property for virtio-net that allows to specify a uuid,
> > >   offer VIRTIO_NET_F_STANDBY_UUID if set
> > > - when configuring, set the property to the group UUID of the vfio-pci
> > >   device
> > > - in the guest, use the uuid from the virtio-net device's config space
> > >   if applicable; else, fall back to matching by MAC as done today
> > > 
> > > That should work for all virtio transports.  
> > 
> > True. I'm a bit unhappy that it's virtio net specific though
> > since down the road I expect we'll have a very similar feature
> > for scsi (and maybe others).
> > 
> > But we do not have a way to have fields that are portable
> > both across devices and transports, and I think it would
> > be a useful addition. How would this work though? Any idea?
> 
> Can we introduce some kind of device-independent config space area?
> Pushing back the device-specific config space by a certain value if the
> appropriate feature is negotiated and use that for things like the uuid?

So config moves back and forth?
Reminds me of the msi vector mess we had with pci.
I'd rather have every transport add a new config.

> But regardless of that, I'm not sure whether extending this approach to
> other device types is the way to go. Tying together two different
> devices is creating complicated situations at least in the hypervisor
> (even if it's fairly straightforward in the guest). [I have not come
> around again to look at the "how to handle visibility in QEMU"
> questions due to lack of cycles, sorry about that.]
> 
> So, what's the goal of this approach? Only to allow migration with
> vfio-pci, or also to plug in a faster device and use it instead of an
> already attached paravirtualized device?

These are two sides of the same coin, I think the second approach
is closer to what we are doing here.

> What about migration of vfio devices that are not easily replaced by a
> paravirtualized device? I'm thinking of vfio-ccw, where our main (and
> currently only) supported device is dasd (disks) -- which can do a lot
> of specialized things that virtio-blk does not support (and should not
> or even cannot support).

But maybe virtio-scsi can?

> Would it be more helpful to focus on generic
> migration support for vfio instead of going about it device by device?
> 
> This network device approach already seems far along, so it makes sense
> to continue with it. But I'm not sure whether we want to spend time and
> energy on that for other device types rather than working on a general
> solution for vfio migration.

I'm inclined to say finalizing this feature would be a good first step
and will teach us how we can move forward.
Siwei Liu June 22, 2018, 7:43 p.m. UTC | #57
On Thu, Jun 21, 2018 at 7:30 PM, Michael S. Tsirkin <mst@redhat.com> wrote:
> On Thu, Jun 21, 2018 at 06:07:18PM -0700, Siwei Liu wrote:
>> On Thu, Jun 21, 2018 at 11:14 AM, Michael S. Tsirkin <mst@redhat.com> wrote:
>> > On Wed, Jun 13, 2018 at 01:40:59PM +0800, Jason Wang wrote:
>> >>
>> >>
>> >> On 2018年06月13日 12:24, Samudrala, Sridhar wrote:
>> >> > On 6/12/2018 7:38 PM, Jason Wang wrote:
>> >> > >
>> >> > >
>> >> > > On 2018年06月12日 19:54, Michael S. Tsirkin wrote:
>> >> > > > On Wed, Jun 06, 2018 at 10:29:03AM +0800, Jason Wang wrote:
>> >> > > > >
>> >> > > > > On 2018年06月05日 20:33, Michael S. Tsirkin wrote:
>> >> > > > > > I don't think this is sufficient.
>> >> > > > > >
>> >> > > > > > If both primary and standby devices are present, a
>> >> > > > > > legacy guest without
>> >> > > > > > support for the feature might see two devices with same mac and get
>> >> > > > > > confused.
>> >> > > > > >
>> >> > > > > > I think that we should only make primary visible after
>> >> > > > > > guest acked the
>> >> > > > > > backup feature bit.
>> >> > > > > I think we want exactly the reverse? E.g fail the
>> >> > > > > negotiation when guest
>> >> > > > > does not ack backup feature.
>> >> > > > >
>> >> > > > > Otherwise legacy guest won't even have the chance to see
>> >> > > > > primary device in
>> >> > > > > the guest.
>> >> > > > That's by design.
>> >> > >
>> >> > > So management needs to know the capability of guest to set the
>> >> > > backup feature. This looks a chicken or egg problem to me.
>> >> >
>> >> > I don't think so. If the tenant requests 'accelerated datapath feature',
>> >> > the management
>> >> > will set 'standby' feature bit on virtio-net interface and if the guest
>> >> > virtio-net driver
>> >> > supports this feature, then the tenant VM will get that capability via a
>> >> > hot-plugged
>> >> > primary device.
>> >>
>> >> Ok, I thought exactly the reverse because of the commit title is "enable
>> >> virtio_net to act as a standby for a passthru device". But re-read the
>> >> commit log content, I understand the case a little bit. Btw, VF is not
>> >> necessarily faster than virtio-net, especially consider virtio-net may have
>> >> a lot of queues.
>> >
>> > Don't do that then, right?
>>
>> I don't understand. Where did the standby feature come to imply the
>> "accelerated datapath" thing?
>> Isn't failover/standby a generic high
>> availblity term, rather than marry it to the concept of device model
>> specifics? Do we expect scsi to work exactly the same way with
>> "accelerated datapath"?
>
> That's not what I said.
> The semantics are that the primary is always used if present in
> preference to standby.
OK. If this is the only semantics of what "standby" refers to in
general, that is fine.

I just don't want to limit the failover/standby semantics to the
device model specifics, the "accelerated datapath" thing or whatever.
I really don't know where the requirements of the "accelerated
datapath" came from, as the originial problem is migrating vfio
devices which is in match of QEMU's live migration model. Hyper-V has
it's limitation to do 1-netdev should not impact how KVM/QEMU should
be doing it.

> Jason said virtio net is sometimes preferable.
> If that's the case don't make it a standby.
>
> More advanced use-cases do exist and e.g. Alexander Duyck
> suggested using a switch-dev.

The switchdev case would need a new feature bit, right?

-Siwei

> failover isn't it though.
>
> --
> MST
Siwei Liu June 22, 2018, 8 p.m. UTC | #58
On Thu, Jun 21, 2018 at 7:32 PM, Michael S. Tsirkin <mst@redhat.com> wrote:
> On Thu, Jun 21, 2018 at 06:21:55PM -0700, Siwei Liu wrote:
>> On Thu, Jun 21, 2018 at 7:59 AM, Cornelia Huck <cohuck@redhat.com> wrote:
>> > On Wed, 20 Jun 2018 22:48:58 +0300
>> > "Michael S. Tsirkin" <mst@redhat.com> wrote:
>> >
>> >> On Wed, Jun 20, 2018 at 06:06:19PM +0200, Cornelia Huck wrote:
>> >> > In any case, I'm not sure anymore why we'd want the extra uuid.
>> >>
>> >> It's mostly so we can have e.g. multiple devices with same MAC
>> >> (which some people seem to want in order to then use
>> >> then with different containers).
>> >>
>> >> But it is also handy for when you assign a PF, since then you
>> >> can't set the MAC.
>> >>
>> >
>> > OK, so what about the following:
>> >
>> > - introduce a new feature bit, VIRTIO_NET_F_STANDBY_UUID that indicates
>> >   that we have a new uuid field in the virtio-net config space
>> > - in QEMU, add a property for virtio-net that allows to specify a uuid,
>> >   offer VIRTIO_NET_F_STANDBY_UUID if set
>> > - when configuring, set the property to the group UUID of the vfio-pci
>> >   device
>>
>> If feature negotiation fails on VIRTIO_NET_F_STANDBY_UUID, is it safe
>> to still expose UUID in the config space on virtio-pci?
>
>
> Yes but guest is not supposed to read it.
>
>> I'm not even sure if it's sane to expose group UUID on the PCI bridge
>> where the corresponding vfio-pci device attached to for a guest which
>> doesn't support the feature (legacy).
>>
>> -Siwei
>
> Yes but you won't add the primary behind such a bridge.

I assume the UUID feature is a new one besides the exiting
VIRTIO_NET_F_STANDBY feature, where I think the current proposal is,
if UUID feature is present and supported by guest, we'll do pairing
based on UUID; if UUID feature present or not supported by guest,
we'll still plug in the VF (if guest supports the _F_STANDBY feature)
but the pairing will be fallback to using MAC address. Are you saying
you don't even want to plug in the primary when the UUID feature is
not supported by guest? Does the feature negotiation UUID have to
depend on STANDBY being set, or the other way around? I thought that
just the absence of STANDBY will prevent primary to get exposed to the
guest.

-Siwei

>
>>
>> > - in the guest, use the uuid from the virtio-net device's config space
>> >   if applicable; else, fall back to matching by MAC as done today
>> >
>> > That should work for all virtio transports.
Siwei Liu June 22, 2018, 8:03 p.m. UTC | #59
On Fri, Jun 22, 2018 at 1:00 PM, Siwei Liu <loseweigh@gmail.com> wrote:
> On Thu, Jun 21, 2018 at 7:32 PM, Michael S. Tsirkin <mst@redhat.com> wrote:
>> On Thu, Jun 21, 2018 at 06:21:55PM -0700, Siwei Liu wrote:
>>> On Thu, Jun 21, 2018 at 7:59 AM, Cornelia Huck <cohuck@redhat.com> wrote:
>>> > On Wed, 20 Jun 2018 22:48:58 +0300
>>> > "Michael S. Tsirkin" <mst@redhat.com> wrote:
>>> >
>>> >> On Wed, Jun 20, 2018 at 06:06:19PM +0200, Cornelia Huck wrote:
>>> >> > In any case, I'm not sure anymore why we'd want the extra uuid.
>>> >>
>>> >> It's mostly so we can have e.g. multiple devices with same MAC
>>> >> (which some people seem to want in order to then use
>>> >> then with different containers).
>>> >>
>>> >> But it is also handy for when you assign a PF, since then you
>>> >> can't set the MAC.
>>> >>
>>> >
>>> > OK, so what about the following:
>>> >
>>> > - introduce a new feature bit, VIRTIO_NET_F_STANDBY_UUID that indicates
>>> >   that we have a new uuid field in the virtio-net config space
>>> > - in QEMU, add a property for virtio-net that allows to specify a uuid,
>>> >   offer VIRTIO_NET_F_STANDBY_UUID if set
>>> > - when configuring, set the property to the group UUID of the vfio-pci
>>> >   device
>>>
>>> If feature negotiation fails on VIRTIO_NET_F_STANDBY_UUID, is it safe
>>> to still expose UUID in the config space on virtio-pci?
>>
>>
>> Yes but guest is not supposed to read it.
>>
>>> I'm not even sure if it's sane to expose group UUID on the PCI bridge
>>> where the corresponding vfio-pci device attached to for a guest which
>>> doesn't support the feature (legacy).
>>>
>>> -Siwei
>>
>> Yes but you won't add the primary behind such a bridge.
>
> I assume the UUID feature is a new one besides the exiting
> VIRTIO_NET_F_STANDBY feature, where I think the current proposal is,
> if UUID feature is present and supported by guest, we'll do pairing
> based on UUID; if UUID feature present
                                 ^^^^^^^  is NOT present

> or not supported by guest,
> we'll still plug in the VF (if guest supports the _F_STANDBY feature)
> but the pairing will be fallback to using MAC address. Are you saying
> you don't even want to plug in the primary when the UUID feature is
> not supported by guest? Does the feature negotiation UUID have to
> depend on STANDBY being set, or the other way around? I thought that
> just the absence of STANDBY will prevent primary to get exposed to the
> guest.
>
> -Siwei
>
>>
>>>
>>> > - in the guest, use the uuid from the virtio-net device's config space
>>> >   if applicable; else, fall back to matching by MAC as done today
>>> >
>>> > That should work for all virtio transports.
Siwei Liu June 22, 2018, 8:21 p.m. UTC | #60
On Fri, Jun 22, 2018 at 12:05 PM, Michael S. Tsirkin <mst@redhat.com> wrote:
> On Fri, Jun 22, 2018 at 05:09:55PM +0200, Cornelia Huck wrote:
>> On Thu, 21 Jun 2018 21:20:13 +0300
>> "Michael S. Tsirkin" <mst@redhat.com> wrote:
>>
>> > On Thu, Jun 21, 2018 at 04:59:13PM +0200, Cornelia Huck wrote:
>> > > OK, so what about the following:
>> > >
>> > > - introduce a new feature bit, VIRTIO_NET_F_STANDBY_UUID that indicates
>> > >   that we have a new uuid field in the virtio-net config space
>> > > - in QEMU, add a property for virtio-net that allows to specify a uuid,
>> > >   offer VIRTIO_NET_F_STANDBY_UUID if set
>> > > - when configuring, set the property to the group UUID of the vfio-pci
>> > >   device
>> > > - in the guest, use the uuid from the virtio-net device's config space
>> > >   if applicable; else, fall back to matching by MAC as done today
>> > >
>> > > That should work for all virtio transports.
>> >
>> > True. I'm a bit unhappy that it's virtio net specific though
>> > since down the road I expect we'll have a very similar feature
>> > for scsi (and maybe others).
>> >
>> > But we do not have a way to have fields that are portable
>> > both across devices and transports, and I think it would
>> > be a useful addition. How would this work though? Any idea?
>>
>> Can we introduce some kind of device-independent config space area?
>> Pushing back the device-specific config space by a certain value if the
>> appropriate feature is negotiated and use that for things like the uuid?
>
> So config moves back and forth?
> Reminds me of the msi vector mess we had with pci.
> I'd rather have every transport add a new config.
>
>> But regardless of that, I'm not sure whether extending this approach to
>> other device types is the way to go. Tying together two different
>> devices is creating complicated situations at least in the hypervisor
>> (even if it's fairly straightforward in the guest). [I have not come
>> around again to look at the "how to handle visibility in QEMU"
>> questions due to lack of cycles, sorry about that.]
>>
>> So, what's the goal of this approach? Only to allow migration with
>> vfio-pci, or also to plug in a faster device and use it instead of an
>> already attached paravirtualized device?
>
> These are two sides of the same coin, I think the second approach
> is closer to what we are doing here.

I'm leaning towards being conservative to keep the potential of doing
both. It's the vfio migration itself that has to be addessed, not to
make every virtio device to have an accelerated datapath, right?

-Siwei


>
>> What about migration of vfio devices that are not easily replaced by a
>> paravirtualized device? I'm thinking of vfio-ccw, where our main (and
>> currently only) supported device is dasd (disks) -- which can do a lot
>> of specialized things that virtio-blk does not support (and should not
>> or even cannot support).
>
> But maybe virtio-scsi can?
>
>> Would it be more helpful to focus on generic
>> migration support for vfio instead of going about it device by device?
>>
>> This network device approach already seems far along, so it makes sense
>> to continue with it. But I'm not sure whether we want to spend time and
>> energy on that for other device types rather than working on a general
>> solution for vfio migration.
>
> I'm inclined to say finalizing this feature would be a good first step
> and will teach us how we can move forward.
>
> --
> MST
Michael S. Tsirkin June 22, 2018, 9:29 p.m. UTC | #61
On Fri, Jun 22, 2018 at 01:03:04PM -0700, Siwei Liu wrote:
> On Fri, Jun 22, 2018 at 1:00 PM, Siwei Liu <loseweigh@gmail.com> wrote:
> > On Thu, Jun 21, 2018 at 7:32 PM, Michael S. Tsirkin <mst@redhat.com> wrote:
> >> On Thu, Jun 21, 2018 at 06:21:55PM -0700, Siwei Liu wrote:
> >>> On Thu, Jun 21, 2018 at 7:59 AM, Cornelia Huck <cohuck@redhat.com> wrote:
> >>> > On Wed, 20 Jun 2018 22:48:58 +0300
> >>> > "Michael S. Tsirkin" <mst@redhat.com> wrote:
> >>> >
> >>> >> On Wed, Jun 20, 2018 at 06:06:19PM +0200, Cornelia Huck wrote:
> >>> >> > In any case, I'm not sure anymore why we'd want the extra uuid.
> >>> >>
> >>> >> It's mostly so we can have e.g. multiple devices with same MAC
> >>> >> (which some people seem to want in order to then use
> >>> >> then with different containers).
> >>> >>
> >>> >> But it is also handy for when you assign a PF, since then you
> >>> >> can't set the MAC.
> >>> >>
> >>> >
> >>> > OK, so what about the following:
> >>> >
> >>> > - introduce a new feature bit, VIRTIO_NET_F_STANDBY_UUID that indicates
> >>> >   that we have a new uuid field in the virtio-net config space
> >>> > - in QEMU, add a property for virtio-net that allows to specify a uuid,
> >>> >   offer VIRTIO_NET_F_STANDBY_UUID if set
> >>> > - when configuring, set the property to the group UUID of the vfio-pci
> >>> >   device
> >>>
> >>> If feature negotiation fails on VIRTIO_NET_F_STANDBY_UUID, is it safe
> >>> to still expose UUID in the config space on virtio-pci?
> >>
> >>
> >> Yes but guest is not supposed to read it.
> >>
> >>> I'm not even sure if it's sane to expose group UUID on the PCI bridge
> >>> where the corresponding vfio-pci device attached to for a guest which
> >>> doesn't support the feature (legacy).
> >>>
> >>> -Siwei
> >>
> >> Yes but you won't add the primary behind such a bridge.
> >
> > I assume the UUID feature is a new one besides the exiting
> > VIRTIO_NET_F_STANDBY feature, where I think the current proposal is,
> > if UUID feature is present and supported by guest, we'll do pairing
> > based on UUID; if UUID feature present
>                                  ^^^^^^^  is NOT present

Let's go back for a bit.

What is wrong with matching by mac?

1. Does not allow multiple NICs with same mac
2. Requires MAC to be programmed by host in the PT device
   (which is often possible with VFs but isn't possible with all PT
   devices)

Both issues are up to host: if host needs them it needs the UUID
feature, if not MAC matching is sufficient.


> > or not supported by guest,
> > we'll still plug in the VF (if guest supports the _F_STANDBY feature)
> > but the pairing will be fallback to using MAC address. Are you saying
> > you don't even want to plug in the primary when the UUID feature is
> > not supported by guest? Does the feature negotiation UUID have to
> > depend on STANDBY being set, or the other way around? I thought that
> > just the absence of STANDBY will prevent primary to get exposed to the
> > guest.
> >
> > -Siwei
> >
> >>
> >>>
> >>> > - in the guest, use the uuid from the virtio-net device's config space
> >>> >   if applicable; else, fall back to matching by MAC as done today
Michael S. Tsirkin June 22, 2018, 9:32 p.m. UTC | #62
On Fri, Jun 22, 2018 at 01:21:45PM -0700, Siwei Liu wrote:
> On Fri, Jun 22, 2018 at 12:05 PM, Michael S. Tsirkin <mst@redhat.com> wrote:
> > On Fri, Jun 22, 2018 at 05:09:55PM +0200, Cornelia Huck wrote:
> >> On Thu, 21 Jun 2018 21:20:13 +0300
> >> "Michael S. Tsirkin" <mst@redhat.com> wrote:
> >>
> >> > On Thu, Jun 21, 2018 at 04:59:13PM +0200, Cornelia Huck wrote:
> >> > > OK, so what about the following:
> >> > >
> >> > > - introduce a new feature bit, VIRTIO_NET_F_STANDBY_UUID that indicates
> >> > >   that we have a new uuid field in the virtio-net config space
> >> > > - in QEMU, add a property for virtio-net that allows to specify a uuid,
> >> > >   offer VIRTIO_NET_F_STANDBY_UUID if set
> >> > > - when configuring, set the property to the group UUID of the vfio-pci
> >> > >   device
> >> > > - in the guest, use the uuid from the virtio-net device's config space
> >> > >   if applicable; else, fall back to matching by MAC as done today
> >> > >
> >> > > That should work for all virtio transports.
> >> >
> >> > True. I'm a bit unhappy that it's virtio net specific though
> >> > since down the road I expect we'll have a very similar feature
> >> > for scsi (and maybe others).
> >> >
> >> > But we do not have a way to have fields that are portable
> >> > both across devices and transports, and I think it would
> >> > be a useful addition. How would this work though? Any idea?
> >>
> >> Can we introduce some kind of device-independent config space area?
> >> Pushing back the device-specific config space by a certain value if the
> >> appropriate feature is negotiated and use that for things like the uuid?
> >
> > So config moves back and forth?
> > Reminds me of the msi vector mess we had with pci.
> > I'd rather have every transport add a new config.
> >
> >> But regardless of that, I'm not sure whether extending this approach to
> >> other device types is the way to go. Tying together two different
> >> devices is creating complicated situations at least in the hypervisor
> >> (even if it's fairly straightforward in the guest). [I have not come
> >> around again to look at the "how to handle visibility in QEMU"
> >> questions due to lack of cycles, sorry about that.]
> >>
> >> So, what's the goal of this approach? Only to allow migration with
> >> vfio-pci, or also to plug in a faster device and use it instead of an
> >> already attached paravirtualized device?
> >
> > These are two sides of the same coin, I think the second approach
> > is closer to what we are doing here.
> 
> I'm leaning towards being conservative to keep the potential of doing
> both. It's the vfio migration itself that has to be addessed, not to
> make every virtio device to have an accelerated datapath, right?
> 
> -Siwei

I think that the approach we took (signal configuration
through standby) assumes standby always present and
primary appearing and disappearing.

Anything else isn't well supported and I'm not sure we
should complicate code too much to support it.

> 
> >
> >> What about migration of vfio devices that are not easily replaced by a
> >> paravirtualized device? I'm thinking of vfio-ccw, where our main (and
> >> currently only) supported device is dasd (disks) -- which can do a lot
> >> of specialized things that virtio-blk does not support (and should not
> >> or even cannot support).
> >
> > But maybe virtio-scsi can?
> >
> >> Would it be more helpful to focus on generic
> >> migration support for vfio instead of going about it device by device?
> >>
> >> This network device approach already seems far along, so it makes sense
> >> to continue with it. But I'm not sure whether we want to spend time and
> >> energy on that for other device types rather than working on a general
> >> solution for vfio migration.
> >
> > I'm inclined to say finalizing this feature would be a good first step
> > and will teach us how we can move forward.
> >
> > --
> > MST
Michael S. Tsirkin June 22, 2018, 9:43 p.m. UTC | #63
On Fri, Jun 22, 2018 at 05:09:55PM +0200, Cornelia Huck wrote:
> Would it be more helpful to focus on generic
> migration support for vfio instead of going about it device by device?

Just to note this approach is actually device by device *type*.  It's
mostly device agnostic for a given device type so you can migrate
between hosts with very different hardware.

And support for more PV device types has other advantages
such as security and forward compatibility to future hosts.

Finally, it all can happen mostly within QEMU. User is currently
required to enable it but it's pretty lightweight.

OTOH vfio migration generally requires actual device-specific work, and
only works when hosts are mostly identical. When they aren't it's easy
to blame the user, but tools for checking host compatiblity are
currently non-existent. Upper layer management will also have to learn
about host and device compatibility wrt migration. At the moment they
can't even figure it out wrt software versions of vhost in kernel and
dpdk so I won't hold my breath for all of this happening quickly.
Michael S. Tsirkin June 22, 2018, 9:47 p.m. UTC | #64
On Fri, Jun 22, 2018 at 12:43:26PM -0700, Siwei Liu wrote:
> > The semantics are that the primary is always used if present in
> > preference to standby.
> OK. If this is the only semantics of what "standby" refers to in
> general, that is fine.
> 
> I just don't want to limit the failover/standby semantics to the
> device model specifics, the "accelerated datapath" thing or whatever.
> I really don't know where the requirements of the "accelerated
> datapath" came from,

It's a way to put it that matches what failover actually provides.

> as the originial problem is migrating vfio
> devices which is in match of QEMU's live migration model.

If you put it this way then it's natural to require that we have a
config with a working vfio device, and that we somehow add virtio for
duration of migration.

> Hyper-V has
> it's limitation to do 1-netdev should not impact how KVM/QEMU should
> be doing it.

That's a linux thing and pretty orthogonal to host/guest interface.

> > Jason said virtio net is sometimes preferable.
> > If that's the case don't make it a standby.
> >
> > More advanced use-cases do exist and e.g. Alexander Duyck
> > suggested using a switch-dev.
> 
> The switchdev case would need a new feature bit, right?
> 
> -Siwei

I think it would need to be a completely new device.

> > failover isn't it though.
> >
> > --
> > MST
Siwei Liu June 22, 2018, 9:51 p.m. UTC | #65
On Fri, Jun 22, 2018 at 2:29 PM, Michael S. Tsirkin <mst@redhat.com> wrote:
> On Fri, Jun 22, 2018 at 01:03:04PM -0700, Siwei Liu wrote:
>> On Fri, Jun 22, 2018 at 1:00 PM, Siwei Liu <loseweigh@gmail.com> wrote:
>> > On Thu, Jun 21, 2018 at 7:32 PM, Michael S. Tsirkin <mst@redhat.com> wrote:
>> >> On Thu, Jun 21, 2018 at 06:21:55PM -0700, Siwei Liu wrote:
>> >>> On Thu, Jun 21, 2018 at 7:59 AM, Cornelia Huck <cohuck@redhat.com> wrote:
>> >>> > On Wed, 20 Jun 2018 22:48:58 +0300
>> >>> > "Michael S. Tsirkin" <mst@redhat.com> wrote:
>> >>> >
>> >>> >> On Wed, Jun 20, 2018 at 06:06:19PM +0200, Cornelia Huck wrote:
>> >>> >> > In any case, I'm not sure anymore why we'd want the extra uuid.
>> >>> >>
>> >>> >> It's mostly so we can have e.g. multiple devices with same MAC
>> >>> >> (which some people seem to want in order to then use
>> >>> >> then with different containers).
>> >>> >>
>> >>> >> But it is also handy for when you assign a PF, since then you
>> >>> >> can't set the MAC.
>> >>> >>
>> >>> >
>> >>> > OK, so what about the following:
>> >>> >
>> >>> > - introduce a new feature bit, VIRTIO_NET_F_STANDBY_UUID that indicates
>> >>> >   that we have a new uuid field in the virtio-net config space
>> >>> > - in QEMU, add a property for virtio-net that allows to specify a uuid,
>> >>> >   offer VIRTIO_NET_F_STANDBY_UUID if set
>> >>> > - when configuring, set the property to the group UUID of the vfio-pci
>> >>> >   device
>> >>>
>> >>> If feature negotiation fails on VIRTIO_NET_F_STANDBY_UUID, is it safe
>> >>> to still expose UUID in the config space on virtio-pci?
>> >>
>> >>
>> >> Yes but guest is not supposed to read it.
>> >>
>> >>> I'm not even sure if it's sane to expose group UUID on the PCI bridge
>> >>> where the corresponding vfio-pci device attached to for a guest which
>> >>> doesn't support the feature (legacy).
>> >>>
>> >>> -Siwei
>> >>
>> >> Yes but you won't add the primary behind such a bridge.
>> >
>> > I assume the UUID feature is a new one besides the exiting
>> > VIRTIO_NET_F_STANDBY feature, where I think the current proposal is,
>> > if UUID feature is present and supported by guest, we'll do pairing
>> > based on UUID; if UUID feature present
>>                                  ^^^^^^^  is NOT present
>
> Let's go back for a bit.
>
> What is wrong with matching by mac?
>
> 1. Does not allow multiple NICs with same mac
> 2. Requires MAC to be programmed by host in the PT device
>    (which is often possible with VFs but isn't possible with all PT
>    devices)

Might not neccessarily be something wrong, but it's very limited to
prohibit the MAC of VF from changing when enslaved by failover. The
same as you indicated on the PT device. I suspect the same is
prohibited on even virtio-net and failover is because of the same
limitation.

>
> Both issues are up to host: if host needs them it needs the UUID
> feature, if not MAC matching is sufficient.

I know. What I'm saying is that we rely on STANDBY feature to control
the visibility of the primary, not the UUID feature.

-Siwei

>
>
>> > or not supported by guest,
>> > we'll still plug in the VF (if guest supports the _F_STANDBY feature)
>> > but the pairing will be fallback to using MAC address. Are you saying
>> > you don't even want to plug in the primary when the UUID feature is
>> > not supported by guest? Does the feature negotiation UUID have to
>> > depend on STANDBY being set, or the other way around? I thought that
>> > just the absence of STANDBY will prevent primary to get exposed to the
>> > guest.
>> >
>> > -Siwei
>> >
>> >>
>> >>>
>> >>> > - in the guest, use the uuid from the virtio-net device's config space
>> >>> >   if applicable; else, fall back to matching by MAC as done today
>
Siwei Liu June 22, 2018, 9:57 p.m. UTC | #66
On Fri, Jun 22, 2018 at 2:32 PM, Michael S. Tsirkin <mst@redhat.com> wrote:
> On Fri, Jun 22, 2018 at 01:21:45PM -0700, Siwei Liu wrote:
>> On Fri, Jun 22, 2018 at 12:05 PM, Michael S. Tsirkin <mst@redhat.com> wrote:
>> > On Fri, Jun 22, 2018 at 05:09:55PM +0200, Cornelia Huck wrote:
>> >> On Thu, 21 Jun 2018 21:20:13 +0300
>> >> "Michael S. Tsirkin" <mst@redhat.com> wrote:
>> >>
>> >> > On Thu, Jun 21, 2018 at 04:59:13PM +0200, Cornelia Huck wrote:
>> >> > > OK, so what about the following:
>> >> > >
>> >> > > - introduce a new feature bit, VIRTIO_NET_F_STANDBY_UUID that indicates
>> >> > >   that we have a new uuid field in the virtio-net config space
>> >> > > - in QEMU, add a property for virtio-net that allows to specify a uuid,
>> >> > >   offer VIRTIO_NET_F_STANDBY_UUID if set
>> >> > > - when configuring, set the property to the group UUID of the vfio-pci
>> >> > >   device
>> >> > > - in the guest, use the uuid from the virtio-net device's config space
>> >> > >   if applicable; else, fall back to matching by MAC as done today
>> >> > >
>> >> > > That should work for all virtio transports.
>> >> >
>> >> > True. I'm a bit unhappy that it's virtio net specific though
>> >> > since down the road I expect we'll have a very similar feature
>> >> > for scsi (and maybe others).
>> >> >
>> >> > But we do not have a way to have fields that are portable
>> >> > both across devices and transports, and I think it would
>> >> > be a useful addition. How would this work though? Any idea?
>> >>
>> >> Can we introduce some kind of device-independent config space area?
>> >> Pushing back the device-specific config space by a certain value if the
>> >> appropriate feature is negotiated and use that for things like the uuid?
>> >
>> > So config moves back and forth?
>> > Reminds me of the msi vector mess we had with pci.
>> > I'd rather have every transport add a new config.
>> >
>> >> But regardless of that, I'm not sure whether extending this approach to
>> >> other device types is the way to go. Tying together two different
>> >> devices is creating complicated situations at least in the hypervisor
>> >> (even if it's fairly straightforward in the guest). [I have not come
>> >> around again to look at the "how to handle visibility in QEMU"
>> >> questions due to lack of cycles, sorry about that.]
>> >>
>> >> So, what's the goal of this approach? Only to allow migration with
>> >> vfio-pci, or also to plug in a faster device and use it instead of an
>> >> already attached paravirtualized device?
>> >
>> > These are two sides of the same coin, I think the second approach
>> > is closer to what we are doing here.
>>
>> I'm leaning towards being conservative to keep the potential of doing
>> both. It's the vfio migration itself that has to be addessed, not to
>> make every virtio device to have an accelerated datapath, right?
>>
>> -Siwei
>
> I think that the approach we took (signal configuration
> through standby) assumes standby always present and
> primary appearing and disappearing.

I can imagine that's still true even for 1-netdev model. As long as we
can hide the lower devices.

That's what I said "to keep the potential to support both" models. But
we should not go further along to assume the other aspect ie. to get
PV accelerated whenever possible, but not to get VF migrated whenever
possible. That's essetially a big diveregence of the priority for the
goal we'd want to pursue.

-Siwei

>
> Anything else isn't well supported and I'm not sure we
> should complicate code too much to support it.
>
>>
>> >
>> >> What about migration of vfio devices that are not easily replaced by a
>> >> paravirtualized device? I'm thinking of vfio-ccw, where our main (and
>> >> currently only) supported device is dasd (disks) -- which can do a lot
>> >> of specialized things that virtio-blk does not support (and should not
>> >> or even cannot support).
>> >
>> > But maybe virtio-scsi can?
>> >
>> >> Would it be more helpful to focus on generic
>> >> migration support for vfio instead of going about it device by device?
>> >>
>> >> This network device approach already seems far along, so it makes sense
>> >> to continue with it. But I'm not sure whether we want to spend time and
>> >> energy on that for other device types rather than working on a general
>> >> solution for vfio migration.
>> >
>> > I'm inclined to say finalizing this feature would be a good first step
>> > and will teach us how we can move forward.
>> >
>> > --
>> > MST
Siwei Liu June 22, 2018, 10:25 p.m. UTC | #67
On Fri, Jun 22, 2018 at 2:47 PM, Michael S. Tsirkin <mst@redhat.com> wrote:
> On Fri, Jun 22, 2018 at 12:43:26PM -0700, Siwei Liu wrote:
>> > The semantics are that the primary is always used if present in
>> > preference to standby.
>> OK. If this is the only semantics of what "standby" refers to in
>> general, that is fine.
>>
>> I just don't want to limit the failover/standby semantics to the
>> device model specifics, the "accelerated datapath" thing or whatever.
>> I really don't know where the requirements of the "accelerated
>> datapath" came from,
>
> It's a way to put it that matches what failover actually provides.
>
>> as the originial problem is migrating vfio
>> devices which is in match of QEMU's live migration model.
>
> If you put it this way then it's natural to require that we have a
> config with a working vfio device, and that we somehow add virtio for
> duration of migration.

OK. That's the STANDBY feature sematics as I expect. Not something
like "we have a working virtio, but we need an accelerated datapath
via VFIO when the VM is not migrating". The VFIO is the what needs to
be concerned with, not the virtio. The way I view it, the STANDBY
feature, although present in the virtio device, is served as a
communication channel for the paired VFIO device, and the main purpose
of its feature negotiation process has to be around for migrating the
corresponding VFIO. While it's possible to re-use it as a side way to
achieve "accelerated datapath".

There could be other alternatives for vfio migration though, which do
not need the virtio helper, so don't need to get a STANDBY virtio for
that VFIO device.

>
>> Hyper-V has
>> it's limitation to do 1-netdev should not impact how KVM/QEMU should
>> be doing it.
>
> That's a linux thing and pretty orthogonal to host/guest interface.

I agree. So don't assume any device model specifics in the host/guest interface.

>
>> > Jason said virtio net is sometimes preferable.
>> > If that's the case don't make it a standby.
>> >
>> > More advanced use-cases do exist and e.g. Alexander Duyck
>> > suggested using a switch-dev.
>>
>> The switchdev case would need a new feature bit, right?
>>
>> -Siwei
>
> I think it would need to be a completely new device.

So how it relates to virtio or failover?

Regards,
-Siwei
>
>> > failover isn't it though.
>> >
>> > --
>> > MST
Michael S. Tsirkin June 22, 2018, 10:25 p.m. UTC | #68
On Fri, Jun 22, 2018 at 02:51:11PM -0700, Siwei Liu wrote:
> On Fri, Jun 22, 2018 at 2:29 PM, Michael S. Tsirkin <mst@redhat.com> wrote:
> > On Fri, Jun 22, 2018 at 01:03:04PM -0700, Siwei Liu wrote:
> >> On Fri, Jun 22, 2018 at 1:00 PM, Siwei Liu <loseweigh@gmail.com> wrote:
> >> > On Thu, Jun 21, 2018 at 7:32 PM, Michael S. Tsirkin <mst@redhat.com> wrote:
> >> >> On Thu, Jun 21, 2018 at 06:21:55PM -0700, Siwei Liu wrote:
> >> >>> On Thu, Jun 21, 2018 at 7:59 AM, Cornelia Huck <cohuck@redhat.com> wrote:
> >> >>> > On Wed, 20 Jun 2018 22:48:58 +0300
> >> >>> > "Michael S. Tsirkin" <mst@redhat.com> wrote:
> >> >>> >
> >> >>> >> On Wed, Jun 20, 2018 at 06:06:19PM +0200, Cornelia Huck wrote:
> >> >>> >> > In any case, I'm not sure anymore why we'd want the extra uuid.
> >> >>> >>
> >> >>> >> It's mostly so we can have e.g. multiple devices with same MAC
> >> >>> >> (which some people seem to want in order to then use
> >> >>> >> then with different containers).
> >> >>> >>
> >> >>> >> But it is also handy for when you assign a PF, since then you
> >> >>> >> can't set the MAC.
> >> >>> >>
> >> >>> >
> >> >>> > OK, so what about the following:
> >> >>> >
> >> >>> > - introduce a new feature bit, VIRTIO_NET_F_STANDBY_UUID that indicates
> >> >>> >   that we have a new uuid field in the virtio-net config space
> >> >>> > - in QEMU, add a property for virtio-net that allows to specify a uuid,
> >> >>> >   offer VIRTIO_NET_F_STANDBY_UUID if set
> >> >>> > - when configuring, set the property to the group UUID of the vfio-pci
> >> >>> >   device
> >> >>>
> >> >>> If feature negotiation fails on VIRTIO_NET_F_STANDBY_UUID, is it safe
> >> >>> to still expose UUID in the config space on virtio-pci?
> >> >>
> >> >>
> >> >> Yes but guest is not supposed to read it.
> >> >>
> >> >>> I'm not even sure if it's sane to expose group UUID on the PCI bridge
> >> >>> where the corresponding vfio-pci device attached to for a guest which
> >> >>> doesn't support the feature (legacy).
> >> >>>
> >> >>> -Siwei
> >> >>
> >> >> Yes but you won't add the primary behind such a bridge.
> >> >
> >> > I assume the UUID feature is a new one besides the exiting
> >> > VIRTIO_NET_F_STANDBY feature, where I think the current proposal is,
> >> > if UUID feature is present and supported by guest, we'll do pairing
> >> > based on UUID; if UUID feature present
> >>                                  ^^^^^^^  is NOT present
> >
> > Let's go back for a bit.
> >
> > What is wrong with matching by mac?
> >
> > 1. Does not allow multiple NICs with same mac
> > 2. Requires MAC to be programmed by host in the PT device
> >    (which is often possible with VFs but isn't possible with all PT
> >    devices)
> 
> Might not neccessarily be something wrong, but it's very limited to
> prohibit the MAC of VF from changing when enslaved by failover.

You mean guest changing MAC? I'm not sure why we prohibit that.


> The
> same as you indicated on the PT device. I suspect the same is
> prohibited on even virtio-net and failover is because of the same
> limitation.
> 
> >
> > Both issues are up to host: if host needs them it needs the UUID
> > feature, if not MAC matching is sufficient.
> 
> I know. What I'm saying is that we rely on STANDBY feature to control
> the visibility of the primary, not the UUID feature.
> 
> -Siwei

And what I'm saying is that it's up to a host policy.

> >
> >
> >> > or not supported by guest,
> >> > we'll still plug in the VF (if guest supports the _F_STANDBY feature)
> >> > but the pairing will be fallback to using MAC address. Are you saying
> >> > you don't even want to plug in the primary when the UUID feature is
> >> > not supported by guest? Does the feature negotiation UUID have to
> >> > depend on STANDBY being set, or the other way around? I thought that
> >> > just the absence of STANDBY will prevent primary to get exposed to the
> >> > guest.
> >> >
> >> > -Siwei
> >> >
> >> >>
> >> >>>
> >> >>> > - in the guest, use the uuid from the virtio-net device's config space
> >> >>> >   if applicable; else, fall back to matching by MAC as done today
> >
Michael S. Tsirkin June 22, 2018, 10:28 p.m. UTC | #69
On Fri, Jun 22, 2018 at 03:25:19PM -0700, Siwei Liu wrote:
> On Fri, Jun 22, 2018 at 2:47 PM, Michael S. Tsirkin <mst@redhat.com> wrote:
> > On Fri, Jun 22, 2018 at 12:43:26PM -0700, Siwei Liu wrote:
> >> > The semantics are that the primary is always used if present in
> >> > preference to standby.
> >> OK. If this is the only semantics of what "standby" refers to in
> >> general, that is fine.
> >>
> >> I just don't want to limit the failover/standby semantics to the
> >> device model specifics, the "accelerated datapath" thing or whatever.
> >> I really don't know where the requirements of the "accelerated
> >> datapath" came from,
> >
> > It's a way to put it that matches what failover actually provides.
> >
> >> as the originial problem is migrating vfio
> >> devices which is in match of QEMU's live migration model.
> >
> > If you put it this way then it's natural to require that we have a
> > config with a working vfio device, and that we somehow add virtio for
> > duration of migration.
> 
> OK. That's the STANDBY feature sematics as I expect.

Maybe but that isn't what virtio or hyperv implement.
If someone tells you otherwise, they are mistaken IMHO.

> Not something like "we have a working virtio, but we need an
> accelerated datapath via VFIO when the VM is not migrating". The VFIO
> is the what needs to be concerned with, not the virtio.
> The way I view it, the STANDBY feature, although present in the virtio
> device, is served as a communication channel for the paired VFIO
> device, and the main purpose of its feature negotiation process has to
> be around for migrating the corresponding VFIO. While it's possible to
> re-use it as a side way to achieve "accelerated datapath".
> 
> There could be other alternatives for vfio migration though, which do
> not need the virtio helper, so don't need to get a STANDBY virtio for
> that VFIO device.
> 
> >
> >> Hyper-V has
> >> it's limitation to do 1-netdev should not impact how KVM/QEMU should
> >> be doing it.
> >
> > That's a linux thing and pretty orthogonal to host/guest interface.
> 
> I agree. So don't assume any device model specifics in the host/guest interface.
> 
> >
> >> > Jason said virtio net is sometimes preferable.
> >> > If that's the case don't make it a standby.
> >> >
> >> > More advanced use-cases do exist and e.g. Alexander Duyck
> >> > suggested using a switch-dev.
> >>
> >> The switchdev case would need a new feature bit, right?
> >>
> >> -Siwei
> >
> > I think it would need to be a completely new device.
> 
> So how it relates to virtio or failover?
> 
> Regards,
> -Siwei

It might with time offer a super-set of the features.

> >
> >> > failover isn't it though.
> >> >
> >> > --
> >> > MST
Michael S. Tsirkin June 22, 2018, 10:33 p.m. UTC | #70
On Fri, Jun 22, 2018 at 02:57:39PM -0700, Siwei Liu wrote:
> On Fri, Jun 22, 2018 at 2:32 PM, Michael S. Tsirkin <mst@redhat.com> wrote:
> > On Fri, Jun 22, 2018 at 01:21:45PM -0700, Siwei Liu wrote:
> >> On Fri, Jun 22, 2018 at 12:05 PM, Michael S. Tsirkin <mst@redhat.com> wrote:
> >> > On Fri, Jun 22, 2018 at 05:09:55PM +0200, Cornelia Huck wrote:
> >> >> On Thu, 21 Jun 2018 21:20:13 +0300
> >> >> "Michael S. Tsirkin" <mst@redhat.com> wrote:
> >> >>
> >> >> > On Thu, Jun 21, 2018 at 04:59:13PM +0200, Cornelia Huck wrote:
> >> >> > > OK, so what about the following:
> >> >> > >
> >> >> > > - introduce a new feature bit, VIRTIO_NET_F_STANDBY_UUID that indicates
> >> >> > >   that we have a new uuid field in the virtio-net config space
> >> >> > > - in QEMU, add a property for virtio-net that allows to specify a uuid,
> >> >> > >   offer VIRTIO_NET_F_STANDBY_UUID if set
> >> >> > > - when configuring, set the property to the group UUID of the vfio-pci
> >> >> > >   device
> >> >> > > - in the guest, use the uuid from the virtio-net device's config space
> >> >> > >   if applicable; else, fall back to matching by MAC as done today
> >> >> > >
> >> >> > > That should work for all virtio transports.
> >> >> >
> >> >> > True. I'm a bit unhappy that it's virtio net specific though
> >> >> > since down the road I expect we'll have a very similar feature
> >> >> > for scsi (and maybe others).
> >> >> >
> >> >> > But we do not have a way to have fields that are portable
> >> >> > both across devices and transports, and I think it would
> >> >> > be a useful addition. How would this work though? Any idea?
> >> >>
> >> >> Can we introduce some kind of device-independent config space area?
> >> >> Pushing back the device-specific config space by a certain value if the
> >> >> appropriate feature is negotiated and use that for things like the uuid?
> >> >
> >> > So config moves back and forth?
> >> > Reminds me of the msi vector mess we had with pci.
> >> > I'd rather have every transport add a new config.
> >> >
> >> >> But regardless of that, I'm not sure whether extending this approach to
> >> >> other device types is the way to go. Tying together two different
> >> >> devices is creating complicated situations at least in the hypervisor
> >> >> (even if it's fairly straightforward in the guest). [I have not come
> >> >> around again to look at the "how to handle visibility in QEMU"
> >> >> questions due to lack of cycles, sorry about that.]
> >> >>
> >> >> So, what's the goal of this approach? Only to allow migration with
> >> >> vfio-pci, or also to plug in a faster device and use it instead of an
> >> >> already attached paravirtualized device?
> >> >
> >> > These are two sides of the same coin, I think the second approach
> >> > is closer to what we are doing here.
> >>
> >> I'm leaning towards being conservative to keep the potential of doing
> >> both. It's the vfio migration itself that has to be addessed, not to
> >> make every virtio device to have an accelerated datapath, right?
> >>
> >> -Siwei
> >
> > I think that the approach we took (signal configuration
> > through standby) assumes standby always present and
> > primary appearing and disappearing.
> 
> I can imagine that's still true even for 1-netdev model. As long as we
> can hide the lower devices.
> 
> That's what I said "to keep the potential to support both" models. But
> we should not go further along to assume the other aspect ie. to get
> PV accelerated whenever possible, but not to get VF migrated whenever
> possible. That's essetially a big diveregence of the priority for the
> goal we'd want to pursue.
> 
> -Siwei

I suspect the diveregence will be lost on most users though
simply because they don't even care about vfio. They just
want things to go fast.

Rephrasing requirements in terms of acceleration actually
made things implementable. So I'm not in a hurry to try
and go back to asking for a migrateable vfio - very
easy to get stuck there.

> >
> > Anything else isn't well supported and I'm not sure we
> > should complicate code too much to support it.
> >
> >>
> >> >
> >> >> What about migration of vfio devices that are not easily replaced by a
> >> >> paravirtualized device? I'm thinking of vfio-ccw, where our main (and
> >> >> currently only) supported device is dasd (disks) -- which can do a lot
> >> >> of specialized things that virtio-blk does not support (and should not
> >> >> or even cannot support).
> >> >
> >> > But maybe virtio-scsi can?
> >> >
> >> >> Would it be more helpful to focus on generic
> >> >> migration support for vfio instead of going about it device by device?
> >> >>
> >> >> This network device approach already seems far along, so it makes sense
> >> >> to continue with it. But I'm not sure whether we want to spend time and
> >> >> energy on that for other device types rather than working on a general
> >> >> solution for vfio migration.
> >> >
> >> > I'm inclined to say finalizing this feature would be a good first step
> >> > and will teach us how we can move forward.
> >> >
> >> > --
> >> > MST
Siwei Liu June 22, 2018, 11:40 p.m. UTC | #71
On Fri, Jun 22, 2018 at 3:25 PM, Michael S. Tsirkin <mst@redhat.com> wrote:
> On Fri, Jun 22, 2018 at 02:51:11PM -0700, Siwei Liu wrote:
>> On Fri, Jun 22, 2018 at 2:29 PM, Michael S. Tsirkin <mst@redhat.com> wrote:
>> > On Fri, Jun 22, 2018 at 01:03:04PM -0700, Siwei Liu wrote:
>> >> On Fri, Jun 22, 2018 at 1:00 PM, Siwei Liu <loseweigh@gmail.com> wrote:
>> >> > On Thu, Jun 21, 2018 at 7:32 PM, Michael S. Tsirkin <mst@redhat.com> wrote:
>> >> >> On Thu, Jun 21, 2018 at 06:21:55PM -0700, Siwei Liu wrote:
>> >> >>> On Thu, Jun 21, 2018 at 7:59 AM, Cornelia Huck <cohuck@redhat.com> wrote:
>> >> >>> > On Wed, 20 Jun 2018 22:48:58 +0300
>> >> >>> > "Michael S. Tsirkin" <mst@redhat.com> wrote:
>> >> >>> >
>> >> >>> >> On Wed, Jun 20, 2018 at 06:06:19PM +0200, Cornelia Huck wrote:
>> >> >>> >> > In any case, I'm not sure anymore why we'd want the extra uuid.
>> >> >>> >>
>> >> >>> >> It's mostly so we can have e.g. multiple devices with same MAC
>> >> >>> >> (which some people seem to want in order to then use
>> >> >>> >> then with different containers).
>> >> >>> >>
>> >> >>> >> But it is also handy for when you assign a PF, since then you
>> >> >>> >> can't set the MAC.
>> >> >>> >>
>> >> >>> >
>> >> >>> > OK, so what about the following:
>> >> >>> >
>> >> >>> > - introduce a new feature bit, VIRTIO_NET_F_STANDBY_UUID that indicates
>> >> >>> >   that we have a new uuid field in the virtio-net config space
>> >> >>> > - in QEMU, add a property for virtio-net that allows to specify a uuid,
>> >> >>> >   offer VIRTIO_NET_F_STANDBY_UUID if set
>> >> >>> > - when configuring, set the property to the group UUID of the vfio-pci
>> >> >>> >   device
>> >> >>>
>> >> >>> If feature negotiation fails on VIRTIO_NET_F_STANDBY_UUID, is it safe
>> >> >>> to still expose UUID in the config space on virtio-pci?
>> >> >>
>> >> >>
>> >> >> Yes but guest is not supposed to read it.
>> >> >>
>> >> >>> I'm not even sure if it's sane to expose group UUID on the PCI bridge
>> >> >>> where the corresponding vfio-pci device attached to for a guest which
>> >> >>> doesn't support the feature (legacy).
>> >> >>>
>> >> >>> -Siwei
>> >> >>
>> >> >> Yes but you won't add the primary behind such a bridge.
>> >> >
>> >> > I assume the UUID feature is a new one besides the exiting
>> >> > VIRTIO_NET_F_STANDBY feature, where I think the current proposal is,
>> >> > if UUID feature is present and supported by guest, we'll do pairing
>> >> > based on UUID; if UUID feature present
>> >>                                  ^^^^^^^  is NOT present
>> >
>> > Let's go back for a bit.
>> >
>> > What is wrong with matching by mac?
>> >
>> > 1. Does not allow multiple NICs with same mac
>> > 2. Requires MAC to be programmed by host in the PT device
>> >    (which is often possible with VFs but isn't possible with all PT
>> >    devices)
>>
>> Might not neccessarily be something wrong, but it's very limited to
>> prohibit the MAC of VF from changing when enslaved by failover.
>
> You mean guest changing MAC? I'm not sure why we prohibit that.

I think Sridhar and Jiri might be better person to answer it. My
impression was that sync'ing the MAC address change between all 3
devices is challenging, as the failover driver uses MAC address to
match net_device internally.

>
>
>> The
>> same as you indicated on the PT device. I suspect the same is
>> prohibited on even virtio-net and failover is because of the same
>> limitation.
>>
>> >
>> > Both issues are up to host: if host needs them it needs the UUID
>> > feature, if not MAC matching is sufficient.
>>
>> I know. What I'm saying is that we rely on STANDBY feature to control
>> the visibility of the primary, not the UUID feature.
>>
>> -Siwei
>
> And what I'm saying is that it's up to a host policy.

So lets say host policy explicitly requires UUID but the guest does
not support it. The guest could be a legacy guest with no STANDBY
support, or a relevately recent guest with STANDBY support but without
the UUID support. In either case, since host asked for UUID matching
explicitly, maybe because it requires different NIC using same MAC, so
it has to override the negotiation result of STANDBY, even if STANDBY
is set and supported? That will end up with inter-feature dependency
over STANDBY, as you see.

-Siwei

>
>> >
>> >
>> >> > or not supported by guest,
>> >> > we'll still plug in the VF (if guest supports the _F_STANDBY feature)
>> >> > but the pairing will be fallback to using MAC address. Are you saying
>> >> > you don't even want to plug in the primary when the UUID feature is
>> >> > not supported by guest? Does the feature negotiation UUID have to
>> >> > depend on STANDBY being set, or the other way around? I thought that
>> >> > just the absence of STANDBY will prevent primary to get exposed to the
>> >> > guest.
>> >> >
>> >> > -Siwei
>> >> >
>> >> >>
>> >> >>>
>> >> >>> > - in the guest, use the uuid from the virtio-net device's config space
>> >> >>> >   if applicable; else, fall back to matching by MAC as done today
>> >
Siwei Liu June 23, 2018, 12:05 a.m. UTC | #72
On Fri, Jun 22, 2018 at 3:33 PM, Michael S. Tsirkin <mst@redhat.com> wrote:
> On Fri, Jun 22, 2018 at 02:57:39PM -0700, Siwei Liu wrote:
>> On Fri, Jun 22, 2018 at 2:32 PM, Michael S. Tsirkin <mst@redhat.com> wrote:
>> > On Fri, Jun 22, 2018 at 01:21:45PM -0700, Siwei Liu wrote:
>> >> On Fri, Jun 22, 2018 at 12:05 PM, Michael S. Tsirkin <mst@redhat.com> wrote:
>> >> > On Fri, Jun 22, 2018 at 05:09:55PM +0200, Cornelia Huck wrote:
>> >> >> On Thu, 21 Jun 2018 21:20:13 +0300
>> >> >> "Michael S. Tsirkin" <mst@redhat.com> wrote:
>> >> >>
>> >> >> > On Thu, Jun 21, 2018 at 04:59:13PM +0200, Cornelia Huck wrote:
>> >> >> > > OK, so what about the following:
>> >> >> > >
>> >> >> > > - introduce a new feature bit, VIRTIO_NET_F_STANDBY_UUID that indicates
>> >> >> > >   that we have a new uuid field in the virtio-net config space
>> >> >> > > - in QEMU, add a property for virtio-net that allows to specify a uuid,
>> >> >> > >   offer VIRTIO_NET_F_STANDBY_UUID if set
>> >> >> > > - when configuring, set the property to the group UUID of the vfio-pci
>> >> >> > >   device
>> >> >> > > - in the guest, use the uuid from the virtio-net device's config space
>> >> >> > >   if applicable; else, fall back to matching by MAC as done today
>> >> >> > >
>> >> >> > > That should work for all virtio transports.
>> >> >> >
>> >> >> > True. I'm a bit unhappy that it's virtio net specific though
>> >> >> > since down the road I expect we'll have a very similar feature
>> >> >> > for scsi (and maybe others).
>> >> >> >
>> >> >> > But we do not have a way to have fields that are portable
>> >> >> > both across devices and transports, and I think it would
>> >> >> > be a useful addition. How would this work though? Any idea?
>> >> >>
>> >> >> Can we introduce some kind of device-independent config space area?
>> >> >> Pushing back the device-specific config space by a certain value if the
>> >> >> appropriate feature is negotiated and use that for things like the uuid?
>> >> >
>> >> > So config moves back and forth?
>> >> > Reminds me of the msi vector mess we had with pci.
>> >> > I'd rather have every transport add a new config.
>> >> >
>> >> >> But regardless of that, I'm not sure whether extending this approach to
>> >> >> other device types is the way to go. Tying together two different
>> >> >> devices is creating complicated situations at least in the hypervisor
>> >> >> (even if it's fairly straightforward in the guest). [I have not come
>> >> >> around again to look at the "how to handle visibility in QEMU"
>> >> >> questions due to lack of cycles, sorry about that.]
>> >> >>
>> >> >> So, what's the goal of this approach? Only to allow migration with
>> >> >> vfio-pci, or also to plug in a faster device and use it instead of an
>> >> >> already attached paravirtualized device?
>> >> >
>> >> > These are two sides of the same coin, I think the second approach
>> >> > is closer to what we are doing here.
>> >>
>> >> I'm leaning towards being conservative to keep the potential of doing
>> >> both. It's the vfio migration itself that has to be addessed, not to
>> >> make every virtio device to have an accelerated datapath, right?
>> >>
>> >> -Siwei
>> >
>> > I think that the approach we took (signal configuration
>> > through standby) assumes standby always present and
>> > primary appearing and disappearing.
>>
>> I can imagine that's still true even for 1-netdev model. As long as we
>> can hide the lower devices.
>>
>> That's what I said "to keep the potential to support both" models. But
>> we should not go further along to assume the other aspect ie. to get
>> PV accelerated whenever possible, but not to get VF migrated whenever
>> possible. That's essetially a big diveregence of the priority for the
>> goal we'd want to pursue.
>>
>> -Siwei
>
> I suspect the diveregence will be lost on most users though
> simply because they don't even care about vfio. They just
> want things to go fast.

Like Jason said, VF isn't faster than virtio-net in all cases. It
depends on the workload and performance metrics: throughput, latency,
or packet per second.

>
> Rephrasing requirements in terms of acceleration actually
> made things implementable. So I'm not in a hurry to try
> and go back to asking for a migrateable vfio - very
> easy to get stuck there.

Understood. When we get all ethtool_ops exposed on the failover
interface, we'll get back to attack migrateable vfio with the 1-netdev
model.

-Siwei


>
>> >
>> > Anything else isn't well supported and I'm not sure we
>> > should complicate code too much to support it.
>> >
>> >>
>> >> >
>> >> >> What about migration of vfio devices that are not easily replaced by a
>> >> >> paravirtualized device? I'm thinking of vfio-ccw, where our main (and
>> >> >> currently only) supported device is dasd (disks) -- which can do a lot
>> >> >> of specialized things that virtio-blk does not support (and should not
>> >> >> or even cannot support).
>> >> >
>> >> > But maybe virtio-scsi can?
>> >> >
>> >> >> Would it be more helpful to focus on generic
>> >> >> migration support for vfio instead of going about it device by device?
>> >> >>
>> >> >> This network device approach already seems far along, so it makes sense
>> >> >> to continue with it. But I'm not sure whether we want to spend time and
>> >> >> energy on that for other device types rather than working on a general
>> >> >> solution for vfio migration.
>> >> >
>> >> > I'm inclined to say finalizing this feature would be a good first step
>> >> > and will teach us how we can move forward.
>> >> >
>> >> > --
>> >> > MST
Siwei Liu June 23, 2018, 12:17 a.m. UTC | #73
On Fri, Jun 22, 2018 at 4:40 PM, Siwei Liu <loseweigh@gmail.com> wrote:
> On Fri, Jun 22, 2018 at 3:25 PM, Michael S. Tsirkin <mst@redhat.com> wrote:
>> On Fri, Jun 22, 2018 at 02:51:11PM -0700, Siwei Liu wrote:
>>> On Fri, Jun 22, 2018 at 2:29 PM, Michael S. Tsirkin <mst@redhat.com> wrote:
>>> > On Fri, Jun 22, 2018 at 01:03:04PM -0700, Siwei Liu wrote:
>>> >> On Fri, Jun 22, 2018 at 1:00 PM, Siwei Liu <loseweigh@gmail.com> wrote:
>>> >> > On Thu, Jun 21, 2018 at 7:32 PM, Michael S. Tsirkin <mst@redhat.com> wrote:
>>> >> >> On Thu, Jun 21, 2018 at 06:21:55PM -0700, Siwei Liu wrote:
>>> >> >>> On Thu, Jun 21, 2018 at 7:59 AM, Cornelia Huck <cohuck@redhat.com> wrote:
>>> >> >>> > On Wed, 20 Jun 2018 22:48:58 +0300
>>> >> >>> > "Michael S. Tsirkin" <mst@redhat.com> wrote:
>>> >> >>> >
>>> >> >>> >> On Wed, Jun 20, 2018 at 06:06:19PM +0200, Cornelia Huck wrote:
>>> >> >>> >> > In any case, I'm not sure anymore why we'd want the extra uuid.
>>> >> >>> >>
>>> >> >>> >> It's mostly so we can have e.g. multiple devices with same MAC
>>> >> >>> >> (which some people seem to want in order to then use
>>> >> >>> >> then with different containers).
>>> >> >>> >>
>>> >> >>> >> But it is also handy for when you assign a PF, since then you
>>> >> >>> >> can't set the MAC.
>>> >> >>> >>
>>> >> >>> >
>>> >> >>> > OK, so what about the following:
>>> >> >>> >
>>> >> >>> > - introduce a new feature bit, VIRTIO_NET_F_STANDBY_UUID that indicates
>>> >> >>> >   that we have a new uuid field in the virtio-net config space
>>> >> >>> > - in QEMU, add a property for virtio-net that allows to specify a uuid,
>>> >> >>> >   offer VIRTIO_NET_F_STANDBY_UUID if set
>>> >> >>> > - when configuring, set the property to the group UUID of the vfio-pci
>>> >> >>> >   device
>>> >> >>>
>>> >> >>> If feature negotiation fails on VIRTIO_NET_F_STANDBY_UUID, is it safe
>>> >> >>> to still expose UUID in the config space on virtio-pci?
>>> >> >>
>>> >> >>
>>> >> >> Yes but guest is not supposed to read it.
>>> >> >>
>>> >> >>> I'm not even sure if it's sane to expose group UUID on the PCI bridge
>>> >> >>> where the corresponding vfio-pci device attached to for a guest which
>>> >> >>> doesn't support the feature (legacy).
>>> >> >>>
>>> >> >>> -Siwei
>>> >> >>
>>> >> >> Yes but you won't add the primary behind such a bridge.
>>> >> >
>>> >> > I assume the UUID feature is a new one besides the exiting
>>> >> > VIRTIO_NET_F_STANDBY feature, where I think the current proposal is,
>>> >> > if UUID feature is present and supported by guest, we'll do pairing
>>> >> > based on UUID; if UUID feature present
>>> >>                                  ^^^^^^^  is NOT present
>>> >
>>> > Let's go back for a bit.
>>> >
>>> > What is wrong with matching by mac?
>>> >
>>> > 1. Does not allow multiple NICs with same mac
>>> > 2. Requires MAC to be programmed by host in the PT device
>>> >    (which is often possible with VFs but isn't possible with all PT
>>> >    devices)
>>>
>>> Might not neccessarily be something wrong, but it's very limited to
>>> prohibit the MAC of VF from changing when enslaved by failover.
>>
>> You mean guest changing MAC? I'm not sure why we prohibit that.
>
> I think Sridhar and Jiri might be better person to answer it. My
> impression was that sync'ing the MAC address change between all 3
> devices is challenging, as the failover driver uses MAC address to
> match net_device internally.
>
>>
>>
>>> The
>>> same as you indicated on the PT device. I suspect the same is
>>> prohibited on even virtio-net and failover is because of the same
>>> limitation.
>>>
>>> >
>>> > Both issues are up to host: if host needs them it needs the UUID
>>> > feature, if not MAC matching is sufficient.
>>>
>>> I know. What I'm saying is that we rely on STANDBY feature to control
>>> the visibility of the primary, not the UUID feature.
>>>
>>> -Siwei
>>
>> And what I'm saying is that it's up to a host policy.
>
> So lets say host policy explicitly requires UUID but the guest does
> not support it. The guest could be a legacy guest with no STANDBY
> support, or a relevately recent guest with STANDBY support but without
> the UUID support. In either case, since host asked for UUID matching
> explicitly, maybe because it requires different NIC using same MAC, so
> it has to override the negotiation result of STANDBY, even if STANDBY
> is set and supported? That will end up with inter-feature dependency
> over STANDBY, as you see.

I forgot to mention the above has the assumption that we expose both
STANDBY and UUID feature to QEMU user. In fact, as we're going towards
not exposing the STANDBY feature directly to user, UUID may be always
required to enable STANDBY. If not, how do we make sure QEMU can
control the visibility of primary device? Something to be confirmed
before implementing the code.

-Siwei

>
> -Siwei
>
>>
>>> >
>>> >
>>> >> > or not supported by guest,
>>> >> > we'll still plug in the VF (if guest supports the _F_STANDBY feature)
>>> >> > but the pairing will be fallback to using MAC address. Are you saying
>>> >> > you don't even want to plug in the primary when the UUID feature is
>>> >> > not supported by guest? Does the feature negotiation UUID have to
>>> >> > depend on STANDBY being set, or the other way around? I thought that
>>> >> > just the absence of STANDBY will prevent primary to get exposed to the
>>> >> > guest.
>>> >> >
>>> >> > -Siwei
>>> >> >
>>> >> >>
>>> >> >>>
>>> >> >>> > - in the guest, use the uuid from the virtio-net device's config space
>>> >> >>> >   if applicable; else, fall back to matching by MAC as done today
>>> >
Michael S. Tsirkin June 24, 2018, 1:45 a.m. UTC | #74
On Fri, Jun 22, 2018 at 05:17:10PM -0700, Siwei Liu wrote:
> I forgot to mention the above has the assumption that we expose both
> STANDBY and UUID feature to QEMU user. In fact, as we're going towards
> not exposing the STANDBY feature directly to user, UUID may be always
> required to enable STANDBY.

Sounds good.

> If not, how do we make sure QEMU can
> control the visibility of primary device?

Hypervisors fundamentally always can control visibility of all
virtual devices.

> Something to be confirmed
> before implementing the code.
>
Cornelia Huck June 25, 2018, 9:55 a.m. UTC | #75
On Fri, 22 Jun 2018 22:05:50 +0300
"Michael S. Tsirkin" <mst@redhat.com> wrote:

> On Fri, Jun 22, 2018 at 05:09:55PM +0200, Cornelia Huck wrote:
> > On Thu, 21 Jun 2018 21:20:13 +0300
> > "Michael S. Tsirkin" <mst@redhat.com> wrote:
> >   
> > > On Thu, Jun 21, 2018 at 04:59:13PM +0200, Cornelia Huck wrote:  
> > > > OK, so what about the following:
> > > > 
> > > > - introduce a new feature bit, VIRTIO_NET_F_STANDBY_UUID that indicates
> > > >   that we have a new uuid field in the virtio-net config space
> > > > - in QEMU, add a property for virtio-net that allows to specify a uuid,
> > > >   offer VIRTIO_NET_F_STANDBY_UUID if set
> > > > - when configuring, set the property to the group UUID of the vfio-pci
> > > >   device
> > > > - in the guest, use the uuid from the virtio-net device's config space
> > > >   if applicable; else, fall back to matching by MAC as done today
> > > > 
> > > > That should work for all virtio transports.    
> > > 
> > > True. I'm a bit unhappy that it's virtio net specific though
> > > since down the road I expect we'll have a very similar feature
> > > for scsi (and maybe others).
> > > 
> > > But we do not have a way to have fields that are portable
> > > both across devices and transports, and I think it would
> > > be a useful addition. How would this work though? Any idea?  
> > 
> > Can we introduce some kind of device-independent config space area?
> > Pushing back the device-specific config space by a certain value if the
> > appropriate feature is negotiated and use that for things like the uuid?  
> 
> So config moves back and forth?
> Reminds me of the msi vector mess we had with pci.

Yes, that would be a bit unfortunate.

> I'd rather have every transport add a new config.

You mean via different mechanisms?

> 
> > But regardless of that, I'm not sure whether extending this approach to
> > other device types is the way to go. Tying together two different
> > devices is creating complicated situations at least in the hypervisor
> > (even if it's fairly straightforward in the guest). [I have not come
> > around again to look at the "how to handle visibility in QEMU"
> > questions due to lack of cycles, sorry about that.]
> > 
> > So, what's the goal of this approach? Only to allow migration with
> > vfio-pci, or also to plug in a faster device and use it instead of an
> > already attached paravirtualized device?  
> 
> These are two sides of the same coin, I think the second approach
> is closer to what we are doing here.

Thinking about it, do we need any knob to keep the vfio device
invisible if the virtio device is not present? IOW, how does the
hypervisor know that the vfio device is supposed to be paired with a
virtio device? It seems we need an explicit tie-in.

> 
> > What about migration of vfio devices that are not easily replaced by a
> > paravirtualized device? I'm thinking of vfio-ccw, where our main (and
> > currently only) supported device is dasd (disks) -- which can do a lot
> > of specialized things that virtio-blk does not support (and should not
> > or even cannot support).  
> 
> But maybe virtio-scsi can?

I don't think so. Dasds have some channel commands that don't map
easily to scsi commands.

> 
> > Would it be more helpful to focus on generic
> > migration support for vfio instead of going about it device by device?
> > 
> > This network device approach already seems far along, so it makes sense
> > to continue with it. But I'm not sure whether we want to spend time and
> > energy on that for other device types rather than working on a general
> > solution for vfio migration.  
> 
> I'm inclined to say finalizing this feature would be a good first step
> and will teach us how we can move forward.

I'm not opposed to figuring out this one, but I'm not sure whether we
want to extend it to more device types.

Are people looking into generic migration support? I have it on my
things-to-look-at list (figuring out what needs to be device specific
and what can be generic, figuring out how we can support vfio-ccw,
etc.).
Samudrala, Sridhar June 25, 2018, 5:54 p.m. UTC | #76
On 6/22/2018 5:17 PM, Siwei Liu wrote:
> On Fri, Jun 22, 2018 at 4:40 PM, Siwei Liu <loseweigh@gmail.com> wrote:
>> On Fri, Jun 22, 2018 at 3:25 PM, Michael S. Tsirkin <mst@redhat.com> wrote:
>>> On Fri, Jun 22, 2018 at 02:51:11PM -0700, Siwei Liu wrote:
>>>> On Fri, Jun 22, 2018 at 2:29 PM, Michael S. Tsirkin <mst@redhat.com> wrote:
>>>>> On Fri, Jun 22, 2018 at 01:03:04PM -0700, Siwei Liu wrote:
>>>>>> On Fri, Jun 22, 2018 at 1:00 PM, Siwei Liu <loseweigh@gmail.com> wrote:
>>>>>>> On Thu, Jun 21, 2018 at 7:32 PM, Michael S. Tsirkin <mst@redhat.com> wrote:
>>>>>>>> On Thu, Jun 21, 2018 at 06:21:55PM -0700, Siwei Liu wrote:
>>>>>>>>> On Thu, Jun 21, 2018 at 7:59 AM, Cornelia Huck <cohuck@redhat.com> wrote:
>>>>>>>>>> On Wed, 20 Jun 2018 22:48:58 +0300
>>>>>>>>>> "Michael S. Tsirkin" <mst@redhat.com> wrote:
>>>>>>>>>>
>>>>>>>>>>> On Wed, Jun 20, 2018 at 06:06:19PM +0200, Cornelia Huck wrote:
>>>>>>>>>>>> In any case, I'm not sure anymore why we'd want the extra uuid.
>>>>>>>>>>> It's mostly so we can have e.g. multiple devices with same MAC
>>>>>>>>>>> (which some people seem to want in order to then use
>>>>>>>>>>> then with different containers).
>>>>>>>>>>>
>>>>>>>>>>> But it is also handy for when you assign a PF, since then you
>>>>>>>>>>> can't set the MAC.
>>>>>>>>>>>
>>>>>>>>>> OK, so what about the following:
>>>>>>>>>>
>>>>>>>>>> - introduce a new feature bit, VIRTIO_NET_F_STANDBY_UUID that indicates
>>>>>>>>>>    that we have a new uuid field in the virtio-net config space
>>>>>>>>>> - in QEMU, add a property for virtio-net that allows to specify a uuid,
>>>>>>>>>>    offer VIRTIO_NET_F_STANDBY_UUID if set
>>>>>>>>>> - when configuring, set the property to the group UUID of the vfio-pci
>>>>>>>>>>    device
>>>>>>>>> If feature negotiation fails on VIRTIO_NET_F_STANDBY_UUID, is it safe
>>>>>>>>> to still expose UUID in the config space on virtio-pci?
>>>>>>>>
>>>>>>>> Yes but guest is not supposed to read it.
>>>>>>>>
>>>>>>>>> I'm not even sure if it's sane to expose group UUID on the PCI bridge
>>>>>>>>> where the corresponding vfio-pci device attached to for a guest which
>>>>>>>>> doesn't support the feature (legacy).
>>>>>>>>>
>>>>>>>>> -Siwei
>>>>>>>> Yes but you won't add the primary behind such a bridge.
>>>>>>> I assume the UUID feature is a new one besides the exiting
>>>>>>> VIRTIO_NET_F_STANDBY feature, where I think the current proposal is,
>>>>>>> if UUID feature is present and supported by guest, we'll do pairing
>>>>>>> based on UUID; if UUID feature present
>>>>>>                                   ^^^^^^^  is NOT present
>>>>> Let's go back for a bit.
>>>>>
>>>>> What is wrong with matching by mac?
>>>>>
>>>>> 1. Does not allow multiple NICs with same mac
>>>>> 2. Requires MAC to be programmed by host in the PT device
>>>>>     (which is often possible with VFs but isn't possible with all PT
>>>>>     devices)
>>>> Might not neccessarily be something wrong, but it's very limited to
>>>> prohibit the MAC of VF from changing when enslaved by failover.
>>> You mean guest changing MAC? I'm not sure why we prohibit that.
>> I think Sridhar and Jiri might be better person to answer it. My
>> impression was that sync'ing the MAC address change between all 3
>> devices is challenging, as the failover driver uses MAC address to
>> match net_device internally.

Yes. The MAC address is assigned by the hypervisor and it needs to manage the movement
of the MAC between the PF and VF.  Allowing the guest to change the MAC will require
synchronization between the hypervisor and the PF/VF drivers. Most of the VF drivers
don't allow changing guest MAC unless it is a trusted VF.
Michael S. Tsirkin June 26, 2018, 1:46 a.m. UTC | #77
On Mon, Jun 25, 2018 at 11:55:12AM +0200, Cornelia Huck wrote:
> On Fri, 22 Jun 2018 22:05:50 +0300
> "Michael S. Tsirkin" <mst@redhat.com> wrote:
> 
> > On Fri, Jun 22, 2018 at 05:09:55PM +0200, Cornelia Huck wrote:
> > > On Thu, 21 Jun 2018 21:20:13 +0300
> > > "Michael S. Tsirkin" <mst@redhat.com> wrote:
> > >   
> > > > On Thu, Jun 21, 2018 at 04:59:13PM +0200, Cornelia Huck wrote:  
> > > > > OK, so what about the following:
> > > > > 
> > > > > - introduce a new feature bit, VIRTIO_NET_F_STANDBY_UUID that indicates
> > > > >   that we have a new uuid field in the virtio-net config space
> > > > > - in QEMU, add a property for virtio-net that allows to specify a uuid,
> > > > >   offer VIRTIO_NET_F_STANDBY_UUID if set
> > > > > - when configuring, set the property to the group UUID of the vfio-pci
> > > > >   device
> > > > > - in the guest, use the uuid from the virtio-net device's config space
> > > > >   if applicable; else, fall back to matching by MAC as done today
> > > > > 
> > > > > That should work for all virtio transports.    
> > > > 
> > > > True. I'm a bit unhappy that it's virtio net specific though
> > > > since down the road I expect we'll have a very similar feature
> > > > for scsi (and maybe others).
> > > > 
> > > > But we do not have a way to have fields that are portable
> > > > both across devices and transports, and I think it would
> > > > be a useful addition. How would this work though? Any idea?  
> > > 
> > > Can we introduce some kind of device-independent config space area?
> > > Pushing back the device-specific config space by a certain value if the
> > > appropriate feature is negotiated and use that for things like the uuid?  
> > 
> > So config moves back and forth?
> > Reminds me of the msi vector mess we had with pci.
> 
> Yes, that would be a bit unfortunate.
> 
> > I'd rather have every transport add a new config.
> 
> You mean via different mechanisms?

I guess so.

> > 
> > > But regardless of that, I'm not sure whether extending this approach to
> > > other device types is the way to go. Tying together two different
> > > devices is creating complicated situations at least in the hypervisor
> > > (even if it's fairly straightforward in the guest). [I have not come
> > > around again to look at the "how to handle visibility in QEMU"
> > > questions due to lack of cycles, sorry about that.]
> > > 
> > > So, what's the goal of this approach? Only to allow migration with
> > > vfio-pci, or also to plug in a faster device and use it instead of an
> > > already attached paravirtualized device?  
> > 
> > These are two sides of the same coin, I think the second approach
> > is closer to what we are doing here.
> 
> Thinking about it, do we need any knob to keep the vfio device
> invisible if the virtio device is not present? IOW, how does the
> hypervisor know that the vfio device is supposed to be paired with a
> virtio device? It seems we need an explicit tie-in.

If we are going the way of the bridge, both bridge and
virtio would have some kind of id.

When pairing using mac, I'm less sure. PAss vfio device mac to qemu
as a property?


> > 
> > > What about migration of vfio devices that are not easily replaced by a
> > > paravirtualized device? I'm thinking of vfio-ccw, where our main (and
> > > currently only) supported device is dasd (disks) -- which can do a lot
> > > of specialized things that virtio-blk does not support (and should not
> > > or even cannot support).  
> > 
> > But maybe virtio-scsi can?
> 
> I don't think so. Dasds have some channel commands that don't map
> easily to scsi commands.

There's always a choice of adding these to the spec.
E.g. FC extensions were proposed, I don't remember why they
are still stuck.

> > 
> > > Would it be more helpful to focus on generic
> > > migration support for vfio instead of going about it device by device?
> > > 
> > > This network device approach already seems far along, so it makes sense
> > > to continue with it. But I'm not sure whether we want to spend time and
> > > energy on that for other device types rather than working on a general
> > > solution for vfio migration.  
> > 
> > I'm inclined to say finalizing this feature would be a good first step
> > and will teach us how we can move forward.
> 
> I'm not opposed to figuring out this one, but I'm not sure whether we
> want to extend it to more device types.
> 
> Are people looking into generic migration support? I have it on my
> things-to-look-at list (figuring out what needs to be device specific
> and what can be generic, figuring out how we can support vfio-ccw,
> etc.).

I expect to see more of it if SPDK makes progress.
Michael S. Tsirkin June 26, 2018, 1:50 a.m. UTC | #78
On Mon, Jun 25, 2018 at 10:54:09AM -0700, Samudrala, Sridhar wrote:
> > > > > Might not neccessarily be something wrong, but it's very limited to
> > > > > prohibit the MAC of VF from changing when enslaved by failover.
> > > > You mean guest changing MAC? I'm not sure why we prohibit that.
> > > I think Sridhar and Jiri might be better person to answer it. My
> > > impression was that sync'ing the MAC address change between all 3
> > > devices is challenging, as the failover driver uses MAC address to
> > > match net_device internally.
> 
> Yes. The MAC address is assigned by the hypervisor and it needs to manage the movement
> of the MAC between the PF and VF.  Allowing the guest to change the MAC will require
> synchronization between the hypervisor and the PF/VF drivers. Most of the VF drivers
> don't allow changing guest MAC unless it is a trusted VF.

OK but it's a policy thing. Maybe it's a trusted VF. Who knows?
For example I can see host just
failing VIRTIO_NET_CTRL_MAC_ADDR_SET if it wants to block it.
I'm not sure why VIRTIO_NET_F_STANDBY has to block it in the guest.
Cornelia Huck June 26, 2018, 11:55 a.m. UTC | #79
On Tue, 26 Jun 2018 04:46:03 +0300
"Michael S. Tsirkin" <mst@redhat.com> wrote:

> On Mon, Jun 25, 2018 at 11:55:12AM +0200, Cornelia Huck wrote:
> > On Fri, 22 Jun 2018 22:05:50 +0300
> > "Michael S. Tsirkin" <mst@redhat.com> wrote:
> >   
> > > On Fri, Jun 22, 2018 at 05:09:55PM +0200, Cornelia Huck wrote:  
> > > > On Thu, 21 Jun 2018 21:20:13 +0300
> > > > "Michael S. Tsirkin" <mst@redhat.com> wrote:
> > > >     
> > > > > On Thu, Jun 21, 2018 at 04:59:13PM +0200, Cornelia Huck wrote:    
> > > > > > OK, so what about the following:
> > > > > > 
> > > > > > - introduce a new feature bit, VIRTIO_NET_F_STANDBY_UUID that indicates
> > > > > >   that we have a new uuid field in the virtio-net config space
> > > > > > - in QEMU, add a property for virtio-net that allows to specify a uuid,
> > > > > >   offer VIRTIO_NET_F_STANDBY_UUID if set
> > > > > > - when configuring, set the property to the group UUID of the vfio-pci
> > > > > >   device
> > > > > > - in the guest, use the uuid from the virtio-net device's config space
> > > > > >   if applicable; else, fall back to matching by MAC as done today
> > > > > > 
> > > > > > That should work for all virtio transports.      
> > > > > 
> > > > > True. I'm a bit unhappy that it's virtio net specific though
> > > > > since down the road I expect we'll have a very similar feature
> > > > > for scsi (and maybe others).
> > > > > 
> > > > > But we do not have a way to have fields that are portable
> > > > > both across devices and transports, and I think it would
> > > > > be a useful addition. How would this work though? Any idea?    
> > > > 
> > > > Can we introduce some kind of device-independent config space area?
> > > > Pushing back the device-specific config space by a certain value if the
> > > > appropriate feature is negotiated and use that for things like the uuid?    
> > > 
> > > So config moves back and forth?
> > > Reminds me of the msi vector mess we had with pci.  
> > 
> > Yes, that would be a bit unfortunate.
> >   
> > > I'd rather have every transport add a new config.  
> > 
> > You mean via different mechanisms?  
> 
> I guess so.

Is there an alternate mechanism for pci to use? (Not so familiar with
it.)

For ccw, this needs more thought. We already introduced two commands
for reading/writing the config space (a concept that does not really
exist on s390). There's the generic read configuration data command,
but the data returned by it is not really generic enough. So we would
need one new command (or two, if we need to write as well). I'm not
sure about that yet.

> 
> > >   
> > > > But regardless of that, I'm not sure whether extending this approach to
> > > > other device types is the way to go. Tying together two different
> > > > devices is creating complicated situations at least in the hypervisor
> > > > (even if it's fairly straightforward in the guest). [I have not come
> > > > around again to look at the "how to handle visibility in QEMU"
> > > > questions due to lack of cycles, sorry about that.]
> > > > 
> > > > So, what's the goal of this approach? Only to allow migration with
> > > > vfio-pci, or also to plug in a faster device and use it instead of an
> > > > already attached paravirtualized device?    
> > > 
> > > These are two sides of the same coin, I think the second approach
> > > is closer to what we are doing here.  
> > 
> > Thinking about it, do we need any knob to keep the vfio device
> > invisible if the virtio device is not present? IOW, how does the
> > hypervisor know that the vfio device is supposed to be paired with a
> > virtio device? It seems we need an explicit tie-in.  
> 
> If we are going the way of the bridge, both bridge and
> virtio would have some kind of id.

So the presence of the id would indicate "this is one part of a pair"?

> 
> When pairing using mac, I'm less sure. PAss vfio device mac to qemu
> as a property?

That feels a bit odd. "This is the vfio device's mac, use this instead
of your usual mac property"? As we have not designed the QEMU interface
yet, just go with the id in any case? The guest can still match by mac.

> > > > What about migration of vfio devices that are not easily replaced by a
> > > > paravirtualized device? I'm thinking of vfio-ccw, where our main (and
> > > > currently only) supported device is dasd (disks) -- which can do a lot
> > > > of specialized things that virtio-blk does not support (and should not
> > > > or even cannot support).    
> > > 
> > > But maybe virtio-scsi can?  
> > 
> > I don't think so. Dasds have some channel commands that don't map
> > easily to scsi commands.  
> 
> There's always a choice of adding these to the spec.
> E.g. FC extensions were proposed, I don't remember why they
> are still stuck.

FC extensions are a completely different kind of enhancements, though.
For a start, they are not unique to a certain transport.

Also, we have a whole list of special dasd issues. Weird disk layout
for eckd, low-level disk formatting, etc. (See the list of commands in
drivers/s390/block/dasd_eckd.h for an idea. There's also no public
documentation AFAICS; https://en.wikipedia.org/wiki/ECKD does not link
to anything interesting.) I don't think we want to cram stuff like this
into a completely different framework.
Michael S. Tsirkin June 26, 2018, 1:54 p.m. UTC | #80
On Tue, Jun 26, 2018 at 01:55:09PM +0200, Cornelia Huck wrote:
> On Tue, 26 Jun 2018 04:46:03 +0300
> "Michael S. Tsirkin" <mst@redhat.com> wrote:
> 
> > On Mon, Jun 25, 2018 at 11:55:12AM +0200, Cornelia Huck wrote:
> > > On Fri, 22 Jun 2018 22:05:50 +0300
> > > "Michael S. Tsirkin" <mst@redhat.com> wrote:
> > >   
> > > > On Fri, Jun 22, 2018 at 05:09:55PM +0200, Cornelia Huck wrote:  
> > > > > On Thu, 21 Jun 2018 21:20:13 +0300
> > > > > "Michael S. Tsirkin" <mst@redhat.com> wrote:
> > > > >     
> > > > > > On Thu, Jun 21, 2018 at 04:59:13PM +0200, Cornelia Huck wrote:    
> > > > > > > OK, so what about the following:
> > > > > > > 
> > > > > > > - introduce a new feature bit, VIRTIO_NET_F_STANDBY_UUID that indicates
> > > > > > >   that we have a new uuid field in the virtio-net config space
> > > > > > > - in QEMU, add a property for virtio-net that allows to specify a uuid,
> > > > > > >   offer VIRTIO_NET_F_STANDBY_UUID if set
> > > > > > > - when configuring, set the property to the group UUID of the vfio-pci
> > > > > > >   device
> > > > > > > - in the guest, use the uuid from the virtio-net device's config space
> > > > > > >   if applicable; else, fall back to matching by MAC as done today
> > > > > > > 
> > > > > > > That should work for all virtio transports.      
> > > > > > 
> > > > > > True. I'm a bit unhappy that it's virtio net specific though
> > > > > > since down the road I expect we'll have a very similar feature
> > > > > > for scsi (and maybe others).
> > > > > > 
> > > > > > But we do not have a way to have fields that are portable
> > > > > > both across devices and transports, and I think it would
> > > > > > be a useful addition. How would this work though? Any idea?    
> > > > > 
> > > > > Can we introduce some kind of device-independent config space area?
> > > > > Pushing back the device-specific config space by a certain value if the
> > > > > appropriate feature is negotiated and use that for things like the uuid?    
> > > > 
> > > > So config moves back and forth?
> > > > Reminds me of the msi vector mess we had with pci.  
> > > 
> > > Yes, that would be a bit unfortunate.
> > >   
> > > > I'd rather have every transport add a new config.  
> > > 
> > > You mean via different mechanisms?  
> > 
> > I guess so.
> 
> Is there an alternate mechanism for pci to use? (Not so familiar with
> it.)

We have a device and transport config capability.
We could add a generic config capability too.

> For ccw, this needs more thought. We already introduced two commands
> for reading/writing the config space (a concept that does not really
> exist on s390). There's the generic read configuration data command,
> but the data returned by it is not really generic enough. So we would
> need one new command (or two, if we need to write as well). I'm not
> sure about that yet.
> 
> > 
> > > >   
> > > > > But regardless of that, I'm not sure whether extending this approach to
> > > > > other device types is the way to go. Tying together two different
> > > > > devices is creating complicated situations at least in the hypervisor
> > > > > (even if it's fairly straightforward in the guest). [I have not come
> > > > > around again to look at the "how to handle visibility in QEMU"
> > > > > questions due to lack of cycles, sorry about that.]
> > > > > 
> > > > > So, what's the goal of this approach? Only to allow migration with
> > > > > vfio-pci, or also to plug in a faster device and use it instead of an
> > > > > already attached paravirtualized device?    
> > > > 
> > > > These are two sides of the same coin, I think the second approach
> > > > is closer to what we are doing here.  
> > > 
> > > Thinking about it, do we need any knob to keep the vfio device
> > > invisible if the virtio device is not present? IOW, how does the
> > > hypervisor know that the vfio device is supposed to be paired with a
> > > virtio device? It seems we need an explicit tie-in.  
> > 
> > If we are going the way of the bridge, both bridge and
> > virtio would have some kind of id.
> 
> So the presence of the id would indicate "this is one part of a pair"?

I guess so, yes.

> > 
> > When pairing using mac, I'm less sure. PAss vfio device mac to qemu
> > as a property?
> 
> That feels a bit odd. "This is the vfio device's mac, use this instead
> of your usual mac property"? As we have not designed the QEMU interface
> yet, just go with the id in any case? The guest can still match by mac.

OK

> > > > > What about migration of vfio devices that are not easily replaced by a
> > > > > paravirtualized device? I'm thinking of vfio-ccw, where our main (and
> > > > > currently only) supported device is dasd (disks) -- which can do a lot
> > > > > of specialized things that virtio-blk does not support (and should not
> > > > > or even cannot support).    
> > > > 
> > > > But maybe virtio-scsi can?  
> > > 
> > > I don't think so. Dasds have some channel commands that don't map
> > > easily to scsi commands.  
> > 
> > There's always a choice of adding these to the spec.
> > E.g. FC extensions were proposed, I don't remember why they
> > are still stuck.
> 
> FC extensions are a completely different kind of enhancements, though.
> For a start, they are not unique to a certain transport.
> 
> Also, we have a whole list of special dasd issues. Weird disk layout
> for eckd, low-level disk formatting, etc. (See the list of commands in
> drivers/s390/block/dasd_eckd.h for an idea. There's also no public
> documentation AFAICS; https://en.wikipedia.org/wiki/ECKD does not link
> to anything interesting.) I don't think we want to cram stuff like this
> into a completely different framework.
Cornelia Huck June 26, 2018, 3:08 p.m. UTC | #81
On Fri, 22 Jun 2018 17:05:04 -0700
Siwei Liu <loseweigh@gmail.com> wrote:

> On Fri, Jun 22, 2018 at 3:33 PM, Michael S. Tsirkin <mst@redhat.com> wrote:
> > I suspect the diveregence will be lost on most users though
> > simply because they don't even care about vfio. They just
> > want things to go fast.  
> 
> Like Jason said, VF isn't faster than virtio-net in all cases. It
> depends on the workload and performance metrics: throughput, latency,
> or packet per second.

So, will it be guest/admin-controllable then where the traffic flows
through? Just because we do have a vf available after negotiation of
the feature bit, it does not necessarily mean we want to use it? Do we
(the guest) even want to make it visible in that case?
Cornelia Huck June 26, 2018, 3:17 p.m. UTC | #82
On Tue, 26 Jun 2018 04:50:25 +0300
"Michael S. Tsirkin" <mst@redhat.com> wrote:

> On Mon, Jun 25, 2018 at 10:54:09AM -0700, Samudrala, Sridhar wrote:
> > > > > > Might not neccessarily be something wrong, but it's very limited to
> > > > > > prohibit the MAC of VF from changing when enslaved by failover.  
> > > > > You mean guest changing MAC? I'm not sure why we prohibit that.  
> > > > I think Sridhar and Jiri might be better person to answer it. My
> > > > impression was that sync'ing the MAC address change between all 3
> > > > devices is challenging, as the failover driver uses MAC address to
> > > > match net_device internally.  
> > 
> > Yes. The MAC address is assigned by the hypervisor and it needs to manage the movement
> > of the MAC between the PF and VF.  Allowing the guest to change the MAC will require
> > synchronization between the hypervisor and the PF/VF drivers. Most of the VF drivers
> > don't allow changing guest MAC unless it is a trusted VF.  
> 
> OK but it's a policy thing. Maybe it's a trusted VF. Who knows?
> For example I can see host just
> failing VIRTIO_NET_CTRL_MAC_ADDR_SET if it wants to block it.
> I'm not sure why VIRTIO_NET_F_STANDBY has to block it in the guest.
> 

So, what I get from this is that QEMU needs to be able to control all
of standby, uuid, and mac to accommodate the different setups
(respectively have libvirt/management software set it up). Is the host
able to find out respectively define whether a VF is trusted?
Michael S. Tsirkin June 26, 2018, 3:38 p.m. UTC | #83
On Tue, Jun 26, 2018 at 05:17:32PM +0200, Cornelia Huck wrote:
> On Tue, 26 Jun 2018 04:50:25 +0300
> "Michael S. Tsirkin" <mst@redhat.com> wrote:
> 
> > On Mon, Jun 25, 2018 at 10:54:09AM -0700, Samudrala, Sridhar wrote:
> > > > > > > Might not neccessarily be something wrong, but it's very limited to
> > > > > > > prohibit the MAC of VF from changing when enslaved by failover.  
> > > > > > You mean guest changing MAC? I'm not sure why we prohibit that.  
> > > > > I think Sridhar and Jiri might be better person to answer it. My
> > > > > impression was that sync'ing the MAC address change between all 3
> > > > > devices is challenging, as the failover driver uses MAC address to
> > > > > match net_device internally.  
> > > 
> > > Yes. The MAC address is assigned by the hypervisor and it needs to manage the movement
> > > of the MAC between the PF and VF.  Allowing the guest to change the MAC will require
> > > synchronization between the hypervisor and the PF/VF drivers. Most of the VF drivers
> > > don't allow changing guest MAC unless it is a trusted VF.  
> > 
> > OK but it's a policy thing. Maybe it's a trusted VF. Who knows?
> > For example I can see host just
> > failing VIRTIO_NET_CTRL_MAC_ADDR_SET if it wants to block it.
> > I'm not sure why VIRTIO_NET_F_STANDBY has to block it in the guest.
> > 
> 
> So, what I get from this is that QEMU needs to be able to control all
> of standby, uuid, and mac to accommodate the different setups
> (respectively have libvirt/management software set it up). Is the host
> able to find out respectively define whether a VF is trusted?

You do it with ip link I think but QEMU doesn't normally do this,
it relies on libvirt to poke at host kernel and supply the info.
Cornelia Huck June 26, 2018, 4:03 p.m. UTC | #84
On Tue, 26 Jun 2018 18:38:51 +0300
"Michael S. Tsirkin" <mst@redhat.com> wrote:

> On Tue, Jun 26, 2018 at 05:17:32PM +0200, Cornelia Huck wrote:
> > On Tue, 26 Jun 2018 04:50:25 +0300
> > "Michael S. Tsirkin" <mst@redhat.com> wrote:
> >   
> > > On Mon, Jun 25, 2018 at 10:54:09AM -0700, Samudrala, Sridhar wrote:  
> > > > > > > > Might not neccessarily be something wrong, but it's very limited to
> > > > > > > > prohibit the MAC of VF from changing when enslaved by failover.    
> > > > > > > You mean guest changing MAC? I'm not sure why we prohibit that.    
> > > > > > I think Sridhar and Jiri might be better person to answer it. My
> > > > > > impression was that sync'ing the MAC address change between all 3
> > > > > > devices is challenging, as the failover driver uses MAC address to
> > > > > > match net_device internally.    
> > > > 
> > > > Yes. The MAC address is assigned by the hypervisor and it needs to manage the movement
> > > > of the MAC between the PF and VF.  Allowing the guest to change the MAC will require
> > > > synchronization between the hypervisor and the PF/VF drivers. Most of the VF drivers
> > > > don't allow changing guest MAC unless it is a trusted VF.    
> > > 
> > > OK but it's a policy thing. Maybe it's a trusted VF. Who knows?
> > > For example I can see host just
> > > failing VIRTIO_NET_CTRL_MAC_ADDR_SET if it wants to block it.
> > > I'm not sure why VIRTIO_NET_F_STANDBY has to block it in the guest.
> > >   
> > 
> > So, what I get from this is that QEMU needs to be able to control all
> > of standby, uuid, and mac to accommodate the different setups
> > (respectively have libvirt/management software set it up). Is the host
> > able to find out respectively define whether a VF is trusted?  
> 
> You do it with ip link I think but QEMU doesn't normally do this,
> it relies on libvirt to poke at host kernel and supply the info.
> 

Ok, that makes me conclude that we definitely need to involve the
libvirt folks before we proceed further with defining QEMU interfaces.
Michael S. Tsirkin June 26, 2018, 5:42 p.m. UTC | #85
On Tue, Jun 26, 2018 at 06:03:16PM +0200, Cornelia Huck wrote:
> Ok, that makes me conclude that we definitely need to involve the
> libvirt folks before we proceed further with defining QEMU interfaces.

That's always a wise thing to do.
Michael S. Tsirkin June 26, 2018, 5:50 p.m. UTC | #86
On Tue, Jun 26, 2018 at 05:08:13PM +0200, Cornelia Huck wrote:
> On Fri, 22 Jun 2018 17:05:04 -0700
> Siwei Liu <loseweigh@gmail.com> wrote:
> 
> > On Fri, Jun 22, 2018 at 3:33 PM, Michael S. Tsirkin <mst@redhat.com> wrote:
> > > I suspect the diveregence will be lost on most users though
> > > simply because they don't even care about vfio. They just
> > > want things to go fast.  
> > 
> > Like Jason said, VF isn't faster than virtio-net in all cases. It
> > depends on the workload and performance metrics: throughput, latency,
> > or packet per second.
> 
> So, will it be guest/admin-controllable then where the traffic flows
> through? Just because we do have a vf available after negotiation of
> the feature bit, it does not necessarily mean we want to use it? Do we
> (the guest) even want to make it visible in that case?

I think these ideas belong to what Alex Duyck wanted to do:
some kind of advanced device that isn't tied to
any network interfaces and allows workload and performance
specific tuning.

Way out of scope for a simple failover, and more importantly,
no one is looking at even enumerating the problems involved,
much less solving them.
Siwei Liu June 26, 2018, 11:38 p.m. UTC | #87
On Mon, Jun 25, 2018 at 6:50 PM, Michael S. Tsirkin <mst@redhat.com> wrote:
> On Mon, Jun 25, 2018 at 10:54:09AM -0700, Samudrala, Sridhar wrote:
>> > > > > Might not neccessarily be something wrong, but it's very limited to
>> > > > > prohibit the MAC of VF from changing when enslaved by failover.
>> > > > You mean guest changing MAC? I'm not sure why we prohibit that.
>> > > I think Sridhar and Jiri might be better person to answer it. My
>> > > impression was that sync'ing the MAC address change between all 3
>> > > devices is challenging, as the failover driver uses MAC address to
>> > > match net_device internally.
>>
>> Yes. The MAC address is assigned by the hypervisor and it needs to manage the movement
>> of the MAC between the PF and VF.  Allowing the guest to change the MAC will require
>> synchronization between the hypervisor and the PF/VF drivers. Most of the VF drivers
>> don't allow changing guest MAC unless it is a trusted VF.
>
> OK but it's a policy thing. Maybe it's a trusted VF. Who knows?
> For example I can see host just
> failing VIRTIO_NET_CTRL_MAC_ADDR_SET if it wants to block it.
> I'm not sure why VIRTIO_NET_F_STANDBY has to block it in the guest.

That's why I think pairing using MAC is fragile IMHO. When VF's MAC
got changed before virtio attempts to match and pair the device, it
ends up with no pairing found out at all. UUID is better.

-Siwei

>
> --
> MST
Michael S. Tsirkin June 27, 2018, 12:29 a.m. UTC | #88
On Tue, Jun 26, 2018 at 04:38:26PM -0700, Siwei Liu wrote:
> On Mon, Jun 25, 2018 at 6:50 PM, Michael S. Tsirkin <mst@redhat.com> wrote:
> > On Mon, Jun 25, 2018 at 10:54:09AM -0700, Samudrala, Sridhar wrote:
> >> > > > > Might not neccessarily be something wrong, but it's very limited to
> >> > > > > prohibit the MAC of VF from changing when enslaved by failover.
> >> > > > You mean guest changing MAC? I'm not sure why we prohibit that.
> >> > > I think Sridhar and Jiri might be better person to answer it. My
> >> > > impression was that sync'ing the MAC address change between all 3
> >> > > devices is challenging, as the failover driver uses MAC address to
> >> > > match net_device internally.
> >>
> >> Yes. The MAC address is assigned by the hypervisor and it needs to manage the movement
> >> of the MAC between the PF and VF.  Allowing the guest to change the MAC will require
> >> synchronization between the hypervisor and the PF/VF drivers. Most of the VF drivers
> >> don't allow changing guest MAC unless it is a trusted VF.
> >
> > OK but it's a policy thing. Maybe it's a trusted VF. Who knows?
> > For example I can see host just
> > failing VIRTIO_NET_CTRL_MAC_ADDR_SET if it wants to block it.
> > I'm not sure why VIRTIO_NET_F_STANDBY has to block it in the guest.
> 
> That's why I think pairing using MAC is fragile IMHO. When VF's MAC
> got changed before virtio attempts to match and pair the device, it
> ends up with no pairing found out at all.

Guest seems to match on the hardware mac and ignore whatever
is set by user. Makes sense to me and should not be fragile.


> UUID is better.
> 
> -Siwei
> 
> >
> > --
> > MST
Siwei Liu June 27, 2018, 6:21 a.m. UTC | #89
On Tue, Jun 26, 2018 at 5:29 PM, Michael S. Tsirkin <mst@redhat.com> wrote:
> On Tue, Jun 26, 2018 at 04:38:26PM -0700, Siwei Liu wrote:
>> On Mon, Jun 25, 2018 at 6:50 PM, Michael S. Tsirkin <mst@redhat.com> wrote:
>> > On Mon, Jun 25, 2018 at 10:54:09AM -0700, Samudrala, Sridhar wrote:
>> >> > > > > Might not neccessarily be something wrong, but it's very limited to
>> >> > > > > prohibit the MAC of VF from changing when enslaved by failover.
>> >> > > > You mean guest changing MAC? I'm not sure why we prohibit that.
>> >> > > I think Sridhar and Jiri might be better person to answer it. My
>> >> > > impression was that sync'ing the MAC address change between all 3
>> >> > > devices is challenging, as the failover driver uses MAC address to
>> >> > > match net_device internally.
>> >>
>> >> Yes. The MAC address is assigned by the hypervisor and it needs to manage the movement
>> >> of the MAC between the PF and VF.  Allowing the guest to change the MAC will require
>> >> synchronization between the hypervisor and the PF/VF drivers. Most of the VF drivers
>> >> don't allow changing guest MAC unless it is a trusted VF.
>> >
>> > OK but it's a policy thing. Maybe it's a trusted VF. Who knows?
>> > For example I can see host just
>> > failing VIRTIO_NET_CTRL_MAC_ADDR_SET if it wants to block it.
>> > I'm not sure why VIRTIO_NET_F_STANDBY has to block it in the guest.
>>
>> That's why I think pairing using MAC is fragile IMHO. When VF's MAC
>> got changed before virtio attempts to match and pair the device, it
>> ends up with no pairing found out at all.
>
> Guest seems to match on the hardware mac and ignore whatever
> is set by user. Makes sense to me and should not be fragile.
Host can change the hardware mac for VF any time.

-Siwei
>
>
>> UUID is better.
>>
>> -Siwei
>>
>> >
>> > --
>> > MST
Samudrala, Sridhar June 27, 2018, 6:49 a.m. UTC | #90
On 6/26/2018 11:21 PM, Siwei Liu wrote:
> On Tue, Jun 26, 2018 at 5:29 PM, Michael S. Tsirkin <mst@redhat.com> wrote:
>> On Tue, Jun 26, 2018 at 04:38:26PM -0700, Siwei Liu wrote:
>>> On Mon, Jun 25, 2018 at 6:50 PM, Michael S. Tsirkin <mst@redhat.com> wrote:
>>>> On Mon, Jun 25, 2018 at 10:54:09AM -0700, Samudrala, Sridhar wrote:
>>>>>>>>> Might not neccessarily be something wrong, but it's very limited to
>>>>>>>>> prohibit the MAC of VF from changing when enslaved by failover.
>>>>>>>> You mean guest changing MAC? I'm not sure why we prohibit that.
>>>>>>> I think Sridhar and Jiri might be better person to answer it. My
>>>>>>> impression was that sync'ing the MAC address change between all 3
>>>>>>> devices is challenging, as the failover driver uses MAC address to
>>>>>>> match net_device internally.
>>>>> Yes. The MAC address is assigned by the hypervisor and it needs to manage the movement
>>>>> of the MAC between the PF and VF.  Allowing the guest to change the MAC will require
>>>>> synchronization between the hypervisor and the PF/VF drivers. Most of the VF drivers
>>>>> don't allow changing guest MAC unless it is a trusted VF.
>>>> OK but it's a policy thing. Maybe it's a trusted VF. Who knows?
>>>> For example I can see host just
>>>> failing VIRTIO_NET_CTRL_MAC_ADDR_SET if it wants to block it.
>>>> I'm not sure why VIRTIO_NET_F_STANDBY has to block it in the guest.
>>> That's why I think pairing using MAC is fragile IMHO. When VF's MAC
>>> got changed before virtio attempts to match and pair the device, it
>>> ends up with no pairing found out at all.
>> Guest seems to match on the hardware mac and ignore whatever
>> is set by user. Makes sense to me and should not be fragile.
> Host can change the hardware mac for VF any time.

Live migration is initiated and controlled by the Host,  So the Source Host will
reset the MAC during live migration after unplugging the VF. This is to redirect the
VMs frames towards PF so that they can be received via virtio-net standby interface.
The destination host will set the VF MAC and plug the VF after live migration is
completed.

Allowing the guest to change the MAC will require the qemu/libvirt/mgmt layers to
track the MAC changes and replay that change after live migration.


>
> -Siwei
>>
>>> UUID is better.
>>>
>>> -Siwei
>>>
>>>> --
>>>> MST
Siwei Liu June 27, 2018, 7:03 a.m. UTC | #91
On Tue, Jun 26, 2018 at 11:49 PM, Samudrala, Sridhar
<sridhar.samudrala@intel.com> wrote:
> On 6/26/2018 11:21 PM, Siwei Liu wrote:
>>
>> On Tue, Jun 26, 2018 at 5:29 PM, Michael S. Tsirkin <mst@redhat.com>
>> wrote:
>>>
>>> On Tue, Jun 26, 2018 at 04:38:26PM -0700, Siwei Liu wrote:
>>>>
>>>> On Mon, Jun 25, 2018 at 6:50 PM, Michael S. Tsirkin <mst@redhat.com>
>>>> wrote:
>>>>>
>>>>> On Mon, Jun 25, 2018 at 10:54:09AM -0700, Samudrala, Sridhar wrote:
>>>>>>>>>>
>>>>>>>>>> Might not neccessarily be something wrong, but it's very limited
>>>>>>>>>> to
>>>>>>>>>> prohibit the MAC of VF from changing when enslaved by failover.
>>>>>>>>>
>>>>>>>>> You mean guest changing MAC? I'm not sure why we prohibit that.
>>>>>>>>
>>>>>>>> I think Sridhar and Jiri might be better person to answer it. My
>>>>>>>> impression was that sync'ing the MAC address change between all 3
>>>>>>>> devices is challenging, as the failover driver uses MAC address to
>>>>>>>> match net_device internally.
>>>>>>
>>>>>> Yes. The MAC address is assigned by the hypervisor and it needs to
>>>>>> manage the movement
>>>>>> of the MAC between the PF and VF.  Allowing the guest to change the
>>>>>> MAC will require
>>>>>> synchronization between the hypervisor and the PF/VF drivers. Most of
>>>>>> the VF drivers
>>>>>> don't allow changing guest MAC unless it is a trusted VF.
>>>>>
>>>>> OK but it's a policy thing. Maybe it's a trusted VF. Who knows?
>>>>> For example I can see host just
>>>>> failing VIRTIO_NET_CTRL_MAC_ADDR_SET if it wants to block it.
>>>>> I'm not sure why VIRTIO_NET_F_STANDBY has to block it in the guest.
>>>>
>>>> That's why I think pairing using MAC is fragile IMHO. When VF's MAC
>>>> got changed before virtio attempts to match and pair the device, it
>>>> ends up with no pairing found out at all.
>>>
>>> Guest seems to match on the hardware mac and ignore whatever
>>> is set by user. Makes sense to me and should not be fragile.
>>
>> Host can change the hardware mac for VF any time.
>
>
> Live migration is initiated and controlled by the Host,  So the Source Host
> will
> reset the MAC during live migration after unplugging the VF. This is to
> redirect the
> VMs frames towards PF so that they can be received via virtio-net standby
> interface.
> The destination host will set the VF MAC and plug the VF after live
> migration is
> completed.
>
> Allowing the guest to change the MAC will require the qemu/libvirt/mgmt
> layers to
> track the MAC changes and replay that change after live migration.
>
If the failover's MAC is kept in sync with VF's MAC address change,
the VF on destination host can be paired using the permanent address
after plugging in, while failover interface will resync the MAC to the
current one in use when enslaving the VF. I think similar is done for
multicast and unicast address list on VF's registration, right? No
need of QEMU or mgmt software keep track of MAC changes.

-Siwei

>
>
>>
>> -Siwei
>>>
>>>
>>>> UUID is better.
>>>>
>>>> -Siwei
>>>>
>>>>> --
>>>>> MST
>
>
Cornelia Huck June 27, 2018, 9:11 a.m. UTC | #92
On Tue, 26 Jun 2018 20:50:20 +0300
"Michael S. Tsirkin" <mst@redhat.com> wrote:

> On Tue, Jun 26, 2018 at 05:08:13PM +0200, Cornelia Huck wrote:
> > On Fri, 22 Jun 2018 17:05:04 -0700
> > Siwei Liu <loseweigh@gmail.com> wrote:
> >   
> > > On Fri, Jun 22, 2018 at 3:33 PM, Michael S. Tsirkin <mst@redhat.com> wrote:  
> > > > I suspect the diveregence will be lost on most users though
> > > > simply because they don't even care about vfio. They just
> > > > want things to go fast.    
> > > 
> > > Like Jason said, VF isn't faster than virtio-net in all cases. It
> > > depends on the workload and performance metrics: throughput, latency,
> > > or packet per second.  
> > 
> > So, will it be guest/admin-controllable then where the traffic flows
> > through? Just because we do have a vf available after negotiation of
> > the feature bit, it does not necessarily mean we want to use it? Do we
> > (the guest) even want to make it visible in that case?  
> 
> I think these ideas belong to what Alex Duyck wanted to do:
> some kind of advanced device that isn't tied to
> any network interfaces and allows workload and performance
> specific tuning.
> 
> Way out of scope for a simple failover, and more importantly,
> no one is looking at even enumerating the problems involved,
> much less solving them.

So, for simplicity's sake, we need to rely on the host admin
configuring the vm for its guest's intended use case. Sounds fair, but
probably needs a note somewhere.
Cornelia Huck June 27, 2018, 10:10 a.m. UTC | #93
On Sat, 23 Jun 2018 00:43:24 +0300
"Michael S. Tsirkin" <mst@redhat.com> wrote:

> On Fri, Jun 22, 2018 at 05:09:55PM +0200, Cornelia Huck wrote:
> > Would it be more helpful to focus on generic
> > migration support for vfio instead of going about it device by device?  
> 
> Just to note this approach is actually device by device *type*.  It's
> mostly device agnostic for a given device type so you can migrate
> between hosts with very different hardware.

This enables heterogeneous environments, yes.

But one drawback of that is that you cannot exploit any hardware
specialities - it seems you're limited to what the paravirtual device
supports. This is limiting for more homogeneous environments.

> 
> And support for more PV device types has other advantages
> such as security and forward compatibility to future hosts.

But again the drawback is that we can't exploit new capabilities
easily, can we?

> 
> Finally, it all can happen mostly within QEMU. User is currently
> required to enable it but it's pretty lightweight.
> 
> OTOH vfio migration generally requires actual device-specific work, and
> only works when hosts are mostly identical. When they aren't it's easy
> to blame the user, but tools for checking host compatiblity are
> currently non-existent. Upper layer management will also have to learn
> about host and device compatibility wrt migration. At the moment they
> can't even figure it out wrt software versions of vhost in kernel and
> dpdk so I won't hold my breath for all of this happening quickly.

Yes, that's a real problem.

I think one issue here is that we want to support really different
environments. For the case here, we have a lot of different networking
adapters, but the guests are basically interested in one thing: doing
network traffic. On the other hand, I'm thinking of the mainframe
environment, where we have a very limited set of devices to support,
but at the same time want to exploit their specialities, so the pv
approach is limiting. For that use case, generic migration looks more
useful.

Patch
diff mbox

diff --git a/hw/net/virtio-net.c b/hw/net/virtio-net.c
index 90502fca7c..38b3140670 100644
--- a/hw/net/virtio-net.c
+++ b/hw/net/virtio-net.c
@@ -2198,6 +2198,8 @@  static Property virtio_net_properties[] = {
                      true),
     DEFINE_PROP_INT32("speed", VirtIONet, net_conf.speed, SPEED_UNKNOWN),
     DEFINE_PROP_STRING("duplex", VirtIONet, net_conf.duplex_str),
+    DEFINE_PROP_BIT64("standby", VirtIONet, host_features, VIRTIO_NET_F_STANDBY,
+                      false),
     DEFINE_PROP_END_OF_LIST(),
 };
 
diff --git a/include/standard-headers/linux/virtio_net.h b/include/standard-headers/linux/virtio_net.h
index e9f255ea3f..01ec09684c 100644
--- a/include/standard-headers/linux/virtio_net.h
+++ b/include/standard-headers/linux/virtio_net.h
@@ -57,6 +57,9 @@ 
 					 * Steering */
 #define VIRTIO_NET_F_CTRL_MAC_ADDR 23	/* Set MAC address */
 
+#define VIRTIO_NET_F_STANDBY      62    /* Act as standby for another device
+                                         * with the same MAC.
+                                         */
 #define VIRTIO_NET_F_SPEED_DUPLEX 63	/* Device set linkspeed and duplex */
 
 #ifndef VIRTIO_NET_NO_LEGACY