mbox series

[QEMU,v2,0/2] : KVM: i386: Add support for save and restore nested state

Message ID 20180916124631.39016-1-liran.alon@oracle.com (mailing list archive)
Headers show
Series : KVM: i386: Add support for save and restore nested state | expand

Message

Liran Alon Sept. 16, 2018, 12:46 p.m. UTC
Hi,

This series aims to add support for QEMU to be able to migrate VMs that
are running nested hypervisors. In order to do so, it utilizes the new
IOCTLs introduced in KVM commit 8fcc4b5923af ("kvm: nVMX: Introduce
KVM_CAP_NESTED_STATE") which were created for this purpose.

1st patch is not really related to the goal of the patch series. It just
makes CPUX86State->xsave_buf to be compiled only when needed (When
compiling with KVM or HVF CPU accelerator).

2nd patch adds the support to migrate VMs that are running nested
hypervisors.

Regards,
-Liran

v1->v2 Changes:
* Renamed kvm_nested_state_length() to kvm_max_nested_state_length()
to better indicate it represents the max nested state size that can
be returned from kernel.
* Added error_report() calls to nested_state_post_load() to make
failures in migration easier to diagnose.
* Fixed support of migrating with various nested_state buffer sizes.
The following scenarios were tested:
(a) src and dest have same nested state size.
	==> Migration succeeds.
(b) src don't have nested state while dest do.
	==> Migration succeed and src don't send it's nested state.
(c) src have nested state while dest don't.
	==> Migration fails as it cannot restore nested state.
(d) dest have bigger max nested state size than src
	==> Migration succeeds.
(e) dest have smaller max nested state size than src but enough to store it's saved nested state
	==> Migration succeeds
(f) dest have smaller max nested state size than src but not enough to store it's saved nested state
	==> Migration fails

Comments

Liran Alon Oct. 8, 2018, 5:21 p.m. UTC | #1
Gentle ping on v2 of this series.
(I noticed 1st patch of series was already applied)

Thanks,
-Liran

> On 16 Sep 2018, at 15:46, Liran Alon <liran.alon@oracle.com> wrote:
> 
> Hi,
> 
> This series aims to add support for QEMU to be able to migrate VMs that
> are running nested hypervisors. In order to do so, it utilizes the new
> IOCTLs introduced in KVM commit 8fcc4b5923af ("kvm: nVMX: Introduce
> KVM_CAP_NESTED_STATE") which were created for this purpose.
> 
> 1st patch is not really related to the goal of the patch series. It just
> makes CPUX86State->xsave_buf to be compiled only when needed (When
> compiling with KVM or HVF CPU accelerator).
> 
> 2nd patch adds the support to migrate VMs that are running nested
> hypervisors.
> 
> Regards,
> -Liran
> 
> v1->v2 Changes:
> * Renamed kvm_nested_state_length() to kvm_max_nested_state_length()
> to better indicate it represents the max nested state size that can
> be returned from kernel.
> * Added error_report() calls to nested_state_post_load() to make
> failures in migration easier to diagnose.
> * Fixed support of migrating with various nested_state buffer sizes.
> The following scenarios were tested:
> (a) src and dest have same nested state size.
> 	==> Migration succeeds.
> (b) src don't have nested state while dest do.
> 	==> Migration succeed and src don't send it's nested state.
> (c) src have nested state while dest don't.
> 	==> Migration fails as it cannot restore nested state.
> (d) dest have bigger max nested state size than src
> 	==> Migration succeeds.
> (e) dest have smaller max nested state size than src but enough to store it's saved nested state
> 	==> Migration succeeds
> (f) dest have smaller max nested state size than src but not enough to store it's saved nested state
> 	==> Migration fails
>
Liran Alon Oct. 15, 2018, 6:10 p.m. UTC | #2
Another gentle ping on v2 of this series?
Patch was submitted a month ago.

Thanks,
-Liran

> On 8 Oct 2018, at 20:21, Liran Alon <liran.alon@oracle.com> wrote:
> 
> Gentle ping on v2 of this series.
> (I noticed 1st patch of series was already applied)
> 
> Thanks,
> -Liran
> 
>> On 16 Sep 2018, at 15:46, Liran Alon <liran.alon@oracle.com> wrote:
>> 
>> Hi,
>> 
>> This series aims to add support for QEMU to be able to migrate VMs that
>> are running nested hypervisors. In order to do so, it utilizes the new
>> IOCTLs introduced in KVM commit 8fcc4b5923af ("kvm: nVMX: Introduce
>> KVM_CAP_NESTED_STATE") which were created for this purpose.
>> 
>> 1st patch is not really related to the goal of the patch series. It just
>> makes CPUX86State->xsave_buf to be compiled only when needed (When
>> compiling with KVM or HVF CPU accelerator).
>> 
>> 2nd patch adds the support to migrate VMs that are running nested
>> hypervisors.
>> 
>> Regards,
>> -Liran
>> 
>> v1->v2 Changes:
>> * Renamed kvm_nested_state_length() to kvm_max_nested_state_length()
>> to better indicate it represents the max nested state size that can
>> be returned from kernel.
>> * Added error_report() calls to nested_state_post_load() to make
>> failures in migration easier to diagnose.
>> * Fixed support of migrating with various nested_state buffer sizes.
>> The following scenarios were tested:
>> (a) src and dest have same nested state size.
>> 	==> Migration succeeds.
>> (b) src don't have nested state while dest do.
>> 	==> Migration succeed and src don't send it's nested state.
>> (c) src have nested state while dest don't.
>> 	==> Migration fails as it cannot restore nested state.
>> (d) dest have bigger max nested state size than src
>> 	==> Migration succeeds.
>> (e) dest have smaller max nested state size than src but enough to store it's saved nested state
>> 	==> Migration succeeds
>> (f) dest have smaller max nested state size than src but not enough to store it's saved nested state
>> 	==> Migration fails
>> 
>
Liran Alon Oct. 31, 2018, 1:03 a.m. UTC | #3
Ping.
Patch was submitted almost two months ago and I haven’t seen any respond for the v2 of this series.

Thanks,
-Liran

> On 15 Oct 2018, at 21:10, Liran Alon <liran.alon@oracle.com> wrote:
> 
> Another gentle ping on v2 of this series?
> Patch was submitted a month ago.
> 
> Thanks,
> -Liran
> 
>> On 8 Oct 2018, at 20:21, Liran Alon <liran.alon@oracle.com> wrote:
>> 
>> Gentle ping on v2 of this series.
>> (I noticed 1st patch of series was already applied)
>> 
>> Thanks,
>> -Liran
>> 
>>> On 16 Sep 2018, at 15:46, Liran Alon <liran.alon@oracle.com> wrote:
>>> 
>>> Hi,
>>> 
>>> This series aims to add support for QEMU to be able to migrate VMs that
>>> are running nested hypervisors. In order to do so, it utilizes the new
>>> IOCTLs introduced in KVM commit 8fcc4b5923af ("kvm: nVMX: Introduce
>>> KVM_CAP_NESTED_STATE") which were created for this purpose.
>>> 
>>> 1st patch is not really related to the goal of the patch series. It just
>>> makes CPUX86State->xsave_buf to be compiled only when needed (When
>>> compiling with KVM or HVF CPU accelerator).
>>> 
>>> 2nd patch adds the support to migrate VMs that are running nested
>>> hypervisors.
>>> 
>>> Regards,
>>> -Liran
>>> 
>>> v1->v2 Changes:
>>> * Renamed kvm_nested_state_length() to kvm_max_nested_state_length()
>>> to better indicate it represents the max nested state size that can
>>> be returned from kernel.
>>> * Added error_report() calls to nested_state_post_load() to make
>>> failures in migration easier to diagnose.
>>> * Fixed support of migrating with various nested_state buffer sizes.
>>> The following scenarios were tested:
>>> (a) src and dest have same nested state size.
>>> 	==> Migration succeeds.
>>> (b) src don't have nested state while dest do.
>>> 	==> Migration succeed and src don't send it's nested state.
>>> (c) src have nested state while dest don't.
>>> 	==> Migration fails as it cannot restore nested state.
>>> (d) dest have bigger max nested state size than src
>>> 	==> Migration succeeds.
>>> (e) dest have smaller max nested state size than src but enough to store it's saved nested state
>>> 	==> Migration succeeds
>>> (f) dest have smaller max nested state size than src but not enough to store it's saved nested state
>>> 	==> Migration fails
>>> 
>> 
>
Eduardo Habkost Oct. 31, 2018, 6:17 p.m. UTC | #4
On Wed, Oct 31, 2018 at 03:03:34AM +0200, Liran Alon wrote:
> Ping.
> Patch was submitted almost two months ago and I haven’t seen any respond for the v2 of this series.

Sorry for the long delay.  This was on my queue of patches to be
reviewed, but I'm failing to keep up to the rate of incoming
patches.  I will try to review the series next week.
Paolo Bonzini Oct. 31, 2018, 6:19 p.m. UTC | #5
On 31/10/2018 19:17, Eduardo Habkost wrote:
> On Wed, Oct 31, 2018 at 03:03:34AM +0200, Liran Alon wrote:
>> Ping.
>> Patch was submitted almost two months ago and I haven’t seen any respond for the v2 of this series.
> Sorry for the long delay.  This was on my queue of patches to be
> reviewed, but I'm failing to keep up to the rate of incoming
> patches.  I will try to review the series next week.

I have already reviewed it; unfortunately I have missed the soft freeze
for posting the version I had also been working on when Liran posted
these patches.

Paolo
Liran Alon Oct. 31, 2018, 6:50 p.m. UTC | #6
> On 31 Oct 2018, at 20:19, Paolo Bonzini <pbonzini@redhat.com> wrote:
> 
> On 31/10/2018 19:17, Eduardo Habkost wrote:
>> On Wed, Oct 31, 2018 at 03:03:34AM +0200, Liran Alon wrote:
>>> Ping.
>>> Patch was submitted almost two months ago and I haven’t seen any respond for the v2 of this series.
>> Sorry for the long delay.  This was on my queue of patches to be
>> reviewed, but I'm failing to keep up to the rate of incoming
>> patches.  I will try to review the series next week.
> 
> I have already reviewed it; unfortunately I have missed the soft freeze
> for posting the version I had also been working on when Liran posted
> these patches.
> 
> Paolo

Paolo, note that this is v2 of this patch series. It’s not the one you have already reviewed.
It now correctly handles the case you mentioned in review of v1 of migrating with various nested_state buffer sizes.
The following scenarios were tested:
(a) src and dest have same nested state size.
	==> Migration succeeds.
(b) src don't have nested state while dest do.
	==> Migration succeed and src don't send it's nested state.
(c) src have nested state while dest don't.
	==> Migration fails as it cannot restore nested state.
(d) dest have bigger max nested state size than src
	==> Migration succeeds.
(e) dest have smaller max nested state size than src but enough to store it's saved nested state
	==> Migration succeeds
(f) dest have smaller max nested state size than src but not enough to store it's saved nested state
	==> Migration fails

-Liran
Dr. David Alan Gilbert Oct. 31, 2018, 6:59 p.m. UTC | #7
* Liran Alon (liran.alon@oracle.com) wrote:
> 
> 
> > On 31 Oct 2018, at 20:19, Paolo Bonzini <pbonzini@redhat.com> wrote:
> > 
> > On 31/10/2018 19:17, Eduardo Habkost wrote:
> >> On Wed, Oct 31, 2018 at 03:03:34AM +0200, Liran Alon wrote:
> >>> Ping.
> >>> Patch was submitted almost two months ago and I haven’t seen any respond for the v2 of this series.
> >> Sorry for the long delay.  This was on my queue of patches to be
> >> reviewed, but I'm failing to keep up to the rate of incoming
> >> patches.  I will try to review the series next week.
> > 
> > I have already reviewed it; unfortunately I have missed the soft freeze
> > for posting the version I had also been working on when Liran posted
> > these patches.
> > 
> > Paolo
> 
> Paolo, note that this is v2 of this patch series. It’s not the one you have already reviewed.
> It now correctly handles the case you mentioned in review of v1 of migrating with various nested_state buffer sizes.
> The following scenarios were tested:
> (a) src and dest have same nested state size.
> 	==> Migration succeeds.
> (b) src don't have nested state while dest do.
> 	==> Migration succeed and src don't send it's nested state.
> (c) src have nested state while dest don't.
> 	==> Migration fails as it cannot restore nested state.
> (d) dest have bigger max nested state size than src
> 	==> Migration succeeds.
> (e) dest have smaller max nested state size than src but enough to store it's saved nested state
> 	==> Migration succeeds
> (f) dest have smaller max nested state size than src but not enough to store it's saved nested state
> 	==> Migration fails

Is it possible to tell these limits before the start of the migration,
or do we only find out that a nested migration won't work by trying it?

Dave

> -Liran
> 
> 
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
Liran Alon Oct. 31, 2018, 11:17 p.m. UTC | #8
> On 31 Oct 2018, at 20:59, Dr. David Alan Gilbert <dgilbert@redhat.com> wrote:
> 
> * Liran Alon (liran.alon@oracle.com) wrote:
>> 
>> 
>>> On 31 Oct 2018, at 20:19, Paolo Bonzini <pbonzini@redhat.com> wrote:
>>> 
>>> On 31/10/2018 19:17, Eduardo Habkost wrote:
>>>> On Wed, Oct 31, 2018 at 03:03:34AM +0200, Liran Alon wrote:
>>>>> Ping.
>>>>> Patch was submitted almost two months ago and I haven’t seen any respond for the v2 of this series.
>>>> Sorry for the long delay.  This was on my queue of patches to be
>>>> reviewed, but I'm failing to keep up to the rate of incoming
>>>> patches.  I will try to review the series next week.
>>> 
>>> I have already reviewed it; unfortunately I have missed the soft freeze
>>> for posting the version I had also been working on when Liran posted
>>> these patches.
>>> 
>>> Paolo
>> 
>> Paolo, note that this is v2 of this patch series. It’s not the one you have already reviewed.
>> It now correctly handles the case you mentioned in review of v1 of migrating with various nested_state buffer sizes.
>> The following scenarios were tested:
>> (a) src and dest have same nested state size.
>> 	==> Migration succeeds.
>> (b) src don't have nested state while dest do.
>> 	==> Migration succeed and src don't send it's nested state.
>> (c) src have nested state while dest don't.
>> 	==> Migration fails as it cannot restore nested state.
>> (d) dest have bigger max nested state size than src
>> 	==> Migration succeeds.
>> (e) dest have smaller max nested state size than src but enough to store it's saved nested state
>> 	==> Migration succeeds
>> (f) dest have smaller max nested state size than src but not enough to store it's saved nested state
>> 	==> Migration fails
> 
> Is it possible to tell these limits before the start of the migration,
> or do we only find out that a nested migration won't work by trying it?
> 
> Dave

It is possible for the destination host to query what is it’s max nested state size.
(This is what is returned from "kvm_check_extension(s, KVM_CAP_NESTED_STATE);” See kvm_init() code)

However, I didn’t find any suitable mechanism in QEMU Live-Migration code to negotiate
this destination host information with source prior to migration. Which kinda makes sense as
this code is also used to my understanding in standard suspend/resume flow. In that scenario,
there is no other side to negotiate with.

So currently what happens in my patch is that source prepares the nested state buffer and sends it to destination as part of VMState,
and destination attempts to load this nested state in it’s nested_state_post_load() function.
If destination kernel cannot handle loading the nested state it has received from source, it fails the migration by returning
failure from nested_state_post_load().

Do you have a better suggestion?

-Liran

> 
>> -Liran
>> 
>> 
> --
> Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
Dr. David Alan Gilbert Nov. 1, 2018, 1:10 p.m. UTC | #9
* Liran Alon (liran.alon@oracle.com) wrote:
> 
> 
> > On 31 Oct 2018, at 20:59, Dr. David Alan Gilbert <dgilbert@redhat.com> wrote:
> > 
> > * Liran Alon (liran.alon@oracle.com) wrote:
> >> 
> >> 
> >>> On 31 Oct 2018, at 20:19, Paolo Bonzini <pbonzini@redhat.com> wrote:
> >>> 
> >>> On 31/10/2018 19:17, Eduardo Habkost wrote:
> >>>> On Wed, Oct 31, 2018 at 03:03:34AM +0200, Liran Alon wrote:
> >>>>> Ping.
> >>>>> Patch was submitted almost two months ago and I haven’t seen any respond for the v2 of this series.
> >>>> Sorry for the long delay.  This was on my queue of patches to be
> >>>> reviewed, but I'm failing to keep up to the rate of incoming
> >>>> patches.  I will try to review the series next week.
> >>> 
> >>> I have already reviewed it; unfortunately I have missed the soft freeze
> >>> for posting the version I had also been working on when Liran posted
> >>> these patches.
> >>> 
> >>> Paolo
> >> 
> >> Paolo, note that this is v2 of this patch series. It’s not the one you have already reviewed.
> >> It now correctly handles the case you mentioned in review of v1 of migrating with various nested_state buffer sizes.
> >> The following scenarios were tested:
> >> (a) src and dest have same nested state size.
> >> 	==> Migration succeeds.
> >> (b) src don't have nested state while dest do.
> >> 	==> Migration succeed and src don't send it's nested state.
> >> (c) src have nested state while dest don't.
> >> 	==> Migration fails as it cannot restore nested state.
> >> (d) dest have bigger max nested state size than src
> >> 	==> Migration succeeds.
> >> (e) dest have smaller max nested state size than src but enough to store it's saved nested state
> >> 	==> Migration succeeds
> >> (f) dest have smaller max nested state size than src but not enough to store it's saved nested state
> >> 	==> Migration fails
> > 
> > Is it possible to tell these limits before the start of the migration,
> > or do we only find out that a nested migration won't work by trying it?
> > 
> > Dave
> 
> It is possible for the destination host to query what is it’s max nested state size.
> (This is what is returned from "kvm_check_extension(s, KVM_CAP_NESTED_STATE);” See kvm_init() code)

Is this max size a function of:
  a) The host CPU
  b) The host kernel
  c) Some configuration

or all of those?

What about the maximum size that will be sent?

> However, I didn’t find any suitable mechanism in QEMU Live-Migration code to negotiate
> this destination host information with source prior to migration. Which kinda makes sense as
> this code is also used to my understanding in standard suspend/resume flow. In that scenario,
> there is no other side to negotiate with.

At the moment, if the user has appropriately configured their QEMU and
their are no IO errors, the migration should not fail; and the
management layer can responsible for configuring the QEMU and checking
compatibilities - 

> So currently what happens in my patch is that source prepares the nested state buffer and sends it to destination as part of VMState,
> and destination attempts to load this nested state in it’s nested_state_post_load() function.
> If destination kernel cannot handle loading the nested state it has received from source, it fails the migration by returning
> failure from nested_state_post_load().

That's minimally OK, but ideally we'd have a way of knowing before we
start the migration if it was going to work, that way the management
layer can refuse it before it spends ages transferring data and then
trying to switch over - recovery from a failed migration can be a bit
tricky/risky.

I can't see it being much use in a cloud environment if the migration
will sometimes fail without the cloud admin being able to do something
about it.

> Do you have a better suggestion?

I'll need to understand a bit more about the nested state.

Dave

> -Liran
> 
> > 
> >> -Liran
> >> 
> >> 
> > --
> > Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
> 
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
Liran Alon Nov. 1, 2018, 3:23 p.m. UTC | #10
> On 1 Nov 2018, at 15:10, Dr. David Alan Gilbert <dgilbert@redhat.com> wrote:
> 
> * Liran Alon (liran.alon@oracle.com) wrote:
>> 
>> 
>>> On 31 Oct 2018, at 20:59, Dr. David Alan Gilbert <dgilbert@redhat.com> wrote:
>>> 
>>> * Liran Alon (liran.alon@oracle.com) wrote:
>>>> 
>>>> 
>>>>> On 31 Oct 2018, at 20:19, Paolo Bonzini <pbonzini@redhat.com> wrote:
>>>>> 
>>>>> On 31/10/2018 19:17, Eduardo Habkost wrote:
>>>>>> On Wed, Oct 31, 2018 at 03:03:34AM +0200, Liran Alon wrote:
>>>>>>> Ping.
>>>>>>> Patch was submitted almost two months ago and I haven’t seen any respond for the v2 of this series.
>>>>>> Sorry for the long delay.  This was on my queue of patches to be
>>>>>> reviewed, but I'm failing to keep up to the rate of incoming
>>>>>> patches.  I will try to review the series next week.
>>>>> 
>>>>> I have already reviewed it; unfortunately I have missed the soft freeze
>>>>> for posting the version I had also been working on when Liran posted
>>>>> these patches.
>>>>> 
>>>>> Paolo
>>>> 
>>>> Paolo, note that this is v2 of this patch series. It’s not the one you have already reviewed.
>>>> It now correctly handles the case you mentioned in review of v1 of migrating with various nested_state buffer sizes.
>>>> The following scenarios were tested:
>>>> (a) src and dest have same nested state size.
>>>> 	==> Migration succeeds.
>>>> (b) src don't have nested state while dest do.
>>>> 	==> Migration succeed and src don't send it's nested state.
>>>> (c) src have nested state while dest don't.
>>>> 	==> Migration fails as it cannot restore nested state.
>>>> (d) dest have bigger max nested state size than src
>>>> 	==> Migration succeeds.
>>>> (e) dest have smaller max nested state size than src but enough to store it's saved nested state
>>>> 	==> Migration succeeds
>>>> (f) dest have smaller max nested state size than src but not enough to store it's saved nested state
>>>> 	==> Migration fails
>>> 
>>> Is it possible to tell these limits before the start of the migration,
>>> or do we only find out that a nested migration won't work by trying it?
>>> 
>>> Dave
>> 
>> It is possible for the destination host to query what is it’s max nested state size.
>> (This is what is returned from "kvm_check_extension(s, KVM_CAP_NESTED_STATE);” See kvm_init() code)
> 
> Is this max size a function of:
>  a) The host CPU
>  b) The host kernel
>  c) Some configuration
> 
> or all of those?
> 
> What about the maximum size that will be sent?

The max size is a function of (b). It depends on your KVM capabilities.
This size that will be sent is also the max size at source.

> 
>> However, I didn’t find any suitable mechanism in QEMU Live-Migration code to negotiate
>> this destination host information with source prior to migration. Which kinda makes sense as
>> this code is also used to my understanding in standard suspend/resume flow. In that scenario,
>> there is no other side to negotiate with.
> 
> At the moment, if the user has appropriately configured their QEMU and
> their are no IO errors, the migration should not fail; and the
> management layer can responsible for configuring the QEMU and checking
> compatibilities - 

I agree with this. The post_load() checks I have done are extra measures to prevent a failing migration.
Management layer can perform extra work before the migration to make sure that source can actually migrate to destination.
Taking the max size of the nested state into account.

> 
>> So currently what happens in my patch is that source prepares the nested state buffer and sends it to destination as part of VMState,
>> and destination attempts to load this nested state in it’s nested_state_post_load() function.
>> If destination kernel cannot handle loading the nested state it has received from source, it fails the migration by returning
>> failure from nested_state_post_load().
> 
> That's minimally OK, but ideally we'd have a way of knowing before we
> start the migration if it was going to work, that way the management
> layer can refuse it before it spends ages transferring data and then
> trying to switch over - recovery from a failed migration can be a bit
> tricky/risky.
> 
> I can't see it being much use in a cloud environment if the migration
> will sometimes fail without the cloud admin being able to do something
> about it.

With current QEMU Live-Migration code, I believe this is in the responsibility of the Control-Plane.
It’s not different from the fact that Control-Plane is also responsible for launching the destination QEMU with same vHW configuration
as the source QEMU. If there was some kind of negotiation mechanism in QEMU that verifies these vHW configuration matches *before*
starting the migration (and not part of Control-Plane), I would have also added to that negotiation to consider the max size of nested state.
But with current mechanisms, this is solely in the responsibility of the Control-Plane.

> 
>> Do you have a better suggestion?
> 
> I'll need to understand a bit more about the nested state.

Feel free to ask more questions about it and I will try to answer.

Thanks,
-Liran

> 
> Dave
> 
>> -Liran
>> 
>>> 
>>>> -Liran
>>>> 
>>>> 
>>> --
>>> Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
>> 
> --
> Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
Dr. David Alan Gilbert Nov. 1, 2018, 3:56 p.m. UTC | #11
* Liran Alon (liran.alon@oracle.com) wrote:
> 
> 
> > On 1 Nov 2018, at 15:10, Dr. David Alan Gilbert <dgilbert@redhat.com> wrote:
> > 
> > * Liran Alon (liran.alon@oracle.com) wrote:
> >> 
> >> 
> >>> On 31 Oct 2018, at 20:59, Dr. David Alan Gilbert <dgilbert@redhat.com> wrote:
> >>> 
> >>> * Liran Alon (liran.alon@oracle.com) wrote:
> >>>> 
> >>>> 
> >>>>> On 31 Oct 2018, at 20:19, Paolo Bonzini <pbonzini@redhat.com> wrote:
> >>>>> 
> >>>>> On 31/10/2018 19:17, Eduardo Habkost wrote:
> >>>>>> On Wed, Oct 31, 2018 at 03:03:34AM +0200, Liran Alon wrote:
> >>>>>>> Ping.
> >>>>>>> Patch was submitted almost two months ago and I haven’t seen any respond for the v2 of this series.
> >>>>>> Sorry for the long delay.  This was on my queue of patches to be
> >>>>>> reviewed, but I'm failing to keep up to the rate of incoming
> >>>>>> patches.  I will try to review the series next week.
> >>>>> 
> >>>>> I have already reviewed it; unfortunately I have missed the soft freeze
> >>>>> for posting the version I had also been working on when Liran posted
> >>>>> these patches.
> >>>>> 
> >>>>> Paolo
> >>>> 
> >>>> Paolo, note that this is v2 of this patch series. It’s not the one you have already reviewed.
> >>>> It now correctly handles the case you mentioned in review of v1 of migrating with various nested_state buffer sizes.
> >>>> The following scenarios were tested:
> >>>> (a) src and dest have same nested state size.
> >>>> 	==> Migration succeeds.
> >>>> (b) src don't have nested state while dest do.
> >>>> 	==> Migration succeed and src don't send it's nested state.
> >>>> (c) src have nested state while dest don't.
> >>>> 	==> Migration fails as it cannot restore nested state.
> >>>> (d) dest have bigger max nested state size than src
> >>>> 	==> Migration succeeds.
> >>>> (e) dest have smaller max nested state size than src but enough to store it's saved nested state
> >>>> 	==> Migration succeeds
> >>>> (f) dest have smaller max nested state size than src but not enough to store it's saved nested state
> >>>> 	==> Migration fails
> >>> 
> >>> Is it possible to tell these limits before the start of the migration,
> >>> or do we only find out that a nested migration won't work by trying it?
> >>> 
> >>> Dave
> >> 
> >> It is possible for the destination host to query what is it’s max nested state size.
> >> (This is what is returned from "kvm_check_extension(s, KVM_CAP_NESTED_STATE);” See kvm_init() code)
> > 
> > Is this max size a function of:
> >  a) The host CPU
> >  b) The host kernel
> >  c) Some configuration
> > 
> > or all of those?
> > 
> > What about the maximum size that will be sent?
> 
> The max size is a function of (b). It depends on your KVM capabilities.
> This size that will be sent is also the max size at source.

So if I have matching host kernels it should always work?
What happens if I upgrade the source kernel to increase it's maximum
nested size, can I force it to keep things small for some VMs?

> > 
> >> However, I didn’t find any suitable mechanism in QEMU Live-Migration code to negotiate
> >> this destination host information with source prior to migration. Which kinda makes sense as
> >> this code is also used to my understanding in standard suspend/resume flow. In that scenario,
> >> there is no other side to negotiate with.
> > 
> > At the moment, if the user has appropriately configured their QEMU and
> > their are no IO errors, the migration should not fail; and the
> > management layer can responsible for configuring the QEMU and checking
> > compatibilities - 
> 
> I agree with this. The post_load() checks I have done are extra measures to prevent a failing migration.
> Management layer can perform extra work before the migration to make sure that source can actually migrate to destination.
> Taking the max size of the nested state into account.
> 
> > 
> >> So currently what happens in my patch is that source prepares the nested state buffer and sends it to destination as part of VMState,
> >> and destination attempts to load this nested state in it’s nested_state_post_load() function.
> >> If destination kernel cannot handle loading the nested state it has received from source, it fails the migration by returning
> >> failure from nested_state_post_load().
> > 
> > That's minimally OK, but ideally we'd have a way of knowing before we
> > start the migration if it was going to work, that way the management
> > layer can refuse it before it spends ages transferring data and then
> > trying to switch over - recovery from a failed migration can be a bit
> > tricky/risky.
> > 
> > I can't see it being much use in a cloud environment if the migration
> > will sometimes fail without the cloud admin being able to do something
> > about it.
> 
> With current QEMU Live-Migration code, I believe this is in the responsibility of the Control-Plane.
> It’s not different from the fact that Control-Plane is also responsible for launching the destination QEMU with same vHW configuration
> as the source QEMU. If there was some kind of negotiation mechanism in QEMU that verifies these vHW configuration matches *before*
> starting the migration (and not part of Control-Plane), I would have also added to that negotiation to consider the max size of nested state.
> But with current mechanisms, this is solely in the responsibility of the Control-Plane.

Yep, that's fine - all we have to make sure of is that:
  a) The control plane can somehow find out the maximums (via qemu or
something else on the host) so it can make that decision.
  b) Thing about how upgrades will work when one host is newer.

Dave

> > 
> >> Do you have a better suggestion?
> > 
> > I'll need to understand a bit more about the nested state.
> 
> Feel free to ask more questions about it and I will try to answer.
> 
> Thanks,
> -Liran
> 
> > 
> > Dave
> > 
> >> -Liran
> >> 
> >>> 
> >>>> -Liran
> >>>> 
> >>>> 
> >>> --
> >>> Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
> >> 
> > --
> > Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
> 
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
Denis V. Lunev" via Nov. 1, 2018, 4:45 p.m. UTC | #12
On Thu, Nov 1, 2018 at 8:56 AM, Dr. David Alan Gilbert
<dgilbert@redhat.com> wrote:

> So if I have matching host kernels it should always work?
> What happens if I upgrade the source kernel to increase it's maximum
> nested size, can I force it to keep things small for some VMs?

Any change to the format of the nested state should be gated by a
KVM_CAP set by userspace. (Unlike, say, how the
KVM_VCPUEVENT_VALID_SMM flag was added to the saved VCPU events state
in commit f077825a8758d.) KVM has traditionally been quite bad about
maintaining backwards compatibility, but I hope the community is more
cognizant of the issues now.

As a cloud provider, one would only enable the new capability from
userspace once all hosts in the pool have a kernel that supports it.
During the transition, the capability would not be enabled on the
hosts with a new kernel, and these hosts would continue to provide
nested state that could be consumed by hosts running the older kernel.
Liran Alon Nov. 1, 2018, 7:03 p.m. UTC | #13
> On 1 Nov 2018, at 17:56, Dr. David Alan Gilbert <dgilbert@redhat.com> wrote:
> 
> * Liran Alon (liran.alon@oracle.com) wrote:
>> 
>> 
>>> On 1 Nov 2018, at 15:10, Dr. David Alan Gilbert <dgilbert@redhat.com> wrote:
>>> 
>>> * Liran Alon (liran.alon@oracle.com) wrote:
>>>> 
>>>> 
>>>>> On 31 Oct 2018, at 20:59, Dr. David Alan Gilbert <dgilbert@redhat.com> wrote:
>>>>> 
>>>>> * Liran Alon (liran.alon@oracle.com) wrote:
>>>>>> 
>>>>>> 
>>>>>>> On 31 Oct 2018, at 20:19, Paolo Bonzini <pbonzini@redhat.com> wrote:
>>>>>>> 
>>>>>>> On 31/10/2018 19:17, Eduardo Habkost wrote:
>>>>>>>> On Wed, Oct 31, 2018 at 03:03:34AM +0200, Liran Alon wrote:
>>>>>>>>> Ping.
>>>>>>>>> Patch was submitted almost two months ago and I haven’t seen any respond for the v2 of this series.
>>>>>>>> Sorry for the long delay.  This was on my queue of patches to be
>>>>>>>> reviewed, but I'm failing to keep up to the rate of incoming
>>>>>>>> patches.  I will try to review the series next week.
>>>>>>> 
>>>>>>> I have already reviewed it; unfortunately I have missed the soft freeze
>>>>>>> for posting the version I had also been working on when Liran posted
>>>>>>> these patches.
>>>>>>> 
>>>>>>> Paolo
>>>>>> 
>>>>>> Paolo, note that this is v2 of this patch series. It’s not the one you have already reviewed.
>>>>>> It now correctly handles the case you mentioned in review of v1 of migrating with various nested_state buffer sizes.
>>>>>> The following scenarios were tested:
>>>>>> (a) src and dest have same nested state size.
>>>>>> 	==> Migration succeeds.
>>>>>> (b) src don't have nested state while dest do.
>>>>>> 	==> Migration succeed and src don't send it's nested state.
>>>>>> (c) src have nested state while dest don't.
>>>>>> 	==> Migration fails as it cannot restore nested state.
>>>>>> (d) dest have bigger max nested state size than src
>>>>>> 	==> Migration succeeds.
>>>>>> (e) dest have smaller max nested state size than src but enough to store it's saved nested state
>>>>>> 	==> Migration succeeds
>>>>>> (f) dest have smaller max nested state size than src but not enough to store it's saved nested state
>>>>>> 	==> Migration fails
>>>>> 
>>>>> Is it possible to tell these limits before the start of the migration,
>>>>> or do we only find out that a nested migration won't work by trying it?
>>>>> 
>>>>> Dave
>>>> 
>>>> It is possible for the destination host to query what is it’s max nested state size.
>>>> (This is what is returned from "kvm_check_extension(s, KVM_CAP_NESTED_STATE);” See kvm_init() code)
>>> 
>>> Is this max size a function of:
>>> a) The host CPU
>>> b) The host kernel
>>> c) Some configuration
>>> 
>>> or all of those?
>>> 
>>> What about the maximum size that will be sent?
>> 
>> The max size is a function of (b). It depends on your KVM capabilities.
>> This size that will be sent is also the max size at source.
> 
> So if I have matching host kernels it should always work?

Yes.

> What happens if I upgrade the source kernel to increase it's maximum
> nested size, can I force it to keep things small for some VMs?

Currently, the IOCTL which saves the nested state have only a single version which could potentially fill entire size (depending on current guest state).
Saving the nested state obviously always attempts to save all relevant information it has because otherwise, you deliberately don't transfer to destination part of the
state it needs to continue running the migrated guest correctly at destination.

It is true that I do expect future versions of this IOCTL to be able to “set nested state” of older versions (smaller one which lack some information)
but I do expect migration to fail gracefully (and continue running guest at source) if destination is not capable of restoring all state sent from source
(When destination max nested state size is smaller than source nested state size).

> 
>>> 
>>>> However, I didn’t find any suitable mechanism in QEMU Live-Migration code to negotiate
>>>> this destination host information with source prior to migration. Which kinda makes sense as
>>>> this code is also used to my understanding in standard suspend/resume flow. In that scenario,
>>>> there is no other side to negotiate with.
>>> 
>>> At the moment, if the user has appropriately configured their QEMU and
>>> their are no IO errors, the migration should not fail; and the
>>> management layer can responsible for configuring the QEMU and checking
>>> compatibilities - 
>> 
>> I agree with this. The post_load() checks I have done are extra measures to prevent a failing migration.
>> Management layer can perform extra work before the migration to make sure that source can actually migrate to destination.
>> Taking the max size of the nested state into account.
>> 
>>> 
>>>> So currently what happens in my patch is that source prepares the nested state buffer and sends it to destination as part of VMState,
>>>> and destination attempts to load this nested state in it’s nested_state_post_load() function.
>>>> If destination kernel cannot handle loading the nested state it has received from source, it fails the migration by returning
>>>> failure from nested_state_post_load().
>>> 
>>> That's minimally OK, but ideally we'd have a way of knowing before we
>>> start the migration if it was going to work, that way the management
>>> layer can refuse it before it spends ages transferring data and then
>>> trying to switch over - recovery from a failed migration can be a bit
>>> tricky/risky.
>>> 
>>> I can't see it being much use in a cloud environment if the migration
>>> will sometimes fail without the cloud admin being able to do something
>>> about it.
>> 
>> With current QEMU Live-Migration code, I believe this is in the responsibility of the Control-Plane.
>> It’s not different from the fact that Control-Plane is also responsible for launching the destination QEMU with same vHW configuration
>> as the source QEMU. If there was some kind of negotiation mechanism in QEMU that verifies these vHW configuration matches *before*
>> starting the migration (and not part of Control-Plane), I would have also added to that negotiation to consider the max size of nested state.
>> But with current mechanisms, this is solely in the responsibility of the Control-Plane.
> 
> Yep, that's fine - all we have to make sure of is that:
>  a) The control plane can somehow find out the maximums (via qemu or
> something else on the host) so it can make that decision.

This is easy for QEMU or something else on the host to verify. Just a matter of sending IOCTL to KVM.

>  b) Thing about how upgrades will work when one host is newer.

I made sure that it is possible for destination to accept nested state of size smaller than it’s max nested state size.
Therefore, it should allow upgrades from one host to a newer host.
(If of course IOCTL which “set nested state” is written correctly in a compatible way).

-Liran

> 
> Dave
> 
>>> 
>>>> Do you have a better suggestion?
>>> 
>>> I'll need to understand a bit more about the nested state.
>> 
>> Feel free to ask more questions about it and I will try to answer.
>> 
>> Thanks,
>> -Liran
>> 
>>> 
>>> Dave
>>> 
>>>> -Liran
>>>> 
>>>>> 
>>>>>> -Liran
>>>>>> 
>>>>>> 
>>>>> --
>>>>> Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
>>>> 
>>> --
>>> Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
>> 
> --
> Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
Dr. David Alan Gilbert Nov. 1, 2018, 7:07 p.m. UTC | #14
* Liran Alon (liran.alon@oracle.com) wrote:
> 
> 
> > On 1 Nov 2018, at 17:56, Dr. David Alan Gilbert <dgilbert@redhat.com> wrote:
> > 
> > * Liran Alon (liran.alon@oracle.com) wrote:
> >> 
> >> 
> >>> On 1 Nov 2018, at 15:10, Dr. David Alan Gilbert <dgilbert@redhat.com> wrote:
> >>> 
> >>> * Liran Alon (liran.alon@oracle.com) wrote:
> >>>> 
> >>>> 
> >>>>> On 31 Oct 2018, at 20:59, Dr. David Alan Gilbert <dgilbert@redhat.com> wrote:
> >>>>> 
> >>>>> * Liran Alon (liran.alon@oracle.com) wrote:
> >>>>>> 
> >>>>>> 
> >>>>>>> On 31 Oct 2018, at 20:19, Paolo Bonzini <pbonzini@redhat.com> wrote:
> >>>>>>> 
> >>>>>>> On 31/10/2018 19:17, Eduardo Habkost wrote:
> >>>>>>>> On Wed, Oct 31, 2018 at 03:03:34AM +0200, Liran Alon wrote:
> >>>>>>>>> Ping.
> >>>>>>>>> Patch was submitted almost two months ago and I haven’t seen any respond for the v2 of this series.
> >>>>>>>> Sorry for the long delay.  This was on my queue of patches to be
> >>>>>>>> reviewed, but I'm failing to keep up to the rate of incoming
> >>>>>>>> patches.  I will try to review the series next week.
> >>>>>>> 
> >>>>>>> I have already reviewed it; unfortunately I have missed the soft freeze
> >>>>>>> for posting the version I had also been working on when Liran posted
> >>>>>>> these patches.
> >>>>>>> 
> >>>>>>> Paolo
> >>>>>> 
> >>>>>> Paolo, note that this is v2 of this patch series. It’s not the one you have already reviewed.
> >>>>>> It now correctly handles the case you mentioned in review of v1 of migrating with various nested_state buffer sizes.
> >>>>>> The following scenarios were tested:
> >>>>>> (a) src and dest have same nested state size.
> >>>>>> 	==> Migration succeeds.
> >>>>>> (b) src don't have nested state while dest do.
> >>>>>> 	==> Migration succeed and src don't send it's nested state.
> >>>>>> (c) src have nested state while dest don't.
> >>>>>> 	==> Migration fails as it cannot restore nested state.
> >>>>>> (d) dest have bigger max nested state size than src
> >>>>>> 	==> Migration succeeds.
> >>>>>> (e) dest have smaller max nested state size than src but enough to store it's saved nested state
> >>>>>> 	==> Migration succeeds
> >>>>>> (f) dest have smaller max nested state size than src but not enough to store it's saved nested state
> >>>>>> 	==> Migration fails
> >>>>> 
> >>>>> Is it possible to tell these limits before the start of the migration,
> >>>>> or do we only find out that a nested migration won't work by trying it?
> >>>>> 
> >>>>> Dave
> >>>> 
> >>>> It is possible for the destination host to query what is it’s max nested state size.
> >>>> (This is what is returned from "kvm_check_extension(s, KVM_CAP_NESTED_STATE);” See kvm_init() code)
> >>> 
> >>> Is this max size a function of:
> >>> a) The host CPU
> >>> b) The host kernel
> >>> c) Some configuration
> >>> 
> >>> or all of those?
> >>> 
> >>> What about the maximum size that will be sent?
> >> 
> >> The max size is a function of (b). It depends on your KVM capabilities.
> >> This size that will be sent is also the max size at source.
> > 
> > So if I have matching host kernels it should always work?
> 
> Yes.

OK, that's a good start.

> > What happens if I upgrade the source kernel to increase it's maximum
> > nested size, can I force it to keep things small for some VMs?
> 
> Currently, the IOCTL which saves the nested state have only a single version which could potentially fill entire size (depending on current guest state).
> Saving the nested state obviously always attempts to save all relevant information it has because otherwise, you deliberately don't transfer to destination part of the
> state it needs to continue running the migrated guest correctly at destination.
> 
> It is true that I do expect future versions of this IOCTL to be able to “set nested state” of older versions (smaller one which lack some information)
> but I do expect migration to fail gracefully (and continue running guest at source) if destination is not capable of restoring all state sent from source
> (When destination max nested state size is smaller than source nested state size).

That older version thing would be good; then we would tie that to the
versioned machine types and/or CPU models some how; that way every
migration of a 3.2 qemu with a given CPU would work irrespective of host
version.

> > 
> >>> 
> >>>> However, I didn’t find any suitable mechanism in QEMU Live-Migration code to negotiate
> >>>> this destination host information with source prior to migration. Which kinda makes sense as
> >>>> this code is also used to my understanding in standard suspend/resume flow. In that scenario,
> >>>> there is no other side to negotiate with.
> >>> 
> >>> At the moment, if the user has appropriately configured their QEMU and
> >>> their are no IO errors, the migration should not fail; and the
> >>> management layer can responsible for configuring the QEMU and checking
> >>> compatibilities - 
> >> 
> >> I agree with this. The post_load() checks I have done are extra measures to prevent a failing migration.
> >> Management layer can perform extra work before the migration to make sure that source can actually migrate to destination.
> >> Taking the max size of the nested state into account.
> >> 
> >>> 
> >>>> So currently what happens in my patch is that source prepares the nested state buffer and sends it to destination as part of VMState,
> >>>> and destination attempts to load this nested state in it’s nested_state_post_load() function.
> >>>> If destination kernel cannot handle loading the nested state it has received from source, it fails the migration by returning
> >>>> failure from nested_state_post_load().
> >>> 
> >>> That's minimally OK, but ideally we'd have a way of knowing before we
> >>> start the migration if it was going to work, that way the management
> >>> layer can refuse it before it spends ages transferring data and then
> >>> trying to switch over - recovery from a failed migration can be a bit
> >>> tricky/risky.
> >>> 
> >>> I can't see it being much use in a cloud environment if the migration
> >>> will sometimes fail without the cloud admin being able to do something
> >>> about it.
> >> 
> >> With current QEMU Live-Migration code, I believe this is in the responsibility of the Control-Plane.
> >> It’s not different from the fact that Control-Plane is also responsible for launching the destination QEMU with same vHW configuration
> >> as the source QEMU. If there was some kind of negotiation mechanism in QEMU that verifies these vHW configuration matches *before*
> >> starting the migration (and not part of Control-Plane), I would have also added to that negotiation to consider the max size of nested state.
> >> But with current mechanisms, this is solely in the responsibility of the Control-Plane.
> > 
> > Yep, that's fine - all we have to make sure of is that:
> >  a) The control plane can somehow find out the maximums (via qemu or
> > something else on the host) so it can make that decision.
> 
> This is easy for QEMU or something else on the host to verify. Just a matter of sending IOCTL to KVM.

OK that's probably fine, we should check with libvirt people what they would like.

> >  b) Thing about how upgrades will work when one host is newer.
> 
> I made sure that it is possible for destination to accept nested state of size smaller than it’s max nested state size.
> Therefore, it should allow upgrades from one host to a newer host.

OK; the tricky thing is when you upgrade one host in a small cluster as
you start doing an upgrade, and then once it's got it's first VM you
can't migrate away from it until others are updated; that gets messy.

Dave

> (If of course IOCTL which “set nested state” is written correctly in a compatible way).
> 
> -Liran
> 
> > 
> > Dave
> > 
> >>> 
> >>>> Do you have a better suggestion?
> >>> 
> >>> I'll need to understand a bit more about the nested state.
> >> 
> >> Feel free to ask more questions about it and I will try to answer.
> >> 
> >> Thanks,
> >> -Liran
> >> 
> >>> 
> >>> Dave
> >>> 
> >>>> -Liran
> >>>> 
> >>>>> 
> >>>>>> -Liran
> >>>>>> 
> >>>>>> 
> >>>>> --
> >>>>> Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
> >>>> 
> >>> --
> >>> Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
> >> 
> > --
> > Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
> 
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
Denis V. Lunev" via Nov. 1, 2018, 7:41 p.m. UTC | #15
On Thu, Nov 1, 2018 at 12:07 PM, Dr. David Alan Gilbert
<dgilbert@redhat.com> wrote:

> OK; the tricky thing is when you upgrade one host in a small cluster as
> you start doing an upgrade, and then once it's got it's first VM you
> can't migrate away from it until others are updated; that gets messy.

One must always be able to migrate off of kernel N+1 back to kernel N.
We depend on that whenever we have to rollback a kernel upgrade.
Liran Alon Nov. 2, 2018, 3:46 a.m. UTC | #16
> On Thu, Nov1, 2018 at 09:45 AM, Jim Mattson <jmattson@google.com> wrote:

>> On Thu, Nov 1, 2018 at 8:56 AM, Dr. David Alan Gilbert <dgilbert@redhat.com> wrote:

>> So if I have matching host kernels it should always work?
>> What happens if I upgrade the source kernel to increase it's maximum
>> nested size, can I force it to keep things small for some VMs?

> Any change to the format of the nested state should be gated by a
> KVM_CAP set by userspace. (Unlike, say, how the
> KVM_VCPUEVENT_VALID_SMM flag was added to the saved VCPU events state
> in commit f077825a8758d.) KVM has traditionally been quite bad about
> maintaining backwards compatibility, but I hope the community is more
> cognizant of the issues now.

> As a cloud provider, one would only enable the new capability from
> userspace once all hosts in the pool have a kernel that supports it.
> During the transition, the capability would not be enabled on the
> hosts with a new kernel, and these hosts would continue to provide
> nested state that could be consumed by hosts running the older kernel.

Hmm this makes sense.

This means though that the patch I have submitted here isn't good enough.
My patch currently assumes that when it attempts to get nested state from KVM,
QEMU should always set nested_state->size to max size supported by KVM as received
from kvm_check_extension(s, KVM_CAP_NESTED_STATE);
(See kvm_get_nested_state() introduced on my patch).
This indeed won't allow migration from host with new KVM to host with old KVM if
nested_state size was enlarged between these KVM versions.
Which is obviously an issue.

Jim, I think that my confusion was created from the fact that there is no clear documentation
on how KVM_{GET,SET}_NESTED_STATE should be changed once we will need to add more state to
nested_state in future KVM versions. I think it's worth adding that to IOCTLs documentation.

For example, let's assume we have a new KVM_CAP_NESTED_STATE_V2.
In this scenario, does kvm_check_extension(s, KVM_CAP_NESTED_STATE) still returns the
size of nested_state v1 and kvm_check_extension(s, KVM_CAP_NESTED_STATE_V2) returns the
size of the nested_state v2?

Also note that the approach suggested by Jim requires mgmt-layer at dest
to be able to specify to QEMU which KVM_CAP_NESTED_STATE_V* capabilities it should enable on kvm_init().
When we know we are migrating from a host which supports v1 to a host which supports v2,
we should make sure that dest QEMU doesn't enable KVM_CAP_NESTED_STATE_V2.
However, when we are just launching a new machine on the host which supports v2, we do want
QEMU to enable KVM_CAP_NESTED_STATE_V2 enabled for that VM.

But on second thought, I'm not sure that this is the right approach as-well.
We don't really want the used version of nested_state to be determined on kvm_init().
* On source QEMU, we actually want to determine it when preparing for migration based
on to the support given by our destination host. If it's an old host, we would like to
save an old version nested_state and if it's a new host, we will like to save our newest
supported nested_state.
* On dest QEMU, we will want to just be able to set received nested_state in KVM.

Therefore, I don't think that we want this versioning to be based on KVM_CAP at all.
It seems that we would want the process to behave as follows:
1) Mgmt-layer at dest queries dest host max supported nested_state size.
   (Which should be returned from kvm_check_extension(KVM_CAP_NESTED_STATE))
2) Mgmt-layer at source initiate migration to dest with requesting QEMU to send nested_state 
   matching dest max supported nested_state size.
   When saving nested state using KVM_GET_NESTED_STATE IOCTL, QEMU will specify in nested_state->size
   the *requested* size to be saved and KVM should be able to save only the information which matches
   the version that worked with that size.
3) After some sanity checks on received migration stream, dest host use KVM_SET_NESTED_STATE IOCTL.
   This IOCTL should deduce which information it should deploy based on given nested_state->size.

This also makes me wonder if it's not just nicer to use nested_state->flags to specify which
information is actually present on nested_state instead of managing versioning with nested_state->size.

What are your opinions on this?

-Liran
Paolo Bonzini Nov. 2, 2018, 9:40 a.m. UTC | #17
On 02/11/2018 04:46, Liran Alon wrote:
>> On Thu, Nov1, 2018 at 09:45 AM, Jim Mattson <jmattson@google.com> wrote:
> 
>>> On Thu, Nov 1, 2018 at 8:56 AM, Dr. David Alan Gilbert <dgilbert@redhat.com> wrote:
> 
>>> So if I have matching host kernels it should always work?
>>> What happens if I upgrade the source kernel to increase it's maximum
>>> nested size, can I force it to keep things small for some VMs?
> 
>> Any change to the format of the nested state should be gated by a
>> KVM_CAP set by userspace. (Unlike, say, how the
>> KVM_VCPUEVENT_VALID_SMM flag was added to the saved VCPU events state
>> in commit f077825a8758d.) KVM has traditionally been quite bad about
>> maintaining backwards compatibility, but I hope the community is more
>> cognizant of the issues now.
> 
>> As a cloud provider, one would only enable the new capability from
>> userspace once all hosts in the pool have a kernel that supports it.
>> During the transition, the capability would not be enabled on the
>> hosts with a new kernel, and these hosts would continue to provide
>> nested state that could be consumed by hosts running the older kernel.
> 
> Hmm this makes sense.
> 
> This means though that the patch I have submitted here isn't good enough.
> My patch currently assumes that when it attempts to get nested state from KVM,
> QEMU should always set nested_state->size to max size supported by KVM as received
> from kvm_check_extension(s, KVM_CAP_NESTED_STATE);
> (See kvm_get_nested_state() introduced on my patch).
> This indeed won't allow migration from host with new KVM to host with old KVM if
> nested_state size was enlarged between these KVM versions.
> Which is obviously an issue.

Actually I think this is okay, because unlike the "new" capability was
enabled, KVM would always reduce nested_state->size to a value that is
compatible with current kernels.

> But on second thought, I'm not sure that this is the right approach as-well.
> We don't really want the used version of nested_state to be determined on kvm_init().
> * On source QEMU, we actually want to determine it when preparing for migration based
> on to the support given by our destination host. If it's an old host, we would like to
> save an old version nested_state and if it's a new host, we will like to save our newest
> supported nested_state.

No, that's wrong because it would lead to losing state.  If the source
QEMU supports more state than the destination QEMU, and the current VM
state needs to transmit it for migration to be _correct_, then migration
to that destination QEMU must fail.

In particular, enabling the new KVM capability needs to be gated by a
new machine type and/or -cpu flag, if migration compatibility is needed.
 (In particular, this is one reason why I haven't considered this series
for 3.1.  Right now, migration of nested hypervisors is completely
busted but if we make it "almost" work, pre-3.1 machine types would not
ever be able to add support for KVM_CAP_EXCEPTION_PAYLOAD.  Therefore,
it's better for users if we wait for one release more, and add support
for KVM_CAP_NESTED_STATE and KVM_CAP_EXCEPTION_PAYLOAD at the same time).

Personally, I would like to say that, starting from QEMU 3.2, enabling
nested VMX requires a 4.20 kernel.  It's a bit bold, but I think it's a
good way to keep some sanity.  Any opinions on that?

Paolo

> Therefore, I don't think that we want this versioning to be based on KVM_CAP at all.
> It seems that we would want the process to behave as follows:
> 1) Mgmt-layer at dest queries dest host max supported nested_state size.
>    (Which should be returned from kvm_check_extension(KVM_CAP_NESTED_STATE))
> 2) Mgmt-layer at source initiate migration to dest with requesting QEMU to send nested_state 
>    matching dest max supported nested_state size.
>    When saving nested state using KVM_GET_NESTED_STATE IOCTL, QEMU will specify in nested_state->size
>    the *requested* size to be saved and KVM should be able to save only the information which matches
>    the version that worked with that size.
> 3) After some sanity checks on received migration stream, dest host use KVM_SET_NESTED_STATE IOCTL.
>    This IOCTL should deduce which information it should deploy based on given nested_state->size.
> 
> This also makes me wonder if it's not just nicer to use nested_state->flags to specify which
> information is actually present on nested_state instead of managing versioning with nested_state->size.
> 
> What are your opinions on this?
> 
> -Liran
>
Dr. David Alan Gilbert Nov. 2, 2018, 12:35 p.m. UTC | #18
* Paolo Bonzini (pbonzini@redhat.com) wrote:
> On 02/11/2018 04:46, Liran Alon wrote:
> >> On Thu, Nov1, 2018 at 09:45 AM, Jim Mattson <jmattson@google.com> wrote:
> > 
> >>> On Thu, Nov 1, 2018 at 8:56 AM, Dr. David Alan Gilbert <dgilbert@redhat.com> wrote:
> > 
> >>> So if I have matching host kernels it should always work?
> >>> What happens if I upgrade the source kernel to increase it's maximum
> >>> nested size, can I force it to keep things small for some VMs?
> > 
> >> Any change to the format of the nested state should be gated by a
> >> KVM_CAP set by userspace. (Unlike, say, how the
> >> KVM_VCPUEVENT_VALID_SMM flag was added to the saved VCPU events state
> >> in commit f077825a8758d.) KVM has traditionally been quite bad about
> >> maintaining backwards compatibility, but I hope the community is more
> >> cognizant of the issues now.
> > 
> >> As a cloud provider, one would only enable the new capability from
> >> userspace once all hosts in the pool have a kernel that supports it.
> >> During the transition, the capability would not be enabled on the
> >> hosts with a new kernel, and these hosts would continue to provide
> >> nested state that could be consumed by hosts running the older kernel.
> > 
> > Hmm this makes sense.
> > 
> > This means though that the patch I have submitted here isn't good enough.
> > My patch currently assumes that when it attempts to get nested state from KVM,
> > QEMU should always set nested_state->size to max size supported by KVM as received
> > from kvm_check_extension(s, KVM_CAP_NESTED_STATE);
> > (See kvm_get_nested_state() introduced on my patch).
> > This indeed won't allow migration from host with new KVM to host with old KVM if
> > nested_state size was enlarged between these KVM versions.
> > Which is obviously an issue.
> 
> Actually I think this is okay, because unlike the "new" capability was
> enabled, KVM would always reduce nested_state->size to a value that is
> compatible with current kernels.
> 
> > But on second thought, I'm not sure that this is the right approach as-well.
> > We don't really want the used version of nested_state to be determined on kvm_init().
> > * On source QEMU, we actually want to determine it when preparing for migration based
> > on to the support given by our destination host. If it's an old host, we would like to
> > save an old version nested_state and if it's a new host, we will like to save our newest
> > supported nested_state.
> 
> No, that's wrong because it would lead to losing state.  If the source
> QEMU supports more state than the destination QEMU, and the current VM
> state needs to transmit it for migration to be _correct_, then migration
> to that destination QEMU must fail.
> 
> In particular, enabling the new KVM capability needs to be gated by a
> new machine type and/or -cpu flag, if migration compatibility is needed.
>  (In particular, this is one reason why I haven't considered this series
> for 3.1.  Right now, migration of nested hypervisors is completely
> busted but if we make it "almost" work, pre-3.1 machine types would not
> ever be able to add support for KVM_CAP_EXCEPTION_PAYLOAD.  Therefore,
> it's better for users if we wait for one release more, and add support
> for KVM_CAP_NESTED_STATE and KVM_CAP_EXCEPTION_PAYLOAD at the same time).
> 
> Personally, I would like to say that, starting from QEMU 3.2, enabling
> nested VMX requires a 4.20 kernel.  It's a bit bold, but I think it's a
> good way to keep some sanity.  Any opinions on that?

That seems a bit mean; there's a lot of people already using nested.

Dave

> Paolo
> 
> > Therefore, I don't think that we want this versioning to be based on KVM_CAP at all.
> > It seems that we would want the process to behave as follows:
> > 1) Mgmt-layer at dest queries dest host max supported nested_state size.
> >    (Which should be returned from kvm_check_extension(KVM_CAP_NESTED_STATE))
> > 2) Mgmt-layer at source initiate migration to dest with requesting QEMU to send nested_state 
> >    matching dest max supported nested_state size.
> >    When saving nested state using KVM_GET_NESTED_STATE IOCTL, QEMU will specify in nested_state->size
> >    the *requested* size to be saved and KVM should be able to save only the information which matches
> >    the version that worked with that size.
> > 3) After some sanity checks on received migration stream, dest host use KVM_SET_NESTED_STATE IOCTL.
> >    This IOCTL should deduce which information it should deploy based on given nested_state->size.
> > 
> > This also makes me wonder if it's not just nicer to use nested_state->flags to specify which
> > information is actually present on nested_state instead of managing versioning with nested_state->size.
> > 
> > What are your opinions on this?
> > 
> > -Liran
> > 
> 
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
Daniel P. Berrangé Nov. 2, 2018, 12:40 p.m. UTC | #19
On Fri, Nov 02, 2018 at 12:35:05PM +0000, Dr. David Alan Gilbert wrote:
> * Paolo Bonzini (pbonzini@redhat.com) wrote:
> > On 02/11/2018 04:46, Liran Alon wrote:
> > >> On Thu, Nov1, 2018 at 09:45 AM, Jim Mattson <jmattson@google.com> wrote:
> > > 
> > >>> On Thu, Nov 1, 2018 at 8:56 AM, Dr. David Alan Gilbert <dgilbert@redhat.com> wrote:
> > > 
> > >>> So if I have matching host kernels it should always work?
> > >>> What happens if I upgrade the source kernel to increase it's maximum
> > >>> nested size, can I force it to keep things small for some VMs?
> > > 
> > >> Any change to the format of the nested state should be gated by a
> > >> KVM_CAP set by userspace. (Unlike, say, how the
> > >> KVM_VCPUEVENT_VALID_SMM flag was added to the saved VCPU events state
> > >> in commit f077825a8758d.) KVM has traditionally been quite bad about
> > >> maintaining backwards compatibility, but I hope the community is more
> > >> cognizant of the issues now.
> > > 
> > >> As a cloud provider, one would only enable the new capability from
> > >> userspace once all hosts in the pool have a kernel that supports it.
> > >> During the transition, the capability would not be enabled on the
> > >> hosts with a new kernel, and these hosts would continue to provide
> > >> nested state that could be consumed by hosts running the older kernel.
> > > 
> > > Hmm this makes sense.
> > > 
> > > This means though that the patch I have submitted here isn't good enough.
> > > My patch currently assumes that when it attempts to get nested state from KVM,
> > > QEMU should always set nested_state->size to max size supported by KVM as received
> > > from kvm_check_extension(s, KVM_CAP_NESTED_STATE);
> > > (See kvm_get_nested_state() introduced on my patch).
> > > This indeed won't allow migration from host with new KVM to host with old KVM if
> > > nested_state size was enlarged between these KVM versions.
> > > Which is obviously an issue.
> > 
> > Actually I think this is okay, because unlike the "new" capability was
> > enabled, KVM would always reduce nested_state->size to a value that is
> > compatible with current kernels.
> > 
> > > But on second thought, I'm not sure that this is the right approach as-well.
> > > We don't really want the used version of nested_state to be determined on kvm_init().
> > > * On source QEMU, we actually want to determine it when preparing for migration based
> > > on to the support given by our destination host. If it's an old host, we would like to
> > > save an old version nested_state and if it's a new host, we will like to save our newest
> > > supported nested_state.
> > 
> > No, that's wrong because it would lead to losing state.  If the source
> > QEMU supports more state than the destination QEMU, and the current VM
> > state needs to transmit it for migration to be _correct_, then migration
> > to that destination QEMU must fail.
> > 
> > In particular, enabling the new KVM capability needs to be gated by a
> > new machine type and/or -cpu flag, if migration compatibility is needed.
> >  (In particular, this is one reason why I haven't considered this series
> > for 3.1.  Right now, migration of nested hypervisors is completely
> > busted but if we make it "almost" work, pre-3.1 machine types would not
> > ever be able to add support for KVM_CAP_EXCEPTION_PAYLOAD.  Therefore,
> > it's better for users if we wait for one release more, and add support
> > for KVM_CAP_NESTED_STATE and KVM_CAP_EXCEPTION_PAYLOAD at the same time).
> > 
> > Personally, I would like to say that, starting from QEMU 3.2, enabling
> > nested VMX requires a 4.20 kernel.  It's a bit bold, but I think it's a
> > good way to keep some sanity.  Any opinions on that?
> 
> That seems a bit mean; there's a lot of people already using nested.

Agreed, it would be a significant regression for people. They may not
even care about migration, so we should not block its use with old
kernels just for the sake of working migration that they won't use.


Regards,
Daniel
Liran Alon Nov. 2, 2018, 12:59 p.m. UTC | #20
> On 2 Nov 2018, at 11:40, Paolo Bonzini <pbonzini@redhat.com> wrote:
> 
> On 02/11/2018 04:46, Liran Alon wrote:
>>> On Thu, Nov1, 2018 at 09:45 AM, Jim Mattson <jmattson@google.com> wrote:
>> 
>>>> On Thu, Nov 1, 2018 at 8:56 AM, Dr. David Alan Gilbert <dgilbert@redhat.com> wrote:
>> 
>>>> So if I have matching host kernels it should always work?
>>>> What happens if I upgrade the source kernel to increase it's maximum
>>>> nested size, can I force it to keep things small for some VMs?
>> 
>>> Any change to the format of the nested state should be gated by a
>>> KVM_CAP set by userspace. (Unlike, say, how the
>>> KVM_VCPUEVENT_VALID_SMM flag was added to the saved VCPU events state
>>> in commit f077825a8758d.) KVM has traditionally been quite bad about
>>> maintaining backwards compatibility, but I hope the community is more
>>> cognizant of the issues now.
>> 
>>> As a cloud provider, one would only enable the new capability from
>>> userspace once all hosts in the pool have a kernel that supports it.
>>> During the transition, the capability would not be enabled on the
>>> hosts with a new kernel, and these hosts would continue to provide
>>> nested state that could be consumed by hosts running the older kernel.
>> 
>> Hmm this makes sense.
>> 
>> This means though that the patch I have submitted here isn't good enough.
>> My patch currently assumes that when it attempts to get nested state from KVM,
>> QEMU should always set nested_state->size to max size supported by KVM as received
>> from kvm_check_extension(s, KVM_CAP_NESTED_STATE);
>> (See kvm_get_nested_state() introduced on my patch).
>> This indeed won't allow migration from host with new KVM to host with old KVM if
>> nested_state size was enlarged between these KVM versions.
>> Which is obviously an issue.
> 
> Actually I think this is okay, because unlike the "new" capability was
> enabled, KVM would always reduce nested_state->size to a value that is
> compatible with current kernels.
> 
>> But on second thought, I'm not sure that this is the right approach as-well.
>> We don't really want the used version of nested_state to be determined on kvm_init().
>> * On source QEMU, we actually want to determine it when preparing for migration based
>> on to the support given by our destination host. If it's an old host, we would like to
>> save an old version nested_state and if it's a new host, we will like to save our newest
>> supported nested_state.
> 
> No, that's wrong because it would lead to losing state.  If the source
> QEMU supports more state than the destination QEMU, and the current VM
> state needs to transmit it for migration to be _correct_, then migration
> to that destination QEMU must fail.
> 
> In particular, enabling the new KVM capability needs to be gated by a
> new machine type and/or -cpu flag, if migration compatibility is needed.
> (In particular, this is one reason why I haven't considered this series
> for 3.1.  Right now, migration of nested hypervisors is completely
> busted but if we make it "almost" work, pre-3.1 machine types would not
> ever be able to add support for KVM_CAP_EXCEPTION_PAYLOAD.  Therefore,
> it's better for users if we wait for one release more, and add support
> for KVM_CAP_NESTED_STATE and KVM_CAP_EXCEPTION_PAYLOAD at the same time).
> 
> Personally, I would like to say that, starting from QEMU 3.2, enabling
> nested VMX requires a 4.20 kernel.  It's a bit bold, but I think it's a
> good way to keep some sanity.  Any opinions on that?
> 
> Paolo

If I understand you correctly, you wish that nested_state version used will be tied to the machine-type used to launch the guest.
The reason I am not fond of this approach is that it means that once a VM is launched with some machine-type, it’s nested_state
will be forever saved with a specific version. Even if this VM has already migrated to a host with newer kernel which knows how to
save more accurate state for the next migration.

The scheme I have described below should avoid this kind of issues while still preserving the ability to migrate to older hosts.
Note that I believe it’s in the responsibility of the mgmt-layer to decide the risk of migrating from a new host to an old host
in case the old host cannot receive the full nested_state that the new host is capable of generating. I don’t think migration
should fail in this case like happens currently with the patch I have submitted.

So in general I still agree with Jim's approach, but I’m not convinced we should use KVM_CAP for this.
(See details on proposed scheme below).

What are the disadvantages you see of using the proposed scheme below?
Why using KVM_CAP is better?

BTW, I agree with the rest of the group here that it’s too aggressive to make QEMU 3.2 force having kernel 4.20 for using nVMX.
This will hurt common nVMX workloads that don’t care about the ability to migrate.

-Liran

> 
>> Therefore, I don't think that we want this versioning to be based on KVM_CAP at all.
>> It seems that we would want the process to behave as follows:
>> 1) Mgmt-layer at dest queries dest host max supported nested_state size.
>>   (Which should be returned from kvm_check_extension(KVM_CAP_NESTED_STATE))
>> 2) Mgmt-layer at source initiate migration to dest with requesting QEMU to send nested_state 
>>   matching dest max supported nested_state size.
>>   When saving nested state using KVM_GET_NESTED_STATE IOCTL, QEMU will specify in nested_state->size
>>   the *requested* size to be saved and KVM should be able to save only the information which matches
>>   the version that worked with that size.
>> 3) After some sanity checks on received migration stream, dest host use KVM_SET_NESTED_STATE IOCTL.
>>   This IOCTL should deduce which information it should deploy based on given nested_state->size.
>> 
>> This also makes me wonder if it's not just nicer to use nested_state->flags to specify which
>> information is actually present on nested_state instead of managing versioning with nested_state->size.
>> 
>> What are your opinions on this?
>> 
>> -Liran
>> 
>
Denis V. Lunev" via Nov. 2, 2018, 4:39 p.m. UTC | #21
On Thu, Nov 1, 2018 at 8:46 PM, Liran Alon <liran.alon@oracle.com> wrote:

> Hmm this makes sense.
>
> This means though that the patch I have submitted here isn't good enough.
> My patch currently assumes that when it attempts to get nested state from KVM,
> QEMU should always set nested_state->size to max size supported by KVM as received
> from kvm_check_extension(s, KVM_CAP_NESTED_STATE);
> (See kvm_get_nested_state() introduced on my patch).
> This indeed won't allow migration from host with new KVM to host with old KVM if
> nested_state size was enlarged between these KVM versions.
> Which is obviously an issue.
>
> Jim, I think that my confusion was created from the fact that there is no clear documentation
> on how KVM_{GET,SET}_NESTED_STATE should be changed once we will need to add more state to
> nested_state in future KVM versions. I think it's worth adding that to IOCTLs documentation.

The nested state IOCTLs aren't unique in this respect. Any changes to
the state saved by any of this whole family of state-saving ioctls
require opt-in from userspace.

> For example, let's assume we have a new KVM_CAP_NESTED_STATE_V2.
> In this scenario, does kvm_check_extension(s, KVM_CAP_NESTED_STATE) still returns the
> size of nested_state v1 and kvm_check_extension(s, KVM_CAP_NESTED_STATE_V2) returns the
> size of the nested_state v2?

Hmm...I don't recall kvm_check_extension(s, KVM_CAP_NESTED_STATE)
being part of my original design. The way I had envisioned it,
the set of capabilities enabled by userspace would be sufficient to
infer the maximum data size.

If, for example, we add a field to stash the time remaining for the
VMCS12 VMX preemption timer, then presumably, userspace will enable it
by enabling KVM_CAP_SAVE_VMX_PREEMPTION_TIMER (or something like
that), and then userspace will know that the maximum nested state data
is 4 bytes larger.

> Also note that the approach suggested by Jim requires mgmt-layer at dest
> to be able to specify to QEMU which KVM_CAP_NESTED_STATE_V* capabilities it should enable on kvm_init().
> When we know we are migrating from a host which supports v1 to a host which supports v2,
> we should make sure that dest QEMU doesn't enable KVM_CAP_NESTED_STATE_V2.
> However, when we are just launching a new machine on the host which supports v2, we do want
> QEMU to enable KVM_CAP_NESTED_STATE_V2 enabled for that VM.

No, no, no. Even when launching a new VM on a host that supports v2,
you cannot enable v2 until you have passed rollback horizon. Should
you decide to roll back the kernel with v2 support, you must be able
to move that new VM to a host with an old kernel.

> But on second thought, I'm not sure that this is the right approach as-well.
> We don't really want the used version of nested_state to be determined on kvm_init().
> * On source QEMU, we actually want to determine it when preparing for migration based
> on to the support given by our destination host. If it's an old host, we would like to
> save an old version nested_state and if it's a new host, we will like to save our newest
> supported nested_state.
> * On dest QEMU, we will want to just be able to set received nested_state in KVM.
>
> Therefore, I don't think that we want this versioning to be based on KVM_CAP at all.
> It seems that we would want the process to behave as follows:
> 1) Mgmt-layer at dest queries dest host max supported nested_state size.
>    (Which should be returned from kvm_check_extension(KVM_CAP_NESTED_STATE))
> 2) Mgmt-layer at source initiate migration to dest with requesting QEMU to send nested_state
>    matching dest max supported nested_state size.
>    When saving nested state using KVM_GET_NESTED_STATE IOCTL, QEMU will specify in nested_state->size
>    the *requested* size to be saved and KVM should be able to save only the information which matches
>    the version that worked with that size.
> 3) After some sanity checks on received migration stream, dest host use KVM_SET_NESTED_STATE IOCTL.
>    This IOCTL should deduce which information it should deploy based on given nested_state->size.
>
> This also makes me wonder if it's not just nicer to use nested_state->flags to specify which
> information is actually present on nested_state instead of managing versioning with nested_state->size.

Yes, you can use nested_state->flags to determine what the data
payload is, but you cannot enable new flags unless userspace opts in.
This is just like KVM_CAP_EXCEPTION_PAYLOAD for kvm_vcpu_events. The
flag, KVM_VCPUEVENT_VALID_PAYLOAD, can only be set on the saved vcpu
events if userspace has opted-in with KVM_CAP_EXCEPTION_PAYLOAD. This
is because older kernels will reject kvm_vcpu_events that have the
KVM_VCPUEVENT_VALID_PAYLOAD flag set.

You don't need a new KVM_CAP_NESTED_STATE_V2 ioctl. You just need
buy-in from userspace for any new data payload. Explicitly enumerating
the payload components in the flags field makes perfect sense.
Denis V. Lunev" via Nov. 2, 2018, 4:44 p.m. UTC | #22
On Fri, Nov 2, 2018 at 5:59 AM, Liran Alon <liran.alon@oracle.com> wrote:
>

>>> Therefore, I don't think that we want this versioning to be based on KVM_CAP at all.
>>> It seems that we would want the process to behave as follows:
>>> 1) Mgmt-layer at dest queries dest host max supported nested_state size.
>>>   (Which should be returned from kvm_check_extension(KVM_CAP_NESTED_STATE))
>>> 2) Mgmt-layer at source initiate migration to dest with requesting QEMU to send nested_state
>>>   matching dest max supported nested_state size.
>>>   When saving nested state using KVM_GET_NESTED_STATE IOCTL, QEMU will specify in nested_state->size
>>>   the *requested* size to be saved and KVM should be able to save only the information which matches
>>>   the version that worked with that size.
>>> 3) After some sanity checks on received migration stream, dest host use KVM_SET_NESTED_STATE IOCTL.
>>>   This IOCTL should deduce which information it should deploy based on given nested_state->size.

I have to object to any proposal which requires the management later
to communicate with the source and the destination to determine what
should be done.
Daniel P. Berrangé Nov. 2, 2018, 4:54 p.m. UTC | #23
On Fri, Nov 02, 2018 at 10:40:35AM +0100, Paolo Bonzini wrote:
> On 02/11/2018 04:46, Liran Alon wrote:
> >> On Thu, Nov1, 2018 at 09:45 AM, Jim Mattson <jmattson@google.com> wrote:
> > 
> >>> On Thu, Nov 1, 2018 at 8:56 AM, Dr. David Alan Gilbert <dgilbert@redhat.com> wrote:
> > 
> >>> So if I have matching host kernels it should always work?
> >>> What happens if I upgrade the source kernel to increase it's maximum
> >>> nested size, can I force it to keep things small for some VMs?
> > 
> >> Any change to the format of the nested state should be gated by a
> >> KVM_CAP set by userspace. (Unlike, say, how the
> >> KVM_VCPUEVENT_VALID_SMM flag was added to the saved VCPU events state
> >> in commit f077825a8758d.) KVM has traditionally been quite bad about
> >> maintaining backwards compatibility, but I hope the community is more
> >> cognizant of the issues now.
> > 
> >> As a cloud provider, one would only enable the new capability from
> >> userspace once all hosts in the pool have a kernel that supports it.
> >> During the transition, the capability would not be enabled on the
> >> hosts with a new kernel, and these hosts would continue to provide
> >> nested state that could be consumed by hosts running the older kernel.
> > 
> > Hmm this makes sense.
> > 
> > This means though that the patch I have submitted here isn't good enough.
> > My patch currently assumes that when it attempts to get nested state from KVM,
> > QEMU should always set nested_state->size to max size supported by KVM as received
> > from kvm_check_extension(s, KVM_CAP_NESTED_STATE);
> > (See kvm_get_nested_state() introduced on my patch).
> > This indeed won't allow migration from host with new KVM to host with old KVM if
> > nested_state size was enlarged between these KVM versions.
> > Which is obviously an issue.
> 
> Actually I think this is okay, because unlike the "new" capability was
> enabled, KVM would always reduce nested_state->size to a value that is
> compatible with current kernels.
> 
> > But on second thought, I'm not sure that this is the right approach as-well.
> > We don't really want the used version of nested_state to be determined on kvm_init().
> > * On source QEMU, we actually want to determine it when preparing for migration based
> > on to the support given by our destination host. If it's an old host, we would like to
> > save an old version nested_state and if it's a new host, we will like to save our newest
> > supported nested_state.
> 
> No, that's wrong because it would lead to losing state.  If the source
> QEMU supports more state than the destination QEMU, and the current VM
> state needs to transmit it for migration to be _correct_, then migration
> to that destination QEMU must fail.
> 
> In particular, enabling the new KVM capability needs to be gated by a
> new machine type and/or -cpu flag, if migration compatibility is needed.
>  (In particular, this is one reason why I haven't considered this series
> for 3.1.  Right now, migration of nested hypervisors is completely
> busted but if we make it "almost" work, pre-3.1 machine types would not
> ever be able to add support for KVM_CAP_EXCEPTION_PAYLOAD.  Therefore,
> it's better for users if we wait for one release more, and add support
> for KVM_CAP_NESTED_STATE and KVM_CAP_EXCEPTION_PAYLOAD at the same time).
> 
> Personally, I would like to say that, starting from QEMU 3.2, enabling
> nested VMX requires a 4.20 kernel.  It's a bit bold, but I think it's a
> good way to keep some sanity.  Any opinions on that?

We have usually followed a rule that new machine types must not
affect runability of a VM on a host. IOW new machine types should
not introduce dependancies on specific kernels, or hardware features
such as CPU flags.

Anything that requires a new kernel feature thus ought to be an
opt-in config tunable on the CLI, separate from machine type
choice.  Alternatively in this case, it could potentially be a
migration parameter settable via QMP. QEMU on each side could
advertize whether the migration parameter is available, and
the mgmt app (which can see both sides of the migration) can
then decide whether to enable it.

Regards,
Daniel
Daniel P. Berrangé Nov. 2, 2018, 4:58 p.m. UTC | #24
On Fri, Nov 02, 2018 at 09:44:54AM -0700, Jim Mattson via Qemu-devel wrote:
> On Fri, Nov 2, 2018 at 5:59 AM, Liran Alon <liran.alon@oracle.com> wrote:
> >
> 
> >>> Therefore, I don't think that we want this versioning to be based on KVM_CAP at all.
> >>> It seems that we would want the process to behave as follows:
> >>> 1) Mgmt-layer at dest queries dest host max supported nested_state size.
> >>>   (Which should be returned from kvm_check_extension(KVM_CAP_NESTED_STATE))
> >>> 2) Mgmt-layer at source initiate migration to dest with requesting QEMU to send nested_state
> >>>   matching dest max supported nested_state size.
> >>>   When saving nested state using KVM_GET_NESTED_STATE IOCTL, QEMU will specify in nested_state->size
> >>>   the *requested* size to be saved and KVM should be able to save only the information which matches
> >>>   the version that worked with that size.
> >>> 3) After some sanity checks on received migration stream, dest host use KVM_SET_NESTED_STATE IOCTL.
> >>>   This IOCTL should deduce which information it should deploy based on given nested_state->size.
> 
> I have to object to any proposal which requires the management later
> to communicate with the source and the destination to determine what
> should be done.

Can you elaborate on why you object ?

There are a bunch of features in QEMU's migration code which require
the mgmt layer to look at source + dest to determine what should be
done. Admittedly the cases we have had so far are generic migration
features (compression, multifd, postcopy, TLS, etc), while this is
a host kernel feature. I don't think it is that far outside the
normal practice wrt migration feature usage decision making though.

Regards,
Daniel
Dr. David Alan Gilbert Nov. 2, 2018, 4:58 p.m. UTC | #25
* Daniel P. Berrangé (berrange@redhat.com) wrote:
> On Fri, Nov 02, 2018 at 10:40:35AM +0100, Paolo Bonzini wrote:
> > On 02/11/2018 04:46, Liran Alon wrote:
> > >> On Thu, Nov1, 2018 at 09:45 AM, Jim Mattson <jmattson@google.com> wrote:
> > > 
> > >>> On Thu, Nov 1, 2018 at 8:56 AM, Dr. David Alan Gilbert <dgilbert@redhat.com> wrote:
> > > 
> > >>> So if I have matching host kernels it should always work?
> > >>> What happens if I upgrade the source kernel to increase it's maximum
> > >>> nested size, can I force it to keep things small for some VMs?
> > > 
> > >> Any change to the format of the nested state should be gated by a
> > >> KVM_CAP set by userspace. (Unlike, say, how the
> > >> KVM_VCPUEVENT_VALID_SMM flag was added to the saved VCPU events state
> > >> in commit f077825a8758d.) KVM has traditionally been quite bad about
> > >> maintaining backwards compatibility, but I hope the community is more
> > >> cognizant of the issues now.
> > > 
> > >> As a cloud provider, one would only enable the new capability from
> > >> userspace once all hosts in the pool have a kernel that supports it.
> > >> During the transition, the capability would not be enabled on the
> > >> hosts with a new kernel, and these hosts would continue to provide
> > >> nested state that could be consumed by hosts running the older kernel.
> > > 
> > > Hmm this makes sense.
> > > 
> > > This means though that the patch I have submitted here isn't good enough.
> > > My patch currently assumes that when it attempts to get nested state from KVM,
> > > QEMU should always set nested_state->size to max size supported by KVM as received
> > > from kvm_check_extension(s, KVM_CAP_NESTED_STATE);
> > > (See kvm_get_nested_state() introduced on my patch).
> > > This indeed won't allow migration from host with new KVM to host with old KVM if
> > > nested_state size was enlarged between these KVM versions.
> > > Which is obviously an issue.
> > 
> > Actually I think this is okay, because unlike the "new" capability was
> > enabled, KVM would always reduce nested_state->size to a value that is
> > compatible with current kernels.
> > 
> > > But on second thought, I'm not sure that this is the right approach as-well.
> > > We don't really want the used version of nested_state to be determined on kvm_init().
> > > * On source QEMU, we actually want to determine it when preparing for migration based
> > > on to the support given by our destination host. If it's an old host, we would like to
> > > save an old version nested_state and if it's a new host, we will like to save our newest
> > > supported nested_state.
> > 
> > No, that's wrong because it would lead to losing state.  If the source
> > QEMU supports more state than the destination QEMU, and the current VM
> > state needs to transmit it for migration to be _correct_, then migration
> > to that destination QEMU must fail.
> > 
> > In particular, enabling the new KVM capability needs to be gated by a
> > new machine type and/or -cpu flag, if migration compatibility is needed.
> >  (In particular, this is one reason why I haven't considered this series
> > for 3.1.  Right now, migration of nested hypervisors is completely
> > busted but if we make it "almost" work, pre-3.1 machine types would not
> > ever be able to add support for KVM_CAP_EXCEPTION_PAYLOAD.  Therefore,
> > it's better for users if we wait for one release more, and add support
> > for KVM_CAP_NESTED_STATE and KVM_CAP_EXCEPTION_PAYLOAD at the same time).
> > 
> > Personally, I would like to say that, starting from QEMU 3.2, enabling
> > nested VMX requires a 4.20 kernel.  It's a bit bold, but I think it's a
> > good way to keep some sanity.  Any opinions on that?
> 
> We have usually followed a rule that new machine types must not
> affect runability of a VM on a host. IOW new machine types should
> not introduce dependancies on specific kernels, or hardware features
> such as CPU flags.
> 
> Anything that requires a new kernel feature thus ought to be an
> opt-in config tunable on the CLI, separate from machine type
> choice.  Alternatively in this case, it could potentially be a
> migration parameter settable via QMP. QEMU on each side could
> advertize whether the migration parameter is available, and
> the mgmt app (which can see both sides of the migration) can
> then decide whether to enable it.

This is a little odd though since it relates to the
contents/size/consistency of the guest state directly.

Dave

> Regards,
> Daniel
> -- 
> |: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
> |: https://libvirt.org         -o-            https://fstop138.berrange.com :|
> |: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
Denis V. Lunev" via Nov. 2, 2018, 5:01 p.m. UTC | #26
On Fri, Nov 2, 2018 at 9:58 AM, Daniel P. Berrangé <berrange@redhat.com> wrote:
> On Fri, Nov 02, 2018 at 09:44:54AM -0700, Jim Mattson via Qemu-devel wrote:
>> On Fri, Nov 2, 2018 at 5:59 AM, Liran Alon <liran.alon@oracle.com> wrote:
>> >
>>
>> >>> Therefore, I don't think that we want this versioning to be based on KVM_CAP at all.
>> >>> It seems that we would want the process to behave as follows:
>> >>> 1) Mgmt-layer at dest queries dest host max supported nested_state size.
>> >>>   (Which should be returned from kvm_check_extension(KVM_CAP_NESTED_STATE))
>> >>> 2) Mgmt-layer at source initiate migration to dest with requesting QEMU to send nested_state
>> >>>   matching dest max supported nested_state size.
>> >>>   When saving nested state using KVM_GET_NESTED_STATE IOCTL, QEMU will specify in nested_state->size
>> >>>   the *requested* size to be saved and KVM should be able to save only the information which matches
>> >>>   the version that worked with that size.
>> >>> 3) After some sanity checks on received migration stream, dest host use KVM_SET_NESTED_STATE IOCTL.
>> >>>   This IOCTL should deduce which information it should deploy based on given nested_state->size.
>>
>> I have to object to any proposal which requires the management later
>> to communicate with the source and the destination to determine what
>> should be done.
>
> Can you elaborate on why you object ?

We don't currently have this requirement, and I don't want to be
encumbered by it.
Liran Alon Nov. 3, 2018, 2:02 a.m. UTC | #27
> On 2 Nov 2018, at 18:39, Jim Mattson <jmattson@google.com> wrote:
> 
> On Thu, Nov 1, 2018 at 8:46 PM, Liran Alon <liran.alon@oracle.com> wrote:
> 
>> Hmm this makes sense.
>> 
>> This means though that the patch I have submitted here isn't good enough.
>> My patch currently assumes that when it attempts to get nested state from KVM,
>> QEMU should always set nested_state->size to max size supported by KVM as received
>> from kvm_check_extension(s, KVM_CAP_NESTED_STATE);
>> (See kvm_get_nested_state() introduced on my patch).
>> This indeed won't allow migration from host with new KVM to host with old KVM if
>> nested_state size was enlarged between these KVM versions.
>> Which is obviously an issue.
>> 
>> Jim, I think that my confusion was created from the fact that there is no clear documentation
>> on how KVM_{GET,SET}_NESTED_STATE should be changed once we will need to add more state to
>> nested_state in future KVM versions. I think it's worth adding that to IOCTLs documentation.
> 
> The nested state IOCTLs aren't unique in this respect. Any changes to
> the state saved by any of this whole family of state-saving ioctls
> require opt-in from userspace.
> 
>> For example, let's assume we have a new KVM_CAP_NESTED_STATE_V2.
>> In this scenario, does kvm_check_extension(s, KVM_CAP_NESTED_STATE) still returns the
>> size of nested_state v1 and kvm_check_extension(s, KVM_CAP_NESTED_STATE_V2) returns the
>> size of the nested_state v2?
> 
> Hmm...I don't recall kvm_check_extension(s, KVM_CAP_NESTED_STATE)
> being part of my original design. The way I had envisioned it,
> the set of capabilities enabled by userspace would be sufficient to
> infer the maximum data size.

If the set of capabilities should be sufficient to infer the max size of nested_state,
why did we code kvm_vm_ioctl_check_extension() such that on KVM_CAP_NESTED_STATE
it returns the max size of nested_state?

> 
> If, for example, we add a field to stash the time remaining for the
> VMCS12 VMX preemption timer, then presumably, userspace will enable it
> by enabling KVM_CAP_SAVE_VMX_PREEMPTION_TIMER (or something like
> that), and then userspace will know that the maximum nested state data
> is 4 bytes larger.

In that case, why did we defined struct kvm_nested_state to hold a blob of data[] instead
of separating the blob into well defined blobs? (e.g. Currently one blob for vmcs12 and another one for shadow vmcs12).
Then when we add a new component which is opt-in by a new KVM_CAP, we will add another well defined blob
to struct kvm_nested_state.

I think this is important because it allows us to specify in nested_state->flags which components are saved
and create multiple VMState subsections with needed() methods for the various saved components.

Thus allowing for example to easily still migrate from a new QEMU which does stash the time remaining for the VMCS12 VMX preemption timer
to an old QEMU which doesn’t stash it in case nested_state->flags specify that this component is not saved (Because L1 don’t use VMX preemption timer for example).

This seems to behave more nicely with how QEMU migration mechanism is defined and the purpose of VMState subsections.

In addition, if we will define struct kvm_nested_state like this, we will also not need the “size” field which needs to be carefully handled to avoid buffer overflows.
(We will just define large enough buffers (with padding) for each opaque component such as vmcs12 and shadow vmcs12).

> 
>> Also note that the approach suggested by Jim requires mgmt-layer at dest
>> to be able to specify to QEMU which KVM_CAP_NESTED_STATE_V* capabilities it should enable on kvm_init().
>> When we know we are migrating from a host which supports v1 to a host which supports v2,
>> we should make sure that dest QEMU doesn't enable KVM_CAP_NESTED_STATE_V2.
>> However, when we are just launching a new machine on the host which supports v2, we do want
>> QEMU to enable KVM_CAP_NESTED_STATE_V2 enabled for that VM.
> 
> No, no, no. Even when launching a new VM on a host that supports v2,
> you cannot enable v2 until you have passed rollback horizon. Should
> you decide to roll back the kernel with v2 support, you must be able
> to move that new VM to a host with an old kernel.

If we use VMState subsections as I described above, QEMU should be able to know which components of nested_state are
actively saved by KVM and therefore are *required* to be restored on dest host in order to migrate without guest issues after it is resumed on dest.
Therefore, still allowing migration from new hosts to old hosts in case guest didn’t enter a state which makes new saved state required in order
for migration to succeed.

If the mechanism will work like this, nested_state KVM_CAPs enabled on QEMU launch are only used to inform KVM which
struct kvm_nested_state is used by userspace. Not what is actually sent as part of migration stream.

What are your thoughts on this?

-Liran

> 
>> But on second thought, I'm not sure that this is the right approach as-well.
>> We don't really want the used version of nested_state to be determined on kvm_init().
>> * On source QEMU, we actually want to determine it when preparing for migration based
>> on to the support given by our destination host. If it's an old host, we would like to
>> save an old version nested_state and if it's a new host, we will like to save our newest
>> supported nested_state.
>> * On dest QEMU, we will want to just be able to set received nested_state in KVM.
>> 
>> Therefore, I don't think that we want this versioning to be based on KVM_CAP at all.
>> It seems that we would want the process to behave as follows:
>> 1) Mgmt-layer at dest queries dest host max supported nested_state size.
>>   (Which should be returned from kvm_check_extension(KVM_CAP_NESTED_STATE))
>> 2) Mgmt-layer at source initiate migration to dest with requesting QEMU to send nested_state
>>   matching dest max supported nested_state size.
>>   When saving nested state using KVM_GET_NESTED_STATE IOCTL, QEMU will specify in nested_state->size
>>   the *requested* size to be saved and KVM should be able to save only the information which matches
>>   the version that worked with that size.
>> 3) After some sanity checks on received migration stream, dest host use KVM_SET_NESTED_STATE IOCTL.
>>   This IOCTL should deduce which information it should deploy based on given nested_state->size.
>> 
>> This also makes me wonder if it's not just nicer to use nested_state->flags to specify which
>> information is actually present on nested_state instead of managing versioning with nested_state->size.
> 
> Yes, you can use nested_state->flags to determine what the data
> payload is, but you cannot enable new flags unless userspace opts in.
> This is just like KVM_CAP_EXCEPTION_PAYLOAD for kvm_vcpu_events. The
> flag, KVM_VCPUEVENT_VALID_PAYLOAD, can only be set on the saved vcpu
> events if userspace has opted-in with KVM_CAP_EXCEPTION_PAYLOAD. This
> is because older kernels will reject kvm_vcpu_events that have the
> KVM_VCPUEVENT_VALID_PAYLOAD flag set.
> 
> You don't need a new KVM_CAP_NESTED_STATE_V2 ioctl. You just need
> buy-in from userspace for any new data payload. Explicitly enumerating
> the payload components in the flags field makes perfect sense.
Paolo Bonzini Nov. 4, 2018, 10:12 p.m. UTC | #28
On 02/11/2018 13:35, Dr. David Alan Gilbert wrote:
>>
>> Personally, I would like to say that, starting from QEMU 3.2, enabling
>> nested VMX requires a 4.20 kernel.  It's a bit bold, but I think it's a
>> good way to keep some sanity.  Any opinions on that?
> That seems a bit mean; there's a lot of people already using nested.

True.  However, a more complete version of the "plan" is that they could
continue using it, but only with a pre-3.2 machine type.

Paolo
Paolo Bonzini Nov. 4, 2018, 10:19 p.m. UTC | #29
On 02/11/2018 17:54, Daniel P. Berrangé wrote:
> We have usually followed a rule that new machine types must not
> affect runability of a VM on a host. IOW new machine types should
> not introduce dependancies on specific kernels, or hardware features
> such as CPU flags.

> Anything that requires a new kernel feature thus ought to be an
> opt-in config tunable on the CLI, separate from machine type
> choice.

Unless someone tinkered with the module parameters, they could not even
use nested virtualization before 4.20.  So for everyone else, "-cpu
...,+vmx" does count as an "opt-in config tunable on the CLI" that
requires 4.20.

For those that did tinker with module parameters, we can grandfather in
the old machine types, so that they can use nested virtualization with
no live migration support.  For those that did not, however, I don't
think it makes sense to say "oh by the way I really want to be able to
migrate this VM" on the command line, or even worse on the monitor.

Paolo

> Alternatively in this case, it could potentially be a
> migration parameter settable via QMP. QEMU on each side could
> advertize whether the migration parameter is available, and
> the mgmt app (which can see both sides of the migration) can
> then decide whether to enable it.
Liran Alon Nov. 8, 2018, 12:13 a.m. UTC | #30
Ping on my last reply.
I would like to reach to an agreement on how v3 should look like before just implementing what I think is right.

Thanks,
-Liran

> On 3 Nov 2018, at 4:02, Liran Alon <liran.alon@oracle.com> wrote:
> 
> 
> 
>> On 2 Nov 2018, at 18:39, Jim Mattson <jmattson@google.com> wrote:
>> 
>> On Thu, Nov 1, 2018 at 8:46 PM, Liran Alon <liran.alon@oracle.com> wrote:
>> 
>>> Hmm this makes sense.
>>> 
>>> This means though that the patch I have submitted here isn't good enough.
>>> My patch currently assumes that when it attempts to get nested state from KVM,
>>> QEMU should always set nested_state->size to max size supported by KVM as received
>>> from kvm_check_extension(s, KVM_CAP_NESTED_STATE);
>>> (See kvm_get_nested_state() introduced on my patch).
>>> This indeed won't allow migration from host with new KVM to host with old KVM if
>>> nested_state size was enlarged between these KVM versions.
>>> Which is obviously an issue.
>>> 
>>> Jim, I think that my confusion was created from the fact that there is no clear documentation
>>> on how KVM_{GET,SET}_NESTED_STATE should be changed once we will need to add more state to
>>> nested_state in future KVM versions. I think it's worth adding that to IOCTLs documentation.
>> 
>> The nested state IOCTLs aren't unique in this respect. Any changes to
>> the state saved by any of this whole family of state-saving ioctls
>> require opt-in from userspace.
>> 
>>> For example, let's assume we have a new KVM_CAP_NESTED_STATE_V2.
>>> In this scenario, does kvm_check_extension(s, KVM_CAP_NESTED_STATE) still returns the
>>> size of nested_state v1 and kvm_check_extension(s, KVM_CAP_NESTED_STATE_V2) returns the
>>> size of the nested_state v2?
>> 
>> Hmm...I don't recall kvm_check_extension(s, KVM_CAP_NESTED_STATE)
>> being part of my original design. The way I had envisioned it,
>> the set of capabilities enabled by userspace would be sufficient to
>> infer the maximum data size.
> 
> If the set of capabilities should be sufficient to infer the max size of nested_state,
> why did we code kvm_vm_ioctl_check_extension() such that on KVM_CAP_NESTED_STATE
> it returns the max size of nested_state?
> 
>> 
>> If, for example, we add a field to stash the time remaining for the
>> VMCS12 VMX preemption timer, then presumably, userspace will enable it
>> by enabling KVM_CAP_SAVE_VMX_PREEMPTION_TIMER (or something like
>> that), and then userspace will know that the maximum nested state data
>> is 4 bytes larger.
> 
> In that case, why did we defined struct kvm_nested_state to hold a blob of data[] instead
> of separating the blob into well defined blobs? (e.g. Currently one blob for vmcs12 and another one for shadow vmcs12).
> Then when we add a new component which is opt-in by a new KVM_CAP, we will add another well defined blob
> to struct kvm_nested_state.
> 
> I think this is important because it allows us to specify in nested_state->flags which components are saved
> and create multiple VMState subsections with needed() methods for the various saved components.
> 
> Thus allowing for example to easily still migrate from a new QEMU which does stash the time remaining for the VMCS12 VMX preemption timer
> to an old QEMU which doesn’t stash it in case nested_state->flags specify that this component is not saved (Because L1 don’t use VMX preemption timer for example).
> 
> This seems to behave more nicely with how QEMU migration mechanism is defined and the purpose of VMState subsections.
> 
> In addition, if we will define struct kvm_nested_state like this, we will also not need the “size” field which needs to be carefully handled to avoid buffer overflows.
> (We will just define large enough buffers (with padding) for each opaque component such as vmcs12 and shadow vmcs12).
> 
>> 
>>> Also note that the approach suggested by Jim requires mgmt-layer at dest
>>> to be able to specify to QEMU which KVM_CAP_NESTED_STATE_V* capabilities it should enable on kvm_init().
>>> When we know we are migrating from a host which supports v1 to a host which supports v2,
>>> we should make sure that dest QEMU doesn't enable KVM_CAP_NESTED_STATE_V2.
>>> However, when we are just launching a new machine on the host which supports v2, we do want
>>> QEMU to enable KVM_CAP_NESTED_STATE_V2 enabled for that VM.
>> 
>> No, no, no. Even when launching a new VM on a host that supports v2,
>> you cannot enable v2 until you have passed rollback horizon. Should
>> you decide to roll back the kernel with v2 support, you must be able
>> to move that new VM to a host with an old kernel.
> 
> If we use VMState subsections as I described above, QEMU should be able to know which components of nested_state are
> actively saved by KVM and therefore are *required* to be restored on dest host in order to migrate without guest issues after it is resumed on dest.
> Therefore, still allowing migration from new hosts to old hosts in case guest didn’t enter a state which makes new saved state required in order
> for migration to succeed.
> 
> If the mechanism will work like this, nested_state KVM_CAPs enabled on QEMU launch are only used to inform KVM which
> struct kvm_nested_state is used by userspace. Not what is actually sent as part of migration stream.
> 
> What are your thoughts on this?
> 
> -Liran
> 
>> 
>>> But on second thought, I'm not sure that this is the right approach as-well.
>>> We don't really want the used version of nested_state to be determined on kvm_init().
>>> * On source QEMU, we actually want to determine it when preparing for migration based
>>> on to the support given by our destination host. If it's an old host, we would like to
>>> save an old version nested_state and if it's a new host, we will like to save our newest
>>> supported nested_state.
>>> * On dest QEMU, we will want to just be able to set received nested_state in KVM.
>>> 
>>> Therefore, I don't think that we want this versioning to be based on KVM_CAP at all.
>>> It seems that we would want the process to behave as follows:
>>> 1) Mgmt-layer at dest queries dest host max supported nested_state size.
>>>  (Which should be returned from kvm_check_extension(KVM_CAP_NESTED_STATE))
>>> 2) Mgmt-layer at source initiate migration to dest with requesting QEMU to send nested_state
>>>  matching dest max supported nested_state size.
>>>  When saving nested state using KVM_GET_NESTED_STATE IOCTL, QEMU will specify in nested_state->size
>>>  the *requested* size to be saved and KVM should be able to save only the information which matches
>>>  the version that worked with that size.
>>> 3) After some sanity checks on received migration stream, dest host use KVM_SET_NESTED_STATE IOCTL.
>>>  This IOCTL should deduce which information it should deploy based on given nested_state->size.
>>> 
>>> This also makes me wonder if it's not just nicer to use nested_state->flags to specify which
>>> information is actually present on nested_state instead of managing versioning with nested_state->size.
>> 
>> Yes, you can use nested_state->flags to determine what the data
>> payload is, but you cannot enable new flags unless userspace opts in.
>> This is just like KVM_CAP_EXCEPTION_PAYLOAD for kvm_vcpu_events. The
>> flag, KVM_VCPUEVENT_VALID_PAYLOAD, can only be set on the saved vcpu
>> events if userspace has opted-in with KVM_CAP_EXCEPTION_PAYLOAD. This
>> is because older kernels will reject kvm_vcpu_events that have the
>> KVM_VCPUEVENT_VALID_PAYLOAD flag set.
>> 
>> You don't need a new KVM_CAP_NESTED_STATE_V2 ioctl. You just need
>> buy-in from userspace for any new data payload. Explicitly enumerating
>> the payload components in the flags field makes perfect sense.
Denis V. Lunev" via Nov. 8, 2018, 12:45 a.m. UTC | #31
On Wed, Nov 7, 2018 at 4:13 PM, Liran Alon <liran.alon@oracle.com> wrote:
> Ping on my last reply.
> I would like to reach to an agreement on how v3 should look like before just implementing what I think is right.
>
> Thanks,
> -Liran

I have no attachments to the current design. I had used a data[] blob,
because I didn't think userspace would have any need to know what was
in there. However, I am now seeing the error of my ways. For example,
the userspace instruction emulator needs to know the contents of the
vmcs12 to emulate instructions when in guest mode.

I had been in favor of KVM_CAPs, because they are one way to ensure
that the guest doesn't dynamically enter a state that isn't backwards
compatible. But other gates are also possible. If you have to support
destination kernels that don't know about the shadow vmcs12 component
of the nested state, userspace can just clear bit 46 of L1's
IA32_VMX_PROCBASED_CTLS2 MSR. For the VMX preemption timer, clear bit
38 of L1's IA32_VMX_PINBASED_CTLS MSR. KVM_CAPs can be reserved for
incompatible changes due to bug-fixes, like KVM_CAP_EXCEPTION_PAYLOAD,
where there isn't another gate available. If, for example, we hadn't
realized ahead of time that migration of the VMX preemption timer
isn't yet supported, some userspace implementation may not have
cleared bit 38 of L1's IA32_VMX_PINBASED_CTLS MSR, and then we would
have to gate the new nested state component behind a KVM_CAP.

As long as userspace can ensure that the kernel will only produce
backwards compatible save-state when it has to, my basic requirements
are met.
Paolo Bonzini Nov. 8, 2018, 9:50 a.m. UTC | #32
On 08/11/2018 01:45, Jim Mattson wrote:
> I have no attachments to the current design. I had used a data[] blob,
> because I didn't think userspace would have any need to know what was
> in there. However, I am now seeing the error of my ways. For example,
> the userspace instruction emulator needs to know the contents of the
> vmcs12 to emulate instructions when in guest mode.

Yeah, we're probably going to have to document the KVM vmcs12 structure,
possibly moving it to uapi.  But that's a different thing from
save/restore state, which can use the 4K or 8K data[] blob.

Paolo
Liran Alon Nov. 8, 2018, 9:57 a.m. UTC | #33
> On 8 Nov 2018, at 11:50, Paolo Bonzini <pbonzini@redhat.com> wrote:
> 
> On 08/11/2018 01:45, Jim Mattson wrote:
>> I have no attachments to the current design. I had used a data[] blob,
>> because I didn't think userspace would have any need to know what was
>> in there. However, I am now seeing the error of my ways. For example,
>> the userspace instruction emulator needs to know the contents of the
>> vmcs12 to emulate instructions when in guest mode.
> 
> Yeah, we're probably going to have to document the KVM vmcs12 structure,
> possibly moving it to uapi.  But that's a different thing from
> save/restore state, which can use the 4K or 8K data[] blob.
> 
> Paolo

But regardless of if we document vmcs12 or not, the current blob we have today
should be separated to well-defined blobs/structs (cached_vmcs12 and cached_shadow_vmcs12)
and each blob should have a relevant flag that specifies it is valid (saved by kernel or requested to be restored by userspace).
Additional future nested-state should be added as additional well-defined blobs/structs with appropriate flags.

Then, in QEMU, each such well-defined blob/struct should have it’s own subsection with a relevant .needed() method.
This will allow us to preserve required backwards compatibility.

Agreed?

-Liran
Paolo Bonzini Nov. 8, 2018, 5:02 p.m. UTC | #34
On 08/11/2018 10:57, Liran Alon wrote:
> 
> 
>> On 8 Nov 2018, at 11:50, Paolo Bonzini <pbonzini@redhat.com> wrote:
>>
>> On 08/11/2018 01:45, Jim Mattson wrote:
>>> I have no attachments to the current design. I had used a data[] blob,
>>> because I didn't think userspace would have any need to know what was
>>> in there. However, I am now seeing the error of my ways. For example,
>>> the userspace instruction emulator needs to know the contents of the
>>> vmcs12 to emulate instructions when in guest mode.
>>
>> Yeah, we're probably going to have to document the KVM vmcs12 structure,
>> possibly moving it to uapi.  But that's a different thing from
>> save/restore state, which can use the 4K or 8K data[] blob.
>>
>> Paolo
> 
> But regardless of if we document vmcs12 or not, the current blob we
> have today should be separated to well-defined blobs/structs (cached_vmcs12
> and cached_shadow_vmcs12) and each blob should have a relevant flag that specifies
> it is valid (saved by kernel or requested to be restored by userspace).
> Additional future nested-state should be added as additional
> well-defined blobs/structs with appropriate flags.

We are somewhat limited by the ABI of the released kernel versions, and
it's unlikely that the structure will change in the short term.  So I
think it's okay if we just define that the first 4K of data are the VMCS
and the second 8K are the shadow VMCS, whenever format=0 in the
kvm_nested_state struct.

Paolo

> Then, in QEMU, each such well-defined blob/struct should have it’s own subsection with a relevant .needed() method.
> This will allow us to preserve required backwards compatibility.
> 
> Agreed?
> 
> -Liran
>
Liran Alon Nov. 8, 2018, 6:41 p.m. UTC | #35
> On 8 Nov 2018, at 19:02, Paolo Bonzini <pbonzini@redhat.com> wrote:
> 
> On 08/11/2018 10:57, Liran Alon wrote:
>> 
>> 
>>> On 8 Nov 2018, at 11:50, Paolo Bonzini <pbonzini@redhat.com> wrote:
>>> 
>>> On 08/11/2018 01:45, Jim Mattson wrote:
>>>> I have no attachments to the current design. I had used a data[] blob,
>>>> because I didn't think userspace would have any need to know what was
>>>> in there. However, I am now seeing the error of my ways. For example,
>>>> the userspace instruction emulator needs to know the contents of the
>>>> vmcs12 to emulate instructions when in guest mode.
>>> 
>>> Yeah, we're probably going to have to document the KVM vmcs12 structure,
>>> possibly moving it to uapi.  But that's a different thing from
>>> save/restore state, which can use the 4K or 8K data[] blob.
>>> 
>>> Paolo
>> 
>> But regardless of if we document vmcs12 or not, the current blob we
>> have today should be separated to well-defined blobs/structs (cached_vmcs12
>> and cached_shadow_vmcs12) and each blob should have a relevant flag that specifies
>> it is valid (saved by kernel or requested to be restored by userspace).
>> Additional future nested-state should be added as additional
>> well-defined blobs/structs with appropriate flags.
> 
> We are somewhat limited by the ABI of the released kernel versions, and
> it's unlikely that the structure will change in the short term.  So I
> think it's okay if we just define that the first 4K of data are the VMCS
> and the second 8K are the shadow VMCS, whenever format=0 in the
> kvm_nested_state struct.
> 
> Paolo

So what I plan to do is indeed to define first 4K of data as vmcs12 and next 4K as shadow_vmcs12.
I will also define each of them in a separate VMState subsection that each will have it’s own .needed()
method that will decide if it’s relevant to send it based on kvm_state.size.
vmcs12 and shadow_vmcs12 will be put in a struct which is unioned with a struct
of 2 pages buffer to clearly indicate that one well-defined struct is used for VMX and the other for SVM.
(based on kvm_state.format)

In addition, I will change KVM_{GET,SET}_NESTED_STATE documentation to indicate that
future nested state fields should be added at the end of struct and weather they are valid should
be determined by a flag in kvm_state.flags.
QEMU will handle these new states in the future by adding more VMState subsections that have
.needed() method based on appropriate flag in kvm_state.flags.
The struct itself that userspace use against it’s local kernel will be determined based on KVM_CAPs.

If there are no objections, I will write the needed patches for QEMU (and the documentation patch for KVM).
(And throw away the current VMState technique I used in this version of the patch :P)

-Liran

> 
>> Then, in QEMU, each such well-defined blob/struct should have it’s own subsection with a relevant .needed() method.
>> This will allow us to preserve required backwards compatibility.
>> 
>> Agreed?
>> 
>> -Liran
>> 
>
Paolo Bonzini Nov. 8, 2018, 8:34 p.m. UTC | #36
On 08/11/2018 19:41, Liran Alon wrote:
> So what I plan to do is indeed to define first 4K of data as vmcs12 and next 4K as shadow_vmcs12.
> I will also define each of them in a separate VMState subsection that each will have it’s own .needed()
> method that will decide if it’s relevant to send it based on kvm_state.size.
> vmcs12 and shadow_vmcs12 will be put in a struct which is unioned with a struct
> of 2 pages buffer to clearly indicate that one well-defined struct is used for VMX and the other for SVM.
> (based on kvm_state.format)

And SVM will use other subsections.

> In addition, I will change KVM_{GET,SET}_NESTED_STATE documentation to indicate that
> future nested state fields should be added at the end of struct and weather they are valid should
> be determined by a flag in kvm_state.flags.
> QEMU will handle these new states in the future by adding more VMState subsections that have
> .needed() method based on appropriate flag in kvm_state.flags.
> The struct itself that userspace use against it’s local kernel will be determined based on KVM_CAPs.
> 
> If there are no objections, I will write the needed patches for QEMU (and the documentation patch for KVM).

Sure.  You can also place the flags and format in yet another subsection.

Paolo
Dr. David Alan Gilbert Nov. 12, 2018, 2:51 p.m. UTC | #37
* Liran Alon (liran.alon@oracle.com) wrote:
> 
> 
> > On 8 Nov 2018, at 19:02, Paolo Bonzini <pbonzini@redhat.com> wrote:
> > 
> > On 08/11/2018 10:57, Liran Alon wrote:
> >> 
> >> 
> >>> On 8 Nov 2018, at 11:50, Paolo Bonzini <pbonzini@redhat.com> wrote:
> >>> 
> >>> On 08/11/2018 01:45, Jim Mattson wrote:
> >>>> I have no attachments to the current design. I had used a data[] blob,
> >>>> because I didn't think userspace would have any need to know what was
> >>>> in there. However, I am now seeing the error of my ways. For example,
> >>>> the userspace instruction emulator needs to know the contents of the
> >>>> vmcs12 to emulate instructions when in guest mode.
> >>> 
> >>> Yeah, we're probably going to have to document the KVM vmcs12 structure,
> >>> possibly moving it to uapi.  But that's a different thing from
> >>> save/restore state, which can use the 4K or 8K data[] blob.
> >>> 
> >>> Paolo
> >> 
> >> But regardless of if we document vmcs12 or not, the current blob we
> >> have today should be separated to well-defined blobs/structs (cached_vmcs12
> >> and cached_shadow_vmcs12) and each blob should have a relevant flag that specifies
> >> it is valid (saved by kernel or requested to be restored by userspace).
> >> Additional future nested-state should be added as additional
> >> well-defined blobs/structs with appropriate flags.
> > 
> > We are somewhat limited by the ABI of the released kernel versions, and
> > it's unlikely that the structure will change in the short term.  So I
> > think it's okay if we just define that the first 4K of data are the VMCS
> > and the second 8K are the shadow VMCS, whenever format=0 in the
> > kvm_nested_state struct.
> > 

(Sorry, I was out last week).

> 
> So what I plan to do is indeed to define first 4K of data as vmcs12 and next 4K as shadow_vmcs12.
> I will also define each of them in a separate VMState subsection that each will have it’s own .needed()
> method that will decide if it’s relevant to send it based on kvm_state.size.
> vmcs12 and shadow_vmcs12 will be put in a struct which is unioned with a struct
> of 2 pages buffer to clearly indicate that one well-defined struct is used for VMX and the other for SVM.
> (based on kvm_state.format)
> 
> In addition, I will change KVM_{GET,SET}_NESTED_STATE documentation to indicate that
> future nested state fields should be added at the end of struct and weather they are valid should
> be determined by a flag in kvm_state.flags.
> QEMU will handle these new states in the future by adding more VMState subsections that have
> .needed() method based on appropriate flag in kvm_state.flags.
> The struct itself that userspace use against it’s local kernel will be determined based on KVM_CAPs.
> 
> If there are no objections, I will write the needed patches for QEMU (and the documentation patch for KVM).
> (And throw away the current VMState technique I used in this version of the patch :P)

Using subsections is good and makes it pretty clear on loading if
there's a problem; it also makes it good when you add new subsections
in new versions.  It can be tricky getting the logic for the .needed
functions right, so that you don't write out new .needed sections for
old configs and thus can allow yourself to migrate to an older
qemu/kernel.

Dave

> -Liran
> 
> > 
> >> Then, in QEMU, each such well-defined blob/struct should have it’s own subsection with a relevant .needed() method.
> >> This will allow us to preserve required backwards compatibility.
> >> 
> >> Agreed?
> >> 
> >> -Liran
> >> 
> > 
> 
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
Daniel P. Berrangé Nov. 12, 2018, 4:18 p.m. UTC | #38
On Sun, Nov 04, 2018 at 11:19:57PM +0100, Paolo Bonzini wrote:
> On 02/11/2018 17:54, Daniel P. Berrangé wrote:
> > We have usually followed a rule that new machine types must not
> > affect runability of a VM on a host. IOW new machine types should
> > not introduce dependancies on specific kernels, or hardware features
> > such as CPU flags.
> 
> > Anything that requires a new kernel feature thus ought to be an
> > opt-in config tunable on the CLI, separate from machine type
> > choice.
> 
> Unless someone tinkered with the module parameters, they could not even
> use nested virtualization before 4.20.  So for everyone else, "-cpu
> ...,+vmx" does count as an "opt-in config tunable on the CLI" that
> requires 4.20.
>
> For those that did tinker with module parameters, we can grandfather in
> the old machine types, so that they can use nested virtualization with
> no live migration support.  For those that did not, however, I don't
> think it makes sense to say "oh by the way I really want to be able to
> migrate this VM" on the command line, or even worse on the monitor.

IIUC, 4.20 is only required from POV of migration state. Is it thus
possible to just register a migration blocker if QEMU is launched
on a host with kernel < 4.20.

Migration has always been busted historically, so those people using
nested VMX already won't be hurt by not having ability to live migrate
their VM, but could otherwise continue using them without being forced
to upgrade their kernel to fix a feature they're not even using.

Regards,
Daniel
Dr. David Alan Gilbert Nov. 12, 2018, 4:50 p.m. UTC | #39
* Daniel P. Berrangé (berrange@redhat.com) wrote:
> On Sun, Nov 04, 2018 at 11:19:57PM +0100, Paolo Bonzini wrote:
> > On 02/11/2018 17:54, Daniel P. Berrangé wrote:
> > > We have usually followed a rule that new machine types must not
> > > affect runability of a VM on a host. IOW new machine types should
> > > not introduce dependancies on specific kernels, or hardware features
> > > such as CPU flags.
> > 
> > > Anything that requires a new kernel feature thus ought to be an
> > > opt-in config tunable on the CLI, separate from machine type
> > > choice.
> > 
> > Unless someone tinkered with the module parameters, they could not even
> > use nested virtualization before 4.20.  So for everyone else, "-cpu
> > ...,+vmx" does count as an "opt-in config tunable on the CLI" that
> > requires 4.20.
> >
> > For those that did tinker with module parameters, we can grandfather in
> > the old machine types, so that they can use nested virtualization with
> > no live migration support.  For those that did not, however, I don't
> > think it makes sense to say "oh by the way I really want to be able to
> > migrate this VM" on the command line, or even worse on the monitor.
> 
> IIUC, 4.20 is only required from POV of migration state. Is it thus
> possible to just register a migration blocker if QEMU is launched
> on a host with kernel < 4.20.
> 
> Migration has always been busted historically, so those people using
> nested VMX already won't be hurt by not having ability to live migrate
> their VM, but could otherwise continue using them without being forced
> to upgrade their kernel to fix a feature they're not even using.

Yes, although I am a bit worried we might have a population of users
that:
   a) Have enabled nesting
   b) Run VMs with vmx enabled
   c) Don't normally actually run nested guests
   d) Currently happily migrate.

Dave

> Regards,
> Daniel
> -- 
> |: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
> |: https://libvirt.org         -o-            https://fstop138.berrange.com :|
> |: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
Paolo Bonzini Nov. 12, 2018, 4:53 p.m. UTC | #40
On 12/11/2018 17:50, Dr. David Alan Gilbert wrote:
>> Migration has always been busted historically, so those people using
>> nested VMX already won't be hurt by not having ability to live migrate
>> their VM, but could otherwise continue using them without being forced
>> to upgrade their kernel to fix a feature they're not even using.
> Yes, although I am a bit worried we might have a population of users
> that:
>    a) Have enabled nesting
>    b) Run VMs with vmx enabled
>    c) Don't normally actually run nested guests
>    d) Currently happily migrate.

Hmm, let's add a migration blocker for nested virtualization.

Paolo
Daniel P. Berrangé Nov. 12, 2018, 4:54 p.m. UTC | #41
On Mon, Nov 12, 2018 at 04:50:54PM +0000, Dr. David Alan Gilbert wrote:
> * Daniel P. Berrangé (berrange@redhat.com) wrote:
> > On Sun, Nov 04, 2018 at 11:19:57PM +0100, Paolo Bonzini wrote:
> > > On 02/11/2018 17:54, Daniel P. Berrangé wrote:
> > > > We have usually followed a rule that new machine types must not
> > > > affect runability of a VM on a host. IOW new machine types should
> > > > not introduce dependancies on specific kernels, or hardware features
> > > > such as CPU flags.
> > > 
> > > > Anything that requires a new kernel feature thus ought to be an
> > > > opt-in config tunable on the CLI, separate from machine type
> > > > choice.
> > > 
> > > Unless someone tinkered with the module parameters, they could not even
> > > use nested virtualization before 4.20.  So for everyone else, "-cpu
> > > ...,+vmx" does count as an "opt-in config tunable on the CLI" that
> > > requires 4.20.
> > >
> > > For those that did tinker with module parameters, we can grandfather in
> > > the old machine types, so that they can use nested virtualization with
> > > no live migration support.  For those that did not, however, I don't
> > > think it makes sense to say "oh by the way I really want to be able to
> > > migrate this VM" on the command line, or even worse on the monitor.
> > 
> > IIUC, 4.20 is only required from POV of migration state. Is it thus
> > possible to just register a migration blocker if QEMU is launched
> > on a host with kernel < 4.20.
> > 
> > Migration has always been busted historically, so those people using
> > nested VMX already won't be hurt by not having ability to live migrate
> > their VM, but could otherwise continue using them without being forced
> > to upgrade their kernel to fix a feature they're not even using.
> 
> Yes, although I am a bit worried we might have a population of users
> that:
>    a) Have enabled nesting
>    b) Run VMs with vmx enabled


>    c) Don't normally actually run nested guests
>    d) Currently happily migrate.

True, and (b) would include anyone using libvirt's  host-model CPU. So if
you enabled nesting, have host-model for all guests, but only use nesting
in one of the guests, you'd be doomed.

Is it possible for QEMU to determine if there are nested guests running or
not and conditionally block migration appropriately to ensure safety ?


Regards,
Daniel
Liran Alon Nov. 12, 2018, 11:58 p.m. UTC | #42
> On 12 Nov 2018, at 18:50, Dr. David Alan Gilbert <dgilbert@redhat.com> wrote:
> 
> * Daniel P. Berrangé (berrange@redhat.com) wrote:
>> On Sun, Nov 04, 2018 at 11:19:57PM +0100, Paolo Bonzini wrote:
>>> On 02/11/2018 17:54, Daniel P. Berrangé wrote:
>>>> We have usually followed a rule that new machine types must not
>>>> affect runability of a VM on a host. IOW new machine types should
>>>> not introduce dependancies on specific kernels, or hardware features
>>>> such as CPU flags.
>>> 
>>>> Anything that requires a new kernel feature thus ought to be an
>>>> opt-in config tunable on the CLI, separate from machine type
>>>> choice.
>>> 
>>> Unless someone tinkered with the module parameters, they could not even
>>> use nested virtualization before 4.20.  So for everyone else, "-cpu
>>> ...,+vmx" does count as an "opt-in config tunable on the CLI" that
>>> requires 4.20.
>>> 
>>> For those that did tinker with module parameters, we can grandfather in
>>> the old machine types, so that they can use nested virtualization with
>>> no live migration support.  For those that did not, however, I don't
>>> think it makes sense to say "oh by the way I really want to be able to
>>> migrate this VM" on the command line, or even worse on the monitor.
>> 
>> IIUC, 4.20 is only required from POV of migration state. Is it thus
>> possible to just register a migration blocker if QEMU is launched
>> on a host with kernel < 4.20.
>> 
>> Migration has always been busted historically, so those people using
>> nested VMX already won't be hurt by not having ability to live migrate
>> their VM, but could otherwise continue using them without being forced
>> to upgrade their kernel to fix a feature they're not even using.
> 
> Yes, although I am a bit worried we might have a population of users
> that:
>   a) Have enabled nesting
>   b) Run VMs with vmx enabled
>   c) Don't normally actually run nested guests
>   d) Currently happily migrate.
> 
> Dave

First of all, we should put the entire kvm_nested_state in a VMState subsection that has a .needed() method
that checks if ((format==VMX) && (vmx.vmxon_pa != -1ull));
This will allow migrating a VM that has nested exposed but didn’t really enter VMX operation to still be able
to successfully migrate to old hosts without any issues.

The only problem remaining to be solved that you discuss here is what happens if user runs with modern QEMU
(Including my to-be-written v3 patches discussed here) on a host that has kernel without KVM_CAP_NESTED_STATE.
In this kernel, it is impossible for userspace (e.g. QEMU) to know if guest that was exposed with VMX actually used it.
So for those cases, I see no option but to modify QEMU to also say that if guest is exposed with VMX and
we are running on kernel without KVM_CAP_NESTED_STATE, then this is a migration blocker.

-Liran

> 
>> Regards,
>> Daniel
>> -- 
>> |: https://urldefense.proofpoint.com/v2/url?u=https-3A__berrange.com&d=DwIDAw&c=RoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE&r=Jk6Q8nNzkQ6LJ6g42qARkg6ryIDGQr-yKXPNGZbpTx0&m=ximE6x4FrQ5vo3D4wF3w-cVsKgtNs85wCTL5GiP_B5Q&s=rs9ZkUUS37SHrs_oZJ9uIiXtpXvUkJBpfVEe9OSiQzk&e=      -o-    https://urldefense.proofpoint.com/v2/url?u=https-3A__www.flickr.com_photos_dberrange&d=DwIDAw&c=RoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE&r=Jk6Q8nNzkQ6LJ6g42qARkg6ryIDGQr-yKXPNGZbpTx0&m=ximE6x4FrQ5vo3D4wF3w-cVsKgtNs85wCTL5GiP_B5Q&s=u_hnAFyY8BVwto0FsomTcZ3dmLKPYb1hwI_jRXI6EZg&e= :|
>> |: https://urldefense.proofpoint.com/v2/url?u=https-3A__libvirt.org&d=DwIDAw&c=RoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE&r=Jk6Q8nNzkQ6LJ6g42qARkg6ryIDGQr-yKXPNGZbpTx0&m=ximE6x4FrQ5vo3D4wF3w-cVsKgtNs85wCTL5GiP_B5Q&s=SEIw983ixpJUUySxUwrxtIbnjvHTB9ff3MaqULaulQw&e=         -o-            https://urldefense.proofpoint.com/v2/url?u=https-3A__fstop138.berrange.com&d=DwIDAw&c=RoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE&r=Jk6Q8nNzkQ6LJ6g42qARkg6ryIDGQr-yKXPNGZbpTx0&m=ximE6x4FrQ5vo3D4wF3w-cVsKgtNs85wCTL5GiP_B5Q&s=yvxjeOrwjXjf08RBhPdX53lJN1W-8WSXT25ZeMSA06k&e= :|
>> |: https://urldefense.proofpoint.com/v2/url?u=https-3A__entangle-2Dphoto.org&d=DwIDAw&c=RoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE&r=Jk6Q8nNzkQ6LJ6g42qARkg6ryIDGQr-yKXPNGZbpTx0&m=ximE6x4FrQ5vo3D4wF3w-cVsKgtNs85wCTL5GiP_B5Q&s=9TIjtmf6AVFYWbyzI5vl-zXTaNCSCMAxyck92pc8yvY&e=    -o-    https://urldefense.proofpoint.com/v2/url?u=https-3A__www.instagram.com_dberrange&d=DwIDAw&c=RoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE&r=Jk6Q8nNzkQ6LJ6g42qARkg6ryIDGQr-yKXPNGZbpTx0&m=ximE6x4FrQ5vo3D4wF3w-cVsKgtNs85wCTL5GiP_B5Q&s=Wapdtm0yT4j-9EjMoxwo9QvRZ3h9Fk_CvpQq8J4TDpg&e= :|
> --
> Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
Liran Alon Nov. 13, 2018, midnight UTC | #43
> On 12 Nov 2018, at 18:54, Daniel P. Berrangé <berrange@redhat.com> wrote:
> 
> On Mon, Nov 12, 2018 at 04:50:54PM +0000, Dr. David Alan Gilbert wrote:
>> * Daniel P. Berrangé (berrange@redhat.com) wrote:
>>> On Sun, Nov 04, 2018 at 11:19:57PM +0100, Paolo Bonzini wrote:
>>>> On 02/11/2018 17:54, Daniel P. Berrangé wrote:
>>>>> We have usually followed a rule that new machine types must not
>>>>> affect runability of a VM on a host. IOW new machine types should
>>>>> not introduce dependancies on specific kernels, or hardware features
>>>>> such as CPU flags.
>>>> 
>>>>> Anything that requires a new kernel feature thus ought to be an
>>>>> opt-in config tunable on the CLI, separate from machine type
>>>>> choice.
>>>> 
>>>> Unless someone tinkered with the module parameters, they could not even
>>>> use nested virtualization before 4.20.  So for everyone else, "-cpu
>>>> ...,+vmx" does count as an "opt-in config tunable on the CLI" that
>>>> requires 4.20.
>>>> 
>>>> For those that did tinker with module parameters, we can grandfather in
>>>> the old machine types, so that they can use nested virtualization with
>>>> no live migration support.  For those that did not, however, I don't
>>>> think it makes sense to say "oh by the way I really want to be able to
>>>> migrate this VM" on the command line, or even worse on the monitor.
>>> 
>>> IIUC, 4.20 is only required from POV of migration state. Is it thus
>>> possible to just register a migration blocker if QEMU is launched
>>> on a host with kernel < 4.20.
>>> 
>>> Migration has always been busted historically, so those people using
>>> nested VMX already won't be hurt by not having ability to live migrate
>>> their VM, but could otherwise continue using them without being forced
>>> to upgrade their kernel to fix a feature they're not even using.
>> 
>> Yes, although I am a bit worried we might have a population of users
>> that:
>>   a) Have enabled nesting
>>   b) Run VMs with vmx enabled
> 
> 
>>   c) Don't normally actually run nested guests
>>   d) Currently happily migrate.
> 
> True, and (b) would include anyone using libvirt's  host-model CPU. So if
> you enabled nesting, have host-model for all guests, but only use nesting
> in one of the guests, you'd be doomed.
> 
> Is it possible for QEMU to determine if there are nested guests running or
> not and conditionally block migration appropriately to ensure safety ?


Only if kernel supports KVM_CAP_NESTED_STATE.
See my reply to Dave in this thread.

-Liran

> 
> 
> Regards,
> Daniel
> -- 
> |: https://urldefense.proofpoint.com/v2/url?u=https-3A__berrange.com&d=DwIDaQ&c=RoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE&r=Jk6Q8nNzkQ6LJ6g42qARkg6ryIDGQr-yKXPNGZbpTx0&m=eMOrT-7t7-tfRtTw2da9c1YTU0_tOFfkVIhj9mWv-Pc&s=DIzWfmRGWO1b6hzL9NRbIt41fiFcnPt0MC8917u4Qv0&e=      -o-    https://urldefense.proofpoint.com/v2/url?u=https-3A__www.flickr.com_photos_dberrange&d=DwIDaQ&c=RoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE&r=Jk6Q8nNzkQ6LJ6g42qARkg6ryIDGQr-yKXPNGZbpTx0&m=eMOrT-7t7-tfRtTw2da9c1YTU0_tOFfkVIhj9mWv-Pc&s=CjA-joyt2Y9t5B4YzIiupfY8EEO58m4vbmnd45adzFI&e= :|
> |: https://urldefense.proofpoint.com/v2/url?u=https-3A__libvirt.org&d=DwIDaQ&c=RoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE&r=Jk6Q8nNzkQ6LJ6g42qARkg6ryIDGQr-yKXPNGZbpTx0&m=eMOrT-7t7-tfRtTw2da9c1YTU0_tOFfkVIhj9mWv-Pc&s=tD05tikOHMJhh_EeZ2Esoxb0oku3MPFmj-S2YHdUGm0&e=         -o-            https://urldefense.proofpoint.com/v2/url?u=https-3A__fstop138.berrange.com&d=DwIDaQ&c=RoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE&r=Jk6Q8nNzkQ6LJ6g42qARkg6ryIDGQr-yKXPNGZbpTx0&m=eMOrT-7t7-tfRtTw2da9c1YTU0_tOFfkVIhj9mWv-Pc&s=YAh1WAoXQKEB6hkMmG6ZnJQETOFnq6eqQLmJokME80A&e= :|
> |: https://urldefense.proofpoint.com/v2/url?u=https-3A__entangle-2Dphoto.org&d=DwIDaQ&c=RoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE&r=Jk6Q8nNzkQ6LJ6g42qARkg6ryIDGQr-yKXPNGZbpTx0&m=eMOrT-7t7-tfRtTw2da9c1YTU0_tOFfkVIhj9mWv-Pc&s=90Mm1Qb-SHe8P63xwGp6gzMU1I5DEW6YX0ttG6TL_7g&e=    -o-    https://urldefense.proofpoint.com/v2/url?u=https-3A__www.instagram.com_dberrange&d=DwIDaQ&c=RoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE&r=Jk6Q8nNzkQ6LJ6g42qARkg6ryIDGQr-yKXPNGZbpTx0&m=eMOrT-7t7-tfRtTw2da9c1YTU0_tOFfkVIhj9mWv-Pc&s=l4NrrDdRzPClvYQdxQdfIW0geHPWcukeyOGX8QapwYA&e= :|
Denis V. Lunev" via Nov. 13, 2018, 12:07 a.m. UTC | #44
On Mon, Nov 12, 2018 at 4:00 PM, Liran Alon <liran.alon@oracle.com> wrote:
>
>
>> On 12 Nov 2018, at 18:54, Daniel P. Berrangé <berrange@redhat.com> wrote:
>>
>> On Mon, Nov 12, 2018 at 04:50:54PM +0000, Dr. David Alan Gilbert wrote:
>>> * Daniel P. Berrangé (berrange@redhat.com) wrote:
>>>> On Sun, Nov 04, 2018 at 11:19:57PM +0100, Paolo Bonzini wrote:
>>>>> On 02/11/2018 17:54, Daniel P. Berrangé wrote:
>>>>>> We have usually followed a rule that new machine types must not
>>>>>> affect runability of a VM on a host. IOW new machine types should
>>>>>> not introduce dependancies on specific kernels, or hardware features
>>>>>> such as CPU flags.
>>>>>
>>>>>> Anything that requires a new kernel feature thus ought to be an
>>>>>> opt-in config tunable on the CLI, separate from machine type
>>>>>> choice.
>>>>>
>>>>> Unless someone tinkered with the module parameters, they could not even
>>>>> use nested virtualization before 4.20.  So for everyone else, "-cpu
>>>>> ...,+vmx" does count as an "opt-in config tunable on the CLI" that
>>>>> requires 4.20.
>>>>>
>>>>> For those that did tinker with module parameters, we can grandfather in
>>>>> the old machine types, so that they can use nested virtualization with
>>>>> no live migration support.  For those that did not, however, I don't
>>>>> think it makes sense to say "oh by the way I really want to be able to
>>>>> migrate this VM" on the command line, or even worse on the monitor.
>>>>
>>>> IIUC, 4.20 is only required from POV of migration state. Is it thus
>>>> possible to just register a migration blocker if QEMU is launched
>>>> on a host with kernel < 4.20.
>>>>
>>>> Migration has always been busted historically, so those people using
>>>> nested VMX already won't be hurt by not having ability to live migrate
>>>> their VM, but could otherwise continue using them without being forced
>>>> to upgrade their kernel to fix a feature they're not even using.
>>>
>>> Yes, although I am a bit worried we might have a population of users
>>> that:
>>>   a) Have enabled nesting
>>>   b) Run VMs with vmx enabled
>>
>>
>>>   c) Don't normally actually run nested guests
>>>   d) Currently happily migrate.
>>
>> True, and (b) would include anyone using libvirt's  host-model CPU. So if
>> you enabled nesting, have host-model for all guests, but only use nesting
>> in one of the guests, you'd be doomed.
>>
>> Is it possible for QEMU to determine if there are nested guests running or
>> not and conditionally block migration appropriately to ensure safety ?
>
>
> Only if kernel supports KVM_CAP_NESTED_STATE.
> See my reply to Dave in this thread.

You could still allow migration if CR4.VMXE is clear.
Liran Alon Nov. 13, 2018, 12:09 a.m. UTC | #45
> On 13 Nov 2018, at 2:07, Jim Mattson <jmattson@google.com> wrote:
> 
> On Mon, Nov 12, 2018 at 4:00 PM, Liran Alon <liran.alon@oracle.com> wrote:
>> 
>> 
>>> On 12 Nov 2018, at 18:54, Daniel P. Berrangé <berrange@redhat.com> wrote:
>>> 
>>> On Mon, Nov 12, 2018 at 04:50:54PM +0000, Dr. David Alan Gilbert wrote:
>>>> * Daniel P. Berrangé (berrange@redhat.com) wrote:
>>>>> On Sun, Nov 04, 2018 at 11:19:57PM +0100, Paolo Bonzini wrote:
>>>>>> On 02/11/2018 17:54, Daniel P. Berrangé wrote:
>>>>>>> We have usually followed a rule that new machine types must not
>>>>>>> affect runability of a VM on a host. IOW new machine types should
>>>>>>> not introduce dependancies on specific kernels, or hardware features
>>>>>>> such as CPU flags.
>>>>>> 
>>>>>>> Anything that requires a new kernel feature thus ought to be an
>>>>>>> opt-in config tunable on the CLI, separate from machine type
>>>>>>> choice.
>>>>>> 
>>>>>> Unless someone tinkered with the module parameters, they could not even
>>>>>> use nested virtualization before 4.20.  So for everyone else, "-cpu
>>>>>> ...,+vmx" does count as an "opt-in config tunable on the CLI" that
>>>>>> requires 4.20.
>>>>>> 
>>>>>> For those that did tinker with module parameters, we can grandfather in
>>>>>> the old machine types, so that they can use nested virtualization with
>>>>>> no live migration support.  For those that did not, however, I don't
>>>>>> think it makes sense to say "oh by the way I really want to be able to
>>>>>> migrate this VM" on the command line, or even worse on the monitor.
>>>>> 
>>>>> IIUC, 4.20 is only required from POV of migration state. Is it thus
>>>>> possible to just register a migration blocker if QEMU is launched
>>>>> on a host with kernel < 4.20.
>>>>> 
>>>>> Migration has always been busted historically, so those people using
>>>>> nested VMX already won't be hurt by not having ability to live migrate
>>>>> their VM, but could otherwise continue using them without being forced
>>>>> to upgrade their kernel to fix a feature they're not even using.
>>>> 
>>>> Yes, although I am a bit worried we might have a population of users
>>>> that:
>>>>  a) Have enabled nesting
>>>>  b) Run VMs with vmx enabled
>>> 
>>> 
>>>>  c) Don't normally actually run nested guests
>>>>  d) Currently happily migrate.
>>> 
>>> True, and (b) would include anyone using libvirt's  host-model CPU. So if
>>> you enabled nesting, have host-model for all guests, but only use nesting
>>> in one of the guests, you'd be doomed.
>>> 
>>> Is it possible for QEMU to determine if there are nested guests running or
>>> not and conditionally block migration appropriately to ensure safety ?
>> 
>> 
>> Only if kernel supports KVM_CAP_NESTED_STATE.
>> See my reply to Dave in this thread.
> 
> You could still allow migration if CR4.VMXE is clear.

Agreed. Nice addition :)

Thanks,
-Liran