mbox series

[0/5] KVM: x86: Fix breakage in KVM_SET_XSAVE's ABI

Message ID 20230928001956.924301-1-seanjc@google.com (mailing list archive)
Headers show
Series KVM: x86: Fix breakage in KVM_SET_XSAVE's ABI | expand

Message

Sean Christopherson Sept. 28, 2023, 12:19 a.m. UTC
Rework how KVM limits guest-unsupported xfeatures to effectively hide
only when saving state for userspace (KVM_GET_XSAVE), i.e. to let userspace
load all host-supported xfeatures (via KVM_SET_XSAVE) irrespective of
what features have been exposed to the guest.

The effect on KVM_SET_XSAVE was knowingly done by commit ad856280ddea
("x86/kvm/fpu: Limit guest user_xfeatures to supported bits of XCR0"):

    As a bonus, it will also fail if userspace tries to set fpu features
    (with the KVM_SET_XSAVE ioctl) that are not compatible to the guest
    configuration.  Such features will never be returned by KVM_GET_XSAVE
    or KVM_GET_XSAVE2.

Peventing userspace from doing stupid things is usually a good idea, but in
this case restricting KVM_SET_XSAVE actually exacerbated the problem that
commit ad856280ddea was fixing.  As reported by Tyler, rejecting KVM_SET_XSAVE
for guest-unsupported xfeatures breaks live migration from a kernel without
commit ad856280ddea, to a kernel with ad856280ddea.  I.e. from a kernel that
saves guest-unsupported xfeatures to a kernel that doesn't allow loading
guest-unuspported xfeatures.

To make matters even worse, QEMU doesn't terminate if KVM_SET_XSAVE fails,
and so the end result is that the live migration results (possibly silent)
guest data corruption instead of a failed migration.

Patch 1 refactors the FPU code to let KVM pass in a mask of which xfeatures
to save, patch 2 fixes KVM by passing in guest_supported_xcr0 instead of
modifying user_xfeatures directly.

Patches 3-5 are regression tests.

I have no objection if anyone wants patches 1 and 2 squashed together, I
split them purely to make review easier.

Note, this doesn't fix the scenario where a guest is migrated from a "bad"
to a "good" kernel and the target host doesn't support the over-saved set
of xfeatures.  I don't see a way to safely handle that in the kernel without
an opt-in, which more or less defeats the purpose of handling it in KVM.

Sean Christopherson (5):
  x86/fpu: Allow caller to constrain xfeatures when copying to uabi
    buffer
  KVM: x86: Constrain guest-supported xfeatures only at KVM_GET_XSAVE{2}
  KVM: selftests: Touch relevant XSAVE state in guest for state test
  KVM: selftests: Load XSAVE state into untouched vCPU during state test
  KVM: selftests: Force load all supported XSAVE state in state test

 arch/x86/include/asm/fpu/api.h                |   3 +-
 arch/x86/kernel/fpu/core.c                    |   5 +-
 arch/x86/kernel/fpu/xstate.c                  |  12 +-
 arch/x86/kernel/fpu/xstate.h                  |   3 +-
 arch/x86/kvm/cpuid.c                          |   8 --
 arch/x86/kvm/x86.c                            |  37 +++---
 .../selftests/kvm/include/x86_64/processor.h  |  23 ++++
 .../testing/selftests/kvm/x86_64/state_test.c | 110 +++++++++++++++++-
 8 files changed, 168 insertions(+), 33 deletions(-)


base-commit: 5804c19b80bf625c6a9925317f845e497434d6d3

Comments

Leonardo Bras Oct. 4, 2023, 7:11 a.m. UTC | #1
On Wed, Sep 27, 2023 at 05:19:51PM -0700, Sean Christopherson wrote:
> Rework how KVM limits guest-unsupported xfeatures to effectively hide
> only when saving state for userspace (KVM_GET_XSAVE), i.e. to let userspace
> load all host-supported xfeatures (via KVM_SET_XSAVE) irrespective of
> what features have been exposed to the guest.

Ok, IIUC your changes provide:
- KVM_GET_XSAVE will return only guest-supported xfeatures
- KVM_SET_XSAVE will allow user to set any xfeatures supported by host
Is that correct?

> 
> The effect on KVM_SET_XSAVE was knowingly done by commit ad856280ddea
> ("x86/kvm/fpu: Limit guest user_xfeatures to supported bits of XCR0"):
> 
>     As a bonus, it will also fail if userspace tries to set fpu features
>     (with the KVM_SET_XSAVE ioctl) that are not compatible to the guest
>     configuration.  Such features will never be returned by KVM_GET_XSAVE
>     or KVM_GET_XSAVE2.
> 
> Peventing userspace from doing stupid things is usually a good idea, but in
> this case restricting KVM_SET_XSAVE actually exacerbated the problem that
> commit ad856280ddea was fixing.  As reported by Tyler, rejecting KVM_SET_XSAVE
> for guest-unsupported xfeatures breaks live migration from a kernel without
> commit ad856280ddea, to a kernel with ad856280ddea.  I.e. from a kernel that
> saves guest-unsupported xfeatures to a kernel that doesn't allow loading
> guest-unuspported xfeatures.

So this patch is supposed to fix migration of VM from a host with
pre-ad856280ddea (OLD) kernel to a host with ad856280ddea + your set(NEW).
Right?

Let's get the scenario here, where all machines are the same:
1 - VM created on OLD kernel with a host-supported xfeature F, which is not
    guest supported.
2 - VM is migrated to a NEW kernel/host, and KVM_SET_XSAVE xfeature F.
3 - VM will be migrated to another host, qemu requests KVM_GET_XSAVE, which
    returns only guest-supported xfeatures, and this is passed to next host
4 - VM will be started on 3rd host with guest-supported xfeatures, meaning
    xfeature F is filtered-out, which is not good, because the VM will have
    less features compared to boot.

In fact, I notice something would possibly happen between 2 and 3, since
qemu will run KVM_GET_XSAVE at kvm_cpu_synchronize_state() and
KVM_SET_XSAVE at kvm_cpu_exec(), which happens quite often (when vcpu stops
/ resumes for some reason).


Also, even if I got something wrong, and for some reason qemu will be able
to store the original VM xfeatures between migrations, we have the original
issue ad856280ddea was dealing with: newer machines -> older machines
migration:

1 - User gets a VM from an OLD kernel, with a newer host (more xfeatures).
2 - User migrates VM to NEW kernel, and we suppose qemu stores  original
    xfeatures (it works). Migration can occur to newer or same gen hosts.
3 - At some point, if migration is attempted to an older host (less
    xfeatures), qemu will abort the VM.

> 
> To make matters even worse, QEMU doesn't terminate if KVM_SET_XSAVE fails,
> and so the end result is that the live migration results (possibly silent)
> guest data corruption instead of a failed migration.

And this is something that really needs to be fixed in QEMU side.

> 
> Patch 1 refactors the FPU code to let KVM pass in a mask of which xfeatures
> to save, patch 2 fixes KVM by passing in guest_supported_xcr0 instead of
> modifying user_xfeatures directly.

At my current understanding of this patchset, I would not recomment merging
it, as it would introduce a lot of undesired behaviors.

Please let me know if I got something wrong, so I can review it again.

Thanks!
Leo

> 
> Patches 3-5 are regression tests.
> 
> I have no objection if anyone wants patches 1 and 2 squashed together, I
> split them purely to make review easier.
> 
> Note, this doesn't fix the scenario where a guest is migrated from a "bad"
> to a "good" kernel and the target host doesn't support the over-saved set
> of xfeatures.  I don't see a way to safely handle that in the kernel without
> an opt-in, which more or less defeats the purpose of handling it in KVM.
> 
> Sean Christopherson (5):
>   x86/fpu: Allow caller to constrain xfeatures when copying to uabi
>     buffer
>   KVM: x86: Constrain guest-supported xfeatures only at KVM_GET_XSAVE{2}
>   KVM: selftests: Touch relevant XSAVE state in guest for state test
>   KVM: selftests: Load XSAVE state into untouched vCPU during state test
>   KVM: selftests: Force load all supported XSAVE state in state test
> 
>  arch/x86/include/asm/fpu/api.h                |   3 +-
>  arch/x86/kernel/fpu/core.c                    |   5 +-
>  arch/x86/kernel/fpu/xstate.c                  |  12 +-
>  arch/x86/kernel/fpu/xstate.h                  |   3 +-
>  arch/x86/kvm/cpuid.c                          |   8 --
>  arch/x86/kvm/x86.c                            |  37 +++---
>  .../selftests/kvm/include/x86_64/processor.h  |  23 ++++
>  .../testing/selftests/kvm/x86_64/state_test.c | 110 +++++++++++++++++-
>  8 files changed, 168 insertions(+), 33 deletions(-)
> 
> 
> base-commit: 5804c19b80bf625c6a9925317f845e497434d6d3
> -- 
> 2.42.0.582.g8ccd20d70d-goog
>
Tyler Stachecki Oct. 4, 2023, 12:21 p.m. UTC | #2
On Wed, Oct 04, 2023 at 04:11:52AM -0300, Leonardo Bras wrote:
> So this patch is supposed to fix migration of VM from a host with
> pre-ad856280ddea (OLD) kernel to a host with ad856280ddea + your set(NEW).
> Right?
> 
> Let's get the scenario here, where all machines are the same:
> 1 - VM created on OLD kernel with a host-supported xfeature F, which is not
>     guest supported.
> 2 - VM is migrated to a NEW kernel/host, and KVM_SET_XSAVE xfeature F.
> 3 - VM will be migrated to another host, qemu requests KVM_GET_XSAVE, which
>     returns only guest-supported xfeatures, and this is passed to next host
> 4 - VM will be started on 3rd host with guest-supported xfeatures, meaning
>     xfeature F is filtered-out, which is not good, because the VM will have
>     less features compared to boot.

This is what I was (trying) to convey earlier...

See Sean's response here:
https://lore.kernel.org/all/ZRMHY83W%2FVPjYyhy@google.com/

I'll copy the pertinent part of his very detailed response inline:
> KVM *must* "trim" features when servicing KVM_GET_SAVE{2}, because that's been
> KVM's ABI for a very long time, and userspace absolutely relies on that
> functionality to ensure that a VM can be migrated within a pool of heterogenous
> systems so long as the features that are *exposed* to the guest are supported
> on all platforms.

My 2 cents: as an outsider with less familiarity of the KVM code, it is hard
to understand the contract here with the guest/userspace. It seems there is a
fundamental question of whether or not "superfluous" features, those being
host-supported features which extend that which the guest is actually capable
of, can be removed between the time that the guest boots and when it
terminates, through however many live-migrations that may be.

Ultimately, this problem is not really fixable if said features cannot be
removed.

Is there an RFC or document which captures expectations of this form?
Sean Christopherson Oct. 4, 2023, 2:51 p.m. UTC | #3
On Wed, Oct 04, 2023, Tyler Stachecki wrote:
> On Wed, Oct 04, 2023 at 04:11:52AM -0300, Leonardo Bras wrote:
> > So this patch is supposed to fix migration of VM from a host with
> > pre-ad856280ddea (OLD) kernel to a host with ad856280ddea + your set(NEW).
> > Right?
> > 
> > Let's get the scenario here, where all machines are the same:
> > 1 - VM created on OLD kernel with a host-supported xfeature F, which is not
> >     guest supported.
> > 2 - VM is migrated to a NEW kernel/host, and KVM_SET_XSAVE xfeature F.
> > 3 - VM will be migrated to another host, qemu requests KVM_GET_XSAVE, which
> >     returns only guest-supported xfeatures, and this is passed to next host
> > 4 - VM will be started on 3rd host with guest-supported xfeatures, meaning
> >     xfeature F is filtered-out, which is not good, because the VM will have
> >     less features compared to boot.

No, the VM will not have less features, because KVM_SET_XSAVE loads *data*, not
features.  On a host that supports xfeature F, the VM is running with garbage data
no matter what, which is perfectly fine because from the guest's perspective, that
xfeature and its associated data do not exist.

And in all likelihood, unless QEMU is doing something bizarre, the data that is
loaded via KVM_SET_XSAVE will be the exact same data that is already present in
the guest FPU state, as both with be in the init state.

On top of that, the data that is loaded via KVM_SET_XSAVE may not actually be
loaded into hardware, i.e. may never be exposed to the guest.  E.g. IIRC, the
original issues was with PKRU.  If PKU is supported by the host, but not exposed
to the guest, KVM will run the guest with the *host's* PKRU value.

> This is what I was (trying) to convey earlier...
> 
> See Sean's response here:
> https://lore.kernel.org/all/ZRMHY83W%2FVPjYyhy@google.com/
> 
> I'll copy the pertinent part of his very detailed response inline:
> > KVM *must* "trim" features when servicing KVM_GET_SAVE{2}, because that's been
> > KVM's ABI for a very long time, and userspace absolutely relies on that
> > functionality to ensure that a VM can be migrated within a pool of heterogenous
> > systems so long as the features that are *exposed* to the guest are supported
> > on all platforms.
> 
> My 2 cents: as an outsider with less familiarity of the KVM code, it is hard
> to understand the contract here with the guest/userspace. It seems there is a
> fundamental question of whether or not "superfluous" features, those being
> host-supported features which extend that which the guest is actually capable
> of, can be removed between the time that the guest boots and when it
> terminates, through however many live-migrations that may be.

KVM's ABI has no formal notion of guest boot=>shutdown or live migration.  The
myriad KVM_GET_* APIs allow taking a snapshot of guest state, and the KVM_SET_*
APIs allow loading a snapshot of guest state.  Live migration is probably the most
common use of those APIs, but there are other use cases.

That matters because KVM's contract with userspace for KVM_SET_XSAVE (or any other
state save/load ioctl()) doesn't have a holistic view of the guest, e.g. KVM can't
know that userspace is live migrating a VM, and that userspace's attempt to load
data for an unsupported xfeature is ok because the xfeature isn't exposed to the
guest.

In other words, at the time of KVM_SET_XSAVE, KVM has no way of knowing that an
xfeature is superfluous.  Normally, that's a complete non-issue because there is
no superfluous xfeature data, as KVM's contract for KVM_GET_SAVE{2} is that only
necessary data is saved in the snapshot.

Unfortunately, the original bug that led to this mess broke the contract for
KVM_GET_XSAVE{2}, and I don't see a safe way to workaround that bug in KVM without
an opt-in from userspace.

> Ultimately, this problem is not really fixable if said features cannot be
> removed.

It's not about removing features.  The change you're asking for is to have KVM
*silently* drop data.  Aside from the fact that such a change would break KVM's
ABI, silently ignoring data that userspace has explicitly requested be loaded for
a vCPU is incredibly dangerous.

E.g. a not too far fetched scenario would be:

   1. xfeature X is supported on Host A and exposed to a guest 
   2. Host B is upgraded to a new kernel that has a bug that causes the kernel
      to disable support for X, even though X is supported in hardware
   3. The guest is live migrated from Host A to Host B

At step #3, what will currently happen is that KVM_SET_XSAVE will fail with -EINVAL
because userspace is attempting to load data that Host B is incapable of loading.

The change you're suggesting would result in KVM dropping the data for X and
letting KVM_SET_XSAVE succeed, *for an xfeature that is exposed to the guest*.
I.e. for all intents and purposes, KVM would deliberately corrupt guest data.

> Is there an RFC or document which captures expectations of this form?

Not AFAIK.  :-/
Tyler Stachecki Oct. 4, 2023, 3:29 p.m. UTC | #4
On Wed, Oct 04, 2023 at 07:51:17AM -0700, Sean Christopherson wrote:
> KVM's ABI has no formal notion of guest boot=>shutdown or live migration.  The
> myriad KVM_GET_* APIs allow taking a snapshot of guest state, and the KVM_SET_*
> APIs allow loading a snapshot of guest state.  Live migration is probably the most
> common use of those APIs, but there are other use cases.

I think the lightbulb just clicked, it is really this:

> No, the VM will not have less features, because KVM_SET_XSAVE loads *data*, not
> features [...]

I think I'm conflating the data vs. features aspect here and will have to
revisit my understanding of the code...

> > Ultimately, this problem is not really fixable if said features cannot be
> > removed.

> It's not about removing features.  The change you're asking for is to have KVM
> *silently* drop data.  Aside from the fact that such a change would break KVM's
> ABI, silently ignoring data that userspace has explicitly requested be loaded for
> a vCPU is incredibly dangerous.

Sorry if it came off that way - I fully understand and am resigned to the "you
break it, you keep both halves" nature of what I had initially proposed and
that it is not a generally tractable solution.

That being said, I genuinely appreciate your jump to action on this problem!

Thanks,
Tyler
Sean Christopherson Oct. 4, 2023, 4:54 p.m. UTC | #5
On Wed, Oct 04, 2023, Tyler Stachecki wrote:
> On Wed, Oct 04, 2023 at 07:51:17AM -0700, Sean Christopherson wrote:
 
> > It's not about removing features.  The change you're asking for is to have KVM
> > *silently* drop data.  Aside from the fact that such a change would break KVM's
> > ABI, silently ignoring data that userspace has explicitly requested be loaded for
> > a vCPU is incredibly dangerous.
> 
> Sorry if it came off that way

No need to apologise, you got bit by a nasty kernel bug and are trying to find a
solution.  There's nothing wrong with that.

> I fully understand and am resigned to the "you
> break it, you keep both halves" nature of what I had initially proposed and
> that it is not a generally tractable solution.

Yeah, the crux of the matter is that we have no control or even knowledge of who
all is using KVM, with what userspace VMM, on what hardware, etc.  E.g. if this
bug were affecting our fleet and for some reason we couldn't address the problem
in userspace, carrying a hack in KVM in our internal kernel would probably be a
viable option because we can do a proper risk assessment.  E.g. we know and control
exactly what userspace we're running, the underlying hardware in affected pools,
what features are exposed to the guest, etc.  And we could revert the hack once
all affected VMs had been sanitized.
Sean Christopherson Oct. 5, 2023, 1:29 a.m. UTC | #6
On Wed, 27 Sep 2023 17:19:51 -0700, Sean Christopherson wrote:
> Rework how KVM limits guest-unsupported xfeatures to effectively hide
> only when saving state for userspace (KVM_GET_XSAVE), i.e. to let userspace
> load all host-supported xfeatures (via KVM_SET_XSAVE) irrespective of
> what features have been exposed to the guest.
> 
> The effect on KVM_SET_XSAVE was knowingly done by commit ad856280ddea
> ("x86/kvm/fpu: Limit guest user_xfeatures to supported bits of XCR0"):
> 
> [...]

Applied to kvm-x86 fpu, even though there is still ongoing discussion.  I want
to get this exposure in -next sooner than later.  I'll keep this in its own
branch so it'll be easier to rewrite/discard if necessary.

[1/5] x86/fpu: Allow caller to constrain xfeatures when copying to uabi buffer
      https://github.com/kvm-x86/linux/commit/2d287ec65e79
[2/5] KVM: x86: Constrain guest-supported xfeatures only at KVM_GET_XSAVE{2}
      https://github.com/kvm-x86/linux/commit/27526efb5cff
[3/5] KVM: selftests: Touch relevant XSAVE state in guest for state test
      https://github.com/kvm-x86/linux/commit/ff0654c71fb6
[4/5] KVM: selftests: Load XSAVE state into untouched vCPU during state test
      https://github.com/kvm-x86/linux/commit/d7b8762ec4a3
[5/5] KVM: selftests: Force load all supported XSAVE state in state test
      https://github.com/kvm-x86/linux/commit/afb2c7e27a7f

--
https://github.com/kvm-x86/linux/tree/next
Paolo Bonzini Oct. 12, 2023, 2:45 p.m. UTC | #7
Queued, thanks.

Paolo