diff mbox series

[v3] KVM: VMX: Enable Notify VM exit

Message ID 20220223062412.22334-1-chenyi.qiang@intel.com (mailing list archive)
State New, archived
Headers show
Series [v3] KVM: VMX: Enable Notify VM exit | expand

Commit Message

Chenyi Qiang Feb. 23, 2022, 6:24 a.m. UTC
From: Tao Xu <tao3.xu@intel.com>

There are cases that malicious virtual machines can cause CPU stuck (due
to event windows don't open up), e.g., infinite loop in microcode when
nested #AC (CVE-2015-5307). No event window means no event (NMI, SMI and
IRQ) can be delivered. It leads the CPU to be unavailable to host or
other VMs.

VMM can enable notify VM exit that a VM exit generated if no event
window occurs in VM non-root mode for a specified amount of time (notify
window).

Feature enabling:
- The new vmcs field SECONDARY_EXEC_NOTIFY_VM_EXITING is introduced to
  enable this feature. VMM can set NOTIFY_WINDOW vmcs field to adjust
  the expected notify window.
- Expose a module param to configure notify window by admin, which is in
  unit of crystal clock.
  - if notify_window < 0, feature disabled;
  - if notify_window >= 0, feature enabled;
- There's a possibility, however small, that a notify VM exit happens
  with VM_CONTEXT_INVALID set in exit qualification. In this case, the
  vcpu can no longer run. To avoid killing a well-behaved guest, set
  notify window as -1 to disable this feature by default.
- It's safe to even set notify window to zero since an internal
  hardware threshold is added to vmcs.notifiy_window.

VM exit handling:
- Introduce a vcpu state notify_window_exits to records the count of
  notify VM exits and expose it through the debugfs.
- warn the notify vm exit in kernel log since host can a) get an
  indication that a guest is potentially malicious and b) rule out (or
  confirm) notify VM exits as the source of degraded guest performance.
- Notify VM exit can happen incident to delivery of a vector event.
  Allow it in KVM.

Nested handling
- Nested notify VM exits are not supported yet. Keep the same notify
  window control in vmcs02 as vmcs01, so that L1 can't escape the
  restriction of notify VM exits through launching L2 VM.
- When L2 VM is context invalid, synthesize a nested
  EXIT_REASON_TRIPLE_FAULT to L1 so that L1 won't be killed due to L2's
  VM_CONTEXT_INVALID happens.

Notify VM exit is defined in latest Intel Architecture Instruction Set
Extensions Programming Reference, chapter 9.2.

TODO: Allow to change the window size (to enable the feature) at runtime,
which can make it more flexible to do management.

---
Change logs:
v2 -> v3
- add a vcpu state notify_window_exits to record the number of
  occurence as well as a pr_warn output. (Sean)
- Add the handling in nested VM to prevent L1 bypassing the restriction
  through launching a L2. (Sean)
- Only kill L2 when L2 VM is context invalid, synthesize a
  EXIT_REASON_TRIPLE_FAULT to L1 (Sean)
- To ease the current implementation, make module parameter
  notify_window read-only. (Sean)
- Disable notify window exit by default.
- v2: https://lore.kernel.org/lkml/20210525051204.1480610-1-tao3.xu@intel.com/

v1 -> v2
- Default set notify window to 0, less than 0 to disable.
- Add more description in commit message.
---

Co-developed-by: Xiaoyao Li <xiaoyao.li@intel.com>
Signed-off-by: Xiaoyao Li <xiaoyao.li@intel.com>
Signed-off-by: Tao Xu <tao3.xu@intel.com>
Co-developed-by: Chenyi Qiang <chenyi.qiang@intel.com>
Signed-off-by: Chenyi Qiang <chenyi.qiang@intel.com>

---
 arch/x86/include/asm/kvm_host.h    |  1 +
 arch/x86/include/asm/vmx.h         |  7 ++++
 arch/x86/include/asm/vmxfeatures.h |  1 +
 arch/x86/include/uapi/asm/vmx.h    |  4 +-
 arch/x86/kvm/vmx/capabilities.h    |  7 ++++
 arch/x86/kvm/vmx/nested.c          | 16 +++++++-
 arch/x86/kvm/vmx/vmx.c             | 59 +++++++++++++++++++++++++++++-
 arch/x86/kvm/x86.c                 |  3 +-
 include/uapi/linux/kvm.h           |  2 +
 9 files changed, 95 insertions(+), 5 deletions(-)

Comments

Paolo Bonzini Feb. 25, 2022, 11:54 a.m. UTC | #1
On 2/23/22 07:24, Chenyi Qiang wrote:
> Nested handling
> - Nested notify VM exits are not supported yet. Keep the same notify
>    window control in vmcs02 as vmcs01, so that L1 can't escape the
>    restriction of notify VM exits through launching L2 VM.
> - When L2 VM is context invalid, synthesize a nested
>    EXIT_REASON_TRIPLE_FAULT to L1 so that L1 won't be killed due to L2's
>    VM_CONTEXT_INVALID happens.
> 
> Notify VM exit is defined in latest Intel Architecture Instruction Set
> Extensions Programming Reference, chapter 9.2.
> 
> TODO: Allow to change the window size (to enable the feature) at runtime,
> which can make it more flexible to do management.

I only have a couple questions, any changes in response to the question
I can do myself.

> diff --git a/arch/x86/kvm/vmx/nested.c b/arch/x86/kvm/vmx/nested.c
> index 1dfe23963a9e..f306b642c3e1 100644
> --- a/arch/x86/kvm/vmx/nested.c
> +++ b/arch/x86/kvm/vmx/nested.c
> @@ -2177,6 +2177,9 @@ static void prepare_vmcs02_constant_state(struct vcpu_vmx *vmx)
>   	if (cpu_has_vmx_encls_vmexit())
>   		vmcs_write64(ENCLS_EXITING_BITMAP, INVALID_GPA);
>   
> +	if (notify_window >= 0)
> +		vmcs_write32(NOTIFY_WINDOW, notify_window);

Is a value of 0 valid?  Should it be changed to the recommended value of
128000 in hardware_setup()?

> +	case EXIT_REASON_NOTIFY:
> +		return nested_cpu_has2(vmcs12,
> +			SECONDARY_EXEC_NOTIFY_VM_EXITING);

This should be "return false" since you don't expose the secondary
control to L1 (meaning, it will never be set).

> +		 * L0 will synthensize a nested TRIPLE_FAULT to kill L2 when
> +		 * notify VM exit occurred in L2 and NOTIFY_VM_CONTEXT_INVALID is
> +		 * set in exit qualification. In this case, if notify VM exit
> +		 * occurred incident to delivery of a vectored event, the IDT
> +		 * vectoring info are recorded in VMCS. Drop the pending event
> +		 * in vmcs12, otherwise L1 VMM will exit to userspace with
> +		 * internal error due to delivery event.
>  		 */
> -		vmcs12_save_pending_event(vcpu, vmcs12);
> +		if (to_vmx(vcpu)->exit_reason.basic != EXIT_REASON_NOTIFY)
> +			vmcs12_save_pending_event(vcpu, vmcs12);

I would prefer to call out the triple fault here:

                 /*
                  * Transfer the event that L0 or L1 may have wanted to inject into
                  * L2 to IDT_VECTORING_INFO_FIELD.
                  *
                  * Skip this if the exit is due to a NOTIFY_VM_CONTEXT_INVALID
                  * exit; in that case, L0 will synthesize a nested TRIPLE_FAULT
                  * vmexit to kill L2.  No IDT vectoring info is recorded for
                  * triple faults, and __vmx_handle_exit does not expect it.
                  */
                 if (!(to_vmx(vcpu)->exit_reason.basic == EXIT_REASON_NOTIFY)
                       && kvm_test_request(KVM_REQ_TRIPLE_FAULT, vcpu))
                         vmcs12_save_pending_event(vcpu, vmcs12);

What do you think?

Paolo
Xiaoyao Li Feb. 25, 2022, 12:46 p.m. UTC | #2
On 2/25/2022 7:54 PM, Paolo Bonzini wrote:
> On 2/23/22 07:24, Chenyi Qiang wrote:
>> Nested handling
>> - Nested notify VM exits are not supported yet. Keep the same notify
>>    window control in vmcs02 as vmcs01, so that L1 can't escape the
>>    restriction of notify VM exits through launching L2 VM.
>> - When L2 VM is context invalid, synthesize a nested
>>    EXIT_REASON_TRIPLE_FAULT to L1 so that L1 won't be killed due to L2's
>>    VM_CONTEXT_INVALID happens.
>>
>> Notify VM exit is defined in latest Intel Architecture Instruction Set
>> Extensions Programming Reference, chapter 9.2.
>>
>> TODO: Allow to change the window size (to enable the feature) at runtime,
>> which can make it more flexible to do management.
> 
> I only have a couple questions, any changes in response to the question
> I can do myself.
> 
>> diff --git a/arch/x86/kvm/vmx/nested.c b/arch/x86/kvm/vmx/nested.c
>> index 1dfe23963a9e..f306b642c3e1 100644
>> --- a/arch/x86/kvm/vmx/nested.c
>> +++ b/arch/x86/kvm/vmx/nested.c
>> @@ -2177,6 +2177,9 @@ static void prepare_vmcs02_constant_state(struct 
>> vcpu_vmx *vmx)
>>       if (cpu_has_vmx_encls_vmexit())
>>           vmcs_write64(ENCLS_EXITING_BITMAP, INVALID_GPA);
>> +    if (notify_window >= 0)
>> +        vmcs_write32(NOTIFY_WINDOW, notify_window);
> 
> Is a value of 0 valid?  

Yes, 0 is valid. That's why there is an internal value to ensure even 0 
won't cause false positive

> Should it be changed to the recommended value of
> 128000 in hardware_setup()?
> 
>> +    case EXIT_REASON_NOTIFY:
>> +        return nested_cpu_has2(vmcs12,
>> +            SECONDARY_EXEC_NOTIFY_VM_EXITING);
> 
> This should be "return false" since you don't expose the secondary
> control to L1 (meaning, it will never be set).

Fine with either.

>> +         * L0 will synthensize a nested TRIPLE_FAULT to kill L2 when
>> +         * notify VM exit occurred in L2 and 
>> NOTIFY_VM_CONTEXT_INVALID is
>> +         * set in exit qualification. In this case, if notify VM exit
>> +         * occurred incident to delivery of a vectored event, the IDT
>> +         * vectoring info are recorded in VMCS. Drop the pending event
>> +         * in vmcs12, otherwise L1 VMM will exit to userspace with
>> +         * internal error due to delivery event.
>>           */
>> -        vmcs12_save_pending_event(vcpu, vmcs12);
>> +        if (to_vmx(vcpu)->exit_reason.basic != EXIT_REASON_NOTIFY)
>> +            vmcs12_save_pending_event(vcpu, vmcs12);
> 
> I would prefer to call out the triple fault here:
> 
>                  /*
>                   * Transfer the event that L0 or L1 may have wanted to 
> inject into
>                   * L2 to IDT_VECTORING_INFO_FIELD.
>                   *
>                   * Skip this if the exit is due to a 
> NOTIFY_VM_CONTEXT_INVALID
>                   * exit; in that case, L0 will synthesize a nested 
> TRIPLE_FAULT
>                   * vmexit to kill L2.  No IDT vectoring info is 
> recorded for
>                   * triple faults, and __vmx_handle_exit does not expect 
> it.
>                   */
>                  if (!(to_vmx(vcpu)->exit_reason.basic == 
> EXIT_REASON_NOTIFY)
>                        && kvm_test_request(KVM_REQ_TRIPLE_FAULT, vcpu))
>                          vmcs12_save_pending_event(vcpu, vmcs12);

looks good to me.

> What do you think?
> 
> Paolo
>
Jim Mattson Feb. 25, 2022, 2:54 p.m. UTC | #3
On Tue, Feb 22, 2022 at 10:19 PM Chenyi Qiang <chenyi.qiang@intel.com> wrote:
>
> From: Tao Xu <tao3.xu@intel.com>
>
> There are cases that malicious virtual machines can cause CPU stuck (due
> to event windows don't open up), e.g., infinite loop in microcode when
> nested #AC (CVE-2015-5307). No event window means no event (NMI, SMI and
> IRQ) can be delivered. It leads the CPU to be unavailable to host or
> other VMs.
>
> VMM can enable notify VM exit that a VM exit generated if no event
> window occurs in VM non-root mode for a specified amount of time (notify
> window).
>
> Feature enabling:
> - The new vmcs field SECONDARY_EXEC_NOTIFY_VM_EXITING is introduced to
>   enable this feature. VMM can set NOTIFY_WINDOW vmcs field to adjust
>   the expected notify window.
> - Expose a module param to configure notify window by admin, which is in
>   unit of crystal clock.
>   - if notify_window < 0, feature disabled;
>   - if notify_window >= 0, feature enabled;
> - There's a possibility, however small, that a notify VM exit happens
>   with VM_CONTEXT_INVALID set in exit qualification. In this case, the
>   vcpu can no longer run. To avoid killing a well-behaved guest, set
>   notify window as -1 to disable this feature by default.
> - It's safe to even set notify window to zero since an internal
>   hardware threshold is added to vmcs.notifiy_window.

What causes a VM_CONTEXT_INVALID VM-exit? How small is this possibility?

> Nested handling
> - Nested notify VM exits are not supported yet. Keep the same notify
>   window control in vmcs02 as vmcs01, so that L1 can't escape the
>   restriction of notify VM exits through launching L2 VM.
> - When L2 VM is context invalid, synthesize a nested
>   EXIT_REASON_TRIPLE_FAULT to L1 so that L1 won't be killed due to L2's
>   VM_CONTEXT_INVALID happens.

I don't like the idea of making things up without notifying userspace
that this is fictional. How is my customer running nested VMs supposed
to know that L2 didn't actually shutdown, but L0 killed it because the
notify window was exceeded? If this information isn't reported to
userspace, I have no way of getting the information to the customer.
Xiaoyao Li Feb. 25, 2022, 3:04 p.m. UTC | #4
On 2/25/2022 10:54 PM, Jim Mattson wrote:
> On Tue, Feb 22, 2022 at 10:19 PM Chenyi Qiang <chenyi.qiang@intel.com> wrote:
>>
>> From: Tao Xu <tao3.xu@intel.com>
>>
>> There are cases that malicious virtual machines can cause CPU stuck (due
>> to event windows don't open up), e.g., infinite loop in microcode when
>> nested #AC (CVE-2015-5307). No event window means no event (NMI, SMI and
>> IRQ) can be delivered. It leads the CPU to be unavailable to host or
>> other VMs.
>>
>> VMM can enable notify VM exit that a VM exit generated if no event
>> window occurs in VM non-root mode for a specified amount of time (notify
>> window).
>>
>> Feature enabling:
>> - The new vmcs field SECONDARY_EXEC_NOTIFY_VM_EXITING is introduced to
>>    enable this feature. VMM can set NOTIFY_WINDOW vmcs field to adjust
>>    the expected notify window.
>> - Expose a module param to configure notify window by admin, which is in
>>    unit of crystal clock.
>>    - if notify_window < 0, feature disabled;
>>    - if notify_window >= 0, feature enabled;
>> - There's a possibility, however small, that a notify VM exit happens
>>    with VM_CONTEXT_INVALID set in exit qualification. In this case, the
>>    vcpu can no longer run. To avoid killing a well-behaved guest, set
>>    notify window as -1 to disable this feature by default.
>> - It's safe to even set notify window to zero since an internal
>>    hardware threshold is added to vmcs.notifiy_window.
> 
> What causes a VM_CONTEXT_INVALID VM-exit? How small is this possibility?

For now, no case will set VM_CONTEXT_INVALID bit.

In the future, it must be some fatal case that vmcs is corrupted.

>> Nested handling
>> - Nested notify VM exits are not supported yet. Keep the same notify
>>    window control in vmcs02 as vmcs01, so that L1 can't escape the
>>    restriction of notify VM exits through launching L2 VM.
>> - When L2 VM is context invalid, synthesize a nested
>>    EXIT_REASON_TRIPLE_FAULT to L1 so that L1 won't be killed due to L2's
>>    VM_CONTEXT_INVALID happens.
> 
> I don't like the idea of making things up without notifying userspace
> that this is fictional. How is my customer running nested VMs supposed
> to know that L2 didn't actually shutdown, but L0 killed it because the
> notify window was exceeded? If this information isn't reported to
> userspace, I have no way of getting the information to the customer.

Then, maybe a dedicated software define VM exit for it instead of 
reusing triple fault?
Xiaoyao Li Feb. 25, 2022, 3:12 p.m. UTC | #5
On 2/25/2022 11:04 PM, Xiaoyao Li wrote:
> On 2/25/2022 10:54 PM, Jim Mattson wrote:
>> On Tue, Feb 22, 2022 at 10:19 PM Chenyi Qiang <chenyi.qiang@intel.com> 
>> wrote:
>>> Nested handling
>>> - Nested notify VM exits are not supported yet. Keep the same notify
>>>    window control in vmcs02 as vmcs01, so that L1 can't escape the
>>>    restriction of notify VM exits through launching L2 VM.
>>> - When L2 VM is context invalid, synthesize a nested
>>>    EXIT_REASON_TRIPLE_FAULT to L1 so that L1 won't be killed due to L2's
>>>    VM_CONTEXT_INVALID happens.
>>
>> I don't like the idea of making things up without notifying userspace
>> that this is fictional. How is my customer running nested VMs supposed
>> to know that L2 didn't actually shutdown, but L0 killed it because the
>> notify window was exceeded? If this information isn't reported to
>> userspace, I have no way of getting the information to the customer.
> 
> Then, maybe a dedicated software define VM exit for it instead of 
> reusing triple fault?
> 

Second thought, we can even just return Notify VM exit to L1 to tell L2 
causes Notify VM exit, even thought Notify VM exit is not exposed to L1.
Paolo Bonzini Feb. 25, 2022, 3:13 p.m. UTC | #6
On 2/25/22 16:12, Xiaoyao Li wrote:
>>>>
>>>
>>> I don't like the idea of making things up without notifying userspace
>>> that this is fictional. How is my customer running nested VMs supposed
>>> to know that L2 didn't actually shutdown, but L0 killed it because the
>>> notify window was exceeded? If this information isn't reported to
>>> userspace, I have no way of getting the information to the customer.
>>
>> Then, maybe a dedicated software define VM exit for it instead of 
>> reusing triple fault?
>>
> 
> Second thought, we can even just return Notify VM exit to L1 to tell L2 
> causes Notify VM exit, even thought Notify VM exit is not exposed to L1.

That might cause NULL pointer dereferences or other nasty occurrences.

Paolo
Jim Mattson Feb. 25, 2022, 6:06 p.m. UTC | #7
On Fri, Feb 25, 2022 at 7:13 AM Paolo Bonzini <pbonzini@redhat.com> wrote:
>
> On 2/25/22 16:12, Xiaoyao Li wrote:
> >>>>
> >>>
> >>> I don't like the idea of making things up without notifying userspace
> >>> that this is fictional. How is my customer running nested VMs supposed
> >>> to know that L2 didn't actually shutdown, but L0 killed it because the
> >>> notify window was exceeded? If this information isn't reported to
> >>> userspace, I have no way of getting the information to the customer.
> >>
> >> Then, maybe a dedicated software define VM exit for it instead of
> >> reusing triple fault?
> >>
> >
> > Second thought, we can even just return Notify VM exit to L1 to tell L2
> > causes Notify VM exit, even thought Notify VM exit is not exposed to L1.
>
> That might cause NULL pointer dereferences or other nasty occurrences.

Could we synthesize a machine check? I haven't looked in detail at the
MCE MSRs, but surely there must be room in there for Intel to reserve
some encodings for synthesized machine checks.
Sean Christopherson Feb. 25, 2022, 6:29 p.m. UTC | #8
On Fri, Feb 25, 2022, Jim Mattson wrote:
> On Fri, Feb 25, 2022 at 7:13 AM Paolo Bonzini <pbonzini@redhat.com> wrote:
> >
> > On 2/25/22 16:12, Xiaoyao Li wrote:
> > >>>>
> > >>>
> > >>> I don't like the idea of making things up without notifying userspace
> > >>> that this is fictional. How is my customer running nested VMs supposed
> > >>> to know that L2 didn't actually shutdown, but L0 killed it because the
> > >>> notify window was exceeded? If this information isn't reported to
> > >>> userspace, I have no way of getting the information to the customer.
> > >>
> > >> Then, maybe a dedicated software define VM exit for it instead of
> > >> reusing triple fault?
> > >>
> > >
> > > Second thought, we can even just return Notify VM exit to L1 to tell L2
> > > causes Notify VM exit, even thought Notify VM exit is not exposed to L1.
> >
> > That might cause NULL pointer dereferences or other nasty occurrences.
> 
> Could we synthesize a machine check? I haven't looked in detail at the
> MCE MSRs, but surely there must be room in there for Intel to reserve
> some encodings for synthesized machine checks.

I don't think we have any choice but to synthesize SHUTDOWN until we get more
details on the exact semantics of VM_CONTEXT_INVALID.  E.g. if GUEST_EFER or any
other critical guest field is corrupted, attempting to re-enter the guest, even
to (attempt to) inject a machine check, is risking undefined behavior in the guest.
Jim Mattson Feb. 25, 2022, 7:15 p.m. UTC | #9
On Fri, Feb 25, 2022 at 10:29 AM Sean Christopherson <seanjc@google.com> wrote:
>
> On Fri, Feb 25, 2022, Jim Mattson wrote:
> > On Fri, Feb 25, 2022 at 7:13 AM Paolo Bonzini <pbonzini@redhat.com> wrote:
> > >
> > > On 2/25/22 16:12, Xiaoyao Li wrote:
> > > >>>>
> > > >>>
> > > >>> I don't like the idea of making things up without notifying userspace
> > > >>> that this is fictional. How is my customer running nested VMs supposed
> > > >>> to know that L2 didn't actually shutdown, but L0 killed it because the
> > > >>> notify window was exceeded? If this information isn't reported to
> > > >>> userspace, I have no way of getting the information to the customer.
> > > >>
> > > >> Then, maybe a dedicated software define VM exit for it instead of
> > > >> reusing triple fault?
> > > >>
> > > >
> > > > Second thought, we can even just return Notify VM exit to L1 to tell L2
> > > > causes Notify VM exit, even thought Notify VM exit is not exposed to L1.
> > >
> > > That might cause NULL pointer dereferences or other nasty occurrences.
> >
> > Could we synthesize a machine check? I haven't looked in detail at the
> > MCE MSRs, but surely there must be room in there for Intel to reserve
> > some encodings for synthesized machine checks.
>
> I don't think we have any choice but to synthesize SHUTDOWN until we get more
> details on the exact semantics of VM_CONTEXT_INVALID.  E.g. if GUEST_EFER or any
> other critical guest field is corrupted, attempting to re-enter the guest, even
> to (attempt to) inject a machine check, is risking undefined behavior in the guest.

Synthesizing shutdown is fine, as long as userspace is notified.
Xiaoyao Li Feb. 26, 2022, 4:07 a.m. UTC | #10
On 2/25/2022 11:13 PM, Paolo Bonzini wrote:
> On 2/25/22 16:12, Xiaoyao Li wrote:
>>>>>
>>>>
>>>> I don't like the idea of making things up without notifying userspace
>>>> that this is fictional. How is my customer running nested VMs supposed
>>>> to know that L2 didn't actually shutdown, but L0 killed it because the
>>>> notify window was exceeded? If this information isn't reported to
>>>> userspace, I have no way of getting the information to the customer.
>>>
>>> Then, maybe a dedicated software define VM exit for it instead of 
>>> reusing triple fault?
>>>
>>
>> Second thought, we can even just return Notify VM exit to L1 to tell 
>> L2 causes Notify VM exit, even thought Notify VM exit is not exposed 
>> to L1.
> 
> That might cause NULL pointer dereferences or other nasty occurrences.

IMO, a well written VMM (in L1) should handle it correctly.

L0 KVM reports no Notify VM Exit support to L1, so L1 runs without 
setting Notify VM exit. If a L2 causes notify_vm_exit with 
invalid_vm_context, L0 just reflects it to L1. In L1's view, there is no 
support of Notify VM Exit from VMX MSR capability. Following L1 handler 
is possible:

a)	if (notify_vm_exit available & notify_vm_exit enabled) {
		handle in b)	
	} else {
		report unexpected vm exit reason to userspace;
	}

b) 	similar handler like we implement in KVM:
	if (!vm_context_invalid)
		re-enter guest;
	else
		report to userspace;

c)	no Notify VM Exit related code (e.g. old KVM), it's treated as 
unsupported exit reason

As long as it belongs to any case above, I think L1 can handle it 
correctly. Any nasty occurrence should be caused by incorrect handler in 
L1 VMM, in my opinion.

> Paolo
>
Jim Mattson Feb. 26, 2022, 4:25 a.m. UTC | #11
On Fri, Feb 25, 2022 at 8:07 PM Xiaoyao Li <xiaoyao.li@intel.com> wrote:
>
> On 2/25/2022 11:13 PM, Paolo Bonzini wrote:
> > On 2/25/22 16:12, Xiaoyao Li wrote:
> >>>>>
> >>>>
> >>>> I don't like the idea of making things up without notifying userspace
> >>>> that this is fictional. How is my customer running nested VMs supposed
> >>>> to know that L2 didn't actually shutdown, but L0 killed it because the
> >>>> notify window was exceeded? If this information isn't reported to
> >>>> userspace, I have no way of getting the information to the customer.
> >>>
> >>> Then, maybe a dedicated software define VM exit for it instead of
> >>> reusing triple fault?
> >>>
> >>
> >> Second thought, we can even just return Notify VM exit to L1 to tell
> >> L2 causes Notify VM exit, even thought Notify VM exit is not exposed
> >> to L1.
> >
> > That might cause NULL pointer dereferences or other nasty occurrences.
>
> IMO, a well written VMM (in L1) should handle it correctly.
>
> L0 KVM reports no Notify VM Exit support to L1, so L1 runs without
> setting Notify VM exit. If a L2 causes notify_vm_exit with
> invalid_vm_context, L0 just reflects it to L1. In L1's view, there is no
> support of Notify VM Exit from VMX MSR capability. Following L1 handler
> is possible:
>
> a)      if (notify_vm_exit available & notify_vm_exit enabled) {
>                 handle in b)
>         } else {
>                 report unexpected vm exit reason to userspace;
>         }
>
> b)      similar handler like we implement in KVM:
>         if (!vm_context_invalid)
>                 re-enter guest;
>         else
>                 report to userspace;
>
> c)      no Notify VM Exit related code (e.g. old KVM), it's treated as
> unsupported exit reason
>
> As long as it belongs to any case above, I think L1 can handle it
> correctly. Any nasty occurrence should be caused by incorrect handler in
> L1 VMM, in my opinion.

Please test some common hypervisors (e.g. ESXi and Hyper-V).
Jim Mattson Feb. 26, 2022, 4:53 a.m. UTC | #12
On Fri, Feb 25, 2022 at 8:25 PM Jim Mattson <jmattson@google.com> wrote:
>
> On Fri, Feb 25, 2022 at 8:07 PM Xiaoyao Li <xiaoyao.li@intel.com> wrote:
> >
> > On 2/25/2022 11:13 PM, Paolo Bonzini wrote:
> > > On 2/25/22 16:12, Xiaoyao Li wrote:
> > >>>>>
> > >>>>
> > >>>> I don't like the idea of making things up without notifying userspace
> > >>>> that this is fictional. How is my customer running nested VMs supposed
> > >>>> to know that L2 didn't actually shutdown, but L0 killed it because the
> > >>>> notify window was exceeded? If this information isn't reported to
> > >>>> userspace, I have no way of getting the information to the customer.
> > >>>
> > >>> Then, maybe a dedicated software define VM exit for it instead of
> > >>> reusing triple fault?
> > >>>
> > >>
> > >> Second thought, we can even just return Notify VM exit to L1 to tell
> > >> L2 causes Notify VM exit, even thought Notify VM exit is not exposed
> > >> to L1.
> > >
> > > That might cause NULL pointer dereferences or other nasty occurrences.
> >
> > IMO, a well written VMM (in L1) should handle it correctly.
> >
> > L0 KVM reports no Notify VM Exit support to L1, so L1 runs without
> > setting Notify VM exit. If a L2 causes notify_vm_exit with
> > invalid_vm_context, L0 just reflects it to L1. In L1's view, there is no
> > support of Notify VM Exit from VMX MSR capability. Following L1 handler
> > is possible:
> >
> > a)      if (notify_vm_exit available & notify_vm_exit enabled) {
> >                 handle in b)
> >         } else {
> >                 report unexpected vm exit reason to userspace;
> >         }
> >
> > b)      similar handler like we implement in KVM:
> >         if (!vm_context_invalid)
> >                 re-enter guest;
> >         else
> >                 report to userspace;
> >
> > c)      no Notify VM Exit related code (e.g. old KVM), it's treated as
> > unsupported exit reason
> >
> > As long as it belongs to any case above, I think L1 can handle it
> > correctly. Any nasty occurrence should be caused by incorrect handler in
> > L1 VMM, in my opinion.
>
> Please test some common hypervisors (e.g. ESXi and Hyper-V).

I took a look at KVM in Linux v4.9 (one of our more popular guests),
and it will not handle this case well:

        if (exit_reason < kvm_vmx_max_exit_handlers
            && kvm_vmx_exit_handlers[exit_reason])
                return kvm_vmx_exit_handlers[exit_reason](vcpu);
        else {
                WARN_ONCE(1, "vmx: unexpected exit reason 0x%x\n", exit_reason);
                kvm_queue_exception(vcpu, UD_VECTOR);
                return 1;
        }

At least there's an L1 kernel log message for the first unexpected
NOTIFY VM-exit, but after that, there is silence. Just a completely
inexplicable #UD in L2, assuming that L2 is resumable at this point.
Xiaoyao Li Feb. 26, 2022, 6:24 a.m. UTC | #13
On 2/26/2022 12:53 PM, Jim Mattson wrote:
> On Fri, Feb 25, 2022 at 8:25 PM Jim Mattson <jmattson@google.com> wrote:
>>
>> On Fri, Feb 25, 2022 at 8:07 PM Xiaoyao Li <xiaoyao.li@intel.com> wrote:
>>>
>>> On 2/25/2022 11:13 PM, Paolo Bonzini wrote:
>>>> On 2/25/22 16:12, Xiaoyao Li wrote:
>>>>>>>>
>>>>>>>
>>>>>>> I don't like the idea of making things up without notifying userspace
>>>>>>> that this is fictional. How is my customer running nested VMs supposed
>>>>>>> to know that L2 didn't actually shutdown, but L0 killed it because the
>>>>>>> notify window was exceeded? If this information isn't reported to
>>>>>>> userspace, I have no way of getting the information to the customer.
>>>>>>
>>>>>> Then, maybe a dedicated software define VM exit for it instead of
>>>>>> reusing triple fault?
>>>>>>
>>>>>
>>>>> Second thought, we can even just return Notify VM exit to L1 to tell
>>>>> L2 causes Notify VM exit, even thought Notify VM exit is not exposed
>>>>> to L1.
>>>>
>>>> That might cause NULL pointer dereferences or other nasty occurrences.
>>>
>>> IMO, a well written VMM (in L1) should handle it correctly.
>>>
>>> L0 KVM reports no Notify VM Exit support to L1, so L1 runs without
>>> setting Notify VM exit. If a L2 causes notify_vm_exit with
>>> invalid_vm_context, L0 just reflects it to L1. In L1's view, there is no
>>> support of Notify VM Exit from VMX MSR capability. Following L1 handler
>>> is possible:
>>>
>>> a)      if (notify_vm_exit available & notify_vm_exit enabled) {
>>>                  handle in b)
>>>          } else {
>>>                  report unexpected vm exit reason to userspace;
>>>          }
>>>
>>> b)      similar handler like we implement in KVM:
>>>          if (!vm_context_invalid)
>>>                  re-enter guest;
>>>          else
>>>                  report to userspace;
>>>
>>> c)      no Notify VM Exit related code (e.g. old KVM), it's treated as
>>> unsupported exit reason
>>>
>>> As long as it belongs to any case above, I think L1 can handle it
>>> correctly. Any nasty occurrence should be caused by incorrect handler in
>>> L1 VMM, in my opinion.
>>
>> Please test some common hypervisors (e.g. ESXi and Hyper-V).
> 
> I took a look at KVM in Linux v4.9 (one of our more popular guests),
> and it will not handle this case well:
> 
>          if (exit_reason < kvm_vmx_max_exit_handlers
>              && kvm_vmx_exit_handlers[exit_reason])
>                  return kvm_vmx_exit_handlers[exit_reason](vcpu);
>          else {
>                  WARN_ONCE(1, "vmx: unexpected exit reason 0x%x\n", exit_reason);
>                  kvm_queue_exception(vcpu, UD_VECTOR);
>                  return 1;
>          }
> 
> At least there's an L1 kernel log message for the first unexpected
> NOTIFY VM-exit, but after that, there is silence. Just a completely
> inexplicable #UD in L2, assuming that L2 is resumable at this point.

At least there is a message to tell L1 a notify VM exit is triggered in 
L2. Yes, the inexplicable #UD won't be hit unless L2 triggers Notify VM 
exit with invalid_context, which is malicious to L0 and L1.

If we use triple_fault (i.e., shutdown), then no info to tell L1 that 
it's caused by Notify VM exit with invalid context. Triple fault needs 
to be extended and L1 kernel needs to be enlightened. It doesn't help 
old guest kernel.

If we use Machine Check, it's somewhat same inexplicable to L2 unless 
it's enlightened. But it doesn't help old guest kernel.

Anyway, for Notify VM exit with invalid context from L2, I don't see a 
good solution to tell L1 VMM it's a "Notify VM exit with invalid context 
from L2" and keep all kinds of L1 VMM happy, especially for those with 
old kernel versions.
Jim Mattson Feb. 26, 2022, 2:24 p.m. UTC | #14
On Fri, Feb 25, 2022 at 10:24 PM Xiaoyao Li <xiaoyao.li@intel.com> wrote:
>
> On 2/26/2022 12:53 PM, Jim Mattson wrote:
> > On Fri, Feb 25, 2022 at 8:25 PM Jim Mattson <jmattson@google.com> wrote:
> >>
> >> On Fri, Feb 25, 2022 at 8:07 PM Xiaoyao Li <xiaoyao.li@intel.com> wrote:
> >>>
> >>> On 2/25/2022 11:13 PM, Paolo Bonzini wrote:
> >>>> On 2/25/22 16:12, Xiaoyao Li wrote:
> >>>>>>>>
> >>>>>>>
> >>>>>>> I don't like the idea of making things up without notifying userspace
> >>>>>>> that this is fictional. How is my customer running nested VMs supposed
> >>>>>>> to know that L2 didn't actually shutdown, but L0 killed it because the
> >>>>>>> notify window was exceeded? If this information isn't reported to
> >>>>>>> userspace, I have no way of getting the information to the customer.
> >>>>>>
> >>>>>> Then, maybe a dedicated software define VM exit for it instead of
> >>>>>> reusing triple fault?
> >>>>>>
> >>>>>
> >>>>> Second thought, we can even just return Notify VM exit to L1 to tell
> >>>>> L2 causes Notify VM exit, even thought Notify VM exit is not exposed
> >>>>> to L1.
> >>>>
> >>>> That might cause NULL pointer dereferences or other nasty occurrences.
> >>>
> >>> IMO, a well written VMM (in L1) should handle it correctly.
> >>>
> >>> L0 KVM reports no Notify VM Exit support to L1, so L1 runs without
> >>> setting Notify VM exit. If a L2 causes notify_vm_exit with
> >>> invalid_vm_context, L0 just reflects it to L1. In L1's view, there is no
> >>> support of Notify VM Exit from VMX MSR capability. Following L1 handler
> >>> is possible:
> >>>
> >>> a)      if (notify_vm_exit available & notify_vm_exit enabled) {
> >>>                  handle in b)
> >>>          } else {
> >>>                  report unexpected vm exit reason to userspace;
> >>>          }
> >>>
> >>> b)      similar handler like we implement in KVM:
> >>>          if (!vm_context_invalid)
> >>>                  re-enter guest;
> >>>          else
> >>>                  report to userspace;
> >>>
> >>> c)      no Notify VM Exit related code (e.g. old KVM), it's treated as
> >>> unsupported exit reason
> >>>
> >>> As long as it belongs to any case above, I think L1 can handle it
> >>> correctly. Any nasty occurrence should be caused by incorrect handler in
> >>> L1 VMM, in my opinion.
> >>
> >> Please test some common hypervisors (e.g. ESXi and Hyper-V).
> >
> > I took a look at KVM in Linux v4.9 (one of our more popular guests),
> > and it will not handle this case well:
> >
> >          if (exit_reason < kvm_vmx_max_exit_handlers
> >              && kvm_vmx_exit_handlers[exit_reason])
> >                  return kvm_vmx_exit_handlers[exit_reason](vcpu);
> >          else {
> >                  WARN_ONCE(1, "vmx: unexpected exit reason 0x%x\n", exit_reason);
> >                  kvm_queue_exception(vcpu, UD_VECTOR);
> >                  return 1;
> >          }
> >
> > At least there's an L1 kernel log message for the first unexpected
> > NOTIFY VM-exit, but after that, there is silence. Just a completely
> > inexplicable #UD in L2, assuming that L2 is resumable at this point.
>
> At least there is a message to tell L1 a notify VM exit is triggered in
> L2. Yes, the inexplicable #UD won't be hit unless L2 triggers Notify VM
> exit with invalid_context, which is malicious to L0 and L1.

There is only an L1 kernel log message *the first time*. That's not
good enough. And this is just one of the myriad of possible L1
hypervisors.

> If we use triple_fault (i.e., shutdown), then no info to tell L1 that
> it's caused by Notify VM exit with invalid context. Triple fault needs
> to be extended and L1 kernel needs to be enlightened. It doesn't help
> old guest kernel.
>
> If we use Machine Check, it's somewhat same inexplicable to L2 unless
> it's enlightened. But it doesn't help old guest kernel.
>
> Anyway, for Notify VM exit with invalid context from L2, I don't see a
> good solution to tell L1 VMM it's a "Notify VM exit with invalid context
> from L2" and keep all kinds of L1 VMM happy, especially for those with
> old kernel versions.

I agree that there is no way to make every conceivable L1 happy.
That's why the information needs to be surfaced to the L0 userspace. I
contend that any time L0 kvm violates the architectural specification
in its emulation of L1 or L2, the L0 userspace *must* be informed.
Xiaoyao Li Feb. 28, 2022, 7:10 a.m. UTC | #15
On 2/26/2022 10:24 PM, Jim Mattson wrote:
> On Fri, Feb 25, 2022 at 10:24 PM Xiaoyao Li <xiaoyao.li@intel.com> wrote:
>>
>> On 2/26/2022 12:53 PM, Jim Mattson wrote:
>>> On Fri, Feb 25, 2022 at 8:25 PM Jim Mattson <jmattson@google.com> wrote:
>>>>
>>>> On Fri, Feb 25, 2022 at 8:07 PM Xiaoyao Li <xiaoyao.li@intel.com> wrote:
>>>>>
>>>>> On 2/25/2022 11:13 PM, Paolo Bonzini wrote:
>>>>>> On 2/25/22 16:12, Xiaoyao Li wrote:
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>> I don't like the idea of making things up without notifying userspace
>>>>>>>>> that this is fictional. How is my customer running nested VMs supposed
>>>>>>>>> to know that L2 didn't actually shutdown, but L0 killed it because the
>>>>>>>>> notify window was exceeded? If this information isn't reported to
>>>>>>>>> userspace, I have no way of getting the information to the customer.
>>>>>>>>
>>>>>>>> Then, maybe a dedicated software define VM exit for it instead of
>>>>>>>> reusing triple fault?
>>>>>>>>
>>>>>>>
>>>>>>> Second thought, we can even just return Notify VM exit to L1 to tell
>>>>>>> L2 causes Notify VM exit, even thought Notify VM exit is not exposed
>>>>>>> to L1.
>>>>>>
>>>>>> That might cause NULL pointer dereferences or other nasty occurrences.
>>>>>
>>>>> IMO, a well written VMM (in L1) should handle it correctly.
>>>>>
>>>>> L0 KVM reports no Notify VM Exit support to L1, so L1 runs without
>>>>> setting Notify VM exit. If a L2 causes notify_vm_exit with
>>>>> invalid_vm_context, L0 just reflects it to L1. In L1's view, there is no
>>>>> support of Notify VM Exit from VMX MSR capability. Following L1 handler
>>>>> is possible:
>>>>>
>>>>> a)      if (notify_vm_exit available & notify_vm_exit enabled) {
>>>>>                   handle in b)
>>>>>           } else {
>>>>>                   report unexpected vm exit reason to userspace;
>>>>>           }
>>>>>
>>>>> b)      similar handler like we implement in KVM:
>>>>>           if (!vm_context_invalid)
>>>>>                   re-enter guest;
>>>>>           else
>>>>>                   report to userspace;
>>>>>
>>>>> c)      no Notify VM Exit related code (e.g. old KVM), it's treated as
>>>>> unsupported exit reason
>>>>>
>>>>> As long as it belongs to any case above, I think L1 can handle it
>>>>> correctly. Any nasty occurrence should be caused by incorrect handler in
>>>>> L1 VMM, in my opinion.
>>>>
>>>> Please test some common hypervisors (e.g. ESXi and Hyper-V).
>>>
>>> I took a look at KVM in Linux v4.9 (one of our more popular guests),
>>> and it will not handle this case well:
>>>
>>>           if (exit_reason < kvm_vmx_max_exit_handlers
>>>               && kvm_vmx_exit_handlers[exit_reason])
>>>                   return kvm_vmx_exit_handlers[exit_reason](vcpu);
>>>           else {
>>>                   WARN_ONCE(1, "vmx: unexpected exit reason 0x%x\n", exit_reason);
>>>                   kvm_queue_exception(vcpu, UD_VECTOR);
>>>                   return 1;
>>>           }
>>>
>>> At least there's an L1 kernel log message for the first unexpected
>>> NOTIFY VM-exit, but after that, there is silence. Just a completely
>>> inexplicable #UD in L2, assuming that L2 is resumable at this point.
>>
>> At least there is a message to tell L1 a notify VM exit is triggered in
>> L2. Yes, the inexplicable #UD won't be hit unless L2 triggers Notify VM
>> exit with invalid_context, which is malicious to L0 and L1.
> 
> There is only an L1 kernel log message *the first time*. That's not
> good enough. And this is just one of the myriad of possible L1
> hypervisors.
> 
>> If we use triple_fault (i.e., shutdown), then no info to tell L1 that
>> it's caused by Notify VM exit with invalid context. Triple fault needs
>> to be extended and L1 kernel needs to be enlightened. It doesn't help
>> old guest kernel.
>>
>> If we use Machine Check, it's somewhat same inexplicable to L2 unless
>> it's enlightened. But it doesn't help old guest kernel.
>>
>> Anyway, for Notify VM exit with invalid context from L2, I don't see a
>> good solution to tell L1 VMM it's a "Notify VM exit with invalid context
>> from L2" and keep all kinds of L1 VMM happy, especially for those with
>> old kernel versions.
> 
> I agree that there is no way to make every conceivable L1 happy.
> That's why the information needs to be surfaced to the L0 userspace. I
> contend that any time L0 kvm violates the architectural specification
> in its emulation of L1 or L2, the L0 userspace *must* be informed.

We can make the design to exit to userspace on notify vm exit 
unconditionally with exit_qualification passed, then userspace can take 
the same action like what this patch does in KVM that

  - re-enter guest when context_invalid is false;
  - stop running the guest if context_invalid is true; (userspace can 
definitely re-enter the guest in this case, but it needs to take the 
fall on this)

Then, for nested case, L0 needs to enable it transparently for L2 if 
this feature is enabled for L1 guest (the reason as we all agreed that 
cannot allow L1 to escape just by creating a L2). Then what should KVM 
do when notify vm exit from L2?

  - Exit to L0 userspace on L2's notify vm exit. L0 userspace takes the 
same action:
	- re-enter if context-invalid is false;
	- kill L1 if context-invalid is true; (I don't know if there is any 
interface for L0 userspace to kill L2). Then it opens the potential door 
for malicious user to kill L1 by creating a L2 to trigger fatal notify 
vm exit. If you guys accept it, we can implement in this way.


in conclusion, we have below solution:

1. Take this patch as is. The drawback is L1 VMM receives a triple_fault 
from L2 when L2 triggers notify vm exit with invalid context. Neither of 
L1 VMM, L1 userspace, nor L2 kernel know it's caused due to notify vm 
exit. There is only kernel log in L0, which seems not accessible for L1 
user or L2 guest.

2. a) Inject notify vm exit back to L1 if L2 triggers notify vm exit 
with invalid context. The drawback is, old L1 hypervisor is not 
enlightened of it and maybe misbehave on it.

    b) Inject a synthesized SHUTDOWN exit to L1, with additional info to 
tell it's caused by fatal notify vm exit from L2. It has the same 
drawback that old hypervisor has no idea of it and maybe misbehave on it.

3. Exit to L0 usersapce unconditionally no matter it's caused from L1 or 
L2. Then it may open the door for L1 user to kill L1.

Do you have any better solution other than above? If no, we need to pick 
one from above though it cannot make everyone happy.

thanks,
-Xiaoyao
Jim Mattson Feb. 28, 2022, 2:30 p.m. UTC | #16
On Sun, Feb 27, 2022 at 11:10 PM Xiaoyao Li <xiaoyao.li@intel.com> wrote:
>
> On 2/26/2022 10:24 PM, Jim Mattson wrote:
> > On Fri, Feb 25, 2022 at 10:24 PM Xiaoyao Li <xiaoyao.li@intel.com> wrote:
> >>
> >> On 2/26/2022 12:53 PM, Jim Mattson wrote:
> >>> On Fri, Feb 25, 2022 at 8:25 PM Jim Mattson <jmattson@google.com> wrote:
> >>>>
> >>>> On Fri, Feb 25, 2022 at 8:07 PM Xiaoyao Li <xiaoyao.li@intel.com> wrote:
> >>>>>
> >>>>> On 2/25/2022 11:13 PM, Paolo Bonzini wrote:
> >>>>>> On 2/25/22 16:12, Xiaoyao Li wrote:
> >>>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> I don't like the idea of making things up without notifying userspace
> >>>>>>>>> that this is fictional. How is my customer running nested VMs supposed
> >>>>>>>>> to know that L2 didn't actually shutdown, but L0 killed it because the
> >>>>>>>>> notify window was exceeded? If this information isn't reported to
> >>>>>>>>> userspace, I have no way of getting the information to the customer.
> >>>>>>>>
> >>>>>>>> Then, maybe a dedicated software define VM exit for it instead of
> >>>>>>>> reusing triple fault?
> >>>>>>>>
> >>>>>>>
> >>>>>>> Second thought, we can even just return Notify VM exit to L1 to tell
> >>>>>>> L2 causes Notify VM exit, even thought Notify VM exit is not exposed
> >>>>>>> to L1.
> >>>>>>
> >>>>>> That might cause NULL pointer dereferences or other nasty occurrences.
> >>>>>
> >>>>> IMO, a well written VMM (in L1) should handle it correctly.
> >>>>>
> >>>>> L0 KVM reports no Notify VM Exit support to L1, so L1 runs without
> >>>>> setting Notify VM exit. If a L2 causes notify_vm_exit with
> >>>>> invalid_vm_context, L0 just reflects it to L1. In L1's view, there is no
> >>>>> support of Notify VM Exit from VMX MSR capability. Following L1 handler
> >>>>> is possible:
> >>>>>
> >>>>> a)      if (notify_vm_exit available & notify_vm_exit enabled) {
> >>>>>                   handle in b)
> >>>>>           } else {
> >>>>>                   report unexpected vm exit reason to userspace;
> >>>>>           }
> >>>>>
> >>>>> b)      similar handler like we implement in KVM:
> >>>>>           if (!vm_context_invalid)
> >>>>>                   re-enter guest;
> >>>>>           else
> >>>>>                   report to userspace;
> >>>>>
> >>>>> c)      no Notify VM Exit related code (e.g. old KVM), it's treated as
> >>>>> unsupported exit reason
> >>>>>
> >>>>> As long as it belongs to any case above, I think L1 can handle it
> >>>>> correctly. Any nasty occurrence should be caused by incorrect handler in
> >>>>> L1 VMM, in my opinion.
> >>>>
> >>>> Please test some common hypervisors (e.g. ESXi and Hyper-V).
> >>>
> >>> I took a look at KVM in Linux v4.9 (one of our more popular guests),
> >>> and it will not handle this case well:
> >>>
> >>>           if (exit_reason < kvm_vmx_max_exit_handlers
> >>>               && kvm_vmx_exit_handlers[exit_reason])
> >>>                   return kvm_vmx_exit_handlers[exit_reason](vcpu);
> >>>           else {
> >>>                   WARN_ONCE(1, "vmx: unexpected exit reason 0x%x\n", exit_reason);
> >>>                   kvm_queue_exception(vcpu, UD_VECTOR);
> >>>                   return 1;
> >>>           }
> >>>
> >>> At least there's an L1 kernel log message for the first unexpected
> >>> NOTIFY VM-exit, but after that, there is silence. Just a completely
> >>> inexplicable #UD in L2, assuming that L2 is resumable at this point.
> >>
> >> At least there is a message to tell L1 a notify VM exit is triggered in
> >> L2. Yes, the inexplicable #UD won't be hit unless L2 triggers Notify VM
> >> exit with invalid_context, which is malicious to L0 and L1.
> >
> > There is only an L1 kernel log message *the first time*. That's not
> > good enough. And this is just one of the myriad of possible L1
> > hypervisors.
> >
> >> If we use triple_fault (i.e., shutdown), then no info to tell L1 that
> >> it's caused by Notify VM exit with invalid context. Triple fault needs
> >> to be extended and L1 kernel needs to be enlightened. It doesn't help
> >> old guest kernel.
> >>
> >> If we use Machine Check, it's somewhat same inexplicable to L2 unless
> >> it's enlightened. But it doesn't help old guest kernel.
> >>
> >> Anyway, for Notify VM exit with invalid context from L2, I don't see a
> >> good solution to tell L1 VMM it's a "Notify VM exit with invalid context
> >> from L2" and keep all kinds of L1 VMM happy, especially for those with
> >> old kernel versions.
> >
> > I agree that there is no way to make every conceivable L1 happy.
> > That's why the information needs to be surfaced to the L0 userspace. I
> > contend that any time L0 kvm violates the architectural specification
> > in its emulation of L1 or L2, the L0 userspace *must* be informed.
>
> We can make the design to exit to userspace on notify vm exit
> unconditionally with exit_qualification passed, then userspace can take
> the same action like what this patch does in KVM that
>
>   - re-enter guest when context_invalid is false;
>   - stop running the guest if context_invalid is true; (userspace can
> definitely re-enter the guest in this case, but it needs to take the
> fall on this)
>
> Then, for nested case, L0 needs to enable it transparently for L2 if
> this feature is enabled for L1 guest (the reason as we all agreed that
> cannot allow L1 to escape just by creating a L2). Then what should KVM
> do when notify vm exit from L2?
>
>   - Exit to L0 userspace on L2's notify vm exit. L0 userspace takes the
> same action:
>         - re-enter if context-invalid is false;
>         - kill L1 if context-invalid is true; (I don't know if there is any
> interface for L0 userspace to kill L2). Then it opens the potential door
> for malicious user to kill L1 by creating a L2 to trigger fatal notify
> vm exit. If you guys accept it, we can implement in this way.
>
>
> in conclusion, we have below solution:
>
> 1. Take this patch as is. The drawback is L1 VMM receives a triple_fault
> from L2 when L2 triggers notify vm exit with invalid context. Neither of
> L1 VMM, L1 userspace, nor L2 kernel know it's caused due to notify vm
> exit. There is only kernel log in L0, which seems not accessible for L1
> user or L2 guest.

You are correct on that last point, and I feel that I cannot stress it
enough. In a typical environment, the L0 kernel log is only available
to the administrator of the L0 host.

> 2. a) Inject notify vm exit back to L1 if L2 triggers notify vm exit
> with invalid context. The drawback is, old L1 hypervisor is not
> enlightened of it and maybe misbehave on it.
>
>     b) Inject a synthesized SHUTDOWN exit to L1, with additional info to
> tell it's caused by fatal notify vm exit from L2. It has the same
> drawback that old hypervisor has no idea of it and maybe misbehave on it.
>
> 3. Exit to L0 usersapce unconditionally no matter it's caused from L1 or
> L2. Then it may open the door for L1 user to kill L1.
>
> Do you have any better solution other than above? If no, we need to pick
> one from above though it cannot make everyone happy.

Yes, I believe I have a better solution. We obviously need an API for
userspace to synthesize a SHUTDOWN event for a vCPU. In addition, to
avoid breaking legacy userspace, the NOTIFY VM-exit should be opt-in.
Xiaoyao Li March 1, 2022, 1:40 a.m. UTC | #17
On 2/28/2022 10:30 PM, Jim Mattson wrote:
> On Sun, Feb 27, 2022 at 11:10 PM Xiaoyao Li <xiaoyao.li@intel.com> wrote:
>>
>> On 2/26/2022 10:24 PM, Jim Mattson wrote:
>>> On Fri, Feb 25, 2022 at 10:24 PM Xiaoyao Li <xiaoyao.li@intel.com> wrote:
>>>>
>>>> On 2/26/2022 12:53 PM, Jim Mattson wrote:
>>>>> On Fri, Feb 25, 2022 at 8:25 PM Jim Mattson <jmattson@google.com> wrote:
>>>>>>
>>>>>> On Fri, Feb 25, 2022 at 8:07 PM Xiaoyao Li <xiaoyao.li@intel.com> wrote:
>>>>>>>
>>>>>>> On 2/25/2022 11:13 PM, Paolo Bonzini wrote:
>>>>>>>> On 2/25/22 16:12, Xiaoyao Li wrote:
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> I don't like the idea of making things up without notifying userspace
>>>>>>>>>>> that this is fictional. How is my customer running nested VMs supposed
>>>>>>>>>>> to know that L2 didn't actually shutdown, but L0 killed it because the
>>>>>>>>>>> notify window was exceeded? If this information isn't reported to
>>>>>>>>>>> userspace, I have no way of getting the information to the customer.
>>>>>>>>>>
>>>>>>>>>> Then, maybe a dedicated software define VM exit for it instead of
>>>>>>>>>> reusing triple fault?
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Second thought, we can even just return Notify VM exit to L1 to tell
>>>>>>>>> L2 causes Notify VM exit, even thought Notify VM exit is not exposed
>>>>>>>>> to L1.
>>>>>>>>
>>>>>>>> That might cause NULL pointer dereferences or other nasty occurrences.
>>>>>>>
>>>>>>> IMO, a well written VMM (in L1) should handle it correctly.
>>>>>>>
>>>>>>> L0 KVM reports no Notify VM Exit support to L1, so L1 runs without
>>>>>>> setting Notify VM exit. If a L2 causes notify_vm_exit with
>>>>>>> invalid_vm_context, L0 just reflects it to L1. In L1's view, there is no
>>>>>>> support of Notify VM Exit from VMX MSR capability. Following L1 handler
>>>>>>> is possible:
>>>>>>>
>>>>>>> a)      if (notify_vm_exit available & notify_vm_exit enabled) {
>>>>>>>                    handle in b)
>>>>>>>            } else {
>>>>>>>                    report unexpected vm exit reason to userspace;
>>>>>>>            }
>>>>>>>
>>>>>>> b)      similar handler like we implement in KVM:
>>>>>>>            if (!vm_context_invalid)
>>>>>>>                    re-enter guest;
>>>>>>>            else
>>>>>>>                    report to userspace;
>>>>>>>
>>>>>>> c)      no Notify VM Exit related code (e.g. old KVM), it's treated as
>>>>>>> unsupported exit reason
>>>>>>>
>>>>>>> As long as it belongs to any case above, I think L1 can handle it
>>>>>>> correctly. Any nasty occurrence should be caused by incorrect handler in
>>>>>>> L1 VMM, in my opinion.
>>>>>>
>>>>>> Please test some common hypervisors (e.g. ESXi and Hyper-V).
>>>>>
>>>>> I took a look at KVM in Linux v4.9 (one of our more popular guests),
>>>>> and it will not handle this case well:
>>>>>
>>>>>            if (exit_reason < kvm_vmx_max_exit_handlers
>>>>>                && kvm_vmx_exit_handlers[exit_reason])
>>>>>                    return kvm_vmx_exit_handlers[exit_reason](vcpu);
>>>>>            else {
>>>>>                    WARN_ONCE(1, "vmx: unexpected exit reason 0x%x\n", exit_reason);
>>>>>                    kvm_queue_exception(vcpu, UD_VECTOR);
>>>>>                    return 1;
>>>>>            }
>>>>>
>>>>> At least there's an L1 kernel log message for the first unexpected
>>>>> NOTIFY VM-exit, but after that, there is silence. Just a completely
>>>>> inexplicable #UD in L2, assuming that L2 is resumable at this point.
>>>>
>>>> At least there is a message to tell L1 a notify VM exit is triggered in
>>>> L2. Yes, the inexplicable #UD won't be hit unless L2 triggers Notify VM
>>>> exit with invalid_context, which is malicious to L0 and L1.
>>>
>>> There is only an L1 kernel log message *the first time*. That's not
>>> good enough. And this is just one of the myriad of possible L1
>>> hypervisors.
>>>
>>>> If we use triple_fault (i.e., shutdown), then no info to tell L1 that
>>>> it's caused by Notify VM exit with invalid context. Triple fault needs
>>>> to be extended and L1 kernel needs to be enlightened. It doesn't help
>>>> old guest kernel.
>>>>
>>>> If we use Machine Check, it's somewhat same inexplicable to L2 unless
>>>> it's enlightened. But it doesn't help old guest kernel.
>>>>
>>>> Anyway, for Notify VM exit with invalid context from L2, I don't see a
>>>> good solution to tell L1 VMM it's a "Notify VM exit with invalid context
>>>> from L2" and keep all kinds of L1 VMM happy, especially for those with
>>>> old kernel versions.
>>>
>>> I agree that there is no way to make every conceivable L1 happy.
>>> That's why the information needs to be surfaced to the L0 userspace. I
>>> contend that any time L0 kvm violates the architectural specification
>>> in its emulation of L1 or L2, the L0 userspace *must* be informed.
>>
>> We can make the design to exit to userspace on notify vm exit
>> unconditionally with exit_qualification passed, then userspace can take
>> the same action like what this patch does in KVM that
>>
>>    - re-enter guest when context_invalid is false;
>>    - stop running the guest if context_invalid is true; (userspace can
>> definitely re-enter the guest in this case, but it needs to take the
>> fall on this)
>>
>> Then, for nested case, L0 needs to enable it transparently for L2 if
>> this feature is enabled for L1 guest (the reason as we all agreed that
>> cannot allow L1 to escape just by creating a L2). Then what should KVM
>> do when notify vm exit from L2?
>>
>>    - Exit to L0 userspace on L2's notify vm exit. L0 userspace takes the
>> same action:
>>          - re-enter if context-invalid is false;
>>          - kill L1 if context-invalid is true; (I don't know if there is any
>> interface for L0 userspace to kill L2). Then it opens the potential door
>> for malicious user to kill L1 by creating a L2 to trigger fatal notify
>> vm exit. If you guys accept it, we can implement in this way.
>>
>>
>> in conclusion, we have below solution:
>>
>> 1. Take this patch as is. The drawback is L1 VMM receives a triple_fault
>> from L2 when L2 triggers notify vm exit with invalid context. Neither of
>> L1 VMM, L1 userspace, nor L2 kernel know it's caused due to notify vm
>> exit. There is only kernel log in L0, which seems not accessible for L1
>> user or L2 guest.
> 
> You are correct on that last point, and I feel that I cannot stress it
> enough. In a typical environment, the L0 kernel log is only available
> to the administrator of the L0 host.
> 
>> 2. a) Inject notify vm exit back to L1 if L2 triggers notify vm exit
>> with invalid context. The drawback is, old L1 hypervisor is not
>> enlightened of it and maybe misbehave on it.
>>
>>      b) Inject a synthesized SHUTDOWN exit to L1, with additional info to
>> tell it's caused by fatal notify vm exit from L2. It has the same
>> drawback that old hypervisor has no idea of it and maybe misbehave on it.
>>
>> 3. Exit to L0 usersapce unconditionally no matter it's caused from L1 or
>> L2. Then it may open the door for L1 user to kill L1.
>>
>> Do you have any better solution other than above? If no, we need to pick
>> one from above though it cannot make everyone happy.
> 
> Yes, I believe I have a better solution. We obviously need an API for
> userspace to synthesize a SHUTDOWN event for a vCPU. 

Can you elaborate on it? Do you mean userspace to inject a synthesized 
SHUTDOWN to guest? If so, I have no idea how it will work.

> In addition, to
> avoid breaking legacy userspace, the NOTIFY VM-exit should be opt-in.

Yes, it's designed as opt-in already that the feature is off by default.
Jim Mattson March 1, 2022, 4:32 a.m. UTC | #18
On Mon, Feb 28, 2022 at 5:41 PM Xiaoyao Li <xiaoyao.li@intel.com> wrote:
>
> On 2/28/2022 10:30 PM, Jim Mattson wrote:
> > On Sun, Feb 27, 2022 at 11:10 PM Xiaoyao Li <xiaoyao.li@intel.com> wrote:
> >>
> >> On 2/26/2022 10:24 PM, Jim Mattson wrote:
> >>> On Fri, Feb 25, 2022 at 10:24 PM Xiaoyao Li <xiaoyao.li@intel.com> wrote:
> >>>>
> >>>> On 2/26/2022 12:53 PM, Jim Mattson wrote:
> >>>>> On Fri, Feb 25, 2022 at 8:25 PM Jim Mattson <jmattson@google.com> wrote:
> >>>>>>
> >>>>>> On Fri, Feb 25, 2022 at 8:07 PM Xiaoyao Li <xiaoyao.li@intel.com> wrote:
> >>>>>>>
> >>>>>>> On 2/25/2022 11:13 PM, Paolo Bonzini wrote:
> >>>>>>>> On 2/25/22 16:12, Xiaoyao Li wrote:
> >>>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>> I don't like the idea of making things up without notifying userspace
> >>>>>>>>>>> that this is fictional. How is my customer running nested VMs supposed
> >>>>>>>>>>> to know that L2 didn't actually shutdown, but L0 killed it because the
> >>>>>>>>>>> notify window was exceeded? If this information isn't reported to
> >>>>>>>>>>> userspace, I have no way of getting the information to the customer.
> >>>>>>>>>>
> >>>>>>>>>> Then, maybe a dedicated software define VM exit for it instead of
> >>>>>>>>>> reusing triple fault?
> >>>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> Second thought, we can even just return Notify VM exit to L1 to tell
> >>>>>>>>> L2 causes Notify VM exit, even thought Notify VM exit is not exposed
> >>>>>>>>> to L1.
> >>>>>>>>
> >>>>>>>> That might cause NULL pointer dereferences or other nasty occurrences.
> >>>>>>>
> >>>>>>> IMO, a well written VMM (in L1) should handle it correctly.
> >>>>>>>
> >>>>>>> L0 KVM reports no Notify VM Exit support to L1, so L1 runs without
> >>>>>>> setting Notify VM exit. If a L2 causes notify_vm_exit with
> >>>>>>> invalid_vm_context, L0 just reflects it to L1. In L1's view, there is no
> >>>>>>> support of Notify VM Exit from VMX MSR capability. Following L1 handler
> >>>>>>> is possible:
> >>>>>>>
> >>>>>>> a)      if (notify_vm_exit available & notify_vm_exit enabled) {
> >>>>>>>                    handle in b)
> >>>>>>>            } else {
> >>>>>>>                    report unexpected vm exit reason to userspace;
> >>>>>>>            }
> >>>>>>>
> >>>>>>> b)      similar handler like we implement in KVM:
> >>>>>>>            if (!vm_context_invalid)
> >>>>>>>                    re-enter guest;
> >>>>>>>            else
> >>>>>>>                    report to userspace;
> >>>>>>>
> >>>>>>> c)      no Notify VM Exit related code (e.g. old KVM), it's treated as
> >>>>>>> unsupported exit reason
> >>>>>>>
> >>>>>>> As long as it belongs to any case above, I think L1 can handle it
> >>>>>>> correctly. Any nasty occurrence should be caused by incorrect handler in
> >>>>>>> L1 VMM, in my opinion.
> >>>>>>
> >>>>>> Please test some common hypervisors (e.g. ESXi and Hyper-V).
> >>>>>
> >>>>> I took a look at KVM in Linux v4.9 (one of our more popular guests),
> >>>>> and it will not handle this case well:
> >>>>>
> >>>>>            if (exit_reason < kvm_vmx_max_exit_handlers
> >>>>>                && kvm_vmx_exit_handlers[exit_reason])
> >>>>>                    return kvm_vmx_exit_handlers[exit_reason](vcpu);
> >>>>>            else {
> >>>>>                    WARN_ONCE(1, "vmx: unexpected exit reason 0x%x\n", exit_reason);
> >>>>>                    kvm_queue_exception(vcpu, UD_VECTOR);
> >>>>>                    return 1;
> >>>>>            }
> >>>>>
> >>>>> At least there's an L1 kernel log message for the first unexpected
> >>>>> NOTIFY VM-exit, but after that, there is silence. Just a completely
> >>>>> inexplicable #UD in L2, assuming that L2 is resumable at this point.
> >>>>
> >>>> At least there is a message to tell L1 a notify VM exit is triggered in
> >>>> L2. Yes, the inexplicable #UD won't be hit unless L2 triggers Notify VM
> >>>> exit with invalid_context, which is malicious to L0 and L1.
> >>>
> >>> There is only an L1 kernel log message *the first time*. That's not
> >>> good enough. And this is just one of the myriad of possible L1
> >>> hypervisors.
> >>>
> >>>> If we use triple_fault (i.e., shutdown), then no info to tell L1 that
> >>>> it's caused by Notify VM exit with invalid context. Triple fault needs
> >>>> to be extended and L1 kernel needs to be enlightened. It doesn't help
> >>>> old guest kernel.
> >>>>
> >>>> If we use Machine Check, it's somewhat same inexplicable to L2 unless
> >>>> it's enlightened. But it doesn't help old guest kernel.
> >>>>
> >>>> Anyway, for Notify VM exit with invalid context from L2, I don't see a
> >>>> good solution to tell L1 VMM it's a "Notify VM exit with invalid context
> >>>> from L2" and keep all kinds of L1 VMM happy, especially for those with
> >>>> old kernel versions.
> >>>
> >>> I agree that there is no way to make every conceivable L1 happy.
> >>> That's why the information needs to be surfaced to the L0 userspace. I
> >>> contend that any time L0 kvm violates the architectural specification
> >>> in its emulation of L1 or L2, the L0 userspace *must* be informed.
> >>
> >> We can make the design to exit to userspace on notify vm exit
> >> unconditionally with exit_qualification passed, then userspace can take
> >> the same action like what this patch does in KVM that
> >>
> >>    - re-enter guest when context_invalid is false;
> >>    - stop running the guest if context_invalid is true; (userspace can
> >> definitely re-enter the guest in this case, but it needs to take the
> >> fall on this)
> >>
> >> Then, for nested case, L0 needs to enable it transparently for L2 if
> >> this feature is enabled for L1 guest (the reason as we all agreed that
> >> cannot allow L1 to escape just by creating a L2). Then what should KVM
> >> do when notify vm exit from L2?
> >>
> >>    - Exit to L0 userspace on L2's notify vm exit. L0 userspace takes the
> >> same action:
> >>          - re-enter if context-invalid is false;
> >>          - kill L1 if context-invalid is true; (I don't know if there is any
> >> interface for L0 userspace to kill L2). Then it opens the potential door
> >> for malicious user to kill L1 by creating a L2 to trigger fatal notify
> >> vm exit. If you guys accept it, we can implement in this way.
> >>
> >>
> >> in conclusion, we have below solution:
> >>
> >> 1. Take this patch as is. The drawback is L1 VMM receives a triple_fault
> >> from L2 when L2 triggers notify vm exit with invalid context. Neither of
> >> L1 VMM, L1 userspace, nor L2 kernel know it's caused due to notify vm
> >> exit. There is only kernel log in L0, which seems not accessible for L1
> >> user or L2 guest.
> >
> > You are correct on that last point, and I feel that I cannot stress it
> > enough. In a typical environment, the L0 kernel log is only available
> > to the administrator of the L0 host.
> >
> >> 2. a) Inject notify vm exit back to L1 if L2 triggers notify vm exit
> >> with invalid context. The drawback is, old L1 hypervisor is not
> >> enlightened of it and maybe misbehave on it.
> >>
> >>      b) Inject a synthesized SHUTDOWN exit to L1, with additional info to
> >> tell it's caused by fatal notify vm exit from L2. It has the same
> >> drawback that old hypervisor has no idea of it and maybe misbehave on it.
> >>
> >> 3. Exit to L0 usersapce unconditionally no matter it's caused from L1 or
> >> L2. Then it may open the door for L1 user to kill L1.
> >>
> >> Do you have any better solution other than above? If no, we need to pick> >> one from above though it cannot make everyone happy.
> >
> > Yes, I believe I have a better solution. We obviously need an API for
> > userspace to synthesize a SHUTDOWN event for a vCPU.
>
> Can you elaborate on it? Do you mean userspace to inject a synthesized
> SHUTDOWN to guest? If so, I have no idea how it will work.

It can probably be implemented as an extension of KVM_SET_VCPU_EVENTS
that invokes kvm_make_request(KVM_REQ_TRIPLE_FAULT).
Xiaoyao Li March 1, 2022, 5:30 a.m. UTC | #19
On 3/1/2022 12:32 PM, Jim Mattson wrote:
> On Mon, Feb 28, 2022 at 5:41 PM Xiaoyao Li <xiaoyao.li@intel.com> wrote:
>>
>> On 2/28/2022 10:30 PM, Jim Mattson wrote:
>>> On Sun, Feb 27, 2022 at 11:10 PM Xiaoyao Li <xiaoyao.li@intel.com> wrote:
>>>>
>>>> On 2/26/2022 10:24 PM, Jim Mattson wrote:
>>>>> On Fri, Feb 25, 2022 at 10:24 PM Xiaoyao Li <xiaoyao.li@intel.com> wrote:
>>>>>>
>>>>>> On 2/26/2022 12:53 PM, Jim Mattson wrote:
>>>>>>> On Fri, Feb 25, 2022 at 8:25 PM Jim Mattson <jmattson@google.com> wrote:
>>>>>>>>
>>>>>>>> On Fri, Feb 25, 2022 at 8:07 PM Xiaoyao Li <xiaoyao.li@intel.com> wrote:
>>>>>>>>>
>>>>>>>>> On 2/25/2022 11:13 PM, Paolo Bonzini wrote:
>>>>>>>>>> On 2/25/22 16:12, Xiaoyao Li wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> I don't like the idea of making things up without notifying userspace
>>>>>>>>>>>>> that this is fictional. How is my customer running nested VMs supposed
>>>>>>>>>>>>> to know that L2 didn't actually shutdown, but L0 killed it because the
>>>>>>>>>>>>> notify window was exceeded? If this information isn't reported to
>>>>>>>>>>>>> userspace, I have no way of getting the information to the customer.
>>>>>>>>>>>>
>>>>>>>>>>>> Then, maybe a dedicated software define VM exit for it instead of
>>>>>>>>>>>> reusing triple fault?
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Second thought, we can even just return Notify VM exit to L1 to tell
>>>>>>>>>>> L2 causes Notify VM exit, even thought Notify VM exit is not exposed
>>>>>>>>>>> to L1.
>>>>>>>>>>
>>>>>>>>>> That might cause NULL pointer dereferences or other nasty occurrences.
>>>>>>>>>
>>>>>>>>> IMO, a well written VMM (in L1) should handle it correctly.
>>>>>>>>>
>>>>>>>>> L0 KVM reports no Notify VM Exit support to L1, so L1 runs without
>>>>>>>>> setting Notify VM exit. If a L2 causes notify_vm_exit with
>>>>>>>>> invalid_vm_context, L0 just reflects it to L1. In L1's view, there is no
>>>>>>>>> support of Notify VM Exit from VMX MSR capability. Following L1 handler
>>>>>>>>> is possible:
>>>>>>>>>
>>>>>>>>> a)      if (notify_vm_exit available & notify_vm_exit enabled) {
>>>>>>>>>                     handle in b)
>>>>>>>>>             } else {
>>>>>>>>>                     report unexpected vm exit reason to userspace;
>>>>>>>>>             }
>>>>>>>>>
>>>>>>>>> b)      similar handler like we implement in KVM:
>>>>>>>>>             if (!vm_context_invalid)
>>>>>>>>>                     re-enter guest;
>>>>>>>>>             else
>>>>>>>>>                     report to userspace;
>>>>>>>>>
>>>>>>>>> c)      no Notify VM Exit related code (e.g. old KVM), it's treated as
>>>>>>>>> unsupported exit reason
>>>>>>>>>
>>>>>>>>> As long as it belongs to any case above, I think L1 can handle it
>>>>>>>>> correctly. Any nasty occurrence should be caused by incorrect handler in
>>>>>>>>> L1 VMM, in my opinion.
>>>>>>>>
>>>>>>>> Please test some common hypervisors (e.g. ESXi and Hyper-V).
>>>>>>>
>>>>>>> I took a look at KVM in Linux v4.9 (one of our more popular guests),
>>>>>>> and it will not handle this case well:
>>>>>>>
>>>>>>>             if (exit_reason < kvm_vmx_max_exit_handlers
>>>>>>>                 && kvm_vmx_exit_handlers[exit_reason])
>>>>>>>                     return kvm_vmx_exit_handlers[exit_reason](vcpu);
>>>>>>>             else {
>>>>>>>                     WARN_ONCE(1, "vmx: unexpected exit reason 0x%x\n", exit_reason);
>>>>>>>                     kvm_queue_exception(vcpu, UD_VECTOR);
>>>>>>>                     return 1;
>>>>>>>             }
>>>>>>>
>>>>>>> At least there's an L1 kernel log message for the first unexpected
>>>>>>> NOTIFY VM-exit, but after that, there is silence. Just a completely
>>>>>>> inexplicable #UD in L2, assuming that L2 is resumable at this point.
>>>>>>
>>>>>> At least there is a message to tell L1 a notify VM exit is triggered in
>>>>>> L2. Yes, the inexplicable #UD won't be hit unless L2 triggers Notify VM
>>>>>> exit with invalid_context, which is malicious to L0 and L1.
>>>>>
>>>>> There is only an L1 kernel log message *the first time*. That's not
>>>>> good enough. And this is just one of the myriad of possible L1
>>>>> hypervisors.
>>>>>
>>>>>> If we use triple_fault (i.e., shutdown), then no info to tell L1 that
>>>>>> it's caused by Notify VM exit with invalid context. Triple fault needs
>>>>>> to be extended and L1 kernel needs to be enlightened. It doesn't help
>>>>>> old guest kernel.
>>>>>>
>>>>>> If we use Machine Check, it's somewhat same inexplicable to L2 unless
>>>>>> it's enlightened. But it doesn't help old guest kernel.
>>>>>>
>>>>>> Anyway, for Notify VM exit with invalid context from L2, I don't see a
>>>>>> good solution to tell L1 VMM it's a "Notify VM exit with invalid context
>>>>>> from L2" and keep all kinds of L1 VMM happy, especially for those with
>>>>>> old kernel versions.
>>>>>
>>>>> I agree that there is no way to make every conceivable L1 happy.
>>>>> That's why the information needs to be surfaced to the L0 userspace. I
>>>>> contend that any time L0 kvm violates the architectural specification
>>>>> in its emulation of L1 or L2, the L0 userspace *must* be informed.
>>>>
>>>> We can make the design to exit to userspace on notify vm exit
>>>> unconditionally with exit_qualification passed, then userspace can take
>>>> the same action like what this patch does in KVM that
>>>>
>>>>     - re-enter guest when context_invalid is false;
>>>>     - stop running the guest if context_invalid is true; (userspace can
>>>> definitely re-enter the guest in this case, but it needs to take the
>>>> fall on this)
>>>>
>>>> Then, for nested case, L0 needs to enable it transparently for L2 if
>>>> this feature is enabled for L1 guest (the reason as we all agreed that
>>>> cannot allow L1 to escape just by creating a L2). Then what should KVM
>>>> do when notify vm exit from L2?
>>>>
>>>>     - Exit to L0 userspace on L2's notify vm exit. L0 userspace takes the
>>>> same action:
>>>>           - re-enter if context-invalid is false;
>>>>           - kill L1 if context-invalid is true; (I don't know if there is any
>>>> interface for L0 userspace to kill L2). Then it opens the potential door
>>>> for malicious user to kill L1 by creating a L2 to trigger fatal notify
>>>> vm exit. If you guys accept it, we can implement in this way.
>>>>
>>>>
>>>> in conclusion, we have below solution:
>>>>
>>>> 1. Take this patch as is. The drawback is L1 VMM receives a triple_fault
>>>> from L2 when L2 triggers notify vm exit with invalid context. Neither of
>>>> L1 VMM, L1 userspace, nor L2 kernel know it's caused due to notify vm
>>>> exit. There is only kernel log in L0, which seems not accessible for L1
>>>> user or L2 guest.
>>>
>>> You are correct on that last point, and I feel that I cannot stress it
>>> enough. In a typical environment, the L0 kernel log is only available
>>> to the administrator of the L0 host.
>>>
>>>> 2. a) Inject notify vm exit back to L1 if L2 triggers notify vm exit
>>>> with invalid context. The drawback is, old L1 hypervisor is not
>>>> enlightened of it and maybe misbehave on it.
>>>>
>>>>       b) Inject a synthesized SHUTDOWN exit to L1, with additional info to
>>>> tell it's caused by fatal notify vm exit from L2. It has the same
>>>> drawback that old hypervisor has no idea of it and maybe misbehave on it.
>>>>
>>>> 3. Exit to L0 usersapce unconditionally no matter it's caused from L1 or
>>>> L2. Then it may open the door for L1 user to kill L1.
>>>>
>>>> Do you have any better solution other than above? If no, we need to pick> >> one from above though it cannot make everyone happy.
>>>
>>> Yes, I believe I have a better solution. We obviously need an API for
>>> userspace to synthesize a SHUTDOWN event for a vCPU.
>>
>> Can you elaborate on it? Do you mean userspace to inject a synthesized
>> SHUTDOWN to guest? If so, I have no idea how it will work.
> 
> It can probably be implemented as an extension of KVM_SET_VCPU_EVENTS
> that invokes kvm_make_request(KVM_REQ_TRIPLE_FAULT).

Then, you mean

1. notify vm exit from guest;
2. exit to userspace on notify vm exit;
3. a. if context_invalid, inject SHUTDOWN to vcpu from userspace to 
request KVM_REQ_TRIPLE_FAULT; goto step 4;
    b. if !context_invalid, re-run vcpu; no step 4 and 5;
4. exit to userspace again with KVM_EXIT_SHUTDOWN due to triple fault;
5. userspace stop running the vcpu/VM

Then why not handle it as KVM_EXIT_SHUTDOWN directly in 3.a ? I don't 
get the point of userspace to inject TRIPLE_FAULT to KVM.
Jim Mattson March 1, 2022, 9:57 p.m. UTC | #20
On Mon, Feb 28, 2022 at 9:30 PM Xiaoyao Li <xiaoyao.li@intel.com> wrote:
>
> On 3/1/2022 12:32 PM, Jim Mattson wrote:
> > On Mon, Feb 28, 2022 at 5:41 PM Xiaoyao Li <xiaoyao.li@intel.com> wrote:
> >>
> >> On 2/28/2022 10:30 PM, Jim Mattson wrote:
> >>> On Sun, Feb 27, 2022 at 11:10 PM Xiaoyao Li <xiaoyao.li@intel.com> wrote:
> >>>>
> >>>> On 2/26/2022 10:24 PM, Jim Mattson wrote:
> >>>>> On Fri, Feb 25, 2022 at 10:24 PM Xiaoyao Li <xiaoyao.li@intel.com> wrote:
> >>>>>>
> >>>>>> On 2/26/2022 12:53 PM, Jim Mattson wrote:
> >>>>>>> On Fri, Feb 25, 2022 at 8:25 PM Jim Mattson <jmattson@google.com> wrote:
> >>>>>>>>
> >>>>>>>> On Fri, Feb 25, 2022 at 8:07 PM Xiaoyao Li <xiaoyao.li@intel.com> wrote:
> >>>>>>>>>
> >>>>>>>>> On 2/25/2022 11:13 PM, Paolo Bonzini wrote:
> >>>>>>>>>> On 2/25/22 16:12, Xiaoyao Li wrote:
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> I don't like the idea of making things up without notifying userspace
> >>>>>>>>>>>>> that this is fictional. How is my customer running nested VMs supposed
> >>>>>>>>>>>>> to know that L2 didn't actually shutdown, but L0 killed it because the
> >>>>>>>>>>>>> notify window was exceeded? If this information isn't reported to
> >>>>>>>>>>>>> userspace, I have no way of getting the information to the customer.
> >>>>>>>>>>>>
> >>>>>>>>>>>> Then, maybe a dedicated software define VM exit for it instead of
> >>>>>>>>>>>> reusing triple fault?
> >>>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>> Second thought, we can even just return Notify VM exit to L1 to tell
> >>>>>>>>>>> L2 causes Notify VM exit, even thought Notify VM exit is not exposed
> >>>>>>>>>>> to L1.
> >>>>>>>>>>
> >>>>>>>>>> That might cause NULL pointer dereferences or other nasty occurrences.
> >>>>>>>>>
> >>>>>>>>> IMO, a well written VMM (in L1) should handle it correctly.
> >>>>>>>>>
> >>>>>>>>> L0 KVM reports no Notify VM Exit support to L1, so L1 runs without
> >>>>>>>>> setting Notify VM exit. If a L2 causes notify_vm_exit with
> >>>>>>>>> invalid_vm_context, L0 just reflects it to L1. In L1's view, there is no
> >>>>>>>>> support of Notify VM Exit from VMX MSR capability. Following L1 handler
> >>>>>>>>> is possible:
> >>>>>>>>>
> >>>>>>>>> a)      if (notify_vm_exit available & notify_vm_exit enabled) {
> >>>>>>>>>                     handle in b)
> >>>>>>>>>             } else {
> >>>>>>>>>                     report unexpected vm exit reason to userspace;
> >>>>>>>>>             }
> >>>>>>>>>
> >>>>>>>>> b)      similar handler like we implement in KVM:
> >>>>>>>>>             if (!vm_context_invalid)
> >>>>>>>>>                     re-enter guest;
> >>>>>>>>>             else
> >>>>>>>>>                     report to userspace;
> >>>>>>>>>
> >>>>>>>>> c)      no Notify VM Exit related code (e.g. old KVM), it's treated as
> >>>>>>>>> unsupported exit reason
> >>>>>>>>>
> >>>>>>>>> As long as it belongs to any case above, I think L1 can handle it
> >>>>>>>>> correctly. Any nasty occurrence should be caused by incorrect handler in
> >>>>>>>>> L1 VMM, in my opinion.
> >>>>>>>>
> >>>>>>>> Please test some common hypervisors (e.g. ESXi and Hyper-V).
> >>>>>>>
> >>>>>>> I took a look at KVM in Linux v4.9 (one of our more popular guests),
> >>>>>>> and it will not handle this case well:
> >>>>>>>
> >>>>>>>             if (exit_reason < kvm_vmx_max_exit_handlers
> >>>>>>>                 && kvm_vmx_exit_handlers[exit_reason])
> >>>>>>>                     return kvm_vmx_exit_handlers[exit_reason](vcpu);
> >>>>>>>             else {
> >>>>>>>                     WARN_ONCE(1, "vmx: unexpected exit reason 0x%x\n", exit_reason);
> >>>>>>>                     kvm_queue_exception(vcpu, UD_VECTOR);
> >>>>>>>                     return 1;
> >>>>>>>             }
> >>>>>>>
> >>>>>>> At least there's an L1 kernel log message for the first unexpected
> >>>>>>> NOTIFY VM-exit, but after that, there is silence. Just a completely
> >>>>>>> inexplicable #UD in L2, assuming that L2 is resumable at this point.
> >>>>>>
> >>>>>> At least there is a message to tell L1 a notify VM exit is triggered in
> >>>>>> L2. Yes, the inexplicable #UD won't be hit unless L2 triggers Notify VM
> >>>>>> exit with invalid_context, which is malicious to L0 and L1.
> >>>>>
> >>>>> There is only an L1 kernel log message *the first time*. That's not
> >>>>> good enough. And this is just one of the myriad of possible L1
> >>>>> hypervisors.
> >>>>>
> >>>>>> If we use triple_fault (i.e., shutdown), then no info to tell L1 that
> >>>>>> it's caused by Notify VM exit with invalid context. Triple fault needs
> >>>>>> to be extended and L1 kernel needs to be enlightened. It doesn't help
> >>>>>> old guest kernel.
> >>>>>>
> >>>>>> If we use Machine Check, it's somewhat same inexplicable to L2 unless
> >>>>>> it's enlightened. But it doesn't help old guest kernel.
> >>>>>>
> >>>>>> Anyway, for Notify VM exit with invalid context from L2, I don't see a
> >>>>>> good solution to tell L1 VMM it's a "Notify VM exit with invalid context
> >>>>>> from L2" and keep all kinds of L1 VMM happy, especially for those with
> >>>>>> old kernel versions.
> >>>>>
> >>>>> I agree that there is no way to make every conceivable L1 happy.
> >>>>> That's why the information needs to be surfaced to the L0 userspace. I
> >>>>> contend that any time L0 kvm violates the architectural specification
> >>>>> in its emulation of L1 or L2, the L0 userspace *must* be informed.
> >>>>
> >>>> We can make the design to exit to userspace on notify vm exit
> >>>> unconditionally with exit_qualification passed, then userspace can take
> >>>> the same action like what this patch does in KVM that
> >>>>
> >>>>     - re-enter guest when context_invalid is false;
> >>>>     - stop running the guest if context_invalid is true; (userspace can
> >>>> definitely re-enter the guest in this case, but it needs to take the
> >>>> fall on this)
> >>>>
> >>>> Then, for nested case, L0 needs to enable it transparently for L2 if
> >>>> this feature is enabled for L1 guest (the reason as we all agreed that
> >>>> cannot allow L1 to escape just by creating a L2). Then what should KVM
> >>>> do when notify vm exit from L2?
> >>>>
> >>>>     - Exit to L0 userspace on L2's notify vm exit. L0 userspace takes the
> >>>> same action:
> >>>>           - re-enter if context-invalid is false;
> >>>>           - kill L1 if context-invalid is true; (I don't know if there is any
> >>>> interface for L0 userspace to kill L2). Then it opens the potential door
> >>>> for malicious user to kill L1 by creating a L2 to trigger fatal notify
> >>>> vm exit. If you guys accept it, we can implement in this way.
> >>>>
> >>>>
> >>>> in conclusion, we have below solution:
> >>>>
> >>>> 1. Take this patch as is. The drawback is L1 VMM receives a triple_fault
> >>>> from L2 when L2 triggers notify vm exit with invalid context. Neither of
> >>>> L1 VMM, L1 userspace, nor L2 kernel know it's caused due to notify vm
> >>>> exit. There is only kernel log in L0, which seems not accessible for L1
> >>>> user or L2 guest.
> >>>
> >>> You are correct on that last point, and I feel that I cannot stress it
> >>> enough. In a typical environment, the L0 kernel log is only available
> >>> to the administrator of the L0 host.
> >>>
> >>>> 2. a) Inject notify vm exit back to L1 if L2 triggers notify vm exit
> >>>> with invalid context. The drawback is, old L1 hypervisor is not
> >>>> enlightened of it and maybe misbehave on it.
> >>>>
> >>>>       b) Inject a synthesized SHUTDOWN exit to L1, with additional info to
> >>>> tell it's caused by fatal notify vm exit from L2. It has the same
> >>>> drawback that old hypervisor has no idea of it and maybe misbehave on it.
> >>>>
> >>>> 3. Exit to L0 usersapce unconditionally no matter it's caused from L1 or
> >>>> L2. Then it may open the door for L1 user to kill L1.
> >>>>
> >>>> Do you have any better solution other than above? If no, we need to pick> >> one from above though it cannot make everyone happy.
> >>>
> >>> Yes, I believe I have a better solution. We obviously need an API for
> >>> userspace to synthesize a SHUTDOWN event for a vCPU.
> >>
> >> Can you elaborate on it? Do you mean userspace to inject a synthesized
> >> SHUTDOWN to guest? If so, I have no idea how it will work.
> >
> > It can probably be implemented as an extension of KVM_SET_VCPU_EVENTS
> > that invokes kvm_make_request(KVM_REQ_TRIPLE_FAULT).
>
> Then, you mean
>
> 1. notify vm exit from guest;
> 2. exit to userspace on notify vm exit;
> 3. a. if context_invalid, inject SHUTDOWN to vcpu from userspace to
> request KVM_REQ_TRIPLE_FAULT; goto step 4;
>     b. if !context_invalid, re-run vcpu; no step 4 and 5;
> 4. exit to userspace again with KVM_EXIT_SHUTDOWN due to triple fault;
> 5. userspace stop running the vcpu/VM
>
> Then why not handle it as KVM_EXIT_SHUTDOWN directly in 3.a ? I don't
> get the point of userspace to inject TRIPLE_FAULT to KVM.

Sure, that should work, as long as L0 userspace is notified of the
emulation error.

Going back to something you said previously:

>> In addition, to avoid breaking legacy userspace, the NOTIFY VM-exit should be opt-in.

> Yes, it's designed as opt-in already that the feature is off by default.

I meant that userspace should opt-in, per VM. I believe your design is
opt-in by system administrator, host-wide.
Chenyi Qiang March 2, 2022, 2:15 a.m. UTC | #21
On 3/2/2022 5:57 AM, Jim Mattson wrote:
> On Mon, Feb 28, 2022 at 9:30 PM Xiaoyao Li <xiaoyao.li@intel.com> wrote:
>>
>> On 3/1/2022 12:32 PM, Jim Mattson wrote:
>>> On Mon, Feb 28, 2022 at 5:41 PM Xiaoyao Li <xiaoyao.li@intel.com> wrote:
>>>>
>>>> On 2/28/2022 10:30 PM, Jim Mattson wrote:
>>>>> On Sun, Feb 27, 2022 at 11:10 PM Xiaoyao Li <xiaoyao.li@intel.com> wrote:
>>>>>>
>>>>>> On 2/26/2022 10:24 PM, Jim Mattson wrote:
>>>>>>> On Fri, Feb 25, 2022 at 10:24 PM Xiaoyao Li <xiaoyao.li@intel.com> wrote:
>>>>>>>>
>>>>>>>> On 2/26/2022 12:53 PM, Jim Mattson wrote:
>>>>>>>>> On Fri, Feb 25, 2022 at 8:25 PM Jim Mattson <jmattson@google.com> wrote:
>>>>>>>>>>
>>>>>>>>>> On Fri, Feb 25, 2022 at 8:07 PM Xiaoyao Li <xiaoyao.li@intel.com> wrote:
>>>>>>>>>>>
>>>>>>>>>>> On 2/25/2022 11:13 PM, Paolo Bonzini wrote:
>>>>>>>>>>>> On 2/25/22 16:12, Xiaoyao Li wrote:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> I don't like the idea of making things up without notifying userspace
>>>>>>>>>>>>>>> that this is fictional. How is my customer running nested VMs supposed
>>>>>>>>>>>>>>> to know that L2 didn't actually shutdown, but L0 killed it because the
>>>>>>>>>>>>>>> notify window was exceeded? If this information isn't reported to
>>>>>>>>>>>>>>> userspace, I have no way of getting the information to the customer.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Then, maybe a dedicated software define VM exit for it instead of
>>>>>>>>>>>>>> reusing triple fault?
>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> Second thought, we can even just return Notify VM exit to L1 to tell
>>>>>>>>>>>>> L2 causes Notify VM exit, even thought Notify VM exit is not exposed
>>>>>>>>>>>>> to L1.
>>>>>>>>>>>>
>>>>>>>>>>>> That might cause NULL pointer dereferences or other nasty occurrences.
>>>>>>>>>>>
>>>>>>>>>>> IMO, a well written VMM (in L1) should handle it correctly.
>>>>>>>>>>>
>>>>>>>>>>> L0 KVM reports no Notify VM Exit support to L1, so L1 runs without
>>>>>>>>>>> setting Notify VM exit. If a L2 causes notify_vm_exit with
>>>>>>>>>>> invalid_vm_context, L0 just reflects it to L1. In L1's view, there is no
>>>>>>>>>>> support of Notify VM Exit from VMX MSR capability. Following L1 handler
>>>>>>>>>>> is possible:
>>>>>>>>>>>
>>>>>>>>>>> a)      if (notify_vm_exit available & notify_vm_exit enabled) {
>>>>>>>>>>>                      handle in b)
>>>>>>>>>>>              } else {
>>>>>>>>>>>                      report unexpected vm exit reason to userspace;
>>>>>>>>>>>              }
>>>>>>>>>>>
>>>>>>>>>>> b)      similar handler like we implement in KVM:
>>>>>>>>>>>              if (!vm_context_invalid)
>>>>>>>>>>>                      re-enter guest;
>>>>>>>>>>>              else
>>>>>>>>>>>                      report to userspace;
>>>>>>>>>>>
>>>>>>>>>>> c)      no Notify VM Exit related code (e.g. old KVM), it's treated as
>>>>>>>>>>> unsupported exit reason
>>>>>>>>>>>
>>>>>>>>>>> As long as it belongs to any case above, I think L1 can handle it
>>>>>>>>>>> correctly. Any nasty occurrence should be caused by incorrect handler in
>>>>>>>>>>> L1 VMM, in my opinion.
>>>>>>>>>>
>>>>>>>>>> Please test some common hypervisors (e.g. ESXi and Hyper-V).
>>>>>>>>>
>>>>>>>>> I took a look at KVM in Linux v4.9 (one of our more popular guests),
>>>>>>>>> and it will not handle this case well:
>>>>>>>>>
>>>>>>>>>              if (exit_reason < kvm_vmx_max_exit_handlers
>>>>>>>>>                  && kvm_vmx_exit_handlers[exit_reason])
>>>>>>>>>                      return kvm_vmx_exit_handlers[exit_reason](vcpu);
>>>>>>>>>              else {
>>>>>>>>>                      WARN_ONCE(1, "vmx: unexpected exit reason 0x%x\n", exit_reason);
>>>>>>>>>                      kvm_queue_exception(vcpu, UD_VECTOR);
>>>>>>>>>                      return 1;
>>>>>>>>>              }
>>>>>>>>>
>>>>>>>>> At least there's an L1 kernel log message for the first unexpected
>>>>>>>>> NOTIFY VM-exit, but after that, there is silence. Just a completely
>>>>>>>>> inexplicable #UD in L2, assuming that L2 is resumable at this point.
>>>>>>>>
>>>>>>>> At least there is a message to tell L1 a notify VM exit is triggered in
>>>>>>>> L2. Yes, the inexplicable #UD won't be hit unless L2 triggers Notify VM
>>>>>>>> exit with invalid_context, which is malicious to L0 and L1.
>>>>>>>
>>>>>>> There is only an L1 kernel log message *the first time*. That's not
>>>>>>> good enough. And this is just one of the myriad of possible L1
>>>>>>> hypervisors.
>>>>>>>
>>>>>>>> If we use triple_fault (i.e., shutdown), then no info to tell L1 that
>>>>>>>> it's caused by Notify VM exit with invalid context. Triple fault needs
>>>>>>>> to be extended and L1 kernel needs to be enlightened. It doesn't help
>>>>>>>> old guest kernel.
>>>>>>>>
>>>>>>>> If we use Machine Check, it's somewhat same inexplicable to L2 unless
>>>>>>>> it's enlightened. But it doesn't help old guest kernel.
>>>>>>>>
>>>>>>>> Anyway, for Notify VM exit with invalid context from L2, I don't see a
>>>>>>>> good solution to tell L1 VMM it's a "Notify VM exit with invalid context
>>>>>>>> from L2" and keep all kinds of L1 VMM happy, especially for those with
>>>>>>>> old kernel versions.
>>>>>>>
>>>>>>> I agree that there is no way to make every conceivable L1 happy.
>>>>>>> That's why the information needs to be surfaced to the L0 userspace. I
>>>>>>> contend that any time L0 kvm violates the architectural specification
>>>>>>> in its emulation of L1 or L2, the L0 userspace *must* be informed.
>>>>>>
>>>>>> We can make the design to exit to userspace on notify vm exit
>>>>>> unconditionally with exit_qualification passed, then userspace can take
>>>>>> the same action like what this patch does in KVM that
>>>>>>
>>>>>>      - re-enter guest when context_invalid is false;
>>>>>>      - stop running the guest if context_invalid is true; (userspace can
>>>>>> definitely re-enter the guest in this case, but it needs to take the
>>>>>> fall on this)
>>>>>>
>>>>>> Then, for nested case, L0 needs to enable it transparently for L2 if
>>>>>> this feature is enabled for L1 guest (the reason as we all agreed that
>>>>>> cannot allow L1 to escape just by creating a L2). Then what should KVM
>>>>>> do when notify vm exit from L2?
>>>>>>
>>>>>>      - Exit to L0 userspace on L2's notify vm exit. L0 userspace takes the
>>>>>> same action:
>>>>>>            - re-enter if context-invalid is false;
>>>>>>            - kill L1 if context-invalid is true; (I don't know if there is any
>>>>>> interface for L0 userspace to kill L2). Then it opens the potential door
>>>>>> for malicious user to kill L1 by creating a L2 to trigger fatal notify
>>>>>> vm exit. If you guys accept it, we can implement in this way.
>>>>>>
>>>>>>
>>>>>> in conclusion, we have below solution:
>>>>>>
>>>>>> 1. Take this patch as is. The drawback is L1 VMM receives a triple_fault
>>>>>> from L2 when L2 triggers notify vm exit with invalid context. Neither of
>>>>>> L1 VMM, L1 userspace, nor L2 kernel know it's caused due to notify vm
>>>>>> exit. There is only kernel log in L0, which seems not accessible for L1
>>>>>> user or L2 guest.
>>>>>
>>>>> You are correct on that last point, and I feel that I cannot stress it
>>>>> enough. In a typical environment, the L0 kernel log is only available
>>>>> to the administrator of the L0 host.
>>>>>
>>>>>> 2. a) Inject notify vm exit back to L1 if L2 triggers notify vm exit
>>>>>> with invalid context. The drawback is, old L1 hypervisor is not
>>>>>> enlightened of it and maybe misbehave on it.
>>>>>>
>>>>>>        b) Inject a synthesized SHUTDOWN exit to L1, with additional info to
>>>>>> tell it's caused by fatal notify vm exit from L2. It has the same
>>>>>> drawback that old hypervisor has no idea of it and maybe misbehave on it.
>>>>>>
>>>>>> 3. Exit to L0 usersapce unconditionally no matter it's caused from L1 or
>>>>>> L2. Then it may open the door for L1 user to kill L1.
>>>>>>
>>>>>> Do you have any better solution other than above? If no, we need to pick> >> one from above though it cannot make everyone happy.
>>>>>
>>>>> Yes, I believe I have a better solution. We obviously need an API for
>>>>> userspace to synthesize a SHUTDOWN event for a vCPU.
>>>>
>>>> Can you elaborate on it? Do you mean userspace to inject a synthesized
>>>> SHUTDOWN to guest? If so, I have no idea how it will work.
>>>
>>> It can probably be implemented as an extension of KVM_SET_VCPU_EVENTS
>>> that invokes kvm_make_request(KVM_REQ_TRIPLE_FAULT).
>>
>> Then, you mean
>>
>> 1. notify vm exit from guest;
>> 2. exit to userspace on notify vm exit;
>> 3. a. if context_invalid, inject SHUTDOWN to vcpu from userspace to
>> request KVM_REQ_TRIPLE_FAULT; goto step 4;
>>      b. if !context_invalid, re-run vcpu; no step 4 and 5;
>> 4. exit to userspace again with KVM_EXIT_SHUTDOWN due to triple fault;
>> 5. userspace stop running the vcpu/VM
>>
>> Then why not handle it as KVM_EXIT_SHUTDOWN directly in 3.a ? I don't
>> get the point of userspace to inject TRIPLE_FAULT to KVM.
> 
> Sure, that should work, as long as L0 userspace is notified of the
> emulation error.
> 
> Going back to something you said previously:

So, after adding the nested handling case, can we summarize the whole 
working flow as:

1. notify VM exit from guest;
2. a. if !context_invalid, resume vcpu, no further process;
    b. if context_invalid, exit to userspace;
3. userspace injects SHUTDOWN event by KVM_SET_VCPU_EVENTS to request 
KVM_REQ_TRIPLE_FAULT;
4. a. if !is_guest_mode(vcpu), exit to userspace again with 
KVM_EXIT_SHUTDOWN due to triple fault. L1 shutdown;
    b. if is_guest_mode(vcpu), synthesize a nested triple fault to L1. 
L2 shutdown;

> 
>>> In addition, to avoid breaking legacy userspace, the NOTIFY VM-exit should be opt-in.
> 
>> Yes, it's designed as opt-in already that the feature is off by default.
> 
> I meant that userspace should opt-in, per VM. I believe your design is
> opt-in by system administrator, host-wide.

OK, we will change to the per-VM control.
diff mbox series

Patch

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 713e08f62385..3df68fea9f22 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -1281,6 +1281,7 @@  struct kvm_vcpu_stat {
 	u64 directed_yield_attempted;
 	u64 directed_yield_successful;
 	u64 guest_mode;
+	u64 notify_window_exits;
 };
 
 struct x86_instruction_info;
diff --git a/arch/x86/include/asm/vmx.h b/arch/x86/include/asm/vmx.h
index 0ffaa3156a4e..9104c85a973f 100644
--- a/arch/x86/include/asm/vmx.h
+++ b/arch/x86/include/asm/vmx.h
@@ -74,6 +74,7 @@ 
 #define SECONDARY_EXEC_TSC_SCALING              VMCS_CONTROL_BIT(TSC_SCALING)
 #define SECONDARY_EXEC_ENABLE_USR_WAIT_PAUSE	VMCS_CONTROL_BIT(USR_WAIT_PAUSE)
 #define SECONDARY_EXEC_BUS_LOCK_DETECTION	VMCS_CONTROL_BIT(BUS_LOCK_DETECTION)
+#define SECONDARY_EXEC_NOTIFY_VM_EXITING	VMCS_CONTROL_BIT(NOTIFY_VM_EXITING)
 
 #define PIN_BASED_EXT_INTR_MASK                 VMCS_CONTROL_BIT(INTR_EXITING)
 #define PIN_BASED_NMI_EXITING                   VMCS_CONTROL_BIT(NMI_EXITING)
@@ -269,6 +270,7 @@  enum vmcs_field {
 	SECONDARY_VM_EXEC_CONTROL       = 0x0000401e,
 	PLE_GAP                         = 0x00004020,
 	PLE_WINDOW                      = 0x00004022,
+	NOTIFY_WINDOW                   = 0x00004024,
 	VM_INSTRUCTION_ERROR            = 0x00004400,
 	VM_EXIT_REASON                  = 0x00004402,
 	VM_EXIT_INTR_INFO               = 0x00004404,
@@ -555,6 +557,11 @@  enum vm_entry_failure_code {
 #define EPT_VIOLATION_EXECUTABLE	(1 << EPT_VIOLATION_EXECUTABLE_BIT)
 #define EPT_VIOLATION_GVA_TRANSLATED	(1 << EPT_VIOLATION_GVA_TRANSLATED_BIT)
 
+/*
+ * Exit Qualifications for NOTIFY VM EXIT
+ */
+#define NOTIFY_VM_CONTEXT_INVALID     BIT(0)
+
 /*
  * VM-instruction error numbers
  */
diff --git a/arch/x86/include/asm/vmxfeatures.h b/arch/x86/include/asm/vmxfeatures.h
index d9a74681a77d..15f0f2ab4f95 100644
--- a/arch/x86/include/asm/vmxfeatures.h
+++ b/arch/x86/include/asm/vmxfeatures.h
@@ -84,5 +84,6 @@ 
 #define VMX_FEATURE_USR_WAIT_PAUSE	( 2*32+ 26) /* Enable TPAUSE, UMONITOR, UMWAIT in guest */
 #define VMX_FEATURE_ENCLV_EXITING	( 2*32+ 28) /* "" VM-Exit on ENCLV (leaf dependent) */
 #define VMX_FEATURE_BUS_LOCK_DETECTION	( 2*32+ 30) /* "" VM-Exit when bus lock caused */
+#define VMX_FEATURE_NOTIFY_VM_EXITING	( 2*32+ 31) /* VM-Exit when no event windows after notify window */
 
 #endif /* _ASM_X86_VMXFEATURES_H */
diff --git a/arch/x86/include/uapi/asm/vmx.h b/arch/x86/include/uapi/asm/vmx.h
index 946d761adbd3..ef4c80f6553e 100644
--- a/arch/x86/include/uapi/asm/vmx.h
+++ b/arch/x86/include/uapi/asm/vmx.h
@@ -91,6 +91,7 @@ 
 #define EXIT_REASON_UMWAIT              67
 #define EXIT_REASON_TPAUSE              68
 #define EXIT_REASON_BUS_LOCK            74
+#define EXIT_REASON_NOTIFY              75
 
 #define VMX_EXIT_REASONS \
 	{ EXIT_REASON_EXCEPTION_NMI,         "EXCEPTION_NMI" }, \
@@ -153,7 +154,8 @@ 
 	{ EXIT_REASON_XRSTORS,               "XRSTORS" }, \
 	{ EXIT_REASON_UMWAIT,                "UMWAIT" }, \
 	{ EXIT_REASON_TPAUSE,                "TPAUSE" }, \
-	{ EXIT_REASON_BUS_LOCK,              "BUS_LOCK" }
+	{ EXIT_REASON_BUS_LOCK,              "BUS_LOCK" }, \
+	{ EXIT_REASON_NOTIFY,                "NOTIFY"}
 
 #define VMX_EXIT_REASON_FLAGS \
 	{ VMX_EXIT_REASONS_FAILED_VMENTRY,	"FAILED_VMENTRY" }
diff --git a/arch/x86/kvm/vmx/capabilities.h b/arch/x86/kvm/vmx/capabilities.h
index 3f430e218375..fff010d24ad0 100644
--- a/arch/x86/kvm/vmx/capabilities.h
+++ b/arch/x86/kvm/vmx/capabilities.h
@@ -14,6 +14,7 @@  extern bool __read_mostly enable_unrestricted_guest;
 extern bool __read_mostly enable_ept_ad_bits;
 extern bool __read_mostly enable_pml;
 extern int __read_mostly pt_mode;
+extern int __read_mostly notify_window;
 
 #define PT_MODE_SYSTEM		0
 #define PT_MODE_HOST_GUEST	1
@@ -417,4 +418,10 @@  static inline u64 vmx_supported_debugctl(void)
 	return debugctl;
 }
 
+static inline bool cpu_has_notify_vm_exiting(void)
+{
+	return vmcs_config.cpu_based_2nd_exec_ctrl &
+		SECONDARY_EXEC_NOTIFY_VM_EXITING;
+}
+
 #endif /* __KVM_X86_VMX_CAPS_H */
diff --git a/arch/x86/kvm/vmx/nested.c b/arch/x86/kvm/vmx/nested.c
index 1dfe23963a9e..f306b642c3e1 100644
--- a/arch/x86/kvm/vmx/nested.c
+++ b/arch/x86/kvm/vmx/nested.c
@@ -2177,6 +2177,9 @@  static void prepare_vmcs02_constant_state(struct vcpu_vmx *vmx)
 	if (cpu_has_vmx_encls_vmexit())
 		vmcs_write64(ENCLS_EXITING_BITMAP, INVALID_GPA);
 
+	if (notify_window >= 0)
+		vmcs_write32(NOTIFY_WINDOW, notify_window);
+
 	/*
 	 * Set the MSR load/store lists to match L0's settings.  Only the
 	 * addresses are constant (for vmcs02), the counts can change based
@@ -4213,8 +4216,16 @@  static void prepare_vmcs12(struct kvm_vcpu *vcpu, struct vmcs12 *vmcs12,
 		/*
 		 * Transfer the event that L0 or L1 may wanted to inject into
 		 * L2 to IDT_VECTORING_INFO_FIELD.
+		 * L0 will synthensize a nested TRIPLE_FAULT to kill L2 when
+		 * notify VM exit occurred in L2 and NOTIFY_VM_CONTEXT_INVALID is
+		 * set in exit qualification. In this case, if notify VM exit
+		 * occurred incident to delivery of a vectored event, the IDT
+		 * vectoring info are recorded in VMCS. Drop the pending event
+		 * in vmcs12, otherwise L1 VMM will exit to userspace with
+		 * internal error due to delivery event.
 		 */
-		vmcs12_save_pending_event(vcpu, vmcs12);
+		if (to_vmx(vcpu)->exit_reason.basic != EXIT_REASON_NOTIFY)
+			vmcs12_save_pending_event(vcpu, vmcs12);
 
 		/*
 		 * According to spec, there's no need to store the guest's
@@ -6080,6 +6091,9 @@  static bool nested_vmx_l1_wants_exit(struct kvm_vcpu *vcpu,
 			SECONDARY_EXEC_ENABLE_USR_WAIT_PAUSE);
 	case EXIT_REASON_ENCLS:
 		return nested_vmx_exit_handled_encls(vcpu, vmcs12);
+	case EXIT_REASON_NOTIFY:
+		return nested_cpu_has2(vmcs12,
+			SECONDARY_EXEC_NOTIFY_VM_EXITING);
 	default:
 		return true;
 	}
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index b183dfc41d74..c8f1c2f83a8a 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -207,6 +207,15 @@  module_param(ple_window_max, uint, 0444);
 int __read_mostly pt_mode = PT_MODE_SYSTEM;
 module_param(pt_mode, int, S_IRUGO);
 
+/*
+ * Set default as -1 to disable notify VM exit.
+ * Admin can set non-negative value to enable notify VM exit.
+ * An recommended value is 128K, which is an ideal threshold
+ * to ensure no false positive on some platforms.
+ */
+int __read_mostly notify_window = -1;
+module_param(notify_window, int, 0444);
+
 static DEFINE_STATIC_KEY_FALSE(vmx_l1d_should_flush);
 static DEFINE_STATIC_KEY_FALSE(vmx_l1d_flush_cond);
 static DEFINE_MUTEX(vmx_l1d_flush_mutex);
@@ -2479,7 +2488,8 @@  static __init int setup_vmcs_config(struct vmcs_config *vmcs_conf,
 			SECONDARY_EXEC_PT_USE_GPA |
 			SECONDARY_EXEC_PT_CONCEAL_VMX |
 			SECONDARY_EXEC_ENABLE_VMFUNC |
-			SECONDARY_EXEC_BUS_LOCK_DETECTION;
+			SECONDARY_EXEC_BUS_LOCK_DETECTION |
+			SECONDARY_EXEC_NOTIFY_VM_EXITING;
 		if (cpu_has_sgx())
 			opt2 |= SECONDARY_EXEC_ENCLS_EXITING;
 		if (adjust_vmx_controls(min2, opt2,
@@ -4369,6 +4379,9 @@  static u32 vmx_secondary_exec_control(struct vcpu_vmx *vmx)
 	if (!vcpu->kvm->arch.bus_lock_detection_enabled)
 		exec_control &= ~SECONDARY_EXEC_BUS_LOCK_DETECTION;
 
+	if (notify_window < 0)
+		exec_control &= ~SECONDARY_EXEC_NOTIFY_VM_EXITING;
+
 	return exec_control;
 }
 
@@ -4410,6 +4423,9 @@  static void init_vmcs(struct vcpu_vmx *vmx)
 		vmx->ple_window_dirty = true;
 	}
 
+	if (notify_window >= 0)
+		vmcs_write32(NOTIFY_WINDOW, notify_window);
+
 	vmcs_write32(PAGE_FAULT_ERROR_CODE_MASK, 0);
 	vmcs_write32(PAGE_FAULT_ERROR_CODE_MATCH, 0);
 	vmcs_write32(CR3_TARGET_COUNT, 0);           /* 22.2.1 */
@@ -5691,6 +5707,40 @@  static int handle_bus_lock_vmexit(struct kvm_vcpu *vcpu)
 	return 1;
 }
 
+static int handle_notify(struct kvm_vcpu *vcpu)
+{
+	unsigned long exit_qual = vmx_get_exit_qual(vcpu);
+
+	++vcpu->stat.notify_window_exits;
+	pr_warn_ratelimited("Notify window exits at address: 0x%lx\n",
+			    kvm_rip_read(vcpu));
+
+	if (!(exit_qual & NOTIFY_VM_CONTEXT_INVALID)) {
+		/*
+		 * Notify VM exit happened while executing iret from NMI,
+		 * "blocked by NMI" bit has to be set before next VM entry.
+		 */
+		if (enable_vnmi &&
+		    (exit_qual & INTR_INFO_UNBLOCK_NMI))
+			vmcs_set_bits(GUEST_INTERRUPTIBILITY_INFO,
+				      GUEST_INTR_STATE_NMI);
+
+		return 1;
+	}
+
+	if (is_guest_mode(vcpu)) {
+		kvm_make_request(KVM_REQ_TRIPLE_FAULT, vcpu);
+		return 1;
+	}
+
+	vcpu->run->exit_reason = KVM_EXIT_INTERNAL_ERROR;
+	vcpu->run->internal.suberror = KVM_INTERNAL_ERROR_NO_EVENT_WINDOW;
+	vcpu->run->internal.ndata = 1;
+	vcpu->run->internal.data[0] = exit_qual;
+
+	return 0;
+}
+
 /*
  * The exit handlers return 1 if the exit was handled fully and guest execution
  * may resume.  Otherwise they set the kvm_run parameter to indicate what needs
@@ -5748,6 +5798,7 @@  static int (*kvm_vmx_exit_handlers[])(struct kvm_vcpu *vcpu) = {
 	[EXIT_REASON_PREEMPTION_TIMER]	      = handle_preemption_timer,
 	[EXIT_REASON_ENCLS]		      = handle_encls,
 	[EXIT_REASON_BUS_LOCK]                = handle_bus_lock_vmexit,
+	[EXIT_REASON_NOTIFY]		      = handle_notify,
 };
 
 static const int kvm_vmx_max_exit_handlers =
@@ -6112,7 +6163,8 @@  static int __vmx_handle_exit(struct kvm_vcpu *vcpu, fastpath_t exit_fastpath)
 	     exit_reason.basic != EXIT_REASON_EPT_VIOLATION &&
 	     exit_reason.basic != EXIT_REASON_PML_FULL &&
 	     exit_reason.basic != EXIT_REASON_APIC_ACCESS &&
-	     exit_reason.basic != EXIT_REASON_TASK_SWITCH)) {
+	     exit_reason.basic != EXIT_REASON_TASK_SWITCH &&
+	     exit_reason.basic != EXIT_REASON_NOTIFY)) {
 		int ndata = 3;
 
 		vcpu->run->exit_reason = KVM_EXIT_INTERNAL_ERROR;
@@ -7975,6 +8027,9 @@  static __init int hardware_setup(void)
 
 	kvm_has_bus_lock_exit = cpu_has_vmx_bus_lock_detection();
 
+	if (!cpu_has_notify_vm_exiting())
+		notify_window = -1;
+
 	set_bit(0, vmx_vpid_bitmap); /* 0 is reserved for host */
 
 	if (enable_ept)
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 6552360d8888..06a74561d44e 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -289,7 +289,8 @@  const struct _kvm_stats_desc kvm_vcpu_stats_desc[] = {
 	STATS_DESC_COUNTER(VCPU, nested_run),
 	STATS_DESC_COUNTER(VCPU, directed_yield_attempted),
 	STATS_DESC_COUNTER(VCPU, directed_yield_successful),
-	STATS_DESC_ICOUNTER(VCPU, guest_mode)
+	STATS_DESC_ICOUNTER(VCPU, guest_mode),
+	STATS_DESC_COUNTER(VCPU, notify_window_exits),
 };
 
 const struct kvm_stats_header kvm_vcpu_stats_header = {
diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index 5191b57e1562..20ee68b4ac14 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -280,6 +280,8 @@  struct kvm_xen_exit {
 #define KVM_INTERNAL_ERROR_DELIVERY_EV	3
 /* Encounter unexpected vm-exit reason */
 #define KVM_INTERNAL_ERROR_UNEXPECTED_EXIT_REASON	4
+/* Encounter notify vm-exit */
+#define KVM_INTERNAL_ERROR_NO_EVENT_WINDOW   5
 
 /* Flags that describe what fields in emulation_failure hold valid data. */
 #define KVM_INTERNAL_ERROR_EMULATION_FLAG_INSTRUCTION_BYTES (1ULL << 0)