Message ID | 1518791030-31765-2-git-send-email-karahmed@amazon.de (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
On 16/02/2018 15:23, KarimAllah Ahmed wrote: > On exit to L0 user-space, always exit from L2 to L1 and synchronize the > state properly for L1. This ensures that user-space only ever sees L1 > state. It also allows L1 to be saved and resumed properly. Obviously > horrible things will still happen to the L2 guest. This will be handled in > a seperate patch. > > There is only a single case which requires a bit of extra care. When the > decision to switch to user space happens while handling an L1 > VMRESUME/VMLAUNCH (i.e. pending_nested_run). In order to handle this > as cleanly as possible without major restructuring, we simply do not exit > to user-space in this case and give L2 another chance to actually run. We > also request an immediate exit to ensure that an exit to user space will > still happen for the L2. > > The only reason I can see where an exit to user space will occur while L2 > is running is because of a pending signal. The is how user space preempts > the KVM_RUN in order to save the state. L2 exits are either handled in L0 > kernel or reflected to L1 and not handled in L0 user-space. > > Signed-off-by: KarimAllah Ahmed <karahmed@amazon.de> We discussed this with Jim about one year ago and then again last January. While I (in 2017) and David H. (in 2018) also thought about doing an L2->L1 exit like this, Jim quickly got me to change my mind---it doesn't really seem like a good idea compared to doing full checkpointing of VMX state. You can find the discussion at https://patchwork.kernel.org/patch/9454799/. Of course, Jim's series (first posted Nov 2016) is way more complex than yours, but the good news is that most of his changes have already been merged; the only ones missing are: https://patchwork.kernel.org/patch/9454799/ [7/8] kvm: nVMX: Introduce KVM_CAP_VMX_STATE https://patchwork.kernel.org/patch/9454797/ [8/8] kvm: nVMX: Defer gpa->hpa lookups for set_vmx_state The main request was to make [7/8] a bit more generic so that it can be applied to SVM as well. That's pretty simple though. Thanks, Paolo
On 02/16/2018 03:52 PM, Paolo Bonzini wrote: > On 16/02/2018 15:23, KarimAllah Ahmed wrote: >> On exit to L0 user-space, always exit from L2 to L1 and synchronize the >> state properly for L1. This ensures that user-space only ever sees L1 >> state. It also allows L1 to be saved and resumed properly. Obviously >> horrible things will still happen to the L2 guest. This will be handled in >> a seperate patch. >> >> There is only a single case which requires a bit of extra care. When the >> decision to switch to user space happens while handling an L1 >> VMRESUME/VMLAUNCH (i.e. pending_nested_run). In order to handle this >> as cleanly as possible without major restructuring, we simply do not exit >> to user-space in this case and give L2 another chance to actually run. We >> also request an immediate exit to ensure that an exit to user space will >> still happen for the L2. >> >> The only reason I can see where an exit to user space will occur while L2 >> is running is because of a pending signal. The is how user space preempts >> the KVM_RUN in order to save the state. L2 exits are either handled in L0 >> kernel or reflected to L1 and not handled in L0 user-space. >> >> Signed-off-by: KarimAllah Ahmed <karahmed@amazon.de> > > We discussed this with Jim about one year ago and then again last > January. While I (in 2017) and David H. (in 2018) also thought about > doing an L2->L1 exit like this, Jim quickly got me to change my > mind---it doesn't really seem like a good idea compared to doing full > checkpointing of VMX state. You can find the discussion at > https://patchwork.kernel.org/patch/9454799/. > > Of course, Jim's series (first posted Nov 2016) is way more complex than > yours, but the good news is that most of his changes have already been > merged; the only ones missing are: > > https://patchwork.kernel.org/patch/9454799/ > [7/8] kvm: nVMX: Introduce KVM_CAP_VMX_STATE > > https://patchwork.kernel.org/patch/9454797/ > [8/8] kvm: nVMX: Defer gpa->hpa lookups for set_vmx_state Oh! Thank you for pointing this out. Somehow I did not notice any of this :) I was also thinking about doing a full save of VMX state then I decided to do the switch instead. In any case, Looking forward to see those bits in master. > > The main request was to make [7/8] a bit more generic so that it can be > applied to SVM as well. That's pretty simple though. > > Thanks, > > Paolo > Amazon Development Center Germany GmbH Berlin - Dresden - Aachen main office: Krausenstr. 38, 10117 Berlin Geschaeftsfuehrer: Dr. Ralf Herbrich, Christian Schlaeger Ust-ID: DE289237879 Eingetragen am Amtsgericht Charlottenburg HRB 149173 B
Now that I looked at the Jim's patch and also went through *most* of the comments, I think I realized that my approach is reasonable and in fact I do not see any of the downsides mentioned in the other thread any more. Specially the one about handling VMLAUNCH case, the code now handles this cleanly and using an approach that is already available in the code base (there is precedent). Only thing that is not handled, AFAICT, is: > The other unfortunate thing about flushing the "current" VMCS12 state > to guest memory on each L2->userspace transition is that much of this > state is in the VMCS02. So,it's not just a matter of writing a > VMCS12_SIZE blob to guest memory; first, the cached VMCS12 has to be > updated from the VMCS02 by calling sync_vmcs12(). This will be > particularly bad for double-nesting, where L1 kvm has to take all of > those VMREAD VM-exits. .. which is something I can actually fix if needed, but is there really anyone doing this today? Do we actually need to optimize for this at all? Is there any thing else that I am missing? So what are the upsides for my approach: 1- It ensures that user-space tools that does not understand nesting can still see the expected guest state when querying guest state or even when trying to read memory, translate an address, etc. 2- It is very simple and does not require a whole lot of state in user- space. 3- It's even rebased on master :) (Ok, maybe this is not a technical reason :D) Thoughts? On a side note: I have also fixed the VMEntry issue that I mentioned in the commit message and I have done hundreds of save/resume successfully already. On Fri, 2018-02-16 at 16:23 +0100, KarimAllah Ahmed wrote: > On 02/16/2018 03:52 PM, Paolo Bonzini wrote: > > > > On 16/02/2018 15:23, KarimAllah Ahmed wrote: > > > > > > On exit to L0 user-space, always exit from L2 to L1 and > > > synchronize the > > > state properly for L1. This ensures that user-space only ever > > > sees L1 > > > state. It also allows L1 to be saved and resumed properly. > > > Obviously > > > horrible things will still happen to the L2 guest. This will be > > > handled in > > > a seperate patch. > > > > > > There is only a single case which requires a bit of extra care. > > > When the > > > decision to switch to user space happens while handling an L1 > > > VMRESUME/VMLAUNCH (i.e. pending_nested_run). In order to handle > > > this > > > as cleanly as possible without major restructuring, we simply do > > > not exit > > > to user-space in this case and give L2 another chance to actually > > > run. We > > > also request an immediate exit to ensure that an exit to user > > > space will > > > still happen for the L2. > > > > > > The only reason I can see where an exit to user space will occur > > > while L2 > > > is running is because of a pending signal. The is how user space > > > preempts > > > the KVM_RUN in order to save the state. L2 exits are either > > > handled in L0 > > > kernel or reflected to L1 and not handled in L0 user-space. > > > > > > Signed-off-by: KarimAllah Ahmed <karahmed@amazon.de> > > > > We discussed this with Jim about one year ago and then again last > > January. While I (in 2017) and David H. (in 2018) also thought > > about > > doing an L2->L1 exit like this, Jim quickly got me to change my > > mind---it doesn't really seem like a good idea compared to doing > > full > > checkpointing of VMX state. You can find the discussion at > > https://patchwork.kernel.org/patch/9454799/. > > > > Of course, Jim's series (first posted Nov 2016) is way more complex > > than > > yours, but the good news is that most of his changes have already > > been > > merged; the only ones missing are: > > > > https://patchwork.kernel.org/patch/9454799/ > > [7/8] kvm: nVMX: Introduce KVM_CAP_VMX_STATE > > > > https://patchwork.kernel.org/patch/9454797/ > > [8/8] kvm: nVMX: Defer gpa->hpa lookups for set_vmx_state > > Oh! Thank you for pointing this out. Somehow I did not notice any of > this :) > > I was also thinking about doing a full save of VMX state then I > decided > to do the switch instead. > > In any case, Looking forward to see those bits in master. > > > > > > > The main request was to make [7/8] a bit more generic so that it > > can be > > applied to SVM as well. That's pretty simple though. > > > > Thanks, > > > > Paolo > > Amazon Development Center Germany GmbH Berlin - Dresden - Aachen main office: Krausenstr. 38, 10117 Berlin Geschaeftsfuehrer: Dr. Ralf Herbrich, Christian Schlaeger Ust-ID: DE289237879 Eingetragen am Amtsgericht Charlottenburg HRB 149173 B
On Sun, Feb 18, 2018 at 9:44 PM, KarimAllah Ahmed <karahmed@amazon.de> wrote:
> Is there any thing else that I am missing?
I haven't looked at your entire patch, but it looks like you are
synthesizing a VM-exit from L2 to L1 for "external interrupt." What if
the “external-interrupt exiting” VM-execution control in vmcs12 is
clear? Then L1 will not be expecting a VM-exit for "external
interrupt."
On 19/02/2018 06:44, KarimAllah Ahmed wrote: > 1- It ensures that user-space tools that does not understand nesting > can still see the expected guest state when querying guest state or > even when trying to read memory, translate an address, etc. How does it work when L1 assigns an MMIO region or PIO region for direct L2 access, and then L2 requests MMIO from userspace? You cannot do a vmexit in that case. > 2- It is very simple and does not require a whole lot of state in user- > space. It's still requires _some_ state. If you need a new API, the amount of state that the API saves/restores is not particularly interesting. Paolo
On Wed, 2018-02-21 at 10:58 +0100, Paolo Bonzini wrote: > On 19/02/2018 06:44, KarimAllah Ahmed wrote: > > > > 1- It ensures that user-space tools that does not understand nesting > > can still see the expected guest state when querying guest state or > > even when trying to read memory, translate an address, etc. > > How does it work when L1 assigns an MMIO region or PIO region for direct > L2 access, and then L2 requests MMIO from userspace? You cannot do a > vmexit in that case. Ah, right! I was testing with a real device that is passed through to L1! and with QEMU/KVM running as L1 hypervisor that emulates e1000 for L2 on L1 Technically I can solve this problem, but it would not satisfy the (1) stated above. So the solution would be slightly more ugly (or significantly more). > > > > > 2- It is very simple and does not require a whole lot of state in user- > > space. > > It's still requires _some_ state. If you need a new API, the amount of > state that the API saves/restores is not particularly interesting. I thought at some point about creating a new MSR and add it in the MSRs to save list. The issue you mentioned above pretty much killed this approach for me :) Ironically it still works fine in our environment but it will not work fine for the general use-case. > > Paolo > Amazon Development Center Germany GmbH Berlin - Dresden - Aachen main office: Krausenstr. 38, 10117 Berlin Geschaeftsfuehrer: Dr. Ralf Herbrich, Christian Schlaeger Ust-ID: DE289237879 Eingetragen am Amtsgericht Charlottenburg HRB 149173 B
On 21/02/2018 13:59, Raslan, KarimAllah wrote: > Ironically it still works fine in our environment but it will not > work fine for the general use-case. Yeah, because you have so many assigned devices and don't really have much to do in userspace. But it could still break with an L1 hypervisor that is not Xen or KVM or anything like that. Paolo
On Tue, 2018-02-20 at 09:19 -0800, Jim Mattson wrote: > On Sun, Feb 18, 2018 at 9:44 PM, KarimAllah Ahmed <karahmed@amazon.de> wrote: > > > > > Is there any thing else that I am missing? > > I haven't looked at your entire patch, but it looks like you are > synthesizing a VM-exit from L2 to L1 for "external interrupt." What if > the “external-interrupt exiting” VM-execution control in vmcs12 is > clear? Then L1 will not be expecting a VM-exit for "external > interrupt." Yup, I thought about that but then I hoped that all L1 hypervisors would not mind this :) Anyway, your approach is indeed superior and seems to fix all outlined problems. Amazon Development Center Germany GmbH Berlin - Dresden - Aachen main office: Krausenstr. 38, 10117 Berlin Geschaeftsfuehrer: Dr. Ralf Herbrich, Christian Schlaeger Ust-ID: DE289237879 Eingetragen am Amtsgericht Charlottenburg HRB 149173 B
On Fri, Feb 16, 2018 at 04:23:33PM +0100, KarimAllah Ahmed wrote: > On 02/16/2018 03:52 PM, Paolo Bonzini wrote: > > On 16/02/2018 15:23, KarimAllah Ahmed wrote: > > > On exit to L0 user-space, always exit from L2 to L1 and synchronize the > > > state properly for L1. This ensures that user-space only ever sees L1 > > > state. It also allows L1 to be saved and resumed properly. Obviously > > > horrible things will still happen to the L2 guest. This will be handled in > > > a seperate patch. > > > > > > There is only a single case which requires a bit of extra care. When the > > > decision to switch to user space happens while handling an L1 > > > VMRESUME/VMLAUNCH (i.e. pending_nested_run). In order to handle this > > > as cleanly as possible without major restructuring, we simply do not exit > > > to user-space in this case and give L2 another chance to actually run. We > > > also request an immediate exit to ensure that an exit to user space will > > > still happen for the L2. > > > > > > The only reason I can see where an exit to user space will occur while L2 > > > is running is because of a pending signal. The is how user space preempts > > > the KVM_RUN in order to save the state. L2 exits are either handled in L0 > > > kernel or reflected to L1 and not handled in L0 user-space. > > > > > > Signed-off-by: KarimAllah Ahmed <karahmed@amazon.de> > > > > We discussed this with Jim about one year ago and then again last > > January. While I (in 2017) and David H. (in 2018) also thought about > > doing an L2->L1 exit like this, Jim quickly got me to change my > > mind---it doesn't really seem like a good idea compared to doing full > > checkpointing of VMX state. You can find the discussion at > > https://patchwork.kernel.org/patch/9454799/. > > > > Of course, Jim's series (first posted Nov 2016) is way more complex than > > yours, but the good news is that most of his changes have already been > > merged; the only ones missing are: > > > > https://patchwork.kernel.org/patch/9454799/ > > [7/8] kvm: nVMX: Introduce KVM_CAP_VMX_STATE > > > > https://patchwork.kernel.org/patch/9454797/ > > [8/8] kvm: nVMX: Defer gpa->hpa lookups for set_vmx_state > > Oh! Thank you for pointing this out. Somehow I did not notice any of this :) > > I was also thinking about doing a full save of VMX state then I decided > to do the switch instead. > > In any case, Looking forward to see those bits in master. .. Is somebody working on this or would it make sense to have a couple of folks work together on this?
On 27/03/2018 21:23, Konrad Rzeszutek Wilk wrote: >>> https://patchwork.kernel.org/patch/9454799/ >>> [7/8] kvm: nVMX: Introduce KVM_CAP_VMX_STATE >>> >>> https://patchwork.kernel.org/patch/9454797/ >>> [8/8] kvm: nVMX: Defer gpa->hpa lookups for set_vmx_state >> Oh! Thank you for pointing this out. Somehow I did not notice any of this :) >> >> I was also thinking about doing a full save of VMX state then I decided >> to do the switch instead. >> >> In any case, Looking forward to see those bits in master. > > .. Is somebody working on this or would it make sense to have a couple of > folks work together on this? I was hoping for Karim to beat me, but I plan to look at it in April. Paolo
On Tue, 2018-03-27 at 22:16 +0200, Paolo Bonzini wrote: > On 27/03/2018 21:23, Konrad Rzeszutek Wilk wrote: > > > > > > > > > > > > > https://patchwork.kernel.org/patch/9454799/ > > > > [7/8] kvm: nVMX: Introduce KVM_CAP_VMX_STATE > > > > > > > > https://patchwork.kernel.org/patch/9454797/ > > > > [8/8] kvm: nVMX: Defer gpa->hpa lookups for set_vmx_state > > > Oh! Thank you for pointing this out. Somehow I did not notice any of this :) > > > > > > I was also thinking about doing a full save of VMX state then I decided > > > to do the switch instead. > > > > > > In any case, Looking forward to see those bits in master. > > > > .. Is somebody working on this or would it make sense to have a couple of > > folks work together on this? > > I was hoping for Karim to beat me, but I plan to look at it in April. I actually did not touch them because I thought that Jim would send another round of the patches. Anyway, if Jim is OK with it I can pick them up and resend. > > Paolo Amazon Development Center Germany GmbH Berlin - Dresden - Aachen main office: Krausenstr. 38, 10117 Berlin Geschaeftsfuehrer: Dr. Ralf Herbrich, Christian Schlaeger Ust-ID: DE289237879 Eingetragen am Amtsgericht Charlottenburg HRB 149173 B
On 27/03/2018 23:51, Raslan, KarimAllah wrote: >>> .. Is somebody working on this or would it make sense to have a couple of >>> folks work together on this? >> I was hoping for Karim to beat me, but I plan to look at it in April. > I actually did not touch them because I thought that Jim would send > another round of the patches. > > Anyway, if Jim is OK with it I can pick them up and resend. I think you can assume that, since the patches have been posted over a year ago. You have my blessing. :) Paolo
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h index 318a414..2c8be56 100644 --- a/arch/x86/include/asm/kvm_host.h +++ b/arch/x86/include/asm/kvm_host.h @@ -961,6 +961,8 @@ struct kvm_x86_ops { struct msr_bitmap_range *whitelist); void (*prepare_guest_switch)(struct kvm_vcpu *vcpu); + void (*prepare_exit_user)(struct kvm_vcpu *vcpu); + bool (*allow_exit_user)(struct kvm_vcpu *vcpu); void (*vcpu_load)(struct kvm_vcpu *vcpu, int cpu); void (*vcpu_put)(struct kvm_vcpu *vcpu); diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c index 52539be..22eb0dc 100644 --- a/arch/x86/kvm/vmx.c +++ b/arch/x86/kvm/vmx.c @@ -2130,6 +2130,42 @@ static unsigned long segment_base(u16 selector) } #endif +static bool vmx_allow_exit_user(struct kvm_vcpu *vcpu) +{ + return !to_vmx(vcpu)->nested.nested_run_pending; +} + +static void vmx_prepare_exit_user(struct kvm_vcpu *vcpu) +{ + struct vcpu_vmx *vmx = to_vmx(vcpu); + + if (vmx->nested.current_vmptr == -1ull) + return; + + /* + * If L2 is running no need to update vmcs12 from shadow VMCS. + * Just force an exit from L2 to L1 + */ + if (is_guest_mode(vcpu)) { + /* + * Pretend that an external interrupt occurred while L2 is + * running to cleanly exit into L1. + */ + nested_vmx_vmexit(vcpu, EXIT_REASON_EXTERNAL_INTERRUPT, 0, 0); + + /* Switch from L2 MMU to L1 MMU */ + kvm_mmu_reset_context(vcpu); + } else if (enable_shadow_vmcs) { + copy_shadow_to_vmcs12(vmx); + } + + /* Flush VMCS12 to guest memory */ + kvm_write_guest(vcpu->kvm, vmx->nested.current_vmptr, + get_vmcs12(vcpu), sizeof(*vmx->nested.cached_vmcs12)); + + return; +} + static void vmx_save_host_state(struct kvm_vcpu *vcpu) { struct vcpu_vmx *vmx = to_vmx(vcpu); @@ -12440,6 +12476,9 @@ static struct kvm_x86_ops vmx_x86_ops __ro_after_init = { .whitelist_msrs = vmx_whitelist_msrs, + .prepare_exit_user = vmx_prepare_exit_user, + .allow_exit_user = vmx_allow_exit_user, + .prepare_guest_switch = vmx_save_host_state, .vcpu_load = vmx_vcpu_load, .vcpu_put = vmx_vcpu_put, diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index 2cfbf39..8256a2d 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -996,6 +996,12 @@ bool kvm_rdpmc(struct kvm_vcpu *vcpu) } EXPORT_SYMBOL_GPL(kvm_rdpmc); +static __always_inline bool should_exit_user(struct kvm_vcpu *vcpu) +{ + return signal_pending(current) && (kvm_x86_ops->allow_exit_user ? + kvm_x86_ops->allow_exit_user(vcpu): true); +} + /* * List of msr numbers which we expose to userspace through KVM_GET_MSRS * and KVM_SET_MSRS, and KVM_GET_MSR_INDEX_LIST. @@ -7187,8 +7193,7 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu) if (kvm_lapic_enabled(vcpu) && vcpu->arch.apicv_active) kvm_x86_ops->sync_pir_to_irr(vcpu); - if (vcpu->mode == EXITING_GUEST_MODE || kvm_request_pending(vcpu) - || need_resched() || signal_pending(current)) { + if (vcpu->mode == EXITING_GUEST_MODE || kvm_request_pending(vcpu) || need_resched()) { vcpu->mode = OUTSIDE_GUEST_MODE; smp_wmb(); local_irq_enable(); @@ -7198,6 +7203,20 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu) goto cancel_injection; } + if (signal_pending(current)) { + if (kvm_x86_ops->allow_exit_user && + kvm_x86_ops->allow_exit_user(vcpu)) { + vcpu->mode = OUTSIDE_GUEST_MODE; + smp_wmb(); + local_irq_enable(); + preempt_enable(); + vcpu->srcu_idx = srcu_read_lock(&vcpu->kvm->srcu); + r = 1; + goto cancel_injection; + } else + req_immediate_exit = true; + } + kvm_load_guest_xcr0(vcpu); if (req_immediate_exit) { @@ -7364,7 +7383,7 @@ static int vcpu_run(struct kvm_vcpu *vcpu) kvm_check_async_pf_completion(vcpu); - if (signal_pending(current)) { + if (should_exit_user(vcpu)) { r = -EINTR; vcpu->run->exit_reason = KVM_EXIT_INTR; ++vcpu->stat.signal_exits; @@ -7506,11 +7525,15 @@ int kvm_arch_vcpu_ioctl_run(struct kvm_vcpu *vcpu, struct kvm_run *kvm_run) } else WARN_ON(vcpu->arch.pio.count || vcpu->mmio_needed); - if (kvm_run->immediate_exit) + if (kvm_run->immediate_exit) { r = -EINTR; - else + } else { r = vcpu_run(vcpu); + if (kvm_x86_ops->prepare_exit_user) + kvm_x86_ops->prepare_exit_user(vcpu); + } + out: kvm_put_guest_fpu(vcpu); post_kvm_run_save(vcpu);
On exit to L0 user-space, always exit from L2 to L1 and synchronize the state properly for L1. This ensures that user-space only ever sees L1 state. It also allows L1 to be saved and resumed properly. Obviously horrible things will still happen to the L2 guest. This will be handled in a seperate patch. There is only a single case which requires a bit of extra care. When the decision to switch to user space happens while handling an L1 VMRESUME/VMLAUNCH (i.e. pending_nested_run). In order to handle this as cleanly as possible without major restructuring, we simply do not exit to user-space in this case and give L2 another chance to actually run. We also request an immediate exit to ensure that an exit to user space will still happen for the L2. The only reason I can see where an exit to user space will occur while L2 is running is because of a pending signal. The is how user space preempts the KVM_RUN in order to save the state. L2 exits are either handled in L0 kernel or reflected to L1 and not handled in L0 user-space. Signed-off-by: KarimAllah Ahmed <karahmed@amazon.de> --- arch/x86/include/asm/kvm_host.h | 2 ++ arch/x86/kvm/vmx.c | 39 +++++++++++++++++++++++++++++++++++++++ arch/x86/kvm/x86.c | 33 ++++++++++++++++++++++++++++----- 3 files changed, 69 insertions(+), 5 deletions(-)