Message ID | 20200424210316.848878-1-mstunes@vmware.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | Allow RDTSC and RDTSCP from userspace | expand |
On 4/24/20 2:03 PM, Mike Stunes wrote: > I needed to allow RDTSC(P) from userspace and in early boot in order to > get userspace started properly. Patch below. > > --- > SEV-ES guests will need to execute rdtsc and rdtscp from userspace and > during early boot. Move the rdtsc(p) #VC handler into common code and > extend the #VC handlers. Do SEV-ES guests _always_ #VC on rdtsc(p)?
On 4/24/20 4:24 PM, Dave Hansen wrote: > On 4/24/20 2:03 PM, Mike Stunes wrote: >> I needed to allow RDTSC(P) from userspace and in early boot in order to >> get userspace started properly. Patch below. >> >> --- >> SEV-ES guests will need to execute rdtsc and rdtscp from userspace and >> during early boot. Move the rdtsc(p) #VC handler into common code and >> extend the #VC handlers. > > Do SEV-ES guests _always_ #VC on rdtsc(p)? Only if the hypervisor is intercepting those instructions. Thanks, Tom >
On 4/24/20 2:27 PM, Tom Lendacky wrote: > On 4/24/20 4:24 PM, Dave Hansen wrote: >> On 4/24/20 2:03 PM, Mike Stunes wrote: >>> I needed to allow RDTSC(P) from userspace and in early boot in order to >>> get userspace started properly. Patch below. >>> >>> --- >>> SEV-ES guests will need to execute rdtsc and rdtscp from userspace and >>> during early boot. Move the rdtsc(p) #VC handler into common code and >>> extend the #VC handlers. >> >> Do SEV-ES guests _always_ #VC on rdtsc(p)? > > Only if the hypervisor is intercepting those instructions. Ahh, so any instruction that can have an instruction intercept set potentially needs to be able to tolerate a #VC? Those instruction intercepts are under the control of the (untrusted relative to the guest) hypervisor, right? From the main sev-es series: +#ifdef CONFIG_AMD_MEM_ENCRYPT +idtentry vmm_communication do_vmm_communication has_error_code=1 +#endif Since this is set as non-paranoid, that both limits the instructions that can be used in entry paths *and* limits the future architecture from being able add instructions that a current SEV-ES guest doesn't know about. Does SEV-ES have versioning so guests can tell if they might be subject to new interrupt intercepts for which they are not prepared? I didn't see anything obvious in section 15.35 of the manual. There's also a nugget in the manual that says: > Similarly, the hypervisor should avoid setting intercept bits for > events that would occur in the #VC handler (such as IRET). That's a fun point because it means that the (untrusted) hypervisor can cause endless faults. I *guess* we have mitigation for this with our stack guard pages, but it's still a bit nasty that the hypervisor can arbitrarily land a guest in the double-fault handler. It just all seems a bit weak for the hypervisor to be considered untrusted. But, it's _certainly_ a steep in the right direction from SEV.
Hi Mike, On Fri, Apr 24, 2020 at 02:03:16PM -0700, Mike Stunes wrote: > I needed to allow RDTSC(P) from userspace and in early boot in order to > get userspace started properly. Patch below. Thanks, but this is not needed anymore. I removed the vc_context_filter from the code. The emulation code is now capable of safely handling any exception from user-space. Regards, Joerg
Hi Dave, On Fri, Apr 24, 2020 at 03:53:09PM -0700, Dave Hansen wrote: > Ahh, so any instruction that can have an instruction intercept set > potentially needs to be able to tolerate a #VC? Those instruction > intercepts are under the control of the (untrusted relative to the > guest) hypervisor, right? > > >From the main sev-es series: > > +#ifdef CONFIG_AMD_MEM_ENCRYPT > +idtentry vmm_communication do_vmm_communication has_error_code=1 > +#endif The next version of the patch-set (which I will hopefully have ready next week) will have this changed. The #VC exception handler uses an IST stack and is set to paranoid=1 and shift_ist. The IST stacks for the #VC handler are only allocated when SEV-ES is active. > That's a fun point because it means that the (untrusted) hypervisor can > cause endless faults. I *guess* we have mitigation for this with our > stack guard pages, but it's still a bit nasty that the hypervisor can > arbitrarily land a guest in the double-fault handler. > > It just all seems a bit weak for the hypervisor to be considered > untrusted. But, it's _certainly_ a steep in the right direction from SEV. Yeah, a malicious hypervisor can do bad things to an SEV-ES VM, but it can't easily steal its secrets from memory or registers. The #VC handler does its best to just crash the VM if unexpected hypervisor behavior is detected. Regards, Joerg
On Sat, Apr 25, 2020 at 5:49 AM Joerg Roedel <jroedel@suse.de> wrote: > > Hi Dave, > > On Fri, Apr 24, 2020 at 03:53:09PM -0700, Dave Hansen wrote: > > Ahh, so any instruction that can have an instruction intercept set > > potentially needs to be able to tolerate a #VC? Those instruction > > intercepts are under the control of the (untrusted relative to the > > guest) hypervisor, right? > > > > >From the main sev-es series: > > > > +#ifdef CONFIG_AMD_MEM_ENCRYPT > > +idtentry vmm_communication do_vmm_communication has_error_code=1 > > +#endif > > The next version of the patch-set (which I will hopefully have ready > next week) will have this changed. The #VC exception handler uses an IST > stack and is set to paranoid=1 and shift_ist. The IST stacks for the #VC > handler are only allocated when SEV-ES is active. shift_ist is gross. What's it for? If it's not needed, I'd rather not use it, and I eventually want to get rid of it for #DB as well. --Andy
On Sat, Apr 25, 2020 at 11:15:35AM -0700, Andy Lutomirski wrote: > shift_ist is gross. What's it for? If it's not needed, I'd rather > not use it, and I eventually want to get rid of it for #DB as well. The #VC handler needs to be able to nest, there is no way around that for various reasons, the two most important ones are: 1. The #VC -> NMI -> #VC case. #VCs can happen in the NMI handler, for example (but not exclusivly) for RDPMC. 2. In case of an error the #VC handler needs to print out error information by calling one of the printk wrappers. Those will end up doing IO to some console/serial port/whatever which will also cause #VC exceptions to emulate the access to the output devices. Using shift_ist is perfect for that, the only problem is the race condition with the NMI handler, as shift_ist does not work well with exceptions that can also trigger within the NMI handler. But I have taken care of that for #VC. Regards, Joerg
> On Apr 25, 2020, at 12:10 PM, Joerg Roedel <joro@8bytes.org> wrote: > > On Sat, Apr 25, 2020 at 11:15:35AM -0700, Andy Lutomirski wrote: >> shift_ist is gross. What's it for? If it's not needed, I'd rather >> not use it, and I eventually want to get rid of it for #DB as well. > > The #VC handler needs to be able to nest, there is no way around that > for various reasons, the two most important ones are: > > 1. The #VC -> NMI -> #VC case. #VCs can happen in the NMI > handler, for example (but not exclusivly) for RDPMC. > > 2. In case of an error the #VC handler needs to print out error > information by calling one of the printk wrappers. Those will > end up doing IO to some console/serial port/whatever which > will also cause #VC exceptions to emulate the access to the > output devices. > > Using shift_ist is perfect for that, the only problem is the race > condition with the NMI handler, as shift_ist does not work well with > exceptions that can also trigger within the NMI handler. But I have > taken care of that for #VC. > I assume the race you mean is: #VC Immediate NMI before IST gets shifted #VC Kaboom. How are you dealing with this? Ultimately, I think that NMI will need to turn off IST before engaging in any funny business. Let me ponder this a bit. > > Regards, > > Joerg >
On Sat, Apr 25, 2020 at 12:47:31PM -0700, Andy Lutomirski wrote: > I assume the race you mean is: > > #VC > Immediate NMI before IST gets shifted > #VC > > Kaboom. > > How are you dealing with this? Ultimately, I think that NMI will need > to turn off IST before engaging in any funny business. Let me ponder > this a bit. Right, I dealt with that by unconditionally shifting/unshifting the #VC IST entry in do_nmi() (thanks to Davin Kaplan for the idea). It might cause one of the IST stacks to be unused during nesting, but that is fine. The stack memory for #VC is only allocated when SEV-ES is active (in an SEV-ES VM). Regards, Joerg
On Sat, Apr 25, 2020 at 1:23 PM Joerg Roedel <joro@8bytes.org> wrote: > > On Sat, Apr 25, 2020 at 12:47:31PM -0700, Andy Lutomirski wrote: > > I assume the race you mean is: > > > > #VC > > Immediate NMI before IST gets shifted > > #VC > > > > Kaboom. > > > > How are you dealing with this? Ultimately, I think that NMI will need > > to turn off IST before engaging in any funny business. Let me ponder > > this a bit. > > Right, I dealt with that by unconditionally shifting/unshifting the #VC IST entry > in do_nmi() (thanks to Davin Kaplan for the idea). It might cause > one of the IST stacks to be unused during nesting, but that is fine. The > stack memory for #VC is only allocated when SEV-ES is active (in an > SEV-ES VM). Blech. It probably works, but still, yuck. It's a bit sad that we seem to be growing more and more poorly designed happens-anywhere exception types at an alarming rate. We seem to have #NM, #MC, #VC, #HV, and #DB. This doesn't really scale. --Andy
On Sat, Apr 25, 2020 at 3:10 PM Andy Lutomirski <luto@kernel.org> wrote: > > On Sat, Apr 25, 2020 at 1:23 PM Joerg Roedel <joro@8bytes.org> wrote: > > > > On Sat, Apr 25, 2020 at 12:47:31PM -0700, Andy Lutomirski wrote: > > > I assume the race you mean is: > > > > > > #VC > > > Immediate NMI before IST gets shifted > > > #VC > > > > > > Kaboom. > > > > > > How are you dealing with this? Ultimately, I think that NMI will need > > > to turn off IST before engaging in any funny business. Let me ponder > > > this a bit. > > > > Right, I dealt with that by unconditionally shifting/unshifting the #VC IST entry > > in do_nmi() (thanks to Davin Kaplan for the idea). It might cause > > one of the IST stacks to be unused during nesting, but that is fine. The > > stack memory for #VC is only allocated when SEV-ES is active (in an > > SEV-ES VM). > > Blech. It probably works, but still, yuck. It's a bit sad that we > seem to be growing more and more poorly designed happens-anywhere > exception types at an alarming rate. We seem to have #NM, #MC, #VC, > #HV, and #DB. This doesn't really scale. I have a somewhat serious question: should we use IST for #VC at all? As I understand it, Rome and Naples make it mandatory for hypervisors to intercept #DB, which means that, due to the MOV SS mess, it's sort of mandatory to use IST for #VC. But Milan fixes the #DB issue, so, if we're running under a sufficiently sensible hypervisor, we don't need IST for #VC. So I think we have two choices: 1. Use IST for #VC and deal with all the mess that entails. 2. Say that we SEV-ES client support on Rome and Naples is for development only and do a quick boot-time check for whether #DB is intercepted. (Just set TF and see what vector we get.) If #DB is intercepted, print a very loud warning and refuse to boot unless some special sev_es.insecure_development_mode or similar option is set. #2 results in simpler and more robust entry code. #1 is more secure. So my question is: will anyone actually use SEV-ES in production on Rome or Naples? As I understand it, it's not really ready for prime time on those chips. And do we care if the combination of a malicious hypervisor and malicious guest userspace on Milan can compromise the guest kernel? I don't think SEV-ES is really mean to resist a concerted effort by the hypervisor to compromise the guest. --Andy
On 27/04/2020 18:37, Andy Lutomirski wrote: > On Sat, Apr 25, 2020 at 3:10 PM Andy Lutomirski <luto@kernel.org> wrote: >> On Sat, Apr 25, 2020 at 1:23 PM Joerg Roedel <joro@8bytes.org> wrote: >>> On Sat, Apr 25, 2020 at 12:47:31PM -0700, Andy Lutomirski wrote: >>>> I assume the race you mean is: >>>> >>>> #VC >>>> Immediate NMI before IST gets shifted >>>> #VC >>>> >>>> Kaboom. >>>> >>>> How are you dealing with this? Ultimately, I think that NMI will need >>>> to turn off IST before engaging in any funny business. Let me ponder >>>> this a bit. >>> Right, I dealt with that by unconditionally shifting/unshifting the #VC IST entry >>> in do_nmi() (thanks to Davin Kaplan for the idea). It might cause >>> one of the IST stacks to be unused during nesting, but that is fine. The >>> stack memory for #VC is only allocated when SEV-ES is active (in an >>> SEV-ES VM). >> Blech. It probably works, but still, yuck. It's a bit sad that we >> seem to be growing more and more poorly designed happens-anywhere >> exception types at an alarming rate. We seem to have #NM, #MC, #VC, >> #HV, and #DB. This doesn't really scale. > I have a somewhat serious question: should we use IST for #VC at all? > As I understand it, Rome and Naples make it mandatory for hypervisors > to intercept #DB, which means that, due to the MOV SS mess, it's sort > of mandatory to use IST for #VC. But Milan fixes the #DB issue, so, > if we're running under a sufficiently sensible hypervisor, we don't > need IST for #VC. > > So I think we have two choices: > > 1. Use IST for #VC and deal with all the mess that entails. > > 2. Say that we SEV-ES client support on Rome and Naples is for > development only and do a quick boot-time check for whether #DB is > intercepted. (Just set TF and see what vector we get.) If #DB is > intercepted, print a very loud warning and refuse to boot unless some > special sev_es.insecure_development_mode or similar option is set. > > #2 results in simpler and more robust entry code. #1 is more secure. > > So my question is: will anyone actually use SEV-ES in production on > Rome or Naples? As I understand it, it's not really ready for prime > time on those chips. And do we care if the combination of a malicious > hypervisor and malicious guest userspace on Milan can compromise the > guest kernel? I don't think SEV-ES is really mean to resist a > concerted effort by the hypervisor to compromise the guest. More specifically, it is mandatory for hypervisors to intercept #DB to defend against CVE-2015-8104, unless they're willing to trust the guest not to tickle that corner case. This is believed fixed with SEV-SNP to allow the encrypted guest to use debugging functionality without posing a DoS risk to the host. In this case, the hypervisor is expected not to intercept #DB. If #DB is intercepted, and #VC doesn't use IST, malicious userspace can cause problems with a movss-delayed breakpoint over SYSCALL. The question basically whether it is worth going to the effort of making #VC IST and all the problems that entails, to cover one corner case in earlier hardware. Ultimately, this depends on whether anyone plans to put SEV-ES into production on pre SEV-SNP hardware, and if developers using pre-SEV-SNP hardware are happy with "don't run malicious userspace" or "don't run malicious kernels and skip the #DB intercept" as a fair tradeoff to avoid the #VC IST fun. ~Andrew
On 4/27/20 12:37 PM, Andy Lutomirski wrote: > On Sat, Apr 25, 2020 at 3:10 PM Andy Lutomirski <luto@kernel.org> wrote: >> >> On Sat, Apr 25, 2020 at 1:23 PM Joerg Roedel <joro@8bytes.org> wrote: >>> >>> On Sat, Apr 25, 2020 at 12:47:31PM -0700, Andy Lutomirski wrote: >>>> I assume the race you mean is: >>>> >>>> #VC >>>> Immediate NMI before IST gets shifted >>>> #VC >>>> >>>> Kaboom. >>>> >>>> How are you dealing with this? Ultimately, I think that NMI will need >>>> to turn off IST before engaging in any funny business. Let me ponder >>>> this a bit. >>> >>> Right, I dealt with that by unconditionally shifting/unshifting the #VC IST entry >>> in do_nmi() (thanks to Davin Kaplan for the idea). It might cause >>> one of the IST stacks to be unused during nesting, but that is fine. The >>> stack memory for #VC is only allocated when SEV-ES is active (in an >>> SEV-ES VM). >> >> Blech. It probably works, but still, yuck. It's a bit sad that we >> seem to be growing more and more poorly designed happens-anywhere >> exception types at an alarming rate. We seem to have #NM, #MC, #VC, >> #HV, and #DB. This doesn't really scale. > > I have a somewhat serious question: should we use IST for #VC at all? > As I understand it, Rome and Naples make it mandatory for hypervisors > to intercept #DB, which means that, due to the MOV SS mess, it's sort > of mandatory to use IST for #VC. But Milan fixes the #DB issue, so, > if we're running under a sufficiently sensible hypervisor, we don't > need IST for #VC. > > So I think we have two choices: > > 1. Use IST for #VC and deal with all the mess that entails. > > 2. Say that we SEV-ES client support on Rome and Naples is for > development only and do a quick boot-time check for whether #DB is > intercepted. (Just set TF and see what vector we get.) If #DB is > intercepted, print a very loud warning and refuse to boot unless some > special sev_es.insecure_development_mode or similar option is set. > > #2 results in simpler and more robust entry code. #1 is more secure. > > So my question is: will anyone actually use SEV-ES in production on > Rome or Naples? As I understand it, it's not really ready for prime > time on those chips. And do we care if the combination of a malicious Naples was limited in the number of encryption keys available for guests (15), but Rome increased that significantly (509). SEV-ES is ready on those chips - Rome more so with the increased key count given the requirement that SEV and SEV-ES guests have non-overlapping ASID ranges (which corresponds to key usage). Thanks, Tom > hypervisor and malicious guest userspace on Milan can compromise the > guest kernel? I don't think SEV-ES is really mean to resist a > concerted effort by the hypervisor to compromise the guest. > > --Andy >
On 4/25/20 5:49 AM, Joerg Roedel wrote: >> That's a fun point because it means that the (untrusted) hypervisor can >> cause endless faults. I *guess* we have mitigation for this with our >> stack guard pages, but it's still a bit nasty that the hypervisor can >> arbitrarily land a guest in the double-fault handler. >> >> It just all seems a bit weak for the hypervisor to be considered >> untrusted. But, it's _certainly_ a steep in the right direction from SEV. > Yeah, a malicious hypervisor can do bad things to an SEV-ES VM, but it > can't easily steal its secrets from memory or registers. The #VC handler > does its best to just crash the VM if unexpected hypervisor behavior is > detected. This is the kind of design information that would be very useful to reviewers. Will some of this information make it into the cover letter eventually? Or, Documentation/? Also, for the security purists, an SEV-ES host is still trusted (in the same TCB as the guest). Truly guest-untrusted VMMs won't be available until SEV-SNP, right?
On Mon, Apr 27, 2020 at 10:37:41AM -0700, Andy Lutomirski wrote: > I have a somewhat serious question: should we use IST for #VC at all? > As I understand it, Rome and Naples make it mandatory for hypervisors > to intercept #DB, which means that, due to the MOV SS mess, it's sort > of mandatory to use IST for #VC. But Milan fixes the #DB issue, so, > if we're running under a sufficiently sensible hypervisor, we don't > need IST for #VC. The reason for #VC being IST is not only #DB, but also SEV-SNP. SNP adds page ownership tracking between guest and host, so that the hypervisor can't remap guest pages without the guest noticing. If there is a violation of ownership, which can happen at any memory access, there will be a #VC exception to notify the guest. And as this can happen anywhere, for example on a carefully crafted stack page set by userspace before doing SYSCALL, the only robust choice for #VC is to use IST. Regards, Joerg
On 28/04/2020 08:55, Joerg Roedel wrote: > On Mon, Apr 27, 2020 at 10:37:41AM -0700, Andy Lutomirski wrote: >> I have a somewhat serious question: should we use IST for #VC at all? >> As I understand it, Rome and Naples make it mandatory for hypervisors >> to intercept #DB, which means that, due to the MOV SS mess, it's sort >> of mandatory to use IST for #VC. But Milan fixes the #DB issue, so, >> if we're running under a sufficiently sensible hypervisor, we don't >> need IST for #VC. > The reason for #VC being IST is not only #DB, but also SEV-SNP. SNP adds > page ownership tracking between guest and host, so that the hypervisor > can't remap guest pages without the guest noticing. > > If there is a violation of ownership, which can happen at any memory > access, there will be a #VC exception to notify the guest. And as this > can happen anywhere, for example on a carefully crafted stack page set > by userspace before doing SYSCALL, the only robust choice for #VC is to > use IST. The kernel won't ever touch the guest stack before restoring %rsp in the syscall path, but the (minimum 2) memory accesses required to save the user %rsp and load the kernel stack may be subject to #VC exceptions, as are instruction fetches at the head of the SYSCALL path. So yes - #VC needs IST. Sorry for the noise. (That said, it is unfortunate that the hypervisor messing with the memory backing the guest #VC handler results in an infinite loop, rather than an ability to cleanly terminate.) ~Andrew
Hi Andy,
On Mon, Apr 27, 2020 at 10:37:41AM -0700, Andy Lutomirski wrote:
> 1. Use IST for #VC and deal with all the mess that entails.
With the removal of IST shifting I wonder what you would suggest on how
to best implement an NMI-safe IST handler with nesting support.
My current plan is to implement an IST handler which switches itself off
the IST stack as soon as possible, freeing it for re-use.
The flow would be roughly like this upon entering the handler;
build_pt_regs();
RSP = pt_regs->sp;
if (RSP in VC_IST_stack)
error("unallowed nesting")
if (RSP in current_kernel_stack)
RSP = round_down_to_8(RSP)
else
RSP = current_top_of_stack() // non-ist kernel stack
copy_pt_regs(pt_regs, RSP);
switch_stack_to(RSP);
To make this NMI safe, the NMI handler needs some logic too. Upon
entering NMI, it needs to check the return RSP, and if it is in the #VC
IST stack, it must do the above flow by itself and update the return RSP
and RIP. It needs to take into account the case when PT_REGS is not
fully populated on the return side.
Alternativly the NMI handler could safe/restore the contents of the #VC
IST stack or just switch to a special #VC-in-NMI IST stack.
All in all it could get complicated, and imho shift_ist would have been
simpler, but who am I anyway...
Or maybe you have a better idea how to implement this, so I'd like to
hear your opinion first before I spend too many days implementing
something.
Regards,
Joerg
On Tue, Jun 23, 2020 at 11:45:19AM +0200, Joerg Roedel wrote: > Hi Andy, > > On Mon, Apr 27, 2020 at 10:37:41AM -0700, Andy Lutomirski wrote: > > 1. Use IST for #VC and deal with all the mess that entails. > > With the removal of IST shifting I wonder what you would suggest on how > to best implement an NMI-safe IST handler with nesting support. > > My current plan is to implement an IST handler which switches itself off > the IST stack as soon as possible, freeing it for re-use. > > The flow would be roughly like this upon entering the handler; > > build_pt_regs(); > > RSP = pt_regs->sp; > > if (RSP in VC_IST_stack) > error("unallowed nesting") > > if (RSP in current_kernel_stack) > RSP = round_down_to_8(RSP) > else > RSP = current_top_of_stack() // non-ist kernel stack > > copy_pt_regs(pt_regs, RSP); > switch_stack_to(RSP); > > To make this NMI safe, the NMI handler needs some logic too. Upon > entering NMI, it needs to check the return RSP, and if it is in the #VC > IST stack, it must do the above flow by itself and update the return RSP > and RIP. It needs to take into account the case when PT_REGS is not > fully populated on the return side. > > Alternativly the NMI handler could safe/restore the contents of the #VC > IST stack or just switch to a special #VC-in-NMI IST stack. > > All in all it could get complicated, and imho shift_ist would have been > simpler, but who am I anyway... > > Or maybe you have a better idea how to implement this, so I'd like to > hear your opinion first before I spend too many days implementing > something. OK, excuse my ignorance, but I'm not seeing how that IST shifting nonsense would've helped in the first place. If I understand correctly the problem is: <#VC> shift IST <NMI> ... does stuff <#VC> # again, safe because the shift But what happens if you get the NMI before your IST adjustment? <#VC> <NMI> ... does stuff <#VC> # again, happily wrecks your earlier #VC shift IST # whoopsy, too late Either way around we get to fix this up in NMI (and any other IST exception that can happen while in #VC, hello #MC). And more complexity there is the very last thing we need :-( There's no way you can fix up the IDT without getting an NMI first. This entire exception model is fundamentally buggered :-/
On Tue, Apr 28, 2020 at 09:55:12AM +0200, Joerg Roedel wrote: > On Mon, Apr 27, 2020 at 10:37:41AM -0700, Andy Lutomirski wrote: > > I have a somewhat serious question: should we use IST for #VC at all? > > As I understand it, Rome and Naples make it mandatory for hypervisors > > to intercept #DB, which means that, due to the MOV SS mess, it's sort > > of mandatory to use IST for #VC. But Milan fixes the #DB issue, so, > > if we're running under a sufficiently sensible hypervisor, we don't > > need IST for #VC. > > The reason for #VC being IST is not only #DB, but also SEV-SNP. SNP adds > page ownership tracking between guest and host, so that the hypervisor > can't remap guest pages without the guest noticing. > > If there is a violation of ownership, which can happen at any memory > access, there will be a #VC exception to notify the guest. And as this > can happen anywhere, for example on a carefully crafted stack page set > by userspace before doing SYSCALL, the only robust choice for #VC is to > use IST. So what happens if this #VC triggers on the first access to the #VC stack, because the malicious host has craftily mucked with only the #VC IST stack page? Or on the NMI IST stack, then we get #VC in NMI before the NMI can fix you up. AFAICT all of that is non-recoverable.
Hi Peter, On Tue, Jun 23, 2020 at 12:45:59PM +0200, Peter Zijlstra wrote: > On Tue, Jun 23, 2020 at 11:45:19AM +0200, Joerg Roedel wrote: > > Or maybe you have a better idea how to implement this, so I'd like to > > hear your opinion first before I spend too many days implementing > > something. > > OK, excuse my ignorance, but I'm not seeing how that IST shifting > nonsense would've helped in the first place. > > If I understand correctly the problem is: > > <#VC> > shift IST > <NMI> > ... does stuff > <#VC> # again, safe because the shift > > But what happens if you get the NMI before your IST adjustment? The v3 patchset implements an unconditional shift of the #VC IST entry in the NMI handler, before it can trigger a #VC exception. > Either way around we get to fix this up in NMI (and any other IST > exception that can happen while in #VC, hello #MC). And more complexity > there is the very last thing we need :-( Yes, in whatever way this gets implemented, it needs some fixup in the NMI handler. But that can happen in C code, so it does not make the assembly more complex, at least. > There's no way you can fix up the IDT without getting an NMI first. Not sure what you mean by this. > This entire exception model is fundamentally buggered :-/ Regards, Joerg
On Tue, Jun 23, 2020 at 01:11:07PM +0200, Joerg Roedel wrote: > Hi Peter, > > On Tue, Jun 23, 2020 at 12:45:59PM +0200, Peter Zijlstra wrote: > > On Tue, Jun 23, 2020 at 11:45:19AM +0200, Joerg Roedel wrote: > > > Or maybe you have a better idea how to implement this, so I'd like to > > > hear your opinion first before I spend too many days implementing > > > something. > > > > OK, excuse my ignorance, but I'm not seeing how that IST shifting > > nonsense would've helped in the first place. > > > > If I understand correctly the problem is: > > > > <#VC> > > shift IST > > <NMI> > > ... does stuff > > <#VC> # again, safe because the shift > > > > But what happens if you get the NMI before your IST adjustment? > > The v3 patchset implements an unconditional shift of the #VC IST entry > in the NMI handler, before it can trigger a #VC exception. Going by that other thread -- where you said that any memory access can trigger a #VC, there just isn't such a guarantee. > > Either way around we get to fix this up in NMI (and any other IST > > exception that can happen while in #VC, hello #MC). And more complexity > > there is the very last thing we need :-( > > Yes, in whatever way this gets implemented, it needs some fixup in the > NMI handler. But that can happen in C code, so it does not make the > assembly more complex, at least. > > > There's no way you can fix up the IDT without getting an NMI first. > > Not sure what you mean by this. I was talking about the case where #VC would try and fix up its own IST.
On Tue, Jun 23, 2020 at 01:07:06PM +0200, Peter Zijlstra wrote: > On Tue, Apr 28, 2020 at 09:55:12AM +0200, Joerg Roedel wrote: > So what happens if this #VC triggers on the first access to the #VC > stack, because the malicious host has craftily mucked with only the #VC > IST stack page? > > Or on the NMI IST stack, then we get #VC in NMI before the NMI can fix > you up. > > AFAICT all of that is non-recoverable. I am not 100% sure, but I think if the #VC stack page is not validated, the #VC should be promoted to a #DF. Note that this is an issue only with secure nested paging (SNP), which is not enabled yet with this patch-set. When it gets enabled a stack recursion check in the #VC handler is needed which panics the VM. That also fixes the #VC-in-early-NMI problem. Regards, Joerg
On Tue, Jun 23, 2020 at 01:14:43PM +0200, Peter Zijlstra wrote: > On Tue, Jun 23, 2020 at 01:11:07PM +0200, Joerg Roedel wrote: > > The v3 patchset implements an unconditional shift of the #VC IST entry > > in the NMI handler, before it can trigger a #VC exception. > > Going by that other thread -- where you said that any memory access can > trigger a #VC, there just isn't such a guarantee. As I wrote in the other mail, this can only happen when SNP gets enabled (which is follow-on work to this) and is handled by a stack recursion check in the #VC handler. The reason I mentioned the #VC-anywhere case is to make it more clear why #VC needs an IST handler. Regards, Joerg
On Tue, Jun 23, 2020 at 01:30:07PM +0200, Joerg Roedel wrote: > Note that this is an issue only with secure nested paging (SNP), which > is not enabled yet with this patch-set. When it gets enabled a stack > recursion check in the #VC handler is needed which panics the VM. That > also fixes the #VC-in-early-NMI problem. But you cannot do a recursion check in #VC, because the NMI can happen on the first instruction of #VC, before we can increment our counter, and then the #VC can happen on NMI because the IST stack is a goner, and we're fscked again (or on a per-cpu variable we touch in our elaborate NMI setup, etc..). There is no way I can see SNP-#VC 'work'. The best I can come up with is 'mostly', but do you like your bridges/dikes/etc.. to be mostly ok? Or do you want a guarantee they'll actually work? I'll keep repeating this, x86_64 exceptions are a trainwreck, and IST in specific is utter crap.
On Tue, Jun 23, 2020 at 01:43:24PM +0200, Joerg Roedel wrote: > On Tue, Jun 23, 2020 at 01:14:43PM +0200, Peter Zijlstra wrote: > > On Tue, Jun 23, 2020 at 01:11:07PM +0200, Joerg Roedel wrote: > > > > The v3 patchset implements an unconditional shift of the #VC IST entry > > > in the NMI handler, before it can trigger a #VC exception. > > > > Going by that other thread -- where you said that any memory access can > > trigger a #VC, there just isn't such a guarantee. > > As I wrote in the other mail, this can only happen when SNP gets enabled > (which is follow-on work to this) and is handled by a stack recursion > check in the #VC handler. > > The reason I mentioned the #VC-anywhere case is to make it more clear > why #VC needs an IST handler. If SNP is the sole reason #VC needs to be IST, then I'd strongly urge you to only make it IST if/when you try and make SNP happen, not before.
On 23/06/2020 12:30, Joerg Roedel wrote: > On Tue, Jun 23, 2020 at 01:07:06PM +0200, Peter Zijlstra wrote: >> On Tue, Apr 28, 2020 at 09:55:12AM +0200, Joerg Roedel wrote: >> So what happens if this #VC triggers on the first access to the #VC >> stack, because the malicious host has craftily mucked with only the #VC >> IST stack page? >> >> Or on the NMI IST stack, then we get #VC in NMI before the NMI can fix >> you up. >> >> AFAICT all of that is non-recoverable. > I am not 100% sure, but I think if the #VC stack page is not validated, > the #VC should be promoted to a #DF. > > Note that this is an issue only with secure nested paging (SNP), which > is not enabled yet with this patch-set. When it gets enabled a stack > recursion check in the #VC handler is needed which panics the VM. That > also fixes the #VC-in-early-NMI problem. There are cases which are definitely non-recoverable. For both ES and SNP, a malicious hypervisor can mess with the guest physmap to make the the NMI, #VC and #DF stacks all alias. For ES, this had better result in the #DF handler deciding that crashing is the way out, whereas for SNP, this had better escalate to Shutdown. What matters here is the security model in SNP. The hypervisor is relied upon for availability (because it could simply refuse to schedule the VM), but market/business forces will cause it to do its best to keep the VM running. Therefore, the securely model is simply(?) that the hypervisor can't do anything to undermine the confidentiality or integrity of the VM. Crashing out hard if the hypervisor is misbehaving is acceptable. In a cloud, I as a customer would (threaten to?) take my credit card elsewhere, while for enterprise, I'd shout at my virtualisation vendor until a fix happened (also perhaps threaten to take my credit card elsewhere). Therefore, it is reasonable to build on the expectation that the hypervisor won't be messing around with remapping stacks/etc. Most #VC's are synchronous with guest actions (they equate to actions which would have caused a VMExit), so you can safely reason about when the first #VC might occur, presuming no funny business with the frames backing any memory operands touched. ~Andrew
On Tue, Jun 23, 2020 at 01:48:18PM +0200, Peter Zijlstra wrote: > On Tue, Jun 23, 2020 at 01:30:07PM +0200, Joerg Roedel wrote: > But you cannot do a recursion check in #VC, because the NMI can happen > on the first instruction of #VC, before we can increment our counter, > and then the #VC can happen on NMI because the IST stack is a goner, and > we're fscked again (or on a per-cpu variable we touch in our elaborate > NMI setup, etc..). No, the recursion check is fine, because overwriting an already used IST stack doesn't matter (as long as it can be detected) if we are going to panic anyway. It doesn't matter because the kernel will not leave the currently running handler anymore. I agree there is no way to keep the system running if that happens, but that is also not what is wanted. If stack recursion happens, something malicious from the HV side is going on, and all the kernel needs to be able to is to safely and reliably detect the situation and panic the VM to prevent any data corruption or loss or even leakage. > I'll keep repeating this, x86_64 exceptions are a trainwreck, and IST in > specific is utter crap. I agree, but don't forget the most prominent underlying reason for IST: The SYSCALL gap. If SYSCALL would switch stacks most of those issues would not exist. IST would still be needed because there are no task gates in x86-64, but still... Regards, Joerg
On Tue, Jun 23, 2020 at 01:50:14PM +0200, Peter Zijlstra wrote: > If SNP is the sole reason #VC needs to be IST, then I'd strongly urge > you to only make it IST if/when you try and make SNP happen, not before. It is not the only reason, when ES guests gain debug register support then #VC also needs to be IST, because #DB can be promoted into #VC then, and as #DB is IST for a reason, #VC needs to be too. Besides that, I am not a fan of delegating problems I already see coming to future-Joerg and future-Peter, but if at all possible deal with them now and be safe later. Regards, Joerg
On Tue, Jun 23, 2020 at 12:51:03PM +0100, Andrew Cooper wrote: > There are cases which are definitely non-recoverable. > > For both ES and SNP, a malicious hypervisor can mess with the guest > physmap to make the the NMI, #VC and #DF stacks all alias. > > For ES, this had better result in the #DF handler deciding that crashing > is the way out, whereas for SNP, this had better escalate to Shutdown. > Crashing out hard if the hypervisor is misbehaving is acceptable. Then I'm thinking the only sensible option is to crash hard for any SNP #VC from kernel mode. Sadly that doesn't help with #VC needing to be IST :-( IST is such a frigging nightmare.
On Tue, Jun 23, 2020 at 02:04:33PM +0200, Joerg Roedel wrote: > On Tue, Jun 23, 2020 at 01:48:18PM +0200, Peter Zijlstra wrote: > > On Tue, Jun 23, 2020 at 01:30:07PM +0200, Joerg Roedel wrote: > > > But you cannot do a recursion check in #VC, because the NMI can happen > > on the first instruction of #VC, before we can increment our counter, > > and then the #VC can happen on NMI because the IST stack is a goner, and > > we're fscked again (or on a per-cpu variable we touch in our elaborate > > NMI setup, etc..). > > No, the recursion check is fine, because overwriting an already used IST > stack doesn't matter (as long as it can be detected) if we are going to > panic anyway. It doesn't matter because the kernel will not leave the > currently running handler anymore. You only have that guarantee when any SNP #VC from kernel is an automatic panic. But in that case, what's the point of having the recursion count? > > I'll keep repeating this, x86_64 exceptions are a trainwreck, and IST in > > specific is utter crap. > > I agree, but don't forget the most prominent underlying reason for IST: > The SYSCALL gap. If SYSCALL would switch stacks most of those issues > would not exist. IST would still be needed because there are no task > gates in x86-64, but still... We could all go back to int80 ;-) /me runs like heck
On Tue, Jun 23, 2020 at 02:12:37PM +0200, Joerg Roedel wrote: > On Tue, Jun 23, 2020 at 01:50:14PM +0200, Peter Zijlstra wrote: > > If SNP is the sole reason #VC needs to be IST, then I'd strongly urge > > you to only make it IST if/when you try and make SNP happen, not before. > > It is not the only reason, when ES guests gain debug register support > then #VC also needs to be IST, because #DB can be promoted into #VC > then, and as #DB is IST for a reason, #VC needs to be too. Didn't I read somewhere that that is only so for Rome/Naples but not for the later chips (Milan) which have #DB pass-through? > Besides that, I am not a fan of delegating problems I already see coming > to future-Joerg and future-Peter, but if at all possible deal with them > now and be safe later. Well, we could just say no :-) At some point in the very near future this house of cards is going to implode. We're talking about the 3rd case where the only reason things 'work' is because we'll have to panic(): - #MC - #DB with BUS LOCK DEBUG EXCEPTION - #VC SNP (and it ain't a happy accident they're all IST) Did someone forget to pass the 'ISTs are *EVIL*' memo to the hardware folks? How come we're getting more and more of them? (/me puts fingers in ears and goes la-la-la-la in anticipation of Andrew mentioning CET)
On Tue, Jun 23, 2020 at 02:52:01PM +0200, Peter Zijlstra wrote: > On Tue, Jun 23, 2020 at 02:04:33PM +0200, Joerg Roedel wrote: > > No, the recursion check is fine, because overwriting an already used IST > > stack doesn't matter (as long as it can be detected) if we are going to > > panic anyway. It doesn't matter because the kernel will not leave the > > currently running handler anymore. > > You only have that guarantee when any SNP #VC from kernel is an > automatic panic. But in that case, what's the point of having the > recursion count? It is not a recursion count, it is a stack-recursion check. Basically walk down the stack and look if your current stack is already in use. Yes, this can be optimized, but that is what is needed. IIRC the current prototype code for SNP just pre-validates all memory in the VM and doesn't support moving pages around on the host. So any #VC SNP exception would be fatal, yes. In a scenario with on-demand validation of guest pages and support for guest-assisted page-moving on the HV side it would be more complicated. Basically all memory that is accessed during #VC exception handling must stay validated at all times, including the IST stack. So saying this, I don't understand why _all_ SNP #VC exceptions from kernel space must be fatal? Regards, Joerg
On 23/06/2020 13:47, Peter Zijlstra wrote: > On Tue, Jun 23, 2020 at 12:51:03PM +0100, Andrew Cooper wrote: > >> There are cases which are definitely non-recoverable. >> >> For both ES and SNP, a malicious hypervisor can mess with the guest >> physmap to make the the NMI, #VC and #DF stacks all alias. >> >> For ES, this had better result in the #DF handler deciding that crashing >> is the way out, whereas for SNP, this had better escalate to Shutdown. >> Crashing out hard if the hypervisor is misbehaving is acceptable. > Then I'm thinking the only sensible option is to crash hard for any SNP > #VC from kernel mode. > > Sadly that doesn't help with #VC needing to be IST :-( IST is such a > frigging nightmare. I presume you mean any #VC caused by RMP faults (i.e. something went wrong with the memory owner/etc metadata) ? If so, then yes. Any failure here is a bug in the kernel or hypervisor (and needs fixing) or a malicious hypervisor and the guest should terminate for its own safety. ~Andrew
On Tue, Jun 23, 2020 at 03:40:03PM +0200, Joerg Roedel wrote: > On Tue, Jun 23, 2020 at 02:52:01PM +0200, Peter Zijlstra wrote: > > You only have that guarantee when any SNP #VC from kernel is an > > automatic panic. But in that case, what's the point of having the > > recursion count? > > It is not a recursion count, it is a stack-recursion check. Basically > walk down the stack and look if your current stack is already in use. > Yes, this can be optimized, but that is what is needed. > > IIRC the current prototype code for SNP just pre-validates all memory in > the VM and doesn't support moving pages around on the host. So any #VC > SNP exception would be fatal, yes. > > In a scenario with on-demand validation of guest pages and support for > guest-assisted page-moving on the HV side it would be more complicated. > Basically all memory that is accessed during #VC exception handling must > stay validated at all times, including the IST stack. > > So saying this, I don't understand why _all_ SNP #VC exceptions from > kernel space must be fatal? Ah, because I hadn't thought of the stack-recursion check. So basically when your exception frame points to your own IST, you die. That sounds like something we should have in generic IST code.
On Tue, Jun 23, 2020 at 03:03:22PM +0200, Peter Zijlstra wrote: > On Tue, Jun 23, 2020 at 02:12:37PM +0200, Joerg Roedel wrote: > > On Tue, Jun 23, 2020 at 01:50:14PM +0200, Peter Zijlstra wrote: > > > If SNP is the sole reason #VC needs to be IST, then I'd strongly urge > > > you to only make it IST if/when you try and make SNP happen, not before. > > > > It is not the only reason, when ES guests gain debug register support > > then #VC also needs to be IST, because #DB can be promoted into #VC > > then, and as #DB is IST for a reason, #VC needs to be too. > > Didn't I read somewhere that that is only so for Rome/Naples but not for > the later chips (Milan) which have #DB pass-through? Probably, not sure which chips will get debug register virtualization under SEV-ES. But even when it is supported, the HV can (and sometimes will) intercept #DB, which then causes it to be promoted to #VC. > We're talking about the 3rd case where the only reason things 'work' is > because we'll have to panic(): > > - #MC Okay, #MC is special and can only be handled on a best-effort basis, as #MC could happen anytime, also while already executing the #MC handler. > - #DB with BUS LOCK DEBUG EXCEPTION If I understand the problem correctly, this can be solved by moving off the IST stack to the current task stack in the #DB handler, like I plan to do for #VC, no? > - #VC SNP This has to panic for other reasons that can't be worked around. It boils down to detecting that the HV is doing something fishy and bail out to avoid further harm (like in the #MC handler). Regards, Joerg
On Tue, Jun 23, 2020 at 03:59:16PM +0200, Peter Zijlstra wrote: > So basically when your exception frame points to your own IST, you die. > That sounds like something we should have in generic IST code. Something like this... #DF already dies and NMI is 'magic' --- arch/x86/entry/common.c | 7 +++++++ arch/x86/include/asm/idtentry.h | 12 +++++++++++- 2 files changed, 18 insertions(+), 1 deletion(-) diff --git a/arch/x86/entry/common.c b/arch/x86/entry/common.c index af0d57ed5e69..e38e4f34c90c 100644 --- a/arch/x86/entry/common.c +++ b/arch/x86/entry/common.c @@ -742,6 +742,13 @@ noinstr void idtentry_exit_nmi(struct pt_regs *regs, bool restore) __nmi_exit(); } +noinstr void idtentry_validate_ist(struct pt_regs *regs) +{ + if ((regs->sp & ~(EXCEPTION_STKSZ-1)) == + (_RET_IP_ & ~(EXCEPTION_STKSZ-1))) + die("IST stack recursion", regs, 0); +} + #ifdef CONFIG_XEN_PV #ifndef CONFIG_PREEMPTION /* diff --git a/arch/x86/include/asm/idtentry.h b/arch/x86/include/asm/idtentry.h index 4e399f120ff8..974c1a4eacbb 100644 --- a/arch/x86/include/asm/idtentry.h +++ b/arch/x86/include/asm/idtentry.h @@ -19,6 +19,8 @@ void idtentry_exit_cond_rcu(struct pt_regs *regs, bool rcu_exit); bool idtentry_enter_nmi(struct pt_regs *regs); void idtentry_exit_nmi(struct pt_regs *regs, bool irq_state); +void idtentry_validate_ist(struct pt_regs *regs); + /** * DECLARE_IDTENTRY - Declare functions for simple IDT entry points * No error code pushed by hardware @@ -322,7 +324,15 @@ static __always_inline void __##func(struct pt_regs *regs) * Maps to DEFINE_IDTENTRY_RAW */ #define DEFINE_IDTENTRY_IST(func) \ - DEFINE_IDTENTRY_RAW(func) +static __always_inline void __##func(struct pt_regs *regs); \ + \ +__visible noinstr void func(struct pt_regs *regs) \ +{ \ + idtentry_validate_ist(regs); \ + __##func(regs); \ +} \ + \ +static __always_inline void __##func(struct pt_regs *regs) /** * DEFINE_IDTENTRY_NOIST - Emit code for NOIST entry points which
On Tue, Jun 23, 2020 at 04:53:44PM +0200, Peter Zijlstra wrote: > +noinstr void idtentry_validate_ist(struct pt_regs *regs) > +{ > + if ((regs->sp & ~(EXCEPTION_STKSZ-1)) == > + (_RET_IP_ & ~(EXCEPTION_STKSZ-1))) > + die("IST stack recursion", regs, 0); > +} Yes, this is a start, it doesn't cover the case where the NMI stack is in-between, so I think you need to walk down regs->sp too. The dumpstack code already has some logic for this. Joerg
On Tue, Jun 23, 2020 at 04:49:40PM +0200, Joerg Roedel wrote: > > We're talking about the 3rd case where the only reason things 'work' is > > because we'll have to panic(): > > > > - #MC > > Okay, #MC is special and can only be handled on a best-effort basis, as > #MC could happen anytime, also while already executing the #MC handler. I think the hardware has a MCE-mask bit somewhere. Flaky though because clearing it isn't 'atomic' with IRET, so there's a 'funny' window. It also interacts really bad with the NMI handler. If we get an #MC early in the NMI, where we hard-rely on the NMI-mask being set to set-up the recursion stack, then the #MC IRET will clear the NMI-mask, and we're toast. Andy has wild and crazy ideas, but I don't think we need more crazy here. #VC SNP has a similar problem vs NMI, that needs to die() irrespective of the #VC IST recursion. > > - #DB with BUS LOCK DEBUG EXCEPTION > > If I understand the problem correctly, this can be solved by moving off > the IST stack to the current task stack in the #DB handler, like I plan > to do for #VC, no? Hmm, probably. Would take a bit of care, but should be doable. > > - #VC SNP > > This has to panic for other reasons that can't be worked around. It > boils down to detecting that the HV is doing something fishy and bail > out to avoid further harm (like in the #MC handler). Right, but it doesn't take away that IST-any-time vectors are fundamentally screwy. Both the MCE and NMI have masks that are, as per the above, differently funny, but the other ISTs do not. Also, even if they had masks, the interaction between them is still screwy. #VC would've been so much better if it would've had a mask bit somewhere, then at least we could've had the exception entry covered. Another #VC with the mask set should probably result in #DF or Shutdown, but that's all water under the bridge I suspect.
On 23/06/2020 14:03, Peter Zijlstra wrote: > On Tue, Jun 23, 2020 at 02:12:37PM +0200, Joerg Roedel wrote: >> On Tue, Jun 23, 2020 at 01:50:14PM +0200, Peter Zijlstra wrote: >>> If SNP is the sole reason #VC needs to be IST, then I'd strongly urge >>> you to only make it IST if/when you try and make SNP happen, not before. >> It is not the only reason, when ES guests gain debug register support >> then #VC also needs to be IST, because #DB can be promoted into #VC >> then, and as #DB is IST for a reason, #VC needs to be too. > Didn't I read somewhere that that is only so for Rome/Naples but not for > the later chips (Milan) which have #DB pass-through? I don't know about hardware timelines, but some future part can now opt in to having debug registers as part of the encrypted state, and swapped by VMExit, which would make debug facilities generally usable, and supposedly safe to the #DB infinite loop issues, at which point the hypervisor need not intercept #DB for safety reasons. Its worth nothing that on current parts, the hypervisor can set up debug facilities on behalf of the guest (or behind its back) as the DR state is unencrypted, but that attempting to intercept #DB will redirect to #VC inside the guest and cause fun. (Also spare a thought for 32bit kernels which have to cope with userspace singlestepping the SYSENTER path with every #DB turning into #VC.) >> Besides that, I am not a fan of delegating problems I already see coming >> to future-Joerg and future-Peter, but if at all possible deal with them >> now and be safe later. > Well, we could just say no :-) At some point in the very near future > this house of cards is going to implode. What currently exists is a picture of a house of cards in front of something which has fallen down. > Did someone forget to pass the 'ISTs are *EVIL*' memo to the hardware > folks? How come we're getting more and more of them? I have tried to get this point across. Then again - its far easier for the software folk in the same company as the hardware folk to make this point. > (/me puts fingers > in ears and goes la-la-la-la in anticipation of Andrew mentioning CET) I wasn't going to bring it up, but seeing as you have - while there are prohibitively-complicating issues preventing it from working on native, I don't see any point even considering it for the mess which is #VC, or the even bigger mess which is #HV. ~Andrew
On Tue, Jun 23, 2020 at 04:59:14PM +0200, Joerg Roedel wrote: > On Tue, Jun 23, 2020 at 04:53:44PM +0200, Peter Zijlstra wrote: > > +noinstr void idtentry_validate_ist(struct pt_regs *regs) > > +{ > > + if ((regs->sp & ~(EXCEPTION_STKSZ-1)) == > > + (_RET_IP_ & ~(EXCEPTION_STKSZ-1))) > > + die("IST stack recursion", regs, 0); > > +} > > Yes, this is a start, it doesn't cover the case where the NMI stack is > in-between, so I think you need to walk down regs->sp too. That shouldn't be possible with the current code, I think. > The dumpstack code already has some logic for this. Reliability of that depends on the unwinder, I wouldn't want the guess uwinder to OOPS me by accident.
On 23/06/2020 16:16, Peter Zijlstra wrote: > On Tue, Jun 23, 2020 at 04:49:40PM +0200, Joerg Roedel wrote: >>> We're talking about the 3rd case where the only reason things 'work' is >>> because we'll have to panic(): >>> >>> - #MC >> Okay, #MC is special and can only be handled on a best-effort basis, as >> #MC could happen anytime, also while already executing the #MC handler. > I think the hardware has a MCE-mask bit somewhere. Flaky though because > clearing it isn't 'atomic' with IRET, so there's a 'funny' window. MSR_MCG_STATUS.MCIP, and yes - any #MC before that point will immediately Shutdown. Any #MC between that point and IRET will clobber its IST stack and end up sad. > It also interacts really bad with the NMI handler. If we get an #MC > early in the NMI, where we hard-rely on the NMI-mask being set to set-up > the recursion stack, then the #MC IRET will clear the NMI-mask, and > we're toast. > > Andy has wild and crazy ideas, but I don't think we need more crazy > here. Want, certainly not. Need, maybe :-/ >>> - #DB with BUS LOCK DEBUG EXCEPTION >> If I understand the problem correctly, this can be solved by moving off >> the IST stack to the current task stack in the #DB handler, like I plan >> to do for #VC, no? > Hmm, probably. Would take a bit of care, but should be doable. Andy and I have spent some time considering this. Having exactly one vector move off IST and onto an in-use task-stack is doable, I think, so long as it can sort out self-recursion protections. Having more than one vector do this is very complicated. You've got to take care to walk up the list of IST-nesting to see if any interrupted context is in the middle of trying to copy themselves onto the stack, so you don't clobber the frame they're in the middle of copying. ~Andrew
On Tue, Jun 23, 2020 at 05:23:26PM +0200, Peter Zijlstra wrote: > On Tue, Jun 23, 2020 at 04:59:14PM +0200, Joerg Roedel wrote: > > On Tue, Jun 23, 2020 at 04:53:44PM +0200, Peter Zijlstra wrote: > > > +noinstr void idtentry_validate_ist(struct pt_regs *regs) > > > +{ > > > + if ((regs->sp & ~(EXCEPTION_STKSZ-1)) == > > > + (_RET_IP_ & ~(EXCEPTION_STKSZ-1))) > > > + die("IST stack recursion", regs, 0); > > > +} > > > > Yes, this is a start, it doesn't cover the case where the NMI stack is > > in-between, so I think you need to walk down regs->sp too. > > That shouldn't be possible with the current code, I think. To clarify, we have: NMI, MCE, DB and DF. DF (with the exception of ESPFIX) is fatal. MCE from kernel is fatal (which is what makes the MCE in NMI 'work') NMI and DB clear DR7, which avoids DB in NMI. So that leaves: NMI in DB, and that works.
On Tue, Jun 23, 2020 at 05:23:26PM +0200, Peter Zijlstra wrote: > On Tue, Jun 23, 2020 at 04:59:14PM +0200, Joerg Roedel wrote: > > On Tue, Jun 23, 2020 at 04:53:44PM +0200, Peter Zijlstra wrote: > > > +noinstr void idtentry_validate_ist(struct pt_regs *regs) > > > +{ > > > + if ((regs->sp & ~(EXCEPTION_STKSZ-1)) == > > > + (_RET_IP_ & ~(EXCEPTION_STKSZ-1))) > > > + die("IST stack recursion", regs, 0); > > > +} > > > > Yes, this is a start, it doesn't cover the case where the NMI stack is > > in-between, so I think you need to walk down regs->sp too. > > That shouldn't be possible with the current code, I think. Not with the current code, but possibly with SNP #VC exceptions: -> First #VC -> NMI before VC handler switched off its IST stack (now on NMI IST stack) -> Second SNP #VC exception before the NMI handler did the #VC stack check (because HV messed around with some pages touched there). In the second #VC you use the same IST stack as in the first #VC, but the the NMI-stack in-between. > Reliability of that depends on the unwinder, I wouldn't want the guess > uwinder to OOPS me by accident. It doesn't use the full unwinder, it just assumes that there is a pt_regs struct at the top of every kernel stack and walks through them until SP points to a user-space stack. As long as the assumption that there is a pt_regs struct on top of every stack holds, this should be safe. The assumption might be wrong when an exception happens during SYSCALL/SYSENTER entry, when the return frame is not written by hardware. Joerg
On 23/06/2020 16:23, Peter Zijlstra wrote: > On Tue, Jun 23, 2020 at 04:59:14PM +0200, Joerg Roedel wrote: >> On Tue, Jun 23, 2020 at 04:53:44PM +0200, Peter Zijlstra wrote: >>> +noinstr void idtentry_validate_ist(struct pt_regs *regs) >>> +{ >>> + if ((regs->sp & ~(EXCEPTION_STKSZ-1)) == >>> + (_RET_IP_ & ~(EXCEPTION_STKSZ-1))) >>> + die("IST stack recursion", regs, 0); >>> +} >> Yes, this is a start, it doesn't cover the case where the NMI stack is >> in-between, so I think you need to walk down regs->sp too. > That shouldn't be possible with the current code, I think. NMI; #MC; Anything which IRET but isn't fatal - #DB, or #BP from patching, #GP from *_safe(), etc; NMI Sure its a corner case, but did you hear that IST is evil? ~Andrew P.S. did you also hear that with Rowhammer, userspace has a nonzero quantity of control over generating #MC, depending on how ECC is configured on the platform.
On Tue, Jun 23, 2020 at 12:51:03PM +0100, Andrew Cooper wrote: > Crashing out hard if the hypervisor is misbehaving is acceptable. In a > cloud, I as a customer would (threaten to?) take my credit card > elsewhere, while for enterprise, I'd shout at my virtualisation vendor > until a fix happened (also perhaps threaten to take my credit card > elsewhere). This is called customer, credit-card-enforced bug fixing.
On Tue, Jun 23, 2020 at 04:39:26PM +0100, Andrew Cooper wrote: > On 23/06/2020 16:23, Peter Zijlstra wrote: > > On Tue, Jun 23, 2020 at 04:59:14PM +0200, Joerg Roedel wrote: > >> Yes, this is a start, it doesn't cover the case where the NMI stack is > >> in-between, so I think you need to walk down regs->sp too. > > That shouldn't be possible with the current code, I think. > > NMI; #MC; Anything which IRET but isn't fatal - #DB, or #BP from > patching, #GP from *_safe(), etc; NMI > > Sure its a corner case, but did you hear that IST is evil? Isn't current #MC unconditionally fatal from kernel? But yes, I was sorta aware people want that changed. And yes, NMI can recurse, mostly on #BP and #PF. Like I wrote, its broken vs #MC. But Joerg was talking about IST recursion with NMI in the middle, something like: #DB, NMI, #DB, and not already being fatal. This one in particular is ruled out by #DB itself clearing DR7 (but NMI would also do that). > P.S. did you also hear that with Rowhammer, userspace has a nonzero > quantity of control over generating #MC, depending on how ECC is > configured on the platform. Yes, excellent stuff.
On Tue, Jun 23, 2020 at 05:38:55PM +0200, Joerg Roedel wrote: > On Tue, Jun 23, 2020 at 05:23:26PM +0200, Peter Zijlstra wrote: > > Reliability of that depends on the unwinder, I wouldn't want the guess > > uwinder to OOPS me by accident. > > It doesn't use the full unwinder, it just assumes that there is a > pt_regs struct at the top of every kernel stack and walks through them > until SP points to a user-space stack. > > As long as the assumption that there is a pt_regs struct on top of every > stack holds, this should be safe. The assumption might be wrong when an > exception happens during SYSCALL/SYSENTER entry, when the return frame > is not written by hardware. The IRQ and SoftIRQ stacks don't have that I think. Only the task and exception stacks.
On 6/23/20 8:52 AM, Peter Zijlstra wrote: > Isn't current #MC unconditionally fatal from kernel? But yes, I was > sorta aware people want that changed. Not unconditionally. copy_to_iter_mcsafe() is a good example of one thing we _can_ handle.
On Tue, Jun 23, 2020 at 04:32:22PM +0100, Andrew Cooper wrote: > MSR_MCG_STATUS.MCIP, and yes - any #MC before that point will > immediately Shutdown. Any #MC between that point and IRET will clobber > its IST stack and end up sad. Well, at some point we should simply accept that we're living a little on the edge. That is, until we get IRET with a mask of to-reenable vectors which has #MC, NMI and whatever else vectors. It would be even better if that mask were configurable.
On Tue, Jun 23, 2020 at 09:03:56AM -0700, Dave Hansen wrote: > On 6/23/20 8:52 AM, Peter Zijlstra wrote: > > Isn't current #MC unconditionally fatal from kernel? But yes, I was > > sorta aware people want that changed. > > Not unconditionally. copy_to_iter_mcsafe() is a good example of one > thing we _can_ handle. Urgh, I thought that stuff was still pending. Anyway, the important thing is that it is fatal if we hit early NMI. Which I think still holds.
On Tue, Jun 23, 2020 at 04:39:26PM +0100, Andrew Cooper wrote: > P.S. did you also hear that with Rowhammer, userspace has a nonzero > quantity of control over generating #MC, depending on how ECC is > configured on the platform. Where does that #MC point to? Can it control for which address to flip the bits for, i.e., make the #MC appear it has been generated for an address in kernel space?
On Tue, Jun 23, 2020 at 8:23 AM Andrew Cooper <andrew.cooper3@citrix.com> wrote: > > On 23/06/2020 14:03, Peter Zijlstra wrote: > > On Tue, Jun 23, 2020 at 02:12:37PM +0200, Joerg Roedel wrote: > >> On Tue, Jun 23, 2020 at 01:50:14PM +0200, Peter Zijlstra wrote: > >>> If SNP is the sole reason #VC needs to be IST, then I'd strongly urge > >>> you to only make it IST if/when you try and make SNP happen, not before. > >> It is not the only reason, when ES guests gain debug register support > >> then #VC also needs to be IST, because #DB can be promoted into #VC > >> then, and as #DB is IST for a reason, #VC needs to be too. > > Didn't I read somewhere that that is only so for Rome/Naples but not for > > the later chips (Milan) which have #DB pass-through? > > I don't know about hardware timelines, but some future part can now opt > in to having debug registers as part of the encrypted state, and swapped > by VMExit, which would make debug facilities generally usable, and > supposedly safe to the #DB infinite loop issues, at which point the > hypervisor need not intercept #DB for safety reasons. > > Its worth nothing that on current parts, the hypervisor can set up debug > facilities on behalf of the guest (or behind its back) as the DR state > is unencrypted, but that attempting to intercept #DB will redirect to > #VC inside the guest and cause fun. (Also spare a thought for 32bit > kernels which have to cope with userspace singlestepping the SYSENTER > path with every #DB turning into #VC.) What do you mean 32-bit? 64-bit kernels have exactly the same problem. At least the stack is okay, though. Anyway, since I'm way behind on this thread, here are some thoughts: First, I plan to implement actual precise recursion detection for the IST stacks. We'll be able to reliably panic when unallowed recursion happens. Second, I don't object *that* strongly to switching to a second #VC stack if an NMI or MCE happens, but we really need to make sure we cover *all* the bases. And #VC is distressingly close to "happens at all kinds of unfortunate times and the guest doesn't actually have much ability to predice it" right now. So we have #VC + #DB + #VC, #VC + NMI + #VC, #VC + MCE + #VC, and even worse options. So doing the shift in a reliable way is not necessarily possible in a clean way. Let me contemplate. And maybe produce some code soon.
On 23/06/2020 19:26, Andy Lutomirski wrote: > On Tue, Jun 23, 2020 at 8:23 AM Andrew Cooper <andrew.cooper3@citrix.com> wrote: >> On 23/06/2020 14:03, Peter Zijlstra wrote: >>> On Tue, Jun 23, 2020 at 02:12:37PM +0200, Joerg Roedel wrote: >>>> On Tue, Jun 23, 2020 at 01:50:14PM +0200, Peter Zijlstra wrote: >>>>> If SNP is the sole reason #VC needs to be IST, then I'd strongly urge >>>>> you to only make it IST if/when you try and make SNP happen, not before. >>>> It is not the only reason, when ES guests gain debug register support >>>> then #VC also needs to be IST, because #DB can be promoted into #VC >>>> then, and as #DB is IST for a reason, #VC needs to be too. >>> Didn't I read somewhere that that is only so for Rome/Naples but not for >>> the later chips (Milan) which have #DB pass-through? >> I don't know about hardware timelines, but some future part can now opt >> in to having debug registers as part of the encrypted state, and swapped >> by VMExit, which would make debug facilities generally usable, and >> supposedly safe to the #DB infinite loop issues, at which point the >> hypervisor need not intercept #DB for safety reasons. >> >> Its worth nothing that on current parts, the hypervisor can set up debug >> facilities on behalf of the guest (or behind its back) as the DR state >> is unencrypted, but that attempting to intercept #DB will redirect to >> #VC inside the guest and cause fun. (Also spare a thought for 32bit >> kernels which have to cope with userspace singlestepping the SYSENTER >> path with every #DB turning into #VC.) > What do you mean 32-bit? 64-bit kernels have exactly the same > problem. At least the stack is okay, though. :) AMD-like CPUs disallow SYSENTER/SYSEXIT in Long Mode, and raise #UD, even from a compatibility mode segment. 64bit kernels only have this problem on Intel-like CPUs. (It is a massive shame that between everyone's attempts, there are 0 "fast system call" instructions with sane semantics, but it is several decades late to fix this problem...) ~Andrew
diff --git a/arch/x86/boot/compressed/sev-es.c b/arch/x86/boot/compressed/sev-es.c index 53c65fc09341..1d0290cc46c1 100644 --- a/arch/x86/boot/compressed/sev-es.c +++ b/arch/x86/boot/compressed/sev-es.c @@ -158,6 +158,10 @@ void boot_vc_handler(struct pt_regs *regs, unsigned long exit_code) case SVM_EXIT_CPUID: result = vc_handle_cpuid(boot_ghcb, &ctxt); break; + case SVM_EXIT_RDTSC: + case SVM_EXIT_RDTSCP: + result = vc_handle_rdtsc(boot_ghcb, &ctxt, exit_code); + break; default: result = ES_UNSUPPORTED; break; diff --git a/arch/x86/kernel/sev-es-shared.c b/arch/x86/kernel/sev-es-shared.c index a632b8f041ec..373ced468659 100644 --- a/arch/x86/kernel/sev-es-shared.c +++ b/arch/x86/kernel/sev-es-shared.c @@ -442,3 +442,26 @@ static enum es_result vc_handle_cpuid(struct ghcb *ghcb, return ES_OK; } + +static enum es_result vc_handle_rdtsc(struct ghcb *ghcb, + struct es_em_ctxt *ctxt, + unsigned long exit_code) +{ + bool rdtscp = (exit_code == SVM_EXIT_RDTSCP); + enum es_result ret; + + ret = sev_es_ghcb_hv_call(ghcb, ctxt, exit_code, 0, 0); + if (ret != ES_OK) + return ret; + + if (!(ghcb_is_valid_rax(ghcb) && ghcb_is_valid_rdx(ghcb) && + (!rdtscp || ghcb_is_valid_rcx(ghcb)))) + return ES_VMM_ERROR; + + ctxt->regs->ax = ghcb->save.rax; + ctxt->regs->dx = ghcb->save.rdx; + if (rdtscp) + ctxt->regs->cx = ghcb->save.rcx; + + return ES_OK; +} diff --git a/arch/x86/kernel/sev-es.c b/arch/x86/kernel/sev-es.c index 409a7a2aa630..82199527d012 100644 --- a/arch/x86/kernel/sev-es.c +++ b/arch/x86/kernel/sev-es.c @@ -815,29 +815,6 @@ static enum es_result vc_handle_wbinvd(struct ghcb *ghcb, return sev_es_ghcb_hv_call(ghcb, ctxt, SVM_EXIT_WBINVD, 0, 0); } -static enum es_result vc_handle_rdtsc(struct ghcb *ghcb, - struct es_em_ctxt *ctxt, - unsigned long exit_code) -{ - bool rdtscp = (exit_code == SVM_EXIT_RDTSCP); - enum es_result ret; - - ret = sev_es_ghcb_hv_call(ghcb, ctxt, exit_code, 0, 0); - if (ret != ES_OK) - return ret; - - if (!(ghcb_is_valid_rax(ghcb) && ghcb_is_valid_rdx(ghcb) && - (!rdtscp || ghcb_is_valid_rcx(ghcb)))) - return ES_VMM_ERROR; - - ctxt->regs->ax = ghcb->save.rax; - ctxt->regs->dx = ghcb->save.rdx; - if (rdtscp) - ctxt->regs->cx = ghcb->save.rcx; - - return ES_OK; -} - static enum es_result vc_handle_rdpmc(struct ghcb *ghcb, struct es_em_ctxt *ctxt) { enum es_result ret; @@ -1001,6 +978,8 @@ static enum es_result vc_context_filter(struct pt_regs *regs, long exit_code) /* List of #VC exit-codes we support in user-space */ case SVM_EXIT_EXCP_BASE ... SVM_EXIT_LAST_EXCP: case SVM_EXIT_CPUID: + case SVM_EXIT_RDTSC: + case SVM_EXIT_RDTSCP: r = ES_OK; break; default:
Hi Joerg, I needed to allow RDTSC(P) from userspace and in early boot in order to get userspace started properly. Patch below. --- SEV-ES guests will need to execute rdtsc and rdtscp from userspace and during early boot. Move the rdtsc(p) #VC handler into common code and extend the #VC handlers. Signed-off-by: Mike Stunes <mstunes@vmware.com> --- arch/x86/boot/compressed/sev-es.c | 4 ++++ arch/x86/kernel/sev-es-shared.c | 23 +++++++++++++++++++++++ arch/x86/kernel/sev-es.c | 25 ++----------------------- 3 files changed, 29 insertions(+), 23 deletions(-)