Message ID | 90f87aa8-09da-1453-bd82-c722465c2881@suse.com (mailing list archive) |
---|---|
State | Superseded |
Headers | show |
Series | x86: guest MSR access handling tweaks | expand |
On Fri, Mar 12, 2021 at 08:54:46AM +0100, Jan Beulich wrote: > Prior to 4.15 Linux, when running in PV mode, did not install a #GP > handler early enough to cover for example the rdmsrl_safe() of > MSR_K8_TSEG_ADDR in bsp_init_amd() (not to speak of the unguarded read > of MSR_K7_HWCR later in the same function). The respective change > (42b3a4cb5609 "x86/xen: Support early interrupts in xen pv guests") was > backported to 4.14, but no further - presumably since it wasn't really > easy because of other dependencies. > > Therefore, to prevent our change in the handling of guest MSR accesses > to render PV Linux 4.13 and older unusable on at least AMD systems, make > the raising of #GP on this paths conditional upon the guest having > installed a handler, provided of course the MSR can be read in the first > place (we would have raised #GP in that case even before). Producing > zero for reads isn't necessarily correct and may trip code trying to > detect presence of MSRs early, but since such detection logic won't work > without a #GP handler anyway, this ought to be a fair workaround. > > Signed-off-by: Jan Beulich <jbeulich@suse.com> > Release-Acked-by: Ian Jackson <iwj@xenproject.org> I think the approach is fine: Reviewed-by: Roger Pau Monné <roger.pau@citrix.com> Some comments below. > --- > (projected v4: re-base over Roger's change) > v3: Use temporary variable for probing. Document the behavior (in a > public header, for the lack of a better place). > v2: Probe MSR read. Exclude hypervisor range. Avoid issuing two log > messages (in debug builds). Don't alter WRMSR behavior. > --- > While I didn't myself observe or find similar WRMSR side issues, I'm > nevertheless not convinced we can get away without also making the WRMSR > path somewhat more permissive again, e.g. tolerating attempts to set > bits which are already set. But of course this would require keeping in > sync for which MSRs we "fake" reads, as then a kernel attempt to set a > bit may also appear as an attempt to clear others (because of the zero > value that we gave it for the read). Roger validly points out that > making behavior dependent upon MSR values has its own downsides, so > simply depending on MSR readability is another option (with, in turn, > its own undesirable effects, e.g. for write-only MSRs). > > --- a/xen/arch/x86/pv/emul-priv-op.c > +++ b/xen/arch/x86/pv/emul-priv-op.c > @@ -874,7 +874,8 @@ static int read_msr(unsigned int reg, ui > struct vcpu *curr = current; > const struct domain *currd = curr->domain; > const struct cpuid_policy *cp = currd->arch.cpuid; > - bool vpmu_msr = false; > + bool vpmu_msr = false, warn = false; > + uint64_t tmp; > int ret; > > if ( (ret = guest_rdmsr(curr, reg, val)) != X86EMUL_UNHANDLEABLE ) > @@ -882,7 +883,7 @@ static int read_msr(unsigned int reg, ui > if ( ret == X86EMUL_EXCEPTION ) > x86_emul_hw_exception(TRAP_gp_fault, 0, ctxt); You might want to move the injection of the exception to the done label? So that we can avoid the call to x86_emul_reset_event. > > - return ret; > + goto done; > } > > switch ( reg ) > @@ -986,7 +987,7 @@ static int read_msr(unsigned int reg, ui > } > /* fall through */ > default: > - gdprintk(XENLOG_WARNING, "RDMSR 0x%08x unimplemented\n", reg); > + warn = true; > break; > > normal: > @@ -995,7 +996,19 @@ static int read_msr(unsigned int reg, ui > return X86EMUL_OKAY; > } > > - return X86EMUL_UNHANDLEABLE; > + done: > + if ( ret != X86EMUL_OKAY && !curr->arch.pv.trap_ctxt[X86_EXC_GP].address && > + (reg >> 16) != 0x4000 && !rdmsr_safe(reg, tmp) ) > + { > + gprintk(XENLOG_WARNING, "faking RDMSR 0x%08x\n", reg); > + *val = 0; > + x86_emul_reset_event(ctxt); > + ret = X86EMUL_OKAY; > + } > + else if ( warn ) > + gdprintk(XENLOG_WARNING, "RDMSR 0x%08x unimplemented\n", reg); I think you could add: if ( rc == X86EMUL_EXCEPTION ) x86_emul_hw_exception(TRAP_gp_fault, 0, ctxt); > + > + return ret; > } > > static int write_msr(unsigned int reg, uint64_t val, > --- a/xen/include/public/arch-x86/xen.h > +++ b/xen/include/public/arch-x86/xen.h > @@ -143,6 +143,12 @@ typedef unsigned long xen_ulong_t; > * Level == 1: Kernel may enter > * Level == 2: Kernel may enter > * Level == 3: Everyone may enter > + * > + * Note: For compatibility with kernels not setting up exception handlers > + * early enough, Xen will avoid trying to inject #GP (and hence crash > + * the domain) when an RDMSR would require this, but no handler was > + * set yet. The precise conditions are implementation specific, and You can drop the 'yet' here I think? As even if a handler has been set and then removed we would still prevent injecting a #GP AFAICT. Not a native speaker anyway, so I might be wrong on that one. > + * new code shouldn't rely on such behavior anyway. I would use a stronger mustn't here instead of shouldn't. Thanks, Roger.
On 12.03.2021 10:08, Roger Pau Monné wrote: > On Fri, Mar 12, 2021 at 08:54:46AM +0100, Jan Beulich wrote: >> Prior to 4.15 Linux, when running in PV mode, did not install a #GP >> handler early enough to cover for example the rdmsrl_safe() of >> MSR_K8_TSEG_ADDR in bsp_init_amd() (not to speak of the unguarded read >> of MSR_K7_HWCR later in the same function). The respective change >> (42b3a4cb5609 "x86/xen: Support early interrupts in xen pv guests") was >> backported to 4.14, but no further - presumably since it wasn't really >> easy because of other dependencies. >> >> Therefore, to prevent our change in the handling of guest MSR accesses >> to render PV Linux 4.13 and older unusable on at least AMD systems, make >> the raising of #GP on this paths conditional upon the guest having >> installed a handler, provided of course the MSR can be read in the first >> place (we would have raised #GP in that case even before). Producing >> zero for reads isn't necessarily correct and may trip code trying to >> detect presence of MSRs early, but since such detection logic won't work >> without a #GP handler anyway, this ought to be a fair workaround. >> >> Signed-off-by: Jan Beulich <jbeulich@suse.com> >> Release-Acked-by: Ian Jackson <iwj@xenproject.org> > > I think the approach is fine: > > Reviewed-by: Roger Pau Monné <roger.pau@citrix.com> Thanks. >> --- a/xen/arch/x86/pv/emul-priv-op.c >> +++ b/xen/arch/x86/pv/emul-priv-op.c >> @@ -874,7 +874,8 @@ static int read_msr(unsigned int reg, ui >> struct vcpu *curr = current; >> const struct domain *currd = curr->domain; >> const struct cpuid_policy *cp = currd->arch.cpuid; >> - bool vpmu_msr = false; >> + bool vpmu_msr = false, warn = false; >> + uint64_t tmp; >> int ret; >> >> if ( (ret = guest_rdmsr(curr, reg, val)) != X86EMUL_UNHANDLEABLE ) >> @@ -882,7 +883,7 @@ static int read_msr(unsigned int reg, ui >> if ( ret == X86EMUL_EXCEPTION ) >> x86_emul_hw_exception(TRAP_gp_fault, 0, ctxt); > > You might want to move the injection of the exception to the done > label? > > So that we can avoid the call to x86_emul_reset_event. At the expense of slightly more code churn, yes, perhaps. I have to admit though that it feels less prone to future issues to me to have an unconditional x86_emul_reset_event() on that path. >> --- a/xen/include/public/arch-x86/xen.h >> +++ b/xen/include/public/arch-x86/xen.h >> @@ -143,6 +143,12 @@ typedef unsigned long xen_ulong_t; >> * Level == 1: Kernel may enter >> * Level == 2: Kernel may enter >> * Level == 3: Everyone may enter >> + * >> + * Note: For compatibility with kernels not setting up exception handlers >> + * early enough, Xen will avoid trying to inject #GP (and hence crash >> + * the domain) when an RDMSR would require this, but no handler was >> + * set yet. The precise conditions are implementation specific, and > > You can drop the 'yet' here I think? As even if a handler has been set > and then removed we would still prevent injecting a #GP AFAICT. Not a > native speaker anyway, so I might be wrong on that one. Well, I've put it there intentionally to leave room (effectively trying to further emphasize "implementation specific") for us to indeed only behave this way if no handler was ever set (as opposed to a handler having got set and then zapped again). >> + * new code shouldn't rely on such behavior anyway. > > I would use a stronger mustn't here instead of shouldn't. Fine with me. I've been using "mustn't" in a number of cases in the past and was told I'm using it too often, sounding sort of impolite. I guess I'll switch to "may not", which was suggested to me as the better replacement. Jan
On 12/03/2021 07:54, Jan Beulich wrote: > Prior to 4.15 Linux, when running in PV mode, did not install a #GP > handler early enough to cover for example the rdmsrl_safe() of > MSR_K8_TSEG_ADDR in bsp_init_amd() (not to speak of the unguarded read > of MSR_K7_HWCR later in the same function). The respective change > (42b3a4cb5609 "x86/xen: Support early interrupts in xen pv guests") was > backported to 4.14, but no further - presumably since it wasn't really > easy because of other dependencies. > > Therefore, to prevent our change in the handling of guest MSR accesses > to render PV Linux 4.13 and older unusable on at least AMD systems, make > the raising of #GP on this paths conditional upon the guest having > installed a handler, provided of course the MSR can be read in the first > place (we would have raised #GP in that case even before). Producing > zero for reads isn't necessarily correct and may trip code trying to > detect presence of MSRs early, but since such detection logic won't work > without a #GP handler anyway, this ought to be a fair workaround. > > Signed-off-by: Jan Beulich <jbeulich@suse.com> > Release-Acked-by: Ian Jackson <iwj@xenproject.org> I am still of the firm belief that this is the wrong course of action. It is deliberately papering over error handling bugs which, in the NetBSD case, literally created memory corruption scenarios. (Yes - that was a WRMSR not RDMSR, but the general point still stands, particularly in light of your expectation to do the same to the WRMSR). It is one thing to not realise that we have a trainwreck here. Its totally another to take wilful action to keep current and all future guests broken in the same way. The *only* case where it is acceptable to skip error handling is if the VM admin has specifically signed their life away to say that they'll accept the, now discovered, potential-memory-corrupion consequences. Rogers patch already does this. ~Andrew
On 12.03.2021 12:19, Andrew Cooper wrote: > On 12/03/2021 07:54, Jan Beulich wrote: >> Prior to 4.15 Linux, when running in PV mode, did not install a #GP >> handler early enough to cover for example the rdmsrl_safe() of >> MSR_K8_TSEG_ADDR in bsp_init_amd() (not to speak of the unguarded read >> of MSR_K7_HWCR later in the same function). The respective change >> (42b3a4cb5609 "x86/xen: Support early interrupts in xen pv guests") was >> backported to 4.14, but no further - presumably since it wasn't really >> easy because of other dependencies. >> >> Therefore, to prevent our change in the handling of guest MSR accesses >> to render PV Linux 4.13 and older unusable on at least AMD systems, make >> the raising of #GP on this paths conditional upon the guest having >> installed a handler, provided of course the MSR can be read in the first >> place (we would have raised #GP in that case even before). Producing >> zero for reads isn't necessarily correct and may trip code trying to >> detect presence of MSRs early, but since such detection logic won't work >> without a #GP handler anyway, this ought to be a fair workaround. >> >> Signed-off-by: Jan Beulich <jbeulich@suse.com> >> Release-Acked-by: Ian Jackson <iwj@xenproject.org> > > I am still of the firm belief that this is the wrong course of action. > > It is deliberately papering over error handling bugs which, in the > NetBSD case, literally created memory corruption scenarios. (Yes - that > was a WRMSR not RDMSR, but the general point still stands, particularly > in light of your expectation to do the same to the WRMSR). > > It is one thing to not realise that we have a trainwreck here. Its > totally another to take wilful action to keep current and all future > guests broken in the same way. > > The *only* case where it is acceptable to skip error handling is if the > VM admin has specifically signed their life away to say that they'll > accept the, now discovered, potential-memory-corrupion consequences. > > Rogers patch already does this. With _much_ bigger impact - it requires changing the behavior for the entire lifetime of the domain, rather than just very early boot. And as you may have seen, despite my fear that it may not be enough, Roger and I have agreed to leave the WRMSR path alone. Jan
--- a/xen/arch/x86/pv/emul-priv-op.c +++ b/xen/arch/x86/pv/emul-priv-op.c @@ -874,7 +874,8 @@ static int read_msr(unsigned int reg, ui struct vcpu *curr = current; const struct domain *currd = curr->domain; const struct cpuid_policy *cp = currd->arch.cpuid; - bool vpmu_msr = false; + bool vpmu_msr = false, warn = false; + uint64_t tmp; int ret; if ( (ret = guest_rdmsr(curr, reg, val)) != X86EMUL_UNHANDLEABLE ) @@ -882,7 +883,7 @@ static int read_msr(unsigned int reg, ui if ( ret == X86EMUL_EXCEPTION ) x86_emul_hw_exception(TRAP_gp_fault, 0, ctxt); - return ret; + goto done; } switch ( reg ) @@ -986,7 +987,7 @@ static int read_msr(unsigned int reg, ui } /* fall through */ default: - gdprintk(XENLOG_WARNING, "RDMSR 0x%08x unimplemented\n", reg); + warn = true; break; normal: @@ -995,7 +996,19 @@ static int read_msr(unsigned int reg, ui return X86EMUL_OKAY; } - return X86EMUL_UNHANDLEABLE; + done: + if ( ret != X86EMUL_OKAY && !curr->arch.pv.trap_ctxt[X86_EXC_GP].address && + (reg >> 16) != 0x4000 && !rdmsr_safe(reg, tmp) ) + { + gprintk(XENLOG_WARNING, "faking RDMSR 0x%08x\n", reg); + *val = 0; + x86_emul_reset_event(ctxt); + ret = X86EMUL_OKAY; + } + else if ( warn ) + gdprintk(XENLOG_WARNING, "RDMSR 0x%08x unimplemented\n", reg); + + return ret; } static int write_msr(unsigned int reg, uint64_t val, --- a/xen/include/public/arch-x86/xen.h +++ b/xen/include/public/arch-x86/xen.h @@ -143,6 +143,12 @@ typedef unsigned long xen_ulong_t; * Level == 1: Kernel may enter * Level == 2: Kernel may enter * Level == 3: Everyone may enter + * + * Note: For compatibility with kernels not setting up exception handlers + * early enough, Xen will avoid trying to inject #GP (and hence crash + * the domain) when an RDMSR would require this, but no handler was + * set yet. The precise conditions are implementation specific, and + * new code shouldn't rely on such behavior anyway. */ #define TI_GET_DPL(_ti) ((_ti)->flags & 3) #define TI_GET_IF(_ti) ((_ti)->flags & 4)
Prior to 4.15 Linux, when running in PV mode, did not install a #GP handler early enough to cover for example the rdmsrl_safe() of MSR_K8_TSEG_ADDR in bsp_init_amd() (not to speak of the unguarded read of MSR_K7_HWCR later in the same function). The respective change (42b3a4cb5609 "x86/xen: Support early interrupts in xen pv guests") was backported to 4.14, but no further - presumably since it wasn't really easy because of other dependencies. Therefore, to prevent our change in the handling of guest MSR accesses to render PV Linux 4.13 and older unusable on at least AMD systems, make the raising of #GP on this paths conditional upon the guest having installed a handler, provided of course the MSR can be read in the first place (we would have raised #GP in that case even before). Producing zero for reads isn't necessarily correct and may trip code trying to detect presence of MSRs early, but since such detection logic won't work without a #GP handler anyway, this ought to be a fair workaround. Signed-off-by: Jan Beulich <jbeulich@suse.com> Release-Acked-by: Ian Jackson <iwj@xenproject.org> --- (projected v4: re-base over Roger's change) v3: Use temporary variable for probing. Document the behavior (in a public header, for the lack of a better place). v2: Probe MSR read. Exclude hypervisor range. Avoid issuing two log messages (in debug builds). Don't alter WRMSR behavior. --- While I didn't myself observe or find similar WRMSR side issues, I'm nevertheless not convinced we can get away without also making the WRMSR path somewhat more permissive again, e.g. tolerating attempts to set bits which are already set. But of course this would require keeping in sync for which MSRs we "fake" reads, as then a kernel attempt to set a bit may also appear as an attempt to clear others (because of the zero value that we gave it for the read). Roger validly points out that making behavior dependent upon MSR values has its own downsides, so simply depending on MSR readability is another option (with, in turn, its own undesirable effects, e.g. for write-only MSRs).