diff mbox

[v2,1/2] x86/msr: Carry on after a non-"safe" MSR access fails without !panic_on_oops

Message ID 130a3b7ef4788baae3a6fe71293ab17442bc9a0a.1442793572.git.luto@kernel.org (mailing list archive)
State New, archived
Headers show

Commit Message

Andy Lutomirski Sept. 21, 2015, 12:02 a.m. UTC
This demotes an OOPS and likely panic due to a failed non-"safe" MSR
access to a WARN_ON_ONCE and a return of zero (in the RDMSR case).
We still write a pr_info entry unconditionally for debugging.

To be clear, this type of failure should *not* happen.  This patch
exists to minimize the chance of nasty undebuggable failures due on
systems that used to work due to a now-fixed CONFIG_PARAVIRT=y bug.

Signed-off-by: Andy Lutomirski <luto@kernel.org>
---
 arch/x86/kernel/traps.c | 55 +++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 55 insertions(+)

Comments

Linus Torvalds Sept. 21, 2015, 12:15 a.m. UTC | #1
On Sun, Sep 20, 2015 at 5:02 PM, Andy Lutomirski <luto@kernel.org> wrote:
> This demotes an OOPS and likely panic due to a failed non-"safe" MSR
> access to a WARN_ON_ONCE and a return of zero (in the RDMSR case).
> We still write a pr_info entry unconditionally for debugging.

No, this is wrong.

If you really want to do something like this, then just make all MSR
reads safe. So the only difference between "safe" and "unsafe" is that
the unsafe version just doesn't check the return value, and silently
just returns zero for reads (or writes nothing).

To quote Obi-Wan: "Use the exception table, Luke".

Because decoding instructions is just too ugly. We'll do it for CPU
errata where we might have to do it for user space code too (ie the
AMD prefetch mess), but for code that _we_ control? Hell no.

So NAK on this.

                   Linus
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Andy Lutomirski Sept. 21, 2015, 1:13 a.m. UTC | #2
On Sep 20, 2015 5:15 PM, "Linus Torvalds" <torvalds@linux-foundation.org> wrote:
>
> On Sun, Sep 20, 2015 at 5:02 PM, Andy Lutomirski <luto@kernel.org> wrote:
> > This demotes an OOPS and likely panic due to a failed non-"safe" MSR
> > access to a WARN_ON_ONCE and a return of zero (in the RDMSR case).
> > We still write a pr_info entry unconditionally for debugging.
>
> No, this is wrong.
>
> If you really want to do something like this, then just make all MSR
> reads safe. So the only difference between "safe" and "unsafe" is that
> the unsafe version just doesn't check the return value, and silently
> just returns zero for reads (or writes nothing).
>
> To quote Obi-Wan: "Use the exception table, Luke".
>
> Because decoding instructions is just too ugly. We'll do it for CPU
> errata where we might have to do it for user space code too (ie the
> AMD prefetch mess), but for code that _we_ control? Hell no.
>
> So NAK on this.

My personal preference is to just not do this at all.  A couple people
disagree.  If we make the unsafe variants not oops, then I think we
want to have the nice loud warning, since these issues are bugs if
they happen.

We could certainly use the exception table for this, but it'll result
in bigger core, since each MSR access will need an exception table
entry and an associated fixup to call some helper that warns and sets
the result to zero.

I'd be happy to implement that, but only if it'll be applied.
Otherwise I'd rather just drop this patch and keep the rest of the
series.

--Andy
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Ingo Molnar Sept. 21, 2015, 8:46 a.m. UTC | #3
* Andy Lutomirski <luto@amacapital.net> wrote:

> On Sep 20, 2015 5:15 PM, "Linus Torvalds" <torvalds@linux-foundation.org> wrote:
> >
> > On Sun, Sep 20, 2015 at 5:02 PM, Andy Lutomirski <luto@kernel.org> wrote:
> > > This demotes an OOPS and likely panic due to a failed non-"safe" MSR
> > > access to a WARN_ON_ONCE and a return of zero (in the RDMSR case).
> > > We still write a pr_info entry unconditionally for debugging.
> >
> > No, this is wrong.
> >
> > If you really want to do something like this, then just make all MSR reads 
> > safe. So the only difference between "safe" and "unsafe" is that the unsafe 
> > version just doesn't check the return value, and silently just returns zero 
> > for reads (or writes nothing).
> >
> > To quote Obi-Wan: "Use the exception table, Luke".
> >
> > Because decoding instructions is just too ugly. We'll do it for CPU errata 
> > where we might have to do it for user space code too (ie the AMD prefetch 
> > mess), but for code that _we_ control? Hell no.
> >
> > So NAK on this.
> 
> My personal preference is to just not do this at all.  A couple people disagree.  
> If we make the unsafe variants not oops, then I think we want to have the nice 
> loud warning, since these issues are bugs if they happen.
> 
> We could certainly use the exception table for this, but it'll result in bigger 
> core, since each MSR access will need an exception table entry and an associated 
> fixup to call some helper that warns and sets the result to zero.
> 
> I'd be happy to implement that, but only if it'll be applied. Otherwise I'd 
> rather just drop this patch and keep the rest of the series.

Linus, what's your preference?

Due to the bug mentioned earlier in this thread all MSR reads are currently 'safe' 
on all the major Linux distros (which all have CONFIG_PARAVIRT=y), i.e. by 
'fixing' them we'd reintroduce random crashes into various fragile pieces of 
code...

To add insult to injury, the current 'silently safe by accident' MSR code isn't so 
safe: because it leaves the result of the read uninitialized...

To fix this all I'd really like to have:

 - safe MSR reads by default (i.e. never boot crash the kernel on some rare 
   condition - which to most users is either a silent boot hang or an instant 
   restart). Historicaly we had a stream of 'silly boot crashes' due to MSR reads 
   that generate a #GPF. They make Linux less usable around the edges, especially 
   in the x86 non-server (desktop) space where most hardware vendors are either 
   openly Linux hostile, or, at best, Linux oblivious.

 - proper result-zeroing behavior on exceptions

 - and we should also generate _some_ sort of warning when MSR exceptions happen
   in an 'unintended' fashion.

Maybe the warning could be put under a (default-enabled) config option for the 
size conscious.

Or we could extend exception table entry encoding to include a 'warning bit', to 
not bloat the kernel. If the exception handler code encounters such an exception 
it would generate a one-time warning for that entry, but otherwise not crash the 
kernel and continue execution with an all-zeroes result for the MSR read.

Thanks,

	Ingo
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Paolo Bonzini Sept. 21, 2015, 12:27 p.m. UTC | #4
On 21/09/2015 10:46, Ingo Molnar wrote:
> Or we could extend exception table entry encoding to include a 'warning bit', to 
> not bloat the kernel. If the exception handler code encounters such an exception 
> it would generate a one-time warning for that entry, but otherwise not crash the 
> kernel and continue execution with an all-zeroes result for the MSR read.

The 'warning bit' already exists, it is the opcode that caused the fault. :)

The concern about bloat is a good one.  However, why is it necessary to
keep native_*_msr* inline?  If they are moved out-of-line, using the
exception table becomes the obvious solution and doesn't cause bloat
anymore.

Paolo
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Linus Torvalds Sept. 21, 2015, 4:36 p.m. UTC | #5
On Mon, Sep 21, 2015 at 1:46 AM, Ingo Molnar <mingo@kernel.org> wrote:
>
> Linus, what's your preference?

So quite frankly, is there any reason we don't just implement
native_read_msr() as just

   unsigned long long native_read_msr(unsigned int msr)
   {
      int err;
      unsigned long long val;

      val = native_read_msr_safe(msr, &err);
      WARN_ON_ONCE(err);
      return val;
   }

Note: no inline, no nothing. Just put it in arch/x86/lib/msr.c, and be
done with it. I don't see the downside.

How many msr reads are <i>so</i> critical that the function call
overhead would matter? Get rid of the inline version of the _safe()
thing too, and put that thing there too.

                  Linus
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Arjan van de Ven Sept. 21, 2015, 4:49 p.m. UTC | #6
On 9/21/2015 9:36 AM, Linus Torvalds wrote:
> On Mon, Sep 21, 2015 at 1:46 AM, Ingo Molnar <mingo@kernel.org> wrote:
>>
>> Linus, what's your preference?
>
> So quite frankly, is there any reason we don't just implement
> native_read_msr() as just
>
>     unsigned long long native_read_msr(unsigned int msr)
>     {
>        int err;
>        unsigned long long val;
>
>        val = native_read_msr_safe(msr, &err);
>        WARN_ON_ONCE(err);
>        return val;
>     }
>
> Note: no inline, no nothing. Just put it in arch/x86/lib/msr.c, and be
> done with it. I don't see the downside.
>
> How many msr reads are <i>so</i> critical that the function call
> overhead would matter?

if anything qualifies it'd be switch_to() and friends.

note that I'm not entirely happy about the notion of "safe" MSRs.
They're safe as in "won't fault".
Reading random MSRs isn't a generic safe operation though, but the name sort of gives people
the impression that it is. Even with _safe variants, you still need to KNOW the MSR exists (by means
of CPUID or similar) unfortunately.


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Linus Torvalds Sept. 21, 2015, 5:27 p.m. UTC | #7
On Mon, Sep 21, 2015 at 9:49 AM, Arjan van de Ven <arjan@linux.intel.com> wrote:
>>
>> How many msr reads are <i>so</i> critical that the function call
>> overhead would matter?
>
> if anything qualifies it'd be switch_to() and friends.

Is there anything else than the FS/GS_BASE thing (possibly hidden
behind inlines etc that I didn't get from a quick grep)? And why is
that sometimes using the "safe" version (in do_arch_prctl()), and
sometimes not (switch_to())?

I'm not convinced that mess is a good argument for the status quo ;)

> note that I'm not entirely happy about the notion of "safe" MSRs.
> They're safe as in "won't fault".

I wouldn't object to renaming them.

                Linus
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Andy Lutomirski Sept. 21, 2015, 5:43 p.m. UTC | #8
On Mon, Sep 21, 2015 at 9:49 AM, Arjan van de Ven <arjan@linux.intel.com> wrote:
> On 9/21/2015 9:36 AM, Linus Torvalds wrote:
>>
>> On Mon, Sep 21, 2015 at 1:46 AM, Ingo Molnar <mingo@kernel.org> wrote:
>>>
>>>
>>> Linus, what's your preference?
>>
>>
>> So quite frankly, is there any reason we don't just implement
>> native_read_msr() as just
>>
>>     unsigned long long native_read_msr(unsigned int msr)
>>     {
>>        int err;
>>        unsigned long long val;
>>
>>        val = native_read_msr_safe(msr, &err);
>>        WARN_ON_ONCE(err);
>>        return val;
>>     }
>>
>> Note: no inline, no nothing. Just put it in arch/x86/lib/msr.c, and be
>> done with it. I don't see the downside.
>>
>> How many msr reads are <i>so</i> critical that the function call
>> overhead would matter?
>
>
> if anything qualifies it'd be switch_to() and friends.

And maybe the KVM user return notifier.  Unfortunately, switch_to
might gain another two MSR accesses at some point if we decide to fix
the bugs in there.  Sigh.

>
> note that I'm not entirely happy about the notion of "safe" MSRs.
> They're safe as in "won't fault".
> Reading random MSRs isn't a generic safe operation though, but the name sort
> of gives people
> the impression that it is. Even with _safe variants, you still need to KNOW
> the MSR exists (by means
> of CPUID or similar) unfortunately.
>

I tend to agree.

Anyway, the fully out-of-line approach isn't obviously a bad idea, and
it simplifies the whole mess (we can drop most of the paravirt
patches, too).  I'll give it a try and see what happens.

--Andy
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Andy Lutomirski Sept. 21, 2015, 6:16 p.m. UTC | #9
On Mon, Sep 21, 2015 at 9:36 AM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
> On Mon, Sep 21, 2015 at 1:46 AM, Ingo Molnar <mingo@kernel.org> wrote:
>>
>> Linus, what's your preference?
>
> So quite frankly, is there any reason we don't just implement
> native_read_msr() as just
>
>    unsigned long long native_read_msr(unsigned int msr)
>    {
>       int err;
>       unsigned long long val;
>
>       val = native_read_msr_safe(msr, &err);
>       WARN_ON_ONCE(err);
>       return val;
>    }
>
> Note: no inline, no nothing. Just put it in arch/x86/lib/msr.c, and be
> done with it. I don't see the downside.

In the interest of sanity, I want to drop the "native_", too, since
there appear to be few or no good use cases for native_read_msr as
such.  I'm tempted to add new functions read_msr and write_msr that
forward to rdmsrl_safe and wrmsrl_safe.

It looks like the msr helpers are every bit as bad as the TSC helpers
used to be :(

--Andy
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Borislav Petkov Sept. 21, 2015, 6:36 p.m. UTC | #10
On Mon, Sep 21, 2015 at 11:16:30AM -0700, Andy Lutomirski wrote:
> In the interest of sanity, I want to drop the "native_", too, since
> there appear to be few or no good use cases for native_read_msr as
> such.  I'm tempted to add new functions read_msr and write_msr that
> forward to rdmsrl_safe and wrmsrl_safe.

Just change the msr_read/msr_write() ones in arch/x86/lib/msr.c to take
a u64 and you're there.

> It looks like the msr helpers are every bit as bad as the TSC helpers
> used to be :(

Yap.
Linus Torvalds Sept. 21, 2015, 6:47 p.m. UTC | #11
On Mon, Sep 21, 2015 at 11:16 AM, Andy Lutomirski <luto@amacapital.net> wrote:
>
> In the interest of sanity, I want to drop the "native_", too

Yes. I think the only reason it exists is to have that wrapper layer
for PV. And that argument just goes away if you just make the
non-inline helper function do all the PV logic directly.

I really suspect we should do this for a *lot* of the PV ops. Yeah,
some are so performance-critical that we probably do have a good
reason for the inline indirections etc (historical example: native
spin-unlock, which traditionally could be done as a single store
instruction), but I suspect a lot of the PV indirection is for this
kind of "historical wrapper model" reason, and it often makes it
really hard to see what is going on because you have to go through
several layers of indirection, often in different files.

                      Linus
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Ingo Molnar Sept. 22, 2015, 7:14 a.m. UTC | #12
* Linus Torvalds <torvalds@linux-foundation.org> wrote:

> On Mon, Sep 21, 2015 at 1:46 AM, Ingo Molnar <mingo@kernel.org> wrote:
> >
> > Linus, what's your preference?
> 
> So quite frankly, is there any reason we don't just implement
> native_read_msr() as just
> 
>    unsigned long long native_read_msr(unsigned int msr)
>    {
>       int err;
>       unsigned long long val;
> 
>       val = native_read_msr_safe(msr, &err);
>       WARN_ON_ONCE(err);
>       return val;
>    }
> 
> Note: no inline, no nothing. Just put it in arch/x86/lib/msr.c, and be
> done with it. I don't see the downside.

Absolutely!

> How many msr reads are <i>so</i> critical that the function call overhead would 
> matter? Get rid of the inline version of the _safe() thing too, and put that 
> thing there too.

Only a very low number of them is performance critical (because even 
hw-accelerated MSR accesses are generally slow so we try to avoid MSR accesses in 
fast paths as much as possible, via shadowing, etc.) - and in the few cases where 
we have to access an MSR in a fast path we can do those separately.

I'm only worried about the 'default' APIs, i.e. rdmsr() that is used throughout 
arch/x86/ over a hundred times, not about performance critical code paths that get 
enough testing and enough attention in general.

Thanks,

	Ingo
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Paolo Bonzini Sept. 22, 2015, 8:12 a.m. UTC | #13
On 21/09/2015 19:43, Andy Lutomirski wrote:
> And maybe the KVM user return notifier.

No, not really.  If anything, the place in KVM where it makes a
difference is vmx_save_host_state, which is also only using
always-present MSRs.  But don't care about KVM.

First clean it up, then we can add back inline versions like __rdmsr or
rdmsr_fault or rdmsr_unsafe or whatever.

Paolo
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Peter Zijlstra Sept. 30, 2015, 1:10 p.m. UTC | #14
On Mon, Sep 21, 2015 at 09:36:15AM -0700, Linus Torvalds wrote:
> On Mon, Sep 21, 2015 at 1:46 AM, Ingo Molnar <mingo@kernel.org> wrote:
> >
> > Linus, what's your preference?
> 
> So quite frankly, is there any reason we don't just implement
> native_read_msr() as just
> 
>    unsigned long long native_read_msr(unsigned int msr)
>    {
>       int err;
>       unsigned long long val;
> 
>       val = native_read_msr_safe(msr, &err);
>       WARN_ON_ONCE(err);
>       return val;
>    }
> 
> Note: no inline, no nothing. Just put it in arch/x86/lib/msr.c, and be
> done with it. I don't see the downside.
> 
> How many msr reads are <i>so</i> critical that the function call
> overhead would matter? Get rid of the inline version of the _safe()
> thing too, and put that thing there too.

There are a few in the perf code, and esp. on cores without a stack
engine the call overhead is noticeable. Also note that the perf MSRs are
generally optimized MSRs and less slow (we cannot say fast, they're
still MSRs) than regular MSRs.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Ingo Molnar Sept. 30, 2015, 2:01 p.m. UTC | #15
* Peter Zijlstra <peterz@infradead.org> wrote:

> On Mon, Sep 21, 2015 at 09:36:15AM -0700, Linus Torvalds wrote:
> > On Mon, Sep 21, 2015 at 1:46 AM, Ingo Molnar <mingo@kernel.org> wrote:
> > >
> > > Linus, what's your preference?
> > 
> > So quite frankly, is there any reason we don't just implement
> > native_read_msr() as just
> > 
> >    unsigned long long native_read_msr(unsigned int msr)
> >    {
> >       int err;
> >       unsigned long long val;
> > 
> >       val = native_read_msr_safe(msr, &err);
> >       WARN_ON_ONCE(err);
> >       return val;
> >    }
> > 
> > Note: no inline, no nothing. Just put it in arch/x86/lib/msr.c, and be
> > done with it. I don't see the downside.
> > 
> > How many msr reads are <i>so</i> critical that the function call
> > overhead would matter? Get rid of the inline version of the _safe()
> > thing too, and put that thing there too.
> 
> There are a few in the perf code, and esp. on cores without a stack engine the 
> call overhead is noticeable. Also note that the perf MSRs are generally 
> optimized MSRs and less slow (we cannot say fast, they're still MSRs) than 
> regular MSRs.

These could still be open coded in an inlined fashion, like the scheduler usage.

Thanks,

	Ingo
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Andy Lutomirski Sept. 30, 2015, 6:04 p.m. UTC | #16
On Wed, Sep 30, 2015 at 7:01 AM, Ingo Molnar <mingo@kernel.org> wrote:
>
> * Peter Zijlstra <peterz@infradead.org> wrote:
>
>> On Mon, Sep 21, 2015 at 09:36:15AM -0700, Linus Torvalds wrote:
>> > On Mon, Sep 21, 2015 at 1:46 AM, Ingo Molnar <mingo@kernel.org> wrote:
>> > >
>> > > Linus, what's your preference?
>> >
>> > So quite frankly, is there any reason we don't just implement
>> > native_read_msr() as just
>> >
>> >    unsigned long long native_read_msr(unsigned int msr)
>> >    {
>> >       int err;
>> >       unsigned long long val;
>> >
>> >       val = native_read_msr_safe(msr, &err);
>> >       WARN_ON_ONCE(err);
>> >       return val;
>> >    }
>> >
>> > Note: no inline, no nothing. Just put it in arch/x86/lib/msr.c, and be
>> > done with it. I don't see the downside.
>> >
>> > How many msr reads are <i>so</i> critical that the function call
>> > overhead would matter? Get rid of the inline version of the _safe()
>> > thing too, and put that thing there too.
>>
>> There are a few in the perf code, and esp. on cores without a stack engine the
>> call overhead is noticeable. Also note that the perf MSRs are generally
>> optimized MSRs and less slow (we cannot say fast, they're still MSRs) than
>> regular MSRs.
>
> These could still be open coded in an inlined fashion, like the scheduler usage.
>

We could have a raw_rdmsr for those.

OTOH, I'm still not 100% convinced that this warn-but-don't-die
behavior is worth the effort.  This isn't a frequent source of bugs to
my knowledge, and we don't try to recover from incorrect cr writes,
out-of-bounds MMIO, etc, so do we really gain much by rigging a
recovery mechanism for rdmsr and wrmsr failures for code that doesn't
use the _safe variants?

--Andy

> Thanks,
>
>         Ingo
H. Peter Anvin Sept. 30, 2015, 6:32 p.m. UTC | #17
On 09/21/2015 09:36 AM, Linus Torvalds wrote:
> 
> How many msr reads are <i>so</i> critical that the function call
> overhead would matter? Get rid of the inline version of the _safe()
> thing too, and put that thing there too.
> 

Probably only the ones that may go in the context switch path.

	-hpa


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Ingo Molnar Oct. 1, 2015, 7:15 a.m. UTC | #18
* Andy Lutomirski <luto@amacapital.net> wrote:

> > These could still be open coded in an inlined fashion, like the scheduler usage.
> 
> We could have a raw_rdmsr for those.
> 
> OTOH, I'm still not 100% convinced that this warn-but-don't-die behavior is 
> worth the effort.  This isn't a frequent source of bugs to my knowledge, and we 
> don't try to recover from incorrect cr writes, out-of-bounds MMIO, etc, so do we 
> really gain much by rigging a recovery mechanism for rdmsr and wrmsr failures 
> for code that doesn't use the _safe variants?

It's just the general principle really: don't crash the kernel on bootup. There's 
few things more user hostile than that.

Also, this would maintain the status quo: since we now (accidentally) don't crash 
the kernel on distro kernels (but silently and unsafely ignore the faulting 
instruction), we should not regress that behavior (by adding the chance to crash 
again), but improve upon it.

Thanks,

	Ingo
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Andy Lutomirski March 11, 2016, 4:48 p.m. UTC | #19
On Thu, Oct 1, 2015 at 12:15 AM, Ingo Molnar <mingo@kernel.org> wrote:
>
> * Andy Lutomirski <luto@amacapital.net> wrote:
>
>> > These could still be open coded in an inlined fashion, like the scheduler usage.
>>
>> We could have a raw_rdmsr for those.
>>
>> OTOH, I'm still not 100% convinced that this warn-but-don't-die behavior is
>> worth the effort.  This isn't a frequent source of bugs to my knowledge, and we
>> don't try to recover from incorrect cr writes, out-of-bounds MMIO, etc, so do we
>> really gain much by rigging a recovery mechanism for rdmsr and wrmsr failures
>> for code that doesn't use the _safe variants?
>
> It's just the general principle really: don't crash the kernel on bootup. There's
> few things more user hostile than that.
>
> Also, this would maintain the status quo: since we now (accidentally) don't crash
> the kernel on distro kernels (but silently and unsafely ignore the faulting
> instruction), we should not regress that behavior (by adding the chance to crash
> again), but improve upon it.

Just a heads up: the extable improvements in tip:ras/core make it
straightforward to get the best of all worlds: explicit failure
handling (written in C!), no fast path overhead whatsoever, and no new
garbage in the exception handlers.

Patches coming once I test them.

>
> Thanks,
>
>         Ingo
Ingo Molnar March 12, 2016, 4:02 p.m. UTC | #20
* Andy Lutomirski <luto@amacapital.net> wrote:

> On Thu, Oct 1, 2015 at 12:15 AM, Ingo Molnar <mingo@kernel.org> wrote:
> >
> > * Andy Lutomirski <luto@amacapital.net> wrote:
> >
> >> > These could still be open coded in an inlined fashion, like the scheduler usage.
> >>
> >> We could have a raw_rdmsr for those.
> >>
> >> OTOH, I'm still not 100% convinced that this warn-but-don't-die behavior is
> >> worth the effort.  This isn't a frequent source of bugs to my knowledge, and we
> >> don't try to recover from incorrect cr writes, out-of-bounds MMIO, etc, so do we
> >> really gain much by rigging a recovery mechanism for rdmsr and wrmsr failures
> >> for code that doesn't use the _safe variants?
> >
> > It's just the general principle really: don't crash the kernel on bootup. There's
> > few things more user hostile than that.
> >
> > Also, this would maintain the status quo: since we now (accidentally) don't crash
> > the kernel on distro kernels (but silently and unsafely ignore the faulting
> > instruction), we should not regress that behavior (by adding the chance to crash
> > again), but improve upon it.
> 
> Just a heads up: the extable improvements in tip:ras/core make it
> straightforward to get the best of all worlds: explicit failure
> handling (written in C!), no fast path overhead whatsoever, and no new
> garbage in the exception handlers.

I _knew_ I should have merged them into tip:x86/mm, not tip:ras/core ;-)

I had a quick look at your new MSR series and I'm very happy with that direction!

Thanks,

	Ingo
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/arch/x86/kernel/traps.c b/arch/x86/kernel/traps.c
index 346eec73f7db..f82987643e32 100644
--- a/arch/x86/kernel/traps.c
+++ b/arch/x86/kernel/traps.c
@@ -437,6 +437,58 @@  exit_trap:
 	do_trap(X86_TRAP_BR, SIGSEGV, "bounds", regs, error_code, NULL);
 }
 
+static bool paper_over_kernel_gpf(struct pt_regs *regs)
+{
+	/*
+	 * Try to decode the opcode that failed.  So far, we only care
+	 * about boring two-byte unprefixed opcodes, so we don't need
+	 * the full instruction decoder machinery.
+	 */
+	u16 opcode;
+
+	if (probe_kernel_read(&opcode, (const void *)regs->ip, sizeof(opcode)))
+		return false;
+
+	if (opcode == 0x320f) {
+		/* RDMSR */
+		pr_info("bad kernel RDMSR from non-existent MSR 0x%x",
+			(unsigned int)regs->cx);
+		if (!panic_on_oops) {
+			WARN_ON_ONCE(true);
+
+			/*
+			 * Pretend that RDMSR worked and returned zero.  We
+			 * chose zero because zero seems less likely to
+			 * cause further malfunctions than any other value.
+			 */
+			regs->ax = 0;
+			regs->dx = 0;
+			regs->ip += 2;
+			return true;
+		} else {
+			/* Don't fix it up. */
+			return false;
+		}
+	} else if (opcode == 0x300f) {
+		/* WRMSR */
+		pr_info("bad kernel WRMSR writing 0x%08x%08x to MSR 0x%x",
+			(unsigned int)regs->dx, (unsigned int)regs->ax,
+			(unsigned int)regs->cx);
+		if (!panic_on_oops) {
+			WARN_ON_ONCE(true);
+
+			/* Pretend it worked and carry on. */
+			regs->ip += 2;
+			return true;
+		} else {
+			/* Don't fix it up. */
+			return false;
+		}
+	}
+
+	return false;
+}
+
 dotraplinkage void
 do_general_protection(struct pt_regs *regs, long error_code)
 {
@@ -456,6 +508,9 @@  do_general_protection(struct pt_regs *regs, long error_code)
 		if (fixup_exception(regs))
 			return;
 
+		if (paper_over_kernel_gpf(regs))
+			return;
+
 		tsk->thread.error_code = error_code;
 		tsk->thread.trap_nr = X86_TRAP_GP;
 		if (notify_die(DIE_GPF, "general protection fault", regs, error_code,