x86: show remote CPU state upon fatal NMI

Message ID	576031C802000078000F4D9F@prv-mh.provo.novell.com (mailing list archive)
State	New, archived
Headers	show Return-Path: <xen-devel-bounces@lists.xen.org> Message-Id: <576031C802000078000F4D9F@prv-mh.provo.novell.com> Date: Tue, 14 Jun 2016 08:33:12 -0600 From: "Jan Beulich" <JBeulich@suse.com> To: "xen-devel" <xen-devel@lists.xenproject.org> Mime-Version: 1.0 Content-Type: multipart/mixed; boundary="=__PartF2C4C8B8.1__=" Cc: Andrew Cooper <andrew.cooper3@citrix.com> Subject: [Xen-devel] [PATCH] x86: show remote CPU state upon fatal NMI Precedence: list Errors-To: xen-devel-bounces@lists.xen.org Sender: "Xen-devel" <xen-devel-bounces@lists.xen.org>

Jan Beulich June 14, 2016, 2:33 p.m. UTC

Quite frequently the watchdog would hit an innocent CPU, e.g. one
trying to acquire a spin lock a remote CPU holds for extended periods
of time, or a random CPU in TSC calbration rendezvous. In such cases
the register and stack dump for that CPU doesn't really help in the
analysis of the problem.

To keep things reasonable on large systems, only log CS:RIP by default.
This can be overridden via a new extension to the "nmi=" command line
option such that full register/stack state will get dumped.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
x86: show remote CPU state upon fatal NMI

Quite frequently the watchdog would hit an innocent CPU, e.g. one
trying to acquire a spin lock a remote CPU holds for extended periods
of time, or a random CPU in TSC calbration rendezvous. In such cases
the register and stack dump for that CPU doesn't really help in the
analysis of the problem.

To keep things reasonable on large systems, only log CS:RIP by default.
This can be overridden via a new extension to the "nmi=" command line
option such that full register/stack state will get dumped.

Signed-off-by: Jan Beulich <jbeulich@suse.com>

--- a/docs/misc/xen-command-line.markdown
+++ b/docs/misc/xen-command-line.markdown
@@ -1155,7 +1155,7 @@ Use the MWAIT idle driver (with model sp
 of the ACPI based one.
 
 ### nmi
-> `= ignore | dom0 | fatal`
+> `= ignore | dom0 | fatal [,show-all]`
 
 > Default: `fatal` for a debug build, or `dom0` for a non-debug build
 
@@ -1163,6 +1163,9 @@ Specify what Xen should do in the event
 `ignore` discards the error; `dom0` causes Xen to report the error to
 dom0, while 'fatal' causes Xen to print diagnostics and then hang.
 
+The `show-all` modifier forces all CPUs' full state to be dumped upon
+fatal NMIs (normally a result of the watchdog kicking in).
+
 ### noapic
 
 Instruct Xen to ignore any IOAPICs that are present in the system, and
--- a/xen/arch/x86/traps.c
+++ b/xen/arch/x86/traps.c
@@ -84,10 +84,11 @@
  *  dom0:   The NMI is virtualised to DOM0.
  *  ignore: The NMI error is cleared and ignored.
  */
+static char __read_mostly opt_nmi[16] =
 #ifdef NDEBUG
-static char __read_mostly opt_nmi[10] = "dom0";
+ "dom0";
 #else
-static char __read_mostly opt_nmi[10] = "fatal";
+ "fatal";
 #endif
 string_param("nmi", opt_nmi);
 
@@ -525,6 +526,35 @@ void vcpu_show_execution_state(struct vc
     vcpu_unpause(v);
 }
 
+static cpumask_t nmi_show_state_mask;
+static bool_t opt_nmi_show_all;
+
+static int __init get_nmi_show_all(void)
+{
+    const char *s = strchr(opt_nmi, ',');
+
+    if ( s && !strcmp(s + 1, "show-all") )
+        opt_nmi_show_all = 1;
+
+    return 0;
+}
+presmp_initcall(get_nmi_show_all);
+
+static int nmi_show_execution_state(const struct cpu_user_regs *regs, int cpu)
+{
+    if ( !cpumask_test_cpu(cpu, &nmi_show_state_mask) )
+        return 0;
+
+    if ( opt_nmi_show_all )
+        show_execution_state(regs);
+    else
+        printk(XENLOG_ERR "CPU%d @ %04x:%08lx (%pS)\n", cpu, regs->cs, regs->rip,
+               guest_mode(regs) ? _p(regs->rip) : NULL);
+    cpumask_clear_cpu(cpu, &nmi_show_state_mask);
+
+    return 1;
+}
+
 static const char *trapstr(unsigned int trapnr)
 {
     static const char * const strings[] = {
@@ -570,6 +600,15 @@ void fatal_trap(const struct cpu_user_re
             printk("Faulting linear address: %p\n", _p(cr2));
             show_page_walk(cr2);
         }
+        else if ( trapnr == TRAP_nmi )
+        {
+            cpumask_andnot(&nmi_show_state_mask, &cpu_online_map,
+                           cpumask_of(smp_processor_id()));
+            set_nmi_callback(nmi_show_execution_state);
+            smp_send_nmi_allbutself();
+            while ( !cpumask_empty(&nmi_show_state_mask) )
+                cpu_relax();
+        }
     }
 
     panic("FATAL TRAP: vector = %d (%s)\n"

Andrew Cooper June 14, 2016, 3:03 p.m. UTC | #1

On 14/06/16 15:33, Jan Beulich wrote:
> Quite frequently the watchdog would hit an innocent CPU, e.g. one
> trying to acquire a spin lock a remote CPU holds for extended periods
> of time, or a random CPU in TSC calbration rendezvous. In such cases
> the register and stack dump for that CPU doesn't really help in the
> analysis of the problem.
>
> To keep things reasonable on large systems, only log CS:RIP by default.
> This can be overridden via a new extension to the "nmi=" command line
> option such that full register/stack state will get dumped.
>
> Signed-off-by: Jan Beulich <jbeulich@suse.com>
>
> --- a/docs/misc/xen-command-line.markdown
> +++ b/docs/misc/xen-command-line.markdown
> @@ -1155,7 +1155,7 @@ Use the MWAIT idle driver (with model sp
>  of the ACPI based one.
>  
>  ### nmi
> -> `= ignore | dom0 | fatal`
> +> `= ignore | dom0 | fatal [,show-all]`
>  
>  > Default: `fatal` for a debug build, or `dom0` for a non-debug build
>  
> @@ -1163,6 +1163,9 @@ Specify what Xen should do in the event
>  `ignore` discards the error; `dom0` causes Xen to report the error to
>  dom0, while 'fatal' causes Xen to print diagnostics and then hang.
>  
> +The `show-all` modifier forces all CPUs' full state to be dumped upon
> +fatal NMIs (normally a result of the watchdog kicking in).
> +
>  ### noapic
>  
>  Instruct Xen to ignore any IOAPICs that are present in the system, and
> --- a/xen/arch/x86/traps.c
> +++ b/xen/arch/x86/traps.c
> @@ -84,10 +84,11 @@
>   *  dom0:   The NMI is virtualised to DOM0.
>   *  ignore: The NMI error is cleared and ignored.
>   */
> +static char __read_mostly opt_nmi[16] =
>  #ifdef NDEBUG
> -static char __read_mostly opt_nmi[10] = "dom0";
> + "dom0";
>  #else
> -static char __read_mostly opt_nmi[10] = "fatal";
> + "fatal";
>  #endif
>  string_param("nmi", opt_nmi);
>  
> @@ -525,6 +526,35 @@ void vcpu_show_execution_state(struct vc
>      vcpu_unpause(v);
>  }
>  
> +static cpumask_t nmi_show_state_mask;
> +static bool_t opt_nmi_show_all;
> +
> +static int __init get_nmi_show_all(void)
> +{
> +    const char *s = strchr(opt_nmi, ',');
> +
> +    if ( s && !strcmp(s + 1, "show-all") )
> +        opt_nmi_show_all = 1;
> +
> +    return 0;
> +}
> +presmp_initcall(get_nmi_show_all);
> +
> +static int nmi_show_execution_state(const struct cpu_user_regs *regs, int cpu)
> +{
> +    if ( !cpumask_test_cpu(cpu, &nmi_show_state_mask) )
> +        return 0;
> +
> +    if ( opt_nmi_show_all )
> +        show_execution_state(regs);
> +    else
> +        printk(XENLOG_ERR "CPU%d @ %04x:%08lx (%pS)\n", cpu, regs->cs, regs->rip,
> +               guest_mode(regs) ? _p(regs->rip) : NULL);
> +    cpumask_clear_cpu(cpu, &nmi_show_state_mask);

I would clear the mask before printing state.  Given the nature of this
handler, it liable to contend sufficiently on the console lock to induce
the further watchdog timeout.

> +
> +    return 1;
> +}
> +
>  static const char *trapstr(unsigned int trapnr)
>  {
>      static const char * const strings[] = {
> @@ -570,6 +600,15 @@ void fatal_trap(const struct cpu_user_re
>              printk("Faulting linear address: %p\n", _p(cr2));
>              show_page_walk(cr2);
>          }
> +        else if ( trapnr == TRAP_nmi )
> +        {
> +            cpumask_andnot(&nmi_show_state_mask, &cpu_online_map,
> +                           cpumask_of(smp_processor_id()));
> +            set_nmi_callback(nmi_show_execution_state);
> +            smp_send_nmi_allbutself();

This would cause far less spinlock contention as

for_each_cpu( cpu, nmi_show_state_mask )
    smp_send_nmi(cpu);

I realise this involves introducing a new smp function, but it would
substantially reduce contention on the console lock.


I would recommend moving this clause into nmi_watchdog_tick(), so that
it doesn't get involved for non-watchdog NMIs.  IOCK/SERR NMIs won't
have anything interesting to print from here.  I would also recommend
disabling the watchdog before IPI'ing.

~Andrew

Jan Beulich June 15, 2016, 7:55 a.m. UTC | #2

>>> On 14.06.16 at 17:03, <andrew.cooper3@citrix.com> wrote:
> On 14/06/16 15:33, Jan Beulich wrote:
>> @@ -525,6 +526,35 @@ void vcpu_show_execution_state(struct vc
>>      vcpu_unpause(v);
>>  }
>>  
>> +static cpumask_t nmi_show_state_mask;
>> +static bool_t opt_nmi_show_all;
>> +
>> +static int __init get_nmi_show_all(void)
>> +{
>> +    const char *s = strchr(opt_nmi, ',');
>> +
>> +    if ( s && !strcmp(s + 1, "show-all") )
>> +        opt_nmi_show_all = 1;
>> +
>> +    return 0;
>> +}
>> +presmp_initcall(get_nmi_show_all);
>> +
>> +static int nmi_show_execution_state(const struct cpu_user_regs *regs, int cpu)
>> +{
>> +    if ( !cpumask_test_cpu(cpu, &nmi_show_state_mask) )
>> +        return 0;
>> +
>> +    if ( opt_nmi_show_all )
>> +        show_execution_state(regs);
>> +    else
>> +        printk(XENLOG_ERR "CPU%d @ %04x:%08lx (%pS)\n", cpu, regs->cs, regs->rip,
>> +               guest_mode(regs) ? _p(regs->rip) : NULL);
>> +    cpumask_clear_cpu(cpu, &nmi_show_state_mask);
> 
> I would clear the mask before printing state.  Given the nature of this
> handler, it liable to contend sufficiently on the console lock to induce
> the further watchdog timeout.

I had it that way for a brief period of time, but it's wrong: It would
let the master continue its bringing down of the host before all CPUs
managed to print their stuff. And I don't see any issue with spin lock
contention here at all - performance is certainly not of any concern.

>> @@ -570,6 +600,15 @@ void fatal_trap(const struct cpu_user_re
>>              printk("Faulting linear address: %p\n", _p(cr2));
>>              show_page_walk(cr2);
>>          }
>> +        else if ( trapnr == TRAP_nmi )
>> +        {
>> +            cpumask_andnot(&nmi_show_state_mask, &cpu_online_map,
>> +                           cpumask_of(smp_processor_id()));
>> +            set_nmi_callback(nmi_show_execution_state);
>> +            smp_send_nmi_allbutself();
> 
> This would cause far less spinlock contention as
> 
> for_each_cpu( cpu, nmi_show_state_mask )
>     smp_send_nmi(cpu);
> 
> I realise this involves introducing a new smp function, but it would
> substantially reduce contention on the console lock.

Again, I don't see why lock contention would matter here. And then
I also don't see how sending the IPIs individually would make matters
significantly better: The sending will surely finish much faster than
the printing.

> I would recommend moving this clause into nmi_watchdog_tick(), so that
> it doesn't get involved for non-watchdog NMIs.  IOCK/SERR NMIs won't
> have anything interesting to print from here.  I would also recommend
> disabling the watchdog before IPI'ing.

And indeed I would have wanted it there, but I can't see how it can
reasonably be put there: fatal_trap() doesn't return, so we can't put
it after. And we definitely want to get state of the local CPU out
before we try to log state of any of the remote CPUs. So the only
options I see would be to
- somehow specially flag the regs structure, but that feels hackish
  (among other aspects nmi_watchdog_tick() has that parameter
  const qualified for the very reason that it isn't supposed to fiddle
  with it),
- introduce a global flag.

Jan

Andrew Cooper June 15, 2016, 11:03 a.m. UTC | #3

On 15/06/16 08:55, Jan Beulich wrote:
>>> @@ -570,6 +600,15 @@ void fatal_trap(const struct cpu_user_re
>>>              printk("Faulting linear address: %p\n", _p(cr2));
>>>              show_page_walk(cr2);
>>>          }
>>> +        else if ( trapnr == TRAP_nmi )
>>> +        {
>>> +            cpumask_andnot(&nmi_show_state_mask, &cpu_online_map,
>>> +                           cpumask_of(smp_processor_id()));
>>> +            set_nmi_callback(nmi_show_execution_state);
>>> +            smp_send_nmi_allbutself();
>> This would cause far less spinlock contention as
>>
>> for_each_cpu( cpu, nmi_show_state_mask )
>>     smp_send_nmi(cpu);
>>
>> I realise this involves introducing a new smp function, but it would
>> substantially reduce contention on the console lock.
> Again, I don't see why lock contention would matter here. And then
> I also don't see how sending the IPIs individually would make matters
> significantly better: The sending will surely finish much faster than
> the printing.

Contention is a problem because you have replaced the NMI callback, and
the watchdog is still running.  Especially if sync_console is in effect,
you are liable to incur a further timeout, queueing up more NMIs.

Although now I think of it, that won't matter so long as the NMIs don't
nest.

The one advantage of sending the NMIs in order is that the information
dump will happen in order, which is slightly more use than having them
in a random order on a large machine.

>
>> I would recommend moving this clause into nmi_watchdog_tick(), so that
>> it doesn't get involved for non-watchdog NMIs.  IOCK/SERR NMIs won't
>> have anything interesting to print from here.  I would also recommend
>> disabling the watchdog before IPI'ing.
> And indeed I would have wanted it there, but I can't see how it can
> reasonably be put there: fatal_trap() doesn't return, so we can't put
> it after. And we definitely want to get state of the local CPU out
> before we try to log state of any of the remote CPUs. So the only
> options I see would be to
> - somehow specially flag the regs structure, but that feels hackish
>   (among other aspects nmi_watchdog_tick() has that parameter
>   const qualified for the very reason that it isn't supposed to fiddle
>   with it),
> - introduce a global flag.

How about a boolean flag to fatal_trap()?  It doesn't have many callers,
and this kind of printing might also be useful for some MCEs.

~Andrew

Jan Beulich June 15, 2016, 12:59 p.m. UTC | #4

>>> On 15.06.16 at 13:03, <andrew.cooper3@citrix.com> wrote:
> On 15/06/16 08:55, Jan Beulich wrote:
>>>> @@ -570,6 +600,15 @@ void fatal_trap(const struct cpu_user_re
>>>>              printk("Faulting linear address: %p\n", _p(cr2));
>>>>              show_page_walk(cr2);
>>>>          }
>>>> +        else if ( trapnr == TRAP_nmi )
>>>> +        {
>>>> +            cpumask_andnot(&nmi_show_state_mask, &cpu_online_map,
>>>> +                           cpumask_of(smp_processor_id()));
>>>> +            set_nmi_callback(nmi_show_execution_state);
>>>> +            smp_send_nmi_allbutself();
>>> This would cause far less spinlock contention as
>>>
>>> for_each_cpu( cpu, nmi_show_state_mask )
>>>     smp_send_nmi(cpu);
>>>
>>> I realise this involves introducing a new smp function, but it would
>>> substantially reduce contention on the console lock.
>> Again, I don't see why lock contention would matter here. And then
>> I also don't see how sending the IPIs individually would make matters
>> significantly better: The sending will surely finish much faster than
>> the printing.
> 
> Contention is a problem because you have replaced the NMI callback, and
> the watchdog is still running.  Especially if sync_console is in effect,
> you are liable to incur a further timeout, queueing up more NMIs.
> 
> Although now I think of it, that won't matter so long as the NMIs don't
> nest.
> 
> The one advantage of sending the NMIs in order is that the information
> dump will happen in order, which is slightly more use than having them
> in a random order on a large machine.

How that? All the NMIs will still arrive at about the same time, so
while some low numbered CPUs may indeed get their state printed
in order, higher numbered ones may still make it into the lock region
in any order. (And no, building upon ticket locks making randomness
much less likely is neither an option, nor would it really help: Just
think of a lower numbered CPU first having to come out of a deep
C-state or running at a much lower P-state than a higher numbered
one.)

>>> I would recommend moving this clause into nmi_watchdog_tick(), so that
>>> it doesn't get involved for non-watchdog NMIs.  IOCK/SERR NMIs won't
>>> have anything interesting to print from here.  I would also recommend
>>> disabling the watchdog before IPI'ing.
>> And indeed I would have wanted it there, but I can't see how it can
>> reasonably be put there: fatal_trap() doesn't return, so we can't put
>> it after. And we definitely want to get state of the local CPU out
>> before we try to log state of any of the remote CPUs. So the only
>> options I see would be to
>> - somehow specially flag the regs structure, but that feels hackish
>>   (among other aspects nmi_watchdog_tick() has that parameter
>>   const qualified for the very reason that it isn't supposed to fiddle
>>   with it),
>> - introduce a global flag.
> 
> How about a boolean flag to fatal_trap()?  It doesn't have many callers,
> and this kind of printing might also be useful for some MCEs.

Ah, right, there indeed aren't that many. Can you qualify "some"
a bit better, so that maybe I can have the patch pass true there
right away?

Jan

Andrew Cooper June 15, 2016, 1:15 p.m. UTC | #5

On 15/06/16 13:59, Jan Beulich wrote:
>>>> On 15.06.16 at 13:03, <andrew.cooper3@citrix.com> wrote:
>> On 15/06/16 08:55, Jan Beulich wrote:
>>>>> @@ -570,6 +600,15 @@ void fatal_trap(const struct cpu_user_re
>>>>>              printk("Faulting linear address: %p\n", _p(cr2));
>>>>>              show_page_walk(cr2);
>>>>>          }
>>>>> +        else if ( trapnr == TRAP_nmi )
>>>>> +        {
>>>>> +            cpumask_andnot(&nmi_show_state_mask, &cpu_online_map,
>>>>> +                           cpumask_of(smp_processor_id()));
>>>>> +            set_nmi_callback(nmi_show_execution_state);
>>>>> +            smp_send_nmi_allbutself();
>>>> This would cause far less spinlock contention as
>>>>
>>>> for_each_cpu( cpu, nmi_show_state_mask )
>>>>     smp_send_nmi(cpu);
>>>>
>>>> I realise this involves introducing a new smp function, but it would
>>>> substantially reduce contention on the console lock.
>>> Again, I don't see why lock contention would matter here. And then
>>> I also don't see how sending the IPIs individually would make matters
>>> significantly better: The sending will surely finish much faster than
>>> the printing.
>> Contention is a problem because you have replaced the NMI callback, and
>> the watchdog is still running.  Especially if sync_console is in effect,
>> you are liable to incur a further timeout, queueing up more NMIs.
>>
>> Although now I think of it, that won't matter so long as the NMIs don't
>> nest.
>>
>> The one advantage of sending the NMIs in order is that the information
>> dump will happen in order, which is slightly more use than having them
>> in a random order on a large machine.
> How that? All the NMIs will still arrive at about the same time, so
> while some low numbered CPUs may indeed get their state printed
> in order, higher numbered ones may still make it into the lock region
> in any order. (And no, building upon ticket locks making randomness
> much less likely is neither an option, nor would it really help: Just
> think of a lower numbered CPU first having to come out of a deep
> C-state or running at a much lower P-state than a higher numbered
> one.)

Hmm true.  There isn't a acknowledgement of the start of the NMI
handler, and putting one in sounds like far more effort and fragility
than it is worth.

>
>>>> I would recommend moving this clause into nmi_watchdog_tick(), so that
>>>> it doesn't get involved for non-watchdog NMIs.  IOCK/SERR NMIs won't
>>>> have anything interesting to print from here.  I would also recommend
>>>> disabling the watchdog before IPI'ing.
>>> And indeed I would have wanted it there, but I can't see how it can
>>> reasonably be put there: fatal_trap() doesn't return, so we can't put
>>> it after. And we definitely want to get state of the local CPU out
>>> before we try to log state of any of the remote CPUs. So the only
>>> options I see would be to
>>> - somehow specially flag the regs structure, but that feels hackish
>>>   (among other aspects nmi_watchdog_tick() has that parameter
>>>   const qualified for the very reason that it isn't supposed to fiddle
>>>   with it),
>>> - introduce a global flag.
>> How about a boolean flag to fatal_trap()?  It doesn't have many callers,
>> and this kind of printing might also be useful for some MCEs.
> Ah, right, there indeed aren't that many. Can you qualify "some"
> a bit better, so that maybe I can have the patch pass true there
> right away?

Any MCE which is in-practice synchronous might benefit.  I wouldn't
worry about updating the MCE callers as part of this.

~Andrew

x86: show remote CPU state upon fatal NMI

Commit Message

Comments

Patch