diff mbox series

[RFC,x86/mce] Make mce_timed_out() identify holdout CPUs

Message ID 20210106174102.GA23874@paulmck-ThinkPad-P72 (mailing list archive)
State New, archived
Headers show
Series [RFC,x86/mce] Make mce_timed_out() identify holdout CPUs | expand

Commit Message

Paul E. McKenney Jan. 6, 2021, 5:41 p.m. UTC
The "Timeout: Not all CPUs entered broadcast exception handler" message
will appear from time to time given enough systems, but this message does
not identify which CPUs failed to enter the broadcast exception handler.
This information would be valuable if available, for example, in order to
correlated with other hardware-oriented error messages.  This commit
therefore maintains a cpumask_t of CPUs that have entered this handler,
and prints out which ones failed to enter in the event of a timeout.

Build-tested only.

Cc: Tony Luck <tony.luck@intel.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: <x86@kernel.org>
Cc: <linux-edac@vger.kernel.org>
Reported-by: Jonathan Lemon <bsd@fb.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>

Comments

Borislav Petkov Jan. 6, 2021, 6:32 p.m. UTC | #1
On Wed, Jan 06, 2021 at 09:41:02AM -0800, Paul E. McKenney wrote:
> The "Timeout: Not all CPUs entered broadcast exception handler" message
> will appear from time to time given enough systems, but this message does
> not identify which CPUs failed to enter the broadcast exception handler.
> This information would be valuable if available, for example, in order to
> correlated with other hardware-oriented error messages.

Because you're expecting that the CPUs which have not entered the
exception handler might have stuck earlier and that's the correlation
there?

> This commit

That's a tautology. :)

> therefore maintains a cpumask_t of CPUs that have entered this handler,
> and prints out which ones failed to enter in the event of a timeout.
> Build-tested only.
> 
> Cc: Tony Luck <tony.luck@intel.com>
> Cc: Borislav Petkov <bp@alien8.de>
> Cc: Thomas Gleixner <tglx@linutronix.de>
> Cc: Ingo Molnar <mingo@redhat.com>
> Cc: "H. Peter Anvin" <hpa@zytor.com>
> Cc: <x86@kernel.org>
> Cc: <linux-edac@vger.kernel.org>
> Reported-by: Jonathan Lemon <bsd@fb.com>
> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
> 
> diff --git a/arch/x86/kernel/cpu/mce/core.c b/arch/x86/kernel/cpu/mce/core.c
> index 13d3f1c..44d2b99 100644
> --- a/arch/x86/kernel/cpu/mce/core.c
> +++ b/arch/x86/kernel/cpu/mce/core.c
> @@ -878,6 +878,12 @@ static atomic_t mce_executing;
>  static atomic_t mce_callin;
>  
>  /*
> + * Track which CPUs entered and not in order to print holdouts.
> + */
> +static cpumask_t mce_present_cpus;
> +static cpumask_t mce_missing_cpus;
> +
> +/*
>   * Check if a timeout waiting for other CPUs happened.
>   */
>  static int mce_timed_out(u64 *t, const char *msg)
> @@ -894,8 +900,12 @@ static int mce_timed_out(u64 *t, const char *msg)
>  	if (!mca_cfg.monarch_timeout)
>  		goto out;
>  	if ((s64)*t < SPINUNIT) {
> -		if (mca_cfg.tolerant <= 1)
> +		if (mca_cfg.tolerant <= 1) {
> +			if (!cpumask_andnot(&mce_missing_cpus, cpu_online_mask, &mce_present_cpus))
> +				pr_info("%s: MCE holdout CPUs: %*pbl\n",
> +					__func__, cpumask_pr_args(&mce_missing_cpus));
>  			mce_panic(msg, NULL, NULL);
> +		}
>  		cpu_missing = 1;
>  		return 1;
>  	}
> @@ -1006,6 +1016,7 @@ static int mce_start(int *no_way_out)
>  	 * is updated before mce_callin.
>  	 */
>  	order = atomic_inc_return(&mce_callin);

Doesn't a single mce_callin_mask suffice?
Luck, Tony Jan. 6, 2021, 6:39 p.m. UTC | #2
> The "Timeout: Not all CPUs entered broadcast exception handler" message
> will appear from time to time given enough systems, but this message does
> not identify which CPUs failed to enter the broadcast exception handler.
> This information would be valuable if available, for example, in order to
> correlated with other hardware-oriented error messages.  This commit
> therefore maintains a cpumask_t of CPUs that have entered this handler,
> and prints out which ones failed to enter in the event of a timeout.

I tried doing this a while back, but found that in my test case where I forced
an error that would cause both threads from one core to be "missing", the
output was highly unpredictable. Some random number of extra CPUs were
reported as missing. After I added some extra breadcrumbs it became clear
that pretty much all the CPUs (except the missing pair) entered do_machine_check(),
but some got hung up at various points beyond the entry point. My only theory
was that they were trying to snoop caches from the dead core (or access some
other resource held by the dead core) and so they hung too.

Your code is much neater than mine ... and perhaps works in other cases, but
maybe the message needs to allow for the fact that some of the cores that
are reported missing may just be collateral damage from the initial problem.

If I get time in the next day or two, I'll run my old test against your code to
see what happens.

-Tony
Paul E. McKenney Jan. 6, 2021, 7:13 p.m. UTC | #3
On Wed, Jan 06, 2021 at 07:32:44PM +0100, Borislav Petkov wrote:
> On Wed, Jan 06, 2021 at 09:41:02AM -0800, Paul E. McKenney wrote:
> > The "Timeout: Not all CPUs entered broadcast exception handler" message
> > will appear from time to time given enough systems, but this message does
> > not identify which CPUs failed to enter the broadcast exception handler.
> > This information would be valuable if available, for example, in order to
> > correlated with other hardware-oriented error messages.
> 
> Because you're expecting that the CPUs which have not entered the
> exception handler might have stuck earlier and that's the correlation
> there?

Or that there might have been any number of other error messages that
flagged that CPU.  For a few examples, watchdogs, hung tasks, and RCU
CPU stall warnings.

> > This commit
> 
> That's a tautology. :)

Not yet, it isn't!  Well, except in -rcu.  ;-)

> > therefore maintains a cpumask_t of CPUs that have entered this handler,
> > and prints out which ones failed to enter in the event of a timeout.
> > Build-tested only.
> > 
> > Cc: Tony Luck <tony.luck@intel.com>
> > Cc: Borislav Petkov <bp@alien8.de>
> > Cc: Thomas Gleixner <tglx@linutronix.de>
> > Cc: Ingo Molnar <mingo@redhat.com>
> > Cc: "H. Peter Anvin" <hpa@zytor.com>
> > Cc: <x86@kernel.org>
> > Cc: <linux-edac@vger.kernel.org>
> > Reported-by: Jonathan Lemon <bsd@fb.com>
> > Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
> > 
> > diff --git a/arch/x86/kernel/cpu/mce/core.c b/arch/x86/kernel/cpu/mce/core.c
> > index 13d3f1c..44d2b99 100644
> > --- a/arch/x86/kernel/cpu/mce/core.c
> > +++ b/arch/x86/kernel/cpu/mce/core.c
> > @@ -878,6 +878,12 @@ static atomic_t mce_executing;
> >  static atomic_t mce_callin;
> >  
> >  /*
> > + * Track which CPUs entered and not in order to print holdouts.
> > + */
> > +static cpumask_t mce_present_cpus;
> > +static cpumask_t mce_missing_cpus;
> > +
> > +/*
> >   * Check if a timeout waiting for other CPUs happened.
> >   */
> >  static int mce_timed_out(u64 *t, const char *msg)
> > @@ -894,8 +900,12 @@ static int mce_timed_out(u64 *t, const char *msg)
> >  	if (!mca_cfg.monarch_timeout)
> >  		goto out;
> >  	if ((s64)*t < SPINUNIT) {
> > -		if (mca_cfg.tolerant <= 1)
> > +		if (mca_cfg.tolerant <= 1) {
> > +			if (!cpumask_andnot(&mce_missing_cpus, cpu_online_mask, &mce_present_cpus))
> > +				pr_info("%s: MCE holdout CPUs: %*pbl\n",
> > +					__func__, cpumask_pr_args(&mce_missing_cpus));
> >  			mce_panic(msg, NULL, NULL);
> > +		}
> >  		cpu_missing = 1;
> >  		return 1;
> >  	}
> > @@ -1006,6 +1016,7 @@ static int mce_start(int *no_way_out)
> >  	 * is updated before mce_callin.
> >  	 */
> >  	order = atomic_inc_return(&mce_callin);
> 
> Doesn't a single mce_callin_mask suffice?

You are suggesting dropping mce_missing_cpus and just doing this?

if (!cpumask_andnot(&mce_present_cpus, cpu_online_mask, &mce_present_cpus))

I was worried (perhaps unnecessarily) about the possibility of CPUs
checking in during the printout operation, which would set rather than
clear the bit.  But perhaps the possible false positives that Tony points
out make this race not worth worrying about.

Thoughts?

							Thanx, Paul
Paul E. McKenney Jan. 6, 2021, 7:17 p.m. UTC | #4
On Wed, Jan 06, 2021 at 06:39:30PM +0000, Luck, Tony wrote:
> > The "Timeout: Not all CPUs entered broadcast exception handler" message
> > will appear from time to time given enough systems, but this message does
> > not identify which CPUs failed to enter the broadcast exception handler.
> > This information would be valuable if available, for example, in order to
> > correlated with other hardware-oriented error messages.  This commit
> > therefore maintains a cpumask_t of CPUs that have entered this handler,
> > and prints out which ones failed to enter in the event of a timeout.
> 
> I tried doing this a while back, but found that in my test case where I forced
> an error that would cause both threads from one core to be "missing", the
> output was highly unpredictable. Some random number of extra CPUs were
> reported as missing. After I added some extra breadcrumbs it became clear
> that pretty much all the CPUs (except the missing pair) entered do_machine_check(),
> but some got hung up at various points beyond the entry point. My only theory
> was that they were trying to snoop caches from the dead core (or access some
> other resource held by the dead core) and so they hung too.
> 
> Your code is much neater than mine ... and perhaps works in other cases, but
> maybe the message needs to allow for the fact that some of the cores that
> are reported missing may just be collateral damage from the initial problem.

Understood.  The system is probably not in the best shape if this code
is ever executed, after all.  ;-)

So how about like this?

	pr_info("%s: MCE holdout CPUs (may include false positives): %*pbl\n",

Easy enough if so!

> If I get time in the next day or two, I'll run my old test against your code to
> see what happens.

Thank you very much in advance!

For my own testing, is this still the right thing to use?

	https://github.com/andikleen/mce-inject

							Thanx, Paul
Luck, Tony Jan. 6, 2021, 10:49 p.m. UTC | #5
On Wed, Jan 06, 2021 at 11:17:08AM -0800, Paul E. McKenney wrote:
> On Wed, Jan 06, 2021 at 06:39:30PM +0000, Luck, Tony wrote:
> > > The "Timeout: Not all CPUs entered broadcast exception handler" message
> > > will appear from time to time given enough systems, but this message does
> > > not identify which CPUs failed to enter the broadcast exception handler.
> > > This information would be valuable if available, for example, in order to
> > > correlated with other hardware-oriented error messages.  This commit
> > > therefore maintains a cpumask_t of CPUs that have entered this handler,
> > > and prints out which ones failed to enter in the event of a timeout.
> > 
> > I tried doing this a while back, but found that in my test case where I forced
> > an error that would cause both threads from one core to be "missing", the
> > output was highly unpredictable. Some random number of extra CPUs were
> > reported as missing. After I added some extra breadcrumbs it became clear
> > that pretty much all the CPUs (except the missing pair) entered do_machine_check(),
> > but some got hung up at various points beyond the entry point. My only theory
> > was that they were trying to snoop caches from the dead core (or access some
> > other resource held by the dead core) and so they hung too.
> > 
> > Your code is much neater than mine ... and perhaps works in other cases, but
> > maybe the message needs to allow for the fact that some of the cores that
> > are reported missing may just be collateral damage from the initial problem.
> 
> Understood.  The system is probably not in the best shape if this code
> is ever executed, after all.  ;-)
> 
> So how about like this?
> 
> 	pr_info("%s: MCE holdout CPUs (may include false positives): %*pbl\n",

That looks fine.
> 
> Easy enough if so!
> 
> > If I get time in the next day or two, I'll run my old test against your code to
> > see what happens.

I got time today (plenty of meetings in which to run experiments in background).

This code:

-               if (mca_cfg.tolerant <= 1)
+               if (mca_cfg.tolerant <= 1) {
+                       if (!cpumask_andnot(&mce_missing_cpus, cpu_online_mask, &mce_present_cpus))
+                               pr_info("%s: MCE holdout CPUs: %*pbl\n",
+                                       __func__, cpumask_pr_args(&mce_missing_cpus));
                        mce_panic(msg, NULL, NULL);

didn't trigger ... so maybe that cpumask_andnot() didn't return the value you expected?

I added a:

+                       pr_info("%s: MCE present CPUs: %*pbl\n", __func__, cpumask_pr_args(&mce_present_cpus));

to check that the mask was being set correctly, and saw:

[  219.329767] mce: mce_timed_out: MCE present CPUs: 0-23,48-119,144-191

So the every core of socket 1 failed to show up for this test.

> For my own testing, is this still the right thing to use?
> 
> 	https://github.com/andikleen/mce-inject

That fakes up errors (by hooking into the mce_rdmsr() code to return arbitrary
user supplied values).  The plus side of this is that you can fake any error
signature without needing special h/w or f/w. The downside is that it is all fake
and you can't create situations where some CPUs don't show up in the machine
check handler.

-Tony
Paul E. McKenney Jan. 6, 2021, 11:23 p.m. UTC | #6
On Wed, Jan 06, 2021 at 02:49:18PM -0800, Luck, Tony wrote:
> On Wed, Jan 06, 2021 at 11:17:08AM -0800, Paul E. McKenney wrote:
> > On Wed, Jan 06, 2021 at 06:39:30PM +0000, Luck, Tony wrote:
> > > > The "Timeout: Not all CPUs entered broadcast exception handler" message
> > > > will appear from time to time given enough systems, but this message does
> > > > not identify which CPUs failed to enter the broadcast exception handler.
> > > > This information would be valuable if available, for example, in order to
> > > > correlated with other hardware-oriented error messages.  This commit
> > > > therefore maintains a cpumask_t of CPUs that have entered this handler,
> > > > and prints out which ones failed to enter in the event of a timeout.
> > > 
> > > I tried doing this a while back, but found that in my test case where I forced
> > > an error that would cause both threads from one core to be "missing", the
> > > output was highly unpredictable. Some random number of extra CPUs were
> > > reported as missing. After I added some extra breadcrumbs it became clear
> > > that pretty much all the CPUs (except the missing pair) entered do_machine_check(),
> > > but some got hung up at various points beyond the entry point. My only theory
> > > was that they were trying to snoop caches from the dead core (or access some
> > > other resource held by the dead core) and so they hung too.
> > > 
> > > Your code is much neater than mine ... and perhaps works in other cases, but
> > > maybe the message needs to allow for the fact that some of the cores that
> > > are reported missing may just be collateral damage from the initial problem.
> > 
> > Understood.  The system is probably not in the best shape if this code
> > is ever executed, after all.  ;-)
> > 
> > So how about like this?
> > 
> > 	pr_info("%s: MCE holdout CPUs (may include false positives): %*pbl\n",
> 
> That looks fine.
> > 
> > Easy enough if so!
> > 
> > > If I get time in the next day or two, I'll run my old test against your code to
> > > see what happens.
> 
> I got time today (plenty of meetings in which to run experiments in background).

Thank you very much!

> This code:
> 
> -               if (mca_cfg.tolerant <= 1)
> +               if (mca_cfg.tolerant <= 1) {
> +                       if (!cpumask_andnot(&mce_missing_cpus, cpu_online_mask, &mce_present_cpus))
> +                               pr_info("%s: MCE holdout CPUs: %*pbl\n",
> +                                       __func__, cpumask_pr_args(&mce_missing_cpus));
>                         mce_panic(msg, NULL, NULL);
> 
> didn't trigger ... so maybe that cpumask_andnot() didn't return the value you expected?
> 
> I added a:
> 
> +                       pr_info("%s: MCE present CPUs: %*pbl\n", __func__, cpumask_pr_args(&mce_present_cpus));
> 
> to check that the mask was being set correctly, and saw:
> 
> [  219.329767] mce: mce_timed_out: MCE present CPUs: 0-23,48-119,144-191
> 
> So the every core of socket 1 failed to show up for this test.

I'll say that cpumask_andnot() didn't return the value I expected!
Mostly because idiot here somehow interpreted "If *@dstp is empty,
returns 0, else returns 1" as "Returns true if *dstp is empty".  So the
check is backwards.

Please see below for an updated patch.

> > For my own testing, is this still the right thing to use?
> > 
> > 	https://github.com/andikleen/mce-inject
> 
> That fakes up errors (by hooking into the mce_rdmsr() code to return arbitrary
> user supplied values).  The plus side of this is that you can fake any error
> signature without needing special h/w or f/w. The downside is that it is all fake
> and you can't create situations where some CPUs don't show up in the machine
> check handler.

So I would need to modify the code to test the code.  I have done worse
things, I suppose.  ;-)

							Thanx, Paul

------------------------------------------------------------------------

x86/mce: Make mce_timed_out() identify holdout CPUs

The "Timeout: Not all CPUs entered broadcast exception handler" message
will appear from time to time given enough systems, but this message does
not identify which CPUs failed to enter the broadcast exception handler.
This information would be valuable if available, for example, in order to
correlated with other hardware-oriented error messages.  This commit
therefore maintains a cpumask_t of CPUs that have entered this handler,
and prints out which ones failed to enter in the event of a timeout.

Cc: Tony Luck <tony.luck@intel.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: <x86@kernel.org>
Cc: <linux-edac@vger.kernel.org>
[ paulmck: Fix cpumask_andnot() check + message per Tony Luck feedback. ]
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>

diff --git a/arch/x86/kernel/cpu/mce/core.c b/arch/x86/kernel/cpu/mce/core.c
index 13d3f1c..7a6e1f3 100644
--- a/arch/x86/kernel/cpu/mce/core.c
+++ b/arch/x86/kernel/cpu/mce/core.c
@@ -878,6 +878,12 @@ static atomic_t mce_executing;
 static atomic_t mce_callin;
 
 /*
+ * Track which CPUs entered and not in order to print holdouts.
+ */
+static cpumask_t mce_present_cpus;
+static cpumask_t mce_missing_cpus;
+
+/*
  * Check if a timeout waiting for other CPUs happened.
  */
 static int mce_timed_out(u64 *t, const char *msg)
@@ -894,8 +900,12 @@ static int mce_timed_out(u64 *t, const char *msg)
 	if (!mca_cfg.monarch_timeout)
 		goto out;
 	if ((s64)*t < SPINUNIT) {
-		if (mca_cfg.tolerant <= 1)
+		if (mca_cfg.tolerant <= 1) {
+			if (cpumask_andnot(&mce_missing_cpus, cpu_online_mask, &mce_present_cpus))
+				pr_info("%s: MCE holdout CPUs (may include false positives): %*pbl\n",
+					__func__, cpumask_pr_args(&mce_missing_cpus));
 			mce_panic(msg, NULL, NULL);
+		}
 		cpu_missing = 1;
 		return 1;
 	}
@@ -1006,6 +1016,7 @@ static int mce_start(int *no_way_out)
 	 * is updated before mce_callin.
 	 */
 	order = atomic_inc_return(&mce_callin);
+	cpumask_set_cpu(smp_processor_id(), &mce_present_cpus);
 
 	/*
 	 * Wait for everyone.
@@ -1114,6 +1125,7 @@ static int mce_end(int order)
 reset:
 	atomic_set(&global_nwo, 0);
 	atomic_set(&mce_callin, 0);
+	cpumask_clear(&mce_present_cpus);
 	barrier();
 
 	/*
@@ -2712,6 +2724,7 @@ static void mce_reset(void)
 	atomic_set(&mce_executing, 0);
 	atomic_set(&mce_callin, 0);
 	atomic_set(&global_nwo, 0);
+	cpumask_clear(&mce_present_cpus);
 }
 
 static int fake_panic_get(void *data, u64 *val)
Luck, Tony Jan. 7, 2021, 12:26 a.m. UTC | #7
> Please see below for an updated patch.

Yes. That worked:

[   78.946069] mce: mce_timed_out: MCE holdout CPUs (may include false positives): 24-47,120-143
[   78.946151] mce: mce_timed_out: MCE holdout CPUs (may include false positives): 24-47,120-143
[   78.946153] Kernel panic - not syncing: Timeout: Not all CPUs entered broadcast exception handler

I guess that more than one CPU hit the timeout and so your new message was printed twice
before the panic code took over?

Once again, the whole of socket 1 is MIA rather than just the pair of threads on one of the cores there.
But that's a useful improvement (eliminating the other three sockets on this system).

Tested-by: Tony Luck <tony.luck@intel.com>

-Tony
Paul E. McKenney Jan. 7, 2021, 12:41 a.m. UTC | #8
On Thu, Jan 07, 2021 at 12:26:19AM +0000, Luck, Tony wrote:
> > Please see below for an updated patch.
> 
> Yes. That worked:
> 
> [   78.946069] mce: mce_timed_out: MCE holdout CPUs (may include false positives): 24-47,120-143
> [   78.946151] mce: mce_timed_out: MCE holdout CPUs (may include false positives): 24-47,120-143
> [   78.946153] Kernel panic - not syncing: Timeout: Not all CPUs entered broadcast exception handler
> 
> I guess that more than one CPU hit the timeout and so your new message was printed twice
> before the panic code took over?

Could well be.

It would be easy to add a flag that allowed only one CPU to print the
message.  Does that make sense?  (See off-the-cuff probably-broken
delta patch below for one approach.)

> Once again, the whole of socket 1 is MIA rather than just the pair of threads on one of the cores there.
> But that's a useful improvement (eliminating the other three sockets on this system).
> 
> Tested-by: Tony Luck <tony.luck@intel.com>

Thank you very much!  I will apply this.

							Thanx, Paul

------------------------------------------------------------------------

diff --git a/arch/x86/kernel/cpu/mce/core.c b/arch/x86/kernel/cpu/mce/core.c
index 7a6e1f3..b46ac56 100644
--- a/arch/x86/kernel/cpu/mce/core.c
+++ b/arch/x86/kernel/cpu/mce/core.c
@@ -882,6 +882,7 @@ static atomic_t mce_callin;
  */
 static cpumask_t mce_present_cpus;
 static cpumask_t mce_missing_cpus;
+static atomic_t mce_missing_cpus_gate;
 
 /*
  * Check if a timeout waiting for other CPUs happened.
@@ -900,7 +901,7 @@ static int mce_timed_out(u64 *t, const char *msg)
 	if (!mca_cfg.monarch_timeout)
 		goto out;
 	if ((s64)*t < SPINUNIT) {
-		if (mca_cfg.tolerant <= 1) {
+		if (mca_cfg.tolerant <= 1 && !atomic_xchg(&mce_missing_cpus_gate, 1)) {
 			if (cpumask_andnot(&mce_missing_cpus, cpu_online_mask, &mce_present_cpus))
 				pr_info("%s: MCE holdout CPUs (may include false positives): %*pbl\n",
 					__func__, cpumask_pr_args(&mce_missing_cpus));
@@ -1017,6 +1018,7 @@ static int mce_start(int *no_way_out)
 	 */
 	order = atomic_inc_return(&mce_callin);
 	cpumask_set_cpu(smp_processor_id(), &mce_present_cpus);
+	atomic_set(&mce_missing_cpus_gate, 0);
 
 	/*
 	 * Wait for everyone.
@@ -1126,6 +1128,7 @@ static int mce_end(int order)
 	atomic_set(&global_nwo, 0);
 	atomic_set(&mce_callin, 0);
 	cpumask_clear(&mce_present_cpus);
+	atomic_set(&mce_missing_cpus_gate, 0);
 	barrier();
 
 	/*
@@ -2725,6 +2728,7 @@ static void mce_reset(void)
 	atomic_set(&mce_callin, 0);
 	atomic_set(&global_nwo, 0);
 	cpumask_clear(&mce_present_cpus);
+	atomic_set(&mce_missing_cpus_gate, 0);
 }
 
 static int fake_panic_get(void *data, u64 *val)
Borislav Petkov Jan. 7, 2021, 7:07 a.m. UTC | #9
On Wed, Jan 06, 2021 at 11:13:53AM -0800, Paul E. McKenney wrote:
> Not yet, it isn't!  Well, except in -rcu.  ;-)

Of course it is - saying "This commit" in this commit's commit message
is very much a tautology. :-)

> You are suggesting dropping mce_missing_cpus and just doing this?
> 
> if (!cpumask_andnot(&mce_present_cpus, cpu_online_mask, &mce_present_cpus))

Yes.

And pls don't call it "holdout CPUs" and change the order so that it is
more user-friendly (yap, you don't need __func__ either):

[   78.946153] mce: Not all CPUs (24-47,120-143) entered the broadcast exception handler.
[   78.946153] Kernel panic - not syncing: Timeout: MCA synchronization.

or so.

And that's fine if it appears twice as long as it is the same info - the
MCA code is one complex mess so you can probably guess why I'd like to
have new stuff added to it be as simplistic as possible.

> I was worried (perhaps unnecessarily) about the possibility of CPUs
> checking in during the printout operation, which would set rather than
> clear the bit.  But perhaps the possible false positives that Tony points
> out make this race not worth worrying about.
> 
> Thoughts?

Yah, apparently, it is not going to be a precise report as you wanted it
to be but at least it'll tell you which *sockets* you can rule out, if
not cores.

:-)
Paul E. McKenney Jan. 7, 2021, 5:08 p.m. UTC | #10
On Thu, Jan 07, 2021 at 08:07:24AM +0100, Borislav Petkov wrote:
> On Wed, Jan 06, 2021 at 11:13:53AM -0800, Paul E. McKenney wrote:
> > Not yet, it isn't!  Well, except in -rcu.  ;-)
> 
> Of course it is - saying "This commit" in this commit's commit message
> is very much a tautology. :-)

Tautology?  Maybe self-referential?  ;-)

> > You are suggesting dropping mce_missing_cpus and just doing this?
> > 
> > if (!cpumask_andnot(&mce_present_cpus, cpu_online_mask, &mce_present_cpus))
> 
> Yes.

I could drop mce_present_cpus, and then initialize mce_missing_cpus
to CPU_MASK_ALL, and have each CPU clear its bit on entry using
cpumask_clear_cpu().  Then cpumask_and() it with cpu_online_mask and
print it out.  That allows late-arriving CPUs to be properly accounted
for, most of the time, anyway.

> And pls don't call it "holdout CPUs"

How about "missing CPUs"?  That is what I used in the meantime, please
see below.  If you have something you would prefer that it be called,
please let me know.

>                                      and change the order so that it is
> more user-friendly (yap, you don't need __func__ either):
> 
> [   78.946153] mce: Not all CPUs (24-47,120-143) entered the broadcast exception handler.
> [   78.946153] Kernel panic - not syncing: Timeout: MCA synchronization.
> 
> or so.

I removed __func__, but the pr_info() already precedes the mce_panic().
Do I need a udelay() after the pr_info() or some such?  If so, how long
should is spin?

> And that's fine if it appears twice as long as it is the same info - the
> MCA code is one complex mess so you can probably guess why I'd like to
> have new stuff added to it be as simplistic as possible.

Works for me.  ;-)

> > I was worried (perhaps unnecessarily) about the possibility of CPUs
> > checking in during the printout operation, which would set rather than
> > clear the bit.  But perhaps the possible false positives that Tony points
> > out make this race not worth worrying about.
> > 
> > Thoughts?
> 
> Yah, apparently, it is not going to be a precise report as you wanted it
> to be but at least it'll tell you which *sockets* you can rule out, if
> not cores.
> 
> :-)

Some information is usually better than none.  And I bet that failing
hardware is capable of all sorts of tricks at all sorts of levels.  ;-)

Updated patch below.  Is this what you had in mind?

							Thanx, Paul

------------------------------------------------------------------------

commit 4b4b57692fdd3b111098eda94df7529f85c54406
Author: Paul E. McKenney <paulmck@kernel.org>
Date:   Wed Dec 23 17:04:19 2020 -0800

    x86/mce: Make mce_timed_out() identify holdout CPUs
    
    The "Timeout: Not all CPUs entered broadcast exception handler" message
    will appear from time to time given enough systems, but this message does
    not identify which CPUs failed to enter the broadcast exception handler.
    This information would be valuable if available, for example, in order to
    correlated with other hardware-oriented error messages.  This commit
    therefore maintains a cpumask_t of CPUs that have entered this handler,
    and prints out which ones failed to enter in the event of a timeout.
    
    Cc: Tony Luck <tony.luck@intel.com>
    Cc: Borislav Petkov <bp@alien8.de>
    Cc: Thomas Gleixner <tglx@linutronix.de>
    Cc: Ingo Molnar <mingo@redhat.com>
    Cc: "H. Peter Anvin" <hpa@zytor.com>
    Cc: <x86@kernel.org>
    Cc: <linux-edac@vger.kernel.org>
    [ paulmck: Fix cpumask_andnot() check per Tony Luck testing and feedback. ]
    [ paulmck: Apply Borislav Petkov feedback. ]
    Reported-by: Jonathan Lemon <bsd@fb.com>
    Tested-by: Tony Luck <tony.luck@intel.com>
    Signed-off-by: Paul E. McKenney <paulmck@kernel.org>

diff --git a/arch/x86/kernel/cpu/mce/core.c b/arch/x86/kernel/cpu/mce/core.c
index 13d3f1c..c83331b 100644
--- a/arch/x86/kernel/cpu/mce/core.c
+++ b/arch/x86/kernel/cpu/mce/core.c
@@ -878,6 +878,11 @@ static atomic_t mce_executing;
 static atomic_t mce_callin;
 
 /*
+ * Track which CPUs entered and not in order to print holdouts.
+ */
+static cpumask_t mce_missing_cpus = CPU_MASK_ALL;
+
+/*
  * Check if a timeout waiting for other CPUs happened.
  */
 static int mce_timed_out(u64 *t, const char *msg)
@@ -894,8 +899,12 @@ static int mce_timed_out(u64 *t, const char *msg)
 	if (!mca_cfg.monarch_timeout)
 		goto out;
 	if ((s64)*t < SPINUNIT) {
-		if (mca_cfg.tolerant <= 1)
+		if (mca_cfg.tolerant <= 1) {
+			if (cpumask_and(&mce_missing_cpus, cpu_online_mask, &mce_missing_cpus))
+				pr_info("MCE missing CPUs (may include false positives): %*pbl\n",
+					cpumask_pr_args(&mce_missing_cpus));
 			mce_panic(msg, NULL, NULL);
+		}
 		cpu_missing = 1;
 		return 1;
 	}
@@ -1006,6 +1015,7 @@ static int mce_start(int *no_way_out)
 	 * is updated before mce_callin.
 	 */
 	order = atomic_inc_return(&mce_callin);
+	cpumask_clear_cpu(smp_processor_id(), &mce_missing_cpus);
 
 	/*
 	 * Wait for everyone.
@@ -1114,6 +1124,7 @@ static int mce_end(int order)
 reset:
 	atomic_set(&global_nwo, 0);
 	atomic_set(&mce_callin, 0);
+	cpumask_setall(&mce_missing_cpus);
 	barrier();
 
 	/*
@@ -2712,6 +2723,7 @@ static void mce_reset(void)
 	atomic_set(&mce_executing, 0);
 	atomic_set(&mce_callin, 0);
 	atomic_set(&global_nwo, 0);
+	cpumask_setall(&mce_missing_cpus);
 }
 
 static int fake_panic_get(void *data, u64 *val)
Borislav Petkov Jan. 8, 2021, 12:31 p.m. UTC | #11
On Thu, Jan 07, 2021 at 09:08:44AM -0800, Paul E. McKenney wrote:
> Some information is usually better than none.  And I bet that failing
> hardware is capable of all sorts of tricks at all sorts of levels.  ;-)

Tell me about it.

> Updated patch below.  Is this what you had in mind?

Ok, so I've massaged it into the below locally while taking another
detailed look. Made the pr_info pr_emerg and poked at the text more, as
I do. :)

Lemme know if something else needs to be adjusted, otherwise I'll queue
it.

Thx.

---
Author: Paul E. McKenney <paulmck@kernel.org>
Date:   Wed Dec 23 17:04:19 2020 -0800

    x86/mce: Make mce_timed_out() identify holdout CPUs
    
    The
    
      "Timeout: Not all CPUs entered broadcast exception handler"
    
    message will appear from time to time given enough systems, but this
    message does not identify which CPUs failed to enter the broadcast
    exception handler. This information would be valuable if available,
    for example, in order to correlate with other hardware-oriented error
    messages.
    
    Add a cpumask of CPUs which maintains which CPUs have entered this
    handler, and print out which ones failed to enter in the event of a
    timeout.
    
     [ bp: Massage. ]
    
    Reported-by: Jonathan Lemon <bsd@fb.com>
    Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
    Signed-off-by: Borislav Petkov <bp@suse.de>
    Tested-by: Tony Luck <tony.luck@intel.com>
    Link: https://lkml.kernel.org/r/20210106174102.GA23874@paulmck-ThinkPad-P72

diff --git a/arch/x86/kernel/cpu/mce/core.c b/arch/x86/kernel/cpu/mce/core.c
index 13d3f1cbda17..6c81d0998e0a 100644
--- a/arch/x86/kernel/cpu/mce/core.c
+++ b/arch/x86/kernel/cpu/mce/core.c
@@ -877,6 +877,12 @@ static atomic_t mce_executing;
  */
 static atomic_t mce_callin;
 
+/*
+ * Track which CPUs entered the MCA broadcast synchronization and which not in
+ * order to print holdouts.
+ */
+static cpumask_t mce_missing_cpus = CPU_MASK_ALL;
+
 /*
  * Check if a timeout waiting for other CPUs happened.
  */
@@ -894,8 +900,12 @@ static int mce_timed_out(u64 *t, const char *msg)
 	if (!mca_cfg.monarch_timeout)
 		goto out;
 	if ((s64)*t < SPINUNIT) {
-		if (mca_cfg.tolerant <= 1)
+		if (mca_cfg.tolerant <= 1) {
+			if (cpumask_and(&mce_missing_cpus, cpu_online_mask, &mce_missing_cpus))
+				pr_emerg("CPUs not responding to MCE broadcast (may include false positives): %*pbl\n",
+					 cpumask_pr_args(&mce_missing_cpus));
 			mce_panic(msg, NULL, NULL);
+		}
 		cpu_missing = 1;
 		return 1;
 	}
@@ -1006,6 +1016,7 @@ static int mce_start(int *no_way_out)
 	 * is updated before mce_callin.
 	 */
 	order = atomic_inc_return(&mce_callin);
+	cpumask_clear_cpu(smp_processor_id(), &mce_missing_cpus);
 
 	/*
 	 * Wait for everyone.
@@ -1114,6 +1125,7 @@ static int mce_end(int order)
 reset:
 	atomic_set(&global_nwo, 0);
 	atomic_set(&mce_callin, 0);
+	cpumask_setall(&mce_missing_cpus);
 	barrier();
 
 	/*
@@ -2712,6 +2724,7 @@ static void mce_reset(void)
 	atomic_set(&mce_executing, 0);
 	atomic_set(&mce_callin, 0);
 	atomic_set(&global_nwo, 0);
+	cpumask_setall(&mce_missing_cpus);
 }
 
 static int fake_panic_get(void *data, u64 *val)
Paul E. McKenney Jan. 8, 2021, 2:55 p.m. UTC | #12
On Fri, Jan 08, 2021 at 01:31:56PM +0100, Borislav Petkov wrote:
> On Thu, Jan 07, 2021 at 09:08:44AM -0800, Paul E. McKenney wrote:
> > Some information is usually better than none.  And I bet that failing
> > hardware is capable of all sorts of tricks at all sorts of levels.  ;-)
> 
> Tell me about it.
> 
> > Updated patch below.  Is this what you had in mind?
> 
> Ok, so I've massaged it into the below locally while taking another
> detailed look. Made the pr_info pr_emerg and poked at the text more, as
> I do. :)
> 
> Lemme know if something else needs to be adjusted, otherwise I'll queue
> it.

Looks good to me!  I agree that your change to the pr_emerg() string is
much better than my original.  And good point on your added comment,
plus it was fun to see that my original "holdouts" wording has not
completely vanished.  ;-)

Thank you very much!!!

							Thanx, Paul

> Thx.
> 
> ---
> Author: Paul E. McKenney <paulmck@kernel.org>
> Date:   Wed Dec 23 17:04:19 2020 -0800
> 
>     x86/mce: Make mce_timed_out() identify holdout CPUs
>     
>     The
>     
>       "Timeout: Not all CPUs entered broadcast exception handler"
>     
>     message will appear from time to time given enough systems, but this
>     message does not identify which CPUs failed to enter the broadcast
>     exception handler. This information would be valuable if available,
>     for example, in order to correlate with other hardware-oriented error
>     messages.
>     
>     Add a cpumask of CPUs which maintains which CPUs have entered this
>     handler, and print out which ones failed to enter in the event of a
>     timeout.
>     
>      [ bp: Massage. ]
>     
>     Reported-by: Jonathan Lemon <bsd@fb.com>
>     Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
>     Signed-off-by: Borislav Petkov <bp@suse.de>
>     Tested-by: Tony Luck <tony.luck@intel.com>
>     Link: https://lkml.kernel.org/r/20210106174102.GA23874@paulmck-ThinkPad-P72
> 
> diff --git a/arch/x86/kernel/cpu/mce/core.c b/arch/x86/kernel/cpu/mce/core.c
> index 13d3f1cbda17..6c81d0998e0a 100644
> --- a/arch/x86/kernel/cpu/mce/core.c
> +++ b/arch/x86/kernel/cpu/mce/core.c
> @@ -877,6 +877,12 @@ static atomic_t mce_executing;
>   */
>  static atomic_t mce_callin;
>  
> +/*
> + * Track which CPUs entered the MCA broadcast synchronization and which not in
> + * order to print holdouts.
> + */
> +static cpumask_t mce_missing_cpus = CPU_MASK_ALL;
> +
>  /*
>   * Check if a timeout waiting for other CPUs happened.
>   */
> @@ -894,8 +900,12 @@ static int mce_timed_out(u64 *t, const char *msg)
>  	if (!mca_cfg.monarch_timeout)
>  		goto out;
>  	if ((s64)*t < SPINUNIT) {
> -		if (mca_cfg.tolerant <= 1)
> +		if (mca_cfg.tolerant <= 1) {
> +			if (cpumask_and(&mce_missing_cpus, cpu_online_mask, &mce_missing_cpus))
> +				pr_emerg("CPUs not responding to MCE broadcast (may include false positives): %*pbl\n",
> +					 cpumask_pr_args(&mce_missing_cpus));
>  			mce_panic(msg, NULL, NULL);
> +		}
>  		cpu_missing = 1;
>  		return 1;
>  	}
> @@ -1006,6 +1016,7 @@ static int mce_start(int *no_way_out)
>  	 * is updated before mce_callin.
>  	 */
>  	order = atomic_inc_return(&mce_callin);
> +	cpumask_clear_cpu(smp_processor_id(), &mce_missing_cpus);
>  
>  	/*
>  	 * Wait for everyone.
> @@ -1114,6 +1125,7 @@ static int mce_end(int order)
>  reset:
>  	atomic_set(&global_nwo, 0);
>  	atomic_set(&mce_callin, 0);
> +	cpumask_setall(&mce_missing_cpus);
>  	barrier();
>  
>  	/*
> @@ -2712,6 +2724,7 @@ static void mce_reset(void)
>  	atomic_set(&mce_executing, 0);
>  	atomic_set(&mce_callin, 0);
>  	atomic_set(&global_nwo, 0);
> +	cpumask_setall(&mce_missing_cpus);
>  }
>  
>  static int fake_panic_get(void *data, u64 *val)
> 
> -- 
> Regards/Gruss,
>     Boris.
> 
> https://people.kernel.org/tglx/notes-about-netiquette
Borislav Petkov Jan. 8, 2021, 4:57 p.m. UTC | #13
On Fri, Jan 08, 2021 at 06:55:14AM -0800, Paul E. McKenney wrote:
> Looks good to me!  I agree that your change to the pr_emerg() string is
> much better than my original.

Well, the rest of the MCE code uses pr_emerg on that path so...

> And good point on your added comment, plus it was fun to see that my
> original "holdouts" wording has not completely vanished. ;-)

I had to leave it in seeing how you love that formulation. :-P

> Thank you very much!!!

Thanks too Paul!

/me queues.
diff mbox series

Patch

diff --git a/arch/x86/kernel/cpu/mce/core.c b/arch/x86/kernel/cpu/mce/core.c
index 13d3f1c..44d2b99 100644
--- a/arch/x86/kernel/cpu/mce/core.c
+++ b/arch/x86/kernel/cpu/mce/core.c
@@ -878,6 +878,12 @@  static atomic_t mce_executing;
 static atomic_t mce_callin;
 
 /*
+ * Track which CPUs entered and not in order to print holdouts.
+ */
+static cpumask_t mce_present_cpus;
+static cpumask_t mce_missing_cpus;
+
+/*
  * Check if a timeout waiting for other CPUs happened.
  */
 static int mce_timed_out(u64 *t, const char *msg)
@@ -894,8 +900,12 @@  static int mce_timed_out(u64 *t, const char *msg)
 	if (!mca_cfg.monarch_timeout)
 		goto out;
 	if ((s64)*t < SPINUNIT) {
-		if (mca_cfg.tolerant <= 1)
+		if (mca_cfg.tolerant <= 1) {
+			if (!cpumask_andnot(&mce_missing_cpus, cpu_online_mask, &mce_present_cpus))
+				pr_info("%s: MCE holdout CPUs: %*pbl\n",
+					__func__, cpumask_pr_args(&mce_missing_cpus));
 			mce_panic(msg, NULL, NULL);
+		}
 		cpu_missing = 1;
 		return 1;
 	}
@@ -1006,6 +1016,7 @@  static int mce_start(int *no_way_out)
 	 * is updated before mce_callin.
 	 */
 	order = atomic_inc_return(&mce_callin);
+	cpumask_set_cpu(smp_processor_id(), &mce_present_cpus);
 
 	/*
 	 * Wait for everyone.
@@ -1114,6 +1125,7 @@  static int mce_end(int order)
 reset:
 	atomic_set(&global_nwo, 0);
 	atomic_set(&mce_callin, 0);
+	cpumask_clear(&mce_present_cpus);
 	barrier();
 
 	/*
@@ -2712,6 +2724,7 @@  static void mce_reset(void)
 	atomic_set(&mce_executing, 0);
 	atomic_set(&mce_callin, 0);
 	atomic_set(&global_nwo, 0);
+	cpumask_clear(&mce_present_cpus);
 }
 
 static int fake_panic_get(void *data, u64 *val)