[RFC,5/5] GHES: Make NMI handler have a single reader

Message ID	1427448178-20689-6-git-send-email-bp@alien8.de (mailing list archive)
State	RFC, archived
Headers	show Return-Path: <linux-acpi-owner@kernel.org> From: Borislav Petkov <bp@alien8.de> To: linux-edac <linux-edac@vger.kernel.org> Cc: Jiri Kosina <jkosina@suse.cz>, Borislav Petkov <bp@suse.de>, "Rafael J. Wysocki" <rjw@rjwysocki.net>, Len Brown <lenb@kernel.org>, Tony Luck <tony.luck@intel.com>, Tomasz Nowicki <tomasz.nowicki@linaro.org>, "Chen, Gong" <gong.chen@linux.intel.com>, Wolfram Sang <wsa@the-dreams.de>, Lv Zheng <lv.zheng@intel.com>, Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>, linux-acpi@vger.kernel.org, linux-kernel@vger.kernel.org Subject: [RFC PATCH 5/5] GHES: Make NMI handler have a single reader Date: Fri, 27 Mar 2015 10:22:58 +0100 Message-Id: <1427448178-20689-6-git-send-email-bp@alien8.de> In-Reply-To: <1427448178-20689-1-git-send-email-bp@alien8.de> References: <1427448178-20689-1-git-send-email-bp@alien8.de> Sender: linux-acpi-owner@vger.kernel.org Precedence: bulk

Borislav Petkov March 27, 2015, 9:22 a.m. UTC

From: Jiri Kosina <jkosina@suse.cz>

Since GHES sources are global, we theoretically need only a single CPU
reading them per NMI instead of a thundering herd of CPUs waiting on a
spinlock in NMI context for no reason at all.

Do that.

Signed-off-by: Jiri Kosina <jkosina@suse.cz>
Signed-off-by: Borislav Petkov <bp@suse.de>
---
 drivers/acpi/apei/ghes.c | 12 +++++++-----
 1 file changed, 7 insertions(+), 5 deletions(-)

Jiri Kosina April 1, 2015, 7:45 a.m. UTC | #1

On Fri, 27 Mar 2015, Borislav Petkov wrote:

> From: Jiri Kosina <jkosina@suse.cz>
> 
> Since GHES sources are global, we theoretically need only a single CPU
> reading them per NMI instead of a thundering herd of CPUs waiting on a
> spinlock in NMI context for no reason at all.

I originally wasn't 100% sure whether GHES sources are global (i.e. if it 
really doesn't matter which CPU is reading the registers), but looking at 
the code more it actually seems that this is really the right thing to do.

Rationale: ghes_ioremap_pfn_nmi() always ioremaps() (exclusively) the page 
with the registers, performs apei_read() (which is ghes-source specific, 
but not CPU-specific) and unmaps the page again.

There is nothing that would make this CPU-specific. Adding Huang Ying (the 
original author of the code) to confirm this. Huang?

> Do that.

I think this should indeed be pushed forward. It fixes horrible spinlock 
contention on systems which are under NMI storm (such as when perf is 
active) unrelated to GHES.

Thanks,

Borislav Petkov April 1, 2015, 1:49 p.m. UTC | #2

On Wed, Apr 01, 2015 at 09:45:53AM +0200, Jiri Kosina wrote:
> On Fri, 27 Mar 2015, Borislav Petkov wrote:
> 
> > From: Jiri Kosina <jkosina@suse.cz>
> > 
> > Since GHES sources are global, we theoretically need only a single CPU
> > reading them per NMI instead of a thundering herd of CPUs waiting on a
> > spinlock in NMI context for no reason at all.
> 
> I originally wasn't 100% sure whether GHES sources are global (i.e. if it 
> really doesn't matter which CPU is reading the registers), but looking at 
> the code more it actually seems that this is really the right thing to do.
> 
> Rationale: ghes_ioremap_pfn_nmi() always ioremaps() (exclusively) the page 
> with the registers, performs apei_read() (which is ghes-source specific, 
> but not CPU-specific) and unmaps the page again.
> 
> There is nothing that would make this CPU-specific. Adding Huang Ying (the 
> original author of the code) to confirm this. Huang?
> 
> > Do that.
> 
> I think this should indeed be pushed forward. It fixes horrible spinlock 
> contention on systems which are under NMI storm (such as when perf is 
> active) unrelated to GHES.

Right, so I tested injecting an error without your patch and same
behavior. So it all points at global sources AFAICT. It would be cool,
though, if someone who knows the fw confirms unambiguously.

Jiri Kosina April 23, 2015, 8:39 a.m. UTC | #3

On Wed, 1 Apr 2015, Borislav Petkov wrote:

> > > From: Jiri Kosina <jkosina@suse.cz>
> > > 
> > > Since GHES sources are global, we theoretically need only a single CPU
> > > reading them per NMI instead of a thundering herd of CPUs waiting on a
> > > spinlock in NMI context for no reason at all.
> > 
> > I originally wasn't 100% sure whether GHES sources are global (i.e. if it 
> > really doesn't matter which CPU is reading the registers), but looking at 
> > the code more it actually seems that this is really the right thing to do.
> > 
> > Rationale: ghes_ioremap_pfn_nmi() always ioremaps() (exclusively) the page 
> > with the registers, performs apei_read() (which is ghes-source specific, 
> > but not CPU-specific) and unmaps the page again.
> > 
> > There is nothing that would make this CPU-specific. Adding Huang Ying (the 
> > original author of the code) to confirm this. Huang?
> > 
> > > Do that.
> > 
> > I think this should indeed be pushed forward. It fixes horrible spinlock 
> > contention on systems which are under NMI storm (such as when perf is 
> > active) unrelated to GHES.
> 
> Right, so I tested injecting an error without your patch and same
> behavior. So it all points at global sources AFAICT. It would be cool,
> though, if someone who knows the fw confirms unambiguously.

Three weeks have passed, therefore I find this an appropriate time for a 
friendly ping :)

Rafael? Naoya? Huang?

This fixes a contention spinlock problem in NMI observed on a real HW, so 
it would be really nice to have it fixed.

Thanks,

Borislav Petkov April 23, 2015, 8:59 a.m. UTC | #4

On Thu, Apr 23, 2015 at 10:39:58AM +0200, Jiri Kosina wrote:
> Three weeks have passed, therefore I find this an appropriate time for a 
> friendly ping :)
> 
> Rafael? Naoya? Huang?
> 
> This fixes a contention spinlock problem in NMI observed on a real HW, so 
> it would be really nice to have it fixed.

I think we should apply this.

Here's why: nothing in the ghes_notify_nmi() handler does CPU-specific
accesses. It iterates over the list of ghes sources which do NMI
notification but those sources are the *same* regardless of which core
does the access as their addresses are per-source, i.e. in that

	struct acpi_generic_address error_status_address;

thing.

And it is a safe bet to say that all that error information is
serialized in the firmware for the error source to consume.

So I'm going to route this through the RAS tree unless Rafael wants to
take it.

Ok?

Tony, objections?

Tony Luck April 23, 2015, 6 p.m. UTC | #5

PiBJIHRoaW5rIHdlIHNob3VsZCBhcHBseSB0aGlzLg0KPg0KPiBIZXJlJ3Mgd2h5OiBub3RoaW5n
IGluIHRoZSBnaGVzX25vdGlmeV9ubWkoKSBoYW5kbGVyIGRvZXMgQ1BVLXNwZWNpZmljDQo+IGFj
Y2Vzc2VzDQoNClRoaXMgbG9va3MgdG8gYmUgdHJ1ZS4NCg0KPiBUb255LCBvYmplY3Rpb25zPw0K
DQpObyBvYmplY3Rpb25zLg0KDQotVG9ueQ0K
--
To unsubscribe from this list: send the line "unsubscribe linux-acpi" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Lv Zheng April 27, 2015, 3:16 a.m. UTC | #6

Hi,

> From: Borislav Petkov [mailto:bp@alien8.de]
> Sent: Friday, March 27, 2015 5:23 PM
> 
> From: Jiri Kosina <jkosina@suse.cz>
> 
> Since GHES sources are global, we theoretically need only a single CPU
> reading them per NMI instead of a thundering herd of CPUs waiting on a
> spinlock in NMI context for no reason at all.
> 
> Do that.
> 
> Signed-off-by: Jiri Kosina <jkosina@suse.cz>
> Signed-off-by: Borislav Petkov <bp@suse.de>
> ---
>  drivers/acpi/apei/ghes.c | 12 +++++++-----
>  1 file changed, 7 insertions(+), 5 deletions(-)
> 
> diff --git a/drivers/acpi/apei/ghes.c b/drivers/acpi/apei/ghes.c
> index 94a44bad5576..2bfd53cbfe80 100644
> --- a/drivers/acpi/apei/ghes.c
> +++ b/drivers/acpi/apei/ghes.c
> @@ -729,10 +729,10 @@ static struct llist_head ghes_estatus_llist;
>  static struct irq_work ghes_proc_irq_work;
> 
>  /*
> - * NMI may be triggered on any CPU, so ghes_nmi_lock is used for
> - * mutual exclusion.
> + * NMI may be triggered on any CPU, so ghes_in_nmi is used for
> + * having only one concurrent reader.
>   */
> -static DEFINE_RAW_SPINLOCK(ghes_nmi_lock);
> +static atomic_t ghes_in_nmi = ATOMIC_INIT(0);
> 
>  static LIST_HEAD(ghes_nmi);
> 
> @@ -840,7 +840,9 @@ static int ghes_notify_nmi(unsigned int cmd, struct pt_regs *regs)
>  	struct ghes *ghes;
>  	int sev, ret = NMI_DONE;
> 
> -	raw_spin_lock(&ghes_nmi_lock);
> +	if (!atomic_add_unless(&ghes_in_nmi, 1, 1))
> +		return ret;
> +

Just a simple question.
Why not just using cmpxchg here instead of atomic_add_unless so that no atomic_dec will be needed.

Thanks and best regards
-Lv

>  	list_for_each_entry_rcu(ghes, &ghes_nmi, list) {
>  		if (ghes_read_estatus(ghes, 1)) {
>  			ghes_clear_estatus(ghes);
> @@ -863,7 +865,7 @@ static int ghes_notify_nmi(unsigned int cmd, struct pt_regs *regs)
>  #ifdef CONFIG_ARCH_HAVE_NMI_SAFE_CMPXCHG
>  	irq_work_queue(&ghes_proc_irq_work);
>  #endif
> -	raw_spin_unlock(&ghes_nmi_lock);
> +	atomic_dec(&ghes_in_nmi);
>  	return ret;
>  }
> 
> --
> 2.3.3

--
To unsubscribe from this list: send the line "unsubscribe linux-acpi" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Borislav Petkov April 27, 2015, 8:46 a.m. UTC | #7

On Mon, Apr 27, 2015 at 03:16:00AM +0000, Zheng, Lv wrote:
> > @@ -840,7 +840,9 @@ static int ghes_notify_nmi(unsigned int cmd, struct pt_regs *regs)
> >  	struct ghes *ghes;
> >  	int sev, ret = NMI_DONE;
> > 
> > -	raw_spin_lock(&ghes_nmi_lock);
> > +	if (!atomic_add_unless(&ghes_in_nmi, 1, 1))
> > +		return ret;
> > +
> 
> Just a simple question.
> Why not just using cmpxchg here instead of atomic_add_unless so that no atomic_dec will be needed.

What do you think atomic_add_unless ends up doing:

#APP
# 177 "./arch/x86/include/asm/atomic.h" 1
	.pushsection .smp_locks,"a"
.balign 4
.long 671f - .
.popsection
671:
	lock; cmpxchgl %edx,ghes_in_nmi(%rip)	# D.37056, MEM[(volatile u32 *)&ghes_in_nmi]
# 0 "" 2
#NO_APP

And you need to atomic_dec() so that another reader can enter, i.e. how
the exclusion primitive works.

Or did you have something else in mind?

Borislav Petkov April 27, 2015, 8:23 p.m. UTC | #8

On Thu, Apr 23, 2015 at 06:00:15PM +0000, Luck, Tony wrote:
> > I think we should apply this.
> >
> > Here's why: nothing in the ghes_notify_nmi() handler does CPU-specific
> > accesses
> 
> This looks to be true.
> 
> > Tony, objections?
> 
> No objections.

Thanks, queued for 4.2, pending one last test I'm doing on a machine
which says

[   24.332560] INFO: NMI handler (ghes_notify_nmi) took too long to run: 3.265 msecs
[   24.332567] INFO: NMI handler (ghes_notify_nmi) took too long to run: 5.946 msecs
[   24.332568] INFO: NMI handler (ghes_notify_nmi) took too long to run: 5.948 msecs

during boot.

Lv Zheng April 28, 2015, 12:44 a.m. UTC | #9

Hi,

> From: Borislav Petkov [mailto:bp@alien8.de]

> Sent: Monday, April 27, 2015 4:47 PM

> 

> On Mon, Apr 27, 2015 at 03:16:00AM +0000, Zheng, Lv wrote:

> > > @@ -840,7 +840,9 @@ static int ghes_notify_nmi(unsigned int cmd, struct pt_regs *regs)

> > >  	struct ghes *ghes;

> > >  	int sev, ret = NMI_DONE;

> > >

> > > -	raw_spin_lock(&ghes_nmi_lock);

> > > +	if (!atomic_add_unless(&ghes_in_nmi, 1, 1))

> > > +		return ret;

> > > +

> >

> > Just a simple question.

> > Why not just using cmpxchg here instead of atomic_add_unless so that no atomic_dec will be needed.

> 

> What do you think atomic_add_unless ends up doing:

> 

> #APP

> # 177 "./arch/x86/include/asm/atomic.h" 1

> 	.pushsection .smp_locks,"a"

> .balign 4

> .long 671f - .

> .popsection

> 671:

> 	lock; cmpxchgl %edx,ghes_in_nmi(%rip)	# D.37056, MEM[(volatile u32 *)&ghes_in_nmi]

> # 0 "" 2

> #NO_APP

> 

> And you need to atomic_dec() so that another reader can enter, i.e. how

> the exclusion primitive works.

> 

> Or did you have something else in mind?


My mistake.
I mean cmpxchg() and xchg() (or atomic_cmpxchg() and atomic_xchg()) pair here, so nothing can be reduced.

But IMO, atomic_add_unless() is implemented via cmpxchg on many architectures.
And it might be better to use it directly here which is a bit faster as you actually only need one value switch here.

Thanks and best regards
-Lv


> 

> --

> Regards/Gruss,

>     Boris.

> 

> ECO tip #101: Trim your mails when you reply.

> --

Lv Zheng April 28, 2015, 2:24 a.m. UTC | #10

SGksDQoNCj4gRnJvbTogWmhlbmcsIEx2DQo+IFNlbnQ6IFR1ZXNkYXksIEFwcmlsIDI4LCAyMDE1
IDg6NDQgQU0NCj4gDQo+IEhpLA0KPiANCj4gPiBGcm9tOiBCb3Jpc2xhdiBQZXRrb3YgW21haWx0
bzpicEBhbGllbjguZGVdDQo+ID4gU2VudDogTW9uZGF5LCBBcHJpbCAyNywgMjAxNSA0OjQ3IFBN
DQo+ID4NCj4gPiBPbiBNb24sIEFwciAyNywgMjAxNSBhdCAwMzoxNjowMEFNICswMDAwLCBaaGVu
ZywgTHYgd3JvdGU6DQo+ID4gPiA+IEBAIC04NDAsNyArODQwLDkgQEAgc3RhdGljIGludCBnaGVz
X25vdGlmeV9ubWkodW5zaWduZWQgaW50IGNtZCwgc3RydWN0IHB0X3JlZ3MgKnJlZ3MpDQo+ID4g
PiA+ICAJc3RydWN0IGdoZXMgKmdoZXM7DQo+ID4gPiA+ICAJaW50IHNldiwgcmV0ID0gTk1JX0RP
TkU7DQo+ID4gPiA+DQo+ID4gPiA+IC0JcmF3X3NwaW5fbG9jaygmZ2hlc19ubWlfbG9jayk7DQo+
ID4gPiA+ICsJaWYgKCFhdG9taWNfYWRkX3VubGVzcygmZ2hlc19pbl9ubWksIDEsIDEpKQ0KPiA+
ID4gPiArCQlyZXR1cm4gcmV0Ow0KPiA+ID4gPiArDQo+ID4gPg0KPiA+ID4gSnVzdCBhIHNpbXBs
ZSBxdWVzdGlvbi4NCj4gPiA+IFdoeSBub3QganVzdCB1c2luZyBjbXB4Y2hnIGhlcmUgaW5zdGVh
ZCBvZiBhdG9taWNfYWRkX3VubGVzcyBzbyB0aGF0IG5vIGF0b21pY19kZWMgd2lsbCBiZSBuZWVk
ZWQuDQo+ID4NCj4gPiBXaGF0IGRvIHlvdSB0aGluayBhdG9taWNfYWRkX3VubGVzcyBlbmRzIHVw
IGRvaW5nOg0KPiA+DQo+ID4gI0FQUA0KPiA+ICMgMTc3ICIuL2FyY2gveDg2L2luY2x1ZGUvYXNt
L2F0b21pYy5oIiAxDQo+ID4gCS5wdXNoc2VjdGlvbiAuc21wX2xvY2tzLCJhIg0KPiA+IC5iYWxp
Z24gNA0KPiA+IC5sb25nIDY3MWYgLSAuDQo+ID4gLnBvcHNlY3Rpb24NCj4gPiA2NzE6DQo+ID4g
CWxvY2s7IGNtcHhjaGdsICVlZHgsZ2hlc19pbl9ubWkoJXJpcCkJIyBELjM3MDU2LCBNRU1bKHZv
bGF0aWxlIHUzMiAqKSZnaGVzX2luX25taV0NCj4gPiAjIDAgIiIgMg0KPiA+ICNOT19BUFANCj4g
Pg0KPiA+IEFuZCB5b3UgbmVlZCB0byBhdG9taWNfZGVjKCkgc28gdGhhdCBhbm90aGVyIHJlYWRl
ciBjYW4gZW50ZXIsIGkuZS4gaG93DQo+ID4gdGhlIGV4Y2x1c2lvbiBwcmltaXRpdmUgd29ya3Mu
DQo+ID4NCj4gPiBPciBkaWQgeW91IGhhdmUgc29tZXRoaW5nIGVsc2UgaW4gbWluZD8NCj4gDQo+
IE15IG1pc3Rha2UuDQo+IEkgbWVhbiBjbXB4Y2hnKCkgYW5kIHhjaGcoKSAob3IgYXRvbWljX2Nt
cHhjaGcoKSBhbmQgYXRvbWljX3hjaGcoKSkgcGFpciBoZXJlLCBzbyBub3RoaW5nIGNhbiBiZSBy
ZWR1Y2VkLg0KDQpMZXQgbWUgY29ycmVjdCwgaXQgc2hvdWxkIGJlIGF0b21pY19jbXB4Y2hnKCkg
YW5kIGF0b21pY19zZXQoKSBoZXJlIGFzIHlvdSBvbmx5IG5lZWQgdG8gc3dpdGNoIGJldHdlZW4g
MCBhbmQgMS4NClNvcnJ5IGZvciB0aGUgbm9pc2UuDQoNClRoYW5rcyBhbmQgYmVzdCByZWdhcmRz
DQotTHYNCg0KPiANCj4gQnV0IElNTywgYXRvbWljX2FkZF91bmxlc3MoKSBpcyBpbXBsZW1lbnRl
ZCB2aWEgY21weGNoZyBvbiBtYW55IGFyY2hpdGVjdHVyZXMuDQo+IEFuZCBpdCBtaWdodCBiZSBi
ZXR0ZXIgdG8gdXNlIGl0IGRpcmVjdGx5IGhlcmUgd2hpY2ggaXMgYSBiaXQgZmFzdGVyIGFzIHlv
dSBhY3R1YWxseSBvbmx5IG5lZWQgb25lIHZhbHVlIHN3aXRjaCBoZXJlLg0KPiANCj4gVGhhbmtz
IGFuZCBiZXN0IHJlZ2FyZHMNCj4gLUx2DQo+IA0KPiANCj4gPg0KPiA+IC0tDQo+ID4gUmVnYXJk
cy9HcnVzcywNCj4gPiAgICAgQm9yaXMuDQo+ID4NCj4gPiBFQ08gdGlwICMxMDE6IFRyaW0geW91
ciBtYWlscyB3aGVuIHlvdSByZXBseS4NCj4gPiAtLQ0K
--
To unsubscribe from this list: send the line "unsubscribe linux-acpi" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Borislav Petkov April 28, 2015, 7:38 a.m. UTC | #11

On Tue, Apr 28, 2015 at 02:24:16AM +0000, Zheng, Lv wrote:
> > > #APP
> > > # 177 "./arch/x86/include/asm/atomic.h" 1
> > > 	.pushsection .smp_locks,"a"
> > > .balign 4
> > > .long 671f - .
> > > .popsection
> > > 671:
> > > 	lock; cmpxchgl %edx,ghes_in_nmi(%rip)	# D.37056, MEM[(volatile u32 *)&ghes_in_nmi]
> > > # 0 "" 2
> > > #NO_APP
> > >
> > > And you need to atomic_dec() so that another reader can enter, i.e. how
> > > the exclusion primitive works.
> > >
> > > Or did you have something else in mind?
> > 
> > My mistake.
> > I mean cmpxchg() and xchg() (or atomic_cmpxchg() and atomic_xchg()) pair here, so nothing can be reduced.
> 
> Let me correct, it should be atomic_cmpxchg() and atomic_set() here as you only need to switch between 0 and 1.
> Sorry for the noise.

I still don't understand what you want from me here. You need to go into
more detail.

Lv Zheng April 28, 2015, 1:38 p.m. UTC | #12

Hi,

I was talking about this patch.

> From: Borislav Petkov [mailto:bp@alien8.de]
> Sent: Friday, March 27, 2015 5:23 PM
> Subject: [RFC PATCH 5/5] GHES: Make NMI handler have a single reader
> 
> From: Jiri Kosina <jkosina@suse.cz>
> 
> Since GHES sources are global, we theoretically need only a single CPU
> reading them per NMI instead of a thundering herd of CPUs waiting on a
> spinlock in NMI context for no reason at all.
> 
> Do that.
> 
> Signed-off-by: Jiri Kosina <jkosina@suse.cz>
> Signed-off-by: Borislav Petkov <bp@suse.de>
> ---
>  drivers/acpi/apei/ghes.c | 12 +++++++-----
>  1 file changed, 7 insertions(+), 5 deletions(-)
> 
> diff --git a/drivers/acpi/apei/ghes.c b/drivers/acpi/apei/ghes.c
> index 94a44bad5576..2bfd53cbfe80 100644
> --- a/drivers/acpi/apei/ghes.c
> +++ b/drivers/acpi/apei/ghes.c
> @@ -729,10 +729,10 @@ static struct llist_head ghes_estatus_llist;
>  static struct irq_work ghes_proc_irq_work;
> 
>  /*
> - * NMI may be triggered on any CPU, so ghes_nmi_lock is used for
> - * mutual exclusion.
> + * NMI may be triggered on any CPU, so ghes_in_nmi is used for
> + * having only one concurrent reader.
>   */
> -static DEFINE_RAW_SPINLOCK(ghes_nmi_lock);
> +static atomic_t ghes_in_nmi = ATOMIC_INIT(0);
> 
>  static LIST_HEAD(ghes_nmi);
> 
> @@ -840,7 +840,9 @@ static int ghes_notify_nmi(unsigned int cmd, struct pt_regs *regs)
>  	struct ghes *ghes;
>  	int sev, ret = NMI_DONE;
> 
> -	raw_spin_lock(&ghes_nmi_lock);
> +	if (!atomic_add_unless(&ghes_in_nmi, 1, 1))
> +		return ret;
> +

if (atomic_cmpxchg(&ghes_in_nmi, 0, 1))
	return ret;

>  	list_for_each_entry_rcu(ghes, &ghes_nmi, list) {
>  		if (ghes_read_estatus(ghes, 1)) {
>  			ghes_clear_estatus(ghes);
> @@ -863,7 +865,7 @@ static int ghes_notify_nmi(unsigned int cmd, struct pt_regs *regs)
>  #ifdef CONFIG_ARCH_HAVE_NMI_SAFE_CMPXCHG
>  	irq_work_queue(&ghes_proc_irq_work);
>  #endif
> -	raw_spin_unlock(&ghes_nmi_lock);
> +	atomic_dec(&ghes_in_nmi);

atomic_set(&ghes_in_nmi, 0);

It seems most of the drivers (under drivers/) are written in this way.
While the user of atomic_add_unless() is rare.

Can this work for you?

Thanks and best regards
-Lv

>  	return ret;
>  }
> 
> --
> 2.3.3

--
To unsubscribe from this list: send the line "unsubscribe linux-acpi" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Borislav Petkov April 28, 2015, 1:59 p.m. UTC | #13

On Tue, Apr 28, 2015 at 01:38:41PM +0000, Zheng, Lv wrote:
> > -	raw_spin_lock(&ghes_nmi_lock);
> > +	if (!atomic_add_unless(&ghes_in_nmi, 1, 1))
> > +		return ret;
> > +
> 
> if (atomic_cmpxchg(&ghes_in_nmi, 0, 1))
> 	return ret;

Ok, now I understand what you mean.

We absolutely want to use atomic_add_unless() because we get to save us
the expensive

	LOCK; CMPXCHG

if the value was already 1. Which is exactly what this patch is trying
to avoid - a thundering herd of cores CMPXCHGing a global variable.

I.e.,

	movl	ghes_in_nmi(%rip), %ecx	# MEM[(const int *)&ghes_in_nmi], c
	cmpl	$1, %ecx	#, c
	je	.L311	#,				<--- exit here if ghes_in_nmi == 1.
	leal	1(%rcx), %edx	#, D.37163
	movl	%ecx, %eax	# c, c
#APP
# 177 "./arch/x86/include/asm/atomic.h" 1
	.pushsection .smp_locks,"a"
.balign 4
.long 671f - .
.popsection
671:
	lock; cmpxchgl %edx,ghes_in_nmi(%rip)	# D.37163, MEM[(volatile u32 *)&ghes_in_nmi]
# 0 "" 2
#NO_APP

Don Zickus April 28, 2015, 2:30 p.m. UTC | #14

On Wed, Apr 01, 2015 at 09:45:53AM +0200, Jiri Kosina wrote:
> On Fri, 27 Mar 2015, Borislav Petkov wrote:
> 
> > From: Jiri Kosina <jkosina@suse.cz>
> > 
> > Since GHES sources are global, we theoretically need only a single CPU
> > reading them per NMI instead of a thundering herd of CPUs waiting on a
> > spinlock in NMI context for no reason at all.
> 
> I originally wasn't 100% sure whether GHES sources are global (i.e. if it 
> really doesn't matter which CPU is reading the registers), but looking at 
> the code more it actually seems that this is really the right thing to do.
> 
> Rationale: ghes_ioremap_pfn_nmi() always ioremaps() (exclusively) the page 
> with the registers, performs apei_read() (which is ghes-source specific, 
> but not CPU-specific) and unmaps the page again.
> 
> There is nothing that would make this CPU-specific. Adding Huang Ying (the 
> original author of the code) to confirm this. Huang?

Hi,

I believe the answer to this question is no, they are not global but
instead external.  All external NMIs are routed to one cpu, normally cpu0.
This spinlock was made global to handle the 'someday' case of hotplugging
the bsp cpu (cpu0).

The other external NMIs (IO_CHECK and SERR) suffer from the same spinlock
problem.  I tried using an irq_workqueue to work around quirks there and
PeterZ wasn't very fond of it (though he couldn't think of a better way to
solve it).

This patch seems interesting but you might still run into the thundering
herd problem with the global spinlock in
arch/x86/kernel/nmi.c::default_do_nmi().  That functions grabs a global
spinlock before processing the external NMI list (which GHES is a part of).

So I am assuming this patch solves the 'thundering herd' problem by
minimizing all the useless writes the spinlock would do for each cpu that
noticed it had no work to do?

In that case, I am in favor of this solution and would like to apply a
similar solution to arch/x86/kernel/nmi.c, to see if that helps there too.

Cheers,
Don

> 
> > Do that.
> 
> I think this should indeed be pushed forward. It fixes horrible spinlock 
> contention on systems which are under NMI storm (such as when perf is 
> active) unrelated to GHES.
> 
> Thanks,
> 
> -- 
> Jiri Kosina
> SUSE Labs
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
--
To unsubscribe from this list: send the line "unsubscribe linux-acpi" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Don Zickus April 28, 2015, 2:42 p.m. UTC | #15

On Tue, Apr 28, 2015 at 10:30:09AM -0400, Don Zickus wrote:
> On Wed, Apr 01, 2015 at 09:45:53AM +0200, Jiri Kosina wrote:
> > On Fri, 27 Mar 2015, Borislav Petkov wrote:
> > 
> > > From: Jiri Kosina <jkosina@suse.cz>
> > > 
> > > Since GHES sources are global, we theoretically need only a single CPU
> > > reading them per NMI instead of a thundering herd of CPUs waiting on a
> > > spinlock in NMI context for no reason at all.
> > 
> > I originally wasn't 100% sure whether GHES sources are global (i.e. if it 
> > really doesn't matter which CPU is reading the registers), but looking at 
> > the code more it actually seems that this is really the right thing to do.
> > 
> > Rationale: ghes_ioremap_pfn_nmi() always ioremaps() (exclusively) the page 
> > with the registers, performs apei_read() (which is ghes-source specific, 
> > but not CPU-specific) and unmaps the page again.
> > 
> > There is nothing that would make this CPU-specific. Adding Huang Ying (the 
> > original author of the code) to confirm this. Huang?
> 
> Hi,
> 
> I believe the answer to this question is no, they are not global but
> instead external.  All external NMIs are routed to one cpu, normally cpu0.
> This spinlock was made global to handle the 'someday' case of hotplugging
> the bsp cpu (cpu0).
> 
> The other external NMIs (IO_CHECK and SERR) suffer from the same spinlock
> problem.  I tried using an irq_workqueue to work around quirks there and
> PeterZ wasn't very fond of it (though he couldn't think of a better way to
> solve it).
> 
> This patch seems interesting but you might still run into the thundering
> herd problem with the global spinlock in
> arch/x86/kernel/nmi.c::default_do_nmi().  That functions grabs a global
> spinlock before processing the external NMI list (which GHES is a part of).

Grr, I mispoke.   I sent a patchset a year ago to split out internal and
external NMIs to simplify the problem.  So I wrote the above paragraph
thinking the GHES NMI handler was wrapped with the external NMI spinlock,
when in fact it isn't.  However, with perf running with lots of events, it
is possible to start 'swallowing' NMIs which requires passing through the
spinlock I just mentioned.  This might cause random delays in your
measurements and is still worth modifying.

Cheers,
Don

> 
> 
> So I am assuming this patch solves the 'thundering herd' problem by
> minimizing all the useless writes the spinlock would do for each cpu that
> noticed it had no work to do?
> 
> In that case, I am in favor of this solution and would like to apply a
> similar solution to arch/x86/kernel/nmi.c, to see if that helps there too.
> 
> Cheers,
> Don
> 
> 
> > 
> > > Do that.
> > 
> > I think this should indeed be pushed forward. It fixes horrible spinlock 
> > contention on systems which are under NMI storm (such as when perf is 
> > active) unrelated to GHES.
> > 
> > Thanks,
> > 
> > -- 
> > Jiri Kosina
> > SUSE Labs
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > Please read the FAQ at  http://www.tux.org/lkml/
--
To unsubscribe from this list: send the line "unsubscribe linux-acpi" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Borislav Petkov April 28, 2015, 2:55 p.m. UTC | #16

On Tue, Apr 28, 2015 at 10:30:09AM -0400, Don Zickus wrote:
> On Wed, Apr 01, 2015 at 09:45:53AM +0200, Jiri Kosina wrote:
> > On Fri, 27 Mar 2015, Borislav Petkov wrote:
> > 
> > > From: Jiri Kosina <jkosina@suse.cz>
> > > 
> > > Since GHES sources are global, we theoretically need only a single CPU
> > > reading them per NMI instead of a thundering herd of CPUs waiting on a
> > > spinlock in NMI context for no reason at all.
> > 
> > I originally wasn't 100% sure whether GHES sources are global (i.e. if it 
> > really doesn't matter which CPU is reading the registers), but looking at 
> > the code more it actually seems that this is really the right thing to do.
> > 
> > Rationale: ghes_ioremap_pfn_nmi() always ioremaps() (exclusively) the page 
> > with the registers, performs apei_read() (which is ghes-source specific, 
> > but not CPU-specific) and unmaps the page again.
> > 
> > There is nothing that would make this CPU-specific. Adding Huang Ying (the 
> > original author of the code) to confirm this. Huang?
> 
> Hi,
> 
> I believe the answer to this question is no, they are not global but
> instead external.  All external NMIs are routed to one cpu, normally cpu0.

Actually, the things we're talking about here are not global NMIs but
global error sources. I.e., the GHES sources which use an NMI to report
errors with and which we noodle over in ghes_notify_nmi() twice:

	list_for_each_entry_rcu(ghes, &ghes_nmi, list) {
		...

> This spinlock was made global to handle the 'someday' case of hotplugging
> the bsp cpu (cpu0).
> 
> The other external NMIs (IO_CHECK and SERR) suffer from the same spinlock
> problem.  I tried using an irq_workqueue to work around quirks there and
> PeterZ wasn't very fond of it (though he couldn't think of a better way to
> solve it).
> 
> This patch seems interesting but you might still run into the thundering
> herd problem with the global spinlock in
> arch/x86/kernel/nmi.c::default_do_nmi().  That functions grabs a global
> spinlock before processing the external NMI list (which GHES is a part of).

nmi_reason_lock? And our handler is registered with
register_nmi_handler() and so we iterate over it in nmi_handle(). It'd
be interesting to know what NMI reason we get for those GHES NMIs so
that we know whether nmi_handle() is called under the lock or not...

> So I am assuming this patch solves the 'thundering herd' problem by
> minimizing all the useless writes the spinlock would do for each cpu that
> noticed it had no work to do?

Not that spinlock. It used to take another one in ghes_notify_nmi().
Removing it solved the issue. There were even boxes which would do:

[   24.332560] INFO: NMI handler (ghes_notify_nmi) took too long to run: 3.265 msecs
[   24.332567] INFO: NMI handler (ghes_notify_nmi) took too long to run: 5.946 msecs
[   24.332568] INFO: NMI handler (ghes_notify_nmi) took too long to run: 5.948 msecs

just like that during boot.

It was basically a lot of purely useless lock grabbing for no reason
whatsoever.

Thanks.

Don Zickus April 28, 2015, 3:35 p.m. UTC | #17

On Tue, Apr 28, 2015 at 04:55:48PM +0200, Borislav Petkov wrote:
> On Tue, Apr 28, 2015 at 10:30:09AM -0400, Don Zickus wrote:
> > On Wed, Apr 01, 2015 at 09:45:53AM +0200, Jiri Kosina wrote:
> > > On Fri, 27 Mar 2015, Borislav Petkov wrote:
> > > 
> > > > From: Jiri Kosina <jkosina@suse.cz>
> > > > 
> > > > Since GHES sources are global, we theoretically need only a single CPU
> > > > reading them per NMI instead of a thundering herd of CPUs waiting on a
> > > > spinlock in NMI context for no reason at all.
> > > 
> > > I originally wasn't 100% sure whether GHES sources are global (i.e. if it 
> > > really doesn't matter which CPU is reading the registers), but looking at 
> > > the code more it actually seems that this is really the right thing to do.
> > > 
> > > Rationale: ghes_ioremap_pfn_nmi() always ioremaps() (exclusively) the page 
> > > with the registers, performs apei_read() (which is ghes-source specific, 
> > > but not CPU-specific) and unmaps the page again.
> > > 
> > > There is nothing that would make this CPU-specific. Adding Huang Ying (the 
> > > original author of the code) to confirm this. Huang?
> > 
> > Hi,
> > 
> > I believe the answer to this question is no, they are not global but
> > instead external.  All external NMIs are routed to one cpu, normally cpu0.
> 
> Actually, the things we're talking about here are not global NMIs but
> global error sources. I.e., the GHES sources which use an NMI to report
> errors with and which we noodle over in ghes_notify_nmi() twice:
> 
> 	list_for_each_entry_rcu(ghes, &ghes_nmi, list) {

Sure, most systems use NMI to report them I believe (at least the ones I
have played with).  Regardless, I have hit performance problems on this lock
while doing perf testing.

I tried to work around it by splitting the NMI list into an
nmi_register(INTERNAL,...) and nmi_register(EXTERNAL,...) to allow perf to
avoid hitting this lock.  But the fallout of that change included missed
NMIs and various solutions from that resulted in complexities that blocked
inclusion last year.

Your solution seems much simpler. :-)

> 		...
> 
> > This spinlock was made global to handle the 'someday' case of hotplugging
> > the bsp cpu (cpu0).
> > 
> > The other external NMIs (IO_CHECK and SERR) suffer from the same spinlock
> > problem.  I tried using an irq_workqueue to work around quirks there and
> > PeterZ wasn't very fond of it (though he couldn't think of a better way to
> > solve it).
> > 
> > This patch seems interesting but you might still run into the thundering
> > herd problem with the global spinlock in
> > arch/x86/kernel/nmi.c::default_do_nmi().  That functions grabs a global
> > spinlock before processing the external NMI list (which GHES is a part of).
> 
> nmi_reason_lock? And our handler is registered with
> register_nmi_handler() and so we iterate over it in nmi_handle(). It'd
> be interesting to know what NMI reason we get for those GHES NMIs so
> that we know whether nmi_handle() is called under the lock or not...

I followed up in another email stating I mis-spoke.  I forgot this still
uses the NMI_LOCAL shared NMI.  So every perf NMI, will also call the GHES
handler to make sure NMIs did not piggy back each other.  So I don't believe
the NMI reason lock is called a majority of the time (except when the NMI is
swallowed, but that is under heavy perf load...).


> 
> > So I am assuming this patch solves the 'thundering herd' problem by
> > minimizing all the useless writes the spinlock would do for each cpu that
> > noticed it had no work to do?
> 
> Not that spinlock. It used to take another one in ghes_notify_nmi().
> Removing it solved the issue. There were even boxes which would do:
> 
> [   24.332560] INFO: NMI handler (ghes_notify_nmi) took too long to run: 3.265 msecs
> [   24.332567] INFO: NMI handler (ghes_notify_nmi) took too long to run: 5.946 msecs
> [   24.332568] INFO: NMI handler (ghes_notify_nmi) took too long to run: 5.948 msecs
> 
> just like that during boot.
> 
> It was basically a lot of purely useless lock grabbing for no reason
> whatsoever.

Yes, I meant the 'raw_spin_lock(&ghes_nmi_lock);' one.  I poorly typed my
email and confused you into thinking I was referring to the wrong one.

Well, it isn't exactly no reason, but more along the lines of 'we do not
know who gets the external NMI, so everyone tries to talk with the GHES
firmware'.  But I think that is just me squabbling.

We both agree the mechanics of the spinlock are overkill here and cause much
cache contention.  Simplifying it to just 'reads' and return removes most of
the problem.

Cheers,
Don
--
To unsubscribe from this list: send the line "unsubscribe linux-acpi" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Borislav Petkov April 28, 2015, 4:22 p.m. UTC | #18

On Tue, Apr 28, 2015 at 11:35:21AM -0400, Don Zickus wrote:
> Your solution seems much simpler. :-)

... and I love simpler :-)

> I followed up in another email stating I mis-spoke.  I forgot this still
> uses the NMI_LOCAL shared NMI.  So every perf NMI, will also call the GHES
> handler to make sure NMIs did not piggy back each other.  So I don't believe

And this is something we should really fix - perf and RAS should
not have anything to do with each other. But I don't know the NMI
code to even have an idea how. I don't even know whether we can
differentiate NMIs, hell, I can't imagine the hardware giving us a
different NMI reason through get_nmi_reason(). Maybe that byte returned
from NMI_REASON_PORT is too small and hangs on too much legacy crap to
even be usable. Questions over questions...

> the NMI reason lock is called a majority of the time (except when the NMI is
> swallowed, but that is under heavy perf load...).

..

> We both agree the mechanics of the spinlock are overkill here and cause much
> cache contention.  Simplifying it to just 'reads' and return removes most of
> the problem.

Right.

Don Zickus April 28, 2015, 6:44 p.m. UTC | #19

On Tue, Apr 28, 2015 at 06:22:29PM +0200, Borislav Petkov wrote:
> On Tue, Apr 28, 2015 at 11:35:21AM -0400, Don Zickus wrote:
> > Your solution seems much simpler. :-)
> 
> ... and I love simpler :-)
> 
> > I followed up in another email stating I mis-spoke.  I forgot this still
> > uses the NMI_LOCAL shared NMI.  So every perf NMI, will also call the GHES
> > handler to make sure NMIs did not piggy back each other.  So I don't believe
> 
> And this is something we should really fix - perf and RAS should
> not have anything to do with each other. But I don't know the NMI
> code to even have an idea how. I don't even know whether we can
> differentiate NMIs, hell, I can't imagine the hardware giving us a
> different NMI reason through get_nmi_reason(). Maybe that byte returned
> from NMI_REASON_PORT is too small and hangs on too much legacy crap to
> even be usable. Questions over questions...

:-)  Well, let me first clear up some of your questions. 

RAS doesn't go through the legacy ports (ie get_nmi_reason()).  Instead it
triggers the external NMI through a different bit (ioapic I think).

The nmi code has no idea what io_remap'ed address apei is using to map its
error handling register that GHES uses.  Unlike the legacy port which is
always port 0x61.

So, with NMI being basically a shared interrupt, with no ability to discern
who sent the interrupt (and even worse no ability to know how _many_ were sent as
the NMI is edge triggered instead of level triggered).  As a result we rely
on the NMI handlers to talk to their address space/registers to determine if
they were they source of the interrupt.

Now I can agree that perf and RAS have nothing to do with each other, but
they both use NMI to interrupt.  Perf is fortunate enough to be internal to
each cpu and therefore needs no global lock unlike GHES (hence part of the
problem).

The only way to determine who sent the NMI is to have each handler read its
register, which is time consuming for GHES.

Of course, we could go back to playing tricks knowing that external NMIs
like GHES and IO_CHECK/SERR are only routed to one cpu (cpu0 mainly) and
optimize things that way, but that inhibits the bsp cpu hotplugging folks.

I also played tricks like last year's patchset that split out the
nmi_handlers into LOCAL and EXTERNAL queues.  Perf would be part of the
LOCAL queue while GHES was part of the EXTERNAL queue.  The thought was to
never touch the EXTERNAL queue if perf claimed an NMI.  This lead to all
sorts of missed external NMIs, so it didn't work out.

Anyway, any ideas or thoughts for improvement are always welcomed. :-) 

Cheers,
Don

> 
> > the NMI reason lock is called a majority of the time (except when the NMI is
> > swallowed, but that is under heavy perf load...).
> 
> ..
> 
> > We both agree the mechanics of the spinlock are overkill here and cause much
> > cache contention.  Simplifying it to just 'reads' and return removes most of
> > the problem.
> 
> Right.
> 
> -- 
> Regards/Gruss,
>     Boris.
> 
> ECO tip #101: Trim your mails when you reply.
> --
--
To unsubscribe from this list: send the line "unsubscribe linux-acpi" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Lv Zheng April 29, 2015, 12:24 a.m. UTC | #20

Hi,

> From: Borislav Petkov [mailto:bp@alien8.de]

> Sent: Tuesday, April 28, 2015 9:59 PM

> Subject: Re: [RFC PATCH 5/5] GHES: Make NMI handler have a single reader

> 

> On Tue, Apr 28, 2015 at 01:38:41PM +0000, Zheng, Lv wrote:

> > > -	raw_spin_lock(&ghes_nmi_lock);

> > > +	if (!atomic_add_unless(&ghes_in_nmi, 1, 1))

> > > +		return ret;

> > > +

> >

> > if (atomic_cmpxchg(&ghes_in_nmi, 0, 1))

> > 	return ret;

> 

> Ok, now I understand what you mean.

> 

> We absolutely want to use atomic_add_unless() because we get to save us

> the expensive

> 

> 	LOCK; CMPXCHG

> 

> if the value was already 1. Which is exactly what this patch is trying

> to avoid - a thundering herd of cores CMPXCHGing a global variable.


IMO, on most architectures, the "cmp" part should work just like what you've done with "if".
And on some architectures, if the "xchg" doesn't happen, the "cmp" part even won't cause a pipe line hazard.

Thanks and best regards
-Lv


> 

> I.e.,

> 

> 	movl	ghes_in_nmi(%rip), %ecx	# MEM[(const int *)&ghes_in_nmi], c

> 	cmpl	$1, %ecx	#, c

> 	je	.L311	#,				<--- exit here if ghes_in_nmi == 1.

> 	leal	1(%rcx), %edx	#, D.37163

> 	movl	%ecx, %eax	# c, c

> #APP

> # 177 "./arch/x86/include/asm/atomic.h" 1

> 	.pushsection .smp_locks,"a"

> .balign 4

> .long 671f - .

> .popsection

> 671:

> 	lock; cmpxchgl %edx,ghes_in_nmi(%rip)	# D.37163, MEM[(volatile u32 *)&ghes_in_nmi]

> # 0 "" 2

> #NO_APP

> 

> --

> Regards/Gruss,

>     Boris.

> 

> ECO tip #101: Trim your mails when you reply.

> --

Lv Zheng April 29, 2015, 12:49 a.m. UTC | #21

SGksDQoNCj4gRnJvbTogWmhlbmcsIEx2DQo+IFNlbnQ6IFdlZG5lc2RheSwgQXByaWwgMjksIDIw
MTUgODoyNSBBTQ0KPiBTdWJqZWN0OiBSRTogW1JGQyBQQVRDSCA1LzVdIEdIRVM6IE1ha2UgTk1J
IGhhbmRsZXIgaGF2ZSBhIHNpbmdsZSByZWFkZXINCj4gDQo+IEhpLA0KPiANCj4gPiBGcm9tOiBC
b3Jpc2xhdiBQZXRrb3YgW21haWx0bzpicEBhbGllbjguZGVdDQo+ID4gU2VudDogVHVlc2RheSwg
QXByaWwgMjgsIDIwMTUgOTo1OSBQTQ0KPiA+IFN1YmplY3Q6IFJlOiBbUkZDIFBBVENIIDUvNV0g
R0hFUzogTWFrZSBOTUkgaGFuZGxlciBoYXZlIGEgc2luZ2xlIHJlYWRlcg0KPiA+DQo+ID4gT24g
VHVlLCBBcHIgMjgsIDIwMTUgYXQgMDE6Mzg6NDFQTSArMDAwMCwgWmhlbmcsIEx2IHdyb3RlOg0K
PiA+ID4gPiAtCXJhd19zcGluX2xvY2soJmdoZXNfbm1pX2xvY2spOw0KPiA+ID4gPiArCWlmICgh
YXRvbWljX2FkZF91bmxlc3MoJmdoZXNfaW5fbm1pLCAxLCAxKSkNCj4gPiA+ID4gKwkJcmV0dXJu
IHJldDsNCj4gPiA+ID4gKw0KPiA+ID4NCj4gPiA+IGlmIChhdG9taWNfY21weGNoZygmZ2hlc19p
bl9ubWksIDAsIDEpKQ0KPiA+ID4gCXJldHVybiByZXQ7DQo+ID4NCj4gPiBPaywgbm93IEkgdW5k
ZXJzdGFuZCB3aGF0IHlvdSBtZWFuLg0KPiA+DQo+ID4gV2UgYWJzb2x1dGVseSB3YW50IHRvIHVz
ZSBhdG9taWNfYWRkX3VubGVzcygpIGJlY2F1c2Ugd2UgZ2V0IHRvIHNhdmUgdXMNCj4gPiB0aGUg
ZXhwZW5zaXZlDQo+ID4NCj4gPiAJTE9DSzsgQ01QWENIRw0KPiA+DQo+ID4gaWYgdGhlIHZhbHVl
IHdhcyBhbHJlYWR5IDEuIFdoaWNoIGlzIGV4YWN0bHkgd2hhdCB0aGlzIHBhdGNoIGlzIHRyeWlu
Zw0KPiA+IHRvIGF2b2lkIC0gYSB0aHVuZGVyaW5nIGhlcmQgb2YgY29yZXMgQ01QWENIR2luZyBh
IGdsb2JhbCB2YXJpYWJsZS4NCj4gDQo+IElNTywgb24gbW9zdCBhcmNoaXRlY3R1cmVzLCB0aGUg
ImNtcCIgcGFydCBzaG91bGQgd29yayBqdXN0IGxpa2Ugd2hhdCB5b3UndmUgZG9uZSB3aXRoICJp
ZiIuDQo+IEFuZCBvbiBzb21lIGFyY2hpdGVjdHVyZXMsIGlmIHRoZSAieGNoZyIgZG9lc24ndCBo
YXBwZW4sIHRoZSAiY21wIiBwYXJ0IGV2ZW4gd29uJ3QgY2F1c2UgYSBwaXBlIGxpbmUgaGF6YXJk
Lg0KDQpJZiB5b3UgbWFuIHRoZSBMT0NLIHByZWZpeCwgSSB1bmRlcnN0YW5kIG5vdy4NCg0KVGhh
bmtzIGFuZCBiZXN0IHJlZ2FyZHMNCi1Mdg0KDQo+IA0KPiBUaGFua3MgYW5kIGJlc3QgcmVnYXJk
cw0KPiAtTHYNCj4gDQo+IA0KPiA+DQo+ID4gSS5lLiwNCj4gPg0KPiA+IAltb3ZsCWdoZXNfaW5f
bm1pKCVyaXApLCAlZWN4CSMgTUVNWyhjb25zdCBpbnQgKikmZ2hlc19pbl9ubWldLCBjDQo+ID4g
CWNtcGwJJDEsICVlY3gJIywgYw0KPiA+IAlqZQkuTDMxMQkjLAkJCQk8LS0tIGV4aXQgaGVyZSBp
ZiBnaGVzX2luX25taSA9PSAxLg0KPiA+IAlsZWFsCTEoJXJjeCksICVlZHgJIywgRC4zNzE2Mw0K
PiA+IAltb3ZsCSVlY3gsICVlYXgJIyBjLCBjDQo+ID4gI0FQUA0KPiA+ICMgMTc3ICIuL2FyY2gv
eDg2L2luY2x1ZGUvYXNtL2F0b21pYy5oIiAxDQo+ID4gCS5wdXNoc2VjdGlvbiAuc21wX2xvY2tz
LCJhIg0KPiA+IC5iYWxpZ24gNA0KPiA+IC5sb25nIDY3MWYgLSAuDQo+ID4gLnBvcHNlY3Rpb24N
Cj4gPiA2NzE6DQo+ID4gCWxvY2s7IGNtcHhjaGdsICVlZHgsZ2hlc19pbl9ubWkoJXJpcCkJIyBE
LjM3MTYzLCBNRU1bKHZvbGF0aWxlIHUzMiAqKSZnaGVzX2luX25taV0NCj4gPiAjIDAgIiIgMg0K
PiA+ICNOT19BUFANCj4gPg0KPiA+IC0tDQo+ID4gUmVnYXJkcy9HcnVzcywNCj4gPiAgICAgQm9y
aXMuDQo+ID4NCj4gPiBFQ08gdGlwICMxMDE6IFRyaW0geW91ciBtYWlscyB3aGVuIHlvdSByZXBs
eS4NCj4gPiAtLQ0K
--
To unsubscribe from this list: send the line "unsubscribe linux-acpi" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Borislav Petkov April 29, 2015, 8:13 a.m. UTC | #22

On Wed, Apr 29, 2015 at 12:49:59AM +0000, Zheng, Lv wrote:
> > > We absolutely want to use atomic_add_unless() because we get to save us
> > > the expensive
> > >
> > > 	LOCK; CMPXCHG
> > >
> > > if the value was already 1. Which is exactly what this patch is trying
> > > to avoid - a thundering herd of cores CMPXCHGing a global variable.
> > 
> > IMO, on most architectures, the "cmp" part should work just like what you've done with "if".
> > And on some architectures, if the "xchg" doesn't happen, the "cmp" part even won't cause a pipe line hazard.

Even if CMPXCHG is being split into several microops, they all still
need to flow down the pipe and require resources and tracking. And you
only know at retire time what the CMP result is and can "discard" the
XCHG part. Provided the uarch is smart enough to do that.

This is probably why CMPXCHG needs 5,6,7,10,22,... cycles depending on
uarch and vendor, if I can trust Agner Fog's tables. And I bet those
numbers are best-case only and in real-life they probably tend to fall
out even worse.

CMP needs only 1. On almost every uarch and vendor. And even that cycle
probably gets hidden with a good branch predictor.

> If you man the LOCK prefix, I understand now.

And that makes several times worse: 22, 40, 80, ... cycles.

Lv Zheng April 30, 2015, 8:05 a.m. UTC | #23

Hi,

> From: Borislav Petkov [mailto:bp@alien8.de]

> Sent: Wednesday, April 29, 2015 4:14 PM

> Subject: Re: [RFC PATCH 5/5] GHES: Make NMI handler have a single reader

> 

> On Wed, Apr 29, 2015 at 12:49:59AM +0000, Zheng, Lv wrote:

> > > > We absolutely want to use atomic_add_unless() because we get to save us

> > > > the expensive

> > > >

> > > > 	LOCK; CMPXCHG

> > > >

> > > > if the value was already 1. Which is exactly what this patch is trying

> > > > to avoid - a thundering herd of cores CMPXCHGing a global variable.

> > >

> > > IMO, on most architectures, the "cmp" part should work just like what you've done with "if".

> > > And on some architectures, if the "xchg" doesn't happen, the "cmp" part even won't cause a pipe line hazard.

> 

> Even if CMPXCHG is being split into several microops, they all still

> need to flow down the pipe and require resources and tracking. And you

> only know at retire time what the CMP result is and can "discard" the

> XCHG part. Provided the uarch is smart enough to do that.

> 

> This is probably why CMPXCHG needs 5,6,7,10,22,... cycles depending on

> uarch and vendor, if I can trust Agner Fog's tables. And I bet those

> numbers are best-case only and in real-life they probably tend to fall

> out even worse.

> 

> CMP needs only 1. On almost every uarch and vendor. And even that cycle

> probably gets hidden with a good branch predictor.


Are there any such data around the SC and LL (MIPS)?

> 

> > If you man the LOCK prefix, I understand now.

> 

> And that makes several times worse: 22, 40, 80, ... cycles.


I'm OK if the code still keeps the readability then.

Thanks and best regards
-Lv

> 

> --

> Regards/Gruss,

>     Boris.

> 

> ECO tip #101: Trim your mails when you reply.

> --

Borislav Petkov April 30, 2015, 8:48 a.m. UTC | #24

On Thu, Apr 30, 2015 at 08:05:12AM +0000, Zheng, Lv wrote:
> Are there any such data around the SC and LL (MIPS)?

What, you can't search the internet yourself?

Lv Zheng May 2, 2015, 12:34 a.m. UTC | #25

SGksDQoNCj4gRnJvbTogQm9yaXNsYXYgUGV0a292IFttYWlsdG86YnBAYWxpZW44LmRlXQ0KPiBT
ZW50OiBUaHVyc2RheSwgQXByaWwgMzAsIDIwMTUgNDo0OSBQTQ0KPiANCj4gT24gVGh1LCBBcHIg
MzAsIDIwMTUgYXQgMDg6MDU6MTJBTSArMDAwMCwgWmhlbmcsIEx2IHdyb3RlOg0KPiA+IEFyZSB0
aGVyZSBhbnkgc3VjaCBkYXRhIGFyb3VuZCB0aGUgU0MgYW5kIExMIChNSVBTKT8NCj4gDQo+IFdo
YXQsIHlvdSBjYW4ndCBzZWFyY2ggdGhlIGludGVybmV0IHlvdXJzZWxmPw0KDQpJIG1lYW4gaWYg
TEwgY2FuIGRvIHdoYXQgeW91IGV4YWN0bHkgZGlkIHdpdGggImlmIiBhbmQgbm8gQUJBIHByb2Js
ZW0gY2FuIGJyZWFrIHRoZSBhdG9taWNfYWRkX3VubGVzcygpIHVzZXJzLg0KVGhlbiBubyBhdG9t
aWNfY21weGNoZygpIHVzZXJzIG1pZ2h0IGJlIGJyb2tlbiBmb3IgdGhlIHNhbWUgcmVhc29uLg0K
V2h5IGRvbid0IHlvdSBwdXQgYW4gImlmIiBpbiB0aGUgYXRvbWljX2NtcHhjaGcoKSB0byBhbGxv
dyBvdGhlciB1c2VycyB0byBoYXZlIHRoZSBzYW1lIGJlbmVmaXQgYmVjYXVzZSBvZiB0aGUgZGF0
YT8NCg0KQnV0IHRoaXMgaXMgb3V0IG9mIHRoZSBjb250ZXh0IGZvciB0aGlzIHBhdGNoLg0KSSBh
Y3R1YWxseSBkb24ndCBoYXZlIGFueSBBUEkgdXNhZ2UgcHJlZmVyZW5jZSBub3cuDQoNClRoYW5r
cw0KLUx2DQoNCj4gDQo+IC0tDQo+IFJlZ2FyZHMvR3J1c3MsDQo+ICAgICBCb3Jpcy4NCj4gDQo+
IEVDTyB0aXAgIzEwMTogVHJpbSB5b3VyIG1haWxzIHdoZW4geW91IHJlcGx5Lg0KPiAtLQ0K
--
To unsubscribe from this list: send the line "unsubscribe linux-acpi" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Borislav Petkov May 4, 2015, 3:40 p.m. UTC | #26

On Tue, Apr 28, 2015 at 02:44:28PM -0400, Don Zickus wrote:
> RAS doesn't go through the legacy ports (ie get_nmi_reason()).  Instead it
> triggers the external NMI through a different bit (ioapic I think).

Well, I see it getting registered with __register_nmi_handler() which
adds it to the NMI_LOCAL type, i.e., ghes_notify_nmi() gets called by

default_do_nmi
|-> nmi_handle(NMI_LOCAL, regs, b2b);

AFAICT.

Which explains also the issue we were seeing as that handler is called
on each NMI, even when the machine is running a perf workload.

> The nmi code has no idea what io_remap'ed address apei is using to map its
> error handling register that GHES uses.  Unlike the legacy port which is
> always port 0x61.
> 
> So, with NMI being basically a shared interrupt, with no ability to discern
> who sent the interrupt (and even worse no ability to know how _many_ were sent as
> the NMI is edge triggered instead of level triggered).  As a result we rely
> on the NMI handlers to talk to their address space/registers to determine if
> they were they source of the interrupt.

I was afraid it would be something like that. We probably should poke hw
people to extend that NMI fun so that we can know who caused it.

<snip stuff I agree with>

> Anyway, any ideas or thoughts for improvement are always welcomed. :-)

Yeah, I'm afraid without hw support, that won't be doable. We need the
hw to tell us who caused the NMI. Otherwise we'll be round-robining
(:-)) through handlers like nuts.

[RFC,5/5] GHES: Make NMI handler have a single reader

Commit Message

Comments

Patch