diff mbox

Linux 2.6.39-rc3

Message ID 20110415131152.GJ18463@8bytes.org (mailing list archive)
State New, archived
Headers show

Commit Message

Joerg Roedel April 15, 2011, 1:11 p.m. UTC
On Wed, Apr 13, 2011 at 07:33:40PM -0700, Linus Torvalds wrote:
>  we definitely want to also understand the reason for things not
> working, even if we do revert..

Okay, here it is.

After experimenting with different configurations for the north-bridge
it turned out that a GART related MCE fires at the time the machine
reboots. BIOSes configure the machine to sync-flood in that case which
causes a reboot.

After decoding the MCE it turned out to be a GART TBL Wlk Error. Such
errors can happen if devices (speculativly) access GART ranges mapped
invalid. The AMD BKDG for Fam10h CPUs recommends to disable these errors
at all. But unfortunatly some BIOSes (including the one on my laptop)
forget to do this.

Below is a patch which disables these errors if the BIOS didn't do it.
It fixes the problem on my site.

Alexandre, can you try this patch on your machine too, please?

Regards,

	Joerg

From aaacff8db50b6ed4345e337ecbe53e505699c7e5 Mon Sep 17 00:00:00 2001
From: Joerg Roedel <joerg.roedel@amd.com>
Date: Fri, 15 Apr 2011 14:47:40 +0200
Subject: [PATCH] x86/amd: Disable GartTlbWlkErr when BIOS forgets it

This patch disables GartTlbWlk errors on AMD Fam10h CPUs if
the BIOS forgets to do is (or is just too old). Letting
these errors enabled can cause a sync-flood on the CPU
causing a reboot.

This patch is the fix for

	https://bugzilla.kernel.org/show_bug.cgi?id=33012

on my machine.

Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
---
 arch/x86/include/asm/msr-index.h |    4 ++++
 arch/x86/kernel/cpu/amd.c        |   19 +++++++++++++++++++
 2 files changed, 23 insertions(+), 0 deletions(-)

Comments

Ingo Molnar April 15, 2011, 1:16 p.m. UTC | #1
* Joerg Roedel <joro@8bytes.org> wrote:

> On Wed, Apr 13, 2011 at 07:33:40PM -0700, Linus Torvalds wrote:
> >  we definitely want to also understand the reason for things not
> > working, even if we do revert..
> 
> Okay, here it is.
> 
> After experimenting with different configurations for the north-bridge
> it turned out that a GART related MCE fires at the time the machine
> reboots. BIOSes configure the machine to sync-flood in that case which
> causes a reboot.
> 
> After decoding the MCE it turned out to be a GART TBL Wlk Error. Such
> errors can happen if devices (speculativly) access GART ranges mapped
> invalid. The AMD BKDG for Fam10h CPUs recommends to disable these errors
> at all. But unfortunatly some BIOSes (including the one on my laptop)
> forget to do this.
> 
> Below is a patch which disables these errors if the BIOS didn't do it.
> It fixes the problem on my site.

Ok, but how did the allocation changes start triggering this error in 
v2.6.39-rc1? There must still be some layout specific thing here, right?
Do we understand the details of that as well?

Thanks,

	Ingo
Andreas Herrmann April 15, 2011, 2:04 p.m. UTC | #2
On Fri, Apr 15, 2011 at 03:11:52PM +0200, Joerg Roedel wrote:
> On Wed, Apr 13, 2011 at 07:33:40PM -0700, Linus Torvalds wrote:
> >  we definitely want to also understand the reason for things not
> > working, even if we do revert..
> 
> Okay, here it is.
> 
> After experimenting with different configurations for the north-bridge
> it turned out that a GART related MCE fires at the time the machine
> reboots. BIOSes configure the machine to sync-flood in that case which
> causes a reboot.
> 
> After decoding the MCE it turned out to be a GART TBL Wlk Error. Such
> errors can happen if devices (speculativly) access GART ranges mapped
> invalid. The AMD BKDG for Fam10h CPUs recommends to disable these errors
> at all. But unfortunatly some BIOSes (including the one on my laptop)
> forget to do this.
> 
> Below is a patch which disables these errors if the BIOS didn't do it.
> It fixes the problem on my site.
> 
> Alexandre, can you try this patch on your machine too, please?
> 
> Regards,
> 
> 	Joerg
> 
> From aaacff8db50b6ed4345e337ecbe53e505699c7e5 Mon Sep 17 00:00:00 2001
> From: Joerg Roedel <joerg.roedel@amd.com>
> Date: Fri, 15 Apr 2011 14:47:40 +0200
> Subject: [PATCH] x86/amd: Disable GartTlbWlkErr when BIOS forgets it
> 
> This patch disables GartTlbWlk errors on AMD Fam10h CPUs if
> the BIOS forgets to do is (or is just too old). Letting
> these errors enabled can cause a sync-flood on the CPU
> causing a reboot.
> 
> This patch is the fix for
> 
> 	https://bugzilla.kernel.org/show_bug.cgi?id=33012
> 
> on my machine.
> 
> Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>


Joerg,

What about tagging this patch for stable/longterm releases?

Potentially there are other cases where certain combinations of
hardware(GPUs)/drivers/whatsoever might trigger a GartTlbWlkErr. If
the BIOS doesn't follow the BKDG recommendation to mask these errors,
the system will hang/reboot. Thus I think having this quirk in .32 and
.38 (at least) is useful.


Andreas
Alexandre Demers April 15, 2011, 2:16 p.m. UTC | #3
On 11-04-15 09:11 AM, Joerg Roedel wrote:
> On Wed, Apr 13, 2011 at 07:33:40PM -0700, Linus Torvalds wrote:
>>  we definitely want to also understand the reason for things not
>> working, even if we do revert..
> Okay, here it is.
>
> After experimenting with different configurations for the north-bridge
> it turned out that a GART related MCE fires at the time the machine
> reboots. BIOSes configure the machine to sync-flood in that case which
> causes a reboot.
>
> After decoding the MCE it turned out to be a GART TBL Wlk Error. Such
> errors can happen if devices (speculativly) access GART ranges mapped
> invalid. The AMD BKDG for Fam10h CPUs recommends to disable these errors
> at all. But unfortunatly some BIOSes (including the one on my laptop)
> forget to do this.
>
> Below is a patch which disables these errors if the BIOS didn't do it.
> It fixes the problem on my site.
>
> Alexandre, can you try this patch on your machine too, please?
>
> Regards,
>
> 	Joerg
>
> From aaacff8db50b6ed4345e337ecbe53e505699c7e5 Mon Sep 17 00:00:00 2001
> From: Joerg Roedel <joerg.roedel@amd.com>
> Date: Fri, 15 Apr 2011 14:47:40 +0200
> Subject: [PATCH] x86/amd: Disable GartTlbWlkErr when BIOS forgets it
>
> This patch disables GartTlbWlk errors on AMD Fam10h CPUs if
> the BIOS forgets to do is (or is just too old). Letting
> these errors enabled can cause a sync-flood on the CPU
> causing a reboot.
>
> This patch is the fix for
>
> 	https://bugzilla.kernel.org/show_bug.cgi?id=33012
>
> on my machine.
>
> Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
> ---
>  arch/x86/include/asm/msr-index.h |    4 ++++
>  arch/x86/kernel/cpu/amd.c        |   19 +++++++++++++++++++
>  2 files changed, 23 insertions(+), 0 deletions(-)
>
> diff --git a/arch/x86/include/asm/msr-index.h b/arch/x86/include/asm/msr-index.h
> index fd5a1f3..3cce714 100644
> --- a/arch/x86/include/asm/msr-index.h
> +++ b/arch/x86/include/asm/msr-index.h
> @@ -96,11 +96,15 @@
>  #define MSR_IA32_MC0_ADDR		0x00000402
>  #define MSR_IA32_MC0_MISC		0x00000403
>  
> +#define MSR_AMD64_MC0_MASK		0xc0010044
> +
>  #define MSR_IA32_MCx_CTL(x)		(MSR_IA32_MC0_CTL + 4*(x))
>  #define MSR_IA32_MCx_STATUS(x)		(MSR_IA32_MC0_STATUS + 4*(x))
>  #define MSR_IA32_MCx_ADDR(x)		(MSR_IA32_MC0_ADDR + 4*(x))
>  #define MSR_IA32_MCx_MISC(x)		(MSR_IA32_MC0_MISC + 4*(x))
>  
> +#define MSR_AMD64_MCx_MASK(x)		(MSR_AMD64_MC0_MASK + (x))
> +
>  /* These are consecutive and not in the normal 4er MCE bank block */
>  #define MSR_IA32_MC0_CTL2		0x00000280
>  #define MSR_IA32_MCx_CTL2(x)		(MSR_IA32_MC0_CTL2 + (x))
> diff --git a/arch/x86/kernel/cpu/amd.c b/arch/x86/kernel/cpu/amd.c
> index 3ecece0..3532d3b 100644
> --- a/arch/x86/kernel/cpu/amd.c
> +++ b/arch/x86/kernel/cpu/amd.c
> @@ -615,6 +615,25 @@ static void __cpuinit init_amd(struct cpuinfo_x86 *c)
>  	/* As a rule processors have APIC timer running in deep C states */
>  	if (c->x86 >= 0xf && !cpu_has_amd_erratum(amd_erratum_400))
>  		set_cpu_cap(c, X86_FEATURE_ARAT);
> +
> +	/*
> +	 * Disable GART TLB Walk Errors on Fam10h. We do this here
> +	 * because this is always needed when GART is enabled, even in a
> +	 * kernel which has no MCE support built in.
> +	 */
> +	if (c->x86 == 0x10) {
> +		/*
> +		 * BIOS should disable GartTlbWlk Errors themself. If
> +		 * it doesn't do it here as suggested by the BKDG.
> +		 *
> +		 * Fixes: https://bugzilla.kernel.org/show_bug.cgi?id=33012
> +		 */
> +		u64 mask;
> +
> +		rdmsrl(MSR_AMD64_MCx_MASK(4), mask);
> +		mask |= (1 << 10);
> +		wrmsrl(MSR_AMD64_MCx_MASK(4), mask);
> +	}
>  }
>  
>  #ifdef CONFIG_X86_32
Ok, I'll test it today. Should I apply it on a clean rc3 without any of
the other patches?

BTW, may I suggest adding the info under bug 33012 in kernel bugzilla?
This could be useful in the future.

I'll keep you up to date.
Joerg Roedel April 15, 2011, 2:27 p.m. UTC | #4
On Fri, Apr 15, 2011 at 10:16:59AM -0400, Alexandre Demers wrote:
> Ok, I'll test it today. Should I apply it on a clean rc3 without any of
> the other patches?

Yes, apply it just on -rc3 without any other patch.

> 
> BTW, may I suggest adding the info under bug 33012 in kernel bugzilla?
> This could be useful in the future.

Cool, thanks


	Joerg
Joerg Roedel April 15, 2011, 2:28 p.m. UTC | #5
On Fri, Apr 15, 2011 at 04:04:45PM +0200, Andreas Herrmann wrote:
> What about tagging this patch for stable/longterm releases?
> 
> Potentially there are other cases where certain combinations of
> hardware(GPUs)/drivers/whatsoever might trigger a GartTlbWlkErr. If
> the BIOS doesn't follow the BKDG recommendation to mask these errors,
> the system will hang/reboot. Thus I think having this quirk in .32 and
> .38 (at least) is useful.

Right, thats certainly a good idea. The problem is not specific to GPUs,
any other device can trigger this too.

	Joerg
Joerg Roedel April 15, 2011, 2:33 p.m. UTC | #6
On Fri, Apr 15, 2011 at 03:16:50PM +0200, Ingo Molnar wrote:
> Ok, but how did the allocation changes start triggering this error in 
> v2.6.39-rc1? There must still be some layout specific thing here, right?
> Do we understand the details of that as well?

No, I must admit that I lack enough knowledge about the GPU hardware to
make an guess how this tanslation-request happened. All I can tell is
the address that was reported in the MCE, it is 0xa0001000 (==the second
page of the GART aperture).

Maybe Alex can help here. Alex, may it be possible that the GPU
generates DMA requests in the GTT area before the GTT is activated (or
the activation is completed)? Or can you imagine any other reason?

	Joerg
Joerg Roedel April 15, 2011, 3:46 p.m. UTC | #7
On Fri, Apr 15, 2011 at 03:16:50PM +0200, Ingo Molnar wrote:
> Ok, but how did the allocation changes start triggering this error in 
> v2.6.39-rc1? There must still be some layout specific thing here, right?
> Do we understand the details of that as well?

Well, thinking again about this, the GPU likely generated this DMA
request before too (which has an address in the range configured for the
GTT on the card), but nobody noticed because they just hit main memory.
And with the allocation changes in 39-rc1 the GART aperture started to
be on the same address as the GTT (in their respective address spaces)
so that the DMA request hit the GART. This caused the MCE and the
sync-flood.
The open question is why the GPU generates a DMA request with an address
that is configured as the GTT base (+1 page) on the card.

	Joerg
Alex Deucher April 15, 2011, 4:11 p.m. UTC | #8
On Fri, Apr 15, 2011 at 10:33 AM, Joerg Roedel <joro@8bytes.org> wrote:
> On Fri, Apr 15, 2011 at 03:16:50PM +0200, Ingo Molnar wrote:
>> Ok, but how did the allocation changes start triggering this error in
>> v2.6.39-rc1? There must still be some layout specific thing here, right?
>> Do we understand the details of that as well?
>
> No, I must admit that I lack enough knowledge about the GPU hardware to
> make an guess how this tanslation-request happened. All I can tell is
> the address that was reported in the MCE, it is 0xa0001000 (==the second
> page of the GART aperture).
>
> Maybe Alex can help here. Alex, may it be possible that the GPU
> generates DMA requests in the GTT area before the GTT is activated (or
> the activation is completed)? Or can you imagine any other reason?

It shouldn't.  The driver binds a dummy page to all entries in the
table at init time and whenever the actual pages are unbound.

Alex
Jerome Glisse April 15, 2011, 4:11 p.m. UTC | #9
On Fri, Apr 15, 2011 at 11:46 AM, Joerg Roedel <joro@8bytes.org> wrote:
> On Fri, Apr 15, 2011 at 03:16:50PM +0200, Ingo Molnar wrote:
>> Ok, but how did the allocation changes start triggering this error in
>> v2.6.39-rc1? There must still be some layout specific thing here, right?
>> Do we understand the details of that as well?
>
> Well, thinking again about this, the GPU likely generated this DMA
> request before too (which has an address in the range configured for the
> GTT on the card), but nobody noticed because they just hit main memory.
> And with the allocation changes in 39-rc1 the GART aperture started to
> be on the same address as the GTT (in their respective address spaces)
> so that the DMA request hit the GART. This caused the MCE and the
> sync-flood.
> The open question is why the GPU generates a DMA request with an address
> that is configured as the GTT base (+1 page) on the card.
>
>        Joerg
>

Do you also got the write if you load radeon with radeon.no_wb=1 ?
I think at this address it's the wb page, or maybe the cp as wb likely
take only one page

Cheers,
Jerome
Alexandre Demers April 15, 2011, 6:59 p.m. UTC | #10
On 11-04-15 10:27 AM, Joerg Roedel wrote:
> On Fri, Apr 15, 2011 at 10:16:59AM -0400, Alexandre Demers wrote:
>> Ok, I'll test it today. Should I apply it on a clean rc3 without any of
>> the other patches?
> Yes, apply it just on -rc3 without any other patch.
>
>> BTW, may I suggest adding the info under bug 33012 in kernel bugzilla?
>> This could be useful in the future.
> Cool, thanks
>
>
> 	Joerg
The patch was applied and tested. It looks fine, I'm able to boot
without problem.
Ingo Molnar April 15, 2011, 7:06 p.m. UTC | #11
* Alexandre Demers <alexandre.f.demers@gmail.com> wrote:

> On 11-04-15 10:27 AM, Joerg Roedel wrote:
> > On Fri, Apr 15, 2011 at 10:16:59AM -0400, Alexandre Demers wrote:
> >> Ok, I'll test it today. Should I apply it on a clean rc3 without any of
> >> the other patches?
> > Yes, apply it just on -rc3 without any other patch.
> >
> >> BTW, may I suggest adding the info under bug 33012 in kernel bugzilla?
> >> This could be useful in the future.
> > Cool, thanks
> >
> >
> > 	Joerg
> The patch was applied and tested. It looks fine, I'm able to boot
> without problem.

Joerg, mind submitting it with a changelog that includes everything we learned 
about this bug and all the Tested-by's in place?

Is anyone of the opinion that we should try to revert the allocation 
order/alignment changes in addition to this fix?

Thanks,

	Ingo
Yinghai Lu April 15, 2011, 7:18 p.m. UTC | #12
On 04/15/2011 12:06 PM, Ingo Molnar wrote:

> 
> Joerg, mind submitting it with a changelog that includes everything we learned 
> about this bug and all the Tested-by's in place?
> 
> Is anyone of the opinion that we should try to revert the allocation 
> order/alignment changes in addition to this fix?

We should figure out what is written to 0xa0001000 (main memory) by GPU before internal GART is setup.

Joerg,
can you insert some dump code in the drm/radon code to find out which function cause the problem?

Thanks

Yinghai
H. Peter Anvin April 15, 2011, 8:22 p.m. UTC | #13
On 04/15/2011 12:18 PM, Yinghai Lu wrote:
> On 04/15/2011 12:06 PM, Ingo Molnar wrote:
> 
>>
>> Joerg, mind submitting it with a changelog that includes everything we learned 
>> about this bug and all the Tested-by's in place?
>>
>> Is anyone of the opinion that we should try to revert the allocation 
>> order/alignment changes in addition to this fix?
> 
> We should figure out what is written to 0xa0001000 (main memory) by GPU before internal GART is setup.
> 
> Joerg,
> can you insert some dump code in the drm/radon code to find out which function cause the problem?
> 

Yes, I would like to make sure we don't just paper over a real bug
(again).  I think we still should talk Joerg's patch since it seems to
be the right thing to do anyway, but I do want to make sure we don't
have a memory-overwrite bug in the kernel that we're papering over.

	-hpa
Joerg Roedel April 16, 2011, noon UTC | #14
On Fri, Apr 15, 2011 at 09:06:41PM +0200, Ingo Molnar wrote:
> 
> * Alexandre Demers <alexandre.f.demers@gmail.com> wrote:
> 
> > On 11-04-15 10:27 AM, Joerg Roedel wrote:
> > > On Fri, Apr 15, 2011 at 10:16:59AM -0400, Alexandre Demers wrote:
> > >> Ok, I'll test it today. Should I apply it on a clean rc3 without any of
> > >> the other patches?
> > > Yes, apply it just on -rc3 without any other patch.
> > >
> > >> BTW, may I suggest adding the info under bug 33012 in kernel bugzilla?
> > >> This could be useful in the future.
> > > Cool, thanks
> > >
> > >
> > > 	Joerg
> > The patch was applied and tested. It looks fine, I'm able to boot
> > without problem.
> 
> Joerg, mind submitting it with a changelog that includes everything we learned 
> about this bug and all the Tested-by's in place?

Looks like I am too late, it is already applied. But the changelog
contains a link to the korg-bugzilla which has all information too. So
the information is not lost.

	Joerg
Joerg Roedel April 16, 2011, 12:01 p.m. UTC | #15
On Fri, Apr 15, 2011 at 12:18:02PM -0700, Yinghai Lu wrote:
> On 04/15/2011 12:06 PM, Ingo Molnar wrote:
> 
> > 
> > Joerg, mind submitting it with a changelog that includes everything we learned 
> > about this bug and all the Tested-by's in place?
> > 
> > Is anyone of the opinion that we should try to revert the allocation 
> > order/alignment changes in addition to this fix?
> 
> We should figure out what is written to 0xa0001000 (main memory) by GPU before internal GART is setup.
> 
> Joerg,
> can you insert some dump code in the drm/radon code to find out which
> function cause the problem?

I am not a GPU expert, but I will see what I can find out.

	Joerg
Ingo Molnar April 16, 2011, 12:21 p.m. UTC | #16
* Joerg Roedel <joro@8bytes.org> wrote:

> On Fri, Apr 15, 2011 at 09:06:41PM +0200, Ingo Molnar wrote:
> > 
> > * Alexandre Demers <alexandre.f.demers@gmail.com> wrote:
> > 
> > > On 11-04-15 10:27 AM, Joerg Roedel wrote:
> > > > On Fri, Apr 15, 2011 at 10:16:59AM -0400, Alexandre Demers wrote:
> > > >> Ok, I'll test it today. Should I apply it on a clean rc3 without any of
> > > >> the other patches?
> > > > Yes, apply it just on -rc3 without any other patch.
> > > >
> > > >> BTW, may I suggest adding the info under bug 33012 in kernel bugzilla?
> > > >> This could be useful in the future.
> > > > Cool, thanks
> > > >
> > > >
> > > > 	Joerg
> > > The patch was applied and tested. It looks fine, I'm able to boot
> > > without problem.
> > 
> > Joerg, mind submitting it with a changelog that includes everything we learned 
> > about this bug and all the Tested-by's in place?
> 
> Looks like I am too late, it is already applied. But the changelog
> contains a link to the korg-bugzilla which has all information too. So
> the information is not lost.

Yeah. In this case getting the fix into -rc4 in a timely manner looked more 
important than waiting for an updated changelog :-)

Thanks,

	Ingo
Joerg Roedel April 16, 2011, 4:35 p.m. UTC | #17
On Fri, Apr 15, 2011 at 12:11:28PM -0400, Jerome Glisse wrote:
> Do you also got the write if you load radeon with radeon.no_wb=1 ?
> I think at this address it's the wb page, or maybe the cp as wb likely
> take only one page

radeon.no_wb=1 makes no difference. The box still reboots.

	Joerg
diff mbox

Patch

diff --git a/arch/x86/include/asm/msr-index.h b/arch/x86/include/asm/msr-index.h
index fd5a1f3..3cce714 100644
--- a/arch/x86/include/asm/msr-index.h
+++ b/arch/x86/include/asm/msr-index.h
@@ -96,11 +96,15 @@ 
 #define MSR_IA32_MC0_ADDR		0x00000402
 #define MSR_IA32_MC0_MISC		0x00000403
 
+#define MSR_AMD64_MC0_MASK		0xc0010044
+
 #define MSR_IA32_MCx_CTL(x)		(MSR_IA32_MC0_CTL + 4*(x))
 #define MSR_IA32_MCx_STATUS(x)		(MSR_IA32_MC0_STATUS + 4*(x))
 #define MSR_IA32_MCx_ADDR(x)		(MSR_IA32_MC0_ADDR + 4*(x))
 #define MSR_IA32_MCx_MISC(x)		(MSR_IA32_MC0_MISC + 4*(x))
 
+#define MSR_AMD64_MCx_MASK(x)		(MSR_AMD64_MC0_MASK + (x))
+
 /* These are consecutive and not in the normal 4er MCE bank block */
 #define MSR_IA32_MC0_CTL2		0x00000280
 #define MSR_IA32_MCx_CTL2(x)		(MSR_IA32_MC0_CTL2 + (x))
diff --git a/arch/x86/kernel/cpu/amd.c b/arch/x86/kernel/cpu/amd.c
index 3ecece0..3532d3b 100644
--- a/arch/x86/kernel/cpu/amd.c
+++ b/arch/x86/kernel/cpu/amd.c
@@ -615,6 +615,25 @@  static void __cpuinit init_amd(struct cpuinfo_x86 *c)
 	/* As a rule processors have APIC timer running in deep C states */
 	if (c->x86 >= 0xf && !cpu_has_amd_erratum(amd_erratum_400))
 		set_cpu_cap(c, X86_FEATURE_ARAT);
+
+	/*
+	 * Disable GART TLB Walk Errors on Fam10h. We do this here
+	 * because this is always needed when GART is enabled, even in a
+	 * kernel which has no MCE support built in.
+	 */
+	if (c->x86 == 0x10) {
+		/*
+		 * BIOS should disable GartTlbWlk Errors themself. If
+		 * it doesn't do it here as suggested by the BKDG.
+		 *
+		 * Fixes: https://bugzilla.kernel.org/show_bug.cgi?id=33012
+		 */
+		u64 mask;
+
+		rdmsrl(MSR_AMD64_MCx_MASK(4), mask);
+		mask |= (1 << 10);
+		wrmsrl(MSR_AMD64_MCx_MASK(4), mask);
+	}
 }
 
 #ifdef CONFIG_X86_32