diff mbox

kernel oops and panic in acpi_atomic_read under 2.6.39.3. call trace included

Message ID 4E547B0F.6000001@intel.com (mailing list archive)
State New, archived
Headers show

Commit Message

Huang, Ying Aug. 24, 2011, 4:16 a.m. UTC
Hi, Rick,

It appears that panic occurs in acpi_atomic_read.  I think the most 
likely cause is that the acpi_generic_address is not pre-mapped.  Can 
you try the patch attached?

It will print registers mapped and accessed.  To use it, run the 
following command line before workload.

dmesg | grep GHES

Then try to find something like

GHES: gar accessed: x, xxxx

in kernel log when panic occurs.

Best Regards,
Huang Ying

Comments

rick@microway.com Aug. 24, 2011, 10:18 p.m. UTC | #1
Hi Huang,

The original system needs to ship to our customer ASAP.  Disabling ghes is
sufficient for the time being for that.  As such, I have set up an
identical system as a temporary master for another cluster to continue
this testing.

I have applied your patch.  Here is the output of dmesg | grep GHES so far:


[    9.272198] GHES: gar mapped: 0, 0xbf7b5ff0
[    9.280782] GHES: gar mapped: 0, 0xbf7b6200
[    9.285102] [Firmware Warn]: GHES: Poll interval is 0 for generic
hardware error source: 1, disabled.

I have the serial console activated and stress tests started back up. 
I'll reply with the output once I get another panic.

Thanks!
Rick

> Hi, Rick,
>
> It appears that panic occurs in acpi_atomic_read.  I think the most
> likely cause is that the acpi_generic_address is not pre-mapped.  Can
> you try the patch attached?
>
> It will print registers mapped and accessed.  To use it, run the
> following command line before workload.
>
> dmesg | grep GHES
>
> Then try to find something like
>
> GHES: gar accessed: x, xxxx
>
> in kernel log when panic occurs.
>
> Best Regards,
> Huang Ying
>
>


--
To unsubscribe from this list: send the line "unsubscribe linux-acpi" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
rick@microway.com Aug. 25, 2011, 3:47 p.m. UTC | #2
Hi Huang,

My new setup reproduced the panic. However I do not have any gar accessed
messages on it.  The gar mapped messages are in my previous email.  Here
is the latest call trace.  There is no GHES output prior to it:

[30348.824329] BUG: unable to handle kernel NULL pointer dereference at   
       (null)
[30348.832197] IP: [<ffffffff812a211d>] acpi_atomic_read+0x8d/0xcb
[30348.838144] PGD 605984067 PUD 6059de067 PMD 0
[30348.842654] Oops: 0000 [#1] PREEMPT SMP
[30348.846640] last sysfs file:
/sys/devices/system/cpu/cpu15/cache/index2/shared_cpu_map
[30348.854555] CPU 13
[30348.856487] Modules linked in: md5 ipmi_devintf ipmi_si ipmi_msghandler
nfsd lockd nfs_acl auth_rpcgss sunrpc ipt_MASQUERADE iptable_mangle
iptable_nat nf_nat nf_conntrack_ipv4 nf_conntrack nf_defrag_ipv4
iptable_filter ip_tables x_tables af_packet edd cpufreq_conservative
cpufreq_userspace cpufreq_powersave acpi_cpufreq mperf xfs dm_mod igb
joydev ioatdma dca iTCO_wdt iTCO_vendor_support i7core_edac i2c_i801
edac_core ghes button hed sg pcspkr serio_raw ext4 jbd2 crc16 fan
processor thermal thermal_sys ata_generic pata_atiixp arcmsr
[30348.904982]
[30348.906481] Pid: 27462, comm: cluster Not tainted
2.6.39.3-microwaycustom #8 Supermicro X8DTH-i/6/iF/6F/X8DTH
[30348.916458] RIP: 0010:[<ffffffff812a211d>]  [<ffffffff812a211d>]
acpi_atomic_read+0x8d/0xcb
[30348.924825] RSP: 0000:ffff88063fca7da8  EFLAGS: 00010046
[30348.930129] RAX: 0000000000000000 RBX: ffff88063fca7df0 RCX:
00000000bf7b6000
[30348.937251] RDX: 0000000000000000 RSI: 00000000bf7b6010 RDI:
00000000bf7b5ff0
[30348.944374] RBP: ffff88063fca7dd8 R08: 00000000bf7b7000 R09:
0000000000000000
[30348.951497] R10: 000000000000000a R11: 000000000000000b R12:
ffffc90003044c20
[30348.958627] R13: 0000000000000000 R14: 00000000bf7b5ff0 R15:
0000000000000000
[30348.965758] FS:  0000000000000000(0000) GS:ffff88063fca0000(0000)
knlGS:0000000000000000
[30348.973841] CS:  0010 DS: 002b ES: 002b CR0: 0000000080050033
[30348.979586] CR2: 0000000000000000 CR3: 00000006059db000 CR4:
00000000000006e0
[30348.986708] DR0: 0000000000000000 DR1: 0000000000000000 DR2:
0000000000000000
[30348.993838] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7:
0000000000000400
[30349.000961] Process cluster (pid: 27462, threadinfo ffff880605a02000,
task ffff88061e8f8440)
[30349.009387] Stack:
[30349.011403]  0000000000000000 00000000bf7b5ff0 ffff88032ac0a940
ffff88032ac0a940
[30349.018879]  0000000000000001 ffffc90003044ca8 ffff88063fca7e18
ffffffffa0136235
[30349.026366]  0000000000000000 0000000000000000 ffff88032ac0a940
0000000000000000
[30349.033850] Call Trace:
[30349.036300]  <NMI>
[30349.038442]  [<ffffffffa0136235>] ghes_read_estatus+0x45/0x180 [ghes]
[30349.044882]  [<ffffffffa013660c>] ghes_notify_nmi+0xbc/0x190 [ghes]
[30349.051148]  [<ffffffff8150ddfd>] notifier_call_chain+0x4d/0x70
[30349.057065]  [<ffffffff8150de63>] __atomic_notifier_call_chain+0x43/0x60
[30349.063762]  [<ffffffff8150de91>] atomic_notifier_call_chain+0x11/0x20
[30349.070286]  [<ffffffff8150dece>] notify_die+0x2e/0x30
[30349.075415]  [<ffffffff8150b4f2>] do_nmi+0xa2/0x260
[30349.080287]  [<ffffffff8150b150>] nmi+0x20/0x30
[30349.084819]  [<ffffffff81029f6a>] ? native_write_msr_safe+0xa/0x10
[30349.090991]  <<EOE>>
[30349.093094]  <IRQ>
[30349.095424]  [<ffffffff81011568>] intel_pmu_disable_all+0x38/0xb0
[30349.101516]  [<ffffffff81010efa>] x86_pmu_disable+0x4a/0x50
[30349.107093]  [<ffffffff810ea842>] perf_event_task_tick+0x1a2/0x2a0
[30349.113269]  [<ffffffff81050750>] scheduler_tick+0x1b0/0x290
[30349.118932]  [<ffffffff81066c29>] update_process_times+0x69/0x80
[30349.124936]  [<ffffffff81088098>] tick_sched_timer+0x58/0x150
[30349.130680]  [<ffffffff8107b7ef>] __run_hrtimer+0x6f/0x250
[30349.136166]  [<ffffffff81088040>] ? tick_init_highres+0x20/0x20
[30349.142087]  [<ffffffff8107bf7a>] hrtimer_interrupt+0xda/0x230
[30349.147921]  [<ffffffff8101f5c6>] smp_apic_timer_interrupt+0x66/0xa0
[30349.154272]  [<ffffffff815120f3>] apic_timer_interrupt+0x13/0x20
[30349.160272]  <EOI>
[30349.162200] Code: fc 10 74 1f 77 08 41 80 fc 08 75 49 eb 0e 41 80 fc 20
74 17 41 80 fc 40 75 3b eb 15 8a 00 0f b6 c0 eb 11 66 8b 00 0f b7 c0 eb 09
<8b> 00 89 c0 eb 03 48 8b 00 48 89 03 e8 62 55 e2 ff eb 1d 41 0f
[30349.182456] RIP  [<ffffffff812a211d>] acpi_atomic_read+0x8d/0xcb
[30349.188490]  RSP <ffff88063fca7da8>
[30349.191977] CR2: 0000000000000000
[30349.195293] ---[ end trace 316c5d7ea544957e ]---
[30349.199904] Kernel panic - not syncing: Fatal exception in interrupt
[30349.206249] Pid: 27462, comm: cluster Tainted: G      D    
2.6.39.3-microwaycustom #8
[30349.214156] Call Trace:
[30349.216605]  <NMI>  [<ffffffff815071ee>] panic+0x9b/0x1b0
[30349.222034]  [<ffffffff8150bb4a>] oops_end+0xea/0xf0
[30349.226997]  [<ffffffff81031dc3>] no_context+0xf3/0x260
[30349.232220]  [<ffffffff812569de>] ? number+0x31e/0x350
[30349.237360]  [<ffffffff81032055>] __bad_area_nosemaphore+0x125/0x1e0
[30349.243712]  [<ffffffff8103211e>] bad_area_nosemaphore+0xe/0x10
[30349.249633]  [<ffffffff8150dd10>] do_page_fault+0x500/0x5a0
[30349.255205]  [<ffffffff81258e0e>] ? vsnprintf+0x33e/0x5d0
[30349.260605]  [<ffffffff8107cd3a>] ? up+0x2a/0x50
[30349.265228]  [<ffffffff81056da9>] ? console_unlock+0x189/0x1e0
[30349.271057]  [<ffffffff8150ae95>] page_fault+0x25/0x30
[30349.276201]  [<ffffffff812a211d>] ? acpi_atomic_read+0x8d/0xcb
[30349.282029]  [<ffffffff812a20f0>] ? acpi_atomic_read+0x60/0xcb
[30349.287869]  [<ffffffffa0136235>] ghes_read_estatus+0x45/0x180 [ghes]
[30349.294311]  [<ffffffffa013660c>] ghes_notify_nmi+0xbc/0x190 [ghes]
[30349.300575]  [<ffffffff8150ddfd>] notifier_call_chain+0x4d/0x70
[30349.306494]  [<ffffffff8150de63>] __atomic_notifier_call_chain+0x43/0x60
[30349.313192]  [<ffffffff8150de91>] atomic_notifier_call_chain+0x11/0x20
[30349.319715]  [<ffffffff8150dece>] notify_die+0x2e/0x30
[30349.324853]  [<ffffffff8150b4f2>] do_nmi+0xa2/0x260
[30349.329727]  [<ffffffff8150b150>] nmi+0x20/0x30
[30349.334264]  [<ffffffff81029f6a>] ? native_write_msr_safe+0xa/0x10
[30349.340438]  <<EOE>>  <IRQ>  [<ffffffff81011568>]
intel_pmu_disable_all+0x38/0xb0
[30349.347959]  [<ffffffff81010efa>] x86_pmu_disable+0x4a/0x50
[30349.353527]  [<ffffffff810ea842>] perf_event_task_tick+0x1a2/0x2a0
[30349.359705]  [<ffffffff81050750>] scheduler_tick+0x1b0/0x290
[30349.365366]  [<ffffffff81066c29>] update_process_times+0x69/0x80
[30349.371370]  [<ffffffff81088098>] tick_sched_timer+0x58/0x150
[30349.377114]  [<ffffffff8107b7ef>] __run_hrtimer+0x6f/0x250
[30349.382604]  [<ffffffff81088040>] ? tick_init_highres+0x20/0x20
[30349.388518]  [<ffffffff8107bf7a>] hrtimer_interrupt+0xda/0x230
[30349.394355]  [<ffffffff8101f5c6>] smp_apic_timer_interrupt+0x66/0xa0
[30349.400708]  [<ffffffff815120f3>] apic_timer_interrupt+0x13/0x20
[30349.406705]  <EOI>

Thanks,
Rick

> Hi Huang,
>
> The original system needs to ship to our customer ASAP.  Disabling ghes is
> sufficient for the time being for that.  As such, I have set up an
> identical system as a temporary master for another cluster to continue
> this testing.
>
> I have applied your patch.  Here is the output of dmesg | grep GHES so
> far:
>
>
> [    9.272198] GHES: gar mapped: 0, 0xbf7b5ff0
> [    9.280782] GHES: gar mapped: 0, 0xbf7b6200
> [    9.285102] [Firmware Warn]: GHES: Poll interval is 0 for generic
> hardware error source: 1, disabled.
>
> I have the serial console activated and stress tests started back up.
> I'll reply with the output once I get another panic.
>
> Thanks!
> Rick
>
>> Hi, Rick,
>>
>> It appears that panic occurs in acpi_atomic_read.  I think the most
>> likely cause is that the acpi_generic_address is not pre-mapped.  Can
>> you try the patch attached?
>>
>> It will print registers mapped and accessed.  To use it, run the
>> following command line before workload.
>>
>> dmesg | grep GHES
>>
>> Then try to find something like
>>
>> GHES: gar accessed: x, xxxx
>>
>> in kernel log when panic occurs.
>>
>> Best Regards,
>> Huang Ying
>>
>>
>
>


--
To unsubscribe from this list: send the line "unsubscribe linux-acpi" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

---
 drivers/acpi/apei/ghes.c |    6 ++++++
 1 file changed, 6 insertions(+)

--- a/drivers/acpi/apei/ghes.c
+++ b/drivers/acpi/apei/ghes.c
@@ -299,6 +299,9 @@  static struct ghes *ghes_new(struct acpi
 		return ERR_PTR(-ENOMEM);
 	ghes->generic = generic;
 	rc = acpi_pre_map_gar(&generic->error_status_address);
+	pr_info(GHES_PFX "gar mapped: %d, 0x%llx\n",
+		generic->error_status_address.space_id,
+		generic->error_status_address.address);
 	if (rc)
 		goto err_free;
 	error_block_length = generic->error_block_length;
@@ -398,6 +401,9 @@  static int ghes_read_estatus(struct ghes
 	u32 len;
 	int rc;
 
+	pr_info(GHES_PFX "gar accessed: %d, 0x%llx\n",
+		g->error_status_address.space_id,
+		g->error_status_address.address);
 	rc = acpi_atomic_read(&buf_paddr, &g->error_status_address);
 	if (rc) {
 		if (!silent && printk_ratelimit())