diff mbox

4.18rc3 TX2 boot failure with "ACPICA: AML parser: attempt to continue loading table after error"

Message ID 6074707.jx9F3QFrWC@aspire.rjw.lan (mailing list archive)
State New, archived
Headers show

Commit Message

Rafael J. Wysocki July 8, 2018, 9:14 a.m. UTC
On Monday, July 2, 2018 11:41:42 PM CEST Jeremy Linton wrote:
> Hi,
> 
> I'm experiencing two problems with commit 5088814a6e931 which is 
> "ACPICA: AML parser: attempt to continue loading table after error"
> 
> The first is this boot failure on a thunderX2:
> 
> [   10.770098] ACPI Error: Ignore error and continue table load 
> (20180531/psobject-604)
>      [   10.777926] Unable to handle kernel NULL pointer dereference at 
> virtual address 0000000000000000
>      [   10.786809] Mem abort info:
>      [   10.789623]   ESR = 0x96000004
>      [   10.792702]   Exception class = DABT (current EL), IL = 32 bits
>      [   10.798682]   SET = 0, FnV = 0
>      [   10.801760]   EA = 0, S1PTW = 0
>      [   10.804925] Data abort info:
>      [   10.807827]   ISV = 0, ISS = 0x00000004
>      [   10.811698]   CM = 0, WnR = 0
>      [   10.814689] [0000000000000000] user address but active_mm is swapper
>      [   10.821108] Internal error: Oops: 96000004 [#1] SMP
>      [   10.826032] Modules linked in:
>      [   10.829113] CPU: 30 PID: 1 Comm: swapper/0 Not tainted 
> 4.18.0-rc3PPTT4k+ #53
>      [   10.836234] Hardware name: Default string Cavium 
> ThunderX2/Default string, BIOS L50_5.13_1.0.0 05/16/2018
>      [   10.845905] pstate: 00400009 (nzcv daif +PAN -UAO)
>      [   10.850746] pc : acpi_ps_peek_opcode+0x1c/0x40
>      [   10.855231] lr : acpi_ps_create_op+0x54/0x278
>      [   10.859627] sp : ffff000009a8ba30
>      [   10.862969] x29: ffff000009a8ba30 x28: 0000000054445353
>      [   10.868334] x27: 0000000000004008 x26: 0000000000000000
>      [   10.873698] x25: ffff000009767f23 x24: ffff000008d59000
>      [   10.879063] x23: ffff802672799030 x22: ffff000009a8bb28
>      [   10.884427] x21: 0000000000000000 x20: ffff000008d59000
>      [   10.889791] x19: ffff802672799030 x18: ffffffffffffffff
>      [   10.895155] x17: 0000000000000013 x16: 0000000000000000
>      [   10.900519] x15: ffff000008d59708 x14: 2d7463656a626f73
>      [   10.905883] x13: 702f313335303831 x12: 3032282064616f6c
>      [   10.911246] x11: 20656c6261742065 x10: 756e69746e6f6320
>      [   10.916610] x9 : 0000000000000058 x8 : ffff000008570998
>      [   10.921974] x7 : 203a726f72724520 x6 : 0000000000000334
>      [   10.927338] x5 : 0000000000000012 x4 : 0000000000000000
>      [   10.932701] x3 : 0000000000000000 x2 : ffff000009a8bb28
>      [   10.938065] x1 : 0000000000000000 x0 : ffff000008505790
>      [   10.943430] Process swapper/0 (pid: 1, stack limit = 
> 0x(____ptrval____))
>      [   10.950199] Call trace:
>      [   10.952663]  acpi_ps_peek_opcode+0x1c/0x40
>      [   10.956797]  acpi_ps_create_op+0x54/0x278
>      [   10.960842]  acpi_ps_parse_loop+0x1b4/0x6c8
>      [   10.965063]  acpi_ps_parse_aml+0xe0/0x2b4
>      [   10.969108]  acpi_ps_execute_table+0xa0/0x104
>      [   10.973505]  acpi_ns_execute_table+0x120/0x194
>      [   10.977989]  acpi_ns_parse_table+0x34/0x68
>      [   10.982122]  acpi_ns_load_table+0x4c/0xbc
>      [   10.986169]  acpi_tb_load_namespace+0x1d4/0x240
>      [   10.990744]  acpi_load_tables+0x50/0xbc
>      [   10.994614]  acpi_init+0xb8/0x374
>      [   10.997959]  do_one_initcall+0x54/0x208
>      [   11.001829]  kernel_init_freeable+0x224/0x300
>      [   11.006229]  kernel_init+0x18/0x118
>      [   11.009747]  ret_from_fork+0x10/0x18
>      [   11.013354] Code: aa0003f3 aa1e03e0 d503201f f9400661 (39400020)
>      [   11.019535] ---[ end trace 2bd8068593cf8acc ]---
>      [   11.024195] Kernel panic - not syncing: Fatal exception
>      [   11.029488] SMP: stopping secondary CPUs
>      [   11.033480] ---[ end Kernel panic - not syncing: Fatal exception 
> ]---
> 
> Which does appear to be the result of some bad data in the table, but it 
> was working with 4.17, and reverting this commit solves the problem.

Does the patch below make any difference?

---
 drivers/acpi/acpica/psobject.c |    3 +++
 1 file changed, 3 insertions(+)

Comments

Jeremy Linton July 9, 2018, 8:45 p.m. UTC | #1
Hi,

First thanks for the patch..

On 07/08/2018 04:14 AM, Rafael J. Wysocki wrote:
> On Monday, July 2, 2018 11:41:42 PM CEST Jeremy Linton wrote:
>> Hi,
>>
>> I'm experiencing two problems with commit 5088814a6e931 which is
>> "ACPICA: AML parser: attempt to continue loading table after error"
>>
>> The first is this boot failure on a thunderX2:
>>
>> [   10.770098] ACPI Error: Ignore error and continue table load

[trimming]

>> ]---
>>
>> Which does appear to be the result of some bad data in the table, but it
>> was working with 4.17, and reverting this commit solves the problem.
> 
> Does the patch below make any difference?
> 
> ---
>   drivers/acpi/acpica/psobject.c |    3 +++
>   1 file changed, 3 insertions(+)
> 
> Index: linux-pm/drivers/acpi/acpica/psobject.c
> ===================================================================
> --- linux-pm.orig/drivers/acpi/acpica/psobject.c
> +++ linux-pm/drivers/acpi/acpica/psobject.c
> @@ -39,6 +39,9 @@ static acpi_status acpi_ps_get_aml_opcod
>   	ACPI_FUNCTION_TRACE_PTR(ps_get_aml_opcode, walk_state);
>   
>   	walk_state->aml = walk_state->parser_state.aml;
> +	if (!walk_state->aml)
> +		return AE_CTRL_PARSE_CONTINUE;
> +

Well this seems to avoid the crash, but now it hangs right after on the 
"Ignore error and continue table load" message. I would post the DSDT 
but I'm borrowing this machine, and I understand the firmware to be an 
early release, so...

Anyway, now that I have access again, i'm turning up the debug 
messaging, let me see if that points at whats causing the hang.


Thanks,
Rafael J. Wysocki July 9, 2018, 9:28 p.m. UTC | #2
On Mon, Jul 9, 2018 at 10:45 PM, Jeremy Linton <jeremy.linton@arm.com> wrote:
> Hi,
>
> First thanks for the patch..
>
> On 07/08/2018 04:14 AM, Rafael J. Wysocki wrote:
>>
>> On Monday, July 2, 2018 11:41:42 PM CEST Jeremy Linton wrote:
>>>
>>> Hi,
>>>
>>> I'm experiencing two problems with commit 5088814a6e931 which is
>>> "ACPICA: AML parser: attempt to continue loading table after error"
>>>
>>> The first is this boot failure on a thunderX2:
>>>
>>> [   10.770098] ACPI Error: Ignore error and continue table load
>
>
> [trimming]
>
>>> ]---
>>>
>>> Which does appear to be the result of some bad data in the table, but it
>>> was working with 4.17, and reverting this commit solves the problem.
>>
>>
>> Does the patch below make any difference?
>>
>> ---
>>   drivers/acpi/acpica/psobject.c |    3 +++
>>   1 file changed, 3 insertions(+)
>>
>> Index: linux-pm/drivers/acpi/acpica/psobject.c
>> ===================================================================
>> --- linux-pm.orig/drivers/acpi/acpica/psobject.c
>> +++ linux-pm/drivers/acpi/acpica/psobject.c
>> @@ -39,6 +39,9 @@ static acpi_status acpi_ps_get_aml_opcod
>>         ACPI_FUNCTION_TRACE_PTR(ps_get_aml_opcode, walk_state);
>>         walk_state->aml = walk_state->parser_state.aml;
>> +       if (!walk_state->aml)
>> +               return AE_CTRL_PARSE_CONTINUE;
>> +
>
>
> Well this seems to avoid the crash, but now it hangs right after on the
> "Ignore error and continue table load" message.

Well, maybe we should just abort in that case.

I'm wondering what happens if you replace the return statement in the
patch above with

return_ACPI_STATUS(AE_AML_BAD_OPCODE)


> I would post the DSDT but
> I'm borrowing this machine, and I understand the firmware to be an early
> release, so...

Please send the output of acpidump from it to Erik in private.  That
will help him and Bob to make the code more robust.

> Anyway, now that I have access again, i'm turning up the debug messaging,
> let me see if that points at whats causing the hang.

OK, thanks!
Jeremy Linton July 10, 2018, 3:44 a.m. UTC | #3
Hi,

On 07/09/2018 04:28 PM, Rafael J. Wysocki wrote:
> On Mon, Jul 9, 2018 at 10:45 PM, Jeremy Linton <jeremy.linton@arm.com> wrote:
>> Hi,
>>
>> First thanks for the patch..
>>
>> On 07/08/2018 04:14 AM, Rafael J. Wysocki wrote:
>>>
>>> On Monday, July 2, 2018 11:41:42 PM CEST Jeremy Linton wrote:
>>>>
>>>> Hi,
>>>>
>>>> I'm experiencing two problems with commit 5088814a6e931 which is
>>>> "ACPICA: AML parser: attempt to continue loading table after error"
>>>>
>>>> The first is this boot failure on a thunderX2:
>>>>
>>>> [   10.770098] ACPI Error: Ignore error and continue table load
>>
>>
>> [trimming]
>>
>>>> ]---
>>>>
>>>> Which does appear to be the result of some bad data in the table, but it
>>>> was working with 4.17, and reverting this commit solves the problem.
>>>
>>>
>>> Does the patch below make any difference?
>>>
>>> ---
>>>    drivers/acpi/acpica/psobject.c |    3 +++
>>>    1 file changed, 3 insertions(+)
>>>
>>> Index: linux-pm/drivers/acpi/acpica/psobject.c
>>> ===================================================================
>>> --- linux-pm.orig/drivers/acpi/acpica/psobject.c
>>> +++ linux-pm/drivers/acpi/acpica/psobject.c
>>> @@ -39,6 +39,9 @@ static acpi_status acpi_ps_get_aml_opcod
>>>          ACPI_FUNCTION_TRACE_PTR(ps_get_aml_opcode, walk_state);
>>>          walk_state->aml = walk_state->parser_state.aml;
>>> +       if (!walk_state->aml)
>>> +               return AE_CTRL_PARSE_CONTINUE;
>>> +
>>
>>
>> Well this seems to avoid the crash, but now it hangs right after on the
>> "Ignore error and continue table load" message.
> 
> Well, maybe we should just abort in that case.
> 
> I'm wondering what happens if you replace the return statement in the
> patch above with
> 
> return_ACPI_STATUS(AE_AML_BAD_OPCODE)

Yes, that is where I went when I applied the patch but I used 
AE_CTRL_TERMINATE, which terminates the loop in acpi_ps_parse_loop() and 
that appears to successfully finish/terminate the initial parsing pass. 
But, it then crashes in acpi_ns_lookup called via the 
acpi_walk_resources sequences that goes through ut_evalute_object() due 
to the path/scope_info->scope.node being ACPI_ROOT_OBJECT (-1) and 
bypassing the null check. Adding a ACPI_ROOT_OBJECT check as well as the 
null checks in acpi_ns_lookup results in a successful boot. Tracking 
down how the terminate (or whatever) is leaving the info->prefix_node 
(in acpi_ns_evaluate) set to ROOT_OBJECT instead of null, is something I 
don't yet understand.

Anyway, I tried Using BAD_OPCODE rather than TERMINATE and it seems to 
have the same basic result as PARSE_CONTINUE.
diff mbox

Patch

Index: linux-pm/drivers/acpi/acpica/psobject.c
===================================================================
--- linux-pm.orig/drivers/acpi/acpica/psobject.c
+++ linux-pm/drivers/acpi/acpica/psobject.c
@@ -39,6 +39,9 @@  static acpi_status acpi_ps_get_aml_opcod
 	ACPI_FUNCTION_TRACE_PTR(ps_get_aml_opcode, walk_state);
 
 	walk_state->aml = walk_state->parser_state.aml;
+	if (!walk_state->aml)
+		return AE_CTRL_PARSE_CONTINUE;
+
 	walk_state->opcode = acpi_ps_peek_opcode(&(walk_state->parser_state));
 
 	/*