diff mbox

Unhandled level 2 translation fault on A72 board.

Message ID 56A7597D.6020609@huawei.com (mailing list archive)
State New, archived
Headers show

Commit Message

Ding Tianhong Jan. 26, 2016, 11:33 a.m. UTC
On 2016/1/26 19:03, Catalin Marinas wrote:
> On Tue, Jan 26, 2016 at 03:37:42PM +0800, Ding Tianhong wrote:
>> I met this problem when running the hackbench test on A72 chip board:
>>
>> sh[4779]: unhandled level 2 translation fault (11) at 0x7f96be0c80, esr 0x83000006 
>> pgd = ffffffc01a1f0000 
>> [7f96be0c80] *pgd=0000000084a20003, *pud=0000000084a20003, *pmd=0000000000000000
>>
>> CPU: 1 PID: 4779 Comm: sh Tainted: G O 4.1.15+ #21 
>> Hardware name: Hisilicon PhosphorHi1382 EVB (DT) 
>> task: ffffffc0163cc500 ti: ffffffc083abc000 task.ti: ffffffc083abc000 
>> PC is at 0x7f96be0c80 
>> LR is at 0x7fb2684eb4 
>> pc : [<0000007f96be0c80>] lr : [<0000007fb2684eb4>] pstate: 60000000 
> 
> So here it's user space trying to execute from 0x7f96be0c80 (instruction
> abort).
> 
>> sh[4963]: unhandled level 2 translation fault (11) at 0x00000000, esr 0x92000006
>> pgd = ffffffc0180c6000 
>> [00000000] *pgd=0000000015157003, *pud=0000000015157003, *pmd=0000000000000000 
>>
>> CPU: 0 PID: 4963 Comm: sh Tainted: G O 4.1.15+ #21 
>> Hardware name: Hisilicon PhosphorHi1382 EVB (DT) 
>> task: ffffffc0163cb980 ti: ffffffc0840c8000 task.ti: ffffffc0840c8000 
>> PC is at 0x42c0c8 
>> LR is at 0x42c03c 
>> pc : [<000000000042c0c8>] lr : [<000000000042c03c>] pstate: 80000000 
> 
> And here you have a null pointer dereference.
> 
>> if I run the benchmark only on the core which is in the same cluster,
>> it looks fine and no error happened, but if I enable the core which in
>> the different cluster, it will happened.
>>
>> I remember that I met the same problem on the A57 and fix it by enable
>> the [bit6] of the CPUECTLR_EL1 and enable MN, But this time, I enable
>> the same setting and looks no effort, I have no idea about this
>> problem, does A57 and A72 has so big difference on TLB?
> 
> I can't tell for sure it's a TLB issue. The kernel page table dump shows
> *pmd being 0, so the fault is correctly called "level 2 translation
> fault". It also seems that there is no vma at this address, hence the
> kernel reports it as unhandled. It looks like data corruption which
> could be caused by cache or TLB incoherence. Just make sure the
> interconnect linking the two clusters is configured correctly by
> _firmware_ before Linux starts.
> 
Hi Catalin:

Thanks for the apply, I have try to apply this patch to test:

--- arch/arm64/kernel/process.c | 9 +++++++++
1 file changed, 9 insertions(+)
 
hw_breakpoint_thread_switch(next);
contextidr_thread_switch(next);
+tlb_flush_thread(prev);
+
/*
* Complete any pending TLB or cache maintenance on this CPU in case
* the thread migrates to a different CPU.

The hackbench would work fine after this patch, so I guess that the old thread tlb may not be
invalidate as soon as possible, but I don't know why, everything is fine on A57,
Does I miss something?

Ding

Comments

Catalin Marinas Jan. 26, 2016, 11:44 a.m. UTC | #1
On Tue, Jan 26, 2016 at 07:33:17PM +0800, Ding Tianhong wrote:
> On 2016/1/26 19:03, Catalin Marinas wrote:
> > On Tue, Jan 26, 2016 at 03:37:42PM +0800, Ding Tianhong wrote:
> >> I met this problem when running the hackbench test on A72 chip board:
> >>
> >> sh[4779]: unhandled level 2 translation fault (11) at 0x7f96be0c80, esr 0x83000006 
> >> pgd = ffffffc01a1f0000 
> >> [7f96be0c80] *pgd=0000000084a20003, *pud=0000000084a20003, *pmd=0000000000000000
[...]
> > I can't tell for sure it's a TLB issue. The kernel page table dump shows
> > *pmd being 0, so the fault is correctly called "level 2 translation
> > fault". It also seems that there is no vma at this address, hence the
> > kernel reports it as unhandled. It looks like data corruption which
> > could be caused by cache or TLB incoherence. Just make sure the
> > interconnect linking the two clusters is configured correctly by
> > _firmware_ before Linux starts.
> 
> Thanks for the apply, I have try to apply this patch to test:
> 
> --- arch/arm64/kernel/process.c | 9 +++++++++
> 1 file changed, 9 insertions(+)
>  
> diff --git a/arch/arm64/kernel/process.c b/arch/arm64/kernel/process.c
> index 6391485..d7d8439 100644
> --- a/arch/arm64/kernel/process.c
> +++ b/arch/arm64/kernel/process.c
> @@ -283,6 +283,13 @@ static void tls_thread_switch(struct task_struct *next)
> : : "r" (tpidr), "r" (tpidrro));
> }
> +static void tlb_flush_thread(struct task_struct *prev)
> +{
> +/* Flush the prev task&apos;s TLB entries */
> +if (prev->mm)
> +flush_tlb_mm(prev->mm);
> +}
> +
> /*
>   * Thread switching.
>   */
> @@ -296,6 +303,8 @@ struct task_struct *__switch_to(struct task_struct *prev,
> hw_breakpoint_thread_switch(next);
> contextidr_thread_switch(next);
> +tlb_flush_thread(prev);
> +
> /*
> * Complete any pending TLB or cache maintenance on this CPU in case
> * the thread migrates to a different CPU.
> 
> The hackbench would work fine after this patch, so I guess that the old thread tlb may not be
> invalidate as soon as possible, but I don't know why, everything is fine on A57,
> Does I miss something?

It looks like the TLB invalidation messages may not get across the CCI
between clusters. I don't have the TRMs at hand but make sure all the
relevant bits in the CPUs and CCI are enabled.

BTW, which kernel version are you running? Is the firmware your own or
built around ARM Trusted Firmware?
Ding Tianhong Jan. 26, 2016, 1:18 p.m. UTC | #2
On 2016/1/26 19:44, Catalin Marinas wrote:
> On Tue, Jan 26, 2016 at 07:33:17PM +0800, Ding Tianhong wrote:
>> On 2016/1/26 19:03, Catalin Marinas wrote:
>>> On Tue, Jan 26, 2016 at 03:37:42PM +0800, Ding Tianhong wrote:
>>>> I met this problem when running the hackbench test on A72 chip board:
>>>>
>>>> sh[4779]: unhandled level 2 translation fault (11) at 0x7f96be0c80, esr 0x83000006 
>>>> pgd = ffffffc01a1f0000 
>>>> [7f96be0c80] *pgd=0000000084a20003, *pud=0000000084a20003, *pmd=0000000000000000
> [...]
>>> I can't tell for sure it's a TLB issue. The kernel page table dump shows
>>> *pmd being 0, so the fault is correctly called "level 2 translation
>>> fault". It also seems that there is no vma at this address, hence the
>>> kernel reports it as unhandled. It looks like data corruption which
>>> could be caused by cache or TLB incoherence. Just make sure the
>>> interconnect linking the two clusters is configured correctly by
>>> _firmware_ before Linux starts.
>>
>> Thanks for the apply, I have try to apply this patch to test:
>>
>> --- arch/arm64/kernel/process.c | 9 +++++++++
>> 1 file changed, 9 insertions(+)
>>  
>> diff --git a/arch/arm64/kernel/process.c b/arch/arm64/kernel/process.c
>> index 6391485..d7d8439 100644
>> --- a/arch/arm64/kernel/process.c
>> +++ b/arch/arm64/kernel/process.c
>> @@ -283,6 +283,13 @@ static void tls_thread_switch(struct task_struct *next)
>> : : "r" (tpidr), "r" (tpidrro));
>> }
>> +static void tlb_flush_thread(struct task_struct *prev)
>> +{
>> +/* Flush the prev task&apos;s TLB entries */
>> +if (prev->mm)
>> +flush_tlb_mm(prev->mm);
>> +}
>> +
>> /*
>>   * Thread switching.
>>   */
>> @@ -296,6 +303,8 @@ struct task_struct *__switch_to(struct task_struct *prev,
>> hw_breakpoint_thread_switch(next);
>> contextidr_thread_switch(next);
>> +tlb_flush_thread(prev);
>> +
>> /*
>> * Complete any pending TLB or cache maintenance on this CPU in case
>> * the thread migrates to a different CPU.
>>
>> The hackbench would work fine after this patch, so I guess that the old thread tlb may not be
>> invalidate as soon as possible, but I don't know why, everything is fine on A57,
>> Does I miss something?
> 
> It looks like the TLB invalidation messages may not get across the CCI
> between clusters. I don't have the TRMs at hand but make sure all the
> relevant bits in the CPUs and CCI are enabled.
> 
Indeed check them several times, and need more information, check it again.


> BTW, which kernel version are you running? Is the firmware your own or
> built around ARM Trusted Firmware?
I use 4.1 kernel version, and the firmware is our own.

Ding
Jason Liu June 1, 2017, 10:52 a.m. UTC | #3
2016-01-26 21:18 GMT+08:00 Ding Tianhong <dingtianhong@huawei.com>:
> On 2016/1/26 19:44, Catalin Marinas wrote:
>> On Tue, Jan 26, 2016 at 07:33:17PM +0800, Ding Tianhong wrote:
>>> On 2016/1/26 19:03, Catalin Marinas wrote:
>>>> On Tue, Jan 26, 2016 at 03:37:42PM +0800, Ding Tianhong wrote:
>>>>> I met this problem when running the hackbench test on A72 chip board:
>>>>>
>>>>> sh[4779]: unhandled level 2 translation fault (11) at 0x7f96be0c80, esr 0x83000006
>>>>> pgd = ffffffc01a1f0000
>>>>> [7f96be0c80] *pgd=0000000084a20003, *pud=0000000084a20003, *pmd=0000000000000000
>> [...]
>>>> I can't tell for sure it's a TLB issue. The kernel page table dump shows
>>>> *pmd being 0, so the fault is correctly called "level 2 translation
>>>> fault". It also seems that there is no vma at this address, hence the
>>>> kernel reports it as unhandled. It looks like data corruption which
>>>> could be caused by cache or TLB incoherence. Just make sure the
>>>> interconnect linking the two clusters is configured correctly by
>>>> _firmware_ before Linux starts.
>>>
>>> Thanks for the apply, I have try to apply this patch to test:
>>>
>>> --- arch/arm64/kernel/process.c | 9 +++++++++
>>> 1 file changed, 9 insertions(+)
>>>
>>> diff --git a/arch/arm64/kernel/process.c b/arch/arm64/kernel/process.c
>>> index 6391485..d7d8439 100644
>>> --- a/arch/arm64/kernel/process.c
>>> +++ b/arch/arm64/kernel/process.c
>>> @@ -283,6 +283,13 @@ static void tls_thread_switch(struct task_struct *next)
>>> : : "r" (tpidr), "r" (tpidrro));
>>> }
>>> +static void tlb_flush_thread(struct task_struct *prev)
>>> +{
>>> +/* Flush the prev task&apos;s TLB entries */
>>> +if (prev->mm)
>>> +flush_tlb_mm(prev->mm);
>>> +}
>>> +
>>> /*
>>>   * Thread switching.
>>>   */
>>> @@ -296,6 +303,8 @@ struct task_struct *__switch_to(struct task_struct *prev,
>>> hw_breakpoint_thread_switch(next);
>>> contextidr_thread_switch(next);
>>> +tlb_flush_thread(prev);
>>> +
>>> /*
>>> * Complete any pending TLB or cache maintenance on this CPU in case
>>> * the thread migrates to a different CPU.
>>>
>>> The hackbench would work fine after this patch, so I guess that the old thread tlb may not be
>>> invalidate as soon as possible, but I don't know why, everything is fine on A57,
>>> Does I miss something?
>>
>> It looks like the TLB invalidation messages may not get across the CCI
>> between clusters. I don't have the TRMs at hand but make sure all the
>> relevant bits in the CPUs and CCI are enabled.
>>
> Indeed check them several times, and need more information, check it again.

How this issue is resolved finally? I search the mail-list and find this old
email thread. Any response will be appreciated.


Jason Liu
>
>
>> BTW, which kernel version are you running? Is the firmware your own or
>> built around ARM Trusted Firmware?
> I use 4.1 kernel version, and the firmware is our own.
>
> Ding
>
>
>
>
> _______________________________________________
> linux-arm-kernel mailing list
> linux-arm-kernel@lists.infradead.org
> http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
Ding Tianhong July 18, 2017, 1:20 a.m. UTC | #4
On 2017/6/1 18:52, Jason Liu wrote:
> 2016-01-26 21:18 GMT+08:00 Ding Tianhong <dingtianhong@huawei.com>:
>> On 2016/1/26 19:44, Catalin Marinas wrote:
>>> On Tue, Jan 26, 2016 at 07:33:17PM +0800, Ding Tianhong wrote:
>>>> On 2016/1/26 19:03, Catalin Marinas wrote:
>>>>> On Tue, Jan 26, 2016 at 03:37:42PM +0800, Ding Tianhong wrote:
>>>>>> I met this problem when running the hackbench test on A72 chip board:
>>>>>>
>>>>>> sh[4779]: unhandled level 2 translation fault (11) at 0x7f96be0c80, esr 0x83000006
>>>>>> pgd = ffffffc01a1f0000
>>>>>> [7f96be0c80] *pgd=0000000084a20003, *pud=0000000084a20003, *pmd=0000000000000000
>>> [...]
>>>>> I can't tell for sure it's a TLB issue. The kernel page table dump shows
>>>>> *pmd being 0, so the fault is correctly called "level 2 translation
>>>>> fault". It also seems that there is no vma at this address, hence the
>>>>> kernel reports it as unhandled. It looks like data corruption which
>>>>> could be caused by cache or TLB incoherence. Just make sure the
>>>>> interconnect linking the two clusters is configured correctly by
>>>>> _firmware_ before Linux starts.
>>>>
>>>> Thanks for the apply, I have try to apply this patch to test:
>>>>
>>>> --- arch/arm64/kernel/process.c | 9 +++++++++
>>>> 1 file changed, 9 insertions(+)
>>>>
>>>> diff --git a/arch/arm64/kernel/process.c b/arch/arm64/kernel/process.c
>>>> index 6391485..d7d8439 100644
>>>> --- a/arch/arm64/kernel/process.c
>>>> +++ b/arch/arm64/kernel/process.c
>>>> @@ -283,6 +283,13 @@ static void tls_thread_switch(struct task_struct *next)
>>>> : : "r" (tpidr), "r" (tpidrro));
>>>> }
>>>> +static void tlb_flush_thread(struct task_struct *prev)
>>>> +{
>>>> +/* Flush the prev task&apos;s TLB entries */
>>>> +if (prev->mm)
>>>> +flush_tlb_mm(prev->mm);
>>>> +}
>>>> +
>>>> /*
>>>>   * Thread switching.
>>>>   */
>>>> @@ -296,6 +303,8 @@ struct task_struct *__switch_to(struct task_struct *prev,
>>>> hw_breakpoint_thread_switch(next);
>>>> contextidr_thread_switch(next);
>>>> +tlb_flush_thread(prev);
>>>> +
>>>> /*
>>>> * Complete any pending TLB or cache maintenance on this CPU in case
>>>> * the thread migrates to a different CPU.
>>>>
>>>> The hackbench would work fine after this patch, so I guess that the old thread tlb may not be
>>>> invalidate as soon as possible, but I don't know why, everything is fine on A57,
>>>> Does I miss something?
>>>
>>> It looks like the TLB invalidation messages may not get across the CCI
>>> between clusters. I don't have the TRMs at hand but make sure all the
>>> relevant bits in the CPUs and CCI are enabled.
>>>
>> Indeed check them several times, and need more information, check it again.
> 
> How this issue is resolved finally? I search the mail-list and find this old
> email thread. Any response will be appreciated.
> 

Fix it already, the main reason is that the chip didn't config correctly to
send broadcast for tlb snoop, so do it in the bios.

Thanks
Ding

> 
> Jason Liu
>>
>>
>>> BTW, which kernel version are you running? Is the firmware your own or
>>> built around ARM Trusted Firmware?
>> I use 4.1 kernel version, and the firmware is our own.
>>
>> Ding
>>
>>
>>
>>
>> _______________________________________________
>> linux-arm-kernel mailing list
>> linux-arm-kernel@lists.infradead.org
>> http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
> 
> _______________________________________________
> linux-arm-kernel mailing list
> linux-arm-kernel@lists.infradead.org
> http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
> 
>
diff mbox

Patch

diff --git a/arch/arm64/kernel/process.c b/arch/arm64/kernel/process.c
index 6391485..d7d8439 100644
--- a/arch/arm64/kernel/process.c
+++ b/arch/arm64/kernel/process.c
@@ -283,6 +283,13 @@  static void tls_thread_switch(struct task_struct *next)
: : "r" (tpidr), "r" (tpidrro));
}
+static void tlb_flush_thread(struct task_struct *prev)
+{
+/* Flush the prev task&apos;s TLB entries */
+if (prev->mm)
+flush_tlb_mm(prev->mm);
+}
+
/*
  * Thread switching.
  */
@@ -296,6 +303,8 @@  struct task_struct *__switch_to(struct task_struct *prev,