diff mbox

arm64: Flush the process's mm context TLB entries when switching

Message ID 534BCE80.3090406@huawei.com (mailing list archive)
State New, archived
Headers show

Commit Message

Ding Tianhong April 14, 2014, 12:03 p.m. UTC
I met a problem when migrating process by following steps:

1) The process was already running on core 0.
2) Set the CPU affinity of the process to 0x02 and move it to core 1,
   it could work well.
3) Set the CPU affinity of the process to 0x01 and move it to core 0 again,
   the problem occurs and the process was killed.

---------------------------------------------------------------------

Aborting.../init: line 29:   434 Aborted                 setsid cttyhack sh
Console sh exited with 134, respawning...
fork_test[440]: unhandled level 2 translation fault (11) at 0x00000000, esr 0x83
000006
pgd = ffffffc01a505000
[00000000] *pgd=000000001a3f4003, *pmd=0000000000000000

CPU: 0 PID: 440 Comm: fork_test Not tainted 3.13.0+ #7
task: ffffffc01a41c800 ti: ffffffc01a55c000 task.ti: ffffffc01a55c000
PC is at 0x0
LR is at 0x0
pc : [<0000000000000000>] lr : [<0000000000000000>] pstate: 20000000
sp : 0000007fdeb1dc50
x29: 0000000000000000 x28: 0000000000000000
x27: 0000000000000000 x26: 0000000000000000
x25: 0000000000000000 x24: 0000000000000000
x23: 0000000000000000 x22: 0000000000000000
x21: 0000000000400570 x20: 0000000000000000
x19: 0000000000400570 x18: 0000007fdeb1d9e0
x17: 0000007fa7a65840 x16: 0000000000410a50
x15: 0000007fa7b3b028 x14: 0000000000000040
x13: 0000000000000090 x12: 000000000013c000
x11: 000000000002b028 x10: 0000000000000000
x9 : 00000000ffffffff x8 : 0000000000000104
x7 : 0000000000000000 x6 : 0000000000000000
x5 : 00000000fbad2a84 x4 : 0000000000000000
x3 : 0000000000000000 x2 : 0000000000000020
x1 : 0000007fa7b356f0 x0 : ffffffffffffffff

CPU: 0 PID: 440 Comm: fork_test Not tainted 3.13.0+ #7
Call trace:
[<ffffffc0000872b0>] dump_backtrace+0x0/0x12c
[<ffffffc0000873f0>] show_stack+0x14/0x1c
[<ffffffc000420e74>] dump_stack+0x70/0x90
[<ffffffc0000912d0>] __do_user_fault+0x48/0xf4
[<ffffffc0000914e4>] do_page_fault+0x168/0x378
[<ffffffc0000917b4>] do_translation_fault+0xc0/0xf0
[<ffffffc000081108>] do_mem_abort+0x3c/0x9c
Exception stack(0xffffffc01a55fe30 to 0xffffffc01a55ff50)
fe20:                                     00400570 00000000 00000000 00000000
fe40: ffffffff ffffffff 00000000 00000000 ffffffff ffffffff 000000dc 00000000
fe60: 00000003 00000004 00000000 00000000 00000000 00000000 000001bb 00000000
fe80: 00000000 00000000 00000000 0000007f 1a41c800 ffffffc0 00095508 ffffffc0
fea0: 00100100 00000000 00200200 00000000 fffffff6 00000000 00001000 00000000
fec0: deb1dc00 0000007f 000839ec ffffffc0 ffffffff ffffffff a7b356f0 0000007f
fee0: 00000020 00000000 00000000 00000000 00000000 00000000 fbad2a84 00000000
ff00: 00000000 00000000 00000000 00000000 00000104 00000000 ffffffff 00000000
ff20: 00000000 00000000 0002b028 00000000 0013c000 00000000 00000090 00000000
ff40: 00000040 00000000 a7b3b028 0000007f

---------------------------- cut here -----------------------------------

It was a very strange problem that the PC and LR are both 0, and the esr is
0x83000006, it means that the used for instruction access generated MMU faults
and synchronous external aborts, including synchronous parity errors.

I try to fix the problem by invalidating the process's TLB entries when switching,
it will make the context stale and pick new one, and then it could work well.

So I think in some situation that after the process switching, the modification of
the TLB entries in the new core didn't inform all other cores to invalidate the old
TLB entries which was in the inner shareable caches, and then if the process schedule
to another core, the old TLB entries may occur MMU faults.

Signed-off-by: Ding Tianhong <dingtianhong@huawei.com>
---
 arch/arm64/kernel/process.c | 9 +++++++++
 1 file changed, 9 insertions(+)

Comments

Will Deacon April 14, 2014, 1:01 p.m. UTC | #1
Hi Ding,

On Mon, Apr 14, 2014 at 01:03:12PM +0100, Ding Tianhong wrote:
> I met a problem when migrating process by following steps:
> 
> 1) The process was already running on core 0.
> 2) Set the CPU affinity of the process to 0x02 and move it to core 1,
>    it could work well.
> 3) Set the CPU affinity of the process to 0x01 and move it to core 0 again,
>    the problem occurs and the process was killed.

[...]

> It was a very strange problem that the PC and LR are both 0, and the esr is
> 0x83000006, it means that the used for instruction access generated MMU faults
> and synchronous external aborts, including synchronous parity errors.
> 
> I try to fix the problem by invalidating the process's TLB entries when switching,
> it will make the context stale and pick new one, and then it could work well.
> 
> So I think in some situation that after the process switching, the modification of
> the TLB entries in the new core didn't inform all other cores to invalidate the old
> TLB entries which was in the inner shareable caches, and then if the process schedule
> to another core, the old TLB entries may occur MMU faults.

Yes, it sounds like you don't have your TLBs configured correctly. Can you
confirm that your EL3 firmware is configuring TLB broadcasting correctly
please?

> Signed-off-by: Ding Tianhong <dingtianhong@huawei.com>
> ---
>  arch/arm64/kernel/process.c | 9 +++++++++
>  1 file changed, 9 insertions(+)
> 
> diff --git a/arch/arm64/kernel/process.c b/arch/arm64/kernel/process.c
> index 6391485..d7d8439 100644
> --- a/arch/arm64/kernel/process.c
> +++ b/arch/arm64/kernel/process.c
> @@ -283,6 +283,13 @@ static void tls_thread_switch(struct task_struct *next)
>  	: : "r" (tpidr), "r" (tpidrro));
>  }
>  
> +static void tlb_flush_thread(struct task_struct *prev)
> +{
> +	/* Flush the prev task's TLB entries */
> +	if (prev->mm)
> +		flush_tlb_mm(prev->mm);
> +}
> +
>  /*
>   * Thread switching.
>   */
> @@ -296,6 +303,8 @@ struct task_struct *__switch_to(struct task_struct *prev,
>  	hw_breakpoint_thread_switch(next);
>  	contextidr_thread_switch(next);
>  
> +	tlb_flush_thread(prev);

NAK to the patch -- the architecture certainly doesn't require this, and
it's a huge hammer for what is more likely a firmware initialisation issue.

Will
Ding Tianhong April 15, 2014, 2 a.m. UTC | #2
On 2014/4/14 21:01, Will Deacon wrote:
> Hi Ding,
> 
> On Mon, Apr 14, 2014 at 01:03:12PM +0100, Ding Tianhong wrote:
>> I met a problem when migrating process by following steps:
>>
>> 1) The process was already running on core 0.
>> 2) Set the CPU affinity of the process to 0x02 and move it to core 1,
>>    it could work well.
>> 3) Set the CPU affinity of the process to 0x01 and move it to core 0 again,
>>    the problem occurs and the process was killed.
> 
> [...]
> 
>> It was a very strange problem that the PC and LR are both 0, and the esr is
>> 0x83000006, it means that the used for instruction access generated MMU faults
>> and synchronous external aborts, including synchronous parity errors.
>>
>> I try to fix the problem by invalidating the process's TLB entries when switching,
>> it will make the context stale and pick new one, and then it could work well.
>>
>> So I think in some situation that after the process switching, the modification of
>> the TLB entries in the new core didn't inform all other cores to invalidate the old
>> TLB entries which was in the inner shareable caches, and then if the process schedule
>> to another core, the old TLB entries may occur MMU faults.
> 
> Yes, it sounds like you don't have your TLBs configured correctly. Can you
> confirm that your EL3 firmware is configuring TLB broadcasting correctly
> please?
> 

Hi will:

Do you mean the SCR_EL3.NS?

>> Signed-off-by: Ding Tianhong <dingtianhong@huawei.com>
>> ---
>>  arch/arm64/kernel/process.c | 9 +++++++++
>>  1 file changed, 9 insertions(+)
>>
>> diff --git a/arch/arm64/kernel/process.c b/arch/arm64/kernel/process.c
>> index 6391485..d7d8439 100644
>> --- a/arch/arm64/kernel/process.c
>> +++ b/arch/arm64/kernel/process.c
>> @@ -283,6 +283,13 @@ static void tls_thread_switch(struct task_struct *next)
>>  	: : "r" (tpidr), "r" (tpidrro));
>>  }
>>  
>> +static void tlb_flush_thread(struct task_struct *prev)
>> +{
>> +	/* Flush the prev task's TLB entries */
>> +	if (prev->mm)
>> +		flush_tlb_mm(prev->mm);
>> +}
>> +
>>  /*
>>   * Thread switching.
>>   */
>> @@ -296,6 +303,8 @@ struct task_struct *__switch_to(struct task_struct *prev,
>>  	hw_breakpoint_thread_switch(next);
>>  	contextidr_thread_switch(next);
>>  
>> +	tlb_flush_thread(prev);
> 
> NAK to the patch -- the architecture certainly doesn't require this, and
> it's a huge hammer for what is more likely a firmware initialisation issue.
> 
> Will
> 

Yep, I am still doubt with this patch, thanks for your suggestion.

Regards
Ding

> .
>
Will Deacon April 15, 2014, 8:02 a.m. UTC | #3
On Tue, Apr 15, 2014 at 03:00:24AM +0100, Ding Tianhong wrote:
> On 2014/4/14 21:01, Will Deacon wrote:
> > Hi Ding,
> > 
> > On Mon, Apr 14, 2014 at 01:03:12PM +0100, Ding Tianhong wrote:
> >> I met a problem when migrating process by following steps:
> >>
> >> 1) The process was already running on core 0.
> >> 2) Set the CPU affinity of the process to 0x02 and move it to core 1,
> >>    it could work well.
> >> 3) Set the CPU affinity of the process to 0x01 and move it to core 0 again,
> >>    the problem occurs and the process was killed.
> > 
> > [...]
> > 
> >> It was a very strange problem that the PC and LR are both 0, and the esr is
> >> 0x83000006, it means that the used for instruction access generated MMU faults
> >> and synchronous external aborts, including synchronous parity errors.
> >>
> >> I try to fix the problem by invalidating the process's TLB entries when switching,
> >> it will make the context stale and pick new one, and then it could work well.
> >>
> >> So I think in some situation that after the process switching, the modification of
> >> the TLB entries in the new core didn't inform all other cores to invalidate the old
> >> TLB entries which was in the inner shareable caches, and then if the process schedule
> >> to another core, the old TLB entries may occur MMU faults.
> > 
> > Yes, it sounds like you don't have your TLBs configured correctly. Can you
> > confirm that your EL3 firmware is configuring TLB broadcasting correctly
> > please?
> > 
> 
> Hi will:
> 
> Do you mean the SCR_EL3.NS?

No, there's usually a CPU-specific register (called something like actlr or
ectlr) which contains bit(s) to enable TLB broadcasting in hardware. Which
CPU are you using?

Will
Ding Tianhong April 15, 2014, 10:20 a.m. UTC | #4
On 2014/4/15 16:02, Will Deacon wrote:
> On Tue, Apr 15, 2014 at 03:00:24AM +0100, Ding Tianhong wrote:
>> On 2014/4/14 21:01, Will Deacon wrote:
>>> Hi Ding,
>>>
>>> On Mon, Apr 14, 2014 at 01:03:12PM +0100, Ding Tianhong wrote:
>>>> I met a problem when migrating process by following steps:
>>>>
>>>> 1) The process was already running on core 0.
>>>> 2) Set the CPU affinity of the process to 0x02 and move it to core 1,
>>>>    it could work well.
>>>> 3) Set the CPU affinity of the process to 0x01 and move it to core 0 again,
>>>>    the problem occurs and the process was killed.
>>>
>>> [...]
>>>
>>>> It was a very strange problem that the PC and LR are both 0, and the esr is
>>>> 0x83000006, it means that the used for instruction access generated MMU faults
>>>> and synchronous external aborts, including synchronous parity errors.
>>>>
>>>> I try to fix the problem by invalidating the process's TLB entries when switching,
>>>> it will make the context stale and pick new one, and then it could work well.
>>>>
>>>> So I think in some situation that after the process switching, the modification of
>>>> the TLB entries in the new core didn't inform all other cores to invalidate the old
>>>> TLB entries which was in the inner shareable caches, and then if the process schedule
>>>> to another core, the old TLB entries may occur MMU faults.
>>>
>>> Yes, it sounds like you don't have your TLBs configured correctly. Can you
>>> confirm that your EL3 firmware is configuring TLB broadcasting correctly
>>> please?
>>>
>>
>> Hi will:
>>
>> Do you mean the SCR_EL3.NS?
> 
> No, there's usually a CPU-specific register (called something like actlr or
> ectlr) which contains bit(s) to enable TLB broadcasting in hardware. Which
> CPU are you using?
> 
> Will
> 
Yes?I set the CPUECTLR.SMP to 1 and enable the core to receive TLB broadcast, then fix the problem,
thanks for your help, And I use arm64-A57.

Regards
Ding
> .
>
diff mbox

Patch

diff --git a/arch/arm64/kernel/process.c b/arch/arm64/kernel/process.c
index 6391485..d7d8439 100644
--- a/arch/arm64/kernel/process.c
+++ b/arch/arm64/kernel/process.c
@@ -283,6 +283,13 @@  static void tls_thread_switch(struct task_struct *next)
 	: : "r" (tpidr), "r" (tpidrro));
 }
 
+static void tlb_flush_thread(struct task_struct *prev)
+{
+	/* Flush the prev task's TLB entries */
+	if (prev->mm)
+		flush_tlb_mm(prev->mm);
+}
+
 /*
  * Thread switching.
  */
@@ -296,6 +303,8 @@  struct task_struct *__switch_to(struct task_struct *prev,
 	hw_breakpoint_thread_switch(next);
 	contextidr_thread_switch(next);
 
+	tlb_flush_thread(prev);
+
 	/*
 	 * Complete any pending TLB or cache maintenance on this CPU in case
 	 * the thread migrates to a different CPU.