diff mbox

A bug about system call on ARM

Message ID 20130530114112.GH7483@mudshark.cambridge.arm.com (mailing list archive)
State New, archived
Headers show

Commit Message

Will Deacon May 30, 2013, 11:41 a.m. UTC
On Thu, May 30, 2013 at 10:09:49AM +0100, Will Deacon wrote:
> On Thu, May 30, 2013 at 02:41:42AM +0100, Wang, Yalin wrote:
> > If you have some patch for this issue,
> > I can do the test for it .
> 
> I'll have a look at cooking something which uses an exception table entry
> to rewind the PC and retry the system call. That's simpler than directly
> injecting a user page fault from the system call path.

Ok, please can you try the following?

Will

--->8

Comments

Nicolas Pitre May 31, 2013, 3:54 a.m. UTC | #1
On Thu, 30 May 2013, Will Deacon wrote:

> On Thu, May 30, 2013 at 10:09:49AM +0100, Will Deacon wrote:
> > On Thu, May 30, 2013 at 02:41:42AM +0100, Wang, Yalin wrote:
> > > If you have some patch for this issue,
> > > I can do the test for it .
> > 
> > I'll have a look at cooking something which uses an exception table entry
> > to rewind the PC and retry the system call. That's simpler than directly
> > injecting a user page fault from the system call path.
> 
> Ok, please can you try the following?
> 
> Will
> 
> --->8
> 
> diff --git a/arch/arm/kernel/entry-common.S b/arch/arm/kernel/entry-common.S
> index bc5bc0a..855926e 100644
> --- a/arch/arm/kernel/entry-common.S
> +++ b/arch/arm/kernel/entry-common.S
> @@ -361,6 +361,15 @@ ENTRY(vector_swi)
>  	str	r8, [sp, #S_PSR]		@ Save CPSR
>  	str	r0, [sp, #S_OLD_R0]		@ Save OLD_R0
>  	zero_fp
> +	enable_irq
> +	ct_user_exit
> +
> +#ifdef CONFIG_ALIGNMENT_TRAP
> +	ldr	ip, __cr_alignment
> +	ldr	ip, [ip]
> +	mcr	p15, 0, ip, c1, c0		@ update control register
> +#endif

This is wrong.  you must set up the align bit in the control register 
_before_ enabling IRQs or an IRQ handler might run without alignment 
fixup.

Otherwise the patch looks good to me.


Nicolas
Will Deacon May 31, 2013, 8:45 a.m. UTC | #2
On Fri, May 31, 2013 at 04:54:56AM +0100, Nicolas Pitre wrote:
> On Thu, 30 May 2013, Will Deacon wrote:
> 
> > On Thu, May 30, 2013 at 10:09:49AM +0100, Will Deacon wrote:
> > > On Thu, May 30, 2013 at 02:41:42AM +0100, Wang, Yalin wrote:
> > > > If you have some patch for this issue,
> > > > I can do the test for it .
> > > 
> > > I'll have a look at cooking something which uses an exception table entry
> > > to rewind the PC and retry the system call. That's simpler than directly
> > > injecting a user page fault from the system call path.
> > 
> > Ok, please can you try the following?
> > 
> > Will
> > 
> > --->8
> > 
> > diff --git a/arch/arm/kernel/entry-common.S b/arch/arm/kernel/entry-common.S
> > index bc5bc0a..855926e 100644
> > --- a/arch/arm/kernel/entry-common.S
> > +++ b/arch/arm/kernel/entry-common.S
> > @@ -361,6 +361,15 @@ ENTRY(vector_swi)
> >  	str	r8, [sp, #S_PSR]		@ Save CPSR
> >  	str	r0, [sp, #S_OLD_R0]		@ Save OLD_R0
> >  	zero_fp
> > +	enable_irq
> > +	ct_user_exit
> > +
> > +#ifdef CONFIG_ALIGNMENT_TRAP
> > +	ldr	ip, __cr_alignment
> > +	ldr	ip, [ip]
> > +	mcr	p15, 0, ip, c1, c0		@ update control register
> > +#endif
> 
> This is wrong.  you must set up the align bit in the control register 
> _before_ enabling IRQs or an IRQ handler might run without alignment 
> fixup.

Okey doke, I can fix that up. I thought it was only needed for the network
layer, but I suppose they have interrupts over there too :)

> Otherwise the patch looks good to me.

Thanks Nicolas.

Will
Russell King - ARM Linux June 3, 2013, 10:18 a.m. UTC | #3
On Thu, May 30, 2013 at 12:41:12PM +0100, Will Deacon wrote:
> +#if defined(CONFIG_OABI_COMPAT) || !defined(CONFIG_AEABI)
> +	/*
> +	 * We may have faulted trying to load the SWI instruction due to
> +	 * concurrent page aging on another CPU. In this case, return
> +	 * back to the swi instruction and fault the page back.
> +	 */
> +9001:
> +	sub	lr, lr, #4
> +	str	lr, [sp, #S_PC]
> +	b	ret_fast_syscall
> +#endif

The comment is wrong.  If we get here, it means that the fault from
trying to loading the instruction can't be fixed up.  Arguably, that
should result in a SIGSEGV being sent immediately, but we'll get to
that when we then try to re-load the instruction.

What it means is that the page we were trying to execute has been
unmapped beneath us.

BTW, I notice that the kernel oops was never posted to the list, so it's
impossible for other people following this thread to see what the real
problem is...
Will Deacon June 3, 2013, 10:27 a.m. UTC | #4
Hi Russell,

On Mon, Jun 03, 2013 at 11:18:09AM +0100, Russell King - ARM Linux wrote:
> On Thu, May 30, 2013 at 12:41:12PM +0100, Will Deacon wrote:
> > +#if defined(CONFIG_OABI_COMPAT) || !defined(CONFIG_AEABI)
> > +	/*
> > +	 * We may have faulted trying to load the SWI instruction due to
> > +	 * concurrent page aging on another CPU. In this case, return
> > +	 * back to the swi instruction and fault the page back.
> > +	 */
> > +9001:
> > +	sub	lr, lr, #4
> > +	str	lr, [sp, #S_PC]
> > +	b	ret_fast_syscall
> > +#endif
> 
> The comment is wrong.  If we get here, it means that the fault from
> trying to loading the instruction can't be fixed up.  Arguably, that
> should result in a SIGSEGV being sent immediately, but we'll get to
> that when we then try to re-load the instruction.

Why would we kill the application in this case? The reported problem is
where one CPU ages the page containing the swi instruction (mkold =>
clears L_PTE_YOUNG => write 0 to the pte) in between the other CPU executing
the swi and the kernel trying to read the immediate. The VMA is fine.

> What it means is that the page we were trying to execute has been
> unmapped beneath us.

Yes, as a result of the kernel aging it.

> BTW, I notice that the kernel oops was never posted to the list, so it's
> impossible for other people following this thread to see what the real
> problem is...

It was sent as an attachment I think, so I've pasted the log below (you can
see CPU0 unmapping the page from which CPU1 is assumedly executing).

Will

--->8

<1>[44330.850628] Unable to handle kernel paging request at virtual address 4020841c
<1>[44330.850750] pgd = c490c000
<1>[44330.850841] [4020841c] *pgd=84451831, *pte=bf05859d, *ppte=00000000
<0>[44330.851055] Internal error: Oops: 17 [#1] PREEMPT SMP ARM
<4>[44330.851146] Modules linked in: hid_sony(O)
<4>[44330.851330] CPU: 1    Tainted: G        W  O  (3.4.0-perf-gf496dca-01162-gcbcc62b #1)
<4>[44330.851421] PC is at vector_swi+0x28/0x88
<4>[44330.851482] LR is at 0x40208420
<4>[44330.851604] pc : [<c000dfe8>]    lr : [<40208420>]    psr: 60000093
<4>[44330.851604] sp : e0601fb0  ip : 40092f78  fp : befcd6fc
<4>[44330.851757] r10: 4005a040  r9 : 4012fa70  r8 : 60000010
<4>[44330.851818] r7 : 00000107  r6 : 00000003  r5 : 4005a030  r4 : 5746d378
<4>[44330.851909] r3 : 5883fbf8  r2 : 00000000  r1 : befcd6d0  r0 : 00000001
<4>[44330.851971] Flags: nZCv  IRQs off  FIQs on  Mode SVC_32  ISA ARM  Segment user
<4>[44330.852093] Control: 10c5787d  Table: 84b0c06a  DAC: 00000015
<4>[44330.852154] 
<4>[44330.852154] PC: 0xc000df68:
<4>[44330.852306] df68  e31100ff 1afffff0 e59d1040 e5bde03c e16ff001 f57ff01f e95d7fff e1a00000
<4>[44330.852825] df88  e28dd00c e1b0f00e eb025aba e1a096ad e1a09689 e5991000 e3a08001 e3110c03
<4>[44330.853405] dfa8  0affffec e1a0100d e3a00001 eb000911 eaffffe8 e320f000 e24dd048 e88d1fff
<4>[44330.853954] dfc8  e28d803c e9486000 e14f8000 e58de03c e58d8040 e58d0044 e3180020 13a0a000
<4>[44330.854534] dfe8  051ea004 e59fc0ac e59cc000 ee01cf10 f1080080 e1a096ad e1a09689 e28f809c
<4>[44330.855053] e008  e3daa4ff 122a7609 159f808c e599a000 e92d0030 e31a0c03 1a000008 e3570f5f
<4>[44330.855602] e028  e24fee13 3798f107 e28d1008 e3a08000 e357080f e2270000 2a000fac ea022fd4
<4>[44330.856152] e048  e1a02007 e28d1008 e3a00000 eb0008e9 e28fe014 e1a07000 e28d1008 e3570f5f
<4>[44330.856732] 
<4>[44330.856732] SP: 0xe0601f30:
<4>[44330.856823] 1f30  00000000 00000000 c0d44f98 00000001 e18fce48 00000002 e565fa80 00000000
<4>[44330.857373] 1f50  c000dfe8 60000093 ffffffff e0601f9c 60000010 c0731518 00000001 befcd6d0
<4>[44330.857891] 1f70  00000000 5883fbf8 5746d378 4005a030 00000003 00000107 60000010 4012fa70
<4>[44330.858471] 1f90  4005a040 befcd6fc 40092f78 e0601fb0 40208420 c000dfe8 60000093 ffffffff
<4>[44330.859021] 1fb0  00000001 befcd6d0 00000000 5883fbf8 5746d378 4005a030 00000003 00000107
<4>[44330.859570] 1fd0  befcd6e8 4012fa70 4005a040 befcd6fc 40092f78 befcd6c8 4008aef9 40208420
<4>[44330.860089] 1ff0  60000010 00000001 ffe6e6e6 ffe6e6e6 00000000 00000000 00000000 00000000
<4>[44330.860669] 2010  00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
<0>[44330.861218] Process ndroid.settings (pid: 25518, stack limit = 0xe06002f0)
<0>[44330.861279] Stack: (0xe0601fb0 to 0xe0602000)
<0>[44330.861401] 1fa0:                                     00000001 befcd6d0 00000000 5883fbf8
<0>[44330.861462] 1fc0: 5746d378 4005a030 00000003 00000107 befcd6e8 4012fa70 4005a040 befcd6fc
<0>[44330.861584] 1fe0: 40092f78 befcd6c8 4008aef9 40208420 60000010 00000001 ffe6e6e6 ffe6e6e6
<0>[44330.861707] Code: e58d8040 e58d0044 e3180020 13a0a000 (051ea004) 
<4>[44330.861890] ---[ end trace da227214a82491c0 ]---
<0>[44330.862012] Kernel panic - not syncing: Fatal exception
<2>[44330.862073] CPU0: stopping
<4>[44330.862195] [<c0014df8>] (unwind_backtrace+0x0/0x11c) from [<c001339c>] (handle_IPI+0x110/0x224)
<4>[44330.862256] [<c001339c>] (handle_IPI+0x110/0x224) from [<c000868c>] (gic_handle_irq+0x104/0x110)
<4>[44330.862378] [<c000868c>] (gic_handle_irq+0x104/0x110) from [<c0731580>] (__irq_svc+0x40/0x70)
<4>[44330.862470] Exception stack(0xc7439b10 to 0xc7439b58)
<4>[44330.862531] 9b00:                                     00000001 bf05f000 00000004 1a55b000
<4>[44330.862653] 9b20: bf05f59d d2b4ac08 000bf05f ca5ca03c 00000000 c7439c24 4020f000 00000001
<4>[44330.862744] 9b40: c0f2ab50 c7439b58 c0123038 c0123040 00000113 ffffffff
<4>[44330.862836] [<c0731580>] (__irq_svc+0x40/0x70) from [<c0123040>] (memblock_is_memory+0x18/0x20)
<4>[44330.862958] [<c0123040>] (memblock_is_memory+0x18/0x20) from [<c00198b4>] (__sync_icache_dcache+0x40/0x9c)
<4>[44330.863049] [<c00198b4>] (__sync_icache_dcache+0x40/0x9c) from [<c0121a44>] (ptep_clear_flush_young+0x3c/0x60)
<4>[44330.863171] [<c0121a44>] (ptep_clear_flush_young+0x3c/0x60) from [<c011d3b0>] (page_referenced_one+0x6c/0xfc)
<4>[44330.863294] [<c011d3b0>] (page_referenced_one+0x6c/0xfc) from [<c011e9e4>] (page_referenced+0x1a8/0x200)
<4>[44330.863416] [<c011e9e4>] (page_referenced+0x1a8/0x200) from [<c010522c>] (shrink_active_list.isra.49+0x1ec/0x2e8)
<4>[44330.863477] [<c010522c>] (shrink_active_list.isra.49+0x1ec/0x2e8) from [<c0106554>] (shrink_mem_cgroup_zone+0x338/0x4c4)
<4>[44330.863599] [<c0106554>] (shrink_mem_cgroup_zone+0x338/0x4c4) from [<c010740c>] (try_to_free_pages+0x2a0/0x570)
<4>[44330.863690] [<c010740c>] (try_to_free_pages+0x2a0/0x570) from [<c00fbff0>] (__alloc_pages_nodemask+0x424/0x758)
<4>[44330.863812] [<c00fbff0>] (__alloc_pages_nodemask+0x424/0x758) from [<c0116dcc>] (handle_pte_fault+0x184/0x7c8)
<4>[44330.863934] [<c0116dcc>] (handle_pte_fault+0x184/0x7c8) from [<c011751c>] (handle_mm_fault+0x10c/0x128)
<4>[44330.864057] [<c011751c>] (handle_mm_fault+0x10c/0x128) from [<c0732cc0>] (do_page_fault+0x180/0x3c0)
<4>[44330.864148] [<c0732cc0>] (do_page_fault+0x180/0x3c0) from [<c000847c>] (do_DataAbort+0x134/0x1a8)
<4>[44330.864270] [<c000847c>] (do_DataAbort+0x134/0x1a8) from [<c07316f4>] (__dabt_usr+0x34/0x40)
<4>[44330.864331] Exception stack(0xc7439fb0 to 0xc7439ff8)
<4>[44330.864453] 9fa0:                                     40baa000 40fab040 0002d440 00000000
<4>[44330.864545] 9fc0: 41ed0f10 00000001 408d72c0 41ed0f2c 0000001c 00000000 40fab400 00000001
<4>[44330.864606] 9fe0: 00000380 59c70e10 00000400 40209364 20000010 ffffffff
Russell King - ARM Linux June 3, 2013, 10:45 a.m. UTC | #5
On Mon, Jun 03, 2013 at 11:27:23AM +0100, Will Deacon wrote:
> Hi Russell,
> 
> On Mon, Jun 03, 2013 at 11:18:09AM +0100, Russell King - ARM Linux wrote:
> > On Thu, May 30, 2013 at 12:41:12PM +0100, Will Deacon wrote:
> > > +#if defined(CONFIG_OABI_COMPAT) || !defined(CONFIG_AEABI)
> > > +	/*
> > > +	 * We may have faulted trying to load the SWI instruction due to
> > > +	 * concurrent page aging on another CPU. In this case, return
> > > +	 * back to the swi instruction and fault the page back.
> > > +	 */
> > > +9001:
> > > +	sub	lr, lr, #4
> > > +	str	lr, [sp, #S_PC]
> > > +	b	ret_fast_syscall
> > > +#endif
> > 
> > The comment is wrong.  If we get here, it means that the fault from
> > trying to loading the instruction can't be fixed up.  Arguably, that
> > should result in a SIGSEGV being sent immediately, but we'll get to
> > that when we then try to re-load the instruction.
> 
> Why would we kill the application in this case? The reported problem is
> where one CPU ages the page containing the swi instruction (mkold =>
> clears L_PTE_YOUNG => write 0 to the pte) in between the other CPU executing
> the swi and the kernel trying to read the immediate. The VMA is fine.

If you mark the instruction was a user-accessing instruction, the kernel
will handle the resulting exception, trying to make the page accessible.
If it is successful, then execution resumes as normal at the faulting
instruction and continues as if nothing happened.

If it can't make the page accessible (eg, out of memory) the exception
handler path (your code above) will be called instead.  Normal action in
that case would be for a system call to return -EFAULT, but in this case
we can't know what the syscall was, so we don't know if userspace will
even pay attention to the returned error code.  In any case, if the page
is no longer accessible, it's going to end up being killed by a SEGV
when we eventually return to userspace anyway.

> > What it means is that the page we were trying to execute has been
> > unmapped beneath us.
> 
> Yes, as a result of the kernel aging it.

No - see above.  The exception path is for more serious conditions than
that.
Will Deacon June 3, 2013, 12:39 p.m. UTC | #6
On Mon, Jun 03, 2013 at 11:45:34AM +0100, Russell King - ARM Linux wrote:
> On Mon, Jun 03, 2013 at 11:27:23AM +0100, Will Deacon wrote:
> > On Mon, Jun 03, 2013 at 11:18:09AM +0100, Russell King - ARM Linux wrote:
> > > On Thu, May 30, 2013 at 12:41:12PM +0100, Will Deacon wrote:
> > > > +#if defined(CONFIG_OABI_COMPAT) || !defined(CONFIG_AEABI)
> > > > +	/*
> > > > +	 * We may have faulted trying to load the SWI instruction due to
> > > > +	 * concurrent page aging on another CPU. In this case, return
> > > > +	 * back to the swi instruction and fault the page back.
> > > > +	 */
> > > > +9001:
> > > > +	sub	lr, lr, #4
> > > > +	str	lr, [sp, #S_PC]
> > > > +	b	ret_fast_syscall
> > > > +#endif
> > > 
> > > The comment is wrong.  If we get here, it means that the fault from
> > > trying to loading the instruction can't be fixed up.  Arguably, that
> > > should result in a SIGSEGV being sent immediately, but we'll get to
> > > that when we then try to re-load the instruction.
> > 
> > Why would we kill the application in this case? The reported problem is
> > where one CPU ages the page containing the swi instruction (mkold =>
> > clears L_PTE_YOUNG => write 0 to the pte) in between the other CPU executing
> > the swi and the kernel trying to read the immediate. The VMA is fine.
> 
> If you mark the instruction was a user-accessing instruction, the kernel
> will handle the resulting exception, trying to make the page accessible.
> If it is successful, then execution resumes as normal at the faulting
> instruction and continues as if nothing happened.
> 
> If it can't make the page accessible (eg, out of memory) the exception
> handler path (your code above) will be called instead.  Normal action in
> that case would be for a system call to return -EFAULT, but in this case
> we can't know what the syscall was, so we don't know if userspace will
> even pay attention to the returned error code.  In any case, if the page
> is no longer accessible, it's going to end up being killed by a SEGV
> when we eventually return to userspace anyway.

Yes, of course, the fault handling will sort out non-fatal faults for us, so
I'll update the comment.

Thanks,

Will
diff mbox

Patch

diff --git a/arch/arm/kernel/entry-common.S b/arch/arm/kernel/entry-common.S
index bc5bc0a..855926e 100644
--- a/arch/arm/kernel/entry-common.S
+++ b/arch/arm/kernel/entry-common.S
@@ -361,6 +361,15 @@  ENTRY(vector_swi)
 	str	r8, [sp, #S_PSR]		@ Save CPSR
 	str	r0, [sp, #S_OLD_R0]		@ Save OLD_R0
 	zero_fp
+	enable_irq
+	ct_user_exit
+
+#ifdef CONFIG_ALIGNMENT_TRAP
+	ldr	ip, __cr_alignment
+	ldr	ip, [ip]
+	mcr	p15, 0, ip, c1, c0		@ update control register
+#endif
+	get_thread_info tsk
 
 	/*
 	 * Get the system call number.
@@ -375,9 +384,9 @@  ENTRY(vector_swi)
 #ifdef CONFIG_ARM_THUMB
 	tst	r8, #PSR_T_BIT
 	movne	r10, #0				@ no thumb OABI emulation
-	ldreq	r10, [lr, #-4]			@ get SWI instruction
+ USER(	ldreq	r10, [lr, #-4]		)	@ get SWI instruction
 #else
-	ldr	r10, [lr, #-4]			@ get SWI instruction
+ USER(	ldr	r10, [lr, #-4]		)	@ get SWI instruction
 #endif
 #ifdef CONFIG_CPU_ENDIAN_BE8
 	rev	r10, r10			@ little endian instruction
@@ -392,22 +401,13 @@  ENTRY(vector_swi)
 	/* Legacy ABI only, possibly thumb mode. */
 	tst	r8, #PSR_T_BIT			@ this is SPSR from save_user_regs
 	addne	scno, r7, #__NR_SYSCALL_BASE	@ put OS number in
-	ldreq	scno, [lr, #-4]
+ USER(	ldreq	scno, [lr, #-4]		)
 
 #else
 	/* Legacy ABI only. */
-	ldr	scno, [lr, #-4]			@ get SWI instruction
-#endif
-
-#ifdef CONFIG_ALIGNMENT_TRAP
-	ldr	ip, __cr_alignment
-	ldr	ip, [ip]
-	mcr	p15, 0, ip, c1, c0		@ update control register
+ USER(	ldr	scno, [lr, #-4]		)	@ get SWI instruction
 #endif
-	enable_irq
-	ct_user_exit
 
-	get_thread_info tsk
 	adr	tbl, sys_call_table		@ load syscall table pointer
 
 #if defined(CONFIG_OABI_COMPAT)
@@ -442,6 +442,18 @@  local_restart:
 	eor	r0, scno, #__NR_SYSCALL_BASE	@ put OS number back
 	bcs	arm_syscall	
 	b	sys_ni_syscall			@ not private func
+
+#if defined(CONFIG_OABI_COMPAT) || !defined(CONFIG_AEABI)
+	/*
+	 * We may have faulted trying to load the SWI instruction due to
+	 * concurrent page aging on another CPU. In this case, return
+	 * back to the swi instruction and fault the page back.
+	 */
+9001:
+	sub	lr, lr, #4
+	str	lr, [sp, #S_PC]
+	b	ret_fast_syscall
+#endif
 ENDPROC(vector_swi)
 
 	/*