diff mbox series

ARM: Fix "external abort on non-linefetch" kernel panic caused by userspace

Message ID 20240706032005.122654-1-wanglinhui@huawei.com (mailing list archive)
State New, archived
Headers show
Series ARM: Fix "external abort on non-linefetch" kernel panic caused by userspace | expand

Commit Message

wanglinhui July 6, 2024, 3:20 a.m. UTC
0x16800000 is a peripheral physical address that supports only
4-byte-aligned access.

Use /dev/mem to enable the user space to access 0x16800000. Then userspace
unexpectedly tried to read four bytes from 0x16800001 (actually access
its virtual address), which caused the kernel to trigger an
"external abort on non-linefetch" panic:

  Unhandled fault: external abort on non-linefetch (0x1018) at 0x0100129b
  [0100129b] *pgd=85038831, *pte=16801703, *ppte=16801e33
  Internal error: : 1018 [#1] SMP ARM
  ...
  CPU: 2 PID: xxxx Comm: xxxx Tainted: G           O      5.10.0 #1
  Hardware name: Hisilicon A9
  PC is at do_alignment_ldrstr+0xb8/0x100
  LR is at 0xc1f203fc
  psr: 200f0313
  sp : c7081ed4  ip : 00000008  fp : 00000011
  r10: b42250c8  r9 : c7081f0c  r8 : c7081fb0
  r7 : 0100129b  r6 : 00000004  r5 : 00000000  r4 : e5908000
  r3 : 00000000  r2 : c7081f0c  r1 : 200f0210  r0 : 0100129b
  Flags: nzCv  IRQs on  FIQs on  Mode SVC_32  ISA ARM  Segment user
  Control: 1ac5387d  Table: 82c3c04a  DAC: 55555555
  Process LcnNCoreTask (pid: 4049, stack limit = 0x14066b0e)
  Call trace:
    do_alignment_ldrstr
    --do_alignment
    ----do_DataAbort
    ------__dabt_usr

It triggers a data abort exception twice. The first time occurs when
an unaligned address is accessed in user mode. The second time occurs
when the peripheral address is actually accessed in kernel mode,
and it crashes the kernel. However, the code location for the second
data abort is as follows:

  ```
  #define __get8_unaligned_check(ins, val, addr, err) \
  	__asm__(\
   ARM("1: "ins" %1, [%2], #1\n") \ <-- Second data abort is triggered here
   THUMB("1: "ins" %1, [%2]\n") \
   THUMB(" add %2, %2, #1\n") \
  	"2:\n" \
  	" .pushsection .text.fixup,\"ax\"\n" \
  ```

It is an exception table entry that can be fixed up.

There is another test that indicates that
"external abort on non-linefetch" needs to be fixed up.

Similarly, use /dev/mem to map 0x16800000 to the user space.
Pass 0x16800001 (actually passes its virtual address) to the
kernel via the write() system call and write 1 byte.
It also causes the kernel to trigger an
"external abort on non-linefetch" panic:

  Unhandled fault: external abort on non-linefetch (0x1018) at 0xb6f95000
  [b6f95000] *pgd=83fb6831, *pte=16800783, *ppte=16800e33
  Internal error: : 1018 [#1] SMP ARM
  ...
  CPU: 1 PID: xxxx Comm: xxxx Tainted: G           O      5.10.0 #1
  Hardware name: Hisilicon A9
  PC is at __get_user_1+0x14/0x20
  LR is at iov_iter_fault_in_readable+0x7c/0x198
  psr: 800b0213
  sp : c195be18  ip : 00000001  fp : c35a2478
  r10: c06b5260  r9 : 00000000  r8 : c356fee0
  r7 : ffffe000  r6 : b6f95000  r5 : 00000001  r4 : c195bf10
  r3 : b6f95000  r2 : f7f95000  r1 : beffffff  r0 : b6f95000
  Call trace looks like:
    __get_user_1
    --iov_iter_fault_in_readable
    ----generic_perform_write
    ------__generic_file_write_iter
    --------generic_file_write_iter

The location of the instruction that triggers the data abort
is as follows:
  ```
  ENTRY(__get_user_1)
  	check_uaccess r0, 1, r1, r2, __get_user_bad
  1: TUSER(ldrb) r2, [r0] <-- Data abort is triggered here
  	mov r0, #0
  	ret lr
  ENDPROC(__get_user_1)
  _ASM_NOKPROBE(__get_user_1)
  ```
It is also an exception table entry that can be fixed up.

Address passed in from user space should not crash the kernel.
Therefore, fixup_exception() is added to fix up such exception.

Fixes: 136848d4ca9c ("ARM: LPAE: Move the FSR definitions to separate files")

Signed-off-by: wanglinhui <wanglinhui@huawei.com>
---
 arch/arm/mm/fault.c      | 9 +++++++++
 arch/arm/mm/fsr-2level.c | 4 ++--
 2 files changed, 11 insertions(+), 2 deletions(-)

Comments

Russell King (Oracle) July 6, 2024, 7:24 a.m. UTC | #1
On Sat, Jul 06, 2024 at 11:20:05AM +0800, wanglinhui wrote:
> 0x16800000 is a peripheral physical address that supports only
> 4-byte-aligned access.
> 
> Use /dev/mem to enable the user space to access 0x16800000. Then userspace
> unexpectedly tried to read four bytes from 0x16800001 (actually access
> its virtual address), which caused the kernel to trigger an
> "external abort on non-linefetch" panic:
> 
>   Unhandled fault: external abort on non-linefetch (0x1018) at 0x0100129b
>   [0100129b] *pgd=85038831, *pte=16801703, *ppte=16801e33
>   Internal error: : 1018 [#1] SMP ARM
>   ...
>   CPU: 2 PID: xxxx Comm: xxxx Tainted: G           O      5.10.0 #1
>   Hardware name: Hisilicon A9
>   PC is at do_alignment_ldrstr+0xb8/0x100
>   LR is at 0xc1f203fc
>   psr: 200f0313
>   sp : c7081ed4  ip : 00000008  fp : 00000011
>   r10: b42250c8  r9 : c7081f0c  r8 : c7081fb0
>   r7 : 0100129b  r6 : 00000004  r5 : 00000000  r4 : e5908000
>   r3 : 00000000  r2 : c7081f0c  r1 : 200f0210  r0 : 0100129b
>   Flags: nzCv  IRQs on  FIQs on  Mode SVC_32  ISA ARM  Segment user
>   Control: 1ac5387d  Table: 82c3c04a  DAC: 55555555
>   Process LcnNCoreTask (pid: 4049, stack limit = 0x14066b0e)
>   Call trace:
>     do_alignment_ldrstr
>     --do_alignment
>     ----do_DataAbort
>     ------__dabt_usr
> 
> It triggers a data abort exception twice. The first time occurs when
> an unaligned address is accessed in user mode. The second time occurs
> when the peripheral address is actually accessed in kernel mode,
> and it crashes the kernel. However, the code location for the second
> data abort is as follows:
> 
>   ```
>   #define __get8_unaligned_check(ins, val, addr, err) \
>   	__asm__(\
>    ARM("1: "ins" %1, [%2], #1\n") \ <-- Second data abort is triggered here
>    THUMB("1: "ins" %1, [%2]\n") \
>    THUMB(" add %2, %2, #1\n") \
>   	"2:\n" \
>   	" .pushsection .text.fixup,\"ax\"\n" \
>   ```
> 
> It is an exception table entry that can be fixed up.
> 
> There is another test that indicates that
> "external abort on non-linefetch" needs to be fixed up.
> 
> Similarly, use /dev/mem to map 0x16800000 to the user space.
> Pass 0x16800001 (actually passes its virtual address) to the
> kernel via the write() system call and write 1 byte.
> It also causes the kernel to trigger an
> "external abort on non-linefetch" panic:
> 
>   Unhandled fault: external abort on non-linefetch (0x1018) at 0xb6f95000
>   [b6f95000] *pgd=83fb6831, *pte=16800783, *ppte=16800e33
>   Internal error: : 1018 [#1] SMP ARM
>   ...
>   CPU: 1 PID: xxxx Comm: xxxx Tainted: G           O      5.10.0 #1
>   Hardware name: Hisilicon A9
>   PC is at __get_user_1+0x14/0x20
>   LR is at iov_iter_fault_in_readable+0x7c/0x198
>   psr: 800b0213
>   sp : c195be18  ip : 00000001  fp : c35a2478
>   r10: c06b5260  r9 : 00000000  r8 : c356fee0
>   r7 : ffffe000  r6 : b6f95000  r5 : 00000001  r4 : c195bf10
>   r3 : b6f95000  r2 : f7f95000  r1 : beffffff  r0 : b6f95000
>   Call trace looks like:
>     __get_user_1
>     --iov_iter_fault_in_readable
>     ----generic_perform_write
>     ------__generic_file_write_iter
>     --------generic_file_write_iter
> 
> The location of the instruction that triggers the data abort
> is as follows:
>   ```
>   ENTRY(__get_user_1)
>   	check_uaccess r0, 1, r1, r2, __get_user_bad
>   1: TUSER(ldrb) r2, [r0] <-- Data abort is triggered here
>   	mov r0, #0
>   	ret lr
>   ENDPROC(__get_user_1)
>   _ASM_NOKPROBE(__get_user_1)
>   ```
> It is also an exception table entry that can be fixed up.
> 
> Address passed in from user space should not crash the kernel.
> Therefore, fixup_exception() is added to fix up such exception.

NAK because:

1) you're using /dev/mem which requires privileges - you're holding
   the gun, pointing it at your foot.

2) you're performing an unaligned access to a device which is
   architecturally not permitted - you're pulling the trigger.

It's not surprising that the result is you've shot yourself in the
foot!

If you access /dev/mem, then you need to know what you're doing and
you must access it according to the requirements of the memory space
you are accessing, otherwise undefined behaviour will occur - not
only architecturally, but also by the kernel.
wanglinhui July 6, 2024, 10:21 a.m. UTC | #2
The original problem is that the device address is mapped through UIO.
Then something unexpected happens that causes an unaligned access to the 
device address.
For simplicity, I used /dev/mem for testing.

Yes, it's a privileged operation. But I don't think it should crash the 
kernel.
It would be more better to have the process exit on an exception signal. 
What do you think?

And the coredump file can be obtained when the process exits.
In this way, more information can be obtained to fix the bug.

在 2024/7/6 15:24, Russell King (Oracle) 写道:
> On Sat, Jul 06, 2024 at 11:20:05AM +0800, wanglinhui wrote:
>> 0x16800000 is a peripheral physical address that supports only
>> 4-byte-aligned access.
>>
>> Use /dev/mem to enable the user space to access 0x16800000. Then userspace
>> unexpectedly tried to read four bytes from 0x16800001 (actually access
>> its virtual address), which caused the kernel to trigger an
>> "external abort on non-linefetch" panic:
>>
>>    Unhandled fault: external abort on non-linefetch (0x1018) at 0x0100129b
>>    [0100129b] *pgd=85038831, *pte=16801703, *ppte=16801e33
>>    Internal error: : 1018 [#1] SMP ARM
>>    ...
>>    CPU: 2 PID: xxxx Comm: xxxx Tainted: G           O      5.10.0 #1
>>    Hardware name: Hisilicon A9
>>    PC is at do_alignment_ldrstr+0xb8/0x100
>>    LR is at 0xc1f203fc
>>    psr: 200f0313
>>    sp : c7081ed4  ip : 00000008  fp : 00000011
>>    r10: b42250c8  r9 : c7081f0c  r8 : c7081fb0
>>    r7 : 0100129b  r6 : 00000004  r5 : 00000000  r4 : e5908000
>>    r3 : 00000000  r2 : c7081f0c  r1 : 200f0210  r0 : 0100129b
>>    Flags: nzCv  IRQs on  FIQs on  Mode SVC_32  ISA ARM  Segment user
>>    Control: 1ac5387d  Table: 82c3c04a  DAC: 55555555
>>    Process LcnNCoreTask (pid: 4049, stack limit = 0x14066b0e)
>>    Call trace:
>>      do_alignment_ldrstr
>>      --do_alignment
>>      ----do_DataAbort
>>      ------__dabt_usr
>>
>> It triggers a data abort exception twice. The first time occurs when
>> an unaligned address is accessed in user mode. The second time occurs
>> when the peripheral address is actually accessed in kernel mode,
>> and it crashes the kernel. However, the code location for the second
>> data abort is as follows:
>>
>>    ```
>>    #define __get8_unaligned_check(ins, val, addr, err) \
>>    	__asm__(\
>>     ARM("1: "ins" %1, [%2], #1\n") \ <-- Second data abort is triggered here
>>     THUMB("1: "ins" %1, [%2]\n") \
>>     THUMB(" add %2, %2, #1\n") \
>>    	"2:\n" \
>>    	" .pushsection .text.fixup,\"ax\"\n" \
>>    ```
>>
>> It is an exception table entry that can be fixed up.
>>
>> There is another test that indicates that
>> "external abort on non-linefetch" needs to be fixed up.
>>
>> Similarly, use /dev/mem to map 0x16800000 to the user space.
>> Pass 0x16800001 (actually passes its virtual address) to the
>> kernel via the write() system call and write 1 byte.
>> It also causes the kernel to trigger an
>> "external abort on non-linefetch" panic:
>>
>>    Unhandled fault: external abort on non-linefetch (0x1018) at 0xb6f95000
>>    [b6f95000] *pgd=83fb6831, *pte=16800783, *ppte=16800e33
>>    Internal error: : 1018 [#1] SMP ARM
>>    ...
>>    CPU: 1 PID: xxxx Comm: xxxx Tainted: G           O      5.10.0 #1
>>    Hardware name: Hisilicon A9
>>    PC is at __get_user_1+0x14/0x20
>>    LR is at iov_iter_fault_in_readable+0x7c/0x198
>>    psr: 800b0213
>>    sp : c195be18  ip : 00000001  fp : c35a2478
>>    r10: c06b5260  r9 : 00000000  r8 : c356fee0
>>    r7 : ffffe000  r6 : b6f95000  r5 : 00000001  r4 : c195bf10
>>    r3 : b6f95000  r2 : f7f95000  r1 : beffffff  r0 : b6f95000
>>    Call trace looks like:
>>      __get_user_1
>>      --iov_iter_fault_in_readable
>>      ----generic_perform_write
>>      ------__generic_file_write_iter
>>      --------generic_file_write_iter
>>
>> The location of the instruction that triggers the data abort
>> is as follows:
>>    ```
>>    ENTRY(__get_user_1)
>>    	check_uaccess r0, 1, r1, r2, __get_user_bad
>>    1: TUSER(ldrb) r2, [r0] <-- Data abort is triggered here
>>    	mov r0, #0
>>    	ret lr
>>    ENDPROC(__get_user_1)
>>    _ASM_NOKPROBE(__get_user_1)
>>    ```
>> It is also an exception table entry that can be fixed up.
>>
>> Address passed in from user space should not crash the kernel.
>> Therefore, fixup_exception() is added to fix up such exception.
> NAK because:
>
> 1) you're using /dev/mem which requires privileges - you're holding
>     the gun, pointing it at your foot.
>
> 2) you're performing an unaligned access to a device which is
>     architecturally not permitted - you're pulling the trigger.
>
> It's not surprising that the result is you've shot yourself in the
> foot!
>
> If you access /dev/mem, then you need to know what you're doing and
> you must access it according to the requirements of the memory space
> you are accessing, otherwise undefined behaviour will occur - not
> only architecturally, but also by the kernel.
>
diff mbox series

Patch

diff --git a/arch/arm/mm/fault.c b/arch/arm/mm/fault.c
index 67c425341a95..55776dcde015 100644
--- a/arch/arm/mm/fault.c
+++ b/arch/arm/mm/fault.c
@@ -558,6 +558,15 @@  do_bad(unsigned long addr, unsigned int fsr, struct pt_regs *regs)
 	return 1;
 }
 
+static int do_fixup_exception(unsigned long addr, unsigned int fsr,
+					struct pt_regs *regs)
+{
+	if (fixup_exception(regs))
+		return 0;
+
+	return 1;
+}
+
 struct fsr_info {
 	int	(*fn)(unsigned long addr, unsigned int fsr, struct pt_regs *regs);
 	int	sig;
diff --git a/arch/arm/mm/fsr-2level.c b/arch/arm/mm/fsr-2level.c
index f2be95197265..a80444db9b3e 100644
--- a/arch/arm/mm/fsr-2level.c
+++ b/arch/arm/mm/fsr-2level.c
@@ -12,9 +12,9 @@  static struct fsr_info fsr_info[] = {
 	{ do_translation_fault,	SIGSEGV, SEGV_MAPERR,	"section translation fault"	   },
 	{ do_bad,		SIGBUS,	 0,		"external abort on linefetch"	   },
 	{ do_page_fault,	SIGSEGV, SEGV_MAPERR,	"page translation fault"	   },
-	{ do_bad,		SIGBUS,	 0,		"external abort on non-linefetch"  },
+	{ do_fixup_exception,	SIGBUS,	 0,		"external abort on non-linefetch"  },
 	{ do_bad,		SIGSEGV, SEGV_ACCERR,	"section domain fault"		   },
-	{ do_bad,		SIGBUS,	 0,		"external abort on non-linefetch"  },
+	{ do_fixup_exception,	SIGBUS,	 0,		"external abort on non-linefetch"  },
 	{ do_bad,		SIGSEGV, SEGV_ACCERR,	"page domain fault"		   },
 	{ do_bad,		SIGBUS,	 0,		"external abort on translation"	   },
 	{ do_sect_fault,	SIGSEGV, SEGV_ACCERR,	"section permission fault"	   },