diff mbox series

[v5] riscv: entry: set a0 = -ENOSYS only when syscall != -1

Message ID 20230801141607.435192-1-CoelacanthusHex@gmail.com (mailing list archive)
State Accepted
Commit 52449c17bdd1540940e21511612b58acebc49c06
Headers show
Series [v5] riscv: entry: set a0 = -ENOSYS only when syscall != -1 | expand

Checks

Context Check Description
conchuod/cover_letter success Single patches do not need cover letters
conchuod/tree_selection success Guessed tree name to be fixes at HEAD ab2dbc7acced
conchuod/fixes_present success Fixes tag present in non-next series
conchuod/maintainers_pattern success MAINTAINERS pattern errors before the patch: 4 and now 4
conchuod/verify_signedoff success Signed-off-by tag matches author and committer
conchuod/kdoc success Errors and warnings before: 0 this patch: 0
conchuod/build_rv64_clang_allmodconfig success Errors and warnings before: 9 this patch: 9
conchuod/module_param success Was 0 now: 0
conchuod/build_rv64_gcc_allmodconfig success Errors and warnings before: 9 this patch: 9
conchuod/build_rv32_defconfig success Build OK
conchuod/dtb_warn_rv64 success Errors and warnings before: 3 this patch: 3
conchuod/header_inline success No static functions without inline keyword in header files
conchuod/checkpatch warning CHECK: From:/Signed-off-by: email comments mismatch: 'From: Celeste Liu <coelacanthushex@gmail.com>' != 'Signed-off-by: Celeste Liu <CoelacanthusHex@gmail.com>' WARNING: Reported-by: should be immediately followed by Closes: with a URL to the report
conchuod/build_rv64_nommu_k210_defconfig success Build OK
conchuod/verify_fixes success Fixes tag looks correct
conchuod/build_rv64_nommu_virt_defconfig success Build OK

Commit Message

Celeste Liu Aug. 1, 2023, 2:15 p.m. UTC
When we test seccomp with 6.4 kernel, we found errno has wrong value.
If we deny NETLINK_AUDIT with EAFNOSUPPORT, after f0bddf50586d, we will
get ENOSYS instead. We got same result with commit 9c2598d43510 ("riscv:
entry: Save a0 prior syscall_enter_from_user_mode()").

After analysing code, we think that regs->a0 = -ENOSYS should only be
executed when syscall != -1. In __seccomp_filter, when seccomp rejected
this syscall with specified errno, they will set a0 to return number as
syscall ABI, and then return -1. This return number is finally pass as
return number of syscall_enter_from_user_mode, and then is compared with
NR_syscalls after converted to ulong (so it will be ULONG_MAX). The
condition syscall < NR_syscalls will always be false, so regs->a0 = -ENOSYS
is always executed. It covered a0 set by seccomp, so we always get
ENOSYS when match seccomp RET_ERRNO rule.

Fixes: f0bddf50586d ("riscv: entry: Convert to generic entry")
Reported-by: Felix Yan <felixonmars@archlinux.org>
Co-developed-by: Ruizhe Pan <c141028@gmail.com>
Signed-off-by: Ruizhe Pan <c141028@gmail.com>
Co-developed-by: Shiqi Zhang <shiqi@isrc.iscas.ac.cn>
Signed-off-by: Shiqi Zhang <shiqi@isrc.iscas.ac.cn>
Signed-off-by: Celeste Liu <CoelacanthusHex@gmail.com>
Tested-by: Felix Yan <felixonmars@archlinux.org>
Tested-by: Emil Renner Berthing <emil.renner.berthing@canonical.com>
Reviewed-by: Björn Töpel <bjorn@rivosinc.com>
Reviewed-by: Guo Ren <guoren@kernel.org>
---

v4 -> v5: add Tested-by Emil Renner Berthing <emil.renner.berthing@canonical.com>
v3 -> v4: use long instead of ulong to reduce type cast and avoid
          implementation-defined behavior, and make the judgment of syscall
          invalid more explicit
v2 -> v3: use if-statement instead of set default value,
          clarify the type of syscall
v1 -> v2: added explanation on why always got ENOSYS

 arch/riscv/kernel/traps.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

Comments

patchwork-bot+linux-riscv@kernel.org Aug. 17, 2023, 3:20 p.m. UTC | #1
Hello:

This patch was applied to riscv/linux.git (fixes)
by Palmer Dabbelt <palmer@rivosinc.com>:

On Tue,  1 Aug 2023 22:15:16 +0800 you wrote:
> When we test seccomp with 6.4 kernel, we found errno has wrong value.
> If we deny NETLINK_AUDIT with EAFNOSUPPORT, after f0bddf50586d, we will
> get ENOSYS instead. We got same result with commit 9c2598d43510 ("riscv:
> entry: Save a0 prior syscall_enter_from_user_mode()").
> 
> After analysing code, we think that regs->a0 = -ENOSYS should only be
> executed when syscall != -1. In __seccomp_filter, when seccomp rejected
> this syscall with specified errno, they will set a0 to return number as
> syscall ABI, and then return -1. This return number is finally pass as
> return number of syscall_enter_from_user_mode, and then is compared with
> NR_syscalls after converted to ulong (so it will be ULONG_MAX). The
> condition syscall < NR_syscalls will always be false, so regs->a0 = -ENOSYS
> is always executed. It covered a0 set by seccomp, so we always get
> ENOSYS when match seccomp RET_ERRNO rule.
> 
> [...]

Here is the summary with links:
  - [v5] riscv: entry: set a0 = -ENOSYS only when syscall != -1
    https://git.kernel.org/riscv/c/52449c17bdd1

You are awesome, thank you!
Dmitry V . Levin June 27, 2024, 7:14 a.m. UTC | #2
Hi,

On Tue, Aug 01, 2023 at 10:15:16PM +0800, Celeste Liu wrote:
> When we test seccomp with 6.4 kernel, we found errno has wrong value.
> If we deny NETLINK_AUDIT with EAFNOSUPPORT, after f0bddf50586d, we will
> get ENOSYS instead. We got same result with commit 9c2598d43510 ("riscv:
> entry: Save a0 prior syscall_enter_from_user_mode()").
> 
> After analysing code, we think that regs->a0 = -ENOSYS should only be
> executed when syscall != -1. In __seccomp_filter, when seccomp rejected
> this syscall with specified errno, they will set a0 to return number as
> syscall ABI, and then return -1. This return number is finally pass as
> return number of syscall_enter_from_user_mode, and then is compared with
> NR_syscalls after converted to ulong (so it will be ULONG_MAX). The
> condition syscall < NR_syscalls will always be false, so regs->a0 = -ENOSYS
> is always executed. It covered a0 set by seccomp, so we always get
> ENOSYS when match seccomp RET_ERRNO rule.
> 
> Fixes: f0bddf50586d ("riscv: entry: Convert to generic entry")
> Reported-by: Felix Yan <felixonmars@archlinux.org>
> Co-developed-by: Ruizhe Pan <c141028@gmail.com>
> Signed-off-by: Ruizhe Pan <c141028@gmail.com>
> Co-developed-by: Shiqi Zhang <shiqi@isrc.iscas.ac.cn>
> Signed-off-by: Shiqi Zhang <shiqi@isrc.iscas.ac.cn>
> Signed-off-by: Celeste Liu <CoelacanthusHex@gmail.com>
> Tested-by: Felix Yan <felixonmars@archlinux.org>
> Tested-by: Emil Renner Berthing <emil.renner.berthing@canonical.com>
> Reviewed-by: Björn Töpel <bjorn@rivosinc.com>
> Reviewed-by: Guo Ren <guoren@kernel.org>
> ---
> 
> v4 -> v5: add Tested-by Emil Renner Berthing <emil.renner.berthing@canonical.com>
> v3 -> v4: use long instead of ulong to reduce type cast and avoid
>           implementation-defined behavior, and make the judgment of syscall
>           invalid more explicit
> v2 -> v3: use if-statement instead of set default value,
>           clarify the type of syscall
> v1 -> v2: added explanation on why always got ENOSYS
> 
>  arch/riscv/kernel/traps.c | 6 +++---
>  1 file changed, 3 insertions(+), 3 deletions(-)
> 
> diff --git a/arch/riscv/kernel/traps.c b/arch/riscv/kernel/traps.c
> index f910dfccbf5d2..729f79c97e2bf 100644
> --- a/arch/riscv/kernel/traps.c
> +++ b/arch/riscv/kernel/traps.c
> @@ -297,7 +297,7 @@ asmlinkage __visible __trap_section void do_trap_break(struct pt_regs *regs)
>  asmlinkage __visible __trap_section void do_trap_ecall_u(struct pt_regs *regs)
>  {
>  	if (user_mode(regs)) {
> -		ulong syscall = regs->a7;
> +		long syscall = regs->a7;
>  
>  		regs->epc += 4;
>  		regs->orig_a0 = regs->a0;
> @@ -306,9 +306,9 @@ asmlinkage __visible __trap_section void do_trap_ecall_u(struct pt_regs *regs)
>  
>  		syscall = syscall_enter_from_user_mode(regs, syscall);
>  
> -		if (syscall < NR_syscalls)
> +		if (syscall >= 0 && syscall < NR_syscalls)
>  			syscall_handler(regs, syscall);
> -		else
> +		else if (syscall != -1)
>  			regs->a0 = -ENOSYS;
>  
>  		syscall_exit_to_user_mode(regs);

Unfortunately, this change introduced a regression: it broke strace
syscall tampering on riscv.  When the tracer changes syscall number to -1,
the kernel fails to initialize a0 with -ENOSYS and subsequently fails to
return the error code of the failed syscall to userspace.

I wish you were running strace test suite before changing this part of the
kernel.  Now I'm going to apply a workaround [1] in strace, but please
note that riscv seems to be the only linux architecture where such a
workaround is currently required.

There was a similar kernel bug once on parisc, but it was fixed [2]
several years ago by commit b7dc5a071ddf.

[1] https://github.com/strace/strace/commit/c3ae2b27732952663a3600269884e363cb77a024
[2] https://git.kernel.org/torvalds/c/b7dc5a071ddf69c0350396b203cba32fe5bab510
Celeste Liu June 27, 2024, 7:47 a.m. UTC | #3
On 2024-06-27 15:14, Dmitry V. Levin wrote:

> Hi,
> 
> On Tue, Aug 01, 2023 at 10:15:16PM +0800, Celeste Liu wrote:
>> When we test seccomp with 6.4 kernel, we found errno has wrong value.
>> If we deny NETLINK_AUDIT with EAFNOSUPPORT, after f0bddf50586d, we will
>> get ENOSYS instead. We got same result with commit 9c2598d43510 ("riscv:
>> entry: Save a0 prior syscall_enter_from_user_mode()").
>>
>> After analysing code, we think that regs->a0 = -ENOSYS should only be
>> executed when syscall != -1. In __seccomp_filter, when seccomp rejected
>> this syscall with specified errno, they will set a0 to return number as
>> syscall ABI, and then return -1. This return number is finally pass as
>> return number of syscall_enter_from_user_mode, and then is compared with
>> NR_syscalls after converted to ulong (so it will be ULONG_MAX). The
>> condition syscall < NR_syscalls will always be false, so regs->a0 = -ENOSYS
>> is always executed. It covered a0 set by seccomp, so we always get
>> ENOSYS when match seccomp RET_ERRNO rule.
>>
>> Fixes: f0bddf50586d ("riscv: entry: Convert to generic entry")
>> Reported-by: Felix Yan <felixonmars@archlinux.org>
>> Co-developed-by: Ruizhe Pan <c141028@gmail.com>
>> Signed-off-by: Ruizhe Pan <c141028@gmail.com>
>> Co-developed-by: Shiqi Zhang <shiqi@isrc.iscas.ac.cn>
>> Signed-off-by: Shiqi Zhang <shiqi@isrc.iscas.ac.cn>
>> Signed-off-by: Celeste Liu <CoelacanthusHex@gmail.com>
>> Tested-by: Felix Yan <felixonmars@archlinux.org>
>> Tested-by: Emil Renner Berthing <emil.renner.berthing@canonical.com>
>> Reviewed-by: Björn Töpel <bjorn@rivosinc.com>
>> Reviewed-by: Guo Ren <guoren@kernel.org>
>> ---
>>
>> v4 -> v5: add Tested-by Emil Renner Berthing <emil.renner.berthing@canonical.com>
>> v3 -> v4: use long instead of ulong to reduce type cast and avoid
>>           implementation-defined behavior, and make the judgment of syscall
>>           invalid more explicit
>> v2 -> v3: use if-statement instead of set default value,
>>           clarify the type of syscall
>> v1 -> v2: added explanation on why always got ENOSYS
>>
>>  arch/riscv/kernel/traps.c | 6 +++---
>>  1 file changed, 3 insertions(+), 3 deletions(-)
>>
>> diff --git a/arch/riscv/kernel/traps.c b/arch/riscv/kernel/traps.c
>> index f910dfccbf5d2..729f79c97e2bf 100644
>> --- a/arch/riscv/kernel/traps.c
>> +++ b/arch/riscv/kernel/traps.c
>> @@ -297,7 +297,7 @@ asmlinkage __visible __trap_section void do_trap_break(struct pt_regs *regs)
>>  asmlinkage __visible __trap_section void do_trap_ecall_u(struct pt_regs *regs)
>>  {
>>  	if (user_mode(regs)) {
>> -		ulong syscall = regs->a7;
>> +		long syscall = regs->a7;
>>  
>>  		regs->epc += 4;
>>  		regs->orig_a0 = regs->a0;
>> @@ -306,9 +306,9 @@ asmlinkage __visible __trap_section void do_trap_ecall_u(struct pt_regs *regs)
>>  
>>  		syscall = syscall_enter_from_user_mode(regs, syscall);
>>  
>> -		if (syscall < NR_syscalls)
>> +		if (syscall >= 0 && syscall < NR_syscalls)
>>  			syscall_handler(regs, syscall);
>> -		else
>> +		else if (syscall != -1)
>>  			regs->a0 = -ENOSYS;
>>  
>>  		syscall_exit_to_user_mode(regs);
> 
> Unfortunately, this change introduced a regression: it broke strace
> syscall tampering on riscv.  When the tracer changes syscall number to -1,
> the kernel fails to initialize a0 with -ENOSYS and subsequently fails to
> return the error code of the failed syscall to userspace.

In the patch v2, we actually do the right thing. But as Björn Töpel's
suggestion and we found cast long to ulong is implementation-defined
behavior in C, so we change it to current form. So revert this patch and
apply patch v2 should fix this issue. Patch v2 uses ths same way with
other architectures.

[1]: https://lore.kernel.org/all/20230718162940.226118-1-CoelacanthusHex@gmail.com/

> 
> I wish you were running strace test suite before changing this part of the
> kernel.  Now I'm going to apply a workaround [1] in strace, but please
> note that riscv seems to be the only linux architecture where such a
> workaround is currently required.
> 
> There was a similar kernel bug once on parisc, but it was fixed [2]
> several years ago by commit b7dc5a071ddf.
> 
> [1] https://github.com/strace/strace/commit/c3ae2b27732952663a3600269884e363cb77a024
> [2] https://git.kernel.org/torvalds/c/b7dc5a071ddf69c0350396b203cba32fe5bab510
> 
>
Andreas Schwab June 27, 2024, 8:10 a.m. UTC | #4
On Jun 27 2024, Celeste Liu wrote:

> suggestion and we found cast long to ulong is implementation-defined
> behavior in C

There is nothing implementaton-defined in a long to ulong conversion.
Celeste Liu June 27, 2024, 9:35 a.m. UTC | #5
On 2024-06-27 16:10, Andreas Schwab wrote:

> On Jun 27 2024, Celeste Liu wrote:
> 
>> suggestion and we found cast long to ulong is implementation-defined
>> behavior in C
> 
> There is nothing implementaton-defined in a long to ulong conversion.
> 

Ok, I misremembered. long to ulong is modulo arithmetic but ulong to long
is implementation-defined behavior. There is no type cast issue.
Björn Töpel June 27, 2024, 9:43 a.m. UTC | #6
On Thu, Jun 27, 2024 at 9:47 AM Celeste Liu <coelacanthushex@gmail.com> wrote:
>
> On 2024-06-27 15:14, Dmitry V. Levin wrote:
>
> > Hi,
> >
> > On Tue, Aug 01, 2023 at 10:15:16PM +0800, Celeste Liu wrote:
> >> When we test seccomp with 6.4 kernel, we found errno has wrong value.
> >> If we deny NETLINK_AUDIT with EAFNOSUPPORT, after f0bddf50586d, we will
> >> get ENOSYS instead. We got same result with commit 9c2598d43510 ("riscv:
> >> entry: Save a0 prior syscall_enter_from_user_mode()").
> >>
> >> After analysing code, we think that regs->a0 = -ENOSYS should only be
> >> executed when syscall != -1. In __seccomp_filter, when seccomp rejected
> >> this syscall with specified errno, they will set a0 to return number as
> >> syscall ABI, and then return -1. This return number is finally pass as
> >> return number of syscall_enter_from_user_mode, and then is compared with
> >> NR_syscalls after converted to ulong (so it will be ULONG_MAX). The
> >> condition syscall < NR_syscalls will always be false, so regs->a0 = -ENOSYS
> >> is always executed. It covered a0 set by seccomp, so we always get
> >> ENOSYS when match seccomp RET_ERRNO rule.
> >>
> >> Fixes: f0bddf50586d ("riscv: entry: Convert to generic entry")
> >> Reported-by: Felix Yan <felixonmars@archlinux.org>
> >> Co-developed-by: Ruizhe Pan <c141028@gmail.com>
> >> Signed-off-by: Ruizhe Pan <c141028@gmail.com>
> >> Co-developed-by: Shiqi Zhang <shiqi@isrc.iscas.ac.cn>
> >> Signed-off-by: Shiqi Zhang <shiqi@isrc.iscas.ac.cn>
> >> Signed-off-by: Celeste Liu <CoelacanthusHex@gmail.com>
> >> Tested-by: Felix Yan <felixonmars@archlinux.org>
> >> Tested-by: Emil Renner Berthing <emil.renner.berthing@canonical.com>
> >> Reviewed-by: Björn Töpel <bjorn@rivosinc.com>
> >> Reviewed-by: Guo Ren <guoren@kernel.org>
> >> ---
> >>
> >> v4 -> v5: add Tested-by Emil Renner Berthing <emil.renner.berthing@canonical.com>
> >> v3 -> v4: use long instead of ulong to reduce type cast and avoid
> >>           implementation-defined behavior, and make the judgment of syscall
> >>           invalid more explicit
> >> v2 -> v3: use if-statement instead of set default value,
> >>           clarify the type of syscall
> >> v1 -> v2: added explanation on why always got ENOSYS
> >>
> >>  arch/riscv/kernel/traps.c | 6 +++---
> >>  1 file changed, 3 insertions(+), 3 deletions(-)
> >>
> >> diff --git a/arch/riscv/kernel/traps.c b/arch/riscv/kernel/traps.c
> >> index f910dfccbf5d2..729f79c97e2bf 100644
> >> --- a/arch/riscv/kernel/traps.c
> >> +++ b/arch/riscv/kernel/traps.c
> >> @@ -297,7 +297,7 @@ asmlinkage __visible __trap_section void do_trap_break(struct pt_regs *regs)
> >>  asmlinkage __visible __trap_section void do_trap_ecall_u(struct pt_regs *regs)
> >>  {
> >>      if (user_mode(regs)) {
> >> -            ulong syscall = regs->a7;
> >> +            long syscall = regs->a7;
> >>
> >>              regs->epc += 4;
> >>              regs->orig_a0 = regs->a0;
> >> @@ -306,9 +306,9 @@ asmlinkage __visible __trap_section void do_trap_ecall_u(struct pt_regs *regs)
> >>
> >>              syscall = syscall_enter_from_user_mode(regs, syscall);
> >>
> >> -            if (syscall < NR_syscalls)
> >> +            if (syscall >= 0 && syscall < NR_syscalls)
> >>                      syscall_handler(regs, syscall);
> >> -            else
> >> +            else if (syscall != -1)
> >>                      regs->a0 = -ENOSYS;
> >>
> >>              syscall_exit_to_user_mode(regs);
> >
> > Unfortunately, this change introduced a regression: it broke strace
> > syscall tampering on riscv.  When the tracer changes syscall number to -1,
> > the kernel fails to initialize a0 with -ENOSYS and subsequently fails to
> > return the error code of the failed syscall to userspace.
>
> In the patch v2, we actually do the right thing. But as Björn Töpel's
> suggestion and we found cast long to ulong is implementation-defined
> behavior in C, so we change it to current form. So revert this patch and
> apply patch v2 should fix this issue. Patch v2 uses ths same way with
> other architectures.
>
> [1]: https://lore.kernel.org/all/20230718162940.226118-1-CoelacanthusHex@gmail.com/

Not reverting, but a fix to make sure that a0 is initialized to -ENOSYS, e.g.:

--8<--
diff --git a/arch/riscv/kernel/traps.c b/arch/riscv/kernel/traps.c
index 05a16b1f0aee..51ebfd23e007 100644
--- a/arch/riscv/kernel/traps.c
+++ b/arch/riscv/kernel/traps.c
@@ -319,6 +319,7 @@ void do_trap_ecall_u(struct pt_regs *regs)

  regs->epc += 4;
  regs->orig_a0 = regs->a0;
+ regs->a0 = -ENOSYS;

  riscv_v_vstate_discard(regs);

@@ -328,8 +329,7 @@ void do_trap_ecall_u(struct pt_regs *regs)

  if (syscall >= 0 && syscall < NR_syscalls)
  syscall_handler(regs, syscall);
- else if (syscall != -1)
- regs->a0 = -ENOSYS;
+
  /*
  * Ultimately, this value will get limited by KSTACK_OFFSET_MAX(),
  * so the maximum stack offset is 1k bytes (10 bits).
--8<--

Celeste, do you want to cook that fix properly?


Björn
Dmitry V . Levin June 27, 2024, 9:52 a.m. UTC | #7
On Thu, Jun 27, 2024 at 11:43:03AM +0200, Björn Töpel wrote:
> On Thu, Jun 27, 2024 at 9:47 AM Celeste Liu <coelacanthushex@gmail.com> wrote:
> > On 2024-06-27 15:14, Dmitry V. Levin wrote:
> >
> > > Hi,
> > >
> > > On Tue, Aug 01, 2023 at 10:15:16PM +0800, Celeste Liu wrote:
> > >> When we test seccomp with 6.4 kernel, we found errno has wrong value.
> > >> If we deny NETLINK_AUDIT with EAFNOSUPPORT, after f0bddf50586d, we will
> > >> get ENOSYS instead. We got same result with commit 9c2598d43510 ("riscv:
> > >> entry: Save a0 prior syscall_enter_from_user_mode()").
> > >>
> > >> After analysing code, we think that regs->a0 = -ENOSYS should only be
> > >> executed when syscall != -1. In __seccomp_filter, when seccomp rejected
> > >> this syscall with specified errno, they will set a0 to return number as
> > >> syscall ABI, and then return -1. This return number is finally pass as
> > >> return number of syscall_enter_from_user_mode, and then is compared with
> > >> NR_syscalls after converted to ulong (so it will be ULONG_MAX). The
> > >> condition syscall < NR_syscalls will always be false, so regs->a0 = -ENOSYS
> > >> is always executed. It covered a0 set by seccomp, so we always get
> > >> ENOSYS when match seccomp RET_ERRNO rule.
> > >>
> > >> Fixes: f0bddf50586d ("riscv: entry: Convert to generic entry")
> > >> Reported-by: Felix Yan <felixonmars@archlinux.org>
> > >> Co-developed-by: Ruizhe Pan <c141028@gmail.com>
> > >> Signed-off-by: Ruizhe Pan <c141028@gmail.com>
> > >> Co-developed-by: Shiqi Zhang <shiqi@isrc.iscas.ac.cn>
> > >> Signed-off-by: Shiqi Zhang <shiqi@isrc.iscas.ac.cn>
> > >> Signed-off-by: Celeste Liu <CoelacanthusHex@gmail.com>
> > >> Tested-by: Felix Yan <felixonmars@archlinux.org>
> > >> Tested-by: Emil Renner Berthing <emil.renner.berthing@canonical.com>
> > >> Reviewed-by: Björn Töpel <bjorn@rivosinc.com>
> > >> Reviewed-by: Guo Ren <guoren@kernel.org>
> > >> ---
> > >>
> > >> v4 -> v5: add Tested-by Emil Renner Berthing <emil.renner.berthing@canonical.com>
> > >> v3 -> v4: use long instead of ulong to reduce type cast and avoid
> > >>           implementation-defined behavior, and make the judgment of syscall
> > >>           invalid more explicit
> > >> v2 -> v3: use if-statement instead of set default value,
> > >>           clarify the type of syscall
> > >> v1 -> v2: added explanation on why always got ENOSYS
> > >>
> > >>  arch/riscv/kernel/traps.c | 6 +++---
> > >>  1 file changed, 3 insertions(+), 3 deletions(-)
> > >>
> > >> diff --git a/arch/riscv/kernel/traps.c b/arch/riscv/kernel/traps.c
> > >> index f910dfccbf5d2..729f79c97e2bf 100644
> > >> --- a/arch/riscv/kernel/traps.c
> > >> +++ b/arch/riscv/kernel/traps.c
> > >> @@ -297,7 +297,7 @@ asmlinkage __visible __trap_section void do_trap_break(struct pt_regs *regs)
> > >>  asmlinkage __visible __trap_section void do_trap_ecall_u(struct pt_regs *regs)
> > >>  {
> > >>      if (user_mode(regs)) {
> > >> -            ulong syscall = regs->a7;
> > >> +            long syscall = regs->a7;
> > >>
> > >>              regs->epc += 4;
> > >>              regs->orig_a0 = regs->a0;
> > >> @@ -306,9 +306,9 @@ asmlinkage __visible __trap_section void do_trap_ecall_u(struct pt_regs *regs)
> > >>
> > >>              syscall = syscall_enter_from_user_mode(regs, syscall);
> > >>
> > >> -            if (syscall < NR_syscalls)
> > >> +            if (syscall >= 0 && syscall < NR_syscalls)
> > >>                      syscall_handler(regs, syscall);
> > >> -            else
> > >> +            else if (syscall != -1)
> > >>                      regs->a0 = -ENOSYS;
> > >>
> > >>              syscall_exit_to_user_mode(regs);
> > >
> > > Unfortunately, this change introduced a regression: it broke strace
> > > syscall tampering on riscv.  When the tracer changes syscall number to -1,
> > > the kernel fails to initialize a0 with -ENOSYS and subsequently fails to
> > > return the error code of the failed syscall to userspace.
> >
> > In the patch v2, we actually do the right thing. But as Björn Töpel's
> > suggestion and we found cast long to ulong is implementation-defined
> > behavior in C, so we change it to current form. So revert this patch and
> > apply patch v2 should fix this issue. Patch v2 uses ths same way with
> > other architectures.
> >
> > [1]: https://lore.kernel.org/all/20230718162940.226118-1-CoelacanthusHex@gmail.com/
> 
> Not reverting, but a fix to make sure that a0 is initialized to -ENOSYS, e.g.:
> 
> --8<--
> diff --git a/arch/riscv/kernel/traps.c b/arch/riscv/kernel/traps.c
> index 05a16b1f0aee..51ebfd23e007 100644
> --- a/arch/riscv/kernel/traps.c
> +++ b/arch/riscv/kernel/traps.c
> @@ -319,6 +319,7 @@ void do_trap_ecall_u(struct pt_regs *regs)
> 
>   regs->epc += 4;
>   regs->orig_a0 = regs->a0;
> + regs->a0 = -ENOSYS;

Given that struct user_regs_struct doesn't have orig_a0, wouldn't this
clobber a0 too early so that the tracer will get -ENOSYS in place of the
first syscall argument?
Celeste Liu June 27, 2024, 10:11 a.m. UTC | #8
On 2024-06-27 17:43, Björn Töpel wrote:
> On Thu, Jun 27, 2024 at 9:47 AM Celeste Liu <coelacanthushex@gmail.com> wrote:
>>
>> On 2024-06-27 15:14, Dmitry V. Levin wrote:
>>
>>> Hi,
>>>
>>> On Tue, Aug 01, 2023 at 10:15:16PM +0800, Celeste Liu wrote:
>>>> When we test seccomp with 6.4 kernel, we found errno has wrong value.
>>>> If we deny NETLINK_AUDIT with EAFNOSUPPORT, after f0bddf50586d, we will
>>>> get ENOSYS instead. We got same result with commit 9c2598d43510 ("riscv:
>>>> entry: Save a0 prior syscall_enter_from_user_mode()").
>>>>
>>>> After analysing code, we think that regs->a0 = -ENOSYS should only be
>>>> executed when syscall != -1. In __seccomp_filter, when seccomp rejected
>>>> this syscall with specified errno, they will set a0 to return number as
>>>> syscall ABI, and then return -1. This return number is finally pass as
>>>> return number of syscall_enter_from_user_mode, and then is compared with
>>>> NR_syscalls after converted to ulong (so it will be ULONG_MAX). The
>>>> condition syscall < NR_syscalls will always be false, so regs->a0 = -ENOSYS
>>>> is always executed. It covered a0 set by seccomp, so we always get
>>>> ENOSYS when match seccomp RET_ERRNO rule.
>>>>
>>>> Fixes: f0bddf50586d ("riscv: entry: Convert to generic entry")
>>>> Reported-by: Felix Yan <felixonmars@archlinux.org>
>>>> Co-developed-by: Ruizhe Pan <c141028@gmail.com>
>>>> Signed-off-by: Ruizhe Pan <c141028@gmail.com>
>>>> Co-developed-by: Shiqi Zhang <shiqi@isrc.iscas.ac.cn>
>>>> Signed-off-by: Shiqi Zhang <shiqi@isrc.iscas.ac.cn>
>>>> Signed-off-by: Celeste Liu <CoelacanthusHex@gmail.com>
>>>> Tested-by: Felix Yan <felixonmars@archlinux.org>
>>>> Tested-by: Emil Renner Berthing <emil.renner.berthing@canonical.com>
>>>> Reviewed-by: Björn Töpel <bjorn@rivosinc.com>
>>>> Reviewed-by: Guo Ren <guoren@kernel.org>
>>>> ---
>>>>
>>>> v4 -> v5: add Tested-by Emil Renner Berthing <emil.renner.berthing@canonical.com>
>>>> v3 -> v4: use long instead of ulong to reduce type cast and avoid
>>>>           implementation-defined behavior, and make the judgment of syscall
>>>>           invalid more explicit
>>>> v2 -> v3: use if-statement instead of set default value,
>>>>           clarify the type of syscall
>>>> v1 -> v2: added explanation on why always got ENOSYS
>>>>
>>>>  arch/riscv/kernel/traps.c | 6 +++---
>>>>  1 file changed, 3 insertions(+), 3 deletions(-)
>>>>
>>>> diff --git a/arch/riscv/kernel/traps.c b/arch/riscv/kernel/traps.c
>>>> index f910dfccbf5d2..729f79c97e2bf 100644
>>>> --- a/arch/riscv/kernel/traps.c
>>>> +++ b/arch/riscv/kernel/traps.c
>>>> @@ -297,7 +297,7 @@ asmlinkage __visible __trap_section void do_trap_break(struct pt_regs *regs)
>>>>  asmlinkage __visible __trap_section void do_trap_ecall_u(struct pt_regs *regs)
>>>>  {
>>>>      if (user_mode(regs)) {
>>>> -            ulong syscall = regs->a7;
>>>> +            long syscall = regs->a7;
>>>>
>>>>              regs->epc += 4;
>>>>              regs->orig_a0 = regs->a0;
>>>> @@ -306,9 +306,9 @@ asmlinkage __visible __trap_section void do_trap_ecall_u(struct pt_regs *regs)
>>>>
>>>>              syscall = syscall_enter_from_user_mode(regs, syscall);
>>>>
>>>> -            if (syscall < NR_syscalls)
>>>> +            if (syscall >= 0 && syscall < NR_syscalls)
>>>>                      syscall_handler(regs, syscall);
>>>> -            else
>>>> +            else if (syscall != -1)
>>>>                      regs->a0 = -ENOSYS;
>>>>
>>>>              syscall_exit_to_user_mode(regs);
>>>
>>> Unfortunately, this change introduced a regression: it broke strace
>>> syscall tampering on riscv.  When the tracer changes syscall number to -1,
>>> the kernel fails to initialize a0 with -ENOSYS and subsequently fails to
>>> return the error code of the failed syscall to userspace.
>>
>> In the patch v2, we actually do the right thing. But as Björn Töpel's
>> suggestion and we found cast long to ulong is implementation-defined
>> behavior in C, so we change it to current form. So revert this patch and
>> apply patch v2 should fix this issue. Patch v2 uses ths same way with
>> other architectures.
>>
>> [1]: https://lore.kernel.org/all/20230718162940.226118-1-CoelacanthusHex@gmail.com/
> 
> Not reverting, but a fix to make sure that a0 is initialized to -ENOSYS, e.g.:

Oh. I just want to describe what change we need, not to say actual 'git revert'.

> 
> --8<--
> diff --git a/arch/riscv/kernel/traps.c b/arch/riscv/kernel/traps.c
> index 05a16b1f0aee..51ebfd23e007 100644
> --- a/arch/riscv/kernel/traps.c
> +++ b/arch/riscv/kernel/traps.c
> @@ -319,6 +319,7 @@ void do_trap_ecall_u(struct pt_regs *regs)
> 
>   regs->epc += 4;
>   regs->orig_a0 = regs->a0;
> + regs->a0 = -ENOSYS;
> 
>   riscv_v_vstate_discard(regs);
> 
> @@ -328,8 +329,7 @@ void do_trap_ecall_u(struct pt_regs *regs)
> 
>   if (syscall >= 0 && syscall < NR_syscalls)
>   syscall_handler(regs, syscall);
> - else if (syscall != -1)
> - regs->a0 = -ENOSYS;
> +
>   /*
>   * Ultimately, this value will get limited by KSTACK_OFFSET_MAX(),
>   * so the maximum stack offset is 1k bytes (10 bits).
> --8<--

This is also what I think.

> Celeste, do you want to cook that fix properly?

Yeah. I will sent patch to mail list soon.

> 
> 
> Björn
Björn Töpel June 27, 2024, 10:23 a.m. UTC | #9
On Thu, Jun 27, 2024 at 11:52 AM Dmitry V. Levin <ldv@strace.io> wrote:
>
> On Thu, Jun 27, 2024 at 11:43:03AM +0200, Björn Töpel wrote:
> > On Thu, Jun 27, 2024 at 9:47 AM Celeste Liu <coelacanthushex@gmail.com> wrote:
> > > On 2024-06-27 15:14, Dmitry V. Levin wrote:
> > >
> > > > Hi,
> > > >
> > > > On Tue, Aug 01, 2023 at 10:15:16PM +0800, Celeste Liu wrote:
> > > >> When we test seccomp with 6.4 kernel, we found errno has wrong value.
> > > >> If we deny NETLINK_AUDIT with EAFNOSUPPORT, after f0bddf50586d, we will
> > > >> get ENOSYS instead. We got same result with commit 9c2598d43510 ("riscv:
> > > >> entry: Save a0 prior syscall_enter_from_user_mode()").
> > > >>
> > > >> After analysing code, we think that regs->a0 = -ENOSYS should only be
> > > >> executed when syscall != -1. In __seccomp_filter, when seccomp rejected
> > > >> this syscall with specified errno, they will set a0 to return number as
> > > >> syscall ABI, and then return -1. This return number is finally pass as
> > > >> return number of syscall_enter_from_user_mode, and then is compared with
> > > >> NR_syscalls after converted to ulong (so it will be ULONG_MAX). The
> > > >> condition syscall < NR_syscalls will always be false, so regs->a0 = -ENOSYS
> > > >> is always executed. It covered a0 set by seccomp, so we always get
> > > >> ENOSYS when match seccomp RET_ERRNO rule.
> > > >>
> > > >> Fixes: f0bddf50586d ("riscv: entry: Convert to generic entry")
> > > >> Reported-by: Felix Yan <felixonmars@archlinux.org>
> > > >> Co-developed-by: Ruizhe Pan <c141028@gmail.com>
> > > >> Signed-off-by: Ruizhe Pan <c141028@gmail.com>
> > > >> Co-developed-by: Shiqi Zhang <shiqi@isrc.iscas.ac.cn>
> > > >> Signed-off-by: Shiqi Zhang <shiqi@isrc.iscas.ac.cn>
> > > >> Signed-off-by: Celeste Liu <CoelacanthusHex@gmail.com>
> > > >> Tested-by: Felix Yan <felixonmars@archlinux.org>
> > > >> Tested-by: Emil Renner Berthing <emil.renner.berthing@canonical.com>
> > > >> Reviewed-by: Björn Töpel <bjorn@rivosinc.com>
> > > >> Reviewed-by: Guo Ren <guoren@kernel.org>
> > > >> ---
> > > >>
> > > >> v4 -> v5: add Tested-by Emil Renner Berthing <emil.renner.berthing@canonical.com>
> > > >> v3 -> v4: use long instead of ulong to reduce type cast and avoid
> > > >>           implementation-defined behavior, and make the judgment of syscall
> > > >>           invalid more explicit
> > > >> v2 -> v3: use if-statement instead of set default value,
> > > >>           clarify the type of syscall
> > > >> v1 -> v2: added explanation on why always got ENOSYS
> > > >>
> > > >>  arch/riscv/kernel/traps.c | 6 +++---
> > > >>  1 file changed, 3 insertions(+), 3 deletions(-)
> > > >>
> > > >> diff --git a/arch/riscv/kernel/traps.c b/arch/riscv/kernel/traps.c
> > > >> index f910dfccbf5d2..729f79c97e2bf 100644
> > > >> --- a/arch/riscv/kernel/traps.c
> > > >> +++ b/arch/riscv/kernel/traps.c
> > > >> @@ -297,7 +297,7 @@ asmlinkage __visible __trap_section void do_trap_break(struct pt_regs *regs)
> > > >>  asmlinkage __visible __trap_section void do_trap_ecall_u(struct pt_regs *regs)
> > > >>  {
> > > >>      if (user_mode(regs)) {
> > > >> -            ulong syscall = regs->a7;
> > > >> +            long syscall = regs->a7;
> > > >>
> > > >>              regs->epc += 4;
> > > >>              regs->orig_a0 = regs->a0;
> > > >> @@ -306,9 +306,9 @@ asmlinkage __visible __trap_section void do_trap_ecall_u(struct pt_regs *regs)
> > > >>
> > > >>              syscall = syscall_enter_from_user_mode(regs, syscall);
> > > >>
> > > >> -            if (syscall < NR_syscalls)
> > > >> +            if (syscall >= 0 && syscall < NR_syscalls)
> > > >>                      syscall_handler(regs, syscall);
> > > >> -            else
> > > >> +            else if (syscall != -1)
> > > >>                      regs->a0 = -ENOSYS;
> > > >>
> > > >>              syscall_exit_to_user_mode(regs);
> > > >
> > > > Unfortunately, this change introduced a regression: it broke strace
> > > > syscall tampering on riscv.  When the tracer changes syscall number to -1,
> > > > the kernel fails to initialize a0 with -ENOSYS and subsequently fails to
> > > > return the error code of the failed syscall to userspace.
> > >
> > > In the patch v2, we actually do the right thing. But as Björn Töpel's
> > > suggestion and we found cast long to ulong is implementation-defined
> > > behavior in C, so we change it to current form. So revert this patch and
> > > apply patch v2 should fix this issue. Patch v2 uses ths same way with
> > > other architectures.
> > >
> > > [1]: https://lore.kernel.org/all/20230718162940.226118-1-CoelacanthusHex@gmail.com/
> >
> > Not reverting, but a fix to make sure that a0 is initialized to -ENOSYS, e.g.:
> >
> > --8<--
> > diff --git a/arch/riscv/kernel/traps.c b/arch/riscv/kernel/traps.c
> > index 05a16b1f0aee..51ebfd23e007 100644
> > --- a/arch/riscv/kernel/traps.c
> > +++ b/arch/riscv/kernel/traps.c
> > @@ -319,6 +319,7 @@ void do_trap_ecall_u(struct pt_regs *regs)
> >
> >   regs->epc += 4;
> >   regs->orig_a0 = regs->a0;
> > + regs->a0 = -ENOSYS;
>
> Given that struct user_regs_struct doesn't have orig_a0, wouldn't this
> clobber a0 too early so that the tracer will get -ENOSYS in place of the
> first syscall argument?

No, that's ok. It's handled by various wrappers where the arguments
are pulled out.


Björn
Celeste Liu June 27, 2024, 10:38 a.m. UTC | #10
On 2024-06-27 17:43, Björn Töpel wrote:

> On Thu, Jun 27, 2024 at 9:47 AM Celeste Liu <coelacanthushex@gmail.com> wrote:
>>
>> On 2024-06-27 15:14, Dmitry V. Levin wrote:
>>
>>> Hi,
>>>
>>> On Tue, Aug 01, 2023 at 10:15:16PM +0800, Celeste Liu wrote:
>>>> When we test seccomp with 6.4 kernel, we found errno has wrong value.
>>>> If we deny NETLINK_AUDIT with EAFNOSUPPORT, after f0bddf50586d, we will
>>>> get ENOSYS instead. We got same result with commit 9c2598d43510 ("riscv:
>>>> entry: Save a0 prior syscall_enter_from_user_mode()").
>>>>
>>>> After analysing code, we think that regs->a0 = -ENOSYS should only be
>>>> executed when syscall != -1. In __seccomp_filter, when seccomp rejected
>>>> this syscall with specified errno, they will set a0 to return number as
>>>> syscall ABI, and then return -1. This return number is finally pass as
>>>> return number of syscall_enter_from_user_mode, and then is compared with
>>>> NR_syscalls after converted to ulong (so it will be ULONG_MAX). The
>>>> condition syscall < NR_syscalls will always be false, so regs->a0 = -ENOSYS
>>>> is always executed. It covered a0 set by seccomp, so we always get
>>>> ENOSYS when match seccomp RET_ERRNO rule.
>>>>
>>>> Fixes: f0bddf50586d ("riscv: entry: Convert to generic entry")
>>>> Reported-by: Felix Yan <felixonmars@archlinux.org>
>>>> Co-developed-by: Ruizhe Pan <c141028@gmail.com>
>>>> Signed-off-by: Ruizhe Pan <c141028@gmail.com>
>>>> Co-developed-by: Shiqi Zhang <shiqi@isrc.iscas.ac.cn>
>>>> Signed-off-by: Shiqi Zhang <shiqi@isrc.iscas.ac.cn>
>>>> Signed-off-by: Celeste Liu <CoelacanthusHex@gmail.com>
>>>> Tested-by: Felix Yan <felixonmars@archlinux.org>
>>>> Tested-by: Emil Renner Berthing <emil.renner.berthing@canonical.com>
>>>> Reviewed-by: Björn Töpel <bjorn@rivosinc.com>
>>>> Reviewed-by: Guo Ren <guoren@kernel.org>
>>>> ---
>>>>
>>>> v4 -> v5: add Tested-by Emil Renner Berthing <emil.renner.berthing@canonical.com>
>>>> v3 -> v4: use long instead of ulong to reduce type cast and avoid
>>>>           implementation-defined behavior, and make the judgment of syscall
>>>>           invalid more explicit
>>>> v2 -> v3: use if-statement instead of set default value,
>>>>           clarify the type of syscall
>>>> v1 -> v2: added explanation on why always got ENOSYS
>>>>
>>>>  arch/riscv/kernel/traps.c | 6 +++---
>>>>  1 file changed, 3 insertions(+), 3 deletions(-)
>>>>
>>>> diff --git a/arch/riscv/kernel/traps.c b/arch/riscv/kernel/traps.c
>>>> index f910dfccbf5d2..729f79c97e2bf 100644
>>>> --- a/arch/riscv/kernel/traps.c
>>>> +++ b/arch/riscv/kernel/traps.c
>>>> @@ -297,7 +297,7 @@ asmlinkage __visible __trap_section void do_trap_break(struct pt_regs *regs)
>>>>  asmlinkage __visible __trap_section void do_trap_ecall_u(struct pt_regs *regs)
>>>>  {
>>>>      if (user_mode(regs)) {
>>>> -            ulong syscall = regs->a7;
>>>> +            long syscall = regs->a7;
>>>>
>>>>              regs->epc += 4;
>>>>              regs->orig_a0 = regs->a0;
>>>> @@ -306,9 +306,9 @@ asmlinkage __visible __trap_section void do_trap_ecall_u(struct pt_regs *regs)
>>>>
>>>>              syscall = syscall_enter_from_user_mode(regs, syscall);
>>>>
>>>> -            if (syscall < NR_syscalls)
>>>> +            if (syscall >= 0 && syscall < NR_syscalls)
>>>>                      syscall_handler(regs, syscall);
>>>> -            else
>>>> +            else if (syscall != -1)
>>>>                      regs->a0 = -ENOSYS;
>>>>
>>>>              syscall_exit_to_user_mode(regs);
>>>
>>> Unfortunately, this change introduced a regression: it broke strace
>>> syscall tampering on riscv.  When the tracer changes syscall number to -1,
>>> the kernel fails to initialize a0 with -ENOSYS and subsequently fails to
>>> return the error code of the failed syscall to userspace.
>>
>> In the patch v2, we actually do the right thing. But as Björn Töpel's
>> suggestion and we found cast long to ulong is implementation-defined
>> behavior in C, so we change it to current form. So revert this patch and
>> apply patch v2 should fix this issue. Patch v2 uses ths same way with
>> other architectures.
>>
>> [1]: https://lore.kernel.org/all/20230718162940.226118-1-CoelacanthusHex@gmail.com/
> 
> Not reverting, but a fix to make sure that a0 is initialized to -ENOSYS, e.g.:
> 
> --8<--
> diff --git a/arch/riscv/kernel/traps.c b/arch/riscv/kernel/traps.c
> index 05a16b1f0aee..51ebfd23e007 100644
> --- a/arch/riscv/kernel/traps.c
> +++ b/arch/riscv/kernel/traps.c
> @@ -319,6 +319,7 @@ void do_trap_ecall_u(struct pt_regs *regs)
> 
>   regs->epc += 4;
>   regs->orig_a0 = regs->a0;
> + regs->a0 = -ENOSYS;
> 
>   riscv_v_vstate_discard(regs);
> 
> @@ -328,8 +329,7 @@ void do_trap_ecall_u(struct pt_regs *regs)
> 
>   if (syscall >= 0 && syscall < NR_syscalls)
>   syscall_handler(regs, syscall);
> - else if (syscall != -1)
> - regs->a0 = -ENOSYS;
> +
>   /*
>   * Ultimately, this value will get limited by KSTACK_OFFSET_MAX(),
>   * so the maximum stack offset is 1k bytes (10 bits).
> --8<--
> 
> Celeste, do you want to cook that fix properly?

Patch has been sent.

https://lore.kernel.org/all/20240627103205.27914-2-CoelacanthusHex@gmail.com/

And linux-riscv mail list problably has some issue: there are two copies
of patch, orginal one and another one with linux-riscv foot.

> 
> 
> Björn
diff mbox series

Patch

diff --git a/arch/riscv/kernel/traps.c b/arch/riscv/kernel/traps.c
index f910dfccbf5d2..729f79c97e2bf 100644
--- a/arch/riscv/kernel/traps.c
+++ b/arch/riscv/kernel/traps.c
@@ -297,7 +297,7 @@  asmlinkage __visible __trap_section void do_trap_break(struct pt_regs *regs)
 asmlinkage __visible __trap_section void do_trap_ecall_u(struct pt_regs *regs)
 {
 	if (user_mode(regs)) {
-		ulong syscall = regs->a7;
+		long syscall = regs->a7;
 
 		regs->epc += 4;
 		regs->orig_a0 = regs->a0;
@@ -306,9 +306,9 @@  asmlinkage __visible __trap_section void do_trap_ecall_u(struct pt_regs *regs)
 
 		syscall = syscall_enter_from_user_mode(regs, syscall);
 
-		if (syscall < NR_syscalls)
+		if (syscall >= 0 && syscall < NR_syscalls)
 			syscall_handler(regs, syscall);
-		else
+		else if (syscall != -1)
 			regs->a0 = -ENOSYS;
 
 		syscall_exit_to_user_mode(regs);