diff mbox series

signal: Extend exec_id to 64bits

Message ID 87zhbvlyq7.fsf_-_@x220.int.ebiederm.org (mailing list archive)
State New, archived
Headers show
Series signal: Extend exec_id to 64bits | expand

Commit Message

Eric W. Biederman April 1, 2020, 8:47 p.m. UTC
Replace the 32bit exec_id with a 64bit exec_id to make it impossible
to wrap the exec_id counter.  With care an attacker can cause exec_id
wrap and send arbitrary signals to a newly exec'd parent.  This
bypasses the signal sending checks if the parent changes their
credentials during exec.

The severity of this problem can been seen that in my limited testing
of a 32bit exec_id it can take as little as 19s to exec 65536 times.
Which means that it can take as little as 14 days to wrap a 32bit
exec_id.  Adam Zabrocki has succeeded wrapping the self_exe_id in 7
days.  Even my slower timing is in the uptime of a typical server.
Which means self_exec_id is simply a speed bump today, and if exec
gets noticably faster self_exec_id won't even be a speed bump.

Extending self_exec_id to 64bits introduces a problem on 32bit
architectures where reading self_exec_id is no longer atomic and can
take two read instructions.  Which means that is is possible to hit
a window where the read value of exec_id does not match the written
value.  So with very lucky timing after this change this still
remains expoiltable.

I have updated the update of exec_id on exec to use WRITE_ONCE
and the read of exec_id in do_notify_parent to use READ_ONCE
to make it clear that there is no locking between these two
locations.

Link: https://lore.kernel.org/kernel-hardening/20200324215049.GA3710@pi3.com.pl
Fixes: 2.3.23pre2
Cc: stable@vger.kernel.org
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
---

Linus would you prefer to take this patch directly or I could put it in
a brach and send you a pull request.
 
 fs/exec.c             | 2 +-
 include/linux/sched.h | 4 ++--
 kernel/signal.c       | 2 +-
 3 files changed, 4 insertions(+), 4 deletions(-)

Comments

Linus Torvalds April 1, 2020, 8:55 p.m. UTC | #1
On Wed, Apr 1, 2020 at 1:50 PM Eric W. Biederman <ebiederm@xmission.com> wrote:
>
> I have updated the update of exec_id on exec to use WRITE_ONCE
> and the read of exec_id in do_notify_parent to use READ_ONCE
> to make it clear that there is no locking between these two
> locations.

Ack, makes sense to me.

Just put it in your branch, this doesn't look urgent, considering that
it's something that has been around forever.

            Linus
Eric W. Biederman April 1, 2020, 9:03 p.m. UTC | #2
Linus Torvalds <torvalds@linux-foundation.org> writes:

> On Wed, Apr 1, 2020 at 1:50 PM Eric W. Biederman <ebiederm@xmission.com> wrote:
>>
>> I have updated the update of exec_id on exec to use WRITE_ONCE
>> and the read of exec_id in do_notify_parent to use READ_ONCE
>> to make it clear that there is no locking between these two
>> locations.
>
> Ack, makes sense to me.
>
> Just put it in your branch, this doesn't look urgent, considering that
> it's something that has been around forever.

Done

Eric
Jann Horn April 1, 2020, 11:37 p.m. UTC | #3
+memory model folks because this seems like a pretty sharp corner

On Wed, Apr 1, 2020 at 10:50 PM Eric W. Biederman <ebiederm@xmission.com> wrote:
> Replace the 32bit exec_id with a 64bit exec_id to make it impossible
> to wrap the exec_id counter.  With care an attacker can cause exec_id
> wrap and send arbitrary signals to a newly exec'd parent.  This
> bypasses the signal sending checks if the parent changes their
> credentials during exec.
>
> The severity of this problem can been seen that in my limited testing
> of a 32bit exec_id it can take as little as 19s to exec 65536 times.
> Which means that it can take as little as 14 days to wrap a 32bit
> exec_id.  Adam Zabrocki has succeeded wrapping the self_exe_id in 7
> days.  Even my slower timing is in the uptime of a typical server.
> Which means self_exec_id is simply a speed bump today, and if exec
> gets noticably faster self_exec_id won't even be a speed bump.
>
> Extending self_exec_id to 64bits introduces a problem on 32bit
> architectures where reading self_exec_id is no longer atomic and can
> take two read instructions.  Which means that is is possible to hit
> a window where the read value of exec_id does not match the written
> value.  So with very lucky timing after this change this still
> remains expoiltable.
>
> I have updated the update of exec_id on exec to use WRITE_ONCE
> and the read of exec_id in do_notify_parent to use READ_ONCE
> to make it clear that there is no locking between these two
> locations.
>
> Link: https://lore.kernel.org/kernel-hardening/20200324215049.GA3710@pi3.com.pl
> Fixes: 2.3.23pre2
> Cc: stable@vger.kernel.org
> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
> ---
>
> Linus would you prefer to take this patch directly or I could put it in
> a brach and send you a pull request.
>
>  fs/exec.c             | 2 +-
>  include/linux/sched.h | 4 ++--
>  kernel/signal.c       | 2 +-
>  3 files changed, 4 insertions(+), 4 deletions(-)
>
> diff --git a/fs/exec.c b/fs/exec.c
> index 0e46ec57fe0a..d55710a36056 100644
> --- a/fs/exec.c
> +++ b/fs/exec.c
> @@ -1413,7 +1413,7 @@ void setup_new_exec(struct linux_binprm * bprm)
>
>         /* An exec changes our domain. We are no longer part of the thread
>            group */
> -       current->self_exec_id++;
> +       WRITE_ONCE(current->self_exec_id, current->self_exec_id + 1);

GCC will generate code for this without complaining, but I think it'll
probably generate a tearing store on 32-bit platforms:

$ cat volatile-8.c
typedef unsigned long long u64;
static volatile u64 n;
void blah(void) {
  n = n + 1;
}
$ gcc -O2 -m32 -c -o volatile-8.o volatile-8.c -Wall
$ objdump --disassemble=blah volatile-8.o
[...]
   b: 8b 81 00 00 00 00    mov    0x0(%ecx),%eax
  11: 8b 91 04 00 00 00    mov    0x4(%ecx),%edx
  17: 83 c0 01              add    $0x1,%eax
  1a: 83 d2 00              adc    $0x0,%edx
  1d: 89 81 00 00 00 00    mov    %eax,0x0(%ecx)
  23: 89 91 04 00 00 00    mov    %edx,0x4(%ecx)
[...]
$

You could maybe use atomic64_t to work around that? atomic64_read()
and atomic64_set() should be straightforward READ_ONCE() /
WRITE_ONCE() on 64-bit systems while compiling into something more
complicated on 32-bit.

>         flush_signal_handlers(current, 0);
>  }
>  EXPORT_SYMBOL(setup_new_exec);
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index 04278493bf15..0323e4f0982a 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -939,8 +939,8 @@ struct task_struct {
>         struct seccomp                  seccomp;
>
>         /* Thread group tracking: */
> -       u32                             parent_exec_id;
> -       u32                             self_exec_id;
> +       u64                             parent_exec_id;
> +       u64                             self_exec_id;
>
>         /* Protection against (de-)allocation: mm, files, fs, tty, keyrings, mems_allowed, mempolicy: */
>         spinlock_t                      alloc_lock;
> diff --git a/kernel/signal.c b/kernel/signal.c
> index 9ad8dea93dbb..5383b562df85 100644
> --- a/kernel/signal.c
> +++ b/kernel/signal.c
> @@ -1926,7 +1926,7 @@ bool do_notify_parent(struct task_struct *tsk, int sig)
>                  * This is only possible if parent == real_parent.
>                  * Check if it has changed security domain.
>                  */
> -               if (tsk->parent_exec_id != tsk->parent->self_exec_id)
> +               if (tsk->parent_exec_id != READ_ONCE(tsk->parent->self_exec_id))
>                         sig = SIGCHLD;
>         }
>
> --
> 2.20.1
>
Linus Torvalds April 1, 2020, 11:51 p.m. UTC | #4
On Wed, Apr 1, 2020 at 4:37 PM Jann Horn <jannh@google.com> wrote:
>
> GCC will generate code for this without complaining, but I think it'll
> probably generate a tearing store on 32-bit platforms:

This is very much a "we don't care" case.

It's literally testing a sequence counter for equality. If you get
tearing in the high bits on the write (or the read), you'd still need
to have the low bits turn around 4G times to get a matching value.

So no. We're not doing atomics for the 32-bit case. That's insane.

               Linus
Linus Torvalds April 1, 2020, 11:55 p.m. UTC | #5
On Wed, Apr 1, 2020 at 4:51 PM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> It's literally testing a sequence counter for equality. If you get
> tearing in the high bits on the write (or the read), you'd still need
> to have the low bits turn around 4G times to get a matching value.

Put another way: first you'd have to work however many weeks to do 4
billion execve() calls, and then you need to hit basically a
single-instruction race to take advantage of it.

Good luck with that. If you have that kind of God-like capability,
whoever you're attacking stands no chance in the first place.

                  Linus
Jann Horn April 2, 2020, 1:35 a.m. UTC | #6
On Thu, Apr 2, 2020 at 1:55 AM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
> On Wed, Apr 1, 2020 at 4:51 PM Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
> >
> > It's literally testing a sequence counter for equality. If you get
> > tearing in the high bits on the write (or the read), you'd still need
> > to have the low bits turn around 4G times to get a matching value.
>
> Put another way: first you'd have to work however many weeks to do 4
> billion execve() calls, and then you need to hit basically a
> single-instruction race to take advantage of it.
>
> Good luck with that. If you have that kind of God-like capability,
> whoever you're attacking stands no chance in the first place.

I'm not really worried about someone being able to hit this in
practice, especially given that the only thing it lets you do is send
signals; I just think that the code is wrong in principle, and that we
should avoid having "technically wrong, but works in practice" code in
the kernel.

This kind of intentional race might also trip up testing tools like
the upcoming KCSAN instrumentation, unless it is specifically
annotated as an intentionally racing instruction. (For now, KCSAN is
64-bit only, so it won't actually be able to detect this though; and
the current KCSAN development branch actually incorrectly considers
WRITE_ONCE() to always be atomic; but still, in principle it's the
kind of issue KCSAN is supposed to detect, I think.)

But even without KCSAN, if we have tearing stores racing with loads, I
think that we ought to *at least* have a comment noting that we're
intentionally doing that so that people don't copy this kind of code
to a place where the high bits change more frequently and correctness
matters more.

Since the read is already protected by the tasklist_lock, an
alternative might be to let the execve path also take that lock to
protect the sequence number update, given that execve is not a
particularly hot path. Or we could hand-code the equality check and
increment operations to be properly race-free.
Linus Torvalds April 2, 2020, 2:05 a.m. UTC | #7
On Wed, Apr 1, 2020 at 6:36 PM Jann Horn <jannh@google.com> wrote:
>
> Since the read is already protected by the tasklist_lock, an
> alternative might be to let the execve path also take that lock to
> protect the sequence number update,

No.

tasklist_lock is aboue the hottest lock there is in all of the kernel.

We're not doing stupid things for theoretical issues.

Stop this crazy argument.

A comment - sure. 64-bit atomics or very expensive locks? Not a chance.

                   Linus
Jann Horn April 2, 2020, 4:46 a.m. UTC | #8
On Wed, Apr 1, 2020 at 10:50 PM Eric W. Biederman <ebiederm@xmission.com> wrote:
> Replace the 32bit exec_id with a 64bit exec_id to make it impossible
> to wrap the exec_id counter.  With care an attacker can cause exec_id
> wrap and send arbitrary signals to a newly exec'd parent.  This
> bypasses the signal sending checks if the parent changes their
> credentials during exec.
>
> The severity of this problem can been seen that in my limited testing
> of a 32bit exec_id it can take as little as 19s to exec 65536 times.
> Which means that it can take as little as 14 days to wrap a 32bit
> exec_id.  Adam Zabrocki has succeeded wrapping the self_exe_id in 7
> days.  Even my slower timing is in the uptime of a typical server.

FYI, if you actually optimize this, it's more like 12s to exec 1048576
times according to my test, which means ~14 hours for 2^32 executions
(on a single core). That's on an i7-4790 (a Haswell desktop processor
that was launched about six years ago, in 2014).

Here's my test code:

=============
$ grep 'model name' /proc/cpuinfo | head -n1
model name : Intel(R) Core(TM) i7-4790 CPU @ 3.60GHz
$ cat build.sh
#!/bin/sh
set -e
nasm -felf32 -o fast_execve.o fast_execve.asm
ld -m elf_i386 -o fast_execve fast_execve.o
gcc -o launch launch.c -Wall
gcc -o finish finish.c -Wall
$ cat fast_execve.asm
bits 32

section .text
global _start
_start:
; eax = argv[0]
; expected to be 8 hex digits, with 'a' meaning 0x0 and 'p' meaning 0xf
mov eax, [esp+4]

mov ebx, 0 ; loop counter
hex_digit_loop:
inc byte [eax+ebx]
cmp byte [eax+ebx], 'a'+16
jne next_exec
mov byte [eax+ebx], 'a'
inc ebx
cmp ebx, 5 ;;;;;;;;;;;;;;;;;; this is N, where iteration_count=pow(16,N)
jne hex_digit_loop


; reached pow(256,N) execs, get out

; first make the stack big again
mov eax, 75 ; setrlimit (32-bit ABI)
mov ebx, 3 ; RLIMIT_STACK
mov ecx, stacklim
int 0x80

; execute end helper
mov ebx, 4 ; dirfd = 4
jmp common_exec

next_exec:
mov ebx, 3 ; dirfd = 3

common_exec: ; execveat() with file descriptor passed in as ebx
mov ecx, nullval ; pathname = empty string
lea edx, [esp+4] ; argv
mov esi, 0 ; envp
mov edi, 0x1000 ; flags = AT_EMPTY_PATH
mov eax, 358 ; execveat (32-bit ABI)
int 0x80
int3

nullval:
dd 0
stacklim:
dd 0x02000000
dd 0xffffffff
$ cat launch.c
#define _GNU_SOURCE
#include <fcntl.h>
#include <err.h>
#include <unistd.h>
#include <sys/syscall.h>
#include <sys/resource.h>
int main(void) {
  close(3);
  close(4);
  if (open("fast_execve", O_PATH) != 3)
    err(1, "open fast_execve");
  if (open("finish", O_PATH) != 4)
    err(1, "open finish");
  char *argv[] = { "aaaaaaaa", NULL };

  struct rlimit lim;
  if (getrlimit(RLIMIT_STACK, &lim))
    err(1, "getrlimit");
  lim.rlim_cur = 0x4000;
  if (setrlimit(RLIMIT_STACK, &lim))
    err(1, "setrlimit");

  syscall(__NR_execveat, 3, "", argv, NULL, AT_EMPTY_PATH);
}
$ cat finish.c
#include <stdlib.h>
int main(void) {
  exit(0);
}
$ ./build.sh
$ time ./launch

real 0m12,075s
user 0m0,905s
sys 0m11,026s
$
=============
Kees Cook April 2, 2020, 7:19 a.m. UTC | #9
On Wed, Apr 01, 2020 at 03:47:44PM -0500, Eric W. Biederman wrote:
> 
> Replace the 32bit exec_id with a 64bit exec_id to make it impossible
> to wrap the exec_id counter.  With care an attacker can cause exec_id
> wrap and send arbitrary signals to a newly exec'd parent.  This
> bypasses the signal sending checks if the parent changes their
> credentials during exec.
> 
> The severity of this problem can been seen that in my limited testing
> of a 32bit exec_id it can take as little as 19s to exec 65536 times.
> Which means that it can take as little as 14 days to wrap a 32bit
> exec_id.  Adam Zabrocki has succeeded wrapping the self_exe_id in 7
> days.  Even my slower timing is in the uptime of a typical server.
> Which means self_exec_id is simply a speed bump today, and if exec
> gets noticably faster self_exec_id won't even be a speed bump.
> 
> Extending self_exec_id to 64bits introduces a problem on 32bit
> architectures where reading self_exec_id is no longer atomic and can
> take two read instructions.  Which means that is is possible to hit
> a window where the read value of exec_id does not match the written
> value.  So with very lucky timing after this change this still
> remains expoiltable.
> 
> I have updated the update of exec_id on exec to use WRITE_ONCE
> and the read of exec_id in do_notify_parent to use READ_ONCE
> to make it clear that there is no locking between these two
> locations.
> 
> Link: https://lore.kernel.org/kernel-hardening/20200324215049.GA3710@pi3.com.pl
> Fixes: 2.3.23pre2
> Cc: stable@vger.kernel.org
> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>

Thanks for chasing this down. :)

Reviewed-by: Kees Cook <keescook@chromium.org>

-Kees

> ---
> 
> Linus would you prefer to take this patch directly or I could put it in
> a brach and send you a pull request.
>  
>  fs/exec.c             | 2 +-
>  include/linux/sched.h | 4 ++--
>  kernel/signal.c       | 2 +-
>  3 files changed, 4 insertions(+), 4 deletions(-)
> 
> diff --git a/fs/exec.c b/fs/exec.c
> index 0e46ec57fe0a..d55710a36056 100644
> --- a/fs/exec.c
> +++ b/fs/exec.c
> @@ -1413,7 +1413,7 @@ void setup_new_exec(struct linux_binprm * bprm)
>  
>  	/* An exec changes our domain. We are no longer part of the thread
>  	   group */
> -	current->self_exec_id++;
> +	WRITE_ONCE(current->self_exec_id, current->self_exec_id + 1);
>  	flush_signal_handlers(current, 0);
>  }
>  EXPORT_SYMBOL(setup_new_exec);
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index 04278493bf15..0323e4f0982a 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -939,8 +939,8 @@ struct task_struct {
>  	struct seccomp			seccomp;
>  
>  	/* Thread group tracking: */
> -	u32				parent_exec_id;
> -	u32				self_exec_id;
> +	u64				parent_exec_id;
> +	u64				self_exec_id;
>  
>  	/* Protection against (de-)allocation: mm, files, fs, tty, keyrings, mems_allowed, mempolicy: */
>  	spinlock_t			alloc_lock;
> diff --git a/kernel/signal.c b/kernel/signal.c
> index 9ad8dea93dbb..5383b562df85 100644
> --- a/kernel/signal.c
> +++ b/kernel/signal.c
> @@ -1926,7 +1926,7 @@ bool do_notify_parent(struct task_struct *tsk, int sig)
>  		 * This is only possible if parent == real_parent.
>  		 * Check if it has changed security domain.
>  		 */
> -		if (tsk->parent_exec_id != tsk->parent->self_exec_id)
> +		if (tsk->parent_exec_id != READ_ONCE(tsk->parent->self_exec_id))
>  			sig = SIGCHLD;
>  	}
>  
> -- 
> 2.20.1
>
Bernd Edlinger April 2, 2020, 7:22 a.m. UTC | #10
On 4/1/20 10:47 PM, Eric W. Biederman wrote:
> 
> Replace the 32bit exec_id with a 64bit exec_id to make it impossible
> to wrap the exec_id counter.  With care an attacker can cause exec_id
> wrap and send arbitrary signals to a newly exec'd parent.  This
> bypasses the signal sending checks if the parent changes their
> credentials during exec.
> 
> The severity of this problem can been seen that in my limited testing
> of a 32bit exec_id it can take as little as 19s to exec 65536 times.
> Which means that it can take as little as 14 days to wrap a 32bit
> exec_id.  Adam Zabrocki has succeeded wrapping the self_exe_id in 7
> days.  Even my slower timing is in the uptime of a typical server.
> Which means self_exec_id is simply a speed bump today, and if exec
> gets noticably faster self_exec_id won't even be a speed bump.
> 
> Extending self_exec_id to 64bits introduces a problem on 32bit
> architectures where reading self_exec_id is no longer atomic and can
> take two read instructions.  Which means that is is possible to hit
> a window where the read value of exec_id does not match the written
> value.  So with very lucky timing after this change this still
> remains expoiltable.
> 
> I have updated the update of exec_id on exec to use WRITE_ONCE
> and the read of exec_id in do_notify_parent to use READ_ONCE
> to make it clear that there is no locking between these two
> locations.
> 
> Link: https://lore.kernel.org/kernel-hardening/20200324215049.GA3710@pi3.com.pl
> Fixes: 2.3.23pre2
> Cc: stable@vger.kernel.org
> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>

Reviewed-by: Bernd Edlinger <bernd.edlinger@hotmail.de>


Thanks
Bernd.
> ---
> 
> Linus would you prefer to take this patch directly or I could put it in
> a brach and send you a pull request.
>  
>  fs/exec.c             | 2 +-
>  include/linux/sched.h | 4 ++--
>  kernel/signal.c       | 2 +-
>  3 files changed, 4 insertions(+), 4 deletions(-)
> 
> diff --git a/fs/exec.c b/fs/exec.c
> index 0e46ec57fe0a..d55710a36056 100644
> --- a/fs/exec.c
> +++ b/fs/exec.c
> @@ -1413,7 +1413,7 @@ void setup_new_exec(struct linux_binprm * bprm)
>  
>  	/* An exec changes our domain. We are no longer part of the thread
>  	   group */
> -	current->self_exec_id++;
> +	WRITE_ONCE(current->self_exec_id, current->self_exec_id + 1);
>  	flush_signal_handlers(current, 0);
>  }
>  EXPORT_SYMBOL(setup_new_exec);
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index 04278493bf15..0323e4f0982a 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -939,8 +939,8 @@ struct task_struct {
>  	struct seccomp			seccomp;
>  
>  	/* Thread group tracking: */
> -	u32				parent_exec_id;
> -	u32				self_exec_id;
> +	u64				parent_exec_id;
> +	u64				self_exec_id;
>  
>  	/* Protection against (de-)allocation: mm, files, fs, tty, keyrings, mems_allowed, mempolicy: */
>  	spinlock_t			alloc_lock;
> diff --git a/kernel/signal.c b/kernel/signal.c
> index 9ad8dea93dbb..5383b562df85 100644
> --- a/kernel/signal.c
> +++ b/kernel/signal.c
> @@ -1926,7 +1926,7 @@ bool do_notify_parent(struct task_struct *tsk, int sig)
>  		 * This is only possible if parent == real_parent.
>  		 * Check if it has changed security domain.
>  		 */
> -		if (tsk->parent_exec_id != tsk->parent->self_exec_id)
> +		if (tsk->parent_exec_id != READ_ONCE(tsk->parent->self_exec_id))
>  			sig = SIGCHLD;
>  	}
>  
>
Eric W. Biederman April 2, 2020, 1:11 p.m. UTC | #11
Linus Torvalds <torvalds@linux-foundation.org> writes:

> tasklist_lock is aboue the hottest lock there is in all of the kernel.

Do you know code paths you see tasklist_lock being hot?

I am looking at some of the exec/signal/ptrace code paths because they
get subtle corner case wrong like a threaded exec deadlocking when
straced.

If the performance problems are in the same neighbourhood I might be
able to fix those problems while I am in the code.

Eric
Eric W. Biederman April 2, 2020, 2:14 p.m. UTC | #12
Jann Horn <jannh@google.com> writes:

> On Wed, Apr 1, 2020 at 10:50 PM Eric W. Biederman <ebiederm@xmission.com> wrote:
>> Replace the 32bit exec_id with a 64bit exec_id to make it impossible
>> to wrap the exec_id counter.  With care an attacker can cause exec_id
>> wrap and send arbitrary signals to a newly exec'd parent.  This
>> bypasses the signal sending checks if the parent changes their
>> credentials during exec.
>>
>> The severity of this problem can been seen that in my limited testing
>> of a 32bit exec_id it can take as little as 19s to exec 65536 times.
>> Which means that it can take as little as 14 days to wrap a 32bit
>> exec_id.  Adam Zabrocki has succeeded wrapping the self_exe_id in 7
>> days.  Even my slower timing is in the uptime of a typical server.
>
> FYI, if you actually optimize this, it's more like 12s to exec 1048576
> times according to my test, which means ~14 hours for 2^32 executions
> (on a single core). That's on an i7-4790 (a Haswell desktop processor
> that was launched about six years ago, in 2014).

Half a day.  I am not at all surprised, but it is good to know it can
take so little time.

Eric
Linus Torvalds April 2, 2020, 6:06 p.m. UTC | #13
On Thu, Apr 2, 2020 at 6:14 AM Eric W. Biederman <ebiederm@xmission.com> wrote:
>
> Linus Torvalds <torvalds@linux-foundation.org> writes:
>
> > tasklist_lock is aboue the hottest lock there is in all of the kernel.
>
> Do you know code paths you see tasklist_lock being hot?

It's generally not bad enough to show up on single-socket machines.

But the problem with tasklist_lock is that it's one of our remaining
completely global locks. So it scales like sh*t in some circumstances.

On single-socket machines, most of the truly nasty hot paths aren't a
huge problem, because they tend to be mostly readers. So you get the
cacheline bounce, but you don't (usually) get much busy looping. The
cacheline bounce is "almost free" on a single socket.

But because it's one of those completely global locks, on big
multi-socket machines people have reported it as a problem forever.
Even just readers can cause problems (because of the cacheline
bouncing even when you just do the reader increment), but you also end
up having more issues with writers scaling badly.

Don't get me wrong - you can get bad scaling on other locks too, even
when they aren't really global - we had that with just the reference
counter increment for the user signal accounting, after all. Neither
of the reference counts were actually global, but they were just
effectively single counters under that particular load (ie the count
was per-user, but the load ran as a single user).

The reason tasklist_lock probably doesn't come up very much is that
it's _always_ been expensive. It has also caused some fundamental
issues (I think it's the main reason we have that rule that
reader-writer locks are unfair to readers, because we have readers
from interrupt context too, but can't afford to make normal readers
disable interrupts).

A lot of the tasklist lock readers end up looping quite a bit inside
the lock (looping over threads etc), which is why it can then be a big
deal when the rare reader shows up.

We've improved a _lot_ of those loops. That has definitely helped for
the common cases. But we've never been able to really fix the lock
itself.

                 Linus
Adam Zabrocki April 3, 2020, 2:11 a.m. UTC | #14
On Thu, Apr 02, 2020 at 06:46:49AM +0200, Jann Horn wrote:
> On Wed, Apr 1, 2020 at 10:50 PM Eric W. Biederman <ebiederm@xmission.com> wrote:
> > Replace the 32bit exec_id with a 64bit exec_id to make it impossible
> > to wrap the exec_id counter.  With care an attacker can cause exec_id
> > wrap and send arbitrary signals to a newly exec'd parent.  This
> > bypasses the signal sending checks if the parent changes their
> > credentials during exec.
> >
> > The severity of this problem can been seen that in my limited testing
> > of a 32bit exec_id it can take as little as 19s to exec 65536 times.
> > Which means that it can take as little as 14 days to wrap a 32bit
> > exec_id.  Adam Zabrocki has succeeded wrapping the self_exe_id in 7
> > days.  Even my slower timing is in the uptime of a typical server.
> 
> FYI, if you actually optimize this, it's more like 12s to exec 1048576
> times according to my test, which means ~14 hours for 2^32 executions
> (on a single core). That's on an i7-4790 (a Haswell desktop processor
> that was launched about six years ago, in 2014).
> 

Yep, there are a few ways of optimizing it and I believe I've pointed it out 
here:
https://www.openwall.com/lists/kernel-hardening/2020/03/31/11

Thanks for doing such tests :)

I've also modified your PoC to use 'sysenter' and 'syscall' instruction. Both 
cases gave me an extra 4% speed bump (including a test for 64-bits 
"fast_execve"). I've run it under Intel(R) Xeon(R) E-2176G CPU @ 3.70GHz

As you've proven, it is possible to be done in a matter of hours.

Thanks,
Adam

> Here's my test code:
> 
> =============
> $ grep 'model name' /proc/cpuinfo | head -n1
> model name : Intel(R) Core(TM) i7-4790 CPU @ 3.60GHz
> $ cat build.sh
> #!/bin/sh
> set -e
> nasm -felf32 -o fast_execve.o fast_execve.asm
> ld -m elf_i386 -o fast_execve fast_execve.o
> gcc -o launch launch.c -Wall
> gcc -o finish finish.c -Wall
> $ cat fast_execve.asm
> bits 32
> 
> section .text
> global _start
> _start:
> ; eax = argv[0]
> ; expected to be 8 hex digits, with 'a' meaning 0x0 and 'p' meaning 0xf
> mov eax, [esp+4]
> 
> mov ebx, 0 ; loop counter
> hex_digit_loop:
> inc byte [eax+ebx]
> cmp byte [eax+ebx], 'a'+16
> jne next_exec
> mov byte [eax+ebx], 'a'
> inc ebx
> cmp ebx, 5 ;;;;;;;;;;;;;;;;;; this is N, where iteration_count=pow(16,N)
> jne hex_digit_loop
> 
> 
> ; reached pow(256,N) execs, get out
> 
> ; first make the stack big again
> mov eax, 75 ; setrlimit (32-bit ABI)
> mov ebx, 3 ; RLIMIT_STACK
> mov ecx, stacklim
> int 0x80
> 
> ; execute end helper
> mov ebx, 4 ; dirfd = 4
> jmp common_exec
> 
> next_exec:
> mov ebx, 3 ; dirfd = 3
> 
> common_exec: ; execveat() with file descriptor passed in as ebx
> mov ecx, nullval ; pathname = empty string
> lea edx, [esp+4] ; argv
> mov esi, 0 ; envp
> mov edi, 0x1000 ; flags = AT_EMPTY_PATH
> mov eax, 358 ; execveat (32-bit ABI)
> int 0x80
> int3
> 
> nullval:
> dd 0
> stacklim:
> dd 0x02000000
> dd 0xffffffff
> $ cat launch.c
> #define _GNU_SOURCE
> #include <fcntl.h>
> #include <err.h>
> #include <unistd.h>
> #include <sys/syscall.h>
> #include <sys/resource.h>
> int main(void) {
>   close(3);
>   close(4);
>   if (open("fast_execve", O_PATH) != 3)
>     err(1, "open fast_execve");
>   if (open("finish", O_PATH) != 4)
>     err(1, "open finish");
>   char *argv[] = { "aaaaaaaa", NULL };
> 
>   struct rlimit lim;
>   if (getrlimit(RLIMIT_STACK, &lim))
>     err(1, "getrlimit");
>   lim.rlim_cur = 0x4000;
>   if (setrlimit(RLIMIT_STACK, &lim))
>     err(1, "setrlimit");
> 
>   syscall(__NR_execveat, 3, "", argv, NULL, AT_EMPTY_PATH);
> }
> $ cat finish.c
> #include <stdlib.h>
> int main(void) {
>   exit(0);
> }
> $ ./build.sh
> $ time ./launch
> 
> real 0m12,075s
> user 0m0,905s
> sys 0m11,026s
> $
> =============
diff mbox series

Patch

diff --git a/fs/exec.c b/fs/exec.c
index 0e46ec57fe0a..d55710a36056 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -1413,7 +1413,7 @@  void setup_new_exec(struct linux_binprm * bprm)
 
 	/* An exec changes our domain. We are no longer part of the thread
 	   group */
-	current->self_exec_id++;
+	WRITE_ONCE(current->self_exec_id, current->self_exec_id + 1);
 	flush_signal_handlers(current, 0);
 }
 EXPORT_SYMBOL(setup_new_exec);
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 04278493bf15..0323e4f0982a 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -939,8 +939,8 @@  struct task_struct {
 	struct seccomp			seccomp;
 
 	/* Thread group tracking: */
-	u32				parent_exec_id;
-	u32				self_exec_id;
+	u64				parent_exec_id;
+	u64				self_exec_id;
 
 	/* Protection against (de-)allocation: mm, files, fs, tty, keyrings, mems_allowed, mempolicy: */
 	spinlock_t			alloc_lock;
diff --git a/kernel/signal.c b/kernel/signal.c
index 9ad8dea93dbb..5383b562df85 100644
--- a/kernel/signal.c
+++ b/kernel/signal.c
@@ -1926,7 +1926,7 @@  bool do_notify_parent(struct task_struct *tsk, int sig)
 		 * This is only possible if parent == real_parent.
 		 * Check if it has changed security domain.
 		 */
-		if (tsk->parent_exec_id != tsk->parent->self_exec_id)
+		if (tsk->parent_exec_id != READ_ONCE(tsk->parent->self_exec_id))
 			sig = SIGCHLD;
 	}