Message ID | 87zhbvlyq7.fsf_-_@x220.int.ebiederm.org (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | signal: Extend exec_id to 64bits | expand |
On Wed, Apr 1, 2020 at 1:50 PM Eric W. Biederman <ebiederm@xmission.com> wrote: > > I have updated the update of exec_id on exec to use WRITE_ONCE > and the read of exec_id in do_notify_parent to use READ_ONCE > to make it clear that there is no locking between these two > locations. Ack, makes sense to me. Just put it in your branch, this doesn't look urgent, considering that it's something that has been around forever. Linus
Linus Torvalds <torvalds@linux-foundation.org> writes: > On Wed, Apr 1, 2020 at 1:50 PM Eric W. Biederman <ebiederm@xmission.com> wrote: >> >> I have updated the update of exec_id on exec to use WRITE_ONCE >> and the read of exec_id in do_notify_parent to use READ_ONCE >> to make it clear that there is no locking between these two >> locations. > > Ack, makes sense to me. > > Just put it in your branch, this doesn't look urgent, considering that > it's something that has been around forever. Done Eric
+memory model folks because this seems like a pretty sharp corner On Wed, Apr 1, 2020 at 10:50 PM Eric W. Biederman <ebiederm@xmission.com> wrote: > Replace the 32bit exec_id with a 64bit exec_id to make it impossible > to wrap the exec_id counter. With care an attacker can cause exec_id > wrap and send arbitrary signals to a newly exec'd parent. This > bypasses the signal sending checks if the parent changes their > credentials during exec. > > The severity of this problem can been seen that in my limited testing > of a 32bit exec_id it can take as little as 19s to exec 65536 times. > Which means that it can take as little as 14 days to wrap a 32bit > exec_id. Adam Zabrocki has succeeded wrapping the self_exe_id in 7 > days. Even my slower timing is in the uptime of a typical server. > Which means self_exec_id is simply a speed bump today, and if exec > gets noticably faster self_exec_id won't even be a speed bump. > > Extending self_exec_id to 64bits introduces a problem on 32bit > architectures where reading self_exec_id is no longer atomic and can > take two read instructions. Which means that is is possible to hit > a window where the read value of exec_id does not match the written > value. So with very lucky timing after this change this still > remains expoiltable. > > I have updated the update of exec_id on exec to use WRITE_ONCE > and the read of exec_id in do_notify_parent to use READ_ONCE > to make it clear that there is no locking between these two > locations. > > Link: https://lore.kernel.org/kernel-hardening/20200324215049.GA3710@pi3.com.pl > Fixes: 2.3.23pre2 > Cc: stable@vger.kernel.org > Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> > --- > > Linus would you prefer to take this patch directly or I could put it in > a brach and send you a pull request. > > fs/exec.c | 2 +- > include/linux/sched.h | 4 ++-- > kernel/signal.c | 2 +- > 3 files changed, 4 insertions(+), 4 deletions(-) > > diff --git a/fs/exec.c b/fs/exec.c > index 0e46ec57fe0a..d55710a36056 100644 > --- a/fs/exec.c > +++ b/fs/exec.c > @@ -1413,7 +1413,7 @@ void setup_new_exec(struct linux_binprm * bprm) > > /* An exec changes our domain. We are no longer part of the thread > group */ > - current->self_exec_id++; > + WRITE_ONCE(current->self_exec_id, current->self_exec_id + 1); GCC will generate code for this without complaining, but I think it'll probably generate a tearing store on 32-bit platforms: $ cat volatile-8.c typedef unsigned long long u64; static volatile u64 n; void blah(void) { n = n + 1; } $ gcc -O2 -m32 -c -o volatile-8.o volatile-8.c -Wall $ objdump --disassemble=blah volatile-8.o [...] b: 8b 81 00 00 00 00 mov 0x0(%ecx),%eax 11: 8b 91 04 00 00 00 mov 0x4(%ecx),%edx 17: 83 c0 01 add $0x1,%eax 1a: 83 d2 00 adc $0x0,%edx 1d: 89 81 00 00 00 00 mov %eax,0x0(%ecx) 23: 89 91 04 00 00 00 mov %edx,0x4(%ecx) [...] $ You could maybe use atomic64_t to work around that? atomic64_read() and atomic64_set() should be straightforward READ_ONCE() / WRITE_ONCE() on 64-bit systems while compiling into something more complicated on 32-bit. > flush_signal_handlers(current, 0); > } > EXPORT_SYMBOL(setup_new_exec); > diff --git a/include/linux/sched.h b/include/linux/sched.h > index 04278493bf15..0323e4f0982a 100644 > --- a/include/linux/sched.h > +++ b/include/linux/sched.h > @@ -939,8 +939,8 @@ struct task_struct { > struct seccomp seccomp; > > /* Thread group tracking: */ > - u32 parent_exec_id; > - u32 self_exec_id; > + u64 parent_exec_id; > + u64 self_exec_id; > > /* Protection against (de-)allocation: mm, files, fs, tty, keyrings, mems_allowed, mempolicy: */ > spinlock_t alloc_lock; > diff --git a/kernel/signal.c b/kernel/signal.c > index 9ad8dea93dbb..5383b562df85 100644 > --- a/kernel/signal.c > +++ b/kernel/signal.c > @@ -1926,7 +1926,7 @@ bool do_notify_parent(struct task_struct *tsk, int sig) > * This is only possible if parent == real_parent. > * Check if it has changed security domain. > */ > - if (tsk->parent_exec_id != tsk->parent->self_exec_id) > + if (tsk->parent_exec_id != READ_ONCE(tsk->parent->self_exec_id)) > sig = SIGCHLD; > } > > -- > 2.20.1 >
On Wed, Apr 1, 2020 at 4:37 PM Jann Horn <jannh@google.com> wrote: > > GCC will generate code for this without complaining, but I think it'll > probably generate a tearing store on 32-bit platforms: This is very much a "we don't care" case. It's literally testing a sequence counter for equality. If you get tearing in the high bits on the write (or the read), you'd still need to have the low bits turn around 4G times to get a matching value. So no. We're not doing atomics for the 32-bit case. That's insane. Linus
On Wed, Apr 1, 2020 at 4:51 PM Linus Torvalds <torvalds@linux-foundation.org> wrote: > > It's literally testing a sequence counter for equality. If you get > tearing in the high bits on the write (or the read), you'd still need > to have the low bits turn around 4G times to get a matching value. Put another way: first you'd have to work however many weeks to do 4 billion execve() calls, and then you need to hit basically a single-instruction race to take advantage of it. Good luck with that. If you have that kind of God-like capability, whoever you're attacking stands no chance in the first place. Linus
On Thu, Apr 2, 2020 at 1:55 AM Linus Torvalds <torvalds@linux-foundation.org> wrote: > On Wed, Apr 1, 2020 at 4:51 PM Linus Torvalds > <torvalds@linux-foundation.org> wrote: > > > > It's literally testing a sequence counter for equality. If you get > > tearing in the high bits on the write (or the read), you'd still need > > to have the low bits turn around 4G times to get a matching value. > > Put another way: first you'd have to work however many weeks to do 4 > billion execve() calls, and then you need to hit basically a > single-instruction race to take advantage of it. > > Good luck with that. If you have that kind of God-like capability, > whoever you're attacking stands no chance in the first place. I'm not really worried about someone being able to hit this in practice, especially given that the only thing it lets you do is send signals; I just think that the code is wrong in principle, and that we should avoid having "technically wrong, but works in practice" code in the kernel. This kind of intentional race might also trip up testing tools like the upcoming KCSAN instrumentation, unless it is specifically annotated as an intentionally racing instruction. (For now, KCSAN is 64-bit only, so it won't actually be able to detect this though; and the current KCSAN development branch actually incorrectly considers WRITE_ONCE() to always be atomic; but still, in principle it's the kind of issue KCSAN is supposed to detect, I think.) But even without KCSAN, if we have tearing stores racing with loads, I think that we ought to *at least* have a comment noting that we're intentionally doing that so that people don't copy this kind of code to a place where the high bits change more frequently and correctness matters more. Since the read is already protected by the tasklist_lock, an alternative might be to let the execve path also take that lock to protect the sequence number update, given that execve is not a particularly hot path. Or we could hand-code the equality check and increment operations to be properly race-free.
On Wed, Apr 1, 2020 at 6:36 PM Jann Horn <jannh@google.com> wrote: > > Since the read is already protected by the tasklist_lock, an > alternative might be to let the execve path also take that lock to > protect the sequence number update, No. tasklist_lock is aboue the hottest lock there is in all of the kernel. We're not doing stupid things for theoretical issues. Stop this crazy argument. A comment - sure. 64-bit atomics or very expensive locks? Not a chance. Linus
On Wed, Apr 1, 2020 at 10:50 PM Eric W. Biederman <ebiederm@xmission.com> wrote: > Replace the 32bit exec_id with a 64bit exec_id to make it impossible > to wrap the exec_id counter. With care an attacker can cause exec_id > wrap and send arbitrary signals to a newly exec'd parent. This > bypasses the signal sending checks if the parent changes their > credentials during exec. > > The severity of this problem can been seen that in my limited testing > of a 32bit exec_id it can take as little as 19s to exec 65536 times. > Which means that it can take as little as 14 days to wrap a 32bit > exec_id. Adam Zabrocki has succeeded wrapping the self_exe_id in 7 > days. Even my slower timing is in the uptime of a typical server. FYI, if you actually optimize this, it's more like 12s to exec 1048576 times according to my test, which means ~14 hours for 2^32 executions (on a single core). That's on an i7-4790 (a Haswell desktop processor that was launched about six years ago, in 2014). Here's my test code: ============= $ grep 'model name' /proc/cpuinfo | head -n1 model name : Intel(R) Core(TM) i7-4790 CPU @ 3.60GHz $ cat build.sh #!/bin/sh set -e nasm -felf32 -o fast_execve.o fast_execve.asm ld -m elf_i386 -o fast_execve fast_execve.o gcc -o launch launch.c -Wall gcc -o finish finish.c -Wall $ cat fast_execve.asm bits 32 section .text global _start _start: ; eax = argv[0] ; expected to be 8 hex digits, with 'a' meaning 0x0 and 'p' meaning 0xf mov eax, [esp+4] mov ebx, 0 ; loop counter hex_digit_loop: inc byte [eax+ebx] cmp byte [eax+ebx], 'a'+16 jne next_exec mov byte [eax+ebx], 'a' inc ebx cmp ebx, 5 ;;;;;;;;;;;;;;;;;; this is N, where iteration_count=pow(16,N) jne hex_digit_loop ; reached pow(256,N) execs, get out ; first make the stack big again mov eax, 75 ; setrlimit (32-bit ABI) mov ebx, 3 ; RLIMIT_STACK mov ecx, stacklim int 0x80 ; execute end helper mov ebx, 4 ; dirfd = 4 jmp common_exec next_exec: mov ebx, 3 ; dirfd = 3 common_exec: ; execveat() with file descriptor passed in as ebx mov ecx, nullval ; pathname = empty string lea edx, [esp+4] ; argv mov esi, 0 ; envp mov edi, 0x1000 ; flags = AT_EMPTY_PATH mov eax, 358 ; execveat (32-bit ABI) int 0x80 int3 nullval: dd 0 stacklim: dd 0x02000000 dd 0xffffffff $ cat launch.c #define _GNU_SOURCE #include <fcntl.h> #include <err.h> #include <unistd.h> #include <sys/syscall.h> #include <sys/resource.h> int main(void) { close(3); close(4); if (open("fast_execve", O_PATH) != 3) err(1, "open fast_execve"); if (open("finish", O_PATH) != 4) err(1, "open finish"); char *argv[] = { "aaaaaaaa", NULL }; struct rlimit lim; if (getrlimit(RLIMIT_STACK, &lim)) err(1, "getrlimit"); lim.rlim_cur = 0x4000; if (setrlimit(RLIMIT_STACK, &lim)) err(1, "setrlimit"); syscall(__NR_execveat, 3, "", argv, NULL, AT_EMPTY_PATH); } $ cat finish.c #include <stdlib.h> int main(void) { exit(0); } $ ./build.sh $ time ./launch real 0m12,075s user 0m0,905s sys 0m11,026s $ =============
On Wed, Apr 01, 2020 at 03:47:44PM -0500, Eric W. Biederman wrote: > > Replace the 32bit exec_id with a 64bit exec_id to make it impossible > to wrap the exec_id counter. With care an attacker can cause exec_id > wrap and send arbitrary signals to a newly exec'd parent. This > bypasses the signal sending checks if the parent changes their > credentials during exec. > > The severity of this problem can been seen that in my limited testing > of a 32bit exec_id it can take as little as 19s to exec 65536 times. > Which means that it can take as little as 14 days to wrap a 32bit > exec_id. Adam Zabrocki has succeeded wrapping the self_exe_id in 7 > days. Even my slower timing is in the uptime of a typical server. > Which means self_exec_id is simply a speed bump today, and if exec > gets noticably faster self_exec_id won't even be a speed bump. > > Extending self_exec_id to 64bits introduces a problem on 32bit > architectures where reading self_exec_id is no longer atomic and can > take two read instructions. Which means that is is possible to hit > a window where the read value of exec_id does not match the written > value. So with very lucky timing after this change this still > remains expoiltable. > > I have updated the update of exec_id on exec to use WRITE_ONCE > and the read of exec_id in do_notify_parent to use READ_ONCE > to make it clear that there is no locking between these two > locations. > > Link: https://lore.kernel.org/kernel-hardening/20200324215049.GA3710@pi3.com.pl > Fixes: 2.3.23pre2 > Cc: stable@vger.kernel.org > Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> Thanks for chasing this down. :) Reviewed-by: Kees Cook <keescook@chromium.org> -Kees > --- > > Linus would you prefer to take this patch directly or I could put it in > a brach and send you a pull request. > > fs/exec.c | 2 +- > include/linux/sched.h | 4 ++-- > kernel/signal.c | 2 +- > 3 files changed, 4 insertions(+), 4 deletions(-) > > diff --git a/fs/exec.c b/fs/exec.c > index 0e46ec57fe0a..d55710a36056 100644 > --- a/fs/exec.c > +++ b/fs/exec.c > @@ -1413,7 +1413,7 @@ void setup_new_exec(struct linux_binprm * bprm) > > /* An exec changes our domain. We are no longer part of the thread > group */ > - current->self_exec_id++; > + WRITE_ONCE(current->self_exec_id, current->self_exec_id + 1); > flush_signal_handlers(current, 0); > } > EXPORT_SYMBOL(setup_new_exec); > diff --git a/include/linux/sched.h b/include/linux/sched.h > index 04278493bf15..0323e4f0982a 100644 > --- a/include/linux/sched.h > +++ b/include/linux/sched.h > @@ -939,8 +939,8 @@ struct task_struct { > struct seccomp seccomp; > > /* Thread group tracking: */ > - u32 parent_exec_id; > - u32 self_exec_id; > + u64 parent_exec_id; > + u64 self_exec_id; > > /* Protection against (de-)allocation: mm, files, fs, tty, keyrings, mems_allowed, mempolicy: */ > spinlock_t alloc_lock; > diff --git a/kernel/signal.c b/kernel/signal.c > index 9ad8dea93dbb..5383b562df85 100644 > --- a/kernel/signal.c > +++ b/kernel/signal.c > @@ -1926,7 +1926,7 @@ bool do_notify_parent(struct task_struct *tsk, int sig) > * This is only possible if parent == real_parent. > * Check if it has changed security domain. > */ > - if (tsk->parent_exec_id != tsk->parent->self_exec_id) > + if (tsk->parent_exec_id != READ_ONCE(tsk->parent->self_exec_id)) > sig = SIGCHLD; > } > > -- > 2.20.1 >
On 4/1/20 10:47 PM, Eric W. Biederman wrote: > > Replace the 32bit exec_id with a 64bit exec_id to make it impossible > to wrap the exec_id counter. With care an attacker can cause exec_id > wrap and send arbitrary signals to a newly exec'd parent. This > bypasses the signal sending checks if the parent changes their > credentials during exec. > > The severity of this problem can been seen that in my limited testing > of a 32bit exec_id it can take as little as 19s to exec 65536 times. > Which means that it can take as little as 14 days to wrap a 32bit > exec_id. Adam Zabrocki has succeeded wrapping the self_exe_id in 7 > days. Even my slower timing is in the uptime of a typical server. > Which means self_exec_id is simply a speed bump today, and if exec > gets noticably faster self_exec_id won't even be a speed bump. > > Extending self_exec_id to 64bits introduces a problem on 32bit > architectures where reading self_exec_id is no longer atomic and can > take two read instructions. Which means that is is possible to hit > a window where the read value of exec_id does not match the written > value. So with very lucky timing after this change this still > remains expoiltable. > > I have updated the update of exec_id on exec to use WRITE_ONCE > and the read of exec_id in do_notify_parent to use READ_ONCE > to make it clear that there is no locking between these two > locations. > > Link: https://lore.kernel.org/kernel-hardening/20200324215049.GA3710@pi3.com.pl > Fixes: 2.3.23pre2 > Cc: stable@vger.kernel.org > Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> Reviewed-by: Bernd Edlinger <bernd.edlinger@hotmail.de> Thanks Bernd. > --- > > Linus would you prefer to take this patch directly or I could put it in > a brach and send you a pull request. > > fs/exec.c | 2 +- > include/linux/sched.h | 4 ++-- > kernel/signal.c | 2 +- > 3 files changed, 4 insertions(+), 4 deletions(-) > > diff --git a/fs/exec.c b/fs/exec.c > index 0e46ec57fe0a..d55710a36056 100644 > --- a/fs/exec.c > +++ b/fs/exec.c > @@ -1413,7 +1413,7 @@ void setup_new_exec(struct linux_binprm * bprm) > > /* An exec changes our domain. We are no longer part of the thread > group */ > - current->self_exec_id++; > + WRITE_ONCE(current->self_exec_id, current->self_exec_id + 1); > flush_signal_handlers(current, 0); > } > EXPORT_SYMBOL(setup_new_exec); > diff --git a/include/linux/sched.h b/include/linux/sched.h > index 04278493bf15..0323e4f0982a 100644 > --- a/include/linux/sched.h > +++ b/include/linux/sched.h > @@ -939,8 +939,8 @@ struct task_struct { > struct seccomp seccomp; > > /* Thread group tracking: */ > - u32 parent_exec_id; > - u32 self_exec_id; > + u64 parent_exec_id; > + u64 self_exec_id; > > /* Protection against (de-)allocation: mm, files, fs, tty, keyrings, mems_allowed, mempolicy: */ > spinlock_t alloc_lock; > diff --git a/kernel/signal.c b/kernel/signal.c > index 9ad8dea93dbb..5383b562df85 100644 > --- a/kernel/signal.c > +++ b/kernel/signal.c > @@ -1926,7 +1926,7 @@ bool do_notify_parent(struct task_struct *tsk, int sig) > * This is only possible if parent == real_parent. > * Check if it has changed security domain. > */ > - if (tsk->parent_exec_id != tsk->parent->self_exec_id) > + if (tsk->parent_exec_id != READ_ONCE(tsk->parent->self_exec_id)) > sig = SIGCHLD; > } > >
Linus Torvalds <torvalds@linux-foundation.org> writes:
> tasklist_lock is aboue the hottest lock there is in all of the kernel.
Do you know code paths you see tasklist_lock being hot?
I am looking at some of the exec/signal/ptrace code paths because they
get subtle corner case wrong like a threaded exec deadlocking when
straced.
If the performance problems are in the same neighbourhood I might be
able to fix those problems while I am in the code.
Eric
Jann Horn <jannh@google.com> writes: > On Wed, Apr 1, 2020 at 10:50 PM Eric W. Biederman <ebiederm@xmission.com> wrote: >> Replace the 32bit exec_id with a 64bit exec_id to make it impossible >> to wrap the exec_id counter. With care an attacker can cause exec_id >> wrap and send arbitrary signals to a newly exec'd parent. This >> bypasses the signal sending checks if the parent changes their >> credentials during exec. >> >> The severity of this problem can been seen that in my limited testing >> of a 32bit exec_id it can take as little as 19s to exec 65536 times. >> Which means that it can take as little as 14 days to wrap a 32bit >> exec_id. Adam Zabrocki has succeeded wrapping the self_exe_id in 7 >> days. Even my slower timing is in the uptime of a typical server. > > FYI, if you actually optimize this, it's more like 12s to exec 1048576 > times according to my test, which means ~14 hours for 2^32 executions > (on a single core). That's on an i7-4790 (a Haswell desktop processor > that was launched about six years ago, in 2014). Half a day. I am not at all surprised, but it is good to know it can take so little time. Eric
On Thu, Apr 2, 2020 at 6:14 AM Eric W. Biederman <ebiederm@xmission.com> wrote: > > Linus Torvalds <torvalds@linux-foundation.org> writes: > > > tasklist_lock is aboue the hottest lock there is in all of the kernel. > > Do you know code paths you see tasklist_lock being hot? It's generally not bad enough to show up on single-socket machines. But the problem with tasklist_lock is that it's one of our remaining completely global locks. So it scales like sh*t in some circumstances. On single-socket machines, most of the truly nasty hot paths aren't a huge problem, because they tend to be mostly readers. So you get the cacheline bounce, but you don't (usually) get much busy looping. The cacheline bounce is "almost free" on a single socket. But because it's one of those completely global locks, on big multi-socket machines people have reported it as a problem forever. Even just readers can cause problems (because of the cacheline bouncing even when you just do the reader increment), but you also end up having more issues with writers scaling badly. Don't get me wrong - you can get bad scaling on other locks too, even when they aren't really global - we had that with just the reference counter increment for the user signal accounting, after all. Neither of the reference counts were actually global, but they were just effectively single counters under that particular load (ie the count was per-user, but the load ran as a single user). The reason tasklist_lock probably doesn't come up very much is that it's _always_ been expensive. It has also caused some fundamental issues (I think it's the main reason we have that rule that reader-writer locks are unfair to readers, because we have readers from interrupt context too, but can't afford to make normal readers disable interrupts). A lot of the tasklist lock readers end up looping quite a bit inside the lock (looping over threads etc), which is why it can then be a big deal when the rare reader shows up. We've improved a _lot_ of those loops. That has definitely helped for the common cases. But we've never been able to really fix the lock itself. Linus
On Thu, Apr 02, 2020 at 06:46:49AM +0200, Jann Horn wrote: > On Wed, Apr 1, 2020 at 10:50 PM Eric W. Biederman <ebiederm@xmission.com> wrote: > > Replace the 32bit exec_id with a 64bit exec_id to make it impossible > > to wrap the exec_id counter. With care an attacker can cause exec_id > > wrap and send arbitrary signals to a newly exec'd parent. This > > bypasses the signal sending checks if the parent changes their > > credentials during exec. > > > > The severity of this problem can been seen that in my limited testing > > of a 32bit exec_id it can take as little as 19s to exec 65536 times. > > Which means that it can take as little as 14 days to wrap a 32bit > > exec_id. Adam Zabrocki has succeeded wrapping the self_exe_id in 7 > > days. Even my slower timing is in the uptime of a typical server. > > FYI, if you actually optimize this, it's more like 12s to exec 1048576 > times according to my test, which means ~14 hours for 2^32 executions > (on a single core). That's on an i7-4790 (a Haswell desktop processor > that was launched about six years ago, in 2014). > Yep, there are a few ways of optimizing it and I believe I've pointed it out here: https://www.openwall.com/lists/kernel-hardening/2020/03/31/11 Thanks for doing such tests :) I've also modified your PoC to use 'sysenter' and 'syscall' instruction. Both cases gave me an extra 4% speed bump (including a test for 64-bits "fast_execve"). I've run it under Intel(R) Xeon(R) E-2176G CPU @ 3.70GHz As you've proven, it is possible to be done in a matter of hours. Thanks, Adam > Here's my test code: > > ============= > $ grep 'model name' /proc/cpuinfo | head -n1 > model name : Intel(R) Core(TM) i7-4790 CPU @ 3.60GHz > $ cat build.sh > #!/bin/sh > set -e > nasm -felf32 -o fast_execve.o fast_execve.asm > ld -m elf_i386 -o fast_execve fast_execve.o > gcc -o launch launch.c -Wall > gcc -o finish finish.c -Wall > $ cat fast_execve.asm > bits 32 > > section .text > global _start > _start: > ; eax = argv[0] > ; expected to be 8 hex digits, with 'a' meaning 0x0 and 'p' meaning 0xf > mov eax, [esp+4] > > mov ebx, 0 ; loop counter > hex_digit_loop: > inc byte [eax+ebx] > cmp byte [eax+ebx], 'a'+16 > jne next_exec > mov byte [eax+ebx], 'a' > inc ebx > cmp ebx, 5 ;;;;;;;;;;;;;;;;;; this is N, where iteration_count=pow(16,N) > jne hex_digit_loop > > > ; reached pow(256,N) execs, get out > > ; first make the stack big again > mov eax, 75 ; setrlimit (32-bit ABI) > mov ebx, 3 ; RLIMIT_STACK > mov ecx, stacklim > int 0x80 > > ; execute end helper > mov ebx, 4 ; dirfd = 4 > jmp common_exec > > next_exec: > mov ebx, 3 ; dirfd = 3 > > common_exec: ; execveat() with file descriptor passed in as ebx > mov ecx, nullval ; pathname = empty string > lea edx, [esp+4] ; argv > mov esi, 0 ; envp > mov edi, 0x1000 ; flags = AT_EMPTY_PATH > mov eax, 358 ; execveat (32-bit ABI) > int 0x80 > int3 > > nullval: > dd 0 > stacklim: > dd 0x02000000 > dd 0xffffffff > $ cat launch.c > #define _GNU_SOURCE > #include <fcntl.h> > #include <err.h> > #include <unistd.h> > #include <sys/syscall.h> > #include <sys/resource.h> > int main(void) { > close(3); > close(4); > if (open("fast_execve", O_PATH) != 3) > err(1, "open fast_execve"); > if (open("finish", O_PATH) != 4) > err(1, "open finish"); > char *argv[] = { "aaaaaaaa", NULL }; > > struct rlimit lim; > if (getrlimit(RLIMIT_STACK, &lim)) > err(1, "getrlimit"); > lim.rlim_cur = 0x4000; > if (setrlimit(RLIMIT_STACK, &lim)) > err(1, "setrlimit"); > > syscall(__NR_execveat, 3, "", argv, NULL, AT_EMPTY_PATH); > } > $ cat finish.c > #include <stdlib.h> > int main(void) { > exit(0); > } > $ ./build.sh > $ time ./launch > > real 0m12,075s > user 0m0,905s > sys 0m11,026s > $ > =============
diff --git a/fs/exec.c b/fs/exec.c index 0e46ec57fe0a..d55710a36056 100644 --- a/fs/exec.c +++ b/fs/exec.c @@ -1413,7 +1413,7 @@ void setup_new_exec(struct linux_binprm * bprm) /* An exec changes our domain. We are no longer part of the thread group */ - current->self_exec_id++; + WRITE_ONCE(current->self_exec_id, current->self_exec_id + 1); flush_signal_handlers(current, 0); } EXPORT_SYMBOL(setup_new_exec); diff --git a/include/linux/sched.h b/include/linux/sched.h index 04278493bf15..0323e4f0982a 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -939,8 +939,8 @@ struct task_struct { struct seccomp seccomp; /* Thread group tracking: */ - u32 parent_exec_id; - u32 self_exec_id; + u64 parent_exec_id; + u64 self_exec_id; /* Protection against (de-)allocation: mm, files, fs, tty, keyrings, mems_allowed, mempolicy: */ spinlock_t alloc_lock; diff --git a/kernel/signal.c b/kernel/signal.c index 9ad8dea93dbb..5383b562df85 100644 --- a/kernel/signal.c +++ b/kernel/signal.c @@ -1926,7 +1926,7 @@ bool do_notify_parent(struct task_struct *tsk, int sig) * This is only possible if parent == real_parent. * Check if it has changed security domain. */ - if (tsk->parent_exec_id != tsk->parent->self_exec_id) + if (tsk->parent_exec_id != READ_ONCE(tsk->parent->self_exec_id)) sig = SIGCHLD; }
Replace the 32bit exec_id with a 64bit exec_id to make it impossible to wrap the exec_id counter. With care an attacker can cause exec_id wrap and send arbitrary signals to a newly exec'd parent. This bypasses the signal sending checks if the parent changes their credentials during exec. The severity of this problem can been seen that in my limited testing of a 32bit exec_id it can take as little as 19s to exec 65536 times. Which means that it can take as little as 14 days to wrap a 32bit exec_id. Adam Zabrocki has succeeded wrapping the self_exe_id in 7 days. Even my slower timing is in the uptime of a typical server. Which means self_exec_id is simply a speed bump today, and if exec gets noticably faster self_exec_id won't even be a speed bump. Extending self_exec_id to 64bits introduces a problem on 32bit architectures where reading self_exec_id is no longer atomic and can take two read instructions. Which means that is is possible to hit a window where the read value of exec_id does not match the written value. So with very lucky timing after this change this still remains expoiltable. I have updated the update of exec_id on exec to use WRITE_ONCE and the read of exec_id in do_notify_parent to use READ_ONCE to make it clear that there is no locking between these two locations. Link: https://lore.kernel.org/kernel-hardening/20200324215049.GA3710@pi3.com.pl Fixes: 2.3.23pre2 Cc: stable@vger.kernel.org Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> --- Linus would you prefer to take this patch directly or I could put it in a brach and send you a pull request. fs/exec.c | 2 +- include/linux/sched.h | 4 ++-- kernel/signal.c | 2 +- 3 files changed, 4 insertions(+), 4 deletions(-)