Message ID | 20201117211836.54acaef2@oasis.local.home (mailing list archive) |
---|---|
State | Superseded |
Headers | show |
Series | [v2] tracepoint: Do not fail unregistering a probe due to memory allocation | expand |
Context | Check | Description |
---|---|---|
netdev/tree_selection | success | Not a local patch |
On Tue, Nov 17, 2020 at 6:18 PM Steven Rostedt <rostedt@goodmis.org> wrote: > > From: "Steven Rostedt (VMware)" <rostedt@goodmis.org> > > The list of tracepoint callbacks is managed by an array that is protected > by RCU. To update this array, a new array is allocated, the updates are > copied over to the new array, and then the list of functions for the > tracepoint is switched over to the new array. After a completion of an RCU > grace period, the old array is freed. > > This process happens for both adding a callback as well as removing one. > But on removing a callback, if the new array fails to be allocated, the > callback is not removed, and may be used after it is freed by the clients > of the tracepoint. > > There's really no reason to fail if the allocation for a new array fails > when removing a function. Instead, the function can simply be replaced by a > stub that will be ignored in the callback loop, and it will be cleaned up > on the next modification of the array. > > Link: https://lore.kernel.org/r/20201115055256.65625-1-mmullins@mmlx.us > Link: https://lkml.kernel.org/r/20201116175107.02db396d@gandalf.local.home > > Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> > Cc: Ingo Molnar <mingo@redhat.com> > Cc: Alexei Starovoitov <ast@kernel.org> > Cc: Daniel Borkmann <daniel@iogearbox.net> > Cc: Dmitry Vyukov <dvyukov@google.com> > Cc: Martin KaFai Lau <kafai@fb.com> > Cc: Song Liu <songliubraving@fb.com> > Cc: Yonghong Song <yhs@fb.com> > Cc: Andrii Nakryiko <andriin@fb.com> > Cc: John Fastabend <john.fastabend@gmail.com> > Cc: KP Singh <kpsingh@chromium.org> > Cc: netdev <netdev@vger.kernel.org> > Cc: bpf <bpf@vger.kernel.org> > Cc: Kees Cook <keescook@chromium.org> > Cc: stable@vger.kernel.org > Fixes: 97e1c18e8d17b ("tracing: Kernel Tracepoints") > Reported-by: syzbot+83aa762ef23b6f0d1991@syzkaller.appspotmail.com > Reported-by: syzbot+d29e58bb557324e55e5e@syzkaller.appspotmail.com > Reported-by: Matt Mullins <mmullins@mmlx.us> > Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org> > --- > Changes since v1: > Use 1L value for stub function, and ignore calling it. > > include/linux/tracepoint.h | 9 ++++- > kernel/tracepoint.c | 80 +++++++++++++++++++++++++++++--------- > 2 files changed, 69 insertions(+), 20 deletions(-) > > diff --git a/include/linux/tracepoint.h b/include/linux/tracepoint.h > index 0f21617f1a66..2e06e05b9d2a 100644 > --- a/include/linux/tracepoint.h > +++ b/include/linux/tracepoint.h > @@ -33,6 +33,8 @@ struct trace_eval_map { > > #define TRACEPOINT_DEFAULT_PRIO 10 > > +#define TRACEPOINT_STUB ((void *)0x1L) > + > extern struct srcu_struct tracepoint_srcu; > > extern int > @@ -310,7 +312,12 @@ static inline struct tracepoint *tracepoint_ptr_deref(tracepoint_ptr_t *p) > do { \ > it_func = (it_func_ptr)->func; \ > __data = (it_func_ptr)->data; \ > - ((void(*)(void *, proto))(it_func))(__data, args); \ > + /* \ > + * Removed functions that couldn't be allocated \ > + * are replaced with TRACEPOINT_STUB. \ > + */ \ > + if (likely(it_func != TRACEPOINT_STUB)) \ > + ((void(*)(void *, proto))(it_func))(__data, args); \ I think you're overreacting to the problem. Adding run-time check to extremely unlikely problem seems wasteful. 99.9% of the time allocate_probes() will do kmalloc from slab of small objects. If that slab is out of memory it means it cannot allocate a single page. In such case so many things will be failing to alloc that system is unlikely operational. oom should have triggered long ago. Imo Matt's approach to add __GFP_NOFAIL to allocate_probes() when it's called from func_remove() is much better. The error was reported by syzbot that was using memory fault injections. ENOMEM in allocate_probes() was never seen in real life and highly unlikely will ever be seen.
On Tue, 17 Nov 2020 20:54:24 -0800 Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote: > > extern int > > @@ -310,7 +312,12 @@ static inline struct tracepoint *tracepoint_ptr_deref(tracepoint_ptr_t *p) > > do { \ > > it_func = (it_func_ptr)->func; \ > > __data = (it_func_ptr)->data; \ > > - ((void(*)(void *, proto))(it_func))(__data, args); \ > > + /* \ > > + * Removed functions that couldn't be allocated \ > > + * are replaced with TRACEPOINT_STUB. \ > > + */ \ > > + if (likely(it_func != TRACEPOINT_STUB)) \ > > + ((void(*)(void *, proto))(it_func))(__data, args); \ > > I think you're overreacting to the problem. I will disagree with that. > Adding run-time check to extremely unlikely problem seems wasteful. Show me a real benchmark that you can notice a problem here. I bet that check is even within the noise of calling an indirect function. Especially on a machine with retpolines. > 99.9% of the time allocate_probes() will do kmalloc from slab of small > objects. > If that slab is out of memory it means it cannot allocate a single page. > In such case so many things will be failing to alloc that system > is unlikely operational. oom should have triggered long ago. > Imo Matt's approach to add __GFP_NOFAIL to allocate_probes() Looking at the GFP_NOFAIL comment: * %__GFP_NOFAIL: The VM implementation _must_ retry infinitely: the caller * cannot handle allocation failures. The allocation could block * indefinitely but will never return with failure. Testing for * failure is pointless. * New users should be evaluated carefully (and the flag should be * used only when there is no reasonable failure policy) but it is * definitely preferable to use the flag rather than opencode endless * loop around allocator. I realized I made a mistake in my patch for using it, as my patch is a failure policy. It looks like something we want to avoid in general. Thanks, I'll send a v3 that removes it. > when it's called from func_remove() is much better. > The error was reported by syzbot that was using > memory fault injections. ENOMEM in allocate_probes() was > never seen in real life and highly unlikely will ever be seen. And the biggest thing you are missing here, is that if you are running on a machine that has static calls, this code is never hit unless you have more than one callback on a single tracepoint. That's because when there's only one callback, it gets called directly, and this loop is not involved. -- Steve
On 18/11/2020 23:46, Steven Rostedt wrote: > On Tue, 17 Nov 2020 20:54:24 -0800 > Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote: > >>> extern int >>> @@ -310,7 +312,12 @@ static inline struct tracepoint *tracepoint_ptr_deref(tracepoint_ptr_t *p) >>> do { \ >>> it_func = (it_func_ptr)->func; \ >>> __data = (it_func_ptr)->data; \ >>> - ((void(*)(void *, proto))(it_func))(__data, args); \ >>> + /* \ >>> + * Removed functions that couldn't be allocated \ >>> + * are replaced with TRACEPOINT_STUB. \ >>> + */ \ >>> + if (likely(it_func != TRACEPOINT_STUB)) \ >>> + ((void(*)(void *, proto))(it_func))(__data, args); \ >> >> I think you're overreacting to the problem. > > I will disagree with that. > >> Adding run-time check to extremely unlikely problem seems wasteful. > > Show me a real benchmark that you can notice a problem here. I bet that > check is even within the noise of calling an indirect function. Especially > on a machine with retpolines. > >> 99.9% of the time allocate_probes() will do kmalloc from slab of small >> objects. >> If that slab is out of memory it means it cannot allocate a single page. >> In such case so many things will be failing to alloc that system >> is unlikely operational. oom should have triggered long ago. >> Imo Matt's approach to add __GFP_NOFAIL to allocate_probes() > > Looking at the GFP_NOFAIL comment: > > * %__GFP_NOFAIL: The VM implementation _must_ retry infinitely: the caller > * cannot handle allocation failures. The allocation could block > * indefinitely but will never return with failure. Testing for > * failure is pointless. > * New users should be evaluated carefully (and the flag should be > * used only when there is no reasonable failure policy) but it is > * definitely preferable to use the flag rather than opencode endless > * loop around allocator. > > I realized I made a mistake in my patch for using it, as my patch is a > failure policy. It looks like something we want to avoid in general. > > Thanks, I'll send a v3 that removes it. > >> when it's called from func_remove() is much better. >> The error was reported by syzbot that was using >> memory fault injections. ENOMEM in allocate_probes() was >> never seen in real life and highly unlikely will ever be seen. > > And the biggest thing you are missing here, is that if you are running on a > machine that has static calls, this code is never hit unless you have more > than one callback on a single tracepoint. That's because when there's only > one callback, it gets called directly, and this loop is not involved. I am running syzkaller and the kernel keeps crashing in __traceiter_##_name. This patch makes these crashes happen lot less often (and so did the v1) but the kernel still crashes (examples below but the common thing is that they crash in tracepoints). Disasm points to __DO_TRACE_CALL(name) and this fixes it: ======================== --- a/include/linux/tracepoint.h +++ b/include/linux/tracepoint.h @@ -313,6 +313,7 @@ static inline struct tracepoint *tracepoint_ptr_deref(tracepoint_ptr_t *p) \ it_func_ptr = \ rcu_dereference_raw((&__tracepoint_##_name)->funcs); \ + if (it_func_ptr) \ do { \ it_func = (it_func_ptr)->func; \ ======================== I am running on a powerpc box which does not have CONFIG_HAVE_STATIC_CALL. I wonder if it is still the same problem which mentioned v3 might fix or it is something different? Thanks, [ 285.072538] Kernel attempted to read user page (0) - exploit attempt? (uid: 0) [ 285.073657] BUG: Kernel NULL pointer dereference on read at 0x00000000 [ 285.075129] Faulting instruction address: 0xc0000000002edf48 cpu 0xd: Vector: 300 (Data Access) at [c0000000115db530] pc: c0000000002edf48: lock_acquire+0x2e8/0x5d0 lr: c0000000006ee450: step_into+0x940/0xc20 sp: c0000000115db7d0 msr: 8000000000009033 dar: 0 dsisr: 40000000 current = 0xc0000000115af280 paca = 0xc00000005ff9fe00 irqmask: 0x03 irq_happened: 0x01 pid = 182, comm = systemd-journal Linux version 5.11.0-rc5-le_syzkaller_a+fstn1 (aik@fstn1-p1) (gcc (Ubuntu 7.5.0-3ubuntu1~18.04) 7.5.0, GNU ld (GNU Binutils for Ubuntu) 2.30) #65 SMP Wed Jan 27 16:46:46 AEDT 2021 enter ? for help [c0000000115db8c0] c0000000006ee450 step_into+0x940/0xc20 [c0000000115db950] c0000000006efddc walk_component+0xbc/0x340 [c0000000115db9d0] c0000000006f0418 link_path_walk.part.29+0x3b8/0x5b0 [c0000000115dbaa0] c0000000006f0b1c path_openat+0x11c/0x1190 [c0000000115dbb40] c0000000006f4334 do_filp_open+0xb4/0x180 [c0000000115dbc80] c0000000006c83cc do_sys_openat2+0x48c/0x610 [c0000000115dbd20] c0000000006caf9c do_sys_open+0xcc/0x140 [c0000000115dbdb0] c00000000004ba48 system_call_exception+0x178/0x2b0 [c0000000115dbe10] c00000000000e060 system_call_common+0xf0/0x27c --- Exception: c00 (System Call) at 00007fff7fb3e758 SP (7ffff28e2900) is in userspace [ 92.747130] FAT-fs (loop7): bogus number of reserved sectors [ 92.747193] Kernel attempted to read user page (0) - exploit attempt? (uid: 0) [ 92.748393] FAT-fs (loop7): Can't find a valid FAT filesystem [ 92.750579] BUG: Kernel NULL pointer dereference on read at 0x00000000 [ 92.751855] Faulting instruction address: 0xc0000000002ed928 cpu 0xd: Vector: 300 (Data Access) at [c00000001138f5c0] pc: c0000000002ed928: lock_release+0x138/0x470 lr: c0000000002e0084: up_write+0x34/0x1e0 sp: c00000001138f860 msr: 8000000000009033 dar: 0 dsisr: 40000000 current = 0xc00000004fe7b480 paca = 0xc00000005ff9fe00 irqmask: 0x03 irq_happened: 0x01 pid = 10670, comm = syz-executor.3 Linux version 5.11.0-rc5-le_syzkaller_a+fstn1 (aik@fstn1-p1) (gcc (Ubuntu 7.5.0-3ubuntu1~18.04) 7.5.0, GNU ld (GNU Binutils for Ubuntu) 2.30) #65 SMP Wed Jan 27 16:46:46 AEDT 2021 enter ? for help [c00000001138f910] c0000000002e0084 up_write+0x34/0x1e0 [c00000001138f980] c0000000006151fc anon_vma_clone+0x1ec/0x370 [c00000001138f9f0] c00000000060180c __split_vma+0x11c/0x340 [c00000001138fa40] c000000000601cb0 __do_munmap+0x1c0/0x920 [c00000001138fad0] c000000000604d20 mmap_region+0x3b0/0xae0 [c00000001138fbd0] c0000000006059d4 do_mmap+0x584/0x830 [c00000001138fc60] c0000000005b0f90 vm_mmap_pgoff+0x170/0x260 [c00000001138fcf0] c000000000600818 ksys_mmap_pgoff+0x198/0x3a0 [c00000001138fd60] c0000000000155ec sys_mmap+0xcc/0x150 [c00000001138fdb0] c00000000004ba48 system_call_exception+0x178/0x2b0 [c00000001138fe10] c00000000000e060 system_call_common+0xf0/0x27c --- Exception: c00 (System Call) at 0000000010058ad0 SP (7fffae45e1c0) is in userspace
On Wed, 27 Jan 2021 18:08:34 +1100 Alexey Kardashevskiy <aik@ozlabs.ru> wrote: > > I am running syzkaller and the kernel keeps crashing in > __traceiter_##_name. This patch makes these crashes happen lot less I have another solution to the above issue. But I'm now concerned with what you write below. > often (and so did the v1) but the kernel still crashes (examples below > but the common thing is that they crash in tracepoints). Disasm points > to __DO_TRACE_CALL(name) and this fixes it: > > ======================== > --- a/include/linux/tracepoint.h > +++ b/include/linux/tracepoint.h > @@ -313,6 +313,7 @@ static inline struct tracepoint > *tracepoint_ptr_deref(tracepoint_ptr_t *p) > \ > it_func_ptr = \ > > rcu_dereference_raw((&__tracepoint_##_name)->funcs); \ > + if (it_func_ptr) \ Looking at v2 of the patch, I found a bug that could make this happen. I'm looking at doing something else that doesn't affect the fast path nor does it bloat the kernel more than necessary. I'll see if I can get that patch out today. Thanks for the report. -- Steve
diff --git a/include/linux/tracepoint.h b/include/linux/tracepoint.h index 0f21617f1a66..2e06e05b9d2a 100644 --- a/include/linux/tracepoint.h +++ b/include/linux/tracepoint.h @@ -33,6 +33,8 @@ struct trace_eval_map { #define TRACEPOINT_DEFAULT_PRIO 10 +#define TRACEPOINT_STUB ((void *)0x1L) + extern struct srcu_struct tracepoint_srcu; extern int @@ -310,7 +312,12 @@ static inline struct tracepoint *tracepoint_ptr_deref(tracepoint_ptr_t *p) do { \ it_func = (it_func_ptr)->func; \ __data = (it_func_ptr)->data; \ - ((void(*)(void *, proto))(it_func))(__data, args); \ + /* \ + * Removed functions that couldn't be allocated \ + * are replaced with TRACEPOINT_STUB. \ + */ \ + if (likely(it_func != TRACEPOINT_STUB)) \ + ((void(*)(void *, proto))(it_func))(__data, args); \ } while ((++it_func_ptr)->func); \ return 0; \ } \ diff --git a/kernel/tracepoint.c b/kernel/tracepoint.c index 3f659f855074..20ce78b3f578 100644 --- a/kernel/tracepoint.c +++ b/kernel/tracepoint.c @@ -53,10 +53,10 @@ struct tp_probes { struct tracepoint_func probes[]; }; -static inline void *allocate_probes(int count) +static inline void *allocate_probes(int count, gfp_t extra_flags) { struct tp_probes *p = kmalloc(struct_size(p, probes, count), - GFP_KERNEL); + GFP_KERNEL | extra_flags); return p == NULL ? NULL : p->probes; } @@ -131,6 +131,7 @@ func_add(struct tracepoint_func **funcs, struct tracepoint_func *tp_func, { struct tracepoint_func *old, *new; int nr_probes = 0; + int stub_funcs = 0; int pos = -1; if (WARN_ON(!tp_func->func)) @@ -147,14 +148,34 @@ func_add(struct tracepoint_func **funcs, struct tracepoint_func *tp_func, if (old[nr_probes].func == tp_func->func && old[nr_probes].data == tp_func->data) return ERR_PTR(-EEXIST); + if (old[nr_probes].func == TRACEPOINT_STUB) + stub_funcs++; } } - /* + 2 : one for new probe, one for NULL func */ - new = allocate_probes(nr_probes + 2); + /* + 2 : one for new probe, one for NULL func - stub functions */ + new = allocate_probes(nr_probes + 2 - stub_funcs, 0); if (new == NULL) return ERR_PTR(-ENOMEM); if (old) { - if (pos < 0) { + if (stub_funcs) { + /* Need to copy one at a time to remove stubs */ + int probes = 0; + + pos = -1; + for (nr_probes = 0; old[nr_probes].func; nr_probes++) { + if (old[nr_probes].func == TRACEPOINT_STUB) + continue; + if (pos < 0 && old[nr_probes].prio < prio) + pos = probes++; + new[probes++] = old[nr_probes]; + } + nr_probes = probes; + if (pos < 0) + pos = probes; + else + nr_probes--; /* Account for insertion */ + + } else if (pos < 0) { pos = nr_probes; memcpy(new, old, nr_probes * sizeof(struct tracepoint_func)); } else { @@ -188,8 +209,9 @@ static void *func_remove(struct tracepoint_func **funcs, /* (N -> M), (N > 1, M >= 0) probes */ if (tp_func->func) { for (nr_probes = 0; old[nr_probes].func; nr_probes++) { - if (old[nr_probes].func == tp_func->func && - old[nr_probes].data == tp_func->data) + if ((old[nr_probes].func == tp_func->func && + old[nr_probes].data == tp_func->data) || + old[nr_probes].func == TRACEPOINT_STUB) nr_del++; } } @@ -207,15 +229,33 @@ static void *func_remove(struct tracepoint_func **funcs, int j = 0; /* N -> M, (N > 1, M > 0) */ /* + 1 for NULL */ - new = allocate_probes(nr_probes - nr_del + 1); - if (new == NULL) - return ERR_PTR(-ENOMEM); - for (i = 0; old[i].func; i++) - if (old[i].func != tp_func->func - || old[i].data != tp_func->data) - new[j++] = old[i]; - new[nr_probes - nr_del].func = NULL; - *funcs = new; + new = allocate_probes(nr_probes - nr_del + 1, __GFP_NOFAIL); + if (new) { + for (i = 0; old[i].func; i++) + if ((old[i].func != tp_func->func + || old[i].data != tp_func->data) + && old[i].func != TRACEPOINT_STUB) + new[j++] = old[i]; + new[nr_probes - nr_del].func = NULL; + *funcs = new; + } else { + /* + * Failed to allocate, replace the old function + * with TRACEPOINT_STUB. + */ + for (i = 0; old[i].func; i++) + if (old[i].func == tp_func->func && + old[i].data == tp_func->data) { + old[i].func = TRACEPOINT_STUB; + /* Set the prio to the next event. */ + if (old[i + 1].func) + old[i].prio = + old[i + 1].prio; + else + old[i].prio = -1; + } + *funcs = old; + } } debug_print_probes(*funcs); return old; @@ -295,10 +335,12 @@ static int tracepoint_remove_func(struct tracepoint *tp, tp_funcs = rcu_dereference_protected(tp->funcs, lockdep_is_held(&tracepoints_mutex)); old = func_remove(&tp_funcs, func); - if (IS_ERR(old)) { - WARN_ON_ONCE(PTR_ERR(old) != -ENOMEM); + if (WARN_ON_ONCE(IS_ERR(old))) return PTR_ERR(old); - } + + if (tp_funcs == old) + /* Failed allocating new tp_funcs, replaced func with stub */ + return 0; if (!tp_funcs) { /* Removed last function */