mbox series

[bpf-next,00/13] uprobes: Add support to optimize usdt probes on x86_64

Message ID 20241211133403.208920-1-jolsa@kernel.org (mailing list archive)
Headers show
Series uprobes: Add support to optimize usdt probes on x86_64 | expand

Message

Jiri Olsa Dec. 11, 2024, 1:33 p.m. UTC
hi,
this patchset adds support to optimize usdt probes on top of 5-byte
nop instruction.

The generic approach (optimize all uprobes) is hard due to emulating
possible multiple original instructions and its related issues. The
usdt case, which stores 5-byte nop seems much easier, so starting
with that.

The basic idea is to replace breakpoint exception with syscall which
is faster on x86_64. For more details please see changelog of patch 8.

The run_bench_uprobes.sh benchmark triggers uprobe (on top of different
original instructions) in a loop and counts how many of those happened
per second (the unit below is million loops).

There's big speed up if you consider current usdt implementation
(uprobe-nop) compared to proposed usdt (uprobe-nop5):

  # ./benchs/run_bench_uprobes.sh 

      usermode-count :  233.831 ± 0.257M/s
      syscall-count  :   12.107 ± 0.038M/s
  --> uprobe-nop     :    3.246 ± 0.004M/s
      uprobe-push    :    3.057 ± 0.000M/s
      uprobe-ret     :    1.113 ± 0.003M/s
  --> uprobe-nop5    :    6.751 ± 0.037M/s
      uretprobe-nop  :    1.740 ± 0.015M/s
      uretprobe-push :    1.677 ± 0.018M/s
      uretprobe-ret  :    0.852 ± 0.005M/s
      uretprobe-nop5 :    6.769 ± 0.040M/s


v1 changes:
- rebased on top of bpf-next/master
- couple of function/variable renames [Andrii]
- added nop5 emulation [Andrii]
- added checks to arch_uprobe_verify_opcode [Andrii]
- fixed arch_uprobe_is_callable/find_nearest_page [Andrii]
- used CALL_INSN_OPCODE [Masami]
- added uprobe-nop5 benchmark [Andrii]
- using atomic64_t in tramp_area [Andri]
- using single page for all uprobe trampoline mappings

thanks,
jirka


---
Jiri Olsa (13):
      uprobes: Rename arch_uretprobe_trampoline function
      uprobes: Make copy_from_page global
      uprobes: Add nbytes argument to uprobe_write_opcode
      uprobes: Add arch_uprobe_verify_opcode function
      uprobes: Add mapping for optimized uprobe trampolines
      uprobes/x86: Add uprobe syscall to speed up uprobe
      uprobes/x86: Add support to emulate nop5 instruction
      uprobes/x86: Add support to optimize uprobes
      selftests/bpf: Use 5-byte nop for x86 usdt probes
      selftests/bpf: Add uprobe/usdt optimized test
      selftests/bpf: Add hit/attach/detach race optimized uprobe test
      selftests/bpf: Add uprobe syscall sigill signal test
      selftests/bpf: Add 5-byte nop uprobe trigger bench

 arch/x86/entry/syscalls/syscall_64.tbl                  |   1 +
 arch/x86/include/asm/uprobes.h                          |   7 +++
 arch/x86/kernel/uprobes.c                               | 255 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++-
 include/linux/syscalls.h                                |   2 +
 include/linux/uprobes.h                                 |  25 +++++++-
 kernel/events/uprobes.c                                 | 191 +++++++++++++++++++++++++++++++++++++++++++++++++++-----
 kernel/fork.c                                           |   1 +
 kernel/sys_ni.c                                         |   1 +
 tools/testing/selftests/bpf/bench.c                     |  12 ++++
 tools/testing/selftests/bpf/benchs/bench_trigger.c      |  42 +++++++++++++
 tools/testing/selftests/bpf/benchs/run_bench_uprobes.sh |   2 +-
 tools/testing/selftests/bpf/prog_tests/uprobe_syscall.c | 326 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 tools/testing/selftests/bpf/progs/uprobe_optimized.c    |  29 +++++++++
 tools/testing/selftests/bpf/sdt.h                       |   9 ++-
 14 files changed, 880 insertions(+), 23 deletions(-)
 create mode 100644 tools/testing/selftests/bpf/progs/uprobe_optimized.c

Comments

Andrii Nakryiko Dec. 13, 2024, 12:43 a.m. UTC | #1
On Wed, Dec 11, 2024 at 5:34 AM Jiri Olsa <jolsa@kernel.org> wrote:
>
> hi,
> this patchset adds support to optimize usdt probes on top of 5-byte
> nop instruction.
>
> The generic approach (optimize all uprobes) is hard due to emulating
> possible multiple original instructions and its related issues. The
> usdt case, which stores 5-byte nop seems much easier, so starting
> with that.
>
> The basic idea is to replace breakpoint exception with syscall which
> is faster on x86_64. For more details please see changelog of patch 8.
>
> The run_bench_uprobes.sh benchmark triggers uprobe (on top of different
> original instructions) in a loop and counts how many of those happened
> per second (the unit below is million loops).
>
> There's big speed up if you consider current usdt implementation
> (uprobe-nop) compared to proposed usdt (uprobe-nop5):
>
>   # ./benchs/run_bench_uprobes.sh
>
>       usermode-count :  233.831 ± 0.257M/s
>       syscall-count  :   12.107 ± 0.038M/s
>   --> uprobe-nop     :    3.246 ± 0.004M/s
>       uprobe-push    :    3.057 ± 0.000M/s
>       uprobe-ret     :    1.113 ± 0.003M/s
>   --> uprobe-nop5    :    6.751 ± 0.037M/s
>       uretprobe-nop  :    1.740 ± 0.015M/s
>       uretprobe-push :    1.677 ± 0.018M/s
>       uretprobe-ret  :    0.852 ± 0.005M/s
>       uretprobe-nop5 :    6.769 ± 0.040M/s

uretprobe-nop5 throughput is the same as uprobe-nop5?..


>
>
> v1 changes:
> - rebased on top of bpf-next/master
> - couple of function/variable renames [Andrii]
> - added nop5 emulation [Andrii]
> - added checks to arch_uprobe_verify_opcode [Andrii]
> - fixed arch_uprobe_is_callable/find_nearest_page [Andrii]
> - used CALL_INSN_OPCODE [Masami]
> - added uprobe-nop5 benchmark [Andrii]
> - using atomic64_t in tramp_area [Andri]
> - using single page for all uprobe trampoline mappings
>
> thanks,
> jirka
>
>
> ---
> Jiri Olsa (13):
>       uprobes: Rename arch_uretprobe_trampoline function
>       uprobes: Make copy_from_page global
>       uprobes: Add nbytes argument to uprobe_write_opcode
>       uprobes: Add arch_uprobe_verify_opcode function
>       uprobes: Add mapping for optimized uprobe trampolines
>       uprobes/x86: Add uprobe syscall to speed up uprobe
>       uprobes/x86: Add support to emulate nop5 instruction
>       uprobes/x86: Add support to optimize uprobes
>       selftests/bpf: Use 5-byte nop for x86 usdt probes
>       selftests/bpf: Add uprobe/usdt optimized test
>       selftests/bpf: Add hit/attach/detach race optimized uprobe test
>       selftests/bpf: Add uprobe syscall sigill signal test
>       selftests/bpf: Add 5-byte nop uprobe trigger bench
>
>  arch/x86/entry/syscalls/syscall_64.tbl                  |   1 +
>  arch/x86/include/asm/uprobes.h                          |   7 +++
>  arch/x86/kernel/uprobes.c                               | 255 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++-
>  include/linux/syscalls.h                                |   2 +
>  include/linux/uprobes.h                                 |  25 +++++++-
>  kernel/events/uprobes.c                                 | 191 +++++++++++++++++++++++++++++++++++++++++++++++++++-----
>  kernel/fork.c                                           |   1 +
>  kernel/sys_ni.c                                         |   1 +
>  tools/testing/selftests/bpf/bench.c                     |  12 ++++
>  tools/testing/selftests/bpf/benchs/bench_trigger.c      |  42 +++++++++++++
>  tools/testing/selftests/bpf/benchs/run_bench_uprobes.sh |   2 +-
>  tools/testing/selftests/bpf/prog_tests/uprobe_syscall.c | 326 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>  tools/testing/selftests/bpf/progs/uprobe_optimized.c    |  29 +++++++++
>  tools/testing/selftests/bpf/sdt.h                       |   9 ++-
>  14 files changed, 880 insertions(+), 23 deletions(-)
>  create mode 100644 tools/testing/selftests/bpf/progs/uprobe_optimized.c
Jiri Olsa Dec. 13, 2024, 9:46 a.m. UTC | #2
On Thu, Dec 12, 2024 at 04:43:02PM -0800, Andrii Nakryiko wrote:
> On Wed, Dec 11, 2024 at 5:34 AM Jiri Olsa <jolsa@kernel.org> wrote:
> >
> > hi,
> > this patchset adds support to optimize usdt probes on top of 5-byte
> > nop instruction.
> >
> > The generic approach (optimize all uprobes) is hard due to emulating
> > possible multiple original instructions and its related issues. The
> > usdt case, which stores 5-byte nop seems much easier, so starting
> > with that.
> >
> > The basic idea is to replace breakpoint exception with syscall which
> > is faster on x86_64. For more details please see changelog of patch 8.
> >
> > The run_bench_uprobes.sh benchmark triggers uprobe (on top of different
> > original instructions) in a loop and counts how many of those happened
> > per second (the unit below is million loops).
> >
> > There's big speed up if you consider current usdt implementation
> > (uprobe-nop) compared to proposed usdt (uprobe-nop5):
> >
> >   # ./benchs/run_bench_uprobes.sh
> >
> >       usermode-count :  233.831 ± 0.257M/s
> >       syscall-count  :   12.107 ± 0.038M/s
> >   --> uprobe-nop     :    3.246 ± 0.004M/s
> >       uprobe-push    :    3.057 ± 0.000M/s
> >       uprobe-ret     :    1.113 ± 0.003M/s
> >   --> uprobe-nop5    :    6.751 ± 0.037M/s
> >       uretprobe-nop  :    1.740 ± 0.015M/s
> >       uretprobe-push :    1.677 ± 0.018M/s
> >       uretprobe-ret  :    0.852 ± 0.005M/s
> >       uretprobe-nop5 :    6.769 ± 0.040M/s
> 
> uretprobe-nop5 throughput is the same as uprobe-nop5?..

ok, there's bug in the uretprobe bench setup, the number is wrong, sorry
will send new numbers

jirka

> 
> 
> >
> >
> > v1 changes:
> > - rebased on top of bpf-next/master
> > - couple of function/variable renames [Andrii]
> > - added nop5 emulation [Andrii]
> > - added checks to arch_uprobe_verify_opcode [Andrii]
> > - fixed arch_uprobe_is_callable/find_nearest_page [Andrii]
> > - used CALL_INSN_OPCODE [Masami]
> > - added uprobe-nop5 benchmark [Andrii]
> > - using atomic64_t in tramp_area [Andri]
> > - using single page for all uprobe trampoline mappings
> >
> > thanks,
> > jirka
> >
> >
> > ---
> > Jiri Olsa (13):
> >       uprobes: Rename arch_uretprobe_trampoline function
> >       uprobes: Make copy_from_page global
> >       uprobes: Add nbytes argument to uprobe_write_opcode
> >       uprobes: Add arch_uprobe_verify_opcode function
> >       uprobes: Add mapping for optimized uprobe trampolines
> >       uprobes/x86: Add uprobe syscall to speed up uprobe
> >       uprobes/x86: Add support to emulate nop5 instruction
> >       uprobes/x86: Add support to optimize uprobes
> >       selftests/bpf: Use 5-byte nop for x86 usdt probes
> >       selftests/bpf: Add uprobe/usdt optimized test
> >       selftests/bpf: Add hit/attach/detach race optimized uprobe test
> >       selftests/bpf: Add uprobe syscall sigill signal test
> >       selftests/bpf: Add 5-byte nop uprobe trigger bench
> >
> >  arch/x86/entry/syscalls/syscall_64.tbl                  |   1 +
> >  arch/x86/include/asm/uprobes.h                          |   7 +++
> >  arch/x86/kernel/uprobes.c                               | 255 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++-
> >  include/linux/syscalls.h                                |   2 +
> >  include/linux/uprobes.h                                 |  25 +++++++-
> >  kernel/events/uprobes.c                                 | 191 +++++++++++++++++++++++++++++++++++++++++++++++++++-----
> >  kernel/fork.c                                           |   1 +
> >  kernel/sys_ni.c                                         |   1 +
> >  tools/testing/selftests/bpf/bench.c                     |  12 ++++
> >  tools/testing/selftests/bpf/benchs/bench_trigger.c      |  42 +++++++++++++
> >  tools/testing/selftests/bpf/benchs/run_bench_uprobes.sh |   2 +-
> >  tools/testing/selftests/bpf/prog_tests/uprobe_syscall.c | 326 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> >  tools/testing/selftests/bpf/progs/uprobe_optimized.c    |  29 +++++++++
> >  tools/testing/selftests/bpf/sdt.h                       |   9 ++-
> >  14 files changed, 880 insertions(+), 23 deletions(-)
> >  create mode 100644 tools/testing/selftests/bpf/progs/uprobe_optimized.c
Peter Zijlstra Dec. 13, 2024, 10:51 a.m. UTC | #3
On Wed, Dec 11, 2024 at 02:33:49PM +0100, Jiri Olsa wrote:
> hi,
> this patchset adds support to optimize usdt probes on top of 5-byte
> nop instruction.
> 
> The generic approach (optimize all uprobes) is hard due to emulating
> possible multiple original instructions and its related issues. The
> usdt case, which stores 5-byte nop seems much easier, so starting
> with that.
> 
> The basic idea is to replace breakpoint exception with syscall which
> is faster on x86_64. For more details please see changelog of patch 8.

So ideally we'd put a check in the syscall, which verifies it comes from
one of our trampolines and reject any and all other usage.

The reason to do this is that we can then delete all this code the
moment it becomes irrelevant without having to worry userspace might be
'creative' somewhere.
Jiri Olsa Dec. 13, 2024, 1:07 p.m. UTC | #4
On Fri, Dec 13, 2024 at 11:51:05AM +0100, Peter Zijlstra wrote:
> On Wed, Dec 11, 2024 at 02:33:49PM +0100, Jiri Olsa wrote:
> > hi,
> > this patchset adds support to optimize usdt probes on top of 5-byte
> > nop instruction.
> > 
> > The generic approach (optimize all uprobes) is hard due to emulating
> > possible multiple original instructions and its related issues. The
> > usdt case, which stores 5-byte nop seems much easier, so starting
> > with that.
> > 
> > The basic idea is to replace breakpoint exception with syscall which
> > is faster on x86_64. For more details please see changelog of patch 8.
> 
> So ideally we'd put a check in the syscall, which verifies it comes from
> one of our trampolines and reject any and all other usage.
> 
> The reason to do this is that we can then delete all this code the
> moment it becomes irrelevant without having to worry userspace might be
> 'creative' somewhere.

yes, we do that already in SYSCALL_DEFINE0(uprobe):

        /* Allow execution only from uprobe trampolines. */
        vma = vma_lookup(current->mm, regs->ip);
        if (!vma || vma->vm_private_data != (void *) &tramp_mapping) {
                force_sig(SIGILL);
                return -1;
        }

jirka
Peter Zijlstra Dec. 13, 2024, 1:54 p.m. UTC | #5
On Fri, Dec 13, 2024 at 02:07:54PM +0100, Jiri Olsa wrote:
> On Fri, Dec 13, 2024 at 11:51:05AM +0100, Peter Zijlstra wrote:
> > On Wed, Dec 11, 2024 at 02:33:49PM +0100, Jiri Olsa wrote:
> > > hi,
> > > this patchset adds support to optimize usdt probes on top of 5-byte
> > > nop instruction.
> > > 
> > > The generic approach (optimize all uprobes) is hard due to emulating
> > > possible multiple original instructions and its related issues. The
> > > usdt case, which stores 5-byte nop seems much easier, so starting
> > > with that.
> > > 
> > > The basic idea is to replace breakpoint exception with syscall which
> > > is faster on x86_64. For more details please see changelog of patch 8.
> > 
> > So ideally we'd put a check in the syscall, which verifies it comes from
> > one of our trampolines and reject any and all other usage.
> > 
> > The reason to do this is that we can then delete all this code the
> > moment it becomes irrelevant without having to worry userspace might be
> > 'creative' somewhere.
> 
> yes, we do that already in SYSCALL_DEFINE0(uprobe):
> 
>         /* Allow execution only from uprobe trampolines. */
>         vma = vma_lookup(current->mm, regs->ip);
>         if (!vma || vma->vm_private_data != (void *) &tramp_mapping) {
>                 force_sig(SIGILL);
>                 return -1;
>         }

Ah, right I missed that. Doesn't that need more locking through? The
moment vma_lookup() returns that vma can go bad.
Jiri Olsa Dec. 13, 2024, 2:05 p.m. UTC | #6
On Fri, Dec 13, 2024 at 02:54:33PM +0100, Peter Zijlstra wrote:
> On Fri, Dec 13, 2024 at 02:07:54PM +0100, Jiri Olsa wrote:
> > On Fri, Dec 13, 2024 at 11:51:05AM +0100, Peter Zijlstra wrote:
> > > On Wed, Dec 11, 2024 at 02:33:49PM +0100, Jiri Olsa wrote:
> > > > hi,
> > > > this patchset adds support to optimize usdt probes on top of 5-byte
> > > > nop instruction.
> > > > 
> > > > The generic approach (optimize all uprobes) is hard due to emulating
> > > > possible multiple original instructions and its related issues. The
> > > > usdt case, which stores 5-byte nop seems much easier, so starting
> > > > with that.
> > > > 
> > > > The basic idea is to replace breakpoint exception with syscall which
> > > > is faster on x86_64. For more details please see changelog of patch 8.
> > > 
> > > So ideally we'd put a check in the syscall, which verifies it comes from
> > > one of our trampolines and reject any and all other usage.
> > > 
> > > The reason to do this is that we can then delete all this code the
> > > moment it becomes irrelevant without having to worry userspace might be
> > > 'creative' somewhere.
> > 
> > yes, we do that already in SYSCALL_DEFINE0(uprobe):
> > 
> >         /* Allow execution only from uprobe trampolines. */
> >         vma = vma_lookup(current->mm, regs->ip);
> >         if (!vma || vma->vm_private_data != (void *) &tramp_mapping) {
> >                 force_sig(SIGILL);
> >                 return -1;
> >         }
> 
> Ah, right I missed that. Doesn't that need more locking through? The
> moment vma_lookup() returns that vma can go bad.

ugh yes.. I guess mmap_read_lock(current->mm) should do, will check

thanks,
jirka
Peter Zijlstra Dec. 13, 2024, 6:39 p.m. UTC | #7
On Fri, Dec 13, 2024 at 03:05:54PM +0100, Jiri Olsa wrote:
> On Fri, Dec 13, 2024 at 02:54:33PM +0100, Peter Zijlstra wrote:
> > On Fri, Dec 13, 2024 at 02:07:54PM +0100, Jiri Olsa wrote:
> > > On Fri, Dec 13, 2024 at 11:51:05AM +0100, Peter Zijlstra wrote:
> > > > On Wed, Dec 11, 2024 at 02:33:49PM +0100, Jiri Olsa wrote:
> > > > > hi,
> > > > > this patchset adds support to optimize usdt probes on top of 5-byte
> > > > > nop instruction.
> > > > > 
> > > > > The generic approach (optimize all uprobes) is hard due to emulating
> > > > > possible multiple original instructions and its related issues. The
> > > > > usdt case, which stores 5-byte nop seems much easier, so starting
> > > > > with that.
> > > > > 
> > > > > The basic idea is to replace breakpoint exception with syscall which
> > > > > is faster on x86_64. For more details please see changelog of patch 8.
> > > > 
> > > > So ideally we'd put a check in the syscall, which verifies it comes from
> > > > one of our trampolines and reject any and all other usage.
> > > > 
> > > > The reason to do this is that we can then delete all this code the
> > > > moment it becomes irrelevant without having to worry userspace might be
> > > > 'creative' somewhere.
> > > 
> > > yes, we do that already in SYSCALL_DEFINE0(uprobe):
> > > 
> > >         /* Allow execution only from uprobe trampolines. */
> > >         vma = vma_lookup(current->mm, regs->ip);
> > >         if (!vma || vma->vm_private_data != (void *) &tramp_mapping) {
> > >                 force_sig(SIGILL);
> > >                 return -1;
> > >         }
> > 
> > Ah, right I missed that. Doesn't that need more locking through? The
> > moment vma_lookup() returns that vma can go bad.
> 
> ugh yes.. I guess mmap_read_lock(current->mm) should do, will check

If you check
tip/perf/core:kernel/events/uprobe.c:find_active_uprobe_speculative()
you'll find means of doing it locklessly using RCU.
Jiri Olsa Dec. 13, 2024, 9:52 p.m. UTC | #8
On Fri, Dec 13, 2024 at 07:39:54PM +0100, Peter Zijlstra wrote:
> On Fri, Dec 13, 2024 at 03:05:54PM +0100, Jiri Olsa wrote:
> > On Fri, Dec 13, 2024 at 02:54:33PM +0100, Peter Zijlstra wrote:
> > > On Fri, Dec 13, 2024 at 02:07:54PM +0100, Jiri Olsa wrote:
> > > > On Fri, Dec 13, 2024 at 11:51:05AM +0100, Peter Zijlstra wrote:
> > > > > On Wed, Dec 11, 2024 at 02:33:49PM +0100, Jiri Olsa wrote:
> > > > > > hi,
> > > > > > this patchset adds support to optimize usdt probes on top of 5-byte
> > > > > > nop instruction.
> > > > > > 
> > > > > > The generic approach (optimize all uprobes) is hard due to emulating
> > > > > > possible multiple original instructions and its related issues. The
> > > > > > usdt case, which stores 5-byte nop seems much easier, so starting
> > > > > > with that.
> > > > > > 
> > > > > > The basic idea is to replace breakpoint exception with syscall which
> > > > > > is faster on x86_64. For more details please see changelog of patch 8.
> > > > > 
> > > > > So ideally we'd put a check in the syscall, which verifies it comes from
> > > > > one of our trampolines and reject any and all other usage.
> > > > > 
> > > > > The reason to do this is that we can then delete all this code the
> > > > > moment it becomes irrelevant without having to worry userspace might be
> > > > > 'creative' somewhere.
> > > > 
> > > > yes, we do that already in SYSCALL_DEFINE0(uprobe):
> > > > 
> > > >         /* Allow execution only from uprobe trampolines. */
> > > >         vma = vma_lookup(current->mm, regs->ip);
> > > >         if (!vma || vma->vm_private_data != (void *) &tramp_mapping) {
> > > >                 force_sig(SIGILL);
> > > >                 return -1;
> > > >         }
> > > 
> > > Ah, right I missed that. Doesn't that need more locking through? The
> > > moment vma_lookup() returns that vma can go bad.
> > 
> > ugh yes.. I guess mmap_read_lock(current->mm) should do, will check
> 
> If you check
> tip/perf/core:kernel/events/uprobe.c:find_active_uprobe_speculative()
> you'll find means of doing it locklessly using RCU.

right, will use that

thanks,
jirka
Andrii Nakryiko Dec. 13, 2024, 9:59 p.m. UTC | #9
On Fri, Dec 13, 2024 at 1:52 PM Jiri Olsa <olsajiri@gmail.com> wrote:
>
> On Fri, Dec 13, 2024 at 07:39:54PM +0100, Peter Zijlstra wrote:
> > On Fri, Dec 13, 2024 at 03:05:54PM +0100, Jiri Olsa wrote:
> > > On Fri, Dec 13, 2024 at 02:54:33PM +0100, Peter Zijlstra wrote:
> > > > On Fri, Dec 13, 2024 at 02:07:54PM +0100, Jiri Olsa wrote:
> > > > > On Fri, Dec 13, 2024 at 11:51:05AM +0100, Peter Zijlstra wrote:
> > > > > > On Wed, Dec 11, 2024 at 02:33:49PM +0100, Jiri Olsa wrote:
> > > > > > > hi,
> > > > > > > this patchset adds support to optimize usdt probes on top of 5-byte
> > > > > > > nop instruction.
> > > > > > >
> > > > > > > The generic approach (optimize all uprobes) is hard due to emulating
> > > > > > > possible multiple original instructions and its related issues. The
> > > > > > > usdt case, which stores 5-byte nop seems much easier, so starting
> > > > > > > with that.
> > > > > > >
> > > > > > > The basic idea is to replace breakpoint exception with syscall which
> > > > > > > is faster on x86_64. For more details please see changelog of patch 8.
> > > > > >
> > > > > > So ideally we'd put a check in the syscall, which verifies it comes from
> > > > > > one of our trampolines and reject any and all other usage.
> > > > > >
> > > > > > The reason to do this is that we can then delete all this code the
> > > > > > moment it becomes irrelevant without having to worry userspace might be
> > > > > > 'creative' somewhere.
> > > > >
> > > > > yes, we do that already in SYSCALL_DEFINE0(uprobe):
> > > > >
> > > > >         /* Allow execution only from uprobe trampolines. */
> > > > >         vma = vma_lookup(current->mm, regs->ip);
> > > > >         if (!vma || vma->vm_private_data != (void *) &tramp_mapping) {
> > > > >                 force_sig(SIGILL);
> > > > >                 return -1;
> > > > >         }
> > > >
> > > > Ah, right I missed that. Doesn't that need more locking through? The
> > > > moment vma_lookup() returns that vma can go bad.
> > >
> > > ugh yes.. I guess mmap_read_lock(current->mm) should do, will check
> >
> > If you check
> > tip/perf/core:kernel/events/uprobe.c:find_active_uprobe_speculative()
> > you'll find means of doing it locklessly using RCU.
>
> right, will use that

phew, yep, came here to ask not to add mmap_read_lock() into the hot
path again :)

>
> thanks,
> jirka