Message ID | 20241211133403.208920-9-jolsa@kernel.org (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | uprobes: Add support to optimize usdt probes on x86_64 | expand |
On Wed, Dec 11, 2024 at 02:33:57PM +0100, Jiri Olsa wrote: > diff --git a/arch/x86/kernel/uprobes.c b/arch/x86/kernel/uprobes.c > index cdea97f8cd39..b2420eeee23a 100644 > --- a/arch/x86/kernel/uprobes.c > +++ b/arch/x86/kernel/uprobes.c > @@ -1306,3 +1339,132 @@ bool arch_uretprobe_is_alive(struct return_instance *ret, enum rp_check ctx, > else > return regs->sp <= ret->stack; > } > + > +int arch_uprobe_verify_opcode(struct arch_uprobe *auprobe, struct page *page, > + unsigned long vaddr, uprobe_opcode_t *new_opcode, > + int nbytes) > +{ > + uprobe_opcode_t old_opcode[5]; > + bool is_call, is_swbp, is_nop5; > + > + if (!test_bit(ARCH_UPROBE_FLAG_CAN_OPTIMIZE, &auprobe->flags)) > + return uprobe_verify_opcode(page, vaddr, new_opcode); > + > + /* > + * The ARCH_UPROBE_FLAG_CAN_OPTIMIZE flag guarantees the following > + * 5 bytes read won't cross the page boundary. > + */ > + uprobe_copy_from_page(page, vaddr, (uprobe_opcode_t *) &old_opcode, 5); > + is_call = is_call_insn((uprobe_opcode_t *) &old_opcode); > + is_swbp = is_swbp_insn((uprobe_opcode_t *) &old_opcode); > + is_nop5 = is_nop5_insn((uprobe_opcode_t *) &old_opcode); > + > + /* > + * We allow following trasitions for optimized uprobes: > + * > + * nop5 -> swbp -> call > + * || | | > + * |'--<---' | > + * '---<-----------' > + * > + * We return 1 to ack the write, 0 to do nothing, -1 to fail write. > + * > + * If the current opcode (old_opcode) has already desired value, > + * we do nothing, because we are racing with another thread doing > + * the update. > + */ > + switch (nbytes) { > + case 5: > + if (is_call_insn(new_opcode)) { > + if (is_swbp) > + return 1; > + if (is_call && !memcmp(new_opcode, &old_opcode, 5)) > + return 0; > + } else { > + if (is_call || is_swbp) > + return 1; > + if (is_nop5) > + return 0; > + } > + break; > + case 1: > + if (is_swbp_insn(new_opcode)) { > + if (is_nop5) > + return 1; > + if (is_swbp || is_call) > + return 0; > + } else { > + if (is_swbp || is_call) > + return 1; > + if (is_nop5) > + return 0; > + } > + } > + return -1; > +} > + > +bool arch_uprobe_is_register(uprobe_opcode_t *insn, int nbytes) > +{ > + return nbytes == 5 ? is_call_insn(insn) : is_swbp_insn(insn); > +} > + > +static void __arch_uprobe_optimize(struct arch_uprobe *auprobe, struct mm_struct *mm, > + unsigned long vaddr) > +{ > + struct uprobe_trampoline *tramp; > + char call[5]; > + > + tramp = uprobe_trampoline_get(vaddr); > + if (!tramp) > + goto fail; > + > + relative_call(call, (void *) vaddr, (void *) tramp->vaddr); > + if (uprobe_write_opcode(auprobe, mm, vaddr, call, 5)) > + goto fail; > + > + set_bit(ARCH_UPROBE_FLAG_OPTIMIZED, &auprobe->flags); > + return; > + > +fail: > + /* Once we fail we never try again. */ > + clear_bit(ARCH_UPROBE_FLAG_CAN_OPTIMIZE, &auprobe->flags); > + uprobe_trampoline_put(tramp); > +} > + > +static bool should_optimize(struct arch_uprobe *auprobe) > +{ > + if (!test_bit(ARCH_UPROBE_FLAG_CAN_OPTIMIZE, &auprobe->flags)) > + return false; > + if (test_bit(ARCH_UPROBE_FLAG_OPTIMIZED, &auprobe->flags)) > + return false; > + return true; > +} > + > +void arch_uprobe_optimize(struct arch_uprobe *auprobe, unsigned long vaddr) > +{ > + struct mm_struct *mm = current->mm; > + > + if (!should_optimize(auprobe)) > + return; > + > + mmap_write_lock(mm); > + if (should_optimize(auprobe)) > + __arch_uprobe_optimize(auprobe, mm, vaddr); > + mmap_write_unlock(mm); > +} > + > +int set_orig_insn(struct arch_uprobe *auprobe, struct mm_struct *mm, unsigned long vaddr) > +{ > + uprobe_opcode_t *insn = (uprobe_opcode_t *) auprobe->insn; > + > + if (test_bit(ARCH_UPROBE_FLAG_OPTIMIZED, &auprobe->flags)) > + return uprobe_write_opcode(auprobe, mm, vaddr, insn, 5); > + > + return uprobe_write_opcode(auprobe, mm, vaddr, insn, UPROBE_SWBP_INSN_SIZE); > +} > + > +bool arch_uprobe_is_callable(unsigned long vtramp, unsigned long vaddr) > +{ > + long delta = (long)(vaddr + 5 - vtramp); > + return delta >= INT_MIN && delta <= INT_MAX; > +} All this code is useless on 32bit, right?
On Fri, Dec 13, 2024 at 11:49:07AM +0100, Peter Zijlstra wrote: > On Wed, Dec 11, 2024 at 02:33:57PM +0100, Jiri Olsa wrote: > > diff --git a/arch/x86/kernel/uprobes.c b/arch/x86/kernel/uprobes.c > > index cdea97f8cd39..b2420eeee23a 100644 > > --- a/arch/x86/kernel/uprobes.c > > +++ b/arch/x86/kernel/uprobes.c > > > @@ -1306,3 +1339,132 @@ bool arch_uretprobe_is_alive(struct return_instance *ret, enum rp_check ctx, > > else > > return regs->sp <= ret->stack; > > } > > + > > +int arch_uprobe_verify_opcode(struct arch_uprobe *auprobe, struct page *page, > > + unsigned long vaddr, uprobe_opcode_t *new_opcode, > > + int nbytes) > > +{ > > + uprobe_opcode_t old_opcode[5]; > > + bool is_call, is_swbp, is_nop5; > > + > > + if (!test_bit(ARCH_UPROBE_FLAG_CAN_OPTIMIZE, &auprobe->flags)) > > + return uprobe_verify_opcode(page, vaddr, new_opcode); > > + > > + /* > > + * The ARCH_UPROBE_FLAG_CAN_OPTIMIZE flag guarantees the following > > + * 5 bytes read won't cross the page boundary. > > + */ > > + uprobe_copy_from_page(page, vaddr, (uprobe_opcode_t *) &old_opcode, 5); > > + is_call = is_call_insn((uprobe_opcode_t *) &old_opcode); > > + is_swbp = is_swbp_insn((uprobe_opcode_t *) &old_opcode); > > + is_nop5 = is_nop5_insn((uprobe_opcode_t *) &old_opcode); > > + > > + /* > > + * We allow following trasitions for optimized uprobes: > > + * > > + * nop5 -> swbp -> call > > + * || | | > > + * |'--<---' | > > + * '---<-----------' > > + * > > + * We return 1 to ack the write, 0 to do nothing, -1 to fail write. > > + * > > + * If the current opcode (old_opcode) has already desired value, > > + * we do nothing, because we are racing with another thread doing > > + * the update. > > + */ > > + switch (nbytes) { > > + case 5: > > + if (is_call_insn(new_opcode)) { > > + if (is_swbp) > > + return 1; > > + if (is_call && !memcmp(new_opcode, &old_opcode, 5)) > > + return 0; > > + } else { > > + if (is_call || is_swbp) > > + return 1; > > + if (is_nop5) > > + return 0; > > + } > > + break; > > + case 1: > > + if (is_swbp_insn(new_opcode)) { > > + if (is_nop5) > > + return 1; > > + if (is_swbp || is_call) > > + return 0; > > + } else { > > + if (is_swbp || is_call) > > + return 1; > > + if (is_nop5) > > + return 0; > > + } > > + } > > + return -1; > > +} > > + > > +bool arch_uprobe_is_register(uprobe_opcode_t *insn, int nbytes) > > +{ > > + return nbytes == 5 ? is_call_insn(insn) : is_swbp_insn(insn); > > +} > > + > > +static void __arch_uprobe_optimize(struct arch_uprobe *auprobe, struct mm_struct *mm, > > + unsigned long vaddr) > > +{ > > + struct uprobe_trampoline *tramp; > > + char call[5]; > > + > > + tramp = uprobe_trampoline_get(vaddr); > > + if (!tramp) > > + goto fail; > > + > > + relative_call(call, (void *) vaddr, (void *) tramp->vaddr); > > + if (uprobe_write_opcode(auprobe, mm, vaddr, call, 5)) > > + goto fail; > > + > > + set_bit(ARCH_UPROBE_FLAG_OPTIMIZED, &auprobe->flags); > > + return; > > + > > +fail: > > + /* Once we fail we never try again. */ > > + clear_bit(ARCH_UPROBE_FLAG_CAN_OPTIMIZE, &auprobe->flags); > > + uprobe_trampoline_put(tramp); > > +} > > + > > +static bool should_optimize(struct arch_uprobe *auprobe) > > +{ > > + if (!test_bit(ARCH_UPROBE_FLAG_CAN_OPTIMIZE, &auprobe->flags)) > > + return false; > > + if (test_bit(ARCH_UPROBE_FLAG_OPTIMIZED, &auprobe->flags)) > > + return false; > > + return true; > > +} > > + > > +void arch_uprobe_optimize(struct arch_uprobe *auprobe, unsigned long vaddr) > > +{ > > + struct mm_struct *mm = current->mm; > > + > > + if (!should_optimize(auprobe)) > > + return; > > + > > + mmap_write_lock(mm); > > + if (should_optimize(auprobe)) > > + __arch_uprobe_optimize(auprobe, mm, vaddr); > > + mmap_write_unlock(mm); > > +} > > + > > +int set_orig_insn(struct arch_uprobe *auprobe, struct mm_struct *mm, unsigned long vaddr) > > +{ > > + uprobe_opcode_t *insn = (uprobe_opcode_t *) auprobe->insn; > > + > > + if (test_bit(ARCH_UPROBE_FLAG_OPTIMIZED, &auprobe->flags)) > > + return uprobe_write_opcode(auprobe, mm, vaddr, insn, 5); > > + > > + return uprobe_write_opcode(auprobe, mm, vaddr, insn, UPROBE_SWBP_INSN_SIZE); > > +} > > + > > +bool arch_uprobe_is_callable(unsigned long vtramp, unsigned long vaddr) > > +{ > > + long delta = (long)(vaddr + 5 - vtramp); > > + return delta >= INT_MIN && delta <= INT_MAX; > > +} > > All this code is useless on 32bit, right? yes, there's user_64bit_mode check in arch_uprobe_trampoline_mapping when getting the trampoline, so above won't get executed in practise.. but I think we should make it more obvious and put the check directly to arch_uprobe_optimize jirka
On Wed, Dec 11, 2024 at 5:35 AM Jiri Olsa <jolsa@kernel.org> wrote: > > Putting together all the previously added pieces to support optimized > uprobes on top of 5-byte nop instruction. > > The current uprobe execution goes through following: > - installs breakpoint instruction over original instruction > - exception handler hit and calls related uprobe consumers > - and either simulates original instruction or does out of line single step > execution of it > - returns to user space > > The optimized uprobe path > > - checks the original instruction is 5-byte nop (plus other checks) > - adds (or uses existing) user space trampoline and overwrites original > instruction (5-byte nop) with call to user space trampoline > - the user space trampoline executes uprobe syscall that calls related uprobe > consumers > - trampoline returns back to next instruction > > This approach won't speed up all uprobes as it's limited to using nop5 as > original instruction, but we could use nop5 as USDT probe instruction (which > uses single byte nop ATM) and speed up the USDT probes. > > This patch overloads related arch functions in uprobe_write_opcode and > set_orig_insn so they can install call instruction if needed. > > The arch_uprobe_optimize triggers the uprobe optimization and is called after > first uprobe hit. I originally had it called on uprobe installation but then > it clashed with elf loader, because the user space trampoline was added in a > place where loader might need to put elf segments, so I decided to do it after > first uprobe hit when loading is done. > > We do not unmap and release uprobe trampoline when it's no longer needed, > because there's no easy way to make sure none of the threads is still > inside the trampoline. But we do not waste memory, because there's just > single page for all the uprobe trampoline mappings. > > We do waste frmae on page mapping for every 4GB by keeping the uprobe > trampoline page mapped, but that seems ok. > > Attaching the speed up from benchs/run_bench_uprobes.sh script: > > current: > > uprobe-nop : 3.281 ± 0.003M/s > uprobe-push : 3.085 ± 0.003M/s > uprobe-ret : 1.130 ± 0.000M/s > --> uprobe-nop5 : 3.276 ± 0.007M/s > uretprobe-nop : 1.716 ± 0.016M/s > uretprobe-push : 1.651 ± 0.017M/s > uretprobe-ret : 0.846 ± 0.006M/s > --> uretprobe-nop5 : 3.279 ± 0.002M/s > > after the change: > > uprobe-nop : 3.246 ± 0.004M/s > uprobe-push : 3.057 ± 0.000M/s > uprobe-ret : 1.113 ± 0.003M/s > --> uprobe-nop5 : 6.751 ± 0.037M/s > uretprobe-nop : 1.740 ± 0.015M/s > uretprobe-push : 1.677 ± 0.018M/s > uretprobe-ret : 0.852 ± 0.005M/s > --> uretprobe-nop5 : 6.769 ± 0.040M/s > > Signed-off-by: Jiri Olsa <jolsa@kernel.org> > --- > arch/x86/include/asm/uprobes.h | 7 ++ > arch/x86/kernel/uprobes.c | 168 ++++++++++++++++++++++++++++++++- > include/linux/uprobes.h | 1 + > kernel/events/uprobes.c | 8 ++ > 4 files changed, 181 insertions(+), 3 deletions(-) > [...] > + > +int arch_uprobe_verify_opcode(struct arch_uprobe *auprobe, struct page *page, > + unsigned long vaddr, uprobe_opcode_t *new_opcode, > + int nbytes) > +{ > + uprobe_opcode_t old_opcode[5]; > + bool is_call, is_swbp, is_nop5; > + > + if (!test_bit(ARCH_UPROBE_FLAG_CAN_OPTIMIZE, &auprobe->flags)) > + return uprobe_verify_opcode(page, vaddr, new_opcode); > + > + /* > + * The ARCH_UPROBE_FLAG_CAN_OPTIMIZE flag guarantees the following > + * 5 bytes read won't cross the page boundary. > + */ > + uprobe_copy_from_page(page, vaddr, (uprobe_opcode_t *) &old_opcode, 5); > + is_call = is_call_insn((uprobe_opcode_t *) &old_opcode); > + is_swbp = is_swbp_insn((uprobe_opcode_t *) &old_opcode); > + is_nop5 = is_nop5_insn((uprobe_opcode_t *) &old_opcode); > + > + /* > + * We allow following trasitions for optimized uprobes: > + * > + * nop5 -> swbp -> call > + * || | | > + * |'--<---' | > + * '---<-----------' > + * > + * We return 1 to ack the write, 0 to do nothing, -1 to fail write. > + * > + * If the current opcode (old_opcode) has already desired value, > + * we do nothing, because we are racing with another thread doing > + * the update. > + */ > + switch (nbytes) { > + case 5: > + if (is_call_insn(new_opcode)) { > + if (is_swbp) > + return 1; > + if (is_call && !memcmp(new_opcode, &old_opcode, 5)) > + return 0; > + } else { > + if (is_call || is_swbp) > + return 1; > + if (is_nop5) > + return 0; > + } > + break; > + case 1: > + if (is_swbp_insn(new_opcode)) { > + if (is_nop5) > + return 1; > + if (is_swbp || is_call) > + return 0; > + } else { > + if (is_swbp || is_call) > + return 1; > + if (is_nop5) > + return 0; > + } > + } > + return -1; nit: -EINVAL? > +} > + > +bool arch_uprobe_is_register(uprobe_opcode_t *insn, int nbytes) > +{ > + return nbytes == 5 ? is_call_insn(insn) : is_swbp_insn(insn); > +} > + [...]
From: Jiri Olsa > Sent: 11 December 2024 13:34 > > Putting together all the previously added pieces to support optimized > uprobes on top of 5-byte nop instruction. > > The current uprobe execution goes through following: > - installs breakpoint instruction over original instruction > - exception handler hit and calls related uprobe consumers > - and either simulates original instruction or does out of line single step > execution of it > - returns to user space > > The optimized uprobe path > > - checks the original instruction is 5-byte nop (plus other checks) > - adds (or uses existing) user space trampoline and overwrites original > instruction (5-byte nop) with call to user space trampoline > - the user space trampoline executes uprobe syscall that calls related uprobe > consumers > - trampoline returns back to next instruction ... How on earth can you safely overwrite a randomly aligned 5 byte instruction that might be being prefetched and executed by another thread of the same process. If the instruction doesn't cross a cache line boundary then you might manage to convince people that an 8-byte write will always be atomic wrt other cpu reading instructions. But you can't guarantee the alignment. You might manage with the 7 byte sequence: br .+7; call addr and then update 'addr' before changing the branch offset from 05 to 00. But even that may not be safe if 'addr' crosses a cache line boundary. You could replace a one byte nop (0x90) with a breakpoint (0xcc) and then return to the instruction after the breakpoint. That would save having to emulate or single stap the overwritten instruction. David - Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK Registration No: 1397386 (Wales)
On 12/15, David Laight wrote: > > From: Jiri Olsa > > The optimized uprobe path > > > > - checks the original instruction is 5-byte nop (plus other checks) > > - adds (or uses existing) user space trampoline and overwrites original > > instruction (5-byte nop) with call to user space trampoline > > - the user space trampoline executes uprobe syscall that calls related uprobe > > consumers > > - trampoline returns back to next instruction > ... > > How on earth can you safely overwrite a randomly aligned 5 byte instruction > that might be being prefetched and executed by another thread of the > same process. uprobe_write_opcode() doesn't overwrite the instruction in place. It creates the new page with the same content, overwrites the probed insn in that page, then calls __replace_page(). Oleg.
On Sun, Dec 15, 2024 at 03:14:13PM +0100, Oleg Nesterov wrote: > On 12/15, David Laight wrote: > > > > From: Jiri Olsa > > > The optimized uprobe path > > > > > > - checks the original instruction is 5-byte nop (plus other checks) > > > - adds (or uses existing) user space trampoline and overwrites original > > > instruction (5-byte nop) with call to user space trampoline > > > - the user space trampoline executes uprobe syscall that calls related uprobe > > > consumers > > > - trampoline returns back to next instruction > > ... > > > > How on earth can you safely overwrite a randomly aligned 5 byte instruction > > that might be being prefetched and executed by another thread of the > > same process. > > uprobe_write_opcode() doesn't overwrite the instruction in place. > > It creates the new page with the same content, overwrites the probed insn in > that page, then calls __replace_page(). tbh I wasn't completely sure about that as well, I added selftest in patch #11 trying to hit the issue you described and it seems to work ok jirka
From: Jiri Olsa > Sent: 16 December 2024 08:09 > > On Sun, Dec 15, 2024 at 03:14:13PM +0100, Oleg Nesterov wrote: > > On 12/15, David Laight wrote: > > > > > > From: Jiri Olsa > > > > The optimized uprobe path > > > > > > > > - checks the original instruction is 5-byte nop (plus other checks) > > > > - adds (or uses existing) user space trampoline and overwrites original > > > > instruction (5-byte nop) with call to user space trampoline > > > > - the user space trampoline executes uprobe syscall that calls related uprobe > > > > consumers > > > > - trampoline returns back to next instruction > > > ... > > > > > > How on earth can you safely overwrite a randomly aligned 5 byte instruction > > > that might be being prefetched and executed by another thread of the > > > same process. > > > > uprobe_write_opcode() doesn't overwrite the instruction in place. > > > > It creates the new page with the same content, overwrites the probed insn in > > that page, then calls __replace_page(). > > tbh I wasn't completely sure about that as well, I added selftest > in patch #11 trying to hit the issue you described and it seems to > work ok Actually hitting the timing window is hard. So 'seems to work ok' doesn't really mean much :-) It all depends on how hard __replace_page() tries to be atomic. The page has to change from one backed by the executable to a private one backed by swap - otherwise you can't write to it. But the problems arise when the instruction prefetch unit has read part of the 5-byte instruction (it might even only read half a cache line at a time). I'm not sure how long the pipeline can sit in that state - but I can do a memory read of a PCIe address that takes ~3000 clocks. (And a misaligned AVX-512 read is probably eight 8-byte transfers.) So I think you need to force an interrupt while the PTE is invalid. And that need to be simultaneous on all cpu running that process. Stopping the process using ptrace would do it. David - Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK Registration No: 1397386 (Wales)
David, let me say first that my understanding of this magic is very limited, please correct me. On 12/16, David Laight wrote: > > It all depends on how hard __replace_page() tries to be atomic. > The page has to change from one backed by the executable to a private > one backed by swap - otherwise you can't write to it. This is what uprobe_write_opcode() does, > But the problems arise when the instruction prefetch unit has read > part of the 5-byte instruction (it might even only read half a cache > line at a time). > I'm not sure how long the pipeline can sit in that state - but I > can do a memory read of a PCIe address that takes ~3000 clocks. > (And a misaligned AVX-512 read is probably eight 8-byte transfers.) > > So I think you need to force an interrupt while the PTE is invalid. > And that need to be simultaneous on all cpu running that process. __replace_page() does ptep_get_and_clear(old_pte) + flush_tlb_page(). That's not enough? > Stopping the process using ptrace would do it. Not an option :/ Oleg.
From: Oleg Nesterov > Sent: 16 December 2024 10:13 > > David, > > let me say first that my understanding of this magic is very limited, > please correct me. I only (half) understand what the 'magic' has to accomplish and some of the pitfalls. I've copied linux-mm - someone there might know more. > On 12/16, David Laight wrote: > > > > It all depends on how hard __replace_page() tries to be atomic. > > The page has to change from one backed by the executable to a private > > one backed by swap - otherwise you can't write to it. > > This is what uprobe_write_opcode() does, And will be enough for single byte changes - they'll be picked up at some point after the change. > > But the problems arise when the instruction prefetch unit has read > > part of the 5-byte instruction (it might even only read half a cache > > line at a time). > > I'm not sure how long the pipeline can sit in that state - but I > > can do a memory read of a PCIe address that takes ~3000 clocks. > > (And a misaligned AVX-512 read is probably eight 8-byte transfers.) > > > > So I think you need to force an interrupt while the PTE is invalid. > > And that need to be simultaneous on all cpu running that process. > > __replace_page() does ptep_get_and_clear(old_pte) + flush_tlb_page(). > > That's not enough? I doubt it. As I understand it. The hardware page tables will be shared by all the threads of a process. So unless you hard synchronise all the cpu (and flush the TLB) while the PTE is being changed there is always the possibility of a cpu picking up the new PTE before the IPI that (I presume) flush_tlb_page() generates is processed. If that happens when the instruction you are patching is part-read into the instruction decode buffer then you'll execute a mismatch of the two instructions. I can't remember the outcome of discussions about live-patching kernel code - and I'm sure that was aligned 32bit writes. > > > Stopping the process using ptrace would do it. > > Not an option :/ Thought you'd say that. David > > Oleg. - Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK Registration No: 1397386 (Wales)
OK, thanks, I am starting to share your concerns... Oleg. On 12/16, David Laight wrote: > > From: Oleg Nesterov > > Sent: 16 December 2024 10:13 > > > > David, > > > > let me say first that my understanding of this magic is very limited, > > please correct me. > > I only (half) understand what the 'magic' has to accomplish and > some of the pitfalls. > > I've copied linux-mm - someone there might know more. > > > On 12/16, David Laight wrote: > > > > > > It all depends on how hard __replace_page() tries to be atomic. > > > The page has to change from one backed by the executable to a private > > > one backed by swap - otherwise you can't write to it. > > > > This is what uprobe_write_opcode() does, > > And will be enough for single byte changes - they'll be picked up > at some point after the change. > > > > But the problems arise when the instruction prefetch unit has read > > > part of the 5-byte instruction (it might even only read half a cache > > > line at a time). > > > I'm not sure how long the pipeline can sit in that state - but I > > > can do a memory read of a PCIe address that takes ~3000 clocks. > > > (And a misaligned AVX-512 read is probably eight 8-byte transfers.) > > > > > > So I think you need to force an interrupt while the PTE is invalid. > > > And that need to be simultaneous on all cpu running that process. > > > > __replace_page() does ptep_get_and_clear(old_pte) + flush_tlb_page(). > > > > That's not enough? > > I doubt it. As I understand it. > The hardware page tables will be shared by all the threads of a process. > So unless you hard synchronise all the cpu (and flush the TLB) while the > PTE is being changed there is always the possibility of a cpu picking up > the new PTE before the IPI that (I presume) flush_tlb_page() generates > is processed. > If that happens when the instruction you are patching is part-read into > the instruction decode buffer then you'll execute a mismatch of the two > instructions. > > I can't remember the outcome of discussions about live-patching kernel > code - and I'm sure that was aligned 32bit writes. > > > > > > Stopping the process using ptrace would do it. > > > > Not an option :/ > > Thought you'd say that. > > David > > > > > Oleg. > > - > Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK > Registration No: 1397386 (Wales) >
On Mon, Dec 16, 2024 at 01:22:05PM +0100, Oleg Nesterov wrote: > OK, thanks, I am starting to share your concerns... > > Oleg. > > On 12/16, David Laight wrote: > > > > From: Oleg Nesterov > > > Sent: 16 December 2024 10:13 > > > > > > David, > > > > > > let me say first that my understanding of this magic is very limited, > > > please correct me. > > > > I only (half) understand what the 'magic' has to accomplish and > > some of the pitfalls. > > > > I've copied linux-mm - someone there might know more. > > > > > On 12/16, David Laight wrote: > > > > > > > > It all depends on how hard __replace_page() tries to be atomic. > > > > The page has to change from one backed by the executable to a private > > > > one backed by swap - otherwise you can't write to it. > > > > > > This is what uprobe_write_opcode() does, > > > > And will be enough for single byte changes - they'll be picked up > > at some point after the change. > > > > > > But the problems arise when the instruction prefetch unit has read > > > > part of the 5-byte instruction (it might even only read half a cache > > > > line at a time). > > > > I'm not sure how long the pipeline can sit in that state - but I > > > > can do a memory read of a PCIe address that takes ~3000 clocks. > > > > (And a misaligned AVX-512 read is probably eight 8-byte transfers.) > > > > > > > > So I think you need to force an interrupt while the PTE is invalid. > > > > And that need to be simultaneous on all cpu running that process. > > > > > > __replace_page() does ptep_get_and_clear(old_pte) + flush_tlb_page(). > > > > > > That's not enough? > > > > I doubt it. As I understand it. > > The hardware page tables will be shared by all the threads of a process. > > So unless you hard synchronise all the cpu (and flush the TLB) while the > > PTE is being changed there is always the possibility of a cpu picking up > > the new PTE before the IPI that (I presume) flush_tlb_page() generates > > is processed. > > If that happens when the instruction you are patching is part-read into > > the instruction decode buffer then you'll execute a mismatch of the two > > instructions. if 5 byte update would be a problem, I guess we could workaround that through partial updates using int3 like we do in text_poke_bp_batch? - changing nop5 instruction to 'call xxx' - write int3 to first byte of nop5 instruction - have poke_int3_handler to emulate nop5 if int3 is triggered - write rest of the call instruction to nop5 last 4 bytes - overwrite first byte of nop5 with call opcode similar update from 'call xxx' -> 'nop5' thanks, jirka > > > > I can't remember the outcome of discussions about live-patching kernel > > code - and I'm sure that was aligned 32bit writes. > > > > > > > > > Stopping the process using ptrace would do it. > > > > > > Not an option :/ > > > > Thought you'd say that. > > > > David > > > > > > > > Oleg. > > > > - > > Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK > > Registration No: 1397386 (Wales) > > >
From: Jiri Olsa > Sent: 16 December 2024 12:50 > > On Mon, Dec 16, 2024 at 01:22:05PM +0100, Oleg Nesterov wrote: > > OK, thanks, I am starting to share your concerns... > > > > Oleg. > > > > On 12/16, David Laight wrote: > > > > > > From: Oleg Nesterov > > > > Sent: 16 December 2024 10:13 > > > > > > > > David, > > > > > > > > let me say first that my understanding of this magic is very limited, > > > > please correct me. > > > > > > I only (half) understand what the 'magic' has to accomplish and > > > some of the pitfalls. > > > > > > I've copied linux-mm - someone there might know more. > > > > > > > On 12/16, David Laight wrote: > > > > > > > > > > It all depends on how hard __replace_page() tries to be atomic. > > > > > The page has to change from one backed by the executable to a private > > > > > one backed by swap - otherwise you can't write to it. > > > > > > > > This is what uprobe_write_opcode() does, > > > > > > And will be enough for single byte changes - they'll be picked up > > > at some point after the change. > > > > > > > > But the problems arise when the instruction prefetch unit has read > > > > > part of the 5-byte instruction (it might even only read half a cache > > > > > line at a time). > > > > > I'm not sure how long the pipeline can sit in that state - but I > > > > > can do a memory read of a PCIe address that takes ~3000 clocks. > > > > > (And a misaligned AVX-512 read is probably eight 8-byte transfers.) > > > > > > > > > > So I think you need to force an interrupt while the PTE is invalid. > > > > > And that need to be simultaneous on all cpu running that process. > > > > > > > > __replace_page() does ptep_get_and_clear(old_pte) + flush_tlb_page(). > > > > > > > > That's not enough? > > > > > > I doubt it. As I understand it. > > > The hardware page tables will be shared by all the threads of a process. > > > So unless you hard synchronise all the cpu (and flush the TLB) while the > > > PTE is being changed there is always the possibility of a cpu picking up > > > the new PTE before the IPI that (I presume) flush_tlb_page() generates > > > is processed. > > > If that happens when the instruction you are patching is part-read into > > > the instruction decode buffer then you'll execute a mismatch of the two > > > instructions. > > if 5 byte update would be a problem, I guess we could workaround that through > partial updates using int3 like we do in text_poke_bp_batch? > > - changing nop5 instruction to 'call xxx' > - write int3 to first byte of nop5 instruction > - have poke_int3_handler to emulate nop5 if int3 is triggered > - write rest of the call instruction to nop5 last 4 bytes > - overwrite first byte of nop5 with call opcode That might work provided there are IPI (to flush the decode pipeline) after the write of the 'int3' and one before the write of the 'call'. You'll need to ensure the I-cache gets invalidated as well. And if the sequence crosses a page boundary.... David > > similar update from 'call xxx' -> 'nop5' > > thanks, > jirka > > > > > > > I can't remember the outcome of discussions about live-patching kernel > > > code - and I'm sure that was aligned 32bit writes. > > > > > > > > > > > > Stopping the process using ptrace would do it. > > > > > > > > Not an option :/ > > > > > > Thought you'd say that. > > > > > > David > > > > > > > > > > > Oleg. > > > > > > - > > > Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK > > > Registration No: 1397386 (Wales) > > > > > - Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK Registration No: 1397386 (Wales)
On Mon, Dec 16, 2024 at 03:08:14PM +0000, David Laight wrote: > From: Jiri Olsa > > Sent: 16 December 2024 12:50 > > > > On Mon, Dec 16, 2024 at 01:22:05PM +0100, Oleg Nesterov wrote: > > > OK, thanks, I am starting to share your concerns... > > > > > > Oleg. > > > > > > On 12/16, David Laight wrote: > > > > > > > > From: Oleg Nesterov > > > > > Sent: 16 December 2024 10:13 > > > > > > > > > > David, > > > > > > > > > > let me say first that my understanding of this magic is very limited, > > > > > please correct me. > > > > > > > > I only (half) understand what the 'magic' has to accomplish and > > > > some of the pitfalls. > > > > > > > > I've copied linux-mm - someone there might know more. > > > > > > > > > On 12/16, David Laight wrote: > > > > > > > > > > > > It all depends on how hard __replace_page() tries to be atomic. > > > > > > The page has to change from one backed by the executable to a private > > > > > > one backed by swap - otherwise you can't write to it. > > > > > > > > > > This is what uprobe_write_opcode() does, > > > > > > > > And will be enough for single byte changes - they'll be picked up > > > > at some point after the change. > > > > > > > > > > But the problems arise when the instruction prefetch unit has read > > > > > > part of the 5-byte instruction (it might even only read half a cache > > > > > > line at a time). > > > > > > I'm not sure how long the pipeline can sit in that state - but I > > > > > > can do a memory read of a PCIe address that takes ~3000 clocks. > > > > > > (And a misaligned AVX-512 read is probably eight 8-byte transfers.) > > > > > > > > > > > > So I think you need to force an interrupt while the PTE is invalid. > > > > > > And that need to be simultaneous on all cpu running that process. > > > > > > > > > > __replace_page() does ptep_get_and_clear(old_pte) + flush_tlb_page(). > > > > > > > > > > That's not enough? > > > > > > > > I doubt it. As I understand it. > > > > The hardware page tables will be shared by all the threads of a process. > > > > So unless you hard synchronise all the cpu (and flush the TLB) while the > > > > PTE is being changed there is always the possibility of a cpu picking up > > > > the new PTE before the IPI that (I presume) flush_tlb_page() generates > > > > is processed. > > > > If that happens when the instruction you are patching is part-read into > > > > the instruction decode buffer then you'll execute a mismatch of the two > > > > instructions. > > > > if 5 byte update would be a problem, I guess we could workaround that through > > partial updates using int3 like we do in text_poke_bp_batch? > > > > - changing nop5 instruction to 'call xxx' > > - write int3 to first byte of nop5 instruction > > - have poke_int3_handler to emulate nop5 if int3 is triggered > > - write rest of the call instruction to nop5 last 4 bytes > > - overwrite first byte of nop5 with call opcode > > That might work provided there are IPI (to flush the decode pipeline) > after the write of the 'int3' and one before the write of the 'call'. > You'll need to ensure the I-cache gets invalidated as well. ok, seems to be done by text_poke_sync > > And if the sequence crosses a page boundary.... that was already limitation for the current change jirka
diff --git a/arch/x86/include/asm/uprobes.h b/arch/x86/include/asm/uprobes.h index 678fb546f0a7..84a75ed748f0 100644 --- a/arch/x86/include/asm/uprobes.h +++ b/arch/x86/include/asm/uprobes.h @@ -20,6 +20,11 @@ typedef u8 uprobe_opcode_t; #define UPROBE_SWBP_INSN 0xcc #define UPROBE_SWBP_INSN_SIZE 1 +enum { + ARCH_UPROBE_FLAG_CAN_OPTIMIZE = 0, + ARCH_UPROBE_FLAG_OPTIMIZED = 1, +}; + struct uprobe_xol_ops; struct arch_uprobe { @@ -45,6 +50,8 @@ struct arch_uprobe { u8 ilen; } push; }; + + unsigned long flags; }; struct arch_uprobe_task { diff --git a/arch/x86/kernel/uprobes.c b/arch/x86/kernel/uprobes.c index cdea97f8cd39..b2420eeee23a 100644 --- a/arch/x86/kernel/uprobes.c +++ b/arch/x86/kernel/uprobes.c @@ -18,6 +18,7 @@ #include <asm/processor.h> #include <asm/insn.h> #include <asm/mmu_context.h> +#include <asm/nops.h> /* Post-execution fixups. */ @@ -914,8 +915,37 @@ static int is_nop5_insn(uprobe_opcode_t *insn) return !memcmp(insn, x86_nops[5], 5); } +static int is_call_insn(uprobe_opcode_t *insn) +{ + return *insn == CALL_INSN_OPCODE; +} + +static void relative_insn(void *dest, void *from, void *to, u8 op) +{ + struct __arch_relative_insn { + u8 op; + s32 raddr; + } __packed *insn; + + insn = (struct __arch_relative_insn *)dest; + insn->raddr = (s32)((long)(to) - ((long)(from) + 5)); + insn->op = op; +} + +static void relative_call(void *dest, void *from, void *to) +{ + relative_insn(dest, from, to, CALL_INSN_OPCODE); +} + +static bool can_optimize_vaddr(unsigned long vaddr) +{ + /* We can't do cross page atomic writes yet. */ + return PAGE_SIZE - (vaddr & ~PAGE_MASK) >= 5; +} + /* Returns -ENOSYS if branch_xol_ops doesn't handle this insn */ -static int branch_setup_xol_ops(struct arch_uprobe *auprobe, struct insn *insn) +static int branch_setup_xol_ops(struct arch_uprobe *auprobe, struct insn *insn, + unsigned long vaddr) { u8 opc1 = OPCODE1(insn); insn_byte_t p; @@ -933,8 +963,11 @@ static int branch_setup_xol_ops(struct arch_uprobe *auprobe, struct insn *insn) break; case 0x0f: - if (is_nop5_insn((uprobe_opcode_t *) &auprobe->insn)) + if (is_nop5_insn((uprobe_opcode_t *) &auprobe->insn)) { + if (can_optimize_vaddr(vaddr)) + set_bit(ARCH_UPROBE_FLAG_CAN_OPTIMIZE, &auprobe->flags); goto setup; + } if (insn->opcode.nbytes != 2) return -ENOSYS; /* @@ -1065,7 +1098,7 @@ int arch_uprobe_analyze_insn(struct arch_uprobe *auprobe, struct mm_struct *mm, if (ret) return ret; - ret = branch_setup_xol_ops(auprobe, &insn); + ret = branch_setup_xol_ops(auprobe, &insn, addr); if (ret != -ENOSYS) return ret; @@ -1306,3 +1339,132 @@ bool arch_uretprobe_is_alive(struct return_instance *ret, enum rp_check ctx, else return regs->sp <= ret->stack; } + +int arch_uprobe_verify_opcode(struct arch_uprobe *auprobe, struct page *page, + unsigned long vaddr, uprobe_opcode_t *new_opcode, + int nbytes) +{ + uprobe_opcode_t old_opcode[5]; + bool is_call, is_swbp, is_nop5; + + if (!test_bit(ARCH_UPROBE_FLAG_CAN_OPTIMIZE, &auprobe->flags)) + return uprobe_verify_opcode(page, vaddr, new_opcode); + + /* + * The ARCH_UPROBE_FLAG_CAN_OPTIMIZE flag guarantees the following + * 5 bytes read won't cross the page boundary. + */ + uprobe_copy_from_page(page, vaddr, (uprobe_opcode_t *) &old_opcode, 5); + is_call = is_call_insn((uprobe_opcode_t *) &old_opcode); + is_swbp = is_swbp_insn((uprobe_opcode_t *) &old_opcode); + is_nop5 = is_nop5_insn((uprobe_opcode_t *) &old_opcode); + + /* + * We allow following trasitions for optimized uprobes: + * + * nop5 -> swbp -> call + * || | | + * |'--<---' | + * '---<-----------' + * + * We return 1 to ack the write, 0 to do nothing, -1 to fail write. + * + * If the current opcode (old_opcode) has already desired value, + * we do nothing, because we are racing with another thread doing + * the update. + */ + switch (nbytes) { + case 5: + if (is_call_insn(new_opcode)) { + if (is_swbp) + return 1; + if (is_call && !memcmp(new_opcode, &old_opcode, 5)) + return 0; + } else { + if (is_call || is_swbp) + return 1; + if (is_nop5) + return 0; + } + break; + case 1: + if (is_swbp_insn(new_opcode)) { + if (is_nop5) + return 1; + if (is_swbp || is_call) + return 0; + } else { + if (is_swbp || is_call) + return 1; + if (is_nop5) + return 0; + } + } + return -1; +} + +bool arch_uprobe_is_register(uprobe_opcode_t *insn, int nbytes) +{ + return nbytes == 5 ? is_call_insn(insn) : is_swbp_insn(insn); +} + +static void __arch_uprobe_optimize(struct arch_uprobe *auprobe, struct mm_struct *mm, + unsigned long vaddr) +{ + struct uprobe_trampoline *tramp; + char call[5]; + + tramp = uprobe_trampoline_get(vaddr); + if (!tramp) + goto fail; + + relative_call(call, (void *) vaddr, (void *) tramp->vaddr); + if (uprobe_write_opcode(auprobe, mm, vaddr, call, 5)) + goto fail; + + set_bit(ARCH_UPROBE_FLAG_OPTIMIZED, &auprobe->flags); + return; + +fail: + /* Once we fail we never try again. */ + clear_bit(ARCH_UPROBE_FLAG_CAN_OPTIMIZE, &auprobe->flags); + uprobe_trampoline_put(tramp); +} + +static bool should_optimize(struct arch_uprobe *auprobe) +{ + if (!test_bit(ARCH_UPROBE_FLAG_CAN_OPTIMIZE, &auprobe->flags)) + return false; + if (test_bit(ARCH_UPROBE_FLAG_OPTIMIZED, &auprobe->flags)) + return false; + return true; +} + +void arch_uprobe_optimize(struct arch_uprobe *auprobe, unsigned long vaddr) +{ + struct mm_struct *mm = current->mm; + + if (!should_optimize(auprobe)) + return; + + mmap_write_lock(mm); + if (should_optimize(auprobe)) + __arch_uprobe_optimize(auprobe, mm, vaddr); + mmap_write_unlock(mm); +} + +int set_orig_insn(struct arch_uprobe *auprobe, struct mm_struct *mm, unsigned long vaddr) +{ + uprobe_opcode_t *insn = (uprobe_opcode_t *) auprobe->insn; + + if (test_bit(ARCH_UPROBE_FLAG_OPTIMIZED, &auprobe->flags)) + return uprobe_write_opcode(auprobe, mm, vaddr, insn, 5); + + return uprobe_write_opcode(auprobe, mm, vaddr, insn, UPROBE_SWBP_INSN_SIZE); +} + +bool arch_uprobe_is_callable(unsigned long vtramp, unsigned long vaddr) +{ + long delta = (long)(vaddr + 5 - vtramp); + return delta >= INT_MIN && delta <= INT_MAX; +} diff --git a/include/linux/uprobes.h b/include/linux/uprobes.h index 5e9a33bfb747..1b14b9f2f8d0 100644 --- a/include/linux/uprobes.h +++ b/include/linux/uprobes.h @@ -233,6 +233,7 @@ extern void uprobe_trampoline_put(struct uprobe_trampoline *area); extern bool arch_uprobe_is_callable(unsigned long vtramp, unsigned long vaddr); extern const struct vm_special_mapping *arch_uprobe_trampoline_mapping(void); extern void handle_syscall_uprobe(struct pt_regs *regs, unsigned long bp_vaddr); +extern void arch_uprobe_optimize(struct arch_uprobe *auprobe, unsigned long vaddr); #else /* !CONFIG_UPROBES */ struct uprobes_state { }; diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c index 52f38d1ef276..a7a3eeec9e51 100644 --- a/kernel/events/uprobes.c +++ b/kernel/events/uprobes.c @@ -2654,6 +2654,11 @@ bool __weak arch_uretprobe_is_alive(struct return_instance *ret, enum rp_check c return true; } +void __weak arch_uprobe_optimize(struct arch_uprobe *auprobe, unsigned long vaddr) +{ + return; +} + /* * Run handler and ask thread to singlestep. * Ensure all non-fatal signals cannot interrupt thread while it singlesteps. @@ -2718,6 +2723,9 @@ static void handle_swbp(struct pt_regs *regs) handler_chain(uprobe, regs); + /* Try to optimize after first hit. */ + arch_uprobe_optimize(&uprobe->arch, bp_vaddr); + if (arch_uprobe_skip_sstep(&uprobe->arch, regs)) goto out;
Putting together all the previously added pieces to support optimized uprobes on top of 5-byte nop instruction. The current uprobe execution goes through following: - installs breakpoint instruction over original instruction - exception handler hit and calls related uprobe consumers - and either simulates original instruction or does out of line single step execution of it - returns to user space The optimized uprobe path - checks the original instruction is 5-byte nop (plus other checks) - adds (or uses existing) user space trampoline and overwrites original instruction (5-byte nop) with call to user space trampoline - the user space trampoline executes uprobe syscall that calls related uprobe consumers - trampoline returns back to next instruction This approach won't speed up all uprobes as it's limited to using nop5 as original instruction, but we could use nop5 as USDT probe instruction (which uses single byte nop ATM) and speed up the USDT probes. This patch overloads related arch functions in uprobe_write_opcode and set_orig_insn so they can install call instruction if needed. The arch_uprobe_optimize triggers the uprobe optimization and is called after first uprobe hit. I originally had it called on uprobe installation but then it clashed with elf loader, because the user space trampoline was added in a place where loader might need to put elf segments, so I decided to do it after first uprobe hit when loading is done. We do not unmap and release uprobe trampoline when it's no longer needed, because there's no easy way to make sure none of the threads is still inside the trampoline. But we do not waste memory, because there's just single page for all the uprobe trampoline mappings. We do waste frmae on page mapping for every 4GB by keeping the uprobe trampoline page mapped, but that seems ok. Attaching the speed up from benchs/run_bench_uprobes.sh script: current: uprobe-nop : 3.281 ± 0.003M/s uprobe-push : 3.085 ± 0.003M/s uprobe-ret : 1.130 ± 0.000M/s --> uprobe-nop5 : 3.276 ± 0.007M/s uretprobe-nop : 1.716 ± 0.016M/s uretprobe-push : 1.651 ± 0.017M/s uretprobe-ret : 0.846 ± 0.006M/s --> uretprobe-nop5 : 3.279 ± 0.002M/s after the change: uprobe-nop : 3.246 ± 0.004M/s uprobe-push : 3.057 ± 0.000M/s uprobe-ret : 1.113 ± 0.003M/s --> uprobe-nop5 : 6.751 ± 0.037M/s uretprobe-nop : 1.740 ± 0.015M/s uretprobe-push : 1.677 ± 0.018M/s uretprobe-ret : 0.852 ± 0.005M/s --> uretprobe-nop5 : 6.769 ± 0.040M/s Signed-off-by: Jiri Olsa <jolsa@kernel.org> --- arch/x86/include/asm/uprobes.h | 7 ++ arch/x86/kernel/uprobes.c | 168 ++++++++++++++++++++++++++++++++- include/linux/uprobes.h | 1 + kernel/events/uprobes.c | 8 ++ 4 files changed, 181 insertions(+), 3 deletions(-)