Message ID | 20230908132740.718103-5-hbathini@linux.ibm.com (mailing list archive) |
---|---|
State | Changes Requested |
Delegated to: | BPF |
Headers | show |
Series | powerpc/bpf: use BPF prog pack allocator | expand |
Context | Check | Description |
---|---|---|
netdev/tree_selection | success | Not a local patch, async |
bpf/vmtest-bpf-next-PR | success | PR summary |
bpf/vmtest-bpf-next-VM_Test-0 | success | Logs for ShellCheck |
bpf/vmtest-bpf-next-VM_Test-5 | success | Logs for set-matrix |
bpf/vmtest-bpf-next-VM_Test-1 | success | Logs for build for aarch64 with gcc |
bpf/vmtest-bpf-next-VM_Test-3 | success | Logs for build for x86_64 with gcc |
bpf/vmtest-bpf-next-VM_Test-4 | success | Logs for build for x86_64 with llvm-16 |
bpf/vmtest-bpf-next-VM_Test-2 | success | Logs for build for s390x with gcc |
bpf/vmtest-bpf-next-VM_Test-6 | success | Logs for test_maps on aarch64 with gcc |
bpf/vmtest-bpf-next-VM_Test-14 | success | Logs for test_progs_no_alu32 on aarch64 with gcc |
bpf/vmtest-bpf-next-VM_Test-18 | success | Logs for test_progs_no_alu32_parallel on aarch64 with gcc |
bpf/vmtest-bpf-next-VM_Test-21 | success | Logs for test_progs_parallel on aarch64 with gcc |
bpf/vmtest-bpf-next-VM_Test-24 | success | Logs for test_verifier on aarch64 with gcc |
bpf/vmtest-bpf-next-VM_Test-8 | fail | Logs for test_maps on x86_64 with gcc |
bpf/vmtest-bpf-next-VM_Test-9 | success | Logs for test_maps on x86_64 with llvm-16 |
bpf/vmtest-bpf-next-VM_Test-10 | success | Logs for test_progs on aarch64 with gcc |
bpf/vmtest-bpf-next-VM_Test-12 | success | Logs for test_progs on x86_64 with gcc |
bpf/vmtest-bpf-next-VM_Test-13 | success | Logs for test_progs on x86_64 with llvm-16 |
bpf/vmtest-bpf-next-VM_Test-16 | success | Logs for test_progs_no_alu32 on x86_64 with gcc |
bpf/vmtest-bpf-next-VM_Test-17 | success | Logs for test_progs_no_alu32 on x86_64 with llvm-16 |
bpf/vmtest-bpf-next-VM_Test-20 | success | Logs for test_progs_no_alu32_parallel on x86_64 with llvm-16 |
bpf/vmtest-bpf-next-VM_Test-19 | success | Logs for test_progs_no_alu32_parallel on x86_64 with gcc |
bpf/vmtest-bpf-next-VM_Test-22 | success | Logs for test_progs_parallel on x86_64 with gcc |
bpf/vmtest-bpf-next-VM_Test-23 | success | Logs for test_progs_parallel on x86_64 with llvm-16 |
bpf/vmtest-bpf-next-VM_Test-26 | success | Logs for test_verifier on x86_64 with gcc |
bpf/vmtest-bpf-next-VM_Test-27 | success | Logs for test_verifier on x86_64 with llvm-16 |
bpf/vmtest-bpf-next-VM_Test-28 | success | Logs for veristat |
bpf/vmtest-bpf-next-VM_Test-25 | success | Logs for test_verifier on s390x with gcc |
bpf/vmtest-bpf-next-VM_Test-15 | success | Logs for test_progs_no_alu32 on s390x with gcc |
bpf/vmtest-bpf-next-VM_Test-11 | success | Logs for test_progs on s390x with gcc |
bpf/vmtest-bpf-next-VM_Test-7 | success | Logs for test_maps on s390x with gcc |
On Fri, Sep 8, 2023 at 6:28 AM Hari Bathini <hbathini@linux.ibm.com> wrote: > > patch_instruction() entails setting up pte, patching the instruction, > clearing the pte and flushing the tlb. If multiple instructions need > to be patched, every instruction would have to go through the above > drill unnecessarily. Instead, introduce function patch_instructions() > that sets up the pte, clears the pte and flushes the tlb only once per > page range of instructions to be patched. This adds a slight overhead > to patch_instruction() call while improving the patching time for > scenarios where more than one instruction needs to be patched. > > Signed-off-by: Hari Bathini <hbathini@linux.ibm.com> I didn't see this one when I reviewed 1/5. Please ignore that comment. [...] > @@ -307,11 +312,22 @@ static int __do_patch_instruction_mm(u32 *addr, ppc_inst_t instr) > > orig_mm = start_using_temp_mm(patching_mm); > > - err = __patch_instruction(addr, instr, patch_addr); > + while (len > 0) { > + instr = ppc_inst_read(code); > + ilen = ppc_inst_len(instr); > + err = __patch_instruction(addr, instr, patch_addr); It appears we are still repeating a lot of work here. For example, with fill_insn == true, we don't need to repeat ppc_inst_read(). Can we do this with a memcpy or memset like functions? > + /* hwsync performed by __patch_instruction (sync) if successful */ > + if (err) { > + mb(); /* sync */ > + break; > + } > > - /* hwsync performed by __patch_instruction (sync) if successful */ > - if (err) > - mb(); /* sync */ > + len -= ilen; > + patch_addr = patch_addr + ilen; > + addr = (void *)addr + ilen; > + if (!fill_insn) > + code = code + ilen; It took me a while to figure out what "fill_insn" means. Maybe call it "repeat_input" or something? Thanks, Song > + } > > /* context synchronisation performed by __patch_instruction (isync or exception) */ > stop_using_temp_mm(patching_mm, orig_mm); > @@ -328,16 +344,21 @@ static int __do_patch_instruction_mm(u32 *addr, ppc_inst_t instr) > return err; > } >
Le 26/09/2023 à 00:50, Song Liu a écrit : > On Fri, Sep 8, 2023 at 6:28 AM Hari Bathini <hbathini@linux.ibm.com> wrote: >> >> patch_instruction() entails setting up pte, patching the instruction, >> clearing the pte and flushing the tlb. If multiple instructions need >> to be patched, every instruction would have to go through the above >> drill unnecessarily. Instead, introduce function patch_instructions() >> that sets up the pte, clears the pte and flushes the tlb only once per >> page range of instructions to be patched. This adds a slight overhead >> to patch_instruction() call while improving the patching time for >> scenarios where more than one instruction needs to be patched. >> >> Signed-off-by: Hari Bathini <hbathini@linux.ibm.com> > > I didn't see this one when I reviewed 1/5. Please ignore that comment. If I remember correctry, patch 1 introduces a huge performance degradation, which gets then improved with this patch. As I said before, I'd expect patch 4 to go first then get bpf_arch_text_copy() be implemented with patch_instructions() directly. Christophe > > [...] > >> @@ -307,11 +312,22 @@ static int __do_patch_instruction_mm(u32 *addr, ppc_inst_t instr) >> >> orig_mm = start_using_temp_mm(patching_mm); >> >> - err = __patch_instruction(addr, instr, patch_addr); >> + while (len > 0) { >> + instr = ppc_inst_read(code); >> + ilen = ppc_inst_len(instr); >> + err = __patch_instruction(addr, instr, patch_addr); > > It appears we are still repeating a lot of work here. For example, with > fill_insn == true, we don't need to repeat ppc_inst_read(). > > Can we do this with a memcpy or memset like functions? > >> + /* hwsync performed by __patch_instruction (sync) if successful */ >> + if (err) { >> + mb(); /* sync */ >> + break; >> + } >> >> - /* hwsync performed by __patch_instruction (sync) if successful */ >> - if (err) >> - mb(); /* sync */ >> + len -= ilen; >> + patch_addr = patch_addr + ilen; >> + addr = (void *)addr + ilen; >> + if (!fill_insn) >> + code = code + ilen; > > It took me a while to figure out what "fill_insn" means. Maybe call it > "repeat_input" or something? > > Thanks, > Song > >> + } >> >> /* context synchronisation performed by __patch_instruction (isync or exception) */ >> stop_using_temp_mm(patching_mm, orig_mm); >> @@ -328,16 +344,21 @@ static int __do_patch_instruction_mm(u32 *addr, ppc_inst_t instr) >> return err; >> } >>
On 26/09/23 12:21 pm, Christophe Leroy wrote: > > > Le 26/09/2023 à 00:50, Song Liu a écrit : >> On Fri, Sep 8, 2023 at 6:28 AM Hari Bathini <hbathini@linux.ibm.com> wrote: >>> >>> patch_instruction() entails setting up pte, patching the instruction, >>> clearing the pte and flushing the tlb. If multiple instructions need >>> to be patched, every instruction would have to go through the above >>> drill unnecessarily. Instead, introduce function patch_instructions() >>> that sets up the pte, clears the pte and flushes the tlb only once per >>> page range of instructions to be patched. This adds a slight overhead >>> to patch_instruction() call while improving the patching time for >>> scenarios where more than one instruction needs to be patched. >>> >>> Signed-off-by: Hari Bathini <hbathini@linux.ibm.com> >> >> I didn't see this one when I reviewed 1/5. Please ignore that comment. > > If I remember correctry, patch 1 introduces a huge performance > degradation, which gets then improved with this patch. > > As I said before, I'd expect patch 4 to go first then get > bpf_arch_text_copy() be implemented with patch_instructions() directly. > Thanks for the reviews, Christophe & Song. Posted v5 based on your inputs. - Hari
diff --git a/arch/powerpc/include/asm/code-patching.h b/arch/powerpc/include/asm/code-patching.h index 3f881548fb61..4f5f0c091586 100644 --- a/arch/powerpc/include/asm/code-patching.h +++ b/arch/powerpc/include/asm/code-patching.h @@ -74,6 +74,7 @@ int create_cond_branch(ppc_inst_t *instr, const u32 *addr, int patch_branch(u32 *addr, unsigned long target, int flags); int patch_instruction(u32 *addr, ppc_inst_t instr); int raw_patch_instruction(u32 *addr, ppc_inst_t instr); +int patch_instructions(void *addr, void *code, size_t len, bool fill_insn); static inline unsigned long patch_site_addr(s32 *site) { diff --git a/arch/powerpc/lib/code-patching.c b/arch/powerpc/lib/code-patching.c index b00112d7ad46..60d446e16b42 100644 --- a/arch/powerpc/lib/code-patching.c +++ b/arch/powerpc/lib/code-patching.c @@ -278,20 +278,25 @@ static void unmap_patch_area(unsigned long addr) flush_tlb_kernel_range(addr, addr + PAGE_SIZE); } -static int __do_patch_instruction_mm(u32 *addr, ppc_inst_t instr) +/* + * A page is mapped and instructions that fit the page are patched. + * Assumes 'len' to be (PAGE_SIZE - offset_in_page(addr)) or below. + */ +static int __do_patch_instructions_mm(u32 *addr, void *code, size_t len, bool fill_insn) { - int err; - u32 *patch_addr; unsigned long text_poke_addr; pte_t *pte; unsigned long pfn = get_patch_pfn(addr); struct mm_struct *patching_mm; struct mm_struct *orig_mm; + ppc_inst_t instr; + void *patch_addr; spinlock_t *ptl; + int ilen, err; patching_mm = __this_cpu_read(cpu_patching_context.mm); text_poke_addr = __this_cpu_read(cpu_patching_context.addr); - patch_addr = (u32 *)(text_poke_addr + offset_in_page(addr)); + patch_addr = (void *)(text_poke_addr + offset_in_page(addr)); pte = get_locked_pte(patching_mm, text_poke_addr, &ptl); if (!pte) @@ -307,11 +312,22 @@ static int __do_patch_instruction_mm(u32 *addr, ppc_inst_t instr) orig_mm = start_using_temp_mm(patching_mm); - err = __patch_instruction(addr, instr, patch_addr); + while (len > 0) { + instr = ppc_inst_read(code); + ilen = ppc_inst_len(instr); + err = __patch_instruction(addr, instr, patch_addr); + /* hwsync performed by __patch_instruction (sync) if successful */ + if (err) { + mb(); /* sync */ + break; + } - /* hwsync performed by __patch_instruction (sync) if successful */ - if (err) - mb(); /* sync */ + len -= ilen; + patch_addr = patch_addr + ilen; + addr = (void *)addr + ilen; + if (!fill_insn) + code = code + ilen; + } /* context synchronisation performed by __patch_instruction (isync or exception) */ stop_using_temp_mm(patching_mm, orig_mm); @@ -328,16 +344,21 @@ static int __do_patch_instruction_mm(u32 *addr, ppc_inst_t instr) return err; } -static int __do_patch_instruction(u32 *addr, ppc_inst_t instr) +/* + * A page is mapped and instructions that fit the page are patched. + * Assumes 'len' to be (PAGE_SIZE - offset_in_page(addr)) or below. + */ +static int __do_patch_instructions(u32 *addr, void *code, size_t len, bool fill_insn) { - int err; - u32 *patch_addr; unsigned long text_poke_addr; pte_t *pte; unsigned long pfn = get_patch_pfn(addr); + void *patch_addr; + ppc_inst_t instr; + int ilen, err; text_poke_addr = (unsigned long)__this_cpu_read(cpu_patching_context.addr) & PAGE_MASK; - patch_addr = (u32 *)(text_poke_addr + offset_in_page(addr)); + patch_addr = (void *)(text_poke_addr + offset_in_page(addr)); pte = __this_cpu_read(cpu_patching_context.pte); __set_pte_at(&init_mm, text_poke_addr, pte, pfn_pte(pfn, PAGE_KERNEL), 0); @@ -345,7 +366,19 @@ static int __do_patch_instruction(u32 *addr, ppc_inst_t instr) if (radix_enabled()) asm volatile("ptesync": : :"memory"); - err = __patch_instruction(addr, instr, patch_addr); + while (len > 0) { + instr = ppc_inst_read(code); + ilen = ppc_inst_len(instr); + err = __patch_instruction(addr, instr, patch_addr); + if (err) + break; + + len -= ilen; + patch_addr = patch_addr + ilen; + addr = (void *)addr + ilen; + if (!fill_insn) + code = code + ilen; + } pte_clear(&init_mm, text_poke_addr, pte); flush_tlb_kernel_range(text_poke_addr, text_poke_addr + PAGE_SIZE); @@ -369,15 +402,46 @@ int patch_instruction(u32 *addr, ppc_inst_t instr) local_irq_save(flags); if (mm_patch_enabled()) - err = __do_patch_instruction_mm(addr, instr); + err = __do_patch_instructions_mm(addr, &instr, ppc_inst_len(instr), false); else - err = __do_patch_instruction(addr, instr); + err = __do_patch_instructions(addr, &instr, ppc_inst_len(instr), false); local_irq_restore(flags); return err; } NOKPROBE_SYMBOL(patch_instruction); +/* + * Patch 'addr' with 'len' bytes of instructions from 'code'. + */ +int patch_instructions(void *addr, void *code, size_t len, bool fill_insn) +{ + unsigned long flags; + size_t plen; + int err; + + while (len > 0) { + plen = min_t(size_t, PAGE_SIZE - offset_in_page(addr), len); + + local_irq_save(flags); + if (mm_patch_enabled()) + err = __do_patch_instructions_mm(addr, code, plen, fill_insn); + else + err = __do_patch_instructions(addr, code, plen, fill_insn); + local_irq_restore(flags); + if (err) + break; + + len -= plen; + addr = addr + plen; + if (!fill_insn) + code = code + plen; + } + + return err; +} +NOKPROBE_SYMBOL(patch_instructions); + int patch_branch(u32 *addr, unsigned long target, int flags) { ppc_inst_t instr;
patch_instruction() entails setting up pte, patching the instruction, clearing the pte and flushing the tlb. If multiple instructions need to be patched, every instruction would have to go through the above drill unnecessarily. Instead, introduce function patch_instructions() that sets up the pte, clears the pte and flushes the tlb only once per page range of instructions to be patched. This adds a slight overhead to patch_instruction() call while improving the patching time for scenarios where more than one instruction needs to be patched. Signed-off-by: Hari Bathini <hbathini@linux.ibm.com> --- arch/powerpc/include/asm/code-patching.h | 1 + arch/powerpc/lib/code-patching.c | 94 ++++++++++++++++++++---- 2 files changed, 80 insertions(+), 15 deletions(-)