Message ID | 20250114140237.3506624-1-jolsa@kernel.org (mailing list archive) |
---|---|
State | RFC |
Headers | show |
Series | [RFC] x86/alternatives: Merge first and second step in text_poke_bp_batch | expand |
Context | Check | Description |
---|---|---|
netdev/tree_selection | success | Not a local patch |
On Tue, Jan 14, 2025 at 03:02:37PM +0100, Jiri Olsa wrote: > hi, > while checking on similar code for uprobes I was wondering if we > can merge first 2 steps of instruction update in text_poke_bp_batch > function. > > Basically the first step now would be to write int3 byte together > with the rest of the bytes of the new instruction instead of doing > that separately. And the second step would be to overwrite int3 > byte with first byte of the new instruction. > > Would that work or do I miss some x86 detail that could lead to crash? I *think* it will work on most modern systems, but I'm very sure I don't have all the details. IIRC this is the magic recipe blessed by both Intel and AMD, and if we're going to be changing this I would want both vendors to sign off on that. > I tried to hack it together in attached patch and it speeds up a bit > text_poke_bp_batch as shown below. Why do we care about performance here?
On Tue, Jan 14, 2025 at 03:17:23PM +0100, Peter Zijlstra wrote: > On Tue, Jan 14, 2025 at 03:02:37PM +0100, Jiri Olsa wrote: > > hi, > > while checking on similar code for uprobes I was wondering if we > > can merge first 2 steps of instruction update in text_poke_bp_batch > > function. > > > > Basically the first step now would be to write int3 byte together > > with the rest of the bytes of the new instruction instead of doing > > that separately. And the second step would be to overwrite int3 > > byte with first byte of the new instruction. > > > > Would that work or do I miss some x86 detail that could lead to crash? > > I *think* it will work on most modern systems, but I'm very sure I don't > have all the details. > > IIRC this is the magic recipe blessed by both Intel and AMD, and > if we're going to be changing this I would want both vendors to sign off > on that. ok > > > I tried to hack it together in attached patch and it speeds up a bit > > text_poke_bp_batch as shown below. > > Why do we care about performance here? just a benefit of doing that change.. but mainly I was just curious on why those first steps are separated thanks, jirka
From: Jiri Olsa > Sent: 14 January 2025 14:03 > > hi, > while checking on similar code for uprobes I was wondering if we > can merge first 2 steps of instruction update in text_poke_bp_batch > function. > > Basically the first step now would be to write int3 byte together > with the rest of the bytes of the new instruction instead of doing > that separately. And the second step would be to overwrite int3 > byte with first byte of the new instruction. > > Would that work or do I miss some x86 detail that could lead to crash? I suspect it will 'crash and burn'. Consider what happens if there is a cache-line boundary in the middle of an instruction. (Actually an instruction fetch boundary will do.) cpu0: reads the old instructions from the old cache line. cpu0: pipeline busy (or similar) so doesn't read the next cache line. cpu1: writes the new instructions. cpu0: reads the second cache line. cpu0 now has a mix of the old and new instruction bytes. Writing the int3 is safe - provided they don't return until all the patching is over. But between writing the int3 (over the first opcode byte) and updating anything else I suspect you need something that does a complete synchronise between the cpu that discards any bytes in the decode pipeline as well as flushing the I-cache (etc). I suspect that requires an acked IPI. Very long cpu stalls are easy to generate. Any read from PCIe will be slow (I've at fpga target that takes ~1us). You'd need to be unlucky to be patching an instruction while one was pending, but a DMA access might just be enough to cause grief. David - Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK Registration No: 1397386 (Wales)
On Tue, 14 Jan 2025 15:31:14 +0100 Jiri Olsa <olsajiri@gmail.com> wrote: > > IIRC this is the magic recipe blessed by both Intel and AMD, and > > if we're going to be changing this I would want both vendors to sign off > > on that. > > ok Right. In fact Intel wouldn't sign off on this recipe for a few years. We actually added to the kernel before they gave their full blessing. I got a "wink, it should work" from them but they wouldn't officially say so ;-) But a lot of it has to do with all the magic of the CPU. They have always allowed writing the one byte int3. I figured, if I could write that one byte int3 then run a sync on all CPUs where all CPUs see that change, then nothing should ever care about the other 4 bytes after that int3 (a sync was already done). Then change the 4 bytes and sync again. I doubt the int3 plus the 4 byte change would work, as was mentioned if the other 4 bytes were on another cache line, another CPU could read the first set of bytes without the int3 and the second set of bytes with the update and go boom! This dance was to make sure everything sees everything properly. I gave a talk about this at Kernel-Recipes in 2019: https://www.slideshare.net/slideshow/kernel-recipes-2019-ftrace-where-modifying-a-running-kernel-all-started/177509633#44 -- Steve
diff --git a/arch/x86/kernel/alternative.c b/arch/x86/kernel/alternative.c index 243843e44e89..99830e9cd641 100644 --- a/arch/x86/kernel/alternative.c +++ b/arch/x86/kernel/alternative.c @@ -2125,9 +2125,9 @@ struct text_poke_loc { s32 disp; u8 len; u8 opcode; - const u8 text[POKE_MAX_OPCODE_SIZE]; + u8 text[POKE_MAX_OPCODE_SIZE]; /* see text_poke_bp_batch() */ - u8 old; + u8 first; }; struct bp_patching_desc { @@ -2282,7 +2282,6 @@ static int tp_vec_nr; */ static void text_poke_bp_batch(struct text_poke_loc *tp, unsigned int nr_entries) { - unsigned char int3 = INT3_INSN_OPCODE; unsigned int i; int do_sync; @@ -2313,29 +2312,20 @@ static void text_poke_bp_batch(struct text_poke_loc *tp, unsigned int nr_entries */ smp_wmb(); - /* - * First step: add a int3 trap to the address that will be patched. - */ - for (i = 0; i < nr_entries; i++) { - tp[i].old = *(u8 *)text_poke_addr(&tp[i]); - text_poke(text_poke_addr(&tp[i]), &int3, INT3_INSN_SIZE); - } - - text_poke_sync(); - /* * Second step: update all but the first byte of the patched range. */ for (do_sync = 0, i = 0; i < nr_entries; i++) { - u8 old[POKE_MAX_OPCODE_SIZE+1] = { tp[i].old, }; + u8 old[POKE_MAX_OPCODE_SIZE+1]; u8 _new[POKE_MAX_OPCODE_SIZE+1]; const u8 *new = tp[i].text; int len = tp[i].len; + tp[i].first = tp[i].text[0]; + tp[i].text[0] = (u8) INT3_INSN_OPCODE; + if (len - INT3_INSN_SIZE > 0) { - memcpy(old + INT3_INSN_SIZE, - text_poke_addr(&tp[i]) + INT3_INSN_SIZE, - len - INT3_INSN_SIZE); + memcpy(old, text_poke_addr(&tp[i]), len); if (len == 6) { _new[0] = 0x0f; @@ -2343,9 +2333,7 @@ static void text_poke_bp_batch(struct text_poke_loc *tp, unsigned int nr_entries new = _new; } - text_poke(text_poke_addr(&tp[i]) + INT3_INSN_SIZE, - new + INT3_INSN_SIZE, - len - INT3_INSN_SIZE); + text_poke(text_poke_addr(&tp[i]), new, len); do_sync++; } @@ -2391,7 +2379,7 @@ static void text_poke_bp_batch(struct text_poke_loc *tp, unsigned int nr_entries * replacing opcode. */ for (do_sync = 0, i = 0; i < nr_entries; i++) { - u8 byte = tp[i].text[0]; + u8 byte = tp[i].first; if (tp[i].len == 6) byte = 0x0f;