diff mbox series

[RFC] x86/alternatives: Merge first and second step in text_poke_bp_batch

Message ID 20250114140237.3506624-1-jolsa@kernel.org (mailing list archive)
State RFC
Headers show
Series [RFC] x86/alternatives: Merge first and second step in text_poke_bp_batch | expand

Checks

Context Check Description
netdev/tree_selection success Not a local patch

Commit Message

Jiri Olsa Jan. 14, 2025, 2:02 p.m. UTC
hi,
while checking on similar code for uprobes I was wondering if we
can merge first 2 steps of instruction update in text_poke_bp_batch
function.

Basically the first step now would be to write int3 byte together
with the rest of the bytes of the new instruction instead of doing
that separately. And the second step would be to overwrite int3
byte with first byte of the new instruction.

Would that work or do I miss some x86 detail that could lead to crash?

I tried to hack it together in attached patch and it speeds up a bit
text_poke_bp_batch as shown below.

thoughts? thanks,
jirka


---
  # cat ~/test.sh
  for i in `seq 1 50`; do echo function > current_tracer ; echo nop > current_tracer ; done

current code:

  # perf stat -e cycles:k,instructions:k -a --  ~jolsa/test.sh

   Performance counter stats for 'system wide':

     158,872,470,494      cycles:k
      61,529,351,096      instructions:k                   #    0.39  insn per cycle

        12.847083584 seconds time elapsed

with the change below:

  # perf stat -e cycles:k,instructions:k -a --  ~jolsa/test.sh

   Performance counter stats for 'system wide':

     105,687,644,963      cycles:k
      45,981,957,996      instructions:k                   #    0.44  insn per cycle

        10.011825294 seconds time elapsed
---
 arch/x86/kernel/alternative.c | 30 +++++++++---------------------
 1 file changed, 9 insertions(+), 21 deletions(-)

Comments

Peter Zijlstra Jan. 14, 2025, 2:17 p.m. UTC | #1
On Tue, Jan 14, 2025 at 03:02:37PM +0100, Jiri Olsa wrote:
> hi,
> while checking on similar code for uprobes I was wondering if we
> can merge first 2 steps of instruction update in text_poke_bp_batch
> function.
> 
> Basically the first step now would be to write int3 byte together
> with the rest of the bytes of the new instruction instead of doing
> that separately. And the second step would be to overwrite int3
> byte with first byte of the new instruction.
> 
> Would that work or do I miss some x86 detail that could lead to crash?

I *think* it will work on most modern systems, but I'm very sure I don't
have all the details.

IIRC this is the magic recipe blessed by both Intel and AMD, and
if we're going to be changing this I would want both vendors to sign off
on that.

> I tried to hack it together in attached patch and it speeds up a bit
> text_poke_bp_batch as shown below.

Why do we care about performance here?
Jiri Olsa Jan. 14, 2025, 2:31 p.m. UTC | #2
On Tue, Jan 14, 2025 at 03:17:23PM +0100, Peter Zijlstra wrote:
> On Tue, Jan 14, 2025 at 03:02:37PM +0100, Jiri Olsa wrote:
> > hi,
> > while checking on similar code for uprobes I was wondering if we
> > can merge first 2 steps of instruction update in text_poke_bp_batch
> > function.
> > 
> > Basically the first step now would be to write int3 byte together
> > with the rest of the bytes of the new instruction instead of doing
> > that separately. And the second step would be to overwrite int3
> > byte with first byte of the new instruction.
> > 
> > Would that work or do I miss some x86 detail that could lead to crash?
> 
> I *think* it will work on most modern systems, but I'm very sure I don't
> have all the details.
> 
> IIRC this is the magic recipe blessed by both Intel and AMD, and
> if we're going to be changing this I would want both vendors to sign off
> on that.

ok

> 
> > I tried to hack it together in attached patch and it speeds up a bit
> > text_poke_bp_batch as shown below.
> 
> Why do we care about performance here?

just a benefit of doing that change.. but mainly I was just curious
on why those first steps are separated

thanks,
jirka
David Laight Jan. 14, 2025, 2:38 p.m. UTC | #3
From: Jiri Olsa
> Sent: 14 January 2025 14:03
> 
> hi,
> while checking on similar code for uprobes I was wondering if we
> can merge first 2 steps of instruction update in text_poke_bp_batch
> function.
> 
> Basically the first step now would be to write int3 byte together
> with the rest of the bytes of the new instruction instead of doing
> that separately. And the second step would be to overwrite int3
> byte with first byte of the new instruction.
> 
> Would that work or do I miss some x86 detail that could lead to crash?

I suspect it will 'crash and burn'.

Consider what happens if there is a cache-line boundary in the
middle of an instruction.
(Actually an instruction fetch boundary will do.)

cpu0: reads the old instructions from the old cache line.
cpu0: pipeline busy (or similar) so doesn't read the next cache line.
cpu1: writes the new instructions.
cpu0: reads the second cache line.

cpu0 now has a mix of the old and new instruction bytes.

Writing the int3 is safe - provided they don't return until
all the patching is over.

But between writing the int3 (over the first opcode byte) and
updating anything else I suspect you need something that does
a complete synchronise between the cpu that discards any bytes
in the decode pipeline as well as flushing the I-cache (etc).
I suspect that requires an acked IPI.

Very long cpu stalls are easy to generate.
Any read from PCIe will be slow (I've at fpga target that takes ~1us).
You'd need to be unlucky to be patching an instruction while one
was pending, but a DMA access might just be enough to cause grief.

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)
Steven Rostedt Jan. 14, 2025, 3:36 p.m. UTC | #4
On Tue, 14 Jan 2025 15:31:14 +0100
Jiri Olsa <olsajiri@gmail.com> wrote:

> > IIRC this is the magic recipe blessed by both Intel and AMD, and
> > if we're going to be changing this I would want both vendors to sign off
> > on that.  
> 
> ok

Right. In fact Intel wouldn't sign off on this recipe for a few years. We
actually added to the kernel before they gave their full blessing. I got a
"wink, it should work" from them but they wouldn't officially say so ;-)

But a lot of it has to do with all the magic of the CPU. They have always
allowed writing the one byte int3. I figured, if I could write that one
byte int3 then run a sync on all CPUs where all CPUs see that change, then
nothing should ever care about the other 4 bytes after that int3 (a sync
was already done). Then change the 4 bytes and sync again.

I doubt the int3 plus the 4 byte change would work, as was mentioned if the
other 4 bytes were on another cache line, another CPU could read the first
set of bytes without the int3 and the second set of bytes with the update
and go boom!

This dance was to make sure everything sees everything properly. I gave a
talk about this at Kernel-Recipes in 2019:

  https://www.slideshare.net/slideshow/kernel-recipes-2019-ftrace-where-modifying-a-running-kernel-all-started/177509633#44

-- Steve
Jiri Olsa Jan. 15, 2025, 6:26 p.m. UTC | #5
On Tue, Jan 14, 2025 at 10:36:04AM -0500, Steven Rostedt wrote:
> On Tue, 14 Jan 2025 15:31:14 +0100
> Jiri Olsa <olsajiri@gmail.com> wrote:
> 
> > > IIRC this is the magic recipe blessed by both Intel and AMD, and
> > > if we're going to be changing this I would want both vendors to sign off
> > > on that.  
> > 
> > ok
> 
> Right. In fact Intel wouldn't sign off on this recipe for a few years. We
> actually added to the kernel before they gave their full blessing. I got a
> "wink, it should work" from them but they wouldn't officially say so ;-)
> 
> But a lot of it has to do with all the magic of the CPU. They have always
> allowed writing the one byte int3. I figured, if I could write that one
> byte int3 then run a sync on all CPUs where all CPUs see that change, then
> nothing should ever care about the other 4 bytes after that int3 (a sync
> was already done). Then change the 4 bytes and sync again.
> 
> I doubt the int3 plus the 4 byte change would work, as was mentioned if the
> other 4 bytes were on another cache line, another CPU could read the first
> set of bytes without the int3 and the second set of bytes with the update
> and go boom!
> 
> This dance was to make sure everything sees everything properly. I gave a
> talk about this at Kernel-Recipes in 2019:
> 
>   https://www.slideshare.net/slideshow/kernel-recipes-2019-ftrace-where-modifying-a-running-kernel-all-started/177509633#44

nice! thanks for all the details,
jirka
Masami Hiramatsu (Google) Jan. 16, 2025, 5:57 a.m. UTC | #6
On Tue, 14 Jan 2025 15:02:37 +0100
Jiri Olsa <jolsa@kernel.org> wrote:

> hi,
> while checking on similar code for uprobes I was wondering if we
> can merge first 2 steps of instruction update in text_poke_bp_batch
> function.
> 
> Basically the first step now would be to write int3 byte together
> with the rest of the bytes of the new instruction instead of doing
> that separately. And the second step would be to overwrite int3
> byte with first byte of the new instruction.
> 
> Would that work or do I miss some x86 detail that could lead to crash?

I agree with Peterz and David. My original idea is that the putting
int3 is safe anyway because it is just 1 byte. Then we can update
following bytes (only after we ensure no one executing(e.g. interrupted)
that part). The another good point of int3 is that can avoid writing
over cache-line boundary because it is 1 byte.

Without this int3 detour, it is possible to see half-way updated
instruction from some other CPU cores :( 

Thank you,
Jiri Olsa Jan. 16, 2025, 11:48 a.m. UTC | #7
On Tue, Jan 14, 2025 at 02:38:42PM +0000, David Laight wrote:
> From: Jiri Olsa
> > Sent: 14 January 2025 14:03
> > 
> > hi,
> > while checking on similar code for uprobes I was wondering if we
> > can merge first 2 steps of instruction update in text_poke_bp_batch
> > function.
> > 
> > Basically the first step now would be to write int3 byte together
> > with the rest of the bytes of the new instruction instead of doing
> > that separately. And the second step would be to overwrite int3
> > byte with first byte of the new instruction.
> > 
> > Would that work or do I miss some x86 detail that could lead to crash?
> 
> I suspect it will 'crash and burn'.
> 
> Consider what happens if there is a cache-line boundary in the
> middle of an instruction.
> (Actually an instruction fetch boundary will do.)
> 
> cpu0: reads the old instructions from the old cache line.
> cpu0: pipeline busy (or similar) so doesn't read the next cache line.
> cpu1: writes the new instructions.
> cpu0: reads the second cache line.
> 
> cpu0 now has a mix of the old and new instruction bytes.
> 
> Writing the int3 is safe - provided they don't return until
> all the patching is over.
> 
> But between writing the int3 (over the first opcode byte) and
> updating anything else I suspect you need something that does
> a complete synchronise between the cpu that discards any bytes
> in the decode pipeline as well as flushing the I-cache (etc).
> I suspect that requires an acked IPI.
> 
> Very long cpu stalls are easy to generate.
> Any read from PCIe will be slow (I've at fpga target that takes ~1us).
> You'd need to be unlucky to be patching an instruction while one
> was pending, but a DMA access might just be enough to cause grief.

ok, thanks for all the details,

jirka
diff mbox series

Patch

diff --git a/arch/x86/kernel/alternative.c b/arch/x86/kernel/alternative.c
index 243843e44e89..99830e9cd641 100644
--- a/arch/x86/kernel/alternative.c
+++ b/arch/x86/kernel/alternative.c
@@ -2125,9 +2125,9 @@  struct text_poke_loc {
 	s32 disp;
 	u8 len;
 	u8 opcode;
-	const u8 text[POKE_MAX_OPCODE_SIZE];
+	u8 text[POKE_MAX_OPCODE_SIZE];
 	/* see text_poke_bp_batch() */
-	u8 old;
+	u8 first;
 };
 
 struct bp_patching_desc {
@@ -2282,7 +2282,6 @@  static int tp_vec_nr;
  */
 static void text_poke_bp_batch(struct text_poke_loc *tp, unsigned int nr_entries)
 {
-	unsigned char int3 = INT3_INSN_OPCODE;
 	unsigned int i;
 	int do_sync;
 
@@ -2313,29 +2312,20 @@  static void text_poke_bp_batch(struct text_poke_loc *tp, unsigned int nr_entries
 	 */
 	smp_wmb();
 
-	/*
-	 * First step: add a int3 trap to the address that will be patched.
-	 */
-	for (i = 0; i < nr_entries; i++) {
-		tp[i].old = *(u8 *)text_poke_addr(&tp[i]);
-		text_poke(text_poke_addr(&tp[i]), &int3, INT3_INSN_SIZE);
-	}
-
-	text_poke_sync();
-
 	/*
 	 * Second step: update all but the first byte of the patched range.
 	 */
 	for (do_sync = 0, i = 0; i < nr_entries; i++) {
-		u8 old[POKE_MAX_OPCODE_SIZE+1] = { tp[i].old, };
+		u8 old[POKE_MAX_OPCODE_SIZE+1];
 		u8 _new[POKE_MAX_OPCODE_SIZE+1];
 		const u8 *new = tp[i].text;
 		int len = tp[i].len;
 
+		tp[i].first = tp[i].text[0];
+		tp[i].text[0] = (u8) INT3_INSN_OPCODE;
+
 		if (len - INT3_INSN_SIZE > 0) {
-			memcpy(old + INT3_INSN_SIZE,
-			       text_poke_addr(&tp[i]) + INT3_INSN_SIZE,
-			       len - INT3_INSN_SIZE);
+			memcpy(old, text_poke_addr(&tp[i]), len);
 
 			if (len == 6) {
 				_new[0] = 0x0f;
@@ -2343,9 +2333,7 @@  static void text_poke_bp_batch(struct text_poke_loc *tp, unsigned int nr_entries
 				new = _new;
 			}
 
-			text_poke(text_poke_addr(&tp[i]) + INT3_INSN_SIZE,
-				  new + INT3_INSN_SIZE,
-				  len - INT3_INSN_SIZE);
+			text_poke(text_poke_addr(&tp[i]), new, len);
 
 			do_sync++;
 		}
@@ -2391,7 +2379,7 @@  static void text_poke_bp_batch(struct text_poke_loc *tp, unsigned int nr_entries
 	 * replacing opcode.
 	 */
 	for (do_sync = 0, i = 0; i < nr_entries; i++) {
-		u8 byte = tp[i].text[0];
+		u8 byte = tp[i].first;
 
 		if (tp[i].len == 6)
 			byte = 0x0f;