diff mbox series

riscv: Fix text patching when icache flushes use IPIs

Message ID 20240206204607.527195-1-alexghiti@rivosinc.com (mailing list archive)
State New
Headers show
Series riscv: Fix text patching when icache flushes use IPIs | expand

Checks

Context Check Description
conchuod/vmtest-for-next-PR success PR summary
conchuod/patch-1-test-1 success .github/scripts/patches/tests/build_rv32_defconfig.sh
conchuod/patch-1-test-2 success .github/scripts/patches/tests/build_rv64_clang_allmodconfig.sh
conchuod/patch-1-test-3 success .github/scripts/patches/tests/build_rv64_gcc_allmodconfig.sh
conchuod/patch-1-test-4 success .github/scripts/patches/tests/build_rv64_nommu_k210_defconfig.sh
conchuod/patch-1-test-5 success .github/scripts/patches/tests/build_rv64_nommu_virt_defconfig.sh
conchuod/patch-1-test-6 warning .github/scripts/patches/tests/checkpatch.sh
conchuod/patch-1-test-7 success .github/scripts/patches/tests/dtb_warn_rv64.sh
conchuod/patch-1-test-8 success .github/scripts/patches/tests/header_inline.sh
conchuod/patch-1-test-9 success .github/scripts/patches/tests/kdoc.sh
conchuod/patch-1-test-10 success .github/scripts/patches/tests/module_param.sh
conchuod/patch-1-test-11 success .github/scripts/patches/tests/verify_fixes.sh
conchuod/patch-1-test-12 success .github/scripts/patches/tests/verify_signedoff.sh

Commit Message

Alexandre Ghiti Feb. 6, 2024, 8:46 p.m. UTC
For now, we use stop_machine() to patch the text and when we use IPIs for
remote icache flushes, the system hangs since the irqs are disabled on all
cpus.

So instead, make sure every cpu executes the stop_machine() patching
function which emits a local icache flush and then avoids the use of
IPIs.

Co-developed-by: Björn Töpel <bjorn@rivosinc.com>
Signed-off-by: Björn Töpel <bjorn@rivosinc.com>
Signed-off-by: Alexandre Ghiti <alexghiti@rivosinc.com>
---
 arch/riscv/include/asm/patch.h |  1 +
 arch/riscv/kernel/ftrace.c     | 38 ++++++++++++++++++++++++++++++----
 arch/riscv/kernel/patch.c      | 11 +++++-----
 3 files changed, 40 insertions(+), 10 deletions(-)

Comments

Björn Töpel Feb. 7, 2024, 7:30 a.m. UTC | #1
Alexandre Ghiti <alexghiti@rivosinc.com> writes:

> For now, we use stop_machine() to patch the text and when we use IPIs for
> remote icache flushes, the system hangs since the irqs are disabled on all
> cpus.
>
> So instead, make sure every cpu executes the stop_machine() patching
> function which emits a local icache flush and then avoids the use of
> IPIs.
>
> Co-developed-by: Björn Töpel <bjorn@rivosinc.com>
> Signed-off-by: Björn Töpel <bjorn@rivosinc.com>
> Signed-off-by: Alexandre Ghiti <alexghiti@rivosinc.com>

FWIW, the BPF selftests pass nicely with this (especially the
fentry/fexit tests ;-)). I don't know if it's worth much saying that
your own stuff was tested, but here goes:

Tested-by: Björn Töpel <bjorn@rivosinc.com>
Andrea Parri Feb. 8, 2024, 11:42 a.m. UTC | #2
> +static int __ftrace_modify_code(void *data)
> +{
> +	struct ftrace_modify_param *param = data;
> +
> +	if (atomic_inc_return(&param->cpu_count) == num_online_cpus()) {
> +		ftrace_modify_all_code(param->command);
> +		atomic_inc(&param->cpu_count);

I stared at ftrace_modify_all_code() for a bit but honestly I don't see
what prevents the ->cpu_count increment from being reordered before the
insn write(s) (architecturally) now that you have removed the IPI dance:
perhaps add an smp_wmb() right before the atomic_inc() (or promote this
latter to a (void)atomic_inc_return_release()) and/or an inline comment
saying why such reordering is not possible?


> +	} else {
> +		while (atomic_read(&param->cpu_count) <= num_online_cpus())
> +			cpu_relax();
> +		smp_mb();

I see that you've lifted/copied the memory barrier from patch_text_cb():
what's its point?  AFAIU, the barrier has no ordering effect on program
order later insn fetches; perhaps the code was based on some legacy/old
version of Zifencei?  IAC, comments, comments, ... or maybe just remove
that memory barrier?


> +	}
> +
> +	local_flush_icache_all();
> +
> +	return 0;
> +}

[...]


> @@ -232,8 +230,7 @@ static int patch_text_cb(void *data)
>  	if (atomic_inc_return(&patch->cpu_count) == num_online_cpus()) {
>  		for (i = 0; ret == 0 && i < patch->ninsns; i++) {
>  			len = GET_INSN_LENGTH(patch->insns[i]);
> -			ret = patch_text_nosync(patch->addr + i * len,
> -						&patch->insns[i], len);
> +			ret = patch_insn_write(patch->addr + i * len, &patch->insns[i], len);
>  		}
>  		atomic_inc(&patch->cpu_count);
>  	} else {
> @@ -242,6 +239,8 @@ static int patch_text_cb(void *data)
>  		smp_mb();
>  	}
>  
> +	local_flush_icache_all();
> +
>  	return ret;
>  }
>  NOKPROBE_SYMBOL(patch_text_cb);

My above remarks/questions also apply to this function.


On a last topic, although somehow orthogonal to the scope of this patch,
I'm not sure the patch_{map,unmap}() dance in our patch_insn_write() is
correct: I can see why we may want (need to do) the local TLB flush be-
fore returning from patch_{map,unmap}(), but does a local flush suffice?
For comparison, arm64 seems to go through a complete dsb-tlbi-dsb(-isb)
sequence in their unmapping stage (and apparently relying on "no caching
of invalid ptes" in their mapping stage).  Of course, "broadcasting" our
(riscv's) TLB invalidations will necessary introduce some complexity...

Thoughts?

  Andrea
Alexandre Ghiti Feb. 8, 2024, 1:23 p.m. UTC | #3
Hi Andrea,

On Thu, Feb 8, 2024 at 12:42 PM Andrea Parri <parri.andrea@gmail.com> wrote:
>
> > +static int __ftrace_modify_code(void *data)
> > +{
> > +     struct ftrace_modify_param *param = data;
> > +
> > +     if (atomic_inc_return(&param->cpu_count) == num_online_cpus()) {
> > +             ftrace_modify_all_code(param->command);
> > +             atomic_inc(&param->cpu_count);
>
> I stared at ftrace_modify_all_code() for a bit but honestly I don't see
> what prevents the ->cpu_count increment from being reordered before the
> insn write(s) (architecturally) now that you have removed the IPI dance:
> perhaps add an smp_wmb() right before the atomic_inc() (or promote this
> latter to a (void)atomic_inc_return_release()) and/or an inline comment
> saying why such reordering is not possible?

I did not even think of that, and it actually makes sense so I'll go
with what you propose: I'll replace atomic_inc() with
atomic_inc_return_release(). And I'll add the following comment if
that's ok with you:

"Make sure the patching store is effective *before* we increment the
counter which releases all waiting cpus"

>
>
> > +     } else {
> > +             while (atomic_read(&param->cpu_count) <= num_online_cpus())
> > +                     cpu_relax();
> > +             smp_mb();
>
> I see that you've lifted/copied the memory barrier from patch_text_cb():
> what's its point?  AFAIU, the barrier has no ordering effect on program
> order later insn fetches; perhaps the code was based on some legacy/old
> version of Zifencei?  IAC, comments, comments, ... or maybe just remove
> that memory barrier?

Honestly, I looked at it one minute, did not understand its purpose
and said to myself "ok that can't hurt anyway, I may be missing
something".

FWIW,  I see that arm64 uses isb() here. If you don't see its purpose,
I'll remove it (here and where I copied it).

>
>
> > +     }
> > +
> > +     local_flush_icache_all();
> > +
> > +     return 0;
> > +}
>
> [...]
>
>
> > @@ -232,8 +230,7 @@ static int patch_text_cb(void *data)
> >       if (atomic_inc_return(&patch->cpu_count) == num_online_cpus()) {
> >               for (i = 0; ret == 0 && i < patch->ninsns; i++) {
> >                       len = GET_INSN_LENGTH(patch->insns[i]);
> > -                     ret = patch_text_nosync(patch->addr + i * len,
> > -                                             &patch->insns[i], len);
> > +                     ret = patch_insn_write(patch->addr + i * len, &patch->insns[i], len);
> >               }
> >               atomic_inc(&patch->cpu_count);
> >       } else {
> > @@ -242,6 +239,8 @@ static int patch_text_cb(void *data)
> >               smp_mb();
> >       }
> >
> > +     local_flush_icache_all();
> > +
> >       return ret;
> >  }
> >  NOKPROBE_SYMBOL(patch_text_cb);
>
> My above remarks/questions also apply to this function.
>
>
> On a last topic, although somehow orthogonal to the scope of this patch,
> I'm not sure the patch_{map,unmap}() dance in our patch_insn_write() is
> correct: I can see why we may want (need to do) the local TLB flush be-
> fore returning from patch_{map,unmap}(), but does a local flush suffice?
> For comparison, arm64 seems to go through a complete dsb-tlbi-dsb(-isb)
> sequence in their unmapping stage (and apparently relying on "no caching
> of invalid ptes" in their mapping stage).  Of course, "broadcasting" our
> (riscv's) TLB invalidations will necessary introduce some complexity...
>
> Thoughts?

To avoid remote TLBI, could we simply disable the preemption before
the first patch_map()? arm64 disables the irqs, but that seems
overkill to me, but maybe I'm missing something again?

Thanks for your comments Andrea,

Alex

>
>   Andrea
Andrea Parri Feb. 8, 2024, 2:14 p.m. UTC | #4
> I did not even think of that, and it actually makes sense so I'll go
> with what you propose: I'll replace atomic_inc() with
> atomic_inc_return_release(). And I'll add the following comment if
> that's ok with you:
> 
> "Make sure the patching store is effective *before* we increment the
> counter which releases all waiting cpus"

Yes, this sounds good to me.


> Honestly, I looked at it one minute, did not understand its purpose
> and said to myself "ok that can't hurt anyway, I may be missing
> something".
> 
> FWIW,  I see that arm64 uses isb() here. If you don't see its purpose,
> I'll remove it (here and where I copied it).

Removing the smp_mb() (and keeping the local_flush_icache_all()) seems
fine to me; thanks for the confirmation.


> > On a last topic, although somehow orthogonal to the scope of this patch,
> > I'm not sure the patch_{map,unmap}() dance in our patch_insn_write() is
> > correct: I can see why we may want (need to do) the local TLB flush be-
> > fore returning from patch_{map,unmap}(), but does a local flush suffice?
> > For comparison, arm64 seems to go through a complete dsb-tlbi-dsb(-isb)
> > sequence in their unmapping stage (and apparently relying on "no caching
> > of invalid ptes" in their mapping stage).  Of course, "broadcasting" our
> > (riscv's) TLB invalidations will necessary introduce some complexity...
> >
> > Thoughts?
> 
> To avoid remote TLBI, could we simply disable the preemption before
> the first patch_map()? arm64 disables the irqs, but that seems
> overkill to me, but maybe I'm missing something again?

Mmh, I'm afraid this will require more thinking/probing on my end (not
really "the expert" of the codebase at stake...).  Maybe the ftrace
reviewers will provide further ideas/suggestions for us to brainstorm.

  Andrea
diff mbox series

Patch

diff --git a/arch/riscv/include/asm/patch.h b/arch/riscv/include/asm/patch.h
index e88b52d39eac..9f5d6e14c405 100644
--- a/arch/riscv/include/asm/patch.h
+++ b/arch/riscv/include/asm/patch.h
@@ -6,6 +6,7 @@ 
 #ifndef _ASM_RISCV_PATCH_H
 #define _ASM_RISCV_PATCH_H
 
+int patch_insn_write(void *addr, const void *insn, size_t len);
 int patch_text_nosync(void *addr, const void *insns, size_t len);
 int patch_text_set_nosync(void *addr, u8 c, size_t len);
 int patch_text(void *addr, u32 *insns, int ninsns);
diff --git a/arch/riscv/kernel/ftrace.c b/arch/riscv/kernel/ftrace.c
index f5aa24d9e1c1..1694a1861d1e 100644
--- a/arch/riscv/kernel/ftrace.c
+++ b/arch/riscv/kernel/ftrace.c
@@ -8,6 +8,7 @@ 
 #include <linux/ftrace.h>
 #include <linux/uaccess.h>
 #include <linux/memory.h>
+#include <linux/stop_machine.h>
 #include <asm/cacheflush.h>
 #include <asm/patch.h>
 
@@ -75,8 +76,7 @@  static int __ftrace_modify_call(unsigned long hook_pos, unsigned long target,
 		make_call_t0(hook_pos, target, call);
 
 	/* Replace the auipc-jalr pair at once. Return -EPERM on write error. */
-	if (patch_text_nosync
-	    ((void *)hook_pos, enable ? call : nops, MCOUNT_INSN_SIZE))
+	if (patch_insn_write((void *)hook_pos, enable ? call : nops, MCOUNT_INSN_SIZE))
 		return -EPERM;
 
 	return 0;
@@ -88,7 +88,7 @@  int ftrace_make_call(struct dyn_ftrace *rec, unsigned long addr)
 
 	make_call_t0(rec->ip, addr, call);
 
-	if (patch_text_nosync((void *)rec->ip, call, MCOUNT_INSN_SIZE))
+	if (patch_insn_write((void *)rec->ip, call, MCOUNT_INSN_SIZE))
 		return -EPERM;
 
 	return 0;
@@ -99,7 +99,7 @@  int ftrace_make_nop(struct module *mod, struct dyn_ftrace *rec,
 {
 	unsigned int nops[2] = {NOP4, NOP4};
 
-	if (patch_text_nosync((void *)rec->ip, nops, MCOUNT_INSN_SIZE))
+	if (patch_insn_write((void *)rec->ip, nops, MCOUNT_INSN_SIZE))
 		return -EPERM;
 
 	return 0;
@@ -134,6 +134,36 @@  int ftrace_update_ftrace_func(ftrace_func_t func)
 
 	return ret;
 }
+
+struct ftrace_modify_param {
+	int command;
+	atomic_t cpu_count;
+};
+
+static int __ftrace_modify_code(void *data)
+{
+	struct ftrace_modify_param *param = data;
+
+	if (atomic_inc_return(&param->cpu_count) == num_online_cpus()) {
+		ftrace_modify_all_code(param->command);
+		atomic_inc(&param->cpu_count);
+	} else {
+		while (atomic_read(&param->cpu_count) <= num_online_cpus())
+			cpu_relax();
+		smp_mb();
+	}
+
+	local_flush_icache_all();
+
+	return 0;
+}
+
+void arch_ftrace_update_code(int command)
+{
+	struct ftrace_modify_param param = { command, ATOMIC_INIT(0) };
+
+	stop_machine(__ftrace_modify_code, &param, cpu_online_mask);
+}
 #endif
 
 #ifdef CONFIG_DYNAMIC_FTRACE_WITH_REGS
diff --git a/arch/riscv/kernel/patch.c b/arch/riscv/kernel/patch.c
index 37e87fdcf6a0..ec7760a4d6cd 100644
--- a/arch/riscv/kernel/patch.c
+++ b/arch/riscv/kernel/patch.c
@@ -188,7 +188,7 @@  int patch_text_set_nosync(void *addr, u8 c, size_t len)
 }
 NOKPROBE_SYMBOL(patch_text_set_nosync);
 
-static int patch_insn_write(void *addr, const void *insn, size_t len)
+int patch_insn_write(void *addr, const void *insn, size_t len)
 {
 	size_t patched = 0;
 	size_t size;
@@ -211,11 +211,9 @@  NOKPROBE_SYMBOL(patch_insn_write);
 
 int patch_text_nosync(void *addr, const void *insns, size_t len)
 {
-	u32 *tp = addr;
 	int ret;
 
-	ret = patch_insn_write(tp, insns, len);
-
+	ret = patch_insn_write(addr, insns, len);
 	if (!ret)
 		flush_icache_range((uintptr_t) tp, (uintptr_t) tp + len);
 
@@ -232,8 +230,7 @@  static int patch_text_cb(void *data)
 	if (atomic_inc_return(&patch->cpu_count) == num_online_cpus()) {
 		for (i = 0; ret == 0 && i < patch->ninsns; i++) {
 			len = GET_INSN_LENGTH(patch->insns[i]);
-			ret = patch_text_nosync(patch->addr + i * len,
-						&patch->insns[i], len);
+			ret = patch_insn_write(patch->addr + i * len, &patch->insns[i], len);
 		}
 		atomic_inc(&patch->cpu_count);
 	} else {
@@ -242,6 +239,8 @@  static int patch_text_cb(void *data)
 		smp_mb();
 	}
 
+	local_flush_icache_all();
+
 	return ret;
 }
 NOKPROBE_SYMBOL(patch_text_cb);