diff mbox series

[bpf-next,v3,2/5] bpf: implement sleepable uprobes by chaining gps

Message ID 1b9c462226d2d7b97293e19ed2d578eb573a4544.1652404870.git.delyank@fb.com (mailing list archive)
State Changes Requested
Delegated to: BPF
Headers show
Series sleepable uprobe support | expand

Checks

Context Check Description
bpf/vmtest-bpf-next-PR success PR summary
netdev/tree_selection success Clearly marked for bpf-next, async
netdev/fixes_present success Fixes tag not required for -next series
netdev/subject_prefix success Link
netdev/cover_letter success Series has a cover letter
netdev/patch_count success Link
netdev/header_inline success No static functions without inline keyword in header files
netdev/build_32bit fail Errors and warnings before: 1469 this patch: 1470
netdev/cc_maintainers warning 8 maintainers not CCed: netdev@vger.kernel.org kafai@fb.com john.fastabend@gmail.com yhs@fb.com kpsingh@kernel.org songliubraving@fb.com rostedt@goodmis.org mingo@redhat.com
netdev/build_clang success Errors and warnings before: 174 this patch: 174
netdev/module_param success Was 0 now: 0
netdev/verify_signedoff success Signed-off-by tag matches author and committer
netdev/verify_fixes success No Fixes tag
netdev/build_allmodconfig_warn fail Errors and warnings before: 1476 this patch: 1477
netdev/checkpatch warning WARNING: line length of 89 exceeds 80 columns
netdev/kdoc fail Errors and warnings before: 0 this patch: 4
netdev/source_inline success Was 0 now: 0
bpf/vmtest-bpf-next-VM_Test-1 success Logs for Kernel LATEST on ubuntu-latest + selftests
bpf/vmtest-bpf-next-VM_Test-2 success Logs for Kernel LATEST on z15 + selftests

Commit Message

Delyan Kratunov May 13, 2022, 1:22 a.m. UTC
uprobes work by raising a trap, setting a task flag from within the
interrupt handler, and processing the actual work for the uprobe on the
way back to userspace. As a result, uprobe handlers already execute in a
might_fault/_sleep context. The primary obstacle to sleepable bpf uprobe
programs is therefore on the bpf side.

Namely, the bpf_prog_array attached to the uprobe is protected by normal
rcu. In order for uprobe bpf programs to become sleepable, it has to be
protected by the tasks_trace rcu flavor instead (and kfree() called after
a corresponding grace period).

Therefore, the free path for bpf_prog_array now chains a tasks_trace and
normal grace periods one after the other.

Users who iterate under tasks_trace read section would
be safe, as would users who iterate under normal read sections (from
non-sleepable locations).

The downside is that the tasks_trace latency affects all perf_event-attached
bpf programs (and not just uprobe ones). This is deemed safe given the
possible attach rates for kprobe/uprobe/tp programs.

Separately, non-sleepable programs need access to dynamically sized
rcu-protected maps, so bpf_run_prog_array_sleepables now conditionally takes
an rcu read section, in addition to the overarching tasks_trace section.

Signed-off-by: Delyan Kratunov <delyank@fb.com>
---
 include/linux/bpf.h         | 53 +++++++++++++++++++++++++++++++++++++
 kernel/bpf/core.c           | 15 +++++++++++
 kernel/trace/bpf_trace.c    |  4 +--
 kernel/trace/trace_uprobe.c |  5 ++--
 4 files changed, 72 insertions(+), 5 deletions(-)

Comments

Daniel Borkmann May 13, 2022, 4 p.m. UTC | #1
On 5/13/22 3:22 AM, Delyan Kratunov wrote:
[...]
>   struct bpf_prog_array *bpf_prog_array_alloc(u32 prog_cnt, gfp_t flags);
>   void bpf_prog_array_free(struct bpf_prog_array *progs);
> +/* Use when traversal over the bpf_prog_array uses tasks_trace rcu */
> +void bpf_prog_array_free_sleepable(struct bpf_prog_array *progs);
>   int bpf_prog_array_length(struct bpf_prog_array *progs);
>   bool bpf_prog_array_is_empty(struct bpf_prog_array *array);
>   int bpf_prog_array_copy_to_user(struct bpf_prog_array *progs,
> @@ -1451,6 +1454,56 @@ bpf_prog_run_array(const struct bpf_prog_array *array,
>   	return ret;
>   }
>   
> +/**
> + * Notes on RCU design for bpf_prog_arrays containing sleepable programs:
> + *
> + * We use the tasks_trace rcu flavor read section to protect the bpf_prog_array
> + * overall. As a result, we must use the bpf_prog_array_free_sleepable
> + * in order to use the tasks_trace rcu grace period.
> + *
> + * When a non-sleepable program is inside the array, we take the rcu read
> + * section and disable preemption for that program alone, so it can access
> + * rcu-protected dynamically sized maps.
> + */

Btw, there are a number of kdoc warnings around your series, pls make sure to
fix or use 'regular' comment:

https://patchwork.hopto.org/static/nipa/641204/12848281/kdoc/stderr
https://patchwork.hopto.org/static/nipa/641204/12848282/kdoc/stderr

I presume otherwise kbuild bot will yell at some point.

Thanks,
Daniel
Delyan Kratunov May 13, 2022, 5:05 p.m. UTC | #2
On Fri, 2022-05-13 at 18:00 +0200, Daniel Borkmann wrote:
> On 5/13/22 3:22 AM, Delyan Kratunov wrote:
> [...]
> >   struct bpf_prog_array *bpf_prog_array_alloc(u32 prog_cnt, gfp_t flags);
> >   void bpf_prog_array_free(struct bpf_prog_array *progs);
> > +/* Use when traversal over the bpf_prog_array uses tasks_trace rcu */
> > +void bpf_prog_array_free_sleepable(struct bpf_prog_array *progs);
> >   int bpf_prog_array_length(struct bpf_prog_array *progs);
> >   bool bpf_prog_array_is_empty(struct bpf_prog_array *array);
> >   int bpf_prog_array_copy_to_user(struct bpf_prog_array *progs,
> > @@ -1451,6 +1454,56 @@ bpf_prog_run_array(const struct bpf_prog_array *array,
> >   	return ret;
> >   }
> >   
> > +/**
> > + * Notes on RCU design for bpf_prog_arrays containing sleepable programs:
> > + *
> > + * We use the tasks_trace rcu flavor read section to protect the bpf_prog_array
> > + * overall. As a result, we must use the bpf_prog_array_free_sleepable
> > + * in order to use the tasks_trace rcu grace period.
> > + *
> > + * When a non-sleepable program is inside the array, we take the rcu read
> > + * section and disable preemption for that program alone, so it can access
> > + * rcu-protected dynamically sized maps.
> > + */
> 
> Btw, there are a number of kdoc warnings around your series, pls make sure to
> fix or use 'regular' comment:
> 
> https://patchwork.hopto.org/static/nipa/641204/12848281/kdoc/stderr 
> https://patchwork.hopto.org/static/nipa/641204/12848282/kdoc/stderr 

Yeah, I just saw these too, I'll take care of them before the next reroll. Let's see
if there's any other high level comments first.

-- Delyan
diff mbox series

Patch

diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index b67893b47da4..77ba90d654e7 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -26,6 +26,7 @@ 
 #include <linux/stddef.h>
 #include <linux/bpfptr.h>
 #include <linux/btf.h>
+#include <linux/rcupdate_trace.h>
 
 struct bpf_verifier_env;
 struct bpf_verifier_log;
@@ -1360,6 +1361,8 @@  extern struct bpf_empty_prog_array bpf_empty_prog_array;
 
 struct bpf_prog_array *bpf_prog_array_alloc(u32 prog_cnt, gfp_t flags);
 void bpf_prog_array_free(struct bpf_prog_array *progs);
+/* Use when traversal over the bpf_prog_array uses tasks_trace rcu */
+void bpf_prog_array_free_sleepable(struct bpf_prog_array *progs);
 int bpf_prog_array_length(struct bpf_prog_array *progs);
 bool bpf_prog_array_is_empty(struct bpf_prog_array *array);
 int bpf_prog_array_copy_to_user(struct bpf_prog_array *progs,
@@ -1451,6 +1454,56 @@  bpf_prog_run_array(const struct bpf_prog_array *array,
 	return ret;
 }
 
+/**
+ * Notes on RCU design for bpf_prog_arrays containing sleepable programs:
+ *
+ * We use the tasks_trace rcu flavor read section to protect the bpf_prog_array
+ * overall. As a result, we must use the bpf_prog_array_free_sleepable
+ * in order to use the tasks_trace rcu grace period.
+ *
+ * When a non-sleepable program is inside the array, we take the rcu read
+ * section and disable preemption for that program alone, so it can access
+ * rcu-protected dynamically sized maps.
+ */
+static __always_inline u32
+bpf_prog_run_array_sleepable(const struct bpf_prog_array __rcu *array_rcu,
+			     const void *ctx, bpf_prog_run_fn run_prog)
+{
+	const struct bpf_prog_array_item *item;
+	const struct bpf_prog *prog;
+	const struct bpf_prog_array *array;
+	struct bpf_run_ctx *old_run_ctx;
+	struct bpf_trace_run_ctx run_ctx;
+	u32 ret = 1;
+
+	might_fault();
+
+	rcu_read_lock_trace();
+	migrate_disable();
+
+	array = rcu_dereference_check(array_rcu, rcu_read_lock_trace_held());
+	if (unlikely(!array))
+		goto out;
+	old_run_ctx = bpf_set_run_ctx(&run_ctx.run_ctx);
+	item = &array->items[0];
+	while ((prog = READ_ONCE(item->prog))) {
+		if (!prog->aux->sleepable)
+			rcu_read_lock();
+
+		run_ctx.bpf_cookie = item->bpf_cookie;
+		ret &= run_prog(prog, ctx);
+		item++;
+
+		if (!prog->aux->sleepable)
+			rcu_read_unlock();
+	}
+	bpf_reset_run_ctx(old_run_ctx);
+out:
+	migrate_enable();
+	rcu_read_unlock_trace();
+	return ret;
+}
+
 #ifdef CONFIG_BPF_SYSCALL
 DECLARE_PER_CPU(int, bpf_prog_active);
 extern struct mutex bpf_stats_enabled_mutex;
diff --git a/kernel/bpf/core.c b/kernel/bpf/core.c
index 76f68d0a7ae8..9c2175b06b38 100644
--- a/kernel/bpf/core.c
+++ b/kernel/bpf/core.c
@@ -2268,6 +2268,21 @@  void bpf_prog_array_free(struct bpf_prog_array *progs)
 	kfree_rcu(progs, rcu);
 }
 
+static void __bpf_prog_array_free_sleepable_cb(struct rcu_head *rcu)
+{
+	struct bpf_prog_array *progs;
+
+	progs = container_of(rcu, struct bpf_prog_array, rcu);
+	kfree_rcu(progs, rcu);
+}
+
+void bpf_prog_array_free_sleepable(struct bpf_prog_array *progs)
+{
+	if (!progs || progs == &bpf_empty_prog_array.hdr)
+		return;
+	call_rcu_tasks_trace(&progs->rcu, __bpf_prog_array_free_sleepable_cb);
+}
+
 int bpf_prog_array_length(struct bpf_prog_array *array)
 {
 	struct bpf_prog_array_item *item;
diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c
index 7141ca8a1c2d..f74c53dba64e 100644
--- a/kernel/trace/bpf_trace.c
+++ b/kernel/trace/bpf_trace.c
@@ -1934,7 +1934,7 @@  int perf_event_attach_bpf_prog(struct perf_event *event,
 	event->prog = prog;
 	event->bpf_cookie = bpf_cookie;
 	rcu_assign_pointer(event->tp_event->prog_array, new_array);
-	bpf_prog_array_free(old_array);
+	bpf_prog_array_free_sleepable(old_array);
 
 unlock:
 	mutex_unlock(&bpf_event_mutex);
@@ -1960,7 +1960,7 @@  void perf_event_detach_bpf_prog(struct perf_event *event)
 		bpf_prog_array_delete_safe(old_array, event->prog);
 	} else {
 		rcu_assign_pointer(event->tp_event->prog_array, new_array);
-		bpf_prog_array_free(old_array);
+		bpf_prog_array_free_sleepable(old_array);
 	}
 
 	bpf_prog_put(event->prog);
diff --git a/kernel/trace/trace_uprobe.c b/kernel/trace/trace_uprobe.c
index 9711589273cd..0282c119b1b2 100644
--- a/kernel/trace/trace_uprobe.c
+++ b/kernel/trace/trace_uprobe.c
@@ -16,6 +16,7 @@ 
 #include <linux/namei.h>
 #include <linux/string.h>
 #include <linux/rculist.h>
+#include <linux/filter.h>
 
 #include "trace_dynevent.h"
 #include "trace_probe.h"
@@ -1346,9 +1347,7 @@  static void __uprobe_perf_func(struct trace_uprobe *tu,
 	if (bpf_prog_array_valid(call)) {
 		u32 ret;
 
-		preempt_disable();
-		ret = trace_call_bpf(call, regs);
-		preempt_enable();
+		ret = bpf_prog_run_array_sleepable(call->prog_array, regs, bpf_prog_run);
 		if (!ret)
 			return;
 	}