Message ID | 972caeb1e9338721bb719b118e0e40705f860f50.1651103126.git.delyank@fb.com (mailing list archive) |
---|---|
State | Superseded |
Delegated to: | BPF |
Headers | show |
Series | sleepable uprobe support | expand |
Context | Check | Description |
---|---|---|
netdev/tree_selection | success | Clearly marked for bpf-next, async |
netdev/apply | fail | Patch does not apply to bpf-next |
bpf/vmtest-bpf-next-PR | fail | merge-conflict |
On Thu, Apr 28, 2022 at 9:54 AM Delyan Kratunov <delyank@fb.com> wrote: > > uprobes work by raising a trap, setting a task flag from within the > interrupt handler, and processing the actual work for the uprobe on the > way back to userspace. As a result, uprobe handlers already execute in a > user context. The primary obstacle to sleepable bpf uprobe programs is > therefore on the bpf side. > > Namely, the bpf_prog_array attached to the uprobe is protected by normal > rcu and runs with disabled preemption. In order for uprobe bpf programs > to become actually sleepable, we need it to be protected by the tasks_trace > rcu flavor instead (and kfree() called after a corresponding grace period). > > One way to achieve this is by tracking an array-has-contained-sleepable-prog > flag in bpf_prog_array and switching rcu flavors based on it. However, this > is deemed somewhat unwieldly and the rcu flavor transition would be hard > to reason about. > > Instead, based on Alexei's proposal, we change the free path for > bpf_prog_array to chain a tasks_trace and normal grace periods > one after the other. Users who iterate under tasks_trace read section would > be safe, as would users who iterate under normal read sections (from > non-sleepable locations). The downside is that we take the tasks_trace latency > unconditionally but that's deemed acceptable under expected workloads. One example where this actually can become a problem is cgroup BPF programs. There you can make single attachment to root cgroup, but it will create one "effective" prog_array for each descendant (and will keep destroying and creating them as child cgroups are created). So there is this invisible multiplier which we can't really control. So paying the (however small, but) price of call_rcu_tasks_trace() in bpf_prog_array_free() which is used for purely non-sleepable cases seems unfortunate. But I think an alternative to tracking this "has sleepable" bit on a per bpf_prog_array case is to have bpf_prog_array_free_sleepable() implementation independent of bpf_prog_array_free() itself and call that sleepable variant from uprobe detach handler, limiting the impact to things that actually might be running as sleepable and which most likely won't churn through a huge amount of arrays. WDYT? Otherwise all looks good and surprisingly straightforward thanks to the fact uprobe is already running in a sleepable context, awesome! > > The other interesting implication is wrt non-sleepable uprobe > programs. Because they need access to dynamically sized rcu-protected > maps, we conditionally disable preemption and take an rcu read section > around them, in addition to the overarching tasks_trace section. > > Signed-off-by: Delyan Kratunov <delyank@fb.com> > --- > include/linux/bpf.h | 60 ++++++++++++++++++++++++++++++++++++ > include/linux/trace_events.h | 1 + > kernel/bpf/core.c | 10 +++++- > kernel/trace/bpf_trace.c | 23 ++++++++++++++ > kernel/trace/trace_uprobe.c | 4 +-- > 5 files changed, 94 insertions(+), 4 deletions(-) > [...]
On Thu, 2022-04-28 at 11:19 -0700, Andrii Nakryiko wrote: > On Thu, Apr 28, 2022 at 9:54 AM Delyan Kratunov <delyank@fb.com> wrote: > > > > uprobes work by raising a trap, setting a task flag from within the > > interrupt handler, and processing the actual work for the uprobe on the > > way back to userspace. As a result, uprobe handlers already execute in a > > user context. The primary obstacle to sleepable bpf uprobe programs is > > therefore on the bpf side. > > > > Namely, the bpf_prog_array attached to the uprobe is protected by normal > > rcu and runs with disabled preemption. In order for uprobe bpf programs > > to become actually sleepable, we need it to be protected by the tasks_trace > > rcu flavor instead (and kfree() called after a corresponding grace period). > > > > One way to achieve this is by tracking an array-has-contained-sleepable-prog > > flag in bpf_prog_array and switching rcu flavors based on it. However, this > > is deemed somewhat unwieldly and the rcu flavor transition would be hard > > to reason about. > > > > Instead, based on Alexei's proposal, we change the free path for > > bpf_prog_array to chain a tasks_trace and normal grace periods > > one after the other. Users who iterate under tasks_trace read section would > > be safe, as would users who iterate under normal read sections (from > > non-sleepable locations). The downside is that we take the tasks_trace latency > > unconditionally but that's deemed acceptable under expected workloads. > > One example where this actually can become a problem is cgroup BPF > programs. There you can make single attachment to root cgroup, but it > will create one "effective" prog_array for each descendant (and will > keep destroying and creating them as child cgroups are created). So > there is this invisible multiplier which we can't really control. > > So paying the (however small, but) price of call_rcu_tasks_trace() in > bpf_prog_array_free() which is used for purely non-sleepable cases > seems unfortunate. But I think an alternative to tracking this "has > sleepable" bit on a per bpf_prog_array case is to have > bpf_prog_array_free_sleepable() implementation independent of > bpf_prog_array_free() itself and call that sleepable variant from > uprobe detach handler, limiting the impact to things that actually > might be running as sleepable and which most likely won't churn > through a huge amount of arrays. WDYT? Honestly, I don't like the idea of having two different APIs, where if you use the wrong one, the program would only fail in rare and undebuggable circumstances. If we need specialization (and I'm not convinced we do - what's the rate of cgroup creation that we can sustain?), we should track that in the bpf_prog_array itself. We can have the allocation path set a flag and branch on it in free() to determine the necessary grace periods. > > Otherwise all looks good and surprisingly straightforward thanks to > the fact uprobe is already running in a sleepable context, awesome! > > > > > The other interesting implication is wrt non-sleepable uprobe > > programs. Because they need access to dynamically sized rcu-protected > > maps, we conditionally disable preemption and take an rcu read section > > around them, in addition to the overarching tasks_trace section. > > > > Signed-off-by: Delyan Kratunov <delyank@fb.com> > > --- > > include/linux/bpf.h | 60 ++++++++++++++++++++++++++++++++++++ > > include/linux/trace_events.h | 1 + > > kernel/bpf/core.c | 10 +++++- > > kernel/trace/bpf_trace.c | 23 ++++++++++++++ > > kernel/trace/trace_uprobe.c | 4 +-- > > 5 files changed, 94 insertions(+), 4 deletions(-) > > > > [...]
On Thu, Apr 28, 2022 at 12:15 PM Delyan Kratunov <delyank@fb.com> wrote: > > On Thu, 2022-04-28 at 11:19 -0700, Andrii Nakryiko wrote: > > On Thu, Apr 28, 2022 at 9:54 AM Delyan Kratunov <delyank@fb.com> wrote: > > > > > > uprobes work by raising a trap, setting a task flag from within the > > > interrupt handler, and processing the actual work for the uprobe on the > > > way back to userspace. As a result, uprobe handlers already execute in a > > > user context. The primary obstacle to sleepable bpf uprobe programs is > > > therefore on the bpf side. > > > > > > Namely, the bpf_prog_array attached to the uprobe is protected by normal > > > rcu and runs with disabled preemption. In order for uprobe bpf programs > > > to become actually sleepable, we need it to be protected by the tasks_trace > > > rcu flavor instead (and kfree() called after a corresponding grace period). > > > > > > One way to achieve this is by tracking an array-has-contained-sleepable-prog > > > flag in bpf_prog_array and switching rcu flavors based on it. However, this > > > is deemed somewhat unwieldly and the rcu flavor transition would be hard > > > to reason about. > > > > > > Instead, based on Alexei's proposal, we change the free path for > > > bpf_prog_array to chain a tasks_trace and normal grace periods > > > one after the other. Users who iterate under tasks_trace read section would > > > be safe, as would users who iterate under normal read sections (from > > > non-sleepable locations). The downside is that we take the tasks_trace latency > > > unconditionally but that's deemed acceptable under expected workloads. > > > > One example where this actually can become a problem is cgroup BPF > > programs. There you can make single attachment to root cgroup, but it > > will create one "effective" prog_array for each descendant (and will > > keep destroying and creating them as child cgroups are created). So > > there is this invisible multiplier which we can't really control. > > > > So paying the (however small, but) price of call_rcu_tasks_trace() in > > bpf_prog_array_free() which is used for purely non-sleepable cases > > seems unfortunate. But I think an alternative to tracking this "has > > sleepable" bit on a per bpf_prog_array case is to have > > bpf_prog_array_free_sleepable() implementation independent of > > bpf_prog_array_free() itself and call that sleepable variant from > > uprobe detach handler, limiting the impact to things that actually > > might be running as sleepable and which most likely won't churn > > through a huge amount of arrays. WDYT? > > Honestly, I don't like the idea of having two different APIs, where if you use the > wrong one, the program would only fail in rare and undebuggable circumstances. > > If we need specialization (and I'm not convinced we do - what's the rate of cgroup > creation that we can sustain?), we should track that in the bpf_prog_array itself. We > can have the allocation path set a flag and branch on it in free() to determine the > necessary grace periods. I think what Andrii is proposing is to leave bpf_prog_array_free() as-is and introduce new bpf_prog_array_free_sleepable() (the way it is in this patch) and call it from 2 places in kernel/trace/bpf_trace.c only. This way cgroup won't be affected. The rcu_trace protection here applies to prog_array itself. Normal progs are still rcu, sleepable progs are rcu_trace. Regardless whether they're called via trampoline or this new prog_array.
On Thu, 2022-04-28 at 13:58 -0700, Alexei Starovoitov wrote: > On Thu, Apr 28, 2022 at 12:15 PM Delyan Kratunov <delyank@fb.com> wrote: > > > > On Thu, 2022-04-28 at 11:19 -0700, Andrii Nakryiko wrote: > > > On Thu, Apr 28, 2022 at 9:54 AM Delyan Kratunov <delyank@fb.com> wrote: > > > > > > > > uprobes work by raising a trap, setting a task flag from within the > > > > interrupt handler, and processing the actual work for the uprobe on the > > > > way back to userspace. As a result, uprobe handlers already execute in a > > > > user context. The primary obstacle to sleepable bpf uprobe programs is > > > > therefore on the bpf side. > > > > > > > > Namely, the bpf_prog_array attached to the uprobe is protected by normal > > > > rcu and runs with disabled preemption. In order for uprobe bpf programs > > > > to become actually sleepable, we need it to be protected by the tasks_trace > > > > rcu flavor instead (and kfree() called after a corresponding grace period). > > > > > > > > One way to achieve this is by tracking an array-has-contained-sleepable-prog > > > > flag in bpf_prog_array and switching rcu flavors based on it. However, this > > > > is deemed somewhat unwieldly and the rcu flavor transition would be hard > > > > to reason about. > > > > > > > > Instead, based on Alexei's proposal, we change the free path for > > > > bpf_prog_array to chain a tasks_trace and normal grace periods > > > > one after the other. Users who iterate under tasks_trace read section would > > > > be safe, as would users who iterate under normal read sections (from > > > > non-sleepable locations). The downside is that we take the tasks_trace latency > > > > unconditionally but that's deemed acceptable under expected workloads. > > > > > > One example where this actually can become a problem is cgroup BPF > > > programs. There you can make single attachment to root cgroup, but it > > > will create one "effective" prog_array for each descendant (and will > > > keep destroying and creating them as child cgroups are created). So > > > there is this invisible multiplier which we can't really control. > > > > > > So paying the (however small, but) price of call_rcu_tasks_trace() in > > > bpf_prog_array_free() which is used for purely non-sleepable cases > > > seems unfortunate. But I think an alternative to tracking this "has > > > sleepable" bit on a per bpf_prog_array case is to have > > > bpf_prog_array_free_sleepable() implementation independent of > > > bpf_prog_array_free() itself and call that sleepable variant from > > > uprobe detach handler, limiting the impact to things that actually > > > might be running as sleepable and which most likely won't churn > > > through a huge amount of arrays. WDYT? > > > > Honestly, I don't like the idea of having two different APIs, where if you use the > > wrong one, the program would only fail in rare and undebuggable circumstances. > > > > If we need specialization (and I'm not convinced we do - what's the rate of cgroup > > creation that we can sustain?), we should track that in the bpf_prog_array itself. We > > can have the allocation path set a flag and branch on it in free() to determine the > > necessary grace periods. > > I think what Andrii is proposing is to leave bpf_prog_array_free() as-is > and introduce new bpf_prog_array_free_sleepable() (the way it is in > this patch) and call it from 2 places in kernel/trace/bpf_trace.c only. > This way cgroup won't be affected. > > The rcu_trace protection here applies to prog_array itself. > Normal progs are still rcu, sleepable progs are rcu_trace. > Regardless whether they're called via trampoline or this new prog_array. Right, I understand the proposal. My objection is that if tomorrow someone mistakenly keeps using bpf_prog_array_free when they actually need bpf_prog_array_free_sleepable, the resulting bug is really difficult to find and reason about. I can make it correct in this patch series but *keeping* things correct going forward will be harder when there's two free variants. Instead, we can have a ARRAY_USE_TRACE_RCU flag which automatically chains the grace periods inside bpf_prog_array_free. That way we eliminate potential future confusion and instead of introducing subtle rcu bugs, at worst you can incur a performance penalty by using the flag where you don't need it. If we spend the time to one-way flip the flag only when you actually insert a sleepable program into the array, even that penalty is eliminated. The cost of this tradeoff in maintainability is 4 bytes more per array object (less if we pack it but that's a worse idea performance-wise). Given how much we already allocate, I think that's fair but I'm happy to discuss other less foot-gun-y ideas if the memory usage is a constraint.
On Thu, Apr 28, 2022 at 2:35 PM Delyan Kratunov <delyank@fb.com> wrote: > > On Thu, 2022-04-28 at 13:58 -0700, Alexei Starovoitov wrote: > > On Thu, Apr 28, 2022 at 12:15 PM Delyan Kratunov <delyank@fb.com> wrote: > > > > > > On Thu, 2022-04-28 at 11:19 -0700, Andrii Nakryiko wrote: > > > > On Thu, Apr 28, 2022 at 9:54 AM Delyan Kratunov <delyank@fb.com> wrote: > > > > > > > > > > uprobes work by raising a trap, setting a task flag from within the > > > > > interrupt handler, and processing the actual work for the uprobe on the > > > > > way back to userspace. As a result, uprobe handlers already execute in a > > > > > user context. The primary obstacle to sleepable bpf uprobe programs is > > > > > therefore on the bpf side. > > > > > > > > > > Namely, the bpf_prog_array attached to the uprobe is protected by normal > > > > > rcu and runs with disabled preemption. In order for uprobe bpf programs > > > > > to become actually sleepable, we need it to be protected by the tasks_trace > > > > > rcu flavor instead (and kfree() called after a corresponding grace period). > > > > > > > > > > One way to achieve this is by tracking an array-has-contained-sleepable-prog > > > > > flag in bpf_prog_array and switching rcu flavors based on it. However, this > > > > > is deemed somewhat unwieldly and the rcu flavor transition would be hard > > > > > to reason about. > > > > > > > > > > Instead, based on Alexei's proposal, we change the free path for > > > > > bpf_prog_array to chain a tasks_trace and normal grace periods > > > > > one after the other. Users who iterate under tasks_trace read section would > > > > > be safe, as would users who iterate under normal read sections (from > > > > > non-sleepable locations). The downside is that we take the tasks_trace latency > > > > > unconditionally but that's deemed acceptable under expected workloads. > > > > > > > > One example where this actually can become a problem is cgroup BPF > > > > programs. There you can make single attachment to root cgroup, but it > > > > will create one "effective" prog_array for each descendant (and will > > > > keep destroying and creating them as child cgroups are created). So > > > > there is this invisible multiplier which we can't really control. > > > > > > > > So paying the (however small, but) price of call_rcu_tasks_trace() in > > > > bpf_prog_array_free() which is used for purely non-sleepable cases > > > > seems unfortunate. But I think an alternative to tracking this "has > > > > sleepable" bit on a per bpf_prog_array case is to have > > > > bpf_prog_array_free_sleepable() implementation independent of > > > > bpf_prog_array_free() itself and call that sleepable variant from > > > > uprobe detach handler, limiting the impact to things that actually > > > > might be running as sleepable and which most likely won't churn > > > > through a huge amount of arrays. WDYT? > > > > > > Honestly, I don't like the idea of having two different APIs, where if you use the > > > wrong one, the program would only fail in rare and undebuggable circumstances. > > > > > > If we need specialization (and I'm not convinced we do - what's the rate of cgroup > > > creation that we can sustain?), we should track that in the bpf_prog_array itself. We > > > can have the allocation path set a flag and branch on it in free() to determine the > > > necessary grace periods. > > > > I think what Andrii is proposing is to leave bpf_prog_array_free() as-is > > and introduce new bpf_prog_array_free_sleepable() (the way it is in > > this patch) and call it from 2 places in kernel/trace/bpf_trace.c only. > > This way cgroup won't be affected. > > > > The rcu_trace protection here applies to prog_array itself. > > Normal progs are still rcu, sleepable progs are rcu_trace. > > Regardless whether they're called via trampoline or this new prog_array. > > Right, I understand the proposal. My objection is that if tomorrow someone mistakenly > keeps using bpf_prog_array_free when they actually need > bpf_prog_array_free_sleepable, the resulting bug is really difficult to find and > reason about. I can make it correct in this patch series but *keeping* things correct > going forward will be harder when there's two free variants. This kind of mistakes code reviews are supposed to catch. > Instead, we can have a ARRAY_USE_TRACE_RCU flag which automatically chains the grace > periods inside bpf_prog_array_free. That way we eliminate potential future confusion > and instead of introducing subtle rcu bugs, at worst you can incur a performance > penalty by using the flag where you don't need it. If we spend the time to one-way > flip the flag only when you actually insert a sleepable program into the array, even > that penalty is eliminated. Are you suggesting to add such flag automatically when sleepable bpf progs are added to prog_array? That would not be correct. Presence of sleepable progs has nothing to do with prog_array itself. The bpf_prog_array_free[_sleepable]() frees that array. Not the progs inside it. If you mean adding this flag when prog_array is allocated for uprobe attachment then I don't see how it helps code reviews. Nothing automatic here. It's one place and single purpose flag. Instead of dynamically checking it in free. We can directly call the correct function.
On Thu, 2022-04-28 at 15:52 -0700, Alexei Starovoitov wrote: > On Thu, Apr 28, 2022 at 2:35 PM Delyan Kratunov <delyank@fb.com> wrote: > > > > On Thu, 2022-04-28 at 13:58 -0700, Alexei Starovoitov wrote: > > > On Thu, Apr 28, 2022 at 12:15 PM Delyan Kratunov <delyank@fb.com> wrote: > > > > > > > > On Thu, 2022-04-28 at 11:19 -0700, Andrii Nakryiko wrote: > > > > > On Thu, Apr 28, 2022 at 9:54 AM Delyan Kratunov <delyank@fb.com> wrote: > > > > > > > > > > > > uprobes work by raising a trap, setting a task flag from within the > > > > > > interrupt handler, and processing the actual work for the uprobe on the > > > > > > way back to userspace. As a result, uprobe handlers already execute in a > > > > > > user context. The primary obstacle to sleepable bpf uprobe programs is > > > > > > therefore on the bpf side. > > > > > > > > > > > > Namely, the bpf_prog_array attached to the uprobe is protected by normal > > > > > > rcu and runs with disabled preemption. In order for uprobe bpf programs > > > > > > to become actually sleepable, we need it to be protected by the tasks_trace > > > > > > rcu flavor instead (and kfree() called after a corresponding grace period). > > > > > > > > > > > > One way to achieve this is by tracking an array-has-contained-sleepable-prog > > > > > > flag in bpf_prog_array and switching rcu flavors based on it. However, this > > > > > > is deemed somewhat unwieldly and the rcu flavor transition would be hard > > > > > > to reason about. > > > > > > > > > > > > Instead, based on Alexei's proposal, we change the free path for > > > > > > bpf_prog_array to chain a tasks_trace and normal grace periods > > > > > > one after the other. Users who iterate under tasks_trace read section would > > > > > > be safe, as would users who iterate under normal read sections (from > > > > > > non-sleepable locations). The downside is that we take the tasks_trace latency > > > > > > unconditionally but that's deemed acceptable under expected workloads. > > > > > > > > > > One example where this actually can become a problem is cgroup BPF > > > > > programs. There you can make single attachment to root cgroup, but it > > > > > will create one "effective" prog_array for each descendant (and will > > > > > keep destroying and creating them as child cgroups are created). So > > > > > there is this invisible multiplier which we can't really control. > > > > > > > > > > So paying the (however small, but) price of call_rcu_tasks_trace() in > > > > > bpf_prog_array_free() which is used for purely non-sleepable cases > > > > > seems unfortunate. But I think an alternative to tracking this "has > > > > > sleepable" bit on a per bpf_prog_array case is to have > > > > > bpf_prog_array_free_sleepable() implementation independent of > > > > > bpf_prog_array_free() itself and call that sleepable variant from > > > > > uprobe detach handler, limiting the impact to things that actually > > > > > might be running as sleepable and which most likely won't churn > > > > > through a huge amount of arrays. WDYT? > > > > > > > > Honestly, I don't like the idea of having two different APIs, where if you use the > > > > wrong one, the program would only fail in rare and undebuggable circumstances. > > > > > > > > If we need specialization (and I'm not convinced we do - what's the rate of cgroup > > > > creation that we can sustain?), we should track that in the bpf_prog_array itself. We > > > > can have the allocation path set a flag and branch on it in free() to determine the > > > > necessary grace periods. > > > > > > I think what Andrii is proposing is to leave bpf_prog_array_free() as-is > > > and introduce new bpf_prog_array_free_sleepable() (the way it is in > > > this patch) and call it from 2 places in kernel/trace/bpf_trace.c only. > > > This way cgroup won't be affected. > > > > > > The rcu_trace protection here applies to prog_array itself. > > > Normal progs are still rcu, sleepable progs are rcu_trace. > > > Regardless whether they're called via trampoline or this new prog_array. > > > > Right, I understand the proposal. My objection is that if tomorrow someone mistakenly > > keeps using bpf_prog_array_free when they actually need > > bpf_prog_array_free_sleepable, the resulting bug is really difficult to find and > > reason about. I can make it correct in this patch series but *keeping* things correct > > going forward will be harder when there's two free variants. > > This kind of mistakes code reviews are supposed to catch. You definitely trust code review more than I do! :) > > Instead, we can have a ARRAY_USE_TRACE_RCU flag which automatically chains the grace > > periods inside bpf_prog_array_free. That way we eliminate potential future confusion > > and instead of introducing subtle rcu bugs, at worst you can incur a performance > > penalty by using the flag where you don't need it. If we spend the time to one-way > > flip the flag only when you actually insert a sleepable program into the array, even > > that penalty is eliminated. > > Are you suggesting to add such flag automatically when > sleepable bpf progs are added to prog_array? > That would not be correct. > Presence of sleepable progs has nothing to do with prog_array itself. > The bpf_prog_array_free[_sleepable]() frees that array. > Not the progs inside it. The only reason the array is under the tasks_trace rcu is the presence of sleepable programs inside, so I'd say it is related. I'm reasonably certain we could orchestrate the transition from one rcu flavor to the other safely. Alternatively, we can have two rcu_heads and only take the tasks_trace one conditionally. > If you mean adding this flag when prog_array is allocated > for uprobe attachment then I don't see how it helps code reviews. > Nothing automatic here. It's one place and single purpose flag. > Instead of dynamically checking it in free. > We can directly call the correct function. Sure, I'll send a v2 with free_sleepable later today.
diff --git a/include/linux/bpf.h b/include/linux/bpf.h index 7d7f4806f5fb..d8692d7176ce 100644 --- a/include/linux/bpf.h +++ b/include/linux/bpf.h @@ -25,6 +25,7 @@ #include <linux/percpu-refcount.h> #include <linux/stddef.h> #include <linux/bpfptr.h> +#include <linux/rcupdate_trace.h> struct bpf_verifier_env; struct bpf_verifier_log; @@ -1379,6 +1380,65 @@ bpf_prog_run_array(const struct bpf_prog_array *array, return ret; } +/** + * Notes on RCU design for bpf_prog_arrays containing sleepable programs: + * + * We use the trace RCU flavor read section to protect the bpf_prog_array overall. + * Because there are users of bpf_prog_array that only use the normal + * rcu flavor, to avoid tracking a used-for-sleepable-programs bit in the + * array, we chain call_rcu_tasks_trace and kfree_rcu on the free path. + * This is suboptimal if the array is never used for sleepable programs + * but not a cause of concern under expected workloads. + * + * In the case where a non-sleepable program is inside the array, + * we take the rcu read section and disable preemption for that program + * alone, so it can access rcu-protected dynamically sized maps. + */ +static __always_inline u32 +bpf_prog_run_array_sleepable(const struct bpf_prog_array __rcu *array_rcu, + const void *ctx, bpf_prog_run_fn run_prog) +{ + const struct bpf_prog_array_item *item; + const struct bpf_prog *prog; + const struct bpf_prog_array *array; + struct bpf_run_ctx *old_run_ctx; + struct bpf_trace_run_ctx run_ctx; + u32 ret = 1; + + might_fault(); + + RCU_LOCKDEP_WARN(!rcu_read_lock_held(), "shouldn't be holding rcu lock"); + + migrate_disable(); + rcu_read_lock_trace(); + + array = rcu_dereference(array_rcu); + if (unlikely(!array)) + goto out; + old_run_ctx = bpf_set_run_ctx(&run_ctx.run_ctx); + item = &array->items[0]; + while ((prog = READ_ONCE(item->prog))) { + if (!prog->aux->sleepable) { + preempt_disable(); + rcu_read_lock(); + } + + run_ctx.bpf_cookie = item->bpf_cookie; + ret &= run_prog(prog, ctx); + item++; + + if (!prog->aux->sleepable) { + rcu_read_unlock(); + preempt_enable(); + } + } + bpf_reset_run_ctx(old_run_ctx); +out: + rcu_read_unlock_trace(); + migrate_enable(); + return ret; +} + #ifdef CONFIG_BPF_SYSCALL DECLARE_PER_CPU(int, bpf_prog_active); extern struct mutex bpf_stats_enabled_mutex; diff --git a/include/linux/trace_events.h b/include/linux/trace_events.h index e6e95a9f07a5..d45889f1210d 100644 --- a/include/linux/trace_events.h +++ b/include/linux/trace_events.h @@ -736,6 +736,7 @@ trace_trigger_soft_disabled(struct trace_event_file *file) #ifdef CONFIG_BPF_EVENTS unsigned int trace_call_bpf(struct trace_event_call *call, void *ctx); +unsigned int uprobe_call_bpf(struct trace_event_call *call, void *ctx); int perf_event_attach_bpf_prog(struct perf_event *event, struct bpf_prog *prog, u64 bpf_cookie); void perf_event_detach_bpf_prog(struct perf_event *event); int perf_event_query_prog_array(struct perf_event *event, void __user *info); diff --git a/kernel/bpf/core.c b/kernel/bpf/core.c index 13e9dbeeedf3..a4c301ef2ba1 100644 --- a/kernel/bpf/core.c +++ b/kernel/bpf/core.c @@ -2261,11 +2261,19 @@ struct bpf_prog_array *bpf_prog_array_alloc(u32 prog_cnt, gfp_t flags) return &bpf_empty_prog_array.hdr; } +static void __bpf_prog_array_free_rcu(struct rcu_head *rcu) +{ + struct bpf_prog_array *progs; + + progs = container_of(rcu, struct bpf_prog_array, rcu); + kfree_rcu(progs, rcu); +} + void bpf_prog_array_free(struct bpf_prog_array *progs) { if (!progs || progs == &bpf_empty_prog_array.hdr) return; - kfree_rcu(progs, rcu); + call_rcu_tasks_trace(&progs->rcu, __bpf_prog_array_free_rcu); } int bpf_prog_array_length(struct bpf_prog_array *array) diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c index f15b826f9899..6a2291d4b5a0 100644 --- a/kernel/trace/bpf_trace.c +++ b/kernel/trace/bpf_trace.c @@ -140,6 +140,29 @@ unsigned int trace_call_bpf(struct trace_event_call *call, void *ctx) return ret; } +unsigned int uprobe_call_bpf(struct trace_event_call *call, void *ctx) +{ + unsigned int ret; + + /* + * Instead of moving rcu_read_lock/rcu_dereference/rcu_read_unlock + * to all call sites, we did a bpf_prog_array_valid() there to check + * whether call->prog_array is empty or not, which is + * a heuristic to speed up execution. + * + * If bpf_prog_array_valid() fetched prog_array was + * non-NULL, we go into uprobe_call_bpf() and do the actual + * proper rcu_dereference() under RCU trace lock. + * If it turns out that prog_array is NULL then, we bail out. + * For the opposite, if the bpf_prog_array_valid() fetched pointer + * was NULL, you'll skip the prog_array with the risk of missing + * out of events when it was updated in between this and the + * rcu_dereference() which is accepted risk. + */ + ret = bpf_prog_run_array_sleepable(call->prog_array, ctx, bpf_prog_run); + return ret; +} + #ifdef CONFIG_BPF_KPROBE_OVERRIDE BPF_CALL_2(bpf_override_return, struct pt_regs *, regs, unsigned long, rc) { diff --git a/kernel/trace/trace_uprobe.c b/kernel/trace/trace_uprobe.c index 9711589273cd..3eb48897d15b 100644 --- a/kernel/trace/trace_uprobe.c +++ b/kernel/trace/trace_uprobe.c @@ -1346,9 +1346,7 @@ static void __uprobe_perf_func(struct trace_uprobe *tu, if (bpf_prog_array_valid(call)) { u32 ret; - preempt_disable(); - ret = trace_call_bpf(call, regs); - preempt_enable(); + ret = uprobe_call_bpf(call, regs); if (!ret) return; }
uprobes work by raising a trap, setting a task flag from within the interrupt handler, and processing the actual work for the uprobe on the way back to userspace. As a result, uprobe handlers already execute in a user context. The primary obstacle to sleepable bpf uprobe programs is therefore on the bpf side. Namely, the bpf_prog_array attached to the uprobe is protected by normal rcu and runs with disabled preemption. In order for uprobe bpf programs to become actually sleepable, we need it to be protected by the tasks_trace rcu flavor instead (and kfree() called after a corresponding grace period). One way to achieve this is by tracking an array-has-contained-sleepable-prog flag in bpf_prog_array and switching rcu flavors based on it. However, this is deemed somewhat unwieldly and the rcu flavor transition would be hard to reason about. Instead, based on Alexei's proposal, we change the free path for bpf_prog_array to chain a tasks_trace and normal grace periods one after the other. Users who iterate under tasks_trace read section would be safe, as would users who iterate under normal read sections (from non-sleepable locations). The downside is that we take the tasks_trace latency unconditionally but that's deemed acceptable under expected workloads. The other interesting implication is wrt non-sleepable uprobe programs. Because they need access to dynamically sized rcu-protected maps, we conditionally disable preemption and take an rcu read section around them, in addition to the overarching tasks_trace section. Signed-off-by: Delyan Kratunov <delyank@fb.com> --- include/linux/bpf.h | 60 ++++++++++++++++++++++++++++++++++++ include/linux/trace_events.h | 1 + kernel/bpf/core.c | 10 +++++- kernel/trace/bpf_trace.c | 23 ++++++++++++++ kernel/trace/trace_uprobe.c | 4 +-- 5 files changed, 94 insertions(+), 4 deletions(-) -- 2.35.1