Message ID | 20221120051004.3605026-4-void@manifault.com (mailing list archive) |
---|---|
State | Accepted |
Commit | 90660309b0c76c564a31a21f3a81d6641a9acaa0 |
Delegated to: | BPF |
Headers | show |
Series | Support storing struct task_struct objects as kptrs | expand |
Hello David, Your idea behind this patch is cool, but I'm afraid that the implementation is incorrect. As you can see, the task_struct:rcu_users member shares the same memory area with the task_struct:rcu (the head of an RCU CB). Consequence: *violated invariant* that the reference counter will remain zero after reaching zero!!! After reaching zero the task_struct:rcu head is set, so further attempts to access the task_struct:rcu_users may lead to a non-zero value. For more information see https://lore.kernel.org/lkml/CAHk-=wjT6LG6sDaZtfeT80B9RaMP-y7RNRM4F5CX2v2Z=o8e=A@mail.gmail.com/ In my opinion, the decision about task_struct:rcu and task_struct:rcu_users union is very bad, but you should probably consult the memory separation with authors of the actual implementation. For now, in our project, we use the following approach: 1) get a reference to a valid task within RCU read-side with get_task_struct() 2) in the release function: 2.1) enter RCU read-side 2.2) if the task state is not TASK_DEAD: use put_task_struct() Note: In the case of a race with an exiting task it's OK to call put_task_struct(), because task_struct will be freed *after* the current RCU GP thanks to the task_struct:rcu_users mechanism. 2.3) otherwise if test_and_set(my_cb_flag): call_rcu(my_cb) Note1: With respect to the RCU CB API you should guarantee that your CB will be installed only once within a given RCU GP. For that purpose we use my_cb_flag. Note2: This code will race with the task_struct:rcu_users mechanism [delayed_put_task_struct()], but it's OK. Either the delayed_put_task_struct() or my_cb() can be the last to call final put_task_struct() after the current RCU GP. 2.4) otherwise: call put_task_struct() Note: The my_cb() is already installed, so within the current RCU GP we can invoke put_task_struct() and the ref counter of the task_struct will not reach zero. 2.5) release the RCU read-side 3) The RCU CB my_cb() should set the my_cb_flag to False and call put_task_struct(). If the release function is called within RCU read-side, the task_struct is guaranteed to remain valid until the end of the current RCU GP. Good luck, mY
On Mon, Dec 05, 2022 at 11:11:47AM +0100, Matus Jokay wrote: > Hello David, Hi Matus, > > Your idea behind this patch is cool, but I'm afraid that the > implementation is incorrect. > > As you can see, the task_struct:rcu_users member shares the same memory > area with the task_struct:rcu (the head of an RCU CB). > Consequence: *violated invariant* that the reference counter will > remain zero after reaching zero!!! > After reaching zero the task_struct:rcu head is set, so further attempts > to access the task_struct:rcu_users may lead to a non-zero value. Yes, you're right. Thanks for explaining this and pointing out the oversight. > For more information see > https://lore.kernel.org/lkml/CAHk-=wjT6LG6sDaZtfeT80B9RaMP-y7RNRM4F5CX2v2Z=o8e=A@mail.gmail.com/ > In my opinion, the decision about task_struct:rcu and > task_struct:rcu_users union is very bad, but you should probably consult > the memory separation with authors of the actual implementation. I expect the reason it's like that is because prior to this change, as Linus pointed out, nothing ever increments the refcount (other than as of commit 912616f142bf: ("exit: Guarantee make_task_dead leaks the tsk when calling do_task_exit"), which similarly increments before the reference could have ever gone to 0, so I think is fine), so we had the ability to save a few bytes of memory in struct task_struct. Eric mentioned this explicitly in the commit summary for commit 3fbd7ee285b2 ("tasks: Add a count of task RCU users"). Now that the refcount is actually being used as a proper refcount with this commit, that space saving is no longer an option (unless we rip out my changes of course). +cc Eric and Oleg -- would you guys be OK with separating them out from that union? I guess the alternative would be to check for p->flags & PF_EXITING in the helper, but using p->rcu_users feels more natural. > For now, in our project, we use the following approach: > > 1) get a reference to a valid task within RCU read-side with > get_task_struct() > 2) in the release function: > 2.1) enter RCU read-side > 2.2) if the task state is not TASK_DEAD: use put_task_struct() > Note: In the case of a race with an exiting task it's OK to > call put_task_struct(), because task_struct will be freed > *after* the current RCU GP thanks to the task_struct:rcu_users > mechanism. > 2.3) otherwise if test_and_set(my_cb_flag): call_rcu(my_cb) > Note1: With respect to the RCU CB API you should guarantee that > your CB will be installed only once within a given RCU GP. For > that purpose we use my_cb_flag. > Note2: This code will race with the task_struct:rcu_users > mechanism [delayed_put_task_struct()], but it's OK. Either the > delayed_put_task_struct() or my_cb() can be the last to call > final put_task_struct() after the current RCU GP. I think this idea would work, but in order for us to do this, I believe we'd have to add _another_ struct rcu_head to struct task_struct. If we did that, I don't think there's any reason to not just separate them out of the union where they live today as it's only like that for space-saving reasons. > 2.4) otherwise: call put_task_struct() > Note: The my_cb() is already installed, so within the current > RCU GP we can invoke put_task_struct() and the ref counter of > the task_struct will not reach zero. > 2.5) release the RCU read-side > 3) The RCU CB my_cb() should set the my_cb_flag to False and call > put_task_struct(). > > If the release function is called within RCU read-side, the task_struct > is guaranteed to remain valid until the end of the current RCU GP. > > Good luck, > mY
diff --git a/kernel/bpf/helpers.c b/kernel/bpf/helpers.c index 212e791d7452..89a95f3d854c 100644 --- a/kernel/bpf/helpers.c +++ b/kernel/bpf/helpers.c @@ -1824,6 +1824,63 @@ struct bpf_list_node *bpf_list_pop_back(struct bpf_list_head *head) return __bpf_list_del(head, true); } +/** + * bpf_task_acquire - Acquire a reference to a task. A task acquired by this + * kfunc which is not stored in a map as a kptr, must be released by calling + * bpf_task_release(). + * @p: The task on which a reference is being acquired. + */ +struct task_struct *bpf_task_acquire(struct task_struct *p) +{ + refcount_inc(&p->rcu_users); + return p; +} + +/** + * bpf_task_kptr_get - Acquire a reference on a struct task_struct kptr. A task + * kptr acquired by this kfunc which is not subsequently stored in a map, must + * be released by calling bpf_task_release(). + * @pp: A pointer to a task kptr on which a reference is being acquired. + */ +struct task_struct *bpf_task_kptr_get(struct task_struct **pp) +{ + struct task_struct *p; + + rcu_read_lock(); + p = READ_ONCE(*pp); + + /* Another context could remove the task from the map and release it at + * any time, including after we've done the lookup above. This is safe + * because we're in an RCU read region, so the task is guaranteed to + * remain valid until at least the rcu_read_unlock() below. + */ + if (p && !refcount_inc_not_zero(&p->rcu_users)) + /* If the task had been removed from the map and freed as + * described above, refcount_inc_not_zero() will return false. + * The task will be freed at some point after the current RCU + * gp has ended, so just return NULL to the user. + */ + p = NULL; + rcu_read_unlock(); + + return p; +} + +/** + * bpf_task_release - Release the reference acquired on a struct task_struct *. + * If this kfunc is invoked in an RCU read region, the task_struct is + * guaranteed to not be freed until the current grace period has ended, even if + * its refcount drops to 0. + * @p: The task on which a reference is being released. + */ +void bpf_task_release(struct task_struct *p) +{ + if (!p) + return; + + put_task_struct_rcu_user(p); +} + __diag_pop(); BTF_SET8_START(generic_btf_ids) @@ -1836,6 +1893,9 @@ BTF_ID_FLAGS(func, bpf_list_push_front) BTF_ID_FLAGS(func, bpf_list_push_back) BTF_ID_FLAGS(func, bpf_list_pop_front, KF_ACQUIRE | KF_RET_NULL) BTF_ID_FLAGS(func, bpf_list_pop_back, KF_ACQUIRE | KF_RET_NULL) +BTF_ID_FLAGS(func, bpf_task_acquire, KF_ACQUIRE | KF_TRUSTED_ARGS) +BTF_ID_FLAGS(func, bpf_task_kptr_get, KF_ACQUIRE | KF_KPTR_GET | KF_RET_NULL) +BTF_ID_FLAGS(func, bpf_task_release, KF_RELEASE) BTF_SET8_END(generic_btf_ids) static const struct btf_kfunc_id_set generic_kfunc_set = { @@ -1843,14 +1903,26 @@ static const struct btf_kfunc_id_set generic_kfunc_set = { .set = &generic_btf_ids, }; +BTF_ID_LIST(generic_dtor_ids) +BTF_ID(struct, task_struct) +BTF_ID(func, bpf_task_release) + static int __init kfunc_init(void) { int ret; + const struct btf_id_dtor_kfunc generic_dtors[] = { + { + .btf_id = generic_dtor_ids[0], + .kfunc_btf_id = generic_dtor_ids[1] + }, + }; ret = register_btf_kfunc_id_set(BPF_PROG_TYPE_TRACING, &generic_kfunc_set); - if (ret) - return ret; - return register_btf_kfunc_id_set(BPF_PROG_TYPE_SCHED_CLS, &generic_kfunc_set); + ret = ret ?: register_btf_kfunc_id_set(BPF_PROG_TYPE_SCHED_CLS, &generic_kfunc_set); + ret = ret ?: register_btf_kfunc_id_set(BPF_PROG_TYPE_STRUCT_OPS, &generic_kfunc_set); + return ret ?: register_btf_id_dtor_kfuncs(generic_dtors, + ARRAY_SIZE(generic_dtors), + THIS_MODULE); } late_initcall(kfunc_init);
Now that BPF supports adding new kernel functions with kfuncs, and storing kernel objects in maps with kptrs, we can add a set of kfuncs which allow struct task_struct objects to be stored in maps as referenced kptrs. The possible use cases for doing this are plentiful. During tracing, for example, it would be useful to be able to collect some tasks that performed a certain operation, and then periodically summarize who they are, which cgroup they're in, how much CPU time they've utilized, etc. In order to enable this, this patch adds three new kfuncs: struct task_struct *bpf_task_acquire(struct task_struct *p); struct task_struct *bpf_task_kptr_get(struct task_struct **pp); void bpf_task_release(struct task_struct *p); A follow-on patch will add selftests validating these kfuncs. Signed-off-by: David Vernet <void@manifault.com> --- kernel/bpf/helpers.c | 78 ++++++++++++++++++++++++++++++++++++++++++-- 1 file changed, 75 insertions(+), 3 deletions(-)