Message ID | 20230310043812.3087672-2-kuifeng@meta.com (mailing list archive) |
---|---|
State | Superseded |
Delegated to: | BPF |
Headers | show |
Series | Transit between BPF TCP congestion controls. | expand |
On 3/9/23 8:38 PM, Kui-Feng Lee wrote: > We have replaced kvalue-refcnt with synchronize_rcu() to wait for an > RCU grace period. > > Maintenance of kvalue->refcnt was a complicated task, as we had to > simultaneously keep track of two reference counts: one for the > reference count of bpf_map. When the kvalue->refcnt reaches zero, we > also have to reduce the reference count on bpf_map - yet these steps > are not performed in an atomic manner and require us to be vigilant > when managing them. By eliminating kvalue->refcnt, we can make our > maintenance more straightforward as the refcount of bpf_map is now > solely managed! > > To prevent the trampoline image of a struct_ops from being released > while it is still in use, we wait for an RCU grace period. The > setsockopt(TCP_CONGESTION, "...") command allows you to change your > socket's congestion control algorithm and can result in releasing the > old struct_ops implementation. If the setsockopt() above is referring to the syscall setsockopt(), then the old struct_ops is fine. The old struct_ops is protected by the struct_ops map's refcnt (or the current kvalue->refcnt). The sk in setsockopt(sk, ...) will no longer use the old struct_ops before the refcnt is decremented. This part should be the same as the tcp-cc kernel module. > Moreover, since this function is > exposed through bpf_setsockopt(), it may be accessed by BPF programs > as well. To ensure that the trampoline image belonging to struct_op > can be safely called while its method is in use, struct_ops is > safeguarded with rcu_read_lock(). Doing so prevents any destruction of > the associated images before returning from a trampoline and requires > us to wait for an RCU grace period. The bpf_setsockopt(TCP_CONGESTION) is the reason that the trampoline image needs a grace period, but I noticed RCU grace period itself is not enough for trampoline image and more on this later. Another reason the struct_ops map needs a RCU grace period is because of the bpf_try_module_get() (in tcp_set_default_congestion_control for example). > --- > include/linux/bpf.h | 1 + > kernel/bpf/bpf_struct_ops.c | 68 ++++++++++++++++++++----------------- > kernel/bpf/syscall.c | 6 ++-- > 3 files changed, 42 insertions(+), 33 deletions(-) > > diff --git a/include/linux/bpf.h b/include/linux/bpf.h > index e64ff1e89fb2..00ca92ea6f2e 100644 > --- a/include/linux/bpf.h > +++ b/include/linux/bpf.h > @@ -1938,6 +1938,7 @@ struct bpf_map *bpf_map_get_with_uref(u32 ufd); > struct bpf_map *__bpf_map_get(struct fd f); > void bpf_map_inc(struct bpf_map *map); > void bpf_map_inc_with_uref(struct bpf_map *map); > +struct bpf_map *__bpf_map_inc_not_zero(struct bpf_map *map, bool uref); > struct bpf_map * __must_check bpf_map_inc_not_zero(struct bpf_map *map); > void bpf_map_put_with_uref(struct bpf_map *map); > void bpf_map_put(struct bpf_map *map); > diff --git a/kernel/bpf/bpf_struct_ops.c b/kernel/bpf/bpf_struct_ops.c > index 38903fb52f98..ab7811a4c1dd 100644 > --- a/kernel/bpf/bpf_struct_ops.c > +++ b/kernel/bpf/bpf_struct_ops.c > @@ -58,6 +58,11 @@ struct bpf_struct_ops_map { > struct bpf_struct_ops_value kvalue; > }; > > +struct bpf_struct_ops_link { > + struct bpf_link link; > + struct bpf_map __rcu *map; > +}; Comparing with v5, this is moved from patch 3 to patch 1. It is not used here, so it belongs to patch 3. > @@ -574,6 +585,19 @@ static void bpf_struct_ops_map_free(struct bpf_map *map) > { > struct bpf_struct_ops_map *st_map = (struct bpf_struct_ops_map *)map; > > + /* The struct_ops's function may switch to another struct_ops. > + * > + * For example, bpf_tcp_cc_x->init() may switch to > + * another tcp_cc_y by calling > + * setsockopt(TCP_CONGESTION, "tcp_cc_y"). > + * During the switch, bpf_struct_ops_put(tcp_cc_x) is called > + * and its refcount may reach 0 which then free its > + * trampoline image while tcp_cc_x is still running. > + * > + * Thus, a rcu grace period is needed here. > + */ > + synchronize_rcu(); After the trampoline image finished running a struct_ops's "prog", it still has a few insn need to execute in the trampoline image, so it also needs to wait for synchronize_rcu_tasks/call_rcu_tasks. This is an old issue, only happens when the struct_ops prog calls bpf_setsockopt(TCP_CONGESTION) with CONFIG_PREEMPT and unlikely other upcoming struct_ops subsystem may need this, please help to do a follow up fix on it (separate from this set) to also wait for the rcu_tasks gp.
diff --git a/include/linux/bpf.h b/include/linux/bpf.h index e64ff1e89fb2..00ca92ea6f2e 100644 --- a/include/linux/bpf.h +++ b/include/linux/bpf.h @@ -1938,6 +1938,7 @@ struct bpf_map *bpf_map_get_with_uref(u32 ufd); struct bpf_map *__bpf_map_get(struct fd f); void bpf_map_inc(struct bpf_map *map); void bpf_map_inc_with_uref(struct bpf_map *map); +struct bpf_map *__bpf_map_inc_not_zero(struct bpf_map *map, bool uref); struct bpf_map * __must_check bpf_map_inc_not_zero(struct bpf_map *map); void bpf_map_put_with_uref(struct bpf_map *map); void bpf_map_put(struct bpf_map *map); diff --git a/kernel/bpf/bpf_struct_ops.c b/kernel/bpf/bpf_struct_ops.c index 38903fb52f98..ab7811a4c1dd 100644 --- a/kernel/bpf/bpf_struct_ops.c +++ b/kernel/bpf/bpf_struct_ops.c @@ -58,6 +58,11 @@ struct bpf_struct_ops_map { struct bpf_struct_ops_value kvalue; }; +struct bpf_struct_ops_link { + struct bpf_link link; + struct bpf_map __rcu *map; +}; + #define VALUE_PREFIX "bpf_struct_ops_" #define VALUE_PREFIX_LEN (sizeof(VALUE_PREFIX) - 1) @@ -249,6 +254,7 @@ int bpf_struct_ops_map_sys_lookup_elem(struct bpf_map *map, void *key, struct bpf_struct_ops_map *st_map = (struct bpf_struct_ops_map *)map; struct bpf_struct_ops_value *uvalue, *kvalue; enum bpf_struct_ops_state state; + s64 refcnt; if (unlikely(*(u32 *)key != 0)) return -ENOENT; @@ -267,7 +273,14 @@ int bpf_struct_ops_map_sys_lookup_elem(struct bpf_map *map, void *key, uvalue = value; memcpy(uvalue, st_map->uvalue, map->value_size); uvalue->state = state; - refcount_set(&uvalue->refcnt, refcount_read(&kvalue->refcnt)); + + /* This value offers the user space a general estimate of how + * many sockets are still utilizing this struct_ops for TCP + * congestion control. The number might not be exact, but it + * should sufficiently meet our present goals. + */ + refcnt = atomic64_read(&map->refcnt) - atomic64_read(&map->usercnt); + refcount_set(&uvalue->refcnt, max_t(s64, refcnt, 0)); return 0; } @@ -491,7 +504,6 @@ static int bpf_struct_ops_map_update_elem(struct bpf_map *map, void *key, *(unsigned long *)(udata + moff) = prog->aux->id; } - refcount_set(&kvalue->refcnt, 1); bpf_map_inc(map); set_memory_rox((long)st_map->image, 1); @@ -536,8 +548,7 @@ static int bpf_struct_ops_map_delete_elem(struct bpf_map *map, void *key) switch (prev_state) { case BPF_STRUCT_OPS_STATE_INUSE: st_map->st_ops->unreg(&st_map->kvalue.data); - if (refcount_dec_and_test(&st_map->kvalue.refcnt)) - bpf_map_put(map); + bpf_map_put(map); return 0; case BPF_STRUCT_OPS_STATE_TOBEFREE: return -EINPROGRESS; @@ -574,6 +585,19 @@ static void bpf_struct_ops_map_free(struct bpf_map *map) { struct bpf_struct_ops_map *st_map = (struct bpf_struct_ops_map *)map; + /* The struct_ops's function may switch to another struct_ops. + * + * For example, bpf_tcp_cc_x->init() may switch to + * another tcp_cc_y by calling + * setsockopt(TCP_CONGESTION, "tcp_cc_y"). + * During the switch, bpf_struct_ops_put(tcp_cc_x) is called + * and its refcount may reach 0 which then free its + * trampoline image while tcp_cc_x is still running. + * + * Thus, a rcu grace period is needed here. + */ + synchronize_rcu(); + if (st_map->links) bpf_struct_ops_map_put_progs(st_map); bpf_map_area_free(st_map->links); @@ -676,41 +700,23 @@ const struct bpf_map_ops bpf_struct_ops_map_ops = { bool bpf_struct_ops_get(const void *kdata) { struct bpf_struct_ops_value *kvalue; + struct bpf_struct_ops_map *st_map; + struct bpf_map *map; kvalue = container_of(kdata, struct bpf_struct_ops_value, data); + st_map = container_of(kvalue, struct bpf_struct_ops_map, kvalue); - return refcount_inc_not_zero(&kvalue->refcnt); -} - -static void bpf_struct_ops_put_rcu(struct rcu_head *head) -{ - struct bpf_struct_ops_map *st_map; - - st_map = container_of(head, struct bpf_struct_ops_map, rcu); - bpf_map_put(&st_map->map); + map = __bpf_map_inc_not_zero(&st_map->map, false); + return !IS_ERR(map); } void bpf_struct_ops_put(const void *kdata) { struct bpf_struct_ops_value *kvalue; + struct bpf_struct_ops_map *st_map; kvalue = container_of(kdata, struct bpf_struct_ops_value, data); - if (refcount_dec_and_test(&kvalue->refcnt)) { - struct bpf_struct_ops_map *st_map; - - st_map = container_of(kvalue, struct bpf_struct_ops_map, - kvalue); - /* The struct_ops's function may switch to another struct_ops. - * - * For example, bpf_tcp_cc_x->init() may switch to - * another tcp_cc_y by calling - * setsockopt(TCP_CONGESTION, "tcp_cc_y"). - * During the switch, bpf_struct_ops_put(tcp_cc_x) is called - * and its map->refcnt may reach 0 which then free its - * trampoline image while tcp_cc_x is still running. - * - * Thus, a rcu grace period is needed here. - */ - call_rcu(&st_map->rcu, bpf_struct_ops_put_rcu); - } + st_map = container_of(kvalue, struct bpf_struct_ops_map, kvalue); + + bpf_map_put(&st_map->map); } diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c index f406dfa13792..ec03f9e450ad 100644 --- a/kernel/bpf/syscall.c +++ b/kernel/bpf/syscall.c @@ -1287,8 +1287,10 @@ struct bpf_map *bpf_map_get_with_uref(u32 ufd) return map; } -/* map_idr_lock should have been held */ -static struct bpf_map *__bpf_map_inc_not_zero(struct bpf_map *map, bool uref) +/* map_idr_lock should have been held or the map should have been + * protected by rcu read lock. + */ +struct bpf_map *__bpf_map_inc_not_zero(struct bpf_map *map, bool uref) { int refold;
We have replaced kvalue-refcnt with synchronize_rcu() to wait for an RCU grace period. Maintenance of kvalue->refcnt was a complicated task, as we had to simultaneously keep track of two reference counts: one for the reference count of bpf_map. When the kvalue->refcnt reaches zero, we also have to reduce the reference count on bpf_map - yet these steps are not performed in an atomic manner and require us to be vigilant when managing them. By eliminating kvalue->refcnt, we can make our maintenance more straightforward as the refcount of bpf_map is now solely managed! To prevent the trampoline image of a struct_ops from being released while it is still in use, we wait for an RCU grace period. The setsockopt(TCP_CONGESTION, "...") command allows you to change your socket's congestion control algorithm and can result in releasing the old struct_ops implementation. Moreover, since this function is exposed through bpf_setsockopt(), it may be accessed by BPF programs as well. To ensure that the trampoline image belonging to struct_op can be safely called while its method is in use, struct_ops is safeguarded with rcu_read_lock(). Doing so prevents any destruction of the associated images before returning from a trampoline and requires us to wait for an RCU grace period. Signed-off-by: Kui-Feng Lee <kuifeng@meta.com> --- include/linux/bpf.h | 1 + kernel/bpf/bpf_struct_ops.c | 68 ++++++++++++++++++++----------------- kernel/bpf/syscall.c | 6 ++-- 3 files changed, 42 insertions(+), 33 deletions(-)