Message ID | 20231124113033.503338-5-houtao@huaweicloud.com (mailing list archive) |
---|---|
State | Superseded |
Delegated to: | BPF |
Headers | show |
Series | bpf: Fix the release of inner map | expand |
On 11/24/23 6:30 AM, Hou Tao wrote: > From: Hou Tao <houtao1@huawei.com> > > When removing the inner map from the outer map, the inner map will be > freed after one RCU grace period and one RCU tasks trace grace > period, so it is certain that the bpf program, which may access the > inner map, has exited before the inner map is freed. > > However there is unnecessary to wait for any RCU grace period, one RCU > grace period or one RCU tasks trace grace period if the outer map is > only accessed by userspace, sleepable program or non-sleepable program. > So recording the sleepable attributes of the owned bpf programs when > adding the outer map into env->used_maps, copying the recorded > attributes to inner map atomically when removing inner map from the > outer map and using the recorded attributes in the inner map to decide > which, and how many, RCU grace periods are needed when freeing the > inner map. > > Signed-off-by: Hou Tao <houtao1@huawei.com> > --- > include/linux/bpf.h | 8 +++++++- > kernel/bpf/map_in_map.c | 19 ++++++++++++++----- > kernel/bpf/syscall.c | 15 +++++++++++++-- > kernel/bpf/verifier.c | 4 ++++ > 4 files changed, 38 insertions(+), 8 deletions(-) > > diff --git a/include/linux/bpf.h b/include/linux/bpf.h > index 15a6bb951b70..c5b549f352d7 100644 > --- a/include/linux/bpf.h > +++ b/include/linux/bpf.h > @@ -245,6 +245,11 @@ struct bpf_list_node_kern { > void *owner; > } __attribute__((aligned(8))); > > +enum { > + BPF_MAP_RCU_GP = BIT(0), > + BPF_MAP_RCU_TT_GP = BIT(1), > +}; > + > struct bpf_map { > /* The first two cachelines with read-mostly members of which some > * are also accessed in fast-path (e.g. ops, max_entries). > @@ -296,7 +301,8 @@ struct bpf_map { > } owner; > bool bypass_spec_v1; > bool frozen; /* write-once; write-protected by freeze_mutex */ > - bool free_after_mult_rcu_gp; > + atomic_t used_in_rcu_gp; > + atomic_t free_by_rcu_gp; > s64 __percpu *elem_count; > }; > > diff --git a/kernel/bpf/map_in_map.c b/kernel/bpf/map_in_map.c > index cf3363065566..d044ee677107 100644 > --- a/kernel/bpf/map_in_map.c > +++ b/kernel/bpf/map_in_map.c > @@ -131,12 +131,21 @@ void bpf_map_fd_put_ptr(struct bpf_map *map, void *ptr, bool deferred) > { > struct bpf_map *inner_map = ptr; > > - /* The inner map may still be used by both non-sleepable and sleepable > - * bpf program, so free it after one RCU grace period and one tasks > - * trace RCU grace period. > + /* Defer the freeing of inner map according to the attribute of bpf > + * program which owns the outer map, so unnecessary multiple RCU GP > + * waitings can be avoided. > */ > - if (deferred) > - WRITE_ONCE(inner_map->free_after_mult_rcu_gp, true); > + if (deferred) { > + /* used_in_rcu_gp may be updated concurrently by new bpf > + * program, so add smp_mb() to guarantee the order between > + * used_in_rcu_gp and lookup/deletion operation of inner map. > + * If a new bpf program finds the inner map before it is > + * removed from outer map, reading used_in_rcu_gp below will > + * return the newly-set bit set by the new bpf program. > + */ > + smp_mb(); smp_mb__before_atomic()? > + atomic_or(atomic_read(&map->used_in_rcu_gp), &inner_map->free_by_rcu_gp); > + } > bpf_map_put(inner_map); > } > > diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c > index 88882cb58121..014a8cd55a41 100644 > --- a/kernel/bpf/syscall.c > +++ b/kernel/bpf/syscall.c > @@ -734,7 +734,10 @@ static void bpf_map_free_rcu_gp(struct rcu_head *rcu) > > static void bpf_map_free_mult_rcu_gp(struct rcu_head *rcu) > { > - if (rcu_trace_implies_rcu_gp()) > + struct bpf_map *map = container_of(rcu, struct bpf_map, rcu); > + > + if (!(atomic_read(&map->free_by_rcu_gp) & BPF_MAP_RCU_GP) || > + rcu_trace_implies_rcu_gp()) > bpf_map_free_rcu_gp(rcu); > else > call_rcu(rcu, bpf_map_free_rcu_gp); > @@ -746,11 +749,16 @@ static void bpf_map_free_mult_rcu_gp(struct rcu_head *rcu) > void bpf_map_put(struct bpf_map *map) > { > if (atomic64_dec_and_test(&map->refcnt)) { > + int free_by_rcu_gp; > + > /* bpf_map_free_id() must be called first */ > bpf_map_free_id(map); > btf_put(map->btf); > > - if (READ_ONCE(map->free_after_mult_rcu_gp)) > + free_by_rcu_gp = atomic_read(&map->free_by_rcu_gp); > + if (free_by_rcu_gp == BPF_MAP_RCU_GP) > + call_rcu(&map->rcu, bpf_map_free_rcu_gp); > + else if (free_by_rcu_gp) > call_rcu_tasks_trace(&map->rcu, bpf_map_free_mult_rcu_gp); > else > bpf_map_free_in_work(map); > @@ -5343,6 +5351,9 @@ static int bpf_prog_bind_map(union bpf_attr *attr) > goto out_unlock; > } > > + /* No need to update used_in_rcu_gp, because the bpf program doesn't > + * access the map. > + */ > memcpy(used_maps_new, used_maps_old, > sizeof(used_maps_old[0]) * prog->aux->used_map_cnt); > used_maps_new[prog->aux->used_map_cnt] = map; > diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c > index 6da370a047fe..3b86c02077f1 100644 > --- a/kernel/bpf/verifier.c > +++ b/kernel/bpf/verifier.c > @@ -18051,6 +18051,10 @@ static int resolve_pseudo_ldimm64(struct bpf_verifier_env *env) > return -E2BIG; > } > > + atomic_or(env->prog->aux->sleepable ? BPF_MAP_RCU_TT_GP : BPF_MAP_RCU_GP, > + &map->used_in_rcu_gp); > + /* Pairs with smp_mb() in bpf_map_fd_put_ptr() */ > + smp_mb__before_atomic(); smp_mb__after_atomic()? Just curious, are two smp_mb*() memory barriers in this patch truely necessary or just want to be cautious? > /* hold the map. If the program is rejected by verifier, > * the map will be released by release_maps() or it > * will be used by the valid program until it's unloaded
Hi Yonghong, On 11/26/2023 3:13 PM, Yonghong Song wrote: > > On 11/24/23 6:30 AM, Hou Tao wrote: >> From: Hou Tao <houtao1@huawei.com> >> >> When removing the inner map from the outer map, the inner map will be >> freed after one RCU grace period and one RCU tasks trace grace >> period, so it is certain that the bpf program, which may access the >> inner map, has exited before the inner map is freed. >> >> However there is unnecessary to wait for any RCU grace period, one RCU >> grace period or one RCU tasks trace grace period if the outer map is >> only accessed by userspace, sleepable program or non-sleepable program. >> So recording the sleepable attributes of the owned bpf programs when >> adding the outer map into env->used_maps, copying the recorded >> attributes to inner map atomically when removing inner map from the >> outer map and using the recorded attributes in the inner map to decide >> which, and how many, RCU grace periods are needed when freeing the >> inner map. >> >> Signed-off-by: Hou Tao <houtao1@huawei.com> >> --- >> include/linux/bpf.h | 8 +++++++- >> kernel/bpf/map_in_map.c | 19 ++++++++++++++----- >> kernel/bpf/syscall.c | 15 +++++++++++++-- >> kernel/bpf/verifier.c | 4 ++++ >> 4 files changed, 38 insertions(+), 8 deletions(-) >> >> diff --git a/include/linux/bpf.h b/include/linux/bpf.h >> index 15a6bb951b70..c5b549f352d7 100644 >> --- a/include/linux/bpf.h >> +++ b/include/linux/bpf.h >> @@ -245,6 +245,11 @@ struct bpf_list_node_kern { >> void *owner; >> } __attribute__((aligned(8))); >> +enum { >> + BPF_MAP_RCU_GP = BIT(0), >> + BPF_MAP_RCU_TT_GP = BIT(1), >> +}; >> + >> struct bpf_map { >> /* The first two cachelines with read-mostly members of which some >> * are also accessed in fast-path (e.g. ops, max_entries). >> @@ -296,7 +301,8 @@ struct bpf_map { >> } owner; >> bool bypass_spec_v1; >> bool frozen; /* write-once; write-protected by freeze_mutex */ >> - bool free_after_mult_rcu_gp; >> + atomic_t used_in_rcu_gp; >> + atomic_t free_by_rcu_gp; >> s64 __percpu *elem_count; >> }; >> diff --git a/kernel/bpf/map_in_map.c b/kernel/bpf/map_in_map.c >> index cf3363065566..d044ee677107 100644 >> --- a/kernel/bpf/map_in_map.c >> +++ b/kernel/bpf/map_in_map.c >> @@ -131,12 +131,21 @@ void bpf_map_fd_put_ptr(struct bpf_map *map, >> void *ptr, bool deferred) >> { >> struct bpf_map *inner_map = ptr; >> - /* The inner map may still be used by both non-sleepable and >> sleepable >> - * bpf program, so free it after one RCU grace period and one tasks >> - * trace RCU grace period. >> + /* Defer the freeing of inner map according to the attribute of bpf >> + * program which owns the outer map, so unnecessary multiple RCU GP >> + * waitings can be avoided. >> */ >> - if (deferred) >> - WRITE_ONCE(inner_map->free_after_mult_rcu_gp, true); >> + if (deferred) { >> + /* used_in_rcu_gp may be updated concurrently by new bpf >> + * program, so add smp_mb() to guarantee the order between >> + * used_in_rcu_gp and lookup/deletion operation of inner map. >> + * If a new bpf program finds the inner map before it is >> + * removed from outer map, reading used_in_rcu_gp below will >> + * return the newly-set bit set by the new bpf program. >> + */ >> + smp_mb(); > > smp_mb__before_atomic()? The memory barrier is used for atomic_read() instead of atomic_or(), so I think smp_mb() is appropriate. >> + atomic_or(atomic_read(&map->used_in_rcu_gp), >> &inner_map->free_by_rcu_gp); >> + } >> bpf_map_put(inner_map); >> } >> diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c >> index 88882cb58121..014a8cd55a41 100644 >> --- a/kernel/bpf/syscall.c >> +++ b/kernel/bpf/syscall.c >> @@ -734,7 +734,10 @@ static void bpf_map_free_rcu_gp(struct rcu_head >> *rcu) >> static void bpf_map_free_mult_rcu_gp(struct rcu_head *rcu) >> { >> - if (rcu_trace_implies_rcu_gp()) >> + struct bpf_map *map = container_of(rcu, struct bpf_map, rcu); >> + >> + if (!(atomic_read(&map->free_by_rcu_gp) & BPF_MAP_RCU_GP) || >> + rcu_trace_implies_rcu_gp()) >> bpf_map_free_rcu_gp(rcu); >> else >> call_rcu(rcu, bpf_map_free_rcu_gp); >> @@ -746,11 +749,16 @@ static void bpf_map_free_mult_rcu_gp(struct >> rcu_head *rcu) >> void bpf_map_put(struct bpf_map *map) >> { >> if (atomic64_dec_and_test(&map->refcnt)) { >> + int free_by_rcu_gp; >> + >> /* bpf_map_free_id() must be called first */ >> bpf_map_free_id(map); >> btf_put(map->btf); >> - if (READ_ONCE(map->free_after_mult_rcu_gp)) >> + free_by_rcu_gp = atomic_read(&map->free_by_rcu_gp); >> + if (free_by_rcu_gp == BPF_MAP_RCU_GP) >> + call_rcu(&map->rcu, bpf_map_free_rcu_gp); >> + else if (free_by_rcu_gp) >> call_rcu_tasks_trace(&map->rcu, bpf_map_free_mult_rcu_gp); >> else >> bpf_map_free_in_work(map); >> @@ -5343,6 +5351,9 @@ static int bpf_prog_bind_map(union bpf_attr *attr) >> goto out_unlock; >> } >> + /* No need to update used_in_rcu_gp, because the bpf program >> doesn't >> + * access the map. >> + */ >> memcpy(used_maps_new, used_maps_old, >> sizeof(used_maps_old[0]) * prog->aux->used_map_cnt); >> used_maps_new[prog->aux->used_map_cnt] = map; >> diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c >> index 6da370a047fe..3b86c02077f1 100644 >> --- a/kernel/bpf/verifier.c >> +++ b/kernel/bpf/verifier.c >> @@ -18051,6 +18051,10 @@ static int resolve_pseudo_ldimm64(struct >> bpf_verifier_env *env) >> return -E2BIG; >> } >> + atomic_or(env->prog->aux->sleepable ? >> BPF_MAP_RCU_TT_GP : BPF_MAP_RCU_GP, >> + &map->used_in_rcu_gp); >> + /* Pairs with smp_mb() in bpf_map_fd_put_ptr() */ >> + smp_mb__before_atomic(); > > smp_mb__after_atomic()? smp_mb__after_atomic() is better because it doesn't depend on the implementation of bpf_map_inc() below. Will use it in next version. > > Just curious, are two smp_mb*() memory barriers in this patch truely > necessary or just > want to be cautious? Martin had asked me the same question in [1]. The reason for these two memory barrier is just want to be cautious. [1]: https://lore.kernel.org/bpf/467cd7b0-9b41-4db5-9646-9b044db14bf0@linux.dev/ > >> /* hold the map. If the program is rejected by verifier, >> * the map will be released by release_maps() or it >> * will be used by the valid program until it's unloaded
On Fri, Nov 24, 2023 at 3:29 AM Hou Tao <houtao@huaweicloud.com> wrote: > > From: Hou Tao <houtao1@huawei.com> > > When removing the inner map from the outer map, the inner map will be > freed after one RCU grace period and one RCU tasks trace grace > period, so it is certain that the bpf program, which may access the > inner map, has exited before the inner map is freed. > > However there is unnecessary to wait for any RCU grace period, one RCU > grace period or one RCU tasks trace grace period if the outer map is > only accessed by userspace, sleepable program or non-sleepable program. > So recording the sleepable attributes of the owned bpf programs when > adding the outer map into env->used_maps, copying the recorded > attributes to inner map atomically when removing inner map from the > outer map and using the recorded attributes in the inner map to decide > which, and how many, RCU grace periods are needed when freeing the > inner map. > > Signed-off-by: Hou Tao <houtao1@huawei.com> > --- > include/linux/bpf.h | 8 +++++++- > kernel/bpf/map_in_map.c | 19 ++++++++++++++----- > kernel/bpf/syscall.c | 15 +++++++++++++-- > kernel/bpf/verifier.c | 4 ++++ > 4 files changed, 38 insertions(+), 8 deletions(-) > > diff --git a/include/linux/bpf.h b/include/linux/bpf.h > index 15a6bb951b70..c5b549f352d7 100644 > --- a/include/linux/bpf.h > +++ b/include/linux/bpf.h > @@ -245,6 +245,11 @@ struct bpf_list_node_kern { > void *owner; > } __attribute__((aligned(8))); > > +enum { > + BPF_MAP_RCU_GP = BIT(0), > + BPF_MAP_RCU_TT_GP = BIT(1), > +}; > + > struct bpf_map { > /* The first two cachelines with read-mostly members of which some > * are also accessed in fast-path (e.g. ops, max_entries). > @@ -296,7 +301,8 @@ struct bpf_map { > } owner; > bool bypass_spec_v1; > bool frozen; /* write-once; write-protected by freeze_mutex */ > - bool free_after_mult_rcu_gp; > + atomic_t used_in_rcu_gp; > + atomic_t free_by_rcu_gp; > s64 __percpu *elem_count; > }; > > diff --git a/kernel/bpf/map_in_map.c b/kernel/bpf/map_in_map.c > index cf3363065566..d044ee677107 100644 > --- a/kernel/bpf/map_in_map.c > +++ b/kernel/bpf/map_in_map.c > @@ -131,12 +131,21 @@ void bpf_map_fd_put_ptr(struct bpf_map *map, void *ptr, bool deferred) > { > struct bpf_map *inner_map = ptr; > > - /* The inner map may still be used by both non-sleepable and sleepable > - * bpf program, so free it after one RCU grace period and one tasks > - * trace RCU grace period. > + /* Defer the freeing of inner map according to the attribute of bpf > + * program which owns the outer map, so unnecessary multiple RCU GP > + * waitings can be avoided. > */ > - if (deferred) > - WRITE_ONCE(inner_map->free_after_mult_rcu_gp, true); > + if (deferred) { > + /* used_in_rcu_gp may be updated concurrently by new bpf > + * program, so add smp_mb() to guarantee the order between > + * used_in_rcu_gp and lookup/deletion operation of inner map. > + * If a new bpf program finds the inner map before it is > + * removed from outer map, reading used_in_rcu_gp below will > + * return the newly-set bit set by the new bpf program. > + */ > + smp_mb(); > + atomic_or(atomic_read(&map->used_in_rcu_gp), &inner_map->free_by_rcu_gp); You resent the patches before I had time to reply to the previous thread... > I think the main reason is that there is four possible case for the free > of inner map: > (1) neither call synchronize_rcu() nor synchronize_rcu_tasks_trace() > It is true when the outer map is only being accessed in user space. > (2) only call synchronize_rcu() > the outer map is only being accessed by non-sleepable bpf program > (3) only call synchronize_rcu_tasks_trace > the outer map is only being accessed by sleepable bpf program > (4) call both synchronize_rcu() and synchronize_rcu_tasks_trace() > > Only using sleepable_refcnt can not express 4 possible cases and we also > need to atomically copy the states from outer map to inner map, because > one inner map may be used concurrently by multiple outer map, so atomic > or mask are chosen. We don't care about optimizing 1, since it's rare case. We also don't care about optimizing 3, since sync_rcu time is negligible when we need to wait for sync_rcu_tasks_trace and also because rcu_trace_implies_rcu_gp() for foreseeable future. > need to atomically we do NOT have such need. There is zero need to do this atomic games and barries "just want to be cautious". The code should not have any code at all "to be cautious". Every barrier has to be a real reason behind it. Please remove them. The concurent access to refcnt and sleepable_refcnt can be serialized with simple spin_lock. I also don't like > + BPF_MAP_RCU_GP = BIT(0), > + BPF_MAP_RCU_TT_GP = BIT(1), the names should not describe what action should be taken. The names should describe what the bits do. I still think sleepable_refcnt and refcnt is more than enough to optimize patch 3.
Hi Alexei, On 11/27/2023 9:49 AM, Alexei Starovoitov wrote: > On Fri, Nov 24, 2023 at 3:29 AM Hou Tao <houtao@huaweicloud.com> wrote: >> From: Hou Tao <houtao1@huawei.com> >> >> When removing the inner map from the outer map, the inner map will be >> freed after one RCU grace period and one RCU tasks trace grace >> period, so it is certain that the bpf program, which may access the >> inner map, has exited before the inner map is freed. >> >> However there is unnecessary to wait for any RCU grace period, one RCU >> grace period or one RCU tasks trace grace period if the outer map is >> only accessed by userspace, sleepable program or non-sleepable program. >> So recording the sleepable attributes of the owned bpf programs when >> adding the outer map into env->used_maps, copying the recorded >> attributes to inner map atomically when removing inner map from the >> outer map and using the recorded attributes in the inner map to decide >> which, and how many, RCU grace periods are needed when freeing the >> inner map. >> >> Signed-off-by: Hou Tao <houtao1@huawei.com> >> >> + if (deferred) { >> + /* used_in_rcu_gp may be updated concurrently by new bpf >> + * program, so add smp_mb() to guarantee the order between >> + * used_in_rcu_gp and lookup/deletion operation of inner map. >> + * If a new bpf program finds the inner map before it is >> + * removed from outer map, reading used_in_rcu_gp below will >> + * return the newly-set bit set by the new bpf program. >> + */ >> + smp_mb(); >> + atomic_or(atomic_read(&map->used_in_rcu_gp), &inner_map->free_by_rcu_gp); > You resent the patches before I had time to reply to the previous thread... I didn't receive the reply from last Thursday, so I though there is no new suggestions from you and sent out v3 which incorporate suggestions from Martin. I will ping next time before sending out new version if there is still pending discussion about the posted patch-set. > >> I think the main reason is that there is four possible case for the free >> of inner map: >> (1) neither call synchronize_rcu() nor synchronize_rcu_tasks_trace() >> It is true when the outer map is only being accessed in user space. >> (2) only call synchronize_rcu() >> the outer map is only being accessed by non-sleepable bpf program >> (3) only call synchronize_rcu_tasks_trace >> the outer map is only being accessed by sleepable bpf program >> (4) call both synchronize_rcu() and synchronize_rcu_tasks_trace() >> >> Only using sleepable_refcnt can not express 4 possible cases and we also >> need to atomically copy the states from outer map to inner map, because >> one inner map may be used concurrently by multiple outer map, so atomic >> or mask are chosen. > We don't care about optimizing 1, since it's rare case. > We also don't care about optimizing 3, since sync_rcu time is negligible > when we need to wait for sync_rcu_tasks_trace and also > because rcu_trace_implies_rcu_gp() for foreseeable future. I see. > >> need to atomically > we do NOT have such need. Here the atomicity means that multiple updates to inner_map->free_in_rcu_gp should not overwrite each other and it has nothing to do with the memory barrier. And using spin_lock will serve the same purpose. > There is zero need to do this atomic games and barries "just want to > be cautious". The code should not have any code at all "to be > cautious". > Every barrier has to be a real reason behind it. > Please remove them. Will remove the memory barrier. I think we can add it later if it is needed. > The concurent access to refcnt and sleepable_refcnt can be serialized > with simple spin_lock. OK. > > I also don't like >> + BPF_MAP_RCU_GP = BIT(0), >> + BPF_MAP_RCU_TT_GP = BIT(1), > the names should not describe what action should be taken. > The names should describe what the bits do. Understood. > I still think sleepable_refcnt and refcnt is more than enough to > optimize patch 3. Before posting v4, do the following code-snippets match your suggestions ? resolve_pseudo_ldimm64() if (env->prog->aux->sleepable) /* The max number of program is INT_MAX - 1, so atomic_t will be enough */ atomic_inc(&map->sleepable_refcnt); bpf_map_fd_put_ptr() if (deferred) { if (atomic_read(&map->sleepable_refcnt)) WRITE_ONCE(map->free_after_tt_rcu_gp, true); else WRITE_ONCE(map->free_after_rcu_gp, true); } bpf_map_put() if (READ_ONCE(map->free_after_tt_rcu_gp)) call_rcu_tasks_trace(&map->rcu, bpf_map_free_mult_rcu_gp); else if (READ_ONCE(map->free_after_rcu_gp)) call_rcu(&map->rcu, bpf_map_free_rcu_gp); else bpf_map_free_in_work(map); And are you OK with using call_rcu()/call_rcu_tasks_trace() instead of synchronize_rcu()/synchronize_rcu_mult() ?
Hi On 11/27/2023 10:47 AM, Hou Tao wrote: > Hi Alexei, > > On 11/27/2023 9:49 AM, Alexei Starovoitov wrote: >> On Fri, Nov 24, 2023 at 3:29 AM Hou Tao <houtao@huaweicloud.com> wrote: >>> From: Hou Tao <houtao1@huawei.com> >>> >>> When removing the inner map from the outer map, the inner map will be >>> freed after one RCU grace period and one RCU tasks trace grace >>> period, so it is certain that the bpf program, which may access the >>> inner map, has exited before the inner map is freed. >>> >>> However there is unnecessary to wait for any RCU grace period, one RCU >>> grace period or one RCU tasks trace grace period if the outer map is >>> only accessed by userspace, sleepable program or non-sleepable program. >>> So recording the sleepable attributes of the owned bpf programs when >>> adding the outer map into env->used_maps, copying the recorded >>> attributes to inner map atomically when removing inner map from the >>> outer map and using the recorded attributes in the inner map to decide >>> which, and how many, RCU grace periods are needed when freeing the >>> inner map. >>> >>> Signed-off-by: Hou Tao <houtao1@huawei.com> SNIP >> I still think sleepable_refcnt and refcnt is more than enough to >> optimize patch 3. > Before posting v4, do the following code-snippets match your suggestions ? > > resolve_pseudo_ldimm64() > if (env->prog->aux->sleepable) > /* The max number of program is INT_MAX - > 1, so atomic_t will be enough */ > atomic_inc(&map->sleepable_refcnt); > > bpf_map_fd_put_ptr() > if (deferred) { > if (atomic_read(&map->sleepable_refcnt)) > WRITE_ONCE(map->free_after_tt_rcu_gp, true); > else > WRITE_ONCE(map->free_after_rcu_gp, true); > } > > bpf_map_put() > if (READ_ONCE(map->free_after_tt_rcu_gp)) > call_rcu_tasks_trace(&map->rcu, > bpf_map_free_mult_rcu_gp); > else if (READ_ONCE(map->free_after_rcu_gp)) > call_rcu(&map->rcu, bpf_map_free_rcu_gp); > else > bpf_map_free_in_work(map); > Just find out the above code-snippet misses the corresponding atomic_dec() for sleepable_refcnt. Could we replace sleepable_refcnt by a bool (e.g, access_in_sleepable), so the check can be simplified further ? > And are you OK with using call_rcu()/call_rcu_tasks_trace() instead of > synchronize_rcu()/synchronize_rcu_mult() ? > > .
On Sun, Nov 26, 2023 at 6:47 PM Hou Tao <houtao@huaweicloud.com> wrote: > > > Before posting v4, do the following code-snippets match your suggestions ? > > resolve_pseudo_ldimm64() > if (env->prog->aux->sleepable) > /* The max number of program is INT_MAX - > 1, so atomic_t will be enough */ > atomic_inc(&map->sleepable_refcnt); > > bpf_map_fd_put_ptr() > if (deferred) { > if (atomic_read(&map->sleepable_refcnt)) > WRITE_ONCE(map->free_after_tt_rcu_gp, true); > else > WRITE_ONCE(map->free_after_rcu_gp, true); > } Yes. That was an idea and I hope you see that in this form it's much easier to understand, right? and regarding your other question: > Could we replace sleepable_refcnt by > a bool (e.g, access_in_sleepable), so the check can be simplified further ? I think you're trying to optimize too soon. How would that bool access_in_sleepable work? Something needs to set it and the last sleepable to unset it. You might need a refcnt to do that. > > bpf_map_put() > if (READ_ONCE(map->free_after_tt_rcu_gp)) > call_rcu_tasks_trace(&map->rcu, > bpf_map_free_mult_rcu_gp); > else if (READ_ONCE(map->free_after_rcu_gp)) > call_rcu(&map->rcu, bpf_map_free_rcu_gp); > else > bpf_map_free_in_work(map); > > And are you OK with using call_rcu()/call_rcu_tasks_trace() instead of > synchronize_rcu()/synchronize_rcu_mult() ? Of course. Not sure what you meant by that. bpf_map_put() cannot call sync_rcu. Are you asking about doing queue_work first and then sync_rcu* inside? I think it will not scale compared to call_rcu approach.
Hi, On 11/27/2023 11:20 AM, Alexei Starovoitov wrote: > On Sun, Nov 26, 2023 at 6:47 PM Hou Tao <houtao@huaweicloud.com> wrote: >> >> Before posting v4, do the following code-snippets match your suggestions ? >> >> resolve_pseudo_ldimm64() >> if (env->prog->aux->sleepable) >> /* The max number of program is INT_MAX - >> 1, so atomic_t will be enough */ >> atomic_inc(&map->sleepable_refcnt); >> >> bpf_map_fd_put_ptr() >> if (deferred) { >> if (atomic_read(&map->sleepable_refcnt)) >> WRITE_ONCE(map->free_after_tt_rcu_gp, true); >> else >> WRITE_ONCE(map->free_after_rcu_gp, true); >> } > Yes. That was an idea and I hope you see that in this form it's > much easier to understand, right? Yes. > > and regarding your other question: > >> Could we replace sleepable_refcnt by >> a bool (e.g, access_in_sleepable), so the check can be simplified further ? > I think you're trying to optimize too soon. > How would that bool access_in_sleepable work? > Something needs to set it and the last sleepable to unset it. > You might need a refcnt to do that. Yes, a premature optimization. I only consider the case that one bpf map will be only accessed by one bpf program and bpf program will always exit after the inner map is deleted from outer map (so sleepable_refcnt is still greater than zero when calling bpf_map_fd_put_ptr()). Will drop the idea. > >> bpf_map_put() >> if (READ_ONCE(map->free_after_tt_rcu_gp)) >> call_rcu_tasks_trace(&map->rcu, >> bpf_map_free_mult_rcu_gp); >> else if (READ_ONCE(map->free_after_rcu_gp)) >> call_rcu(&map->rcu, bpf_map_free_rcu_gp); >> else >> bpf_map_free_in_work(map); >> >> And are you OK with using call_rcu()/call_rcu_tasks_trace() instead of >> synchronize_rcu()/synchronize_rcu_mult() ? > Of course. Not sure what you meant by that. > bpf_map_put() cannot call sync_rcu. > Are you asking about doing queue_work first and then sync_rcu* inside? > I think it will not scale compared to call_rcu approach. > . Yes, I mean the deferment implementation in v2 which calls sync_rcu()* in kworker. Will still use call_rcu()* in v4. Thanks for all these suggestions.
diff --git a/include/linux/bpf.h b/include/linux/bpf.h index 15a6bb951b70..c5b549f352d7 100644 --- a/include/linux/bpf.h +++ b/include/linux/bpf.h @@ -245,6 +245,11 @@ struct bpf_list_node_kern { void *owner; } __attribute__((aligned(8))); +enum { + BPF_MAP_RCU_GP = BIT(0), + BPF_MAP_RCU_TT_GP = BIT(1), +}; + struct bpf_map { /* The first two cachelines with read-mostly members of which some * are also accessed in fast-path (e.g. ops, max_entries). @@ -296,7 +301,8 @@ struct bpf_map { } owner; bool bypass_spec_v1; bool frozen; /* write-once; write-protected by freeze_mutex */ - bool free_after_mult_rcu_gp; + atomic_t used_in_rcu_gp; + atomic_t free_by_rcu_gp; s64 __percpu *elem_count; }; diff --git a/kernel/bpf/map_in_map.c b/kernel/bpf/map_in_map.c index cf3363065566..d044ee677107 100644 --- a/kernel/bpf/map_in_map.c +++ b/kernel/bpf/map_in_map.c @@ -131,12 +131,21 @@ void bpf_map_fd_put_ptr(struct bpf_map *map, void *ptr, bool deferred) { struct bpf_map *inner_map = ptr; - /* The inner map may still be used by both non-sleepable and sleepable - * bpf program, so free it after one RCU grace period and one tasks - * trace RCU grace period. + /* Defer the freeing of inner map according to the attribute of bpf + * program which owns the outer map, so unnecessary multiple RCU GP + * waitings can be avoided. */ - if (deferred) - WRITE_ONCE(inner_map->free_after_mult_rcu_gp, true); + if (deferred) { + /* used_in_rcu_gp may be updated concurrently by new bpf + * program, so add smp_mb() to guarantee the order between + * used_in_rcu_gp and lookup/deletion operation of inner map. + * If a new bpf program finds the inner map before it is + * removed from outer map, reading used_in_rcu_gp below will + * return the newly-set bit set by the new bpf program. + */ + smp_mb(); + atomic_or(atomic_read(&map->used_in_rcu_gp), &inner_map->free_by_rcu_gp); + } bpf_map_put(inner_map); } diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c index 88882cb58121..014a8cd55a41 100644 --- a/kernel/bpf/syscall.c +++ b/kernel/bpf/syscall.c @@ -734,7 +734,10 @@ static void bpf_map_free_rcu_gp(struct rcu_head *rcu) static void bpf_map_free_mult_rcu_gp(struct rcu_head *rcu) { - if (rcu_trace_implies_rcu_gp()) + struct bpf_map *map = container_of(rcu, struct bpf_map, rcu); + + if (!(atomic_read(&map->free_by_rcu_gp) & BPF_MAP_RCU_GP) || + rcu_trace_implies_rcu_gp()) bpf_map_free_rcu_gp(rcu); else call_rcu(rcu, bpf_map_free_rcu_gp); @@ -746,11 +749,16 @@ static void bpf_map_free_mult_rcu_gp(struct rcu_head *rcu) void bpf_map_put(struct bpf_map *map) { if (atomic64_dec_and_test(&map->refcnt)) { + int free_by_rcu_gp; + /* bpf_map_free_id() must be called first */ bpf_map_free_id(map); btf_put(map->btf); - if (READ_ONCE(map->free_after_mult_rcu_gp)) + free_by_rcu_gp = atomic_read(&map->free_by_rcu_gp); + if (free_by_rcu_gp == BPF_MAP_RCU_GP) + call_rcu(&map->rcu, bpf_map_free_rcu_gp); + else if (free_by_rcu_gp) call_rcu_tasks_trace(&map->rcu, bpf_map_free_mult_rcu_gp); else bpf_map_free_in_work(map); @@ -5343,6 +5351,9 @@ static int bpf_prog_bind_map(union bpf_attr *attr) goto out_unlock; } + /* No need to update used_in_rcu_gp, because the bpf program doesn't + * access the map. + */ memcpy(used_maps_new, used_maps_old, sizeof(used_maps_old[0]) * prog->aux->used_map_cnt); used_maps_new[prog->aux->used_map_cnt] = map; diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c index 6da370a047fe..3b86c02077f1 100644 --- a/kernel/bpf/verifier.c +++ b/kernel/bpf/verifier.c @@ -18051,6 +18051,10 @@ static int resolve_pseudo_ldimm64(struct bpf_verifier_env *env) return -E2BIG; } + atomic_or(env->prog->aux->sleepable ? BPF_MAP_RCU_TT_GP : BPF_MAP_RCU_GP, + &map->used_in_rcu_gp); + /* Pairs with smp_mb() in bpf_map_fd_put_ptr() */ + smp_mb__before_atomic(); /* hold the map. If the program is rejected by verifier, * the map will be released by release_maps() or it * will be used by the valid program until it's unloaded