[bpf,v3,4/6] bpf: Optimize the free of inner map

Message ID	20231124113033.503338-5-houtao@huaweicloud.com (mailing list archive)
State	Superseded
Delegated to:	BPF
Headers	show Authentication-Results: smtp.subspace.kernel.org; dkim=none From: Hou Tao <houtao@huaweicloud.com> To: bpf@vger.kernel.org Cc: Martin KaFai Lau <martin.lau@linux.dev>, Alexei Starovoitov <alexei.starovoitov@gmail.com>, Andrii Nakryiko <andrii@kernel.org>, Song Liu <song@kernel.org>, Hao Luo <haoluo@google.com>, Yonghong Song <yonghong.song@linux.dev>, Daniel Borkmann <daniel@iogearbox.net>, KP Singh <kpsingh@kernel.org>, Stanislav Fomichev <sdf@google.com>, Jiri Olsa <jolsa@kernel.org>, John Fastabend <john.fastabend@gmail.com>, houtao1@huawei.com Subject: [PATCH bpf v3 4/6] bpf: Optimize the free of inner map Date: Fri, 24 Nov 2023 19:30:31 +0800 Message-Id: <20231124113033.503338-5-houtao@huaweicloud.com> In-Reply-To: <20231124113033.503338-1-houtao@huaweicloud.com> References: <20231124113033.503338-1-houtao@huaweicloud.com> Precedence: bulk MIME-Version: 1.0 Content-Transfer-Encoding: 8bit
Series	bpf: Fix the release of inner map \| expand [bpf,v3,0/6] bpf: Fix the release of inner map [bpf,v3,1/6] bpf: Check rcu_read_lock_trace_held() before calling bpf map helpers [bpf,v3,2/6] bpf: Add map and need_defer parameters to .map_fd_put_ptr() [bpf,v3,3/6] bpf: Defer the free of inner map when necessary [bpf,v3,4/6] bpf: Optimize the free of inner map [bpf,v3,5/6] selftests/bpf: Add test cases for inner map [bpf,v3,6/6] selftests/bpf: Test outer map update operations in syscall program

Context	Check	Description
bpf/vmtest-bpf-PR	success	PR summary
bpf/vmtest-bpf-VM_Test-1	success	Logs for ShellCheck
bpf/vmtest-bpf-VM_Test-0	success	Logs for Lint
bpf/vmtest-bpf-VM_Test-2	success	Logs for Validate matrix.py
bpf/vmtest-bpf-VM_Test-3	success	Logs for aarch64-gcc / build / build for aarch64 with gcc
bpf/vmtest-bpf-VM_Test-8	success	Logs for aarch64-gcc / veristat
bpf/vmtest-bpf-VM_Test-7	success	Logs for aarch64-gcc / test (test_verifier, false, 360) / test_verifier on aarch64 with gcc
bpf/vmtest-bpf-VM_Test-16	success	Logs for x86_64-gcc / build / build for x86_64 with gcc
bpf/vmtest-bpf-VM_Test-17	success	Logs for x86_64-gcc / test (test_maps, false, 360) / test_maps on x86_64 with gcc
bpf/vmtest-bpf-VM_Test-18	success	Logs for x86_64-gcc / test (test_progs, false, 360) / test_progs on x86_64 with gcc
bpf/vmtest-bpf-VM_Test-19	success	Logs for x86_64-gcc / test (test_progs_no_alu32, false, 360) / test_progs_no_alu32 on x86_64 with gcc
bpf/vmtest-bpf-VM_Test-20	success	Logs for x86_64-gcc / test (test_progs_no_alu32_parallel, true, 30) / test_progs_no_alu32_parallel on x86_64 with gcc
bpf/vmtest-bpf-VM_Test-21	success	Logs for x86_64-gcc / test (test_progs_parallel, true, 30) / test_progs_parallel on x86_64 with gcc
bpf/vmtest-bpf-VM_Test-22	success	Logs for x86_64-gcc / test (test_verifier, false, 360) / test_verifier on x86_64 with gcc
bpf/vmtest-bpf-VM_Test-4	success	Logs for aarch64-gcc / test (test_maps, false, 360) / test_maps on aarch64 with gcc
bpf/vmtest-bpf-VM_Test-23	success	Logs for x86_64-gcc / veristat / veristat on x86_64 with gcc
bpf/vmtest-bpf-VM_Test-14	success	Logs for s390x-gcc / veristat
bpf/vmtest-bpf-VM_Test-9	success	Logs for s390x-gcc / build / build for s390x with gcc
bpf/vmtest-bpf-VM_Test-6	success	Logs for aarch64-gcc / test (test_progs_no_alu32, false, 360) / test_progs_no_alu32 on aarch64 with gcc
bpf/vmtest-bpf-VM_Test-25	success	Logs for x86_64-llvm-16 / test (test_maps, false, 360) / test_maps on x86_64 with llvm-16
bpf/vmtest-bpf-VM_Test-24	success	Logs for x86_64-llvm-16 / build / build for x86_64 with llvm-16
bpf/vmtest-bpf-VM_Test-15	success	Logs for set-matrix
bpf/vmtest-bpf-VM_Test-26	success	Logs for x86_64-llvm-16 / test (test_progs, false, 360) / test_progs on x86_64 with llvm-16
bpf/vmtest-bpf-VM_Test-27	success	Logs for x86_64-llvm-16 / test (test_progs_no_alu32, false, 360) / test_progs_no_alu32 on x86_64 with llvm-16
bpf/vmtest-bpf-VM_Test-29	success	Logs for x86_64-llvm-16 / veristat
bpf/vmtest-bpf-VM_Test-28	success	Logs for x86_64-llvm-16 / test (test_verifier, false, 360) / test_verifier on x86_64 with llvm-16
bpf/vmtest-bpf-VM_Test-13	success	Logs for s390x-gcc / test (test_verifier, false, 360) / test_verifier on s390x with gcc
bpf/vmtest-bpf-VM_Test-5	success	Logs for aarch64-gcc / test (test_progs, false, 360) / test_progs on aarch64 with gcc
bpf/vmtest-bpf-VM_Test-12	success	Logs for s390x-gcc / test (test_progs_no_alu32, false, 360) / test_progs_no_alu32 on s390x with gcc
bpf/vmtest-bpf-VM_Test-11	success	Logs for s390x-gcc / test (test_progs, false, 360) / test_progs on s390x with gcc
bpf/vmtest-bpf-VM_Test-10	success	Logs for s390x-gcc / test (test_maps, false, 360) / test_maps on s390x with gcc
netdev/series_format	success	Posting correctly formatted
netdev/codegen	success	Generated files up to date
netdev/tree_selection	success	Clearly marked for bpf, async
netdev/fixes_present	success	Fixes tag present in non-next series
netdev/header_inline	success	No static functions without inline keyword in header files
netdev/build_32bit	success	Errors and warnings before: 2656 this patch: 2656
netdev/cc_maintainers	success	CCed 12 of 12 maintainers
netdev/build_clang	success	Errors and warnings before: 1276 this patch: 1276
netdev/verify_signedoff	success	Signed-off-by tag matches author and committer
netdev/deprecated_api	success	None detected
netdev/check_selftest	success	No net selftest shell script
netdev/verify_fixes	success	No Fixes tag
netdev/build_allmodconfig_warn	success	Errors and warnings before: 2733 this patch: 2733
netdev/checkpatch	warning	WARNING: line length of 89 exceeds 80 columns WARNING: line length of 97 exceeds 80 columns
netdev/build_clang_rust	success	No Rust files in patch. Skipping build
netdev/kdoc	success	Errors and warnings before: 0 this patch: 0
netdev/source_inline	success	Was 0 now: 0

Hou Tao Nov. 24, 2023, 11:30 a.m. UTC

From: Hou Tao <houtao1@huawei.com>

When removing the inner map from the outer map, the inner map will be
freed after one RCU grace period and one RCU tasks trace grace
period, so it is certain that the bpf program, which may access the
inner map, has exited before the inner map is freed.

However there is unnecessary to wait for any RCU grace period, one RCU
grace period or one RCU tasks trace grace period if the outer map is
only accessed by userspace, sleepable program or non-sleepable program.
So recording the sleepable attributes of the owned bpf programs when
adding the outer map into env->used_maps, copying the recorded
attributes to inner map atomically when removing inner map from the
outer map and using the recorded attributes in the inner map to decide
which, and how many, RCU grace periods are needed when freeing the
inner map.

Signed-off-by: Hou Tao <houtao1@huawei.com>
---
 include/linux/bpf.h     |  8 +++++++-
 kernel/bpf/map_in_map.c | 19 ++++++++++++++-----
 kernel/bpf/syscall.c    | 15 +++++++++++++--
 kernel/bpf/verifier.c   |  4 ++++
 4 files changed, 38 insertions(+), 8 deletions(-)

Yonghong Song Nov. 26, 2023, 7:13 a.m. UTC | #1

On 11/24/23 6:30 AM, Hou Tao wrote:
> From: Hou Tao <houtao1@huawei.com>
>
> When removing the inner map from the outer map, the inner map will be
> freed after one RCU grace period and one RCU tasks trace grace
> period, so it is certain that the bpf program, which may access the
> inner map, has exited before the inner map is freed.
>
> However there is unnecessary to wait for any RCU grace period, one RCU
> grace period or one RCU tasks trace grace period if the outer map is
> only accessed by userspace, sleepable program or non-sleepable program.
> So recording the sleepable attributes of the owned bpf programs when
> adding the outer map into env->used_maps, copying the recorded
> attributes to inner map atomically when removing inner map from the
> outer map and using the recorded attributes in the inner map to decide
> which, and how many, RCU grace periods are needed when freeing the
> inner map.
>
> Signed-off-by: Hou Tao <houtao1@huawei.com>
> ---
>   include/linux/bpf.h     |  8 +++++++-
>   kernel/bpf/map_in_map.c | 19 ++++++++++++++-----
>   kernel/bpf/syscall.c    | 15 +++++++++++++--
>   kernel/bpf/verifier.c   |  4 ++++
>   4 files changed, 38 insertions(+), 8 deletions(-)
>
> diff --git a/include/linux/bpf.h b/include/linux/bpf.h
> index 15a6bb951b70..c5b549f352d7 100644
> --- a/include/linux/bpf.h
> +++ b/include/linux/bpf.h
> @@ -245,6 +245,11 @@ struct bpf_list_node_kern {
>   	void *owner;
>   } __attribute__((aligned(8)));
>   
> +enum {
> +	BPF_MAP_RCU_GP = BIT(0),
> +	BPF_MAP_RCU_TT_GP = BIT(1),
> +};
> +
>   struct bpf_map {
>   	/* The first two cachelines with read-mostly members of which some
>   	 * are also accessed in fast-path (e.g. ops, max_entries).
> @@ -296,7 +301,8 @@ struct bpf_map {
>   	} owner;
>   	bool bypass_spec_v1;
>   	bool frozen; /* write-once; write-protected by freeze_mutex */
> -	bool free_after_mult_rcu_gp;
> +	atomic_t used_in_rcu_gp;
> +	atomic_t free_by_rcu_gp;
>   	s64 __percpu *elem_count;
>   };
>   
> diff --git a/kernel/bpf/map_in_map.c b/kernel/bpf/map_in_map.c
> index cf3363065566..d044ee677107 100644
> --- a/kernel/bpf/map_in_map.c
> +++ b/kernel/bpf/map_in_map.c
> @@ -131,12 +131,21 @@ void bpf_map_fd_put_ptr(struct bpf_map *map, void *ptr, bool deferred)
>   {
>   	struct bpf_map *inner_map = ptr;
>   
> -	/* The inner map may still be used by both non-sleepable and sleepable
> -	 * bpf program, so free it after one RCU grace period and one tasks
> -	 * trace RCU grace period.
> +	/* Defer the freeing of inner map according to the attribute of bpf
> +	 * program which owns the outer map, so unnecessary multiple RCU GP
> +	 * waitings can be avoided.
>   	 */
> -	if (deferred)
> -		WRITE_ONCE(inner_map->free_after_mult_rcu_gp, true);
> +	if (deferred) {
> +		/* used_in_rcu_gp may be updated concurrently by new bpf
> +		 * program, so add smp_mb() to guarantee the order between
> +		 * used_in_rcu_gp and lookup/deletion operation of inner map.
> +		 * If a new bpf program finds the inner map before it is
> +		 * removed from outer map, reading used_in_rcu_gp below will
> +		 * return the newly-set bit set by the new bpf program.
> +		 */
> +		smp_mb();

smp_mb__before_atomic()?

> +		atomic_or(atomic_read(&map->used_in_rcu_gp), &inner_map->free_by_rcu_gp);
> +	}
>   	bpf_map_put(inner_map);
>   }
>   
> diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
> index 88882cb58121..014a8cd55a41 100644
> --- a/kernel/bpf/syscall.c
> +++ b/kernel/bpf/syscall.c
> @@ -734,7 +734,10 @@ static void bpf_map_free_rcu_gp(struct rcu_head *rcu)
>   
>   static void bpf_map_free_mult_rcu_gp(struct rcu_head *rcu)
>   {
> -	if (rcu_trace_implies_rcu_gp())
> +	struct bpf_map *map = container_of(rcu, struct bpf_map, rcu);
> +
> +	if (!(atomic_read(&map->free_by_rcu_gp) & BPF_MAP_RCU_GP) ||
> +	    rcu_trace_implies_rcu_gp())
>   		bpf_map_free_rcu_gp(rcu);
>   	else
>   		call_rcu(rcu, bpf_map_free_rcu_gp);
> @@ -746,11 +749,16 @@ static void bpf_map_free_mult_rcu_gp(struct rcu_head *rcu)
>   void bpf_map_put(struct bpf_map *map)
>   {
>   	if (atomic64_dec_and_test(&map->refcnt)) {
> +		int free_by_rcu_gp;
> +
>   		/* bpf_map_free_id() must be called first */
>   		bpf_map_free_id(map);
>   		btf_put(map->btf);
>   
> -		if (READ_ONCE(map->free_after_mult_rcu_gp))
> +		free_by_rcu_gp = atomic_read(&map->free_by_rcu_gp);
> +		if (free_by_rcu_gp == BPF_MAP_RCU_GP)
> +			call_rcu(&map->rcu, bpf_map_free_rcu_gp);
> +		else if (free_by_rcu_gp)
>   			call_rcu_tasks_trace(&map->rcu, bpf_map_free_mult_rcu_gp);
>   		else
>   			bpf_map_free_in_work(map);
> @@ -5343,6 +5351,9 @@ static int bpf_prog_bind_map(union bpf_attr *attr)
>   		goto out_unlock;
>   	}
>   
> +	/* No need to update used_in_rcu_gp, because the bpf program doesn't
> +	 * access the map.
> +	 */
>   	memcpy(used_maps_new, used_maps_old,
>   	       sizeof(used_maps_old[0]) * prog->aux->used_map_cnt);
>   	used_maps_new[prog->aux->used_map_cnt] = map;
> diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
> index 6da370a047fe..3b86c02077f1 100644
> --- a/kernel/bpf/verifier.c
> +++ b/kernel/bpf/verifier.c
> @@ -18051,6 +18051,10 @@ static int resolve_pseudo_ldimm64(struct bpf_verifier_env *env)
>   				return -E2BIG;
>   			}
>   
> +			atomic_or(env->prog->aux->sleepable ? BPF_MAP_RCU_TT_GP : BPF_MAP_RCU_GP,
> +				  &map->used_in_rcu_gp);
> +			/* Pairs with smp_mb() in bpf_map_fd_put_ptr() */
> +			smp_mb__before_atomic();

smp_mb__after_atomic()?

Just curious, are two smp_mb*() memory barriers in this patch truely necessary or just
want to be cautious?

>   			/* hold the map. If the program is rejected by verifier,
>   			 * the map will be released by release_maps() or it
>   			 * will be used by the valid program until it's unloaded

Hou Tao Nov. 27, 2023, 1:24 a.m. UTC | #2

Hi Yonghong,

On 11/26/2023 3:13 PM, Yonghong Song wrote:
>
> On 11/24/23 6:30 AM, Hou Tao wrote:
>> From: Hou Tao <houtao1@huawei.com>
>>
>> When removing the inner map from the outer map, the inner map will be
>> freed after one RCU grace period and one RCU tasks trace grace
>> period, so it is certain that the bpf program, which may access the
>> inner map, has exited before the inner map is freed.
>>
>> However there is unnecessary to wait for any RCU grace period, one RCU
>> grace period or one RCU tasks trace grace period if the outer map is
>> only accessed by userspace, sleepable program or non-sleepable program.
>> So recording the sleepable attributes of the owned bpf programs when
>> adding the outer map into env->used_maps, copying the recorded
>> attributes to inner map atomically when removing inner map from the
>> outer map and using the recorded attributes in the inner map to decide
>> which, and how many, RCU grace periods are needed when freeing the
>> inner map.
>>
>> Signed-off-by: Hou Tao <houtao1@huawei.com>
>> ---
>>   include/linux/bpf.h     |  8 +++++++-
>>   kernel/bpf/map_in_map.c | 19 ++++++++++++++-----
>>   kernel/bpf/syscall.c    | 15 +++++++++++++--
>>   kernel/bpf/verifier.c   |  4 ++++
>>   4 files changed, 38 insertions(+), 8 deletions(-)
>>
>> diff --git a/include/linux/bpf.h b/include/linux/bpf.h
>> index 15a6bb951b70..c5b549f352d7 100644
>> --- a/include/linux/bpf.h
>> +++ b/include/linux/bpf.h
>> @@ -245,6 +245,11 @@ struct bpf_list_node_kern {
>>       void *owner;
>>   } __attribute__((aligned(8)));
>>   +enum {
>> +    BPF_MAP_RCU_GP = BIT(0),
>> +    BPF_MAP_RCU_TT_GP = BIT(1),
>> +};
>> +
>>   struct bpf_map {
>>       /* The first two cachelines with read-mostly members of which some
>>        * are also accessed in fast-path (e.g. ops, max_entries).
>> @@ -296,7 +301,8 @@ struct bpf_map {
>>       } owner;
>>       bool bypass_spec_v1;
>>       bool frozen; /* write-once; write-protected by freeze_mutex */
>> -    bool free_after_mult_rcu_gp;
>> +    atomic_t used_in_rcu_gp;
>> +    atomic_t free_by_rcu_gp;
>>       s64 __percpu *elem_count;
>>   };
>>   diff --git a/kernel/bpf/map_in_map.c b/kernel/bpf/map_in_map.c
>> index cf3363065566..d044ee677107 100644
>> --- a/kernel/bpf/map_in_map.c
>> +++ b/kernel/bpf/map_in_map.c
>> @@ -131,12 +131,21 @@ void bpf_map_fd_put_ptr(struct bpf_map *map,
>> void *ptr, bool deferred)
>>   {
>>       struct bpf_map *inner_map = ptr;
>>   -    /* The inner map may still be used by both non-sleepable and
>> sleepable
>> -     * bpf program, so free it after one RCU grace period and one tasks
>> -     * trace RCU grace period.
>> +    /* Defer the freeing of inner map according to the attribute of bpf
>> +     * program which owns the outer map, so unnecessary multiple RCU GP
>> +     * waitings can be avoided.
>>        */
>> -    if (deferred)
>> -        WRITE_ONCE(inner_map->free_after_mult_rcu_gp, true);
>> +    if (deferred) {
>> +        /* used_in_rcu_gp may be updated concurrently by new bpf
>> +         * program, so add smp_mb() to guarantee the order between
>> +         * used_in_rcu_gp and lookup/deletion operation of inner map.
>> +         * If a new bpf program finds the inner map before it is
>> +         * removed from outer map, reading used_in_rcu_gp below will
>> +         * return the newly-set bit set by the new bpf program.
>> +         */
>> +        smp_mb();
>
> smp_mb__before_atomic()?

The memory barrier is used for atomic_read() instead of atomic_or(), so
I think smp_mb() is appropriate.

>> +        atomic_or(atomic_read(&map->used_in_rcu_gp),
>> &inner_map->free_by_rcu_gp);
>> +    }
>>       bpf_map_put(inner_map);
>>   }
>>   diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
>> index 88882cb58121..014a8cd55a41 100644
>> --- a/kernel/bpf/syscall.c
>> +++ b/kernel/bpf/syscall.c
>> @@ -734,7 +734,10 @@ static void bpf_map_free_rcu_gp(struct rcu_head
>> *rcu)
>>     static void bpf_map_free_mult_rcu_gp(struct rcu_head *rcu)
>>   {
>> -    if (rcu_trace_implies_rcu_gp())
>> +    struct bpf_map *map = container_of(rcu, struct bpf_map, rcu);
>> +
>> +    if (!(atomic_read(&map->free_by_rcu_gp) & BPF_MAP_RCU_GP) ||
>> +        rcu_trace_implies_rcu_gp())
>>           bpf_map_free_rcu_gp(rcu);
>>       else
>>           call_rcu(rcu, bpf_map_free_rcu_gp);
>> @@ -746,11 +749,16 @@ static void bpf_map_free_mult_rcu_gp(struct
>> rcu_head *rcu)
>>   void bpf_map_put(struct bpf_map *map)
>>   {
>>       if (atomic64_dec_and_test(&map->refcnt)) {
>> +        int free_by_rcu_gp;
>> +
>>           /* bpf_map_free_id() must be called first */
>>           bpf_map_free_id(map);
>>           btf_put(map->btf);
>>   -        if (READ_ONCE(map->free_after_mult_rcu_gp))
>> +        free_by_rcu_gp = atomic_read(&map->free_by_rcu_gp);
>> +        if (free_by_rcu_gp == BPF_MAP_RCU_GP)
>> +            call_rcu(&map->rcu, bpf_map_free_rcu_gp);
>> +        else if (free_by_rcu_gp)
>>               call_rcu_tasks_trace(&map->rcu, bpf_map_free_mult_rcu_gp);
>>           else
>>               bpf_map_free_in_work(map);
>> @@ -5343,6 +5351,9 @@ static int bpf_prog_bind_map(union bpf_attr *attr)
>>           goto out_unlock;
>>       }
>>   +    /* No need to update used_in_rcu_gp, because the bpf program
>> doesn't
>> +     * access the map.
>> +     */
>>       memcpy(used_maps_new, used_maps_old,
>>              sizeof(used_maps_old[0]) * prog->aux->used_map_cnt);
>>       used_maps_new[prog->aux->used_map_cnt] = map;
>> diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
>> index 6da370a047fe..3b86c02077f1 100644
>> --- a/kernel/bpf/verifier.c
>> +++ b/kernel/bpf/verifier.c
>> @@ -18051,6 +18051,10 @@ static int resolve_pseudo_ldimm64(struct
>> bpf_verifier_env *env)
>>                   return -E2BIG;
>>               }
>>   +            atomic_or(env->prog->aux->sleepable ?
>> BPF_MAP_RCU_TT_GP : BPF_MAP_RCU_GP,
>> +                  &map->used_in_rcu_gp);
>> +            /* Pairs with smp_mb() in bpf_map_fd_put_ptr() */
>> +            smp_mb__before_atomic();
>
> smp_mb__after_atomic()?

smp_mb__after_atomic() is better because it doesn't depend on the
implementation of bpf_map_inc() below. Will use it in next version.
>
> Just curious, are two smp_mb*() memory barriers in this patch truely
> necessary or just
> want to be cautious?

Martin had asked me the same question in [1]. The reason for these two
memory barrier is just want to be cautious.

[1]:
https://lore.kernel.org/bpf/467cd7b0-9b41-4db5-9646-9b044db14bf0@linux.dev/
>
>>               /* hold the map. If the program is rejected by verifier,
>>                * the map will be released by release_maps() or it
>>                * will be used by the valid program until it's unloaded

Alexei Starovoitov Nov. 27, 2023, 1:49 a.m. UTC | #3

On Fri, Nov 24, 2023 at 3:29 AM Hou Tao <houtao@huaweicloud.com> wrote:
>
> From: Hou Tao <houtao1@huawei.com>
>
> When removing the inner map from the outer map, the inner map will be
> freed after one RCU grace period and one RCU tasks trace grace
> period, so it is certain that the bpf program, which may access the
> inner map, has exited before the inner map is freed.
>
> However there is unnecessary to wait for any RCU grace period, one RCU
> grace period or one RCU tasks trace grace period if the outer map is
> only accessed by userspace, sleepable program or non-sleepable program.
> So recording the sleepable attributes of the owned bpf programs when
> adding the outer map into env->used_maps, copying the recorded
> attributes to inner map atomically when removing inner map from the
> outer map and using the recorded attributes in the inner map to decide
> which, and how many, RCU grace periods are needed when freeing the
> inner map.
>
> Signed-off-by: Hou Tao <houtao1@huawei.com>
> ---
>  include/linux/bpf.h     |  8 +++++++-
>  kernel/bpf/map_in_map.c | 19 ++++++++++++++-----
>  kernel/bpf/syscall.c    | 15 +++++++++++++--
>  kernel/bpf/verifier.c   |  4 ++++
>  4 files changed, 38 insertions(+), 8 deletions(-)
>
> diff --git a/include/linux/bpf.h b/include/linux/bpf.h
> index 15a6bb951b70..c5b549f352d7 100644
> --- a/include/linux/bpf.h
> +++ b/include/linux/bpf.h
> @@ -245,6 +245,11 @@ struct bpf_list_node_kern {
>         void *owner;
>  } __attribute__((aligned(8)));
>
> +enum {
> +       BPF_MAP_RCU_GP = BIT(0),
> +       BPF_MAP_RCU_TT_GP = BIT(1),
> +};
> +
>  struct bpf_map {
>         /* The first two cachelines with read-mostly members of which some
>          * are also accessed in fast-path (e.g. ops, max_entries).
> @@ -296,7 +301,8 @@ struct bpf_map {
>         } owner;
>         bool bypass_spec_v1;
>         bool frozen; /* write-once; write-protected by freeze_mutex */
> -       bool free_after_mult_rcu_gp;
> +       atomic_t used_in_rcu_gp;
> +       atomic_t free_by_rcu_gp;
>         s64 __percpu *elem_count;
>  };
>
> diff --git a/kernel/bpf/map_in_map.c b/kernel/bpf/map_in_map.c
> index cf3363065566..d044ee677107 100644
> --- a/kernel/bpf/map_in_map.c
> +++ b/kernel/bpf/map_in_map.c
> @@ -131,12 +131,21 @@ void bpf_map_fd_put_ptr(struct bpf_map *map, void *ptr, bool deferred)
>  {
>         struct bpf_map *inner_map = ptr;
>
> -       /* The inner map may still be used by both non-sleepable and sleepable
> -        * bpf program, so free it after one RCU grace period and one tasks
> -        * trace RCU grace period.
> +       /* Defer the freeing of inner map according to the attribute of bpf
> +        * program which owns the outer map, so unnecessary multiple RCU GP
> +        * waitings can be avoided.
>          */
> -       if (deferred)
> -               WRITE_ONCE(inner_map->free_after_mult_rcu_gp, true);
> +       if (deferred) {
> +               /* used_in_rcu_gp may be updated concurrently by new bpf
> +                * program, so add smp_mb() to guarantee the order between
> +                * used_in_rcu_gp and lookup/deletion operation of inner map.
> +                * If a new bpf program finds the inner map before it is
> +                * removed from outer map, reading used_in_rcu_gp below will
> +                * return the newly-set bit set by the new bpf program.
> +                */
> +               smp_mb();
> +               atomic_or(atomic_read(&map->used_in_rcu_gp), &inner_map->free_by_rcu_gp);

You resent the patches before I had time to reply to the previous thread...

> I think the main reason is that there is four possible case for the free
> of inner map:
> (1) neither call synchronize_rcu() nor synchronize_rcu_tasks_trace()
> It is true when the outer map is only being accessed in user space.
> (2) only call synchronize_rcu()
> the outer map is only being accessed by non-sleepable bpf program
> (3) only call synchronize_rcu_tasks_trace
> the outer map is only being accessed by sleepable bpf program
> (4) call both synchronize_rcu() and synchronize_rcu_tasks_trace()
>
> Only using sleepable_refcnt can not express 4 possible cases and we also
> need to atomically copy the states from outer map to inner map, because
> one inner map may be used concurrently by multiple outer map, so atomic
> or mask are chosen.

We don't care about optimizing 1, since it's rare case.
We also don't care about optimizing 3, since sync_rcu time is negligible
when we need to wait for sync_rcu_tasks_trace and also
because rcu_trace_implies_rcu_gp() for foreseeable future.

> need to atomically

we do NOT have such need.
There is zero need to do this atomic games and barries "just want to
be cautious". The code should not have any code at all "to be
cautious".
Every barrier has to be a real reason behind it.
Please remove them.
The concurent access to refcnt and sleepable_refcnt can be serialized
with simple spin_lock.

I also don't like
> +       BPF_MAP_RCU_GP = BIT(0),
> +       BPF_MAP_RCU_TT_GP = BIT(1),

the names should not describe what action should be taken.
The names should describe what the bits do.
I still think sleepable_refcnt and refcnt is more than enough to
optimize patch 3.

Hou Tao Nov. 27, 2023, 2:47 a.m. UTC | #4

Hi Alexei,

On 11/27/2023 9:49 AM, Alexei Starovoitov wrote:
> On Fri, Nov 24, 2023 at 3:29 AM Hou Tao <houtao@huaweicloud.com> wrote:
>> From: Hou Tao <houtao1@huawei.com>
>>
>> When removing the inner map from the outer map, the inner map will be
>> freed after one RCU grace period and one RCU tasks trace grace
>> period, so it is certain that the bpf program, which may access the
>> inner map, has exited before the inner map is freed.
>>
>> However there is unnecessary to wait for any RCU grace period, one RCU
>> grace period or one RCU tasks trace grace period if the outer map is
>> only accessed by userspace, sleepable program or non-sleepable program.
>> So recording the sleepable attributes of the owned bpf programs when
>> adding the outer map into env->used_maps, copying the recorded
>> attributes to inner map atomically when removing inner map from the
>> outer map and using the recorded attributes in the inner map to decide
>> which, and how many, RCU grace periods are needed when freeing the
>> inner map.
>>
>> Signed-off-by: Hou Tao <houtao1@huawei.com>
>>
>> +       if (deferred) {
>> +               /* used_in_rcu_gp may be updated concurrently by new bpf
>> +                * program, so add smp_mb() to guarantee the order between
>> +                * used_in_rcu_gp and lookup/deletion operation of inner map.
>> +                * If a new bpf program finds the inner map before it is
>> +                * removed from outer map, reading used_in_rcu_gp below will
>> +                * return the newly-set bit set by the new bpf program.
>> +                */
>> +               smp_mb();
>> +               atomic_or(atomic_read(&map->used_in_rcu_gp), &inner_map->free_by_rcu_gp);
> You resent the patches before I had time to reply to the previous thread...

I didn't receive the reply from last Thursday,  so I though there is no
new suggestions from you and sent out v3 which incorporate suggestions
from Martin. I will ping next time before sending out new version if
there is still pending discussion about the posted patch-set.
>
>> I think the main reason is that there is four possible case for the free
>> of inner map:
>> (1) neither call synchronize_rcu() nor synchronize_rcu_tasks_trace()
>> It is true when the outer map is only being accessed in user space.
>> (2) only call synchronize_rcu()
>> the outer map is only being accessed by non-sleepable bpf program
>> (3) only call synchronize_rcu_tasks_trace
>> the outer map is only being accessed by sleepable bpf program
>> (4) call both synchronize_rcu() and synchronize_rcu_tasks_trace()
>>
>> Only using sleepable_refcnt can not express 4 possible cases and we also
>> need to atomically copy the states from outer map to inner map, because
>> one inner map may be used concurrently by multiple outer map, so atomic
>> or mask are chosen.
> We don't care about optimizing 1, since it's rare case.
> We also don't care about optimizing 3, since sync_rcu time is negligible
> when we need to wait for sync_rcu_tasks_trace and also
> because rcu_trace_implies_rcu_gp() for foreseeable future.

I see.
>
>> need to atomically
> we do NOT have such need.

Here the atomicity means that multiple updates to
inner_map->free_in_rcu_gp should not overwrite each other and it has
nothing to do with the memory barrier. And using spin_lock will serve
the same purpose.
> There is zero need to do this atomic games and barries "just want to
> be cautious". The code should not have any code at all "to be
> cautious".
> Every barrier has to be a real reason behind it.
> Please remove them.

Will remove the memory barrier. I think we can add it later if it is needed.
> The concurent access to refcnt and sleepable_refcnt can be serialized
> with simple spin_lock.

OK.
>
> I also don't like
>> +       BPF_MAP_RCU_GP = BIT(0),
>> +       BPF_MAP_RCU_TT_GP = BIT(1),
> the names should not describe what action should be taken.
> The names should describe what the bits do.

Understood.
> I still think sleepable_refcnt and refcnt is more than enough to
> optimize patch 3.

Before posting v4,  do the following code-snippets match your suggestions ?

resolve_pseudo_ldimm64()
                        if (env->prog->aux->sleepable)
                               /* The max number of program is INT_MAX -
1, so atomic_t will be enough */
                                atomic_inc(&map->sleepable_refcnt);

bpf_map_fd_put_ptr()
        if (deferred) {
                if (atomic_read(&map->sleepable_refcnt))
                        WRITE_ONCE(map->free_after_tt_rcu_gp, true);
                else
                        WRITE_ONCE(map->free_after_rcu_gp, true);
        }

bpf_map_put()
                if (READ_ONCE(map->free_after_tt_rcu_gp))
                        call_rcu_tasks_trace(&map->rcu,
bpf_map_free_mult_rcu_gp);
                else if (READ_ONCE(map->free_after_rcu_gp))
                        call_rcu(&map->rcu, bpf_map_free_rcu_gp);
                else
                        bpf_map_free_in_work(map);

And are you OK with using call_rcu()/call_rcu_tasks_trace() instead of
synchronize_rcu()/synchronize_rcu_mult() ?

Hou Tao Nov. 27, 2023, 3:07 a.m. UTC | #5

Hi

On 11/27/2023 10:47 AM, Hou Tao wrote:
> Hi Alexei,
>
> On 11/27/2023 9:49 AM, Alexei Starovoitov wrote:
>> On Fri, Nov 24, 2023 at 3:29 AM Hou Tao <houtao@huaweicloud.com> wrote:
>>> From: Hou Tao <houtao1@huawei.com>
>>>
>>> When removing the inner map from the outer map, the inner map will be
>>> freed after one RCU grace period and one RCU tasks trace grace
>>> period, so it is certain that the bpf program, which may access the
>>> inner map, has exited before the inner map is freed.
>>>
>>> However there is unnecessary to wait for any RCU grace period, one RCU
>>> grace period or one RCU tasks trace grace period if the outer map is
>>> only accessed by userspace, sleepable program or non-sleepable program.
>>> So recording the sleepable attributes of the owned bpf programs when
>>> adding the outer map into env->used_maps, copying the recorded
>>> attributes to inner map atomically when removing inner map from the
>>> outer map and using the recorded attributes in the inner map to decide
>>> which, and how many, RCU grace periods are needed when freeing the
>>> inner map.
>>>
>>> Signed-off-by: Hou Tao <houtao1@huawei.com>

SNIP
>> I still think sleepable_refcnt and refcnt is more than enough to
>> optimize patch 3.
> Before posting v4,  do the following code-snippets match your suggestions ?
>
> resolve_pseudo_ldimm64()
>                         if (env->prog->aux->sleepable)
>                                /* The max number of program is INT_MAX -
> 1, so atomic_t will be enough */
>                                 atomic_inc(&map->sleepable_refcnt);
>
> bpf_map_fd_put_ptr()
>         if (deferred) {
>                 if (atomic_read(&map->sleepable_refcnt))
>                         WRITE_ONCE(map->free_after_tt_rcu_gp, true);
>                 else
>                         WRITE_ONCE(map->free_after_rcu_gp, true);
>         }
>
> bpf_map_put()
>                 if (READ_ONCE(map->free_after_tt_rcu_gp))
>                         call_rcu_tasks_trace(&map->rcu,
> bpf_map_free_mult_rcu_gp);
>                 else if (READ_ONCE(map->free_after_rcu_gp))
>                         call_rcu(&map->rcu, bpf_map_free_rcu_gp);
>                 else
>                         bpf_map_free_in_work(map);
>

Just find out the above code-snippet misses the corresponding
atomic_dec() for sleepable_refcnt. Could we replace sleepable_refcnt by
a bool (e.g, access_in_sleepable), so the check can be simplified further ?
> And are you OK with using call_rcu()/call_rcu_tasks_trace() instead of
> synchronize_rcu()/synchronize_rcu_mult() ?
>
> .

Alexei Starovoitov Nov. 27, 2023, 3:20 a.m. UTC | #6

On Sun, Nov 26, 2023 at 6:47 PM Hou Tao <houtao@huaweicloud.com> wrote:
>
>
> Before posting v4,  do the following code-snippets match your suggestions ?
>
> resolve_pseudo_ldimm64()
>                         if (env->prog->aux->sleepable)
>                                /* The max number of program is INT_MAX -
> 1, so atomic_t will be enough */
>                                 atomic_inc(&map->sleepable_refcnt);
>
> bpf_map_fd_put_ptr()
>         if (deferred) {
>                 if (atomic_read(&map->sleepable_refcnt))
>                         WRITE_ONCE(map->free_after_tt_rcu_gp, true);
>                 else
>                         WRITE_ONCE(map->free_after_rcu_gp, true);
>         }

Yes. That was an idea and I hope you see that in this form it's
much easier to understand, right?

and regarding your other question:

> Could we replace sleepable_refcnt by
> a bool (e.g, access_in_sleepable), so the check can be simplified further ?

I think you're trying to optimize too soon.
How would that bool access_in_sleepable work?
Something needs to set it and the last sleepable to unset it.
You might need a refcnt to do that.

>
> bpf_map_put()
>                 if (READ_ONCE(map->free_after_tt_rcu_gp))
>                         call_rcu_tasks_trace(&map->rcu,
> bpf_map_free_mult_rcu_gp);
>                 else if (READ_ONCE(map->free_after_rcu_gp))
>                         call_rcu(&map->rcu, bpf_map_free_rcu_gp);
>                 else
>                         bpf_map_free_in_work(map);
>
> And are you OK with using call_rcu()/call_rcu_tasks_trace() instead of
> synchronize_rcu()/synchronize_rcu_mult() ?

Of course. Not sure what you meant by that.
bpf_map_put() cannot call sync_rcu.
Are you asking about doing queue_work first and then sync_rcu* inside?
I think it will not scale compared to call_rcu approach.

Hou Tao Nov. 27, 2023, 3:54 a.m. UTC | #7

Hi,

On 11/27/2023 11:20 AM, Alexei Starovoitov wrote:
> On Sun, Nov 26, 2023 at 6:47 PM Hou Tao <houtao@huaweicloud.com> wrote:
>>
>> Before posting v4,  do the following code-snippets match your suggestions ?
>>
>> resolve_pseudo_ldimm64()
>>                         if (env->prog->aux->sleepable)
>>                                /* The max number of program is INT_MAX -
>> 1, so atomic_t will be enough */
>>                                 atomic_inc(&map->sleepable_refcnt);
>>
>> bpf_map_fd_put_ptr()
>>         if (deferred) {
>>                 if (atomic_read(&map->sleepable_refcnt))
>>                         WRITE_ONCE(map->free_after_tt_rcu_gp, true);
>>                 else
>>                         WRITE_ONCE(map->free_after_rcu_gp, true);
>>         }
> Yes. That was an idea and I hope you see that in this form it's
> much easier to understand, right?

Yes.
>
> and regarding your other question:
>
>> Could we replace sleepable_refcnt by
>> a bool (e.g, access_in_sleepable), so the check can be simplified further ?
> I think you're trying to optimize too soon.
> How would that bool access_in_sleepable work?
> Something needs to set it and the last sleepable to unset it.
> You might need a refcnt to do that.

Yes, a premature optimization. I only consider the case that one bpf map
will be only accessed by one bpf program and bpf program will always
exit after the inner map is deleted from outer map (so sleepable_refcnt
is still greater than zero when calling bpf_map_fd_put_ptr()). Will drop
the idea.
>
>> bpf_map_put()
>>                 if (READ_ONCE(map->free_after_tt_rcu_gp))
>>                         call_rcu_tasks_trace(&map->rcu,
>> bpf_map_free_mult_rcu_gp);
>>                 else if (READ_ONCE(map->free_after_rcu_gp))
>>                         call_rcu(&map->rcu, bpf_map_free_rcu_gp);
>>                 else
>>                         bpf_map_free_in_work(map);
>>
>> And are you OK with using call_rcu()/call_rcu_tasks_trace() instead of
>> synchronize_rcu()/synchronize_rcu_mult() ?
> Of course. Not sure what you meant by that.
> bpf_map_put() cannot call sync_rcu.
> Are you asking about doing queue_work first and then sync_rcu* inside?
> I think it will not scale compared to call_rcu approach.
> .

Yes, I mean the deferment implementation in v2 which calls sync_rcu()*
in kworker. Will still use call_rcu()* in v4. Thanks for all these
suggestions.

[bpf,v3,4/6] bpf: Optimize the free of inner map

Checks

Commit Message

Comments

Patch