diff mbox series

[bpf-next,v3,1/2] bpf: Reduce the scope of rcu_read_lock when updating fd map

Message ID 20231214043010.3458072-2-houtao@huaweicloud.com (mailing list archive)
State Accepted
Commit 8f82583f9527b3be9d70d9a5d1f33435e29d0480
Delegated to: BPF
Headers show
Series bpf: Use GFP_KERNEL in bpf_event_entry_gen() | expand

Checks

Context Check Description
bpf/vmtest-bpf-next-PR fail PR summary
bpf/vmtest-bpf-next-VM_Test-47 success Logs for x86_64-llvm-18 / veristat
bpf/vmtest-bpf-next-VM_Test-45 success Logs for x86_64-llvm-18 / test (test_progs_no_alu32, false, 360) / test_progs_no_alu32 on x86_64 with llvm-18
bpf/vmtest-bpf-next-VM_Test-43 success Logs for x86_64-llvm-18 / test (test_progs, false, 360) / test_progs on x86_64 with llvm-18
bpf/vmtest-bpf-next-VM_Test-44 success Logs for x86_64-llvm-18 / test (test_progs_cpuv4, false, 360) / test_progs_cpuv4 on x86_64 with llvm-18
bpf/vmtest-bpf-next-VM_Test-46 success Logs for x86_64-llvm-18 / test (test_verifier, false, 360) / test_verifier on x86_64 with llvm-18
bpf/vmtest-bpf-next-VM_Test-0 success Logs for Lint
bpf/vmtest-bpf-next-VM_Test-2 success Logs for Unittests
bpf/vmtest-bpf-next-VM_Test-3 success Logs for Validate matrix.py
bpf/vmtest-bpf-next-VM_Test-1 success Logs for ShellCheck
bpf/vmtest-bpf-next-VM_Test-5 success Logs for aarch64-gcc / build-release
bpf/vmtest-bpf-next-VM_Test-13 success Logs for set-matrix
bpf/vmtest-bpf-next-VM_Test-14 success Logs for x86_64-gcc / build / build for x86_64 with gcc
bpf/vmtest-bpf-next-VM_Test-4 success Logs for aarch64-gcc / build / build for aarch64 with gcc
bpf/vmtest-bpf-next-VM_Test-12 success Logs for s390x-gcc / build-release
bpf/vmtest-bpf-next-VM_Test-10 success Logs for aarch64-gcc / veristat
bpf/vmtest-bpf-next-VM_Test-15 success Logs for x86_64-gcc / build-release
bpf/vmtest-bpf-next-VM_Test-9 success Logs for aarch64-gcc / test (test_verifier, false, 360) / test_verifier on aarch64 with gcc
bpf/vmtest-bpf-next-VM_Test-7 success Logs for aarch64-gcc / test (test_progs, false, 360) / test_progs on aarch64 with gcc
bpf/vmtest-bpf-next-VM_Test-8 success Logs for aarch64-gcc / test (test_progs_no_alu32, false, 360) / test_progs_no_alu32 on aarch64 with gcc
bpf/vmtest-bpf-next-VM_Test-6 success Logs for aarch64-gcc / test (test_maps, false, 360) / test_maps on aarch64 with gcc
bpf/vmtest-bpf-next-VM_Test-23 success Logs for x86_64-gcc / test (test_progs_no_alu32, false, 360) / test_progs_no_alu32 on x86_64 with gcc
bpf/vmtest-bpf-next-VM_Test-18 success Logs for set-matrix
bpf/vmtest-bpf-next-VM_Test-30 success Logs for x86_64-llvm-17 / test (test_maps, false, 360) / test_maps on x86_64 with llvm-17
bpf/vmtest-bpf-next-VM_Test-41 success Logs for x86_64-llvm-18 / test (test_verifier, false, 360) / test_verifier on x86_64 with llvm-18
bpf/vmtest-bpf-next-VM_Test-42 success Logs for x86_64-llvm-18 / veristat
bpf/vmtest-bpf-next-VM_Test-32 success Logs for x86_64-llvm-17 / test (test_progs_no_alu32, false, 360) / test_progs_no_alu32 on x86_64 with llvm-17
bpf/vmtest-bpf-next-VM_Test-35 success Logs for x86_64-llvm-18 / build / build for x86_64 with llvm-18
bpf/vmtest-bpf-next-VM_Test-21 success Logs for x86_64-gcc / test (test_maps, false, 360) / test_maps on x86_64 with gcc
bpf/vmtest-bpf-next-VM_Test-31 success Logs for x86_64-llvm-17 / test (test_progs, false, 360) / test_progs on x86_64 with llvm-17
bpf/vmtest-bpf-next-VM_Test-36 success Logs for x86_64-llvm-18 / build-release / build for x86_64 with llvm-18 and -O2 optimization
bpf/vmtest-bpf-next-VM_Test-24 success Logs for x86_64-gcc / test (test_progs_no_alu32_parallel, true, 30) / test_progs_no_alu32_parallel on x86_64 with gcc
bpf/vmtest-bpf-next-VM_Test-25 success Logs for x86_64-gcc / test (test_progs_parallel, true, 30) / test_progs_parallel on x86_64 with gcc
bpf/vmtest-bpf-next-VM_Test-11 success Logs for s390x-gcc / build / build for s390x with gcc
bpf/vmtest-bpf-next-VM_Test-39 success Logs for x86_64-llvm-18 / test (test_progs_cpuv4, false, 360) / test_progs_cpuv4 on x86_64 with llvm-18
bpf/vmtest-bpf-next-VM_Test-17 success Logs for s390x-gcc / veristat
bpf/vmtest-bpf-next-VM_Test-33 success Logs for x86_64-llvm-17 / test (test_verifier, false, 360) / test_verifier on x86_64 with llvm-17
bpf/vmtest-bpf-next-VM_Test-22 success Logs for x86_64-gcc / test (test_progs, false, 360) / test_progs on x86_64 with gcc
bpf/vmtest-bpf-next-VM_Test-28 success Logs for x86_64-llvm-17 / build / build for x86_64 with llvm-17
bpf/vmtest-bpf-next-VM_Test-20 success Logs for x86_64-gcc / build-release
bpf/vmtest-bpf-next-VM_Test-29 success Logs for x86_64-llvm-17 / build-release / build for x86_64 with llvm-17 and -O2 optimization
bpf/vmtest-bpf-next-VM_Test-34 success Logs for x86_64-llvm-17 / veristat
bpf/vmtest-bpf-next-VM_Test-26 success Logs for x86_64-gcc / test (test_verifier, false, 360) / test_verifier on x86_64 with gcc
bpf/vmtest-bpf-next-VM_Test-40 success Logs for x86_64-llvm-18 / test (test_progs_no_alu32, false, 360) / test_progs_no_alu32 on x86_64 with llvm-18
bpf/vmtest-bpf-next-VM_Test-37 success Logs for x86_64-llvm-18 / test (test_maps, false, 360) / test_maps on x86_64 with llvm-18
bpf/vmtest-bpf-next-VM_Test-38 success Logs for x86_64-llvm-18 / test (test_progs, false, 360) / test_progs on x86_64 with llvm-18
bpf/vmtest-bpf-next-VM_Test-27 success Logs for x86_64-gcc / veristat / veristat on x86_64 with gcc
bpf/vmtest-bpf-next-VM_Test-19 success Logs for x86_64-gcc / build / build for x86_64 with gcc
bpf/vmtest-bpf-next-VM_Test-16 success Logs for s390x-gcc / test (test_verifier, false, 360) / test_verifier on s390x with gcc
netdev/tree_selection success Clearly marked for bpf-next
netdev/apply success Patch already applied to bpf-next

Commit Message

Hou Tao Dec. 14, 2023, 4:30 a.m. UTC
From: Hou Tao <houtao1@huawei.com>

There is no rcu-read-lock requirement for ops->map_fd_get_ptr() or
ops->map_fd_put_ptr(), so doesn't use rcu-read-lock for these two
callbacks.

For bpf_fd_array_map_update_elem(), accessing array->ptrs doesn't need
rcu-read-lock because array->ptrs must still be allocated. For
bpf_fd_htab_map_update_elem(), htab_map_update_elem() only requires
rcu-read-lock to be held to avoid the WARN_ON_ONCE(), so only use
rcu_read_lock() during the invocation of htab_map_update_elem().

Acked-by: Yonghong Song <yonghong.song@linux.dev>
Signed-off-by: Hou Tao <houtao1@huawei.com>
---
 kernel/bpf/hashtab.c | 6 ++++++
 kernel/bpf/syscall.c | 4 ----
 2 files changed, 6 insertions(+), 4 deletions(-)

Comments

John Fastabend Dec. 14, 2023, 6:22 a.m. UTC | #1
Hou Tao wrote:
> From: Hou Tao <houtao1@huawei.com>
> 
> There is no rcu-read-lock requirement for ops->map_fd_get_ptr() or
> ops->map_fd_put_ptr(), so doesn't use rcu-read-lock for these two
> callbacks.
> 
> For bpf_fd_array_map_update_elem(), accessing array->ptrs doesn't need
> rcu-read-lock because array->ptrs must still be allocated. For
> bpf_fd_htab_map_update_elem(), htab_map_update_elem() only requires
> rcu-read-lock to be held to avoid the WARN_ON_ONCE(), so only use
> rcu_read_lock() during the invocation of htab_map_update_elem().
> 
> Acked-by: Yonghong Song <yonghong.song@linux.dev>
> Signed-off-by: Hou Tao <houtao1@huawei.com>
> ---
>  kernel/bpf/hashtab.c | 6 ++++++
>  kernel/bpf/syscall.c | 4 ----
>  2 files changed, 6 insertions(+), 4 deletions(-)
> 
> diff --git a/kernel/bpf/hashtab.c b/kernel/bpf/hashtab.c
> index 5b9146fa825f..ec3bdcc6a3cf 100644
> --- a/kernel/bpf/hashtab.c
> +++ b/kernel/bpf/hashtab.c
> @@ -2523,7 +2523,13 @@ int bpf_fd_htab_map_update_elem(struct bpf_map *map, struct file *map_file,
>  	if (IS_ERR(ptr))
>  		return PTR_ERR(ptr);
>  
> +	/* The htab bucket lock is always held during update operations in fd
> +	 * htab map, and the following rcu_read_lock() is only used to avoid
> +	 * the WARN_ON_ONCE in htab_map_update_elem().
> +	 */
> +	rcu_read_lock();
>  	ret = htab_map_update_elem(map, key, &ptr, map_flags);
> +	rcu_read_unlock();

Did we consider dropping the WARN_ON_ONCE in htab_map_update_elem()? It
looks like there are two ways to get to htab_map_update_elem() either
through a syscall and the path here (bpf_fd_htab_map_update_elem) or
through a BPF program calling, bpf_update_elem()? In the BPF_CALL
case bpf_map_update_elem() already has,

   WARN_ON_ONCE(!rcu_read_lock_held() && !rcu_read_lock_bh_held())

The htab_map_update_elem() has an additional check for
rcu_read_lock_trace_held(), but not sure where this is coming from
at the moment. Can that be added to the BPF caller side if needed?

Did I miss some caller path?

 

>  	if (ret)
>  		map->ops->map_fd_put_ptr(map, ptr, false);
>  
> diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
> index d63c1ed42412..3fcf7741146a 100644
> --- a/kernel/bpf/syscall.c
> +++ b/kernel/bpf/syscall.c
> @@ -184,15 +184,11 @@ static int bpf_map_update_value(struct bpf_map *map, struct file *map_file,
>  		err = bpf_percpu_cgroup_storage_update(map, key, value,
>  						       flags);
>  	} else if (IS_FD_ARRAY(map)) {
> -		rcu_read_lock();
>  		err = bpf_fd_array_map_update_elem(map, map_file, key, value,
>  						   flags);
> -		rcu_read_unlock();
>  	} else if (map->map_type == BPF_MAP_TYPE_HASH_OF_MAPS) {
> -		rcu_read_lock();
>  		err = bpf_fd_htab_map_update_elem(map, map_file, key, value,
>  						  flags);
> -		rcu_read_unlock();
>  	} else if (map->map_type == BPF_MAP_TYPE_REUSEPORT_SOCKARRAY) {
>  		/* rcu_read_lock() is not needed */
>  		err = bpf_fd_reuseport_array_update_elem(map, key, value,

Any reason to leave the last rcu_read_lock() on the 'else{}' case? If
the rule is we have a reference to the map through the file fdget()? And
any concurrent runners need some locking, xchg, to handle the update a
rcu_read_lock() wont help there.

I didn't audit all the update flows tonight though.


> -- 
> 2.29.2
> 
>
Hou Tao Dec. 14, 2023, 7:31 a.m. UTC | #2
Hi,

On 12/14/2023 2:22 PM, John Fastabend wrote:
> Hou Tao wrote:
>> From: Hou Tao <houtao1@huawei.com>
>>
>> There is no rcu-read-lock requirement for ops->map_fd_get_ptr() or
>> ops->map_fd_put_ptr(), so doesn't use rcu-read-lock for these two
>> callbacks.
>>
>> For bpf_fd_array_map_update_elem(), accessing array->ptrs doesn't need
>> rcu-read-lock because array->ptrs must still be allocated. For
>> bpf_fd_htab_map_update_elem(), htab_map_update_elem() only requires
>> rcu-read-lock to be held to avoid the WARN_ON_ONCE(), so only use
>> rcu_read_lock() during the invocation of htab_map_update_elem().
>>
>> Acked-by: Yonghong Song <yonghong.song@linux.dev>
>> Signed-off-by: Hou Tao <houtao1@huawei.com>
>> ---
>>  kernel/bpf/hashtab.c | 6 ++++++
>>  kernel/bpf/syscall.c | 4 ----
>>  2 files changed, 6 insertions(+), 4 deletions(-)
>>
>> diff --git a/kernel/bpf/hashtab.c b/kernel/bpf/hashtab.c
>> index 5b9146fa825f..ec3bdcc6a3cf 100644
>> --- a/kernel/bpf/hashtab.c
>> +++ b/kernel/bpf/hashtab.c
>> @@ -2523,7 +2523,13 @@ int bpf_fd_htab_map_update_elem(struct bpf_map *map, struct file *map_file,
>>  	if (IS_ERR(ptr))
>>  		return PTR_ERR(ptr);
>>  
>> +	/* The htab bucket lock is always held during update operations in fd
>> +	 * htab map, and the following rcu_read_lock() is only used to avoid
>> +	 * the WARN_ON_ONCE in htab_map_update_elem().
>> +	 */
>> +	rcu_read_lock();
>>  	ret = htab_map_update_elem(map, key, &ptr, map_flags);
>> +	rcu_read_unlock();
> Did we consider dropping the WARN_ON_ONCE in htab_map_update_elem()? It
> looks like there are two ways to get to htab_map_update_elem() either
> through a syscall and the path here (bpf_fd_htab_map_update_elem) or
> through a BPF program calling, bpf_update_elem()? In the BPF_CALL
> case bpf_map_update_elem() already has,
>
>    WARN_ON_ONCE(!rcu_read_lock_held() && !rcu_read_lock_bh_held())
>
> The htab_map_update_elem() has an additional check for
> rcu_read_lock_trace_held(), but not sure where this is coming from
> at the moment. Can that be added to the BPF caller side if needed?
>
> Did I miss some caller path?

No. But I think the main reason for the extra WARN in
bpf_map_update_elem() is that bpf_map_update_elem() may be inlined by
verifier in do_misc_fixups(), so the WARN_ON_ONCE in
bpf_map_update_elem() will not be invoked ever. For
rcu_read_lock_trace_held(), I have added the assertion in
bpf_map_delete_elem() recently in commit 169410eba271 ("bpf: Check
rcu_read_lock_trace_held() before calling bpf map helpers").
>  
>
>>  	if (ret)
>>  		map->ops->map_fd_put_ptr(map, ptr, false);
>>  
>> diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
>> index d63c1ed42412..3fcf7741146a 100644
>> --- a/kernel/bpf/syscall.c
>> +++ b/kernel/bpf/syscall.c
>> @@ -184,15 +184,11 @@ static int bpf_map_update_value(struct bpf_map *map, struct file *map_file,
>>  		err = bpf_percpu_cgroup_storage_update(map, key, value,
>>  						       flags);
>>  	} else if (IS_FD_ARRAY(map)) {
>> -		rcu_read_lock();
>>  		err = bpf_fd_array_map_update_elem(map, map_file, key, value,
>>  						   flags);
>> -		rcu_read_unlock();
>>  	} else if (map->map_type == BPF_MAP_TYPE_HASH_OF_MAPS) {
>> -		rcu_read_lock();
>>  		err = bpf_fd_htab_map_update_elem(map, map_file, key, value,
>>  						  flags);
>> -		rcu_read_unlock();
>>  	} else if (map->map_type == BPF_MAP_TYPE_REUSEPORT_SOCKARRAY) {
>>  		/* rcu_read_lock() is not needed */
>>  		err = bpf_fd_reuseport_array_update_elem(map, key, value,
> Any reason to leave the last rcu_read_lock() on the 'else{}' case? If
> the rule is we have a reference to the map through the file fdget()? And
> any concurrent runners need some locking, xchg, to handle the update a
> rcu_read_lock() wont help there.
>
> I didn't audit all the update flows tonight though.

It seems it is still necessary for htab and local storage. For normal
htab, it is possible the update is done without taking the bucket lock
(in-place replace), so RCU CS is needed to guarantee the iteration is
still safe. And for local storage (e.g. cgrp local storage) it may also
do in-place update through lookup and then update. We could fold the
calling of rcu_read_lock() into .map_update_elem() if it is necessary.
>
>
>> -- 
>> 2.29.2
>>
>>
Alexei Starovoitov Dec. 14, 2023, 1:55 p.m. UTC | #3
On Wed, Dec 13, 2023 at 11:31 PM Hou Tao <houtao@huaweicloud.com> wrote:
>
> Hi,
>
> On 12/14/2023 2:22 PM, John Fastabend wrote:
> > Hou Tao wrote:
> >> From: Hou Tao <houtao1@huawei.com>
> >>
> >> There is no rcu-read-lock requirement for ops->map_fd_get_ptr() or
> >> ops->map_fd_put_ptr(), so doesn't use rcu-read-lock for these two
> >> callbacks.
> >>
> >> For bpf_fd_array_map_update_elem(), accessing array->ptrs doesn't need
> >> rcu-read-lock because array->ptrs must still be allocated. For
> >> bpf_fd_htab_map_update_elem(), htab_map_update_elem() only requires
> >> rcu-read-lock to be held to avoid the WARN_ON_ONCE(), so only use
> >> rcu_read_lock() during the invocation of htab_map_update_elem().
> >>
> >> Acked-by: Yonghong Song <yonghong.song@linux.dev>
> >> Signed-off-by: Hou Tao <houtao1@huawei.com>
> >> ---
> >>  kernel/bpf/hashtab.c | 6 ++++++
> >>  kernel/bpf/syscall.c | 4 ----
> >>  2 files changed, 6 insertions(+), 4 deletions(-)
> >>
> >> diff --git a/kernel/bpf/hashtab.c b/kernel/bpf/hashtab.c
> >> index 5b9146fa825f..ec3bdcc6a3cf 100644
> >> --- a/kernel/bpf/hashtab.c
> >> +++ b/kernel/bpf/hashtab.c
> >> @@ -2523,7 +2523,13 @@ int bpf_fd_htab_map_update_elem(struct bpf_map *map, struct file *map_file,
> >>      if (IS_ERR(ptr))
> >>              return PTR_ERR(ptr);
> >>
> >> +    /* The htab bucket lock is always held during update operations in fd
> >> +     * htab map, and the following rcu_read_lock() is only used to avoid
> >> +     * the WARN_ON_ONCE in htab_map_update_elem().
> >> +     */
> >> +    rcu_read_lock();
> >>      ret = htab_map_update_elem(map, key, &ptr, map_flags);
> >> +    rcu_read_unlock();
> > Did we consider dropping the WARN_ON_ONCE in htab_map_update_elem()? It
> > looks like there are two ways to get to htab_map_update_elem() either
> > through a syscall and the path here (bpf_fd_htab_map_update_elem) or
> > through a BPF program calling, bpf_update_elem()? In the BPF_CALL
> > case bpf_map_update_elem() already has,
> >
> >    WARN_ON_ONCE(!rcu_read_lock_held() && !rcu_read_lock_bh_held())
> >
> > The htab_map_update_elem() has an additional check for
> > rcu_read_lock_trace_held(), but not sure where this is coming from
> > at the moment. Can that be added to the BPF caller side if needed?
> >
> > Did I miss some caller path?
>
> No. But I think the main reason for the extra WARN in
> bpf_map_update_elem() is that bpf_map_update_elem() may be inlined by
> verifier in do_misc_fixups(), so the WARN_ON_ONCE in
> bpf_map_update_elem() will not be invoked ever. For
> rcu_read_lock_trace_held(), I have added the assertion in
> bpf_map_delete_elem() recently in commit 169410eba271 ("bpf: Check
> rcu_read_lock_trace_held() before calling bpf map helpers").

Yep.
We should probably remove WARN_ONs from
bpf_map_update_elem() and others in kernel/bpf/helpers.c
since they are inlined by the verifier with 99% probability
and the WARNs are never called even in DEBUG kernels.
And confusing developers. As this thread shows.

We can replace them with a comment that explains this inlining logic
and where the real WARNs are.
John Fastabend Dec. 14, 2023, 7:15 p.m. UTC | #4
Alexei Starovoitov wrote:
> On Wed, Dec 13, 2023 at 11:31 PM Hou Tao <houtao@huaweicloud.com> wrote:
> >
> > Hi,
> >
> > On 12/14/2023 2:22 PM, John Fastabend wrote:
> > > Hou Tao wrote:
> > >> From: Hou Tao <houtao1@huawei.com>
> > >>
> > >> There is no rcu-read-lock requirement for ops->map_fd_get_ptr() or
> > >> ops->map_fd_put_ptr(), so doesn't use rcu-read-lock for these two
> > >> callbacks.
> > >>
> > >> For bpf_fd_array_map_update_elem(), accessing array->ptrs doesn't need
> > >> rcu-read-lock because array->ptrs must still be allocated. For
> > >> bpf_fd_htab_map_update_elem(), htab_map_update_elem() only requires
> > >> rcu-read-lock to be held to avoid the WARN_ON_ONCE(), so only use
> > >> rcu_read_lock() during the invocation of htab_map_update_elem().
> > >>
> > >> Acked-by: Yonghong Song <yonghong.song@linux.dev>
> > >> Signed-off-by: Hou Tao <houtao1@huawei.com>
> > >> ---
> > >>  kernel/bpf/hashtab.c | 6 ++++++
> > >>  kernel/bpf/syscall.c | 4 ----
> > >>  2 files changed, 6 insertions(+), 4 deletions(-)
> > >>
> > >> diff --git a/kernel/bpf/hashtab.c b/kernel/bpf/hashtab.c
> > >> index 5b9146fa825f..ec3bdcc6a3cf 100644
> > >> --- a/kernel/bpf/hashtab.c
> > >> +++ b/kernel/bpf/hashtab.c
> > >> @@ -2523,7 +2523,13 @@ int bpf_fd_htab_map_update_elem(struct bpf_map *map, struct file *map_file,
> > >>      if (IS_ERR(ptr))
> > >>              return PTR_ERR(ptr);
> > >>
> > >> +    /* The htab bucket lock is always held during update operations in fd
> > >> +     * htab map, and the following rcu_read_lock() is only used to avoid
> > >> +     * the WARN_ON_ONCE in htab_map_update_elem().
> > >> +     */

Ah ok but isn't this comment wrong because you do need rcu read lock to do
the walk with lookup_nulls_elem_raw where there is no lock being held? And
then the subsequent copy in place is fine because you do have a lock.

So its not just to appease the WARN_ON_ONCE here it has an actual real
need?

> > >> +    rcu_read_lock();
> > >>      ret = htab_map_update_elem(map, key, &ptr, map_flags);
> > >> +    rcu_read_unlock();
> > > Did we consider dropping the WARN_ON_ONCE in htab_map_update_elem()? It
> > > looks like there are two ways to get to htab_map_update_elem() either
> > > through a syscall and the path here (bpf_fd_htab_map_update_elem) or
> > > through a BPF program calling, bpf_update_elem()? In the BPF_CALL
> > > case bpf_map_update_elem() already has,
> > >
> > >    WARN_ON_ONCE(!rcu_read_lock_held() && !rcu_read_lock_bh_held())
> > >
> > > The htab_map_update_elem() has an additional check for
> > > rcu_read_lock_trace_held(), but not sure where this is coming from
> > > at the moment. Can that be added to the BPF caller side if needed?
> > >
> > > Did I miss some caller path?
> >
> > No. But I think the main reason for the extra WARN in
> > bpf_map_update_elem() is that bpf_map_update_elem() may be inlined by
> > verifier in do_misc_fixups(), so the WARN_ON_ONCE in
> > bpf_map_update_elem() will not be invoked ever. For
> > rcu_read_lock_trace_held(), I have added the assertion in
> > bpf_map_delete_elem() recently in commit 169410eba271 ("bpf: Check
> > rcu_read_lock_trace_held() before calling bpf map helpers").
> 
> Yep.
> We should probably remove WARN_ONs from
> bpf_map_update_elem() and others in kernel/bpf/helpers.c
> since they are inlined by the verifier with 99% probability
> and the WARNs are never called even in DEBUG kernels.
> And confusing developers. As this thread shows.

Agree. The rcu_read needs to be close as possible to where its actually
needed and the WARN_ON_ONCE should be dropped if its going to be
inlined.

> 
> We can replace them with a comment that explains this inlining logic
> and where the real WARNs are.
Alexei Starovoitov Dec. 15, 2023, 3:23 a.m. UTC | #5
On Thu, Dec 14, 2023 at 11:15 AM John Fastabend
<john.fastabend@gmail.com> wrote:
>
> Alexei Starovoitov wrote:
> > On Wed, Dec 13, 2023 at 11:31 PM Hou Tao <houtao@huaweicloud.com> wrote:
> > >
> > > Hi,
> > >
> > > On 12/14/2023 2:22 PM, John Fastabend wrote:
> > > > Hou Tao wrote:
> > > >> From: Hou Tao <houtao1@huawei.com>
> > > >>
> > > >> There is no rcu-read-lock requirement for ops->map_fd_get_ptr() or
> > > >> ops->map_fd_put_ptr(), so doesn't use rcu-read-lock for these two
> > > >> callbacks.
> > > >>
> > > >> For bpf_fd_array_map_update_elem(), accessing array->ptrs doesn't need
> > > >> rcu-read-lock because array->ptrs must still be allocated. For
> > > >> bpf_fd_htab_map_update_elem(), htab_map_update_elem() only requires
> > > >> rcu-read-lock to be held to avoid the WARN_ON_ONCE(), so only use
> > > >> rcu_read_lock() during the invocation of htab_map_update_elem().
> > > >>
> > > >> Acked-by: Yonghong Song <yonghong.song@linux.dev>
> > > >> Signed-off-by: Hou Tao <houtao1@huawei.com>
> > > >> ---
> > > >>  kernel/bpf/hashtab.c | 6 ++++++
> > > >>  kernel/bpf/syscall.c | 4 ----
> > > >>  2 files changed, 6 insertions(+), 4 deletions(-)
> > > >>
> > > >> diff --git a/kernel/bpf/hashtab.c b/kernel/bpf/hashtab.c
> > > >> index 5b9146fa825f..ec3bdcc6a3cf 100644
> > > >> --- a/kernel/bpf/hashtab.c
> > > >> +++ b/kernel/bpf/hashtab.c
> > > >> @@ -2523,7 +2523,13 @@ int bpf_fd_htab_map_update_elem(struct bpf_map *map, struct file *map_file,
> > > >>      if (IS_ERR(ptr))
> > > >>              return PTR_ERR(ptr);
> > > >>
> > > >> +    /* The htab bucket lock is always held during update operations in fd
> > > >> +     * htab map, and the following rcu_read_lock() is only used to avoid
> > > >> +     * the WARN_ON_ONCE in htab_map_update_elem().
> > > >> +     */
>
> Ah ok but isn't this comment wrong because you do need rcu read lock to do
> the walk with lookup_nulls_elem_raw where there is no lock being held? And
> then the subsequent copy in place is fine because you do have a lock.

Ohh. You're correct.
Not sure what I was thinking.

Hou,
could you please send a follow up to undo my braino.
Hou Tao Dec. 15, 2023, 3:39 a.m. UTC | #6
Hi,

On 12/15/2023 11:23 AM, Alexei Starovoitov wrote:
> On Thu, Dec 14, 2023 at 11:15 AM John Fastabend
> <john.fastabend@gmail.com> wrote:
>> Alexei Starovoitov wrote:
>>> On Wed, Dec 13, 2023 at 11:31 PM Hou Tao <houtao@huaweicloud.com> wrote:
>>>> Hi,
>>>>
>>>> On 12/14/2023 2:22 PM, John Fastabend wrote:
>>>>> Hou Tao wrote:
>>>>>> From: Hou Tao <houtao1@huawei.com>
>>>>>>
>>>>>> There is no rcu-read-lock requirement for ops->map_fd_get_ptr() or
>>>>>> ops->map_fd_put_ptr(), so doesn't use rcu-read-lock for these two
>>>>>> callbacks.
>>>>>>
>>>>>> For bpf_fd_array_map_update_elem(), accessing array->ptrs doesn't need
>>>>>> rcu-read-lock because array->ptrs must still be allocated. For
>>>>>> bpf_fd_htab_map_update_elem(), htab_map_update_elem() only requires
>>>>>> rcu-read-lock to be held to avoid the WARN_ON_ONCE(), so only use
>>>>>> rcu_read_lock() during the invocation of htab_map_update_elem().
>>>>>>
>>>>>> Acked-by: Yonghong Song <yonghong.song@linux.dev>
>>>>>> Signed-off-by: Hou Tao <houtao1@huawei.com>
>>>>>> ---
>>>>>>  kernel/bpf/hashtab.c | 6 ++++++
>>>>>>  kernel/bpf/syscall.c | 4 ----
>>>>>>  2 files changed, 6 insertions(+), 4 deletions(-)
>>>>>>
>>>>>> diff --git a/kernel/bpf/hashtab.c b/kernel/bpf/hashtab.c
>>>>>> index 5b9146fa825f..ec3bdcc6a3cf 100644
>>>>>> --- a/kernel/bpf/hashtab.c
>>>>>> +++ b/kernel/bpf/hashtab.c
>>>>>> @@ -2523,7 +2523,13 @@ int bpf_fd_htab_map_update_elem(struct bpf_map *map, struct file *map_file,
>>>>>>      if (IS_ERR(ptr))
>>>>>>              return PTR_ERR(ptr);
>>>>>>
>>>>>> +    /* The htab bucket lock is always held during update operations in fd
>>>>>> +     * htab map, and the following rcu_read_lock() is only used to avoid
>>>>>> +     * the WARN_ON_ONCE in htab_map_update_elem().
>>>>>> +     */
>> Ah ok but isn't this comment wrong because you do need rcu read lock to do
>> the walk with lookup_nulls_elem_raw where there is no lock being held? And
>> then the subsequent copy in place is fine because you do have a lock.
> Ohh. You're correct.
> Not sure what I was thinking.
>
> Hou,
> could you please send a follow up to undo my braino.
Er, I didn't follow. There is no spin-lock support in fd htab map, so
htab_map_update_elem() won't call lookup_nulls_elem_raw(), instead it
will lock the bucket and invoke lookup_elem_raw(), so I don't think
rcu_read_lock() is indeed needed for the invocation of
htab_map_update_elem(), except to make WARN_ON_ONC() happy.
Hou Tao Dec. 15, 2023, 8:18 a.m. UTC | #7
Hi,

On 12/15/2023 3:15 AM, John Fastabend wrote:
> Alexei Starovoitov wrote:
>> On Wed, Dec 13, 2023 at 11:31 PM Hou Tao <houtao@huaweicloud.com> wrote:
>>> Hi,
>>>
>>> On 12/14/2023 2:22 PM, John Fastabend wrote:
>>>> Hou Tao wrote:
>>>>> From: Hou Tao <houtao1@huawei.com>
>>>>>
>>>>> There is no rcu-read-lock requirement for ops->map_fd_get_ptr() or
>>>>> ops->map_fd_put_ptr(), so doesn't use rcu-read-lock for these two
>>>>> callbacks.
>>>>>
>>>>> For bpf_fd_array_map_update_elem(), accessing array->ptrs doesn't need
>>>>> rcu-read-lock because array->ptrs must still be allocated. For
>>>>> bpf_fd_htab_map_update_elem(), htab_map_update_elem() only requires
>>>>> rcu-read-lock to be held to avoid the WARN_ON_ONCE(), so only use
>>>>> rcu_read_lock() during the invocation of htab_map_update_elem().
>>>>>
>>>>> Acked-by: Yonghong Song <yonghong.song@linux.dev>
>>>>> Signed-off-by: Hou Tao <houtao1@huawei.com>
>>>>> ---
>>>>>  kernel/bpf/hashtab.c | 6 ++++++
>>>>>  kernel/bpf/syscall.c | 4 ----
>>>>>  2 files changed, 6 insertions(+), 4 deletions(-)
>>>>>
>>>>> diff --git a/kernel/bpf/hashtab.c b/kernel/bpf/hashtab.c
>>>>> index 5b9146fa825f..ec3bdcc6a3cf 100644
>>>>> --- a/kernel/bpf/hashtab.c
>>>>> +++ b/kernel/bpf/hashtab.c
>>>>> @@ -2523,7 +2523,13 @@ int bpf_fd_htab_map_update_elem(struct bpf_map *map, struct file *map_file,
>>>>>      if (IS_ERR(ptr))
>>>>>              return PTR_ERR(ptr);
>>>>>
>>>>> +    /* The htab bucket lock is always held during update operations in fd
>>>>> +     * htab map, and the following rcu_read_lock() is only used to avoid
>>>>> +     * the WARN_ON_ONCE in htab_map_update_elem().
>>>>> +     */
> Ah ok but isn't this comment wrong because you do need rcu read lock to do
> the walk with lookup_nulls_elem_raw where there is no lock being held? And
> then the subsequent copy in place is fine because you do have a lock.
>
> So its not just to appease the WARN_ON_ONCE here it has an actual real
> need?
>
>>>>> +    rcu_read_lock();
>>>>>      ret = htab_map_update_elem(map, key, &ptr, map_flags);
>>>>> +    rcu_read_unlock();
>>>> Did we consider dropping the WARN_ON_ONCE in htab_map_update_elem()? It
>>>> looks like there are two ways to get to htab_map_update_elem() either
>>>> through a syscall and the path here (bpf_fd_htab_map_update_elem) or
>>>> through a BPF program calling, bpf_update_elem()? In the BPF_CALL
>>>> case bpf_map_update_elem() already has,
>>>>
>>>>    WARN_ON_ONCE(!rcu_read_lock_held() && !rcu_read_lock_bh_held())
>>>>
>>>> The htab_map_update_elem() has an additional check for
>>>> rcu_read_lock_trace_held(), but not sure where this is coming from
>>>> at the moment. Can that be added to the BPF caller side if needed?
>>>>
>>>> Did I miss some caller path?
>>> No. But I think the main reason for the extra WARN in
>>> bpf_map_update_elem() is that bpf_map_update_elem() may be inlined by
>>> verifier in do_misc_fixups(), so the WARN_ON_ONCE in
>>> bpf_map_update_elem() will not be invoked ever. For
>>> rcu_read_lock_trace_held(), I have added the assertion in
>>> bpf_map_delete_elem() recently in commit 169410eba271 ("bpf: Check
>>> rcu_read_lock_trace_held() before calling bpf map helpers").
>> Yep.
>> We should probably remove WARN_ONs from
>> bpf_map_update_elem() and others in kernel/bpf/helpers.c
>> since they are inlined by the verifier with 99% probability
>> and the WARNs are never called even in DEBUG kernels.
>> And confusing developers. As this thread shows.
> Agree. The rcu_read needs to be close as possible to where its actually
> needed and the WARN_ON_ONCE should be dropped if its going to be
> inlined.

I did some investigation on these bpf map helpers and the
implementations of these helpers in various kinds of bpf map. It seems
most implementations (besides dev_map_hash_ops) already have added
proper RCU lock assertions, so I think it is indeed OK to remove
WARN_ON_ONCE() on these bpf map helpers after fixing the assertion in
dev_map_hash_ops. The following is the details:

1. bpf_map_lookup_elem helper
(a) hash/lru_hash/percpu_hash/lru_percpu_hash
with !rcu_read_lock_held() && !rcu_read_lock_trace_held() &&
!rcu_read_lock_bh_held() in __htab_map_lookup_elem()

(b) array/percpu_array
no deletion, so no RCU

(c) lpm_trie
with rcu_read_lock_bh_held() in trie_lookup_elem()

(d) htab_of_maps
with !rcu_read_lock_held() && !rcu_read_lock_trace_held() &&
!rcu_read_lock_bh_held() in __htab_map_lookup_elem()

(e) array_of_maps
no deletion, so no RCU

(f) sockmap
rcu_read_lock_held() in __sock_map_lookup_elem()

(g) sockhash
rcu_read_lock_held() in__sock_hash_lookup_elem()

(h) devmap
rcu_read_lock_bh_held() in __dev_map_lookup_elem()

(i) devmap_hash (incorrect assertion)
No rcu_read_lock_bh_held() in __dev_map_hash_lookup_elem()

(j) xskmap
rcu_read_lock_bh_held() in __xsk_map_lookup_elem()

2. bpf_map_update_elem helper
(a) hash/lru_hash/percpu_hash/lru_percpu_hash
with !rcu_read_lock_held() && !rcu_read_lock_trace_held() &&
!rcu_read_lock_bh_held() in
htab_map_update_elem()/htab_lru_map_update_elem()/__htab_percpu_map_update_elem()/__htab_lru_percpu_map_update_elem()

(b) array/percpu_array
no RCU

(c) lpm_trie
use spin-lock, and no RCU

(d) sockmap
use spin-lock & with rcu_read_lock_held() in sock_map_update_common()

(e) sockhash
use spin-lock & with rcu_read_lock_held() in sock_hash_update_common()

3.bpf_map_delete_elem helper
(a) hash/lru_hash/percpu_hash/lru_percpu_hash
with !rcu_read_lock_held() && !rcu_read_lock_trace_held() &&
!rcu_read_lock_bh_held() () in htab_map_delete_elem/htab_lru_map_delete_elem

(b) array/percpu_array
no support

(c) lpm_trie
use spin-lock, no rcu

(d) sockmap
use spin-lock

(e) sockhash
use spin-lock

4. bpf_map_lookup_percpu_elem
(a) percpu_hash/lru_percpu_hash
with !rcu_read_lock_held() && !rcu_read_lock_trace_held() &&
!rcu_read_lock_bh_held() in __htab_map_lookup_elem()

(b) percpu_array
no deletion, no RCU

>> We can replace them with a comment that explains this inlining logic
>> and where the real WARNs are..
diff mbox series

Patch

diff --git a/kernel/bpf/hashtab.c b/kernel/bpf/hashtab.c
index 5b9146fa825f..ec3bdcc6a3cf 100644
--- a/kernel/bpf/hashtab.c
+++ b/kernel/bpf/hashtab.c
@@ -2523,7 +2523,13 @@  int bpf_fd_htab_map_update_elem(struct bpf_map *map, struct file *map_file,
 	if (IS_ERR(ptr))
 		return PTR_ERR(ptr);
 
+	/* The htab bucket lock is always held during update operations in fd
+	 * htab map, and the following rcu_read_lock() is only used to avoid
+	 * the WARN_ON_ONCE in htab_map_update_elem().
+	 */
+	rcu_read_lock();
 	ret = htab_map_update_elem(map, key, &ptr, map_flags);
+	rcu_read_unlock();
 	if (ret)
 		map->ops->map_fd_put_ptr(map, ptr, false);
 
diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index d63c1ed42412..3fcf7741146a 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -184,15 +184,11 @@  static int bpf_map_update_value(struct bpf_map *map, struct file *map_file,
 		err = bpf_percpu_cgroup_storage_update(map, key, value,
 						       flags);
 	} else if (IS_FD_ARRAY(map)) {
-		rcu_read_lock();
 		err = bpf_fd_array_map_update_elem(map, map_file, key, value,
 						   flags);
-		rcu_read_unlock();
 	} else if (map->map_type == BPF_MAP_TYPE_HASH_OF_MAPS) {
-		rcu_read_lock();
 		err = bpf_fd_htab_map_update_elem(map, map_file, key, value,
 						  flags);
-		rcu_read_unlock();
 	} else if (map->map_type == BPF_MAP_TYPE_REUSEPORT_SOCKARRAY) {
 		/* rcu_read_lock() is not needed */
 		err = bpf_fd_reuseport_array_update_elem(map, key, value,