diff mbox series

[bpf-next,7/7] bpf: Wait for sleepable BPF program in maybe_wait_bpf_programs()

Message ID 20231208102355.2628918-8-houtao@huaweicloud.com (mailing list archive)
State Changes Requested
Delegated to: BPF
Headers show
Series bpf: Fixes for maybe_wait_bpf_programs() | expand

Checks

Context Check Description
bpf/vmtest-bpf-next-PR success PR summary
netdev/series_format success Posting correctly formatted
netdev/tree_selection success Clearly marked for bpf-next, async
netdev/ynl success SINGLE THREAD; Generated files up to date; no warnings/errors; no diff in generated;
netdev/fixes_present success Fixes tag not required for -next series
netdev/header_inline success No static functions without inline keyword in header files
netdev/build_32bit success Errors and warnings before: 1117 this patch: 1117
netdev/cc_maintainers success CCed 12 of 12 maintainers
netdev/build_clang success Errors and warnings before: 1143 this patch: 1143
netdev/verify_signedoff success Signed-off-by tag matches author and committer
netdev/deprecated_api success None detected
netdev/check_selftest success No net selftest shell script
netdev/verify_fixes success No Fixes tag
netdev/build_allmodconfig_warn success Errors and warnings before: 1144 this patch: 1144
netdev/checkpatch warning WARNING: line length of 82 exceeds 80 columns
netdev/build_clang_rust success No Rust files in patch. Skipping build
netdev/kdoc success Errors and warnings before: 0 this patch: 0
netdev/source_inline success Was 0 now: 0
bpf/vmtest-bpf-next-VM_Test-0 success Logs for Lint
bpf/vmtest-bpf-next-VM_Test-1 success Logs for ShellCheck
bpf/vmtest-bpf-next-VM_Test-2 success Logs for Validate matrix.py
bpf/vmtest-bpf-next-VM_Test-4 success Logs for aarch64-gcc / test (test_maps, false, 360) / test_maps on aarch64 with gcc
bpf/vmtest-bpf-next-VM_Test-8 success Logs for aarch64-gcc / veristat
bpf/vmtest-bpf-next-VM_Test-3 success Logs for aarch64-gcc / build / build for aarch64 with gcc
bpf/vmtest-bpf-next-VM_Test-7 success Logs for aarch64-gcc / test (test_verifier, false, 360) / test_verifier on aarch64 with gcc
bpf/vmtest-bpf-next-VM_Test-6 success Logs for aarch64-gcc / test (test_progs_no_alu32, false, 360) / test_progs_no_alu32 on aarch64 with gcc
bpf/vmtest-bpf-next-VM_Test-5 success Logs for aarch64-gcc / test (test_progs, false, 360) / test_progs on aarch64 with gcc
bpf/vmtest-bpf-next-VM_Test-14 success Logs for s390x-gcc / veristat
bpf/vmtest-bpf-next-VM_Test-15 success Logs for set-matrix
bpf/vmtest-bpf-next-VM_Test-9 success Logs for s390x-gcc / build / build for s390x with gcc
bpf/vmtest-bpf-next-VM_Test-16 success Logs for x86_64-gcc / build / build for x86_64 with gcc
bpf/vmtest-bpf-next-VM_Test-17 success Logs for x86_64-gcc / test (test_maps, false, 360) / test_maps on x86_64 with gcc
bpf/vmtest-bpf-next-VM_Test-18 success Logs for x86_64-gcc / test (test_progs, false, 360) / test_progs on x86_64 with gcc
bpf/vmtest-bpf-next-VM_Test-19 success Logs for x86_64-gcc / test (test_progs_no_alu32, false, 360) / test_progs_no_alu32 on x86_64 with gcc
bpf/vmtest-bpf-next-VM_Test-20 success Logs for x86_64-gcc / test (test_progs_no_alu32_parallel, true, 30) / test_progs_no_alu32_parallel on x86_64 with gcc
bpf/vmtest-bpf-next-VM_Test-21 success Logs for x86_64-gcc / test (test_progs_parallel, true, 30) / test_progs_parallel on x86_64 with gcc
bpf/vmtest-bpf-next-VM_Test-22 success Logs for x86_64-gcc / test (test_verifier, false, 360) / test_verifier on x86_64 with gcc
bpf/vmtest-bpf-next-VM_Test-23 success Logs for x86_64-gcc / veristat / veristat on x86_64 with gcc
bpf/vmtest-bpf-next-VM_Test-24 success Logs for x86_64-llvm-16 / build / build for x86_64 with llvm-16
bpf/vmtest-bpf-next-VM_Test-25 success Logs for x86_64-llvm-16 / test (test_maps, false, 360) / test_maps on x86_64 with llvm-16
bpf/vmtest-bpf-next-VM_Test-26 success Logs for x86_64-llvm-16 / test (test_progs, false, 360) / test_progs on x86_64 with llvm-16
bpf/vmtest-bpf-next-VM_Test-27 success Logs for x86_64-llvm-16 / test (test_progs_no_alu32, false, 360) / test_progs_no_alu32 on x86_64 with llvm-16
bpf/vmtest-bpf-next-VM_Test-28 success Logs for x86_64-llvm-16 / test (test_verifier, false, 360) / test_verifier on x86_64 with llvm-16
bpf/vmtest-bpf-next-VM_Test-29 success Logs for x86_64-llvm-16 / veristat
bpf/vmtest-bpf-next-VM_Test-13 success Logs for s390x-gcc / test (test_verifier, false, 360) / test_verifier on s390x with gcc
bpf/vmtest-bpf-next-VM_Test-12 success Logs for s390x-gcc / test (test_progs_no_alu32, false, 360) / test_progs_no_alu32 on s390x with gcc
bpf/vmtest-bpf-next-VM_Test-11 success Logs for s390x-gcc / test (test_progs, false, 360) / test_progs on s390x with gcc
bpf/vmtest-bpf-next-VM_Test-10 success Logs for s390x-gcc / test (test_maps, false, 360) / test_maps on s390x with gcc

Commit Message

Hou Tao Dec. 8, 2023, 10:23 a.m. UTC
From: Hou Tao <houtao1@huawei.com>

Since commit 638e4b825d52 ("bpf: Allows per-cpu maps and map-in-map in
sleepable programs"), sleepable BPF program can use map-in-map, but
maybe_wait_bpf_programs() doesn't consider it accordingly.

So checking the value of sleepable_refcnt in maybe_wait_bpf_programs(),
if the value is not zero, use synchronize_rcu_mult() to wait for both
sleepable and non-sleepable BPF programs. But bpf syscall from syscall
program is special, because the bpf syscall is called with
rcu_read_lock_trace() being held, and there will be dead-lock if
synchronize_rcu_mult() is used to wait for the exit of sleepable BPF
program, so just skip the waiting of sleepable BPF program for bpf
syscall which comes from syscall program.

Signed-off-by: Hou Tao <houtao1@huawei.com>
---
 kernel/bpf/syscall.c | 28 +++++++++++++++++++---------
 1 file changed, 19 insertions(+), 9 deletions(-)

Comments

Andrii Nakryiko Dec. 8, 2023, 10:30 p.m. UTC | #1
On Fri, Dec 8, 2023 at 2:23 AM Hou Tao <houtao@huaweicloud.com> wrote:
>
> From: Hou Tao <houtao1@huawei.com>
>
> Since commit 638e4b825d52 ("bpf: Allows per-cpu maps and map-in-map in
> sleepable programs"), sleepable BPF program can use map-in-map, but
> maybe_wait_bpf_programs() doesn't consider it accordingly.
>
> So checking the value of sleepable_refcnt in maybe_wait_bpf_programs(),
> if the value is not zero, use synchronize_rcu_mult() to wait for both
> sleepable and non-sleepable BPF programs. But bpf syscall from syscall
> program is special, because the bpf syscall is called with
> rcu_read_lock_trace() being held, and there will be dead-lock if
> synchronize_rcu_mult() is used to wait for the exit of sleepable BPF
> program, so just skip the waiting of sleepable BPF program for bpf
> syscall which comes from syscall program.
>
> Signed-off-by: Hou Tao <houtao1@huawei.com>
> ---
>  kernel/bpf/syscall.c | 28 +++++++++++++++++++---------
>  1 file changed, 19 insertions(+), 9 deletions(-)
>
> diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
> index d2641e51a1a7..6b9d7990d95f 100644
> --- a/kernel/bpf/syscall.c
> +++ b/kernel/bpf/syscall.c
> @@ -35,6 +35,7 @@
>  #include <linux/rcupdate_trace.h>
>  #include <linux/memcontrol.h>
>  #include <linux/trace_events.h>
> +#include <linux/rcupdate_wait.h>
>
>  #include <net/netfilter/nf_bpf_link.h>
>  #include <net/netkit.h>
> @@ -140,15 +141,24 @@ static u32 bpf_map_value_size(const struct bpf_map *map)
>                 return  map->value_size;
>  }
>
> -static void maybe_wait_bpf_programs(struct bpf_map *map)
> +static void maybe_wait_bpf_programs(struct bpf_map *map, bool rcu_trace_lock_held)
>  {
> -       /* Wait for any running BPF programs to complete so that
> -        * userspace, when we return to it, knows that all programs
> -        * that could be running use the new map value.
> +       /* Wait for any running non-sleepable and sleepable BPF programs to
> +        * complete, so that userspace, when we return to it, knows that all
> +        * programs that could be running use the new map value. However
> +        * syscall program can also use bpf syscall to update or delete inner
> +        * map in outer map, and it holds rcu_read_lock_trace() before doing
> +        * the bpf syscall. If use synchronize_rcu_mult(call_rcu_tasks_trace)
> +        * to wait for the exit of running sleepable BPF programs, there will
> +        * be dead-lock, so skip the waiting for syscall program.
>          */
>         if (map->map_type == BPF_MAP_TYPE_HASH_OF_MAPS ||
> -           map->map_type == BPF_MAP_TYPE_ARRAY_OF_MAPS)
> -               synchronize_rcu();
> +           map->map_type == BPF_MAP_TYPE_ARRAY_OF_MAPS) {
> +               if (atomic64_read(&map->sleepable_refcnt) && !rcu_trace_lock_held)

why is this correct and non-racy without holding used_maps_mutex under
which this sleepable_refcnt is incremented?

> +                       synchronize_rcu_mult(call_rcu, call_rcu_tasks_trace);
> +               else
> +                       synchronize_rcu();
> +       }
>  }
>
>  static int bpf_map_update_value(struct bpf_map *map, struct file *map_file,
> @@ -1561,7 +1571,7 @@ static int map_update_elem(union bpf_attr *attr, bpfptr_t uattr)
>
>         err = bpf_map_update_value(map, f.file, key, value, attr->flags);
>         if (!err)
> -               maybe_wait_bpf_programs(map);
> +               maybe_wait_bpf_programs(map, bpfptr_is_kernel(uattr));
>
>         kvfree(value);
>  free_key:
> @@ -1618,7 +1628,7 @@ static int map_delete_elem(union bpf_attr *attr, bpfptr_t uattr)
>         rcu_read_unlock();
>         bpf_enable_instrumentation();
>         if (!err)
> -               maybe_wait_bpf_programs(map);
> +               maybe_wait_bpf_programs(map, bpfptr_is_kernel(uattr));
>  out:
>         kvfree(key);
>  err_put:
> @@ -4973,7 +4983,7 @@ static int bpf_map_do_batch(union bpf_attr *attr,
>  err_put:
>         if (has_write) {
>                 if (attr->batch.count)
> -                       maybe_wait_bpf_programs(map);
> +                       maybe_wait_bpf_programs(map, false);
>                 bpf_map_write_active_dec(map);
>         }
>         fdput(f);
> --
> 2.29.2
>
Alexei Starovoitov Dec. 10, 2023, 2:11 a.m. UTC | #2
On Fri, Dec 8, 2023 at 2:22 AM Hou Tao <houtao@huaweicloud.com> wrote:
>
> +       /* Wait for any running non-sleepable and sleepable BPF programs to
> +        * complete, so that userspace, when we return to it, knows that all
> +        * programs that could be running use the new map value.

which could be forever... and the user space task doing simple map update
will never know why it got stuck in syscall waiting... forever...
synchronous waiting for tasks_trace is never an option.
Hou Tao Dec. 11, 2023, 1:17 a.m. UTC | #3
Hi,

On 12/9/2023 6:30 AM, Andrii Nakryiko wrote:
> On Fri, Dec 8, 2023 at 2:23 AM Hou Tao <houtao@huaweicloud.com> wrote:
>> From: Hou Tao <houtao1@huawei.com>
>>
>> Since commit 638e4b825d52 ("bpf: Allows per-cpu maps and map-in-map in
>> sleepable programs"), sleepable BPF program can use map-in-map, but
>> maybe_wait_bpf_programs() doesn't consider it accordingly.
>>
>> So checking the value of sleepable_refcnt in maybe_wait_bpf_programs(),
>> if the value is not zero, use synchronize_rcu_mult() to wait for both
>> sleepable and non-sleepable BPF programs. But bpf syscall from syscall
>> program is special, because the bpf syscall is called with
>> rcu_read_lock_trace() being held, and there will be dead-lock if
>> synchronize_rcu_mult() is used to wait for the exit of sleepable BPF
>> program, so just skip the waiting of sleepable BPF program for bpf
>> syscall which comes from syscall program.
>>
>> Signed-off-by: Hou Tao <houtao1@huawei.com>
>> ---
>>  kernel/bpf/syscall.c | 28 +++++++++++++++++++---------
>>  1 file changed, 19 insertions(+), 9 deletions(-)
>>
>> diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
>> index d2641e51a1a7..6b9d7990d95f 100644
>> --- a/kernel/bpf/syscall.c
>> +++ b/kernel/bpf/syscall.c
>> @@ -35,6 +35,7 @@
>>  #include <linux/rcupdate_trace.h>
>>  #include <linux/memcontrol.h>
>>  #include <linux/trace_events.h>
>> +#include <linux/rcupdate_wait.h>
>>
>>  #include <net/netfilter/nf_bpf_link.h>
>>  #include <net/netkit.h>
>> @@ -140,15 +141,24 @@ static u32 bpf_map_value_size(const struct bpf_map *map)
>>                 return  map->value_size;
>>  }
>>
>> -static void maybe_wait_bpf_programs(struct bpf_map *map)
>> +static void maybe_wait_bpf_programs(struct bpf_map *map, bool rcu_trace_lock_held)
>>  {
>> -       /* Wait for any running BPF programs to complete so that
>> -        * userspace, when we return to it, knows that all programs
>> -        * that could be running use the new map value.
>> +       /* Wait for any running non-sleepable and sleepable BPF programs to
>> +        * complete, so that userspace, when we return to it, knows that all
>> +        * programs that could be running use the new map value. However
>> +        * syscall program can also use bpf syscall to update or delete inner
>> +        * map in outer map, and it holds rcu_read_lock_trace() before doing
>> +        * the bpf syscall. If use synchronize_rcu_mult(call_rcu_tasks_trace)
>> +        * to wait for the exit of running sleepable BPF programs, there will
>> +        * be dead-lock, so skip the waiting for syscall program.
>>          */
>>         if (map->map_type == BPF_MAP_TYPE_HASH_OF_MAPS ||
>> -           map->map_type == BPF_MAP_TYPE_ARRAY_OF_MAPS)
>> -               synchronize_rcu();
>> +           map->map_type == BPF_MAP_TYPE_ARRAY_OF_MAPS) {
>> +               if (atomic64_read(&map->sleepable_refcnt) && !rcu_trace_lock_held)
> why is this correct and non-racy without holding used_maps_mutex under
> which this sleepable_refcnt is incremented?

Do you mean bpf_prog_bind_map(), right ? The program which binds with
the map doesn't access it, so if the atomic64_inc() is missed, there
will still be OK. But if the implementation is changed afterwards, there
will be a problem.
>
>> +                       synchronize_rcu_mult(call_rcu, call_rcu_tasks_trace);
>> +               else
>> +                       synchronize_rcu();
>> +       }
>>  }
>>
>>  static int bpf_map_update_value(struct bpf_map *map, struct file *map_file,
>> @@ -1561,7 +1571,7 @@ static int map_update_elem(union bpf_attr *attr, bpfptr_t uattr)
>>
>>         err = bpf_map_update_value(map, f.file, key, value, attr->flags);
>>         if (!err)
>> -               maybe_wait_bpf_programs(map);
>> +               maybe_wait_bpf_programs(map, bpfptr_is_kernel(uattr));
>>
>>         kvfree(value);
>>  free_key:
>> @@ -1618,7 +1628,7 @@ static int map_delete_elem(union bpf_attr *attr, bpfptr_t uattr)
>>         rcu_read_unlock();
>>         bpf_enable_instrumentation();
>>         if (!err)
>> -               maybe_wait_bpf_programs(map);
>> +               maybe_wait_bpf_programs(map, bpfptr_is_kernel(uattr));
>>  out:
>>         kvfree(key);
>>  err_put:
>> @@ -4973,7 +4983,7 @@ static int bpf_map_do_batch(union bpf_attr *attr,
>>  err_put:
>>         if (has_write) {
>>                 if (attr->batch.count)
>> -                       maybe_wait_bpf_programs(map);
>> +                       maybe_wait_bpf_programs(map, false);
>>                 bpf_map_write_active_dec(map);
>>         }
>>         fdput(f);
>> --
>> 2.29.2
>>
Hou Tao Dec. 11, 2023, 2:07 a.m. UTC | #4
Hi Alexei,

On 12/10/2023 10:11 AM, Alexei Starovoitov wrote:
> On Fri, Dec 8, 2023 at 2:22 AM Hou Tao <houtao@huaweicloud.com> wrote:
>> +       /* Wait for any running non-sleepable and sleepable BPF programs to
>> +        * complete, so that userspace, when we return to it, knows that all
>> +        * programs that could be running use the new map value.
> which could be forever... and the user space task doing simple map update
> will never know why it got stuck in syscall waiting... forever...
> synchronous waiting for tasks_trace is never an option.

Could you please elaborate the reason why there is dead-lock problem ? 
In my naive understanding, synchronize_rcu_tasks_trace() only waits for
the end of rcu_read_lock_trace()/rcu_read_unlock_trace(), if there is no
rcu_read_lock_trace being held, there will be no dead-lock.
Alexei Starovoitov Dec. 11, 2023, 2:56 a.m. UTC | #5
On Sun, Dec 10, 2023 at 6:07 PM Hou Tao <houtao@huaweicloud.com> wrote:
>
> Hi Alexei,
>
> On 12/10/2023 10:11 AM, Alexei Starovoitov wrote:
> > On Fri, Dec 8, 2023 at 2:22 AM Hou Tao <houtao@huaweicloud.com> wrote:
> >> +       /* Wait for any running non-sleepable and sleepable BPF programs to
> >> +        * complete, so that userspace, when we return to it, knows that all
> >> +        * programs that could be running use the new map value.
> > which could be forever... and the user space task doing simple map update
> > will never know why it got stuck in syscall waiting... forever...
> > synchronous waiting for tasks_trace is never an option.
>
> Could you please elaborate the reason why there is dead-lock problem ?
> In my naive understanding, synchronize_rcu_tasks_trace() only waits for
> the end of rcu_read_lock_trace()/rcu_read_unlock_trace(), if there is no
> rcu_read_lock_trace being held, there will be no dead-lock.

I didn't say it's dead-lock. rcu_read_lock_trace() section can last
for a very long time. The user space shouldn't be exposed to such delays.
Hou Tao Dec. 11, 2023, 3:03 a.m. UTC | #6
Hi,

On 12/11/2023 10:56 AM, Alexei Starovoitov wrote:
> On Sun, Dec 10, 2023 at 6:07 PM Hou Tao <houtao@huaweicloud.com> wrote:
>> Hi Alexei,
>>
>> On 12/10/2023 10:11 AM, Alexei Starovoitov wrote:
>>> On Fri, Dec 8, 2023 at 2:22 AM Hou Tao <houtao@huaweicloud.com> wrote:
>>>> +       /* Wait for any running non-sleepable and sleepable BPF programs to
>>>> +        * complete, so that userspace, when we return to it, knows that all
>>>> +        * programs that could be running use the new map value.
>>> which could be forever... and the user space task doing simple map update
>>> will never know why it got stuck in syscall waiting... forever...
>>> synchronous waiting for tasks_trace is never an option.
>> Could you please elaborate the reason why there is dead-lock problem ?
>> In my naive understanding, synchronize_rcu_tasks_trace() only waits for
>> the end of rcu_read_lock_trace()/rcu_read_unlock_trace(), if there is no
>> rcu_read_lock_trace being held, there will be no dead-lock.
> I didn't say it's dead-lock. rcu_read_lock_trace() section can last
> for a very long time. The user space shouldn't be exposed to such delays.

I see. Thanks for the explanation. Will update the comments in
maybe_wait_bpf_programs() in a new patch.
> .
diff mbox series

Patch

diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index d2641e51a1a7..6b9d7990d95f 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -35,6 +35,7 @@ 
 #include <linux/rcupdate_trace.h>
 #include <linux/memcontrol.h>
 #include <linux/trace_events.h>
+#include <linux/rcupdate_wait.h>
 
 #include <net/netfilter/nf_bpf_link.h>
 #include <net/netkit.h>
@@ -140,15 +141,24 @@  static u32 bpf_map_value_size(const struct bpf_map *map)
 		return  map->value_size;
 }
 
-static void maybe_wait_bpf_programs(struct bpf_map *map)
+static void maybe_wait_bpf_programs(struct bpf_map *map, bool rcu_trace_lock_held)
 {
-	/* Wait for any running BPF programs to complete so that
-	 * userspace, when we return to it, knows that all programs
-	 * that could be running use the new map value.
+	/* Wait for any running non-sleepable and sleepable BPF programs to
+	 * complete, so that userspace, when we return to it, knows that all
+	 * programs that could be running use the new map value. However
+	 * syscall program can also use bpf syscall to update or delete inner
+	 * map in outer map, and it holds rcu_read_lock_trace() before doing
+	 * the bpf syscall. If use synchronize_rcu_mult(call_rcu_tasks_trace)
+	 * to wait for the exit of running sleepable BPF programs, there will
+	 * be dead-lock, so skip the waiting for syscall program.
 	 */
 	if (map->map_type == BPF_MAP_TYPE_HASH_OF_MAPS ||
-	    map->map_type == BPF_MAP_TYPE_ARRAY_OF_MAPS)
-		synchronize_rcu();
+	    map->map_type == BPF_MAP_TYPE_ARRAY_OF_MAPS) {
+		if (atomic64_read(&map->sleepable_refcnt) && !rcu_trace_lock_held)
+			synchronize_rcu_mult(call_rcu, call_rcu_tasks_trace);
+		else
+			synchronize_rcu();
+	}
 }
 
 static int bpf_map_update_value(struct bpf_map *map, struct file *map_file,
@@ -1561,7 +1571,7 @@  static int map_update_elem(union bpf_attr *attr, bpfptr_t uattr)
 
 	err = bpf_map_update_value(map, f.file, key, value, attr->flags);
 	if (!err)
-		maybe_wait_bpf_programs(map);
+		maybe_wait_bpf_programs(map, bpfptr_is_kernel(uattr));
 
 	kvfree(value);
 free_key:
@@ -1618,7 +1628,7 @@  static int map_delete_elem(union bpf_attr *attr, bpfptr_t uattr)
 	rcu_read_unlock();
 	bpf_enable_instrumentation();
 	if (!err)
-		maybe_wait_bpf_programs(map);
+		maybe_wait_bpf_programs(map, bpfptr_is_kernel(uattr));
 out:
 	kvfree(key);
 err_put:
@@ -4973,7 +4983,7 @@  static int bpf_map_do_batch(union bpf_attr *attr,
 err_put:
 	if (has_write) {
 		if (attr->batch.count)
-			maybe_wait_bpf_programs(map);
+			maybe_wait_bpf_programs(map, false);
 		bpf_map_write_active_dec(map);
 	}
 	fdput(f);