diff mbox series

[bpf,v2,1/2] bpf: Check map->usercnt again after timer->timer is assigned

Message ID 20231020014214.2471419-2-houtao@huaweicloud.com (mailing list archive)
State Changes Requested
Delegated to: BPF
Headers show
Series bpf: Fix bpf timer kmemleak | expand

Checks

Context Check Description
bpf/vmtest-bpf-PR success PR summary
bpf/vmtest-bpf-VM_Test-0 success Logs for ShellCheck
bpf/vmtest-bpf-VM_Test-1 success Logs for build for aarch64 with gcc
bpf/vmtest-bpf-VM_Test-2 success Logs for build for s390x with gcc
bpf/vmtest-bpf-VM_Test-3 success Logs for build for x86_64 with gcc
bpf/vmtest-bpf-VM_Test-5 success Logs for set-matrix
bpf/vmtest-bpf-VM_Test-4 success Logs for build for x86_64 with llvm-16
bpf/vmtest-bpf-VM_Test-6 success Logs for test_maps on aarch64 with gcc
bpf/vmtest-bpf-VM_Test-7 success Logs for test_maps on s390x with gcc
bpf/vmtest-bpf-VM_Test-8 success Logs for test_maps on x86_64 with gcc
bpf/vmtest-bpf-VM_Test-9 success Logs for test_maps on x86_64 with llvm-16
bpf/vmtest-bpf-VM_Test-10 success Logs for test_progs on aarch64 with gcc
bpf/vmtest-bpf-VM_Test-11 success Logs for test_progs on s390x with gcc
bpf/vmtest-bpf-VM_Test-13 success Logs for test_progs on x86_64 with llvm-16
bpf/vmtest-bpf-VM_Test-14 success Logs for test_progs_no_alu32 on aarch64 with gcc
bpf/vmtest-bpf-VM_Test-12 success Logs for test_progs on x86_64 with gcc
bpf/vmtest-bpf-VM_Test-15 success Logs for test_progs_no_alu32 on s390x with gcc
bpf/vmtest-bpf-VM_Test-17 success Logs for test_progs_no_alu32 on x86_64 with llvm-16
bpf/vmtest-bpf-VM_Test-16 success Logs for test_progs_no_alu32 on x86_64 with gcc
bpf/vmtest-bpf-VM_Test-19 success Logs for test_progs_no_alu32_parallel on x86_64 with gcc
bpf/vmtest-bpf-VM_Test-20 success Logs for test_progs_no_alu32_parallel on x86_64 with llvm-16
bpf/vmtest-bpf-VM_Test-18 success Logs for test_progs_no_alu32_parallel on aarch64 with gcc
bpf/vmtest-bpf-VM_Test-21 success Logs for test_progs_parallel on aarch64 with gcc
bpf/vmtest-bpf-VM_Test-22 success Logs for test_progs_parallel on x86_64 with gcc
bpf/vmtest-bpf-VM_Test-23 success Logs for test_progs_parallel on x86_64 with llvm-16
bpf/vmtest-bpf-VM_Test-25 success Logs for test_verifier on s390x with gcc
bpf/vmtest-bpf-VM_Test-24 success Logs for test_verifier on aarch64 with gcc
bpf/vmtest-bpf-VM_Test-26 success Logs for test_verifier on x86_64 with gcc
bpf/vmtest-bpf-VM_Test-27 success Logs for test_verifier on x86_64 with llvm-16
bpf/vmtest-bpf-VM_Test-28 success Logs for veristat
netdev/series_format success Posting correctly formatted
netdev/tree_selection success Clearly marked for bpf, async
netdev/fixes_present success Fixes tag present in non-next series
netdev/header_inline success No static functions without inline keyword in header files
netdev/build_32bit success Errors and warnings before: 1403 this patch: 1403
netdev/cc_maintainers fail 1 blamed authors not CCed: toke@redhat.com; 1 maintainers not CCed: toke@redhat.com
netdev/build_clang success Errors and warnings before: 1388 this patch: 1388
netdev/verify_signedoff success Signed-off-by tag matches author and committer
netdev/deprecated_api success None detected
netdev/check_selftest success No net selftest shell script
netdev/verify_fixes success Fixes tag looks correct
netdev/build_allmodconfig_warn success Errors and warnings before: 1427 this patch: 1427
netdev/checkpatch success total: 0 errors, 0 warnings, 0 checks, 38 lines checked
netdev/kdoc success Errors and warnings before: 0 this patch: 0
netdev/source_inline success Was 0 now: 0

Commit Message

Hou Tao Oct. 20, 2023, 1:42 a.m. UTC
From: Hou Tao <houtao1@huawei.com>

When there are concurrent uref release and bpf timer init operations,
the following sequence diagram is possible and it will lead to memory
leak:

bpf program X

bpf_timer_init()
  lock timer->lock
    read timer->timer as NULL
    read map->usercnt != 0

                process Y

                close(map_fd)
                  // put last uref
                  bpf_map_put_uref()
                    atomic_dec_and_test(map->usercnt)
                      array_map_free_timers()
                        bpf_timer_cancel_and_free()
                          // just return and lead to memory leak
                          read timer->timer is NULL

    t = bpf_map_kmalloc_node()
    timer->timer = t
  unlock timer->lock

Fix the problem by checking map->usercnt again after timer->timer is
assigned, so when there are concurrent uref release and bpf timer init,
either bpf_timer_cancel_and_free() from uref release reads a no-NULL
timer and the newly-added check of map->usercnt reads a zero usercnt.

Because atomic_dec_and_test(map->usercnt) and READ_ONCE(timer->timer)
in bpf_timer_cancel_and_free() are not protected by a lock, so add
a memory barrier to guarantee the order between map->usercnt and
timer->timer. Also use WRITE_ONCE(timer->timer, x) to match the lockless
read of timer->timer.

Reported-by: Hsin-Wei Hung <hsinweih@uci.edu>
Closes: https://lore.kernel.org/bpf/CABcoxUaT2k9hWsS1tNgXyoU3E-=PuOgMn737qK984fbFmfYixQ@mail.gmail.com
Fixes: b00628b1c7d5 ("bpf: Introduce bpf timers.")
Signed-off-by: Hou Tao <houtao1@huawei.com>
---
 kernel/bpf/helpers.c | 18 +++++++++++++++---
 1 file changed, 15 insertions(+), 3 deletions(-)

Comments

Alexei Starovoitov Oct. 20, 2023, 2:14 a.m. UTC | #1
On Thu, Oct 19, 2023 at 6:41 PM Hou Tao <houtao@huaweicloud.com> wrote:
>
> From: Hou Tao <houtao1@huawei.com>
>
> When there are concurrent uref release and bpf timer init operations,
> the following sequence diagram is possible and it will lead to memory
> leak:
>
> bpf program X
>
> bpf_timer_init()
>   lock timer->lock
>     read timer->timer as NULL
>     read map->usercnt != 0
>
>                 process Y
>
>                 close(map_fd)
>                   // put last uref
>                   bpf_map_put_uref()
>                     atomic_dec_and_test(map->usercnt)
>                       array_map_free_timers()
>                         bpf_timer_cancel_and_free()
>                           // just return and lead to memory leak
>                           read timer->timer is NULL
>
>     t = bpf_map_kmalloc_node()
>     timer->timer = t
>   unlock timer->lock
>
> Fix the problem by checking map->usercnt again after timer->timer is
> assigned, so when there are concurrent uref release and bpf timer init,
> either bpf_timer_cancel_and_free() from uref release reads a no-NULL
> timer and the newly-added check of map->usercnt reads a zero usercnt.
>
> Because atomic_dec_and_test(map->usercnt) and READ_ONCE(timer->timer)
> in bpf_timer_cancel_and_free() are not protected by a lock, so add
> a memory barrier to guarantee the order between map->usercnt and
> timer->timer. Also use WRITE_ONCE(timer->timer, x) to match the lockless
> read of timer->timer.
>
> Reported-by: Hsin-Wei Hung <hsinweih@uci.edu>
> Closes: https://lore.kernel.org/bpf/CABcoxUaT2k9hWsS1tNgXyoU3E-=PuOgMn737qK984fbFmfYixQ@mail.gmail.com
> Fixes: b00628b1c7d5 ("bpf: Introduce bpf timers.")
> Signed-off-by: Hou Tao <houtao1@huawei.com>
> ---
>  kernel/bpf/helpers.c | 18 +++++++++++++++---
>  1 file changed, 15 insertions(+), 3 deletions(-)
>
> diff --git a/kernel/bpf/helpers.c b/kernel/bpf/helpers.c
> index 757b99c1e613f..a7d92c3ddc3dd 100644
> --- a/kernel/bpf/helpers.c
> +++ b/kernel/bpf/helpers.c
> @@ -1156,7 +1156,7 @@ BPF_CALL_3(bpf_timer_init, struct bpf_timer_kern *, timer, struct bpf_map *, map
>            u64, flags)
>  {
>         clockid_t clockid = flags & (MAX_CLOCKS - 1);
> -       struct bpf_hrtimer *t;
> +       struct bpf_hrtimer *t, *to_free = NULL;
>         int ret = 0;
>
>         BUILD_BUG_ON(MAX_CLOCKS != 16);
> @@ -1197,9 +1197,21 @@ BPF_CALL_3(bpf_timer_init, struct bpf_timer_kern *, timer, struct bpf_map *, map
>         rcu_assign_pointer(t->callback_fn, NULL);
>         hrtimer_init(&t->timer, clockid, HRTIMER_MODE_REL_SOFT);
>         t->timer.function = bpf_timer_cb;
> -       timer->timer = t;
> +       WRITE_ONCE(timer->timer, t);
> +       /* Guarantee order between timer->timer and map->usercnt. So when
> +        * there are concurrent uref release and bpf timer init, either
> +        * bpf_timer_cancel_and_free() called by uref release reads a no-NULL
> +        * timer or atomic64_read() below reads a zero usercnt.
> +        */
> +       smp_mb();
> +       if (!atomic64_read(&map->usercnt)) {
> +               WRITE_ONCE(timer->timer, NULL);
> +               to_free = t;

just kfree(t); here.

> +               ret = -EPERM;
> +       }

This will add a second atomic64_read(&map->usercnt) in the same function.
Let's remove the first one ?

>  out:
>         __bpf_spin_unlock_irqrestore(&timer->lock);
> +       kfree(to_free);
>         return ret;
>  }
>
> @@ -1372,7 +1384,7 @@ void bpf_timer_cancel_and_free(void *val)
>         /* The subsequent bpf_timer_start/cancel() helpers won't be able to use
>          * this timer, since it won't be initialized.
>          */
> -       timer->timer = NULL;
> +       WRITE_ONCE(timer->timer, NULL);
>  out:
>         __bpf_spin_unlock_irqrestore(&timer->lock);
>         if (!t)
> --
> 2.29.2
>
Hou Tao Oct. 20, 2023, 7:31 a.m. UTC | #2
Hi,

On 10/20/2023 10:14 AM, Alexei Starovoitov wrote:
> On Thu, Oct 19, 2023 at 6:41 PM Hou Tao <houtao@huaweicloud.com> wrote:
>> From: Hou Tao <houtao1@huawei.com>
>>
>> When there are concurrent uref release and bpf timer init operations,
>> the following sequence diagram is possible and it will lead to memory
>> leak:
>>
>> bpf program X
>>
>> bpf_timer_init()
>>   lock timer->lock
>>     read timer->timer as NULL
>>     read map->usercnt != 0
>>
>>                 process Y
>>
>>                 close(map_fd)
>>                   // put last uref
>>                   bpf_map_put_uref()
>>                     atomic_dec_and_test(map->usercnt)
>>                       array_map_free_timers()
>>                         bpf_timer_cancel_and_free()
>>                           // just return and lead to memory leak
>>                           read timer->timer is NULL
>>
>>     t = bpf_map_kmalloc_node()
>>     timer->timer = t
>>   unlock timer->lock
>>
>> Fix the problem by checking map->usercnt again after timer->timer is
>> assigned, so when there are concurrent uref release and bpf timer init,
>> either bpf_timer_cancel_and_free() from uref release reads a no-NULL
>> timer and the newly-added check of map->usercnt reads a zero usercnt.
>>
>> Because atomic_dec_and_test(map->usercnt) and READ_ONCE(timer->timer)
>> in bpf_timer_cancel_and_free() are not protected by a lock, so add
>> a memory barrier to guarantee the order between map->usercnt and
>> timer->timer. Also use WRITE_ONCE(timer->timer, x) to match the lockless
>> read of timer->timer.
>>
>> Reported-by: Hsin-Wei Hung <hsinweih@uci.edu>
>> Closes: https://lore.kernel.org/bpf/CABcoxUaT2k9hWsS1tNgXyoU3E-=PuOgMn737qK984fbFmfYixQ@mail.gmail.com
>> Fixes: b00628b1c7d5 ("bpf: Introduce bpf timers.")
>> Signed-off-by: Hou Tao <houtao1@huawei.com>
>> ---
>>  kernel/bpf/helpers.c | 18 +++++++++++++++---
>>  1 file changed, 15 insertions(+), 3 deletions(-)
>>
>> diff --git a/kernel/bpf/helpers.c b/kernel/bpf/helpers.c
>> index 757b99c1e613f..a7d92c3ddc3dd 100644
>> --- a/kernel/bpf/helpers.c
>> +++ b/kernel/bpf/helpers.c
>> @@ -1156,7 +1156,7 @@ BPF_CALL_3(bpf_timer_init, struct bpf_timer_kern *, timer, struct bpf_map *, map
>>            u64, flags)
>>  {
>>         clockid_t clockid = flags & (MAX_CLOCKS - 1);
>> -       struct bpf_hrtimer *t;
>> +       struct bpf_hrtimer *t, *to_free = NULL;
>>         int ret = 0;
>>
>>         BUILD_BUG_ON(MAX_CLOCKS != 16);
>> @@ -1197,9 +1197,21 @@ BPF_CALL_3(bpf_timer_init, struct bpf_timer_kern *, timer, struct bpf_map *, map
>>         rcu_assign_pointer(t->callback_fn, NULL);
>>         hrtimer_init(&t->timer, clockid, HRTIMER_MODE_REL_SOFT);
>>         t->timer.function = bpf_timer_cb;
>> -       timer->timer = t;
>> +       WRITE_ONCE(timer->timer, t);
>> +       /* Guarantee order between timer->timer and map->usercnt. So when
>> +        * there are concurrent uref release and bpf timer init, either
>> +        * bpf_timer_cancel_and_free() called by uref release reads a no-NULL
>> +        * timer or atomic64_read() below reads a zero usercnt.
>> +        */
>> +       smp_mb();
>> +       if (!atomic64_read(&map->usercnt)) {
>> +               WRITE_ONCE(timer->timer, NULL);
>> +               to_free = t;
> just kfree(t); here.

Will do. It is a slow path, so I think doing kfree() under spin-lock is
acceptable.
>
>> +               ret = -EPERM;
>> +       }
> This will add a second atomic64_read(&map->usercnt) in the same function.
> Let's remove the first one ?

I prefer to still keep it. Because it can detect the release of map uref
early and the handle of uref release is simple compared with the second
atomic64_read(). Do you have a strong preference ?
>
>>  out:
>>         __bpf_spin_unlock_irqrestore(&timer->lock);
>> +       kfree(to_free);
>>         return ret;
>>  }
>>
>> @@ -1372,7 +1384,7 @@ void bpf_timer_cancel_and_free(void *val)
>>         /* The subsequent bpf_timer_start/cancel() helpers won't be able to use
>>          * this timer, since it won't be initialized.
>>          */
>> -       timer->timer = NULL;
>> +       WRITE_ONCE(timer->timer, NULL);
>>  out:
>>         __bpf_spin_unlock_irqrestore(&timer->lock);
>>         if (!t)
>> --
>> 2.29.2
>>
> .
Alexei Starovoitov Oct. 20, 2023, 2:57 p.m. UTC | #3
On Fri, Oct 20, 2023 at 12:31 AM Hou Tao <houtao@huaweicloud.com> wrote:
>
> Hi,
>
> On 10/20/2023 10:14 AM, Alexei Starovoitov wrote:
> > On Thu, Oct 19, 2023 at 6:41 PM Hou Tao <houtao@huaweicloud.com> wrote:
> >> From: Hou Tao <houtao1@huawei.com>
> >>
> >> When there are concurrent uref release and bpf timer init operations,
> >> the following sequence diagram is possible and it will lead to memory
> >> leak:
> >>
> >> bpf program X
> >>
> >> bpf_timer_init()
> >>   lock timer->lock
> >>     read timer->timer as NULL
> >>     read map->usercnt != 0
> >>
> >>                 process Y
> >>
> >>                 close(map_fd)
> >>                   // put last uref
> >>                   bpf_map_put_uref()
> >>                     atomic_dec_and_test(map->usercnt)
> >>                       array_map_free_timers()
> >>                         bpf_timer_cancel_and_free()
> >>                           // just return and lead to memory leak
> >>                           read timer->timer is NULL
> >>
> >>     t = bpf_map_kmalloc_node()
> >>     timer->timer = t
> >>   unlock timer->lock
> >>
> >> Fix the problem by checking map->usercnt again after timer->timer is
> >> assigned, so when there are concurrent uref release and bpf timer init,
> >> either bpf_timer_cancel_and_free() from uref release reads a no-NULL
> >> timer and the newly-added check of map->usercnt reads a zero usercnt.
> >>
> >> Because atomic_dec_and_test(map->usercnt) and READ_ONCE(timer->timer)
> >> in bpf_timer_cancel_and_free() are not protected by a lock, so add
> >> a memory barrier to guarantee the order between map->usercnt and
> >> timer->timer. Also use WRITE_ONCE(timer->timer, x) to match the lockless
> >> read of timer->timer.
> >>
> >> Reported-by: Hsin-Wei Hung <hsinweih@uci.edu>
> >> Closes: https://lore.kernel.org/bpf/CABcoxUaT2k9hWsS1tNgXyoU3E-=PuOgMn737qK984fbFmfYixQ@mail.gmail.com
> >> Fixes: b00628b1c7d5 ("bpf: Introduce bpf timers.")
> >> Signed-off-by: Hou Tao <houtao1@huawei.com>
> >> ---
> >>  kernel/bpf/helpers.c | 18 +++++++++++++++---
> >>  1 file changed, 15 insertions(+), 3 deletions(-)
> >>
> >> diff --git a/kernel/bpf/helpers.c b/kernel/bpf/helpers.c
> >> index 757b99c1e613f..a7d92c3ddc3dd 100644
> >> --- a/kernel/bpf/helpers.c
> >> +++ b/kernel/bpf/helpers.c
> >> @@ -1156,7 +1156,7 @@ BPF_CALL_3(bpf_timer_init, struct bpf_timer_kern *, timer, struct bpf_map *, map
> >>            u64, flags)
> >>  {
> >>         clockid_t clockid = flags & (MAX_CLOCKS - 1);
> >> -       struct bpf_hrtimer *t;
> >> +       struct bpf_hrtimer *t, *to_free = NULL;
> >>         int ret = 0;
> >>
> >>         BUILD_BUG_ON(MAX_CLOCKS != 16);
> >> @@ -1197,9 +1197,21 @@ BPF_CALL_3(bpf_timer_init, struct bpf_timer_kern *, timer, struct bpf_map *, map
> >>         rcu_assign_pointer(t->callback_fn, NULL);
> >>         hrtimer_init(&t->timer, clockid, HRTIMER_MODE_REL_SOFT);
> >>         t->timer.function = bpf_timer_cb;
> >> -       timer->timer = t;
> >> +       WRITE_ONCE(timer->timer, t);
> >> +       /* Guarantee order between timer->timer and map->usercnt. So when
> >> +        * there are concurrent uref release and bpf timer init, either
> >> +        * bpf_timer_cancel_and_free() called by uref release reads a no-NULL
> >> +        * timer or atomic64_read() below reads a zero usercnt.
> >> +        */
> >> +       smp_mb();
> >> +       if (!atomic64_read(&map->usercnt)) {
> >> +               WRITE_ONCE(timer->timer, NULL);
> >> +               to_free = t;
> > just kfree(t); here.
>
> Will do. It is a slow path, so I think doing kfree() under spin-lock is
> acceptable.
> >
> >> +               ret = -EPERM;
> >> +       }
> > This will add a second atomic64_read(&map->usercnt) in the same function.
> > Let's remove the first one ?
>
> I prefer to still keep it. Because it can detect the release of map uref
> early and the handle of uref release is simple compared with the second
> atomic64_read(). Do you have a strong preference ?

I bet somebody will send a patch to remove the first one as redundant.
So let's do it now.
The only reason we do repeated early check is to avoid taking a lock.
Here doing an extra early check to avoid kmalloc is an overkill.
That check is highly unlikely to hit while for locks it's a likely one.
Hence extra check is justified for locks, but not here.

Reading your other email it looks like this patchset is incomplete anyway?
diff mbox series

Patch

diff --git a/kernel/bpf/helpers.c b/kernel/bpf/helpers.c
index 757b99c1e613f..a7d92c3ddc3dd 100644
--- a/kernel/bpf/helpers.c
+++ b/kernel/bpf/helpers.c
@@ -1156,7 +1156,7 @@  BPF_CALL_3(bpf_timer_init, struct bpf_timer_kern *, timer, struct bpf_map *, map
 	   u64, flags)
 {
 	clockid_t clockid = flags & (MAX_CLOCKS - 1);
-	struct bpf_hrtimer *t;
+	struct bpf_hrtimer *t, *to_free = NULL;
 	int ret = 0;
 
 	BUILD_BUG_ON(MAX_CLOCKS != 16);
@@ -1197,9 +1197,21 @@  BPF_CALL_3(bpf_timer_init, struct bpf_timer_kern *, timer, struct bpf_map *, map
 	rcu_assign_pointer(t->callback_fn, NULL);
 	hrtimer_init(&t->timer, clockid, HRTIMER_MODE_REL_SOFT);
 	t->timer.function = bpf_timer_cb;
-	timer->timer = t;
+	WRITE_ONCE(timer->timer, t);
+	/* Guarantee order between timer->timer and map->usercnt. So when
+	 * there are concurrent uref release and bpf timer init, either
+	 * bpf_timer_cancel_and_free() called by uref release reads a no-NULL
+	 * timer or atomic64_read() below reads a zero usercnt.
+	 */
+	smp_mb();
+	if (!atomic64_read(&map->usercnt)) {
+		WRITE_ONCE(timer->timer, NULL);
+		to_free = t;
+		ret = -EPERM;
+	}
 out:
 	__bpf_spin_unlock_irqrestore(&timer->lock);
+	kfree(to_free);
 	return ret;
 }
 
@@ -1372,7 +1384,7 @@  void bpf_timer_cancel_and_free(void *val)
 	/* The subsequent bpf_timer_start/cancel() helpers won't be able to use
 	 * this timer, since it won't be initialized.
 	 */
-	timer->timer = NULL;
+	WRITE_ONCE(timer->timer, NULL);
 out:
 	__bpf_spin_unlock_irqrestore(&timer->lock);
 	if (!t)