diff mbox series

[1/3] bpf: Disable preemption when increasing per-cpu map_locked

Message ID 20220821033223.2598791-2-houtao@huaweicloud.com (mailing list archive)
State Superseded
Delegated to: BPF
Headers show
Series fixes for concurrent htab updates | expand

Checks

Context Check Description
bpf/vmtest-bpf-next-VM_Test-4 success Logs for llvm-toolchain
bpf/vmtest-bpf-next-VM_Test-5 success Logs for set-matrix
bpf/vmtest-bpf-next-PR fail PR summary
bpf/vmtest-bpf-next-VM_Test-1 success Logs for Kernel LATEST on ubuntu-latest with gcc
bpf/vmtest-bpf-next-VM_Test-2 fail Logs for Kernel LATEST on ubuntu-latest with llvm-16
bpf/vmtest-bpf-next-VM_Test-3 success Logs for Kernel LATEST on z15 with gcc
netdev/tree_selection success Guessed tree name to be net-next
netdev/fixes_present success Fixes tag not required for -next series
netdev/subject_prefix warning Target tree name not specified in the subject
netdev/cover_letter success Series has a cover letter
netdev/patch_count success Link
netdev/header_inline success No static functions without inline keyword in header files
netdev/build_32bit success Errors and warnings before: 353 this patch: 348
netdev/cc_maintainers warning 2 maintainers not CCed: song@kernel.org martin.lau@linux.dev
netdev/build_clang success Errors and warnings before: 5 this patch: 5
netdev/module_param success Was 0 now: 0
netdev/verify_signedoff success Signed-off-by tag matches author and committer
netdev/check_selftest success No net selftest shell script
netdev/verify_fixes success No Fixes tag
netdev/build_allmodconfig_warn success Errors and warnings before: 353 this patch: 348
netdev/checkpatch success total: 0 errors, 0 warnings, 0 checks, 48 lines checked
netdev/kdoc success Errors and warnings before: 0 this patch: 0
netdev/source_inline success Was 0 now: 0

Commit Message

Hou Tao Aug. 21, 2022, 3:32 a.m. UTC
From: Hou Tao <houtao1@huawei.com>

Per-cpu htab->map_locked is used to prohibit the concurrent accesses
from both NMI and non-NMI contexts. But since 74d862b682f51 ("sched:
Make migrate_disable/enable() independent of RT"), migrations_disable()
is also preemptible under !PREEMPT_RT case, so now map_locked also
disallows concurrent updates from normal contexts (e.g. userspace
processes) unexpectedly as shown below:

process A                      process B

htab_map_update_elem()
  htab_lock_bucket()
    migrate_disable()
    /* return 1 */
    __this_cpu_inc_return()
    /* preempted by B */

                               htab_map_update_elem()
                                 /* the same bucket as A */
                                 htab_lock_bucket()
                                   migrate_disable()
                                   /* return 2, so lock fails */
                                   __this_cpu_inc_return()
                                   return -EBUSY

A fix that seems feasible is using in_nmi() in htab_lock_bucket() and
only checking the value of map_locked for nmi context. But it will
re-introduce dead-lock on bucket lock if htab_lock_bucket() is re-entered
through non-tracing program (e.g. fentry program).

So fixing it by using disable_preempt() instead of migrate_disable() when
increasing htab->map_locked. However when htab_use_raw_lock() is false,
bucket lock will be a sleepable spin-lock and it breaks disable_preempt(),
so still use migrate_disable() for spin-lock case.

Signed-off-by: Hou Tao <houtao1@huawei.com>
---
 kernel/bpf/hashtab.c | 23 ++++++++++++++++++-----
 1 file changed, 18 insertions(+), 5 deletions(-)

Comments

Hao Luo Aug. 21, 2022, 4:42 p.m. UTC | #1
Hi Hou Tao,

On Sat, Aug 20, 2022 at 8:14 PM Hou Tao <houtao@huaweicloud.com> wrote:
>
> From: Hou Tao <houtao1@huawei.com>
>
> Per-cpu htab->map_locked is used to prohibit the concurrent accesses
> from both NMI and non-NMI contexts. But since 74d862b682f51 ("sched:
> Make migrate_disable/enable() independent of RT"), migrations_disable()
> is also preemptible under !PREEMPT_RT case, so now map_locked also
> disallows concurrent updates from normal contexts (e.g. userspace
> processes) unexpectedly as shown below:
>
> process A                      process B
>
> htab_map_update_elem()
>   htab_lock_bucket()
>     migrate_disable()
>     /* return 1 */
>     __this_cpu_inc_return()
>     /* preempted by B */
>
>                                htab_map_update_elem()
>                                  /* the same bucket as A */
>                                  htab_lock_bucket()
>                                    migrate_disable()
>                                    /* return 2, so lock fails */
>                                    __this_cpu_inc_return()
>                                    return -EBUSY
>
> A fix that seems feasible is using in_nmi() in htab_lock_bucket() and
> only checking the value of map_locked for nmi context. But it will
> re-introduce dead-lock on bucket lock if htab_lock_bucket() is re-entered
> through non-tracing program (e.g. fentry program).
>
> So fixing it by using disable_preempt() instead of migrate_disable() when
> increasing htab->map_locked. However when htab_use_raw_lock() is false,
> bucket lock will be a sleepable spin-lock and it breaks disable_preempt(),
> so still use migrate_disable() for spin-lock case.
>
> Signed-off-by: Hou Tao <houtao1@huawei.com>
> ---

IIUC, this patch enlarges the scope of preemption disable to cover inc
map_locked. But I don't think the change is meaningful.

This patch only affects the case when raw lock is used. In the case of
raw lock, irq is disabled for b->raw_lock protected critical section.
A raw spin lock itself doesn't block in both RT and non-RT. So, my
understanding about this patch is, it just makes sure preemption
doesn't happen on the exact __this_cpu_inc_return. But the window is
so small that it should be really unlikely to happen.

>  kernel/bpf/hashtab.c | 23 ++++++++++++++++++-----
>  1 file changed, 18 insertions(+), 5 deletions(-)
>
> diff --git a/kernel/bpf/hashtab.c b/kernel/bpf/hashtab.c
> index 6c530a5e560a..ad09da139589 100644
> --- a/kernel/bpf/hashtab.c
> +++ b/kernel/bpf/hashtab.c
> @@ -162,17 +162,25 @@ static inline int htab_lock_bucket(const struct bpf_htab *htab,
>                                    unsigned long *pflags)
>  {
>         unsigned long flags;
> +       bool use_raw_lock;
>
>         hash = hash & HASHTAB_MAP_LOCK_MASK;
>
> -       migrate_disable();
> +       use_raw_lock = htab_use_raw_lock(htab);
> +       if (use_raw_lock)
> +               preempt_disable();
> +       else
> +               migrate_disable();
>         if (unlikely(__this_cpu_inc_return(*(htab->map_locked[hash])) != 1)) {
>                 __this_cpu_dec(*(htab->map_locked[hash]));
> -               migrate_enable();
> +               if (use_raw_lock)
> +                       preempt_enable();
> +               else
> +                       migrate_enable();
>                 return -EBUSY;
>         }
>
> -       if (htab_use_raw_lock(htab))
> +       if (use_raw_lock)
>                 raw_spin_lock_irqsave(&b->raw_lock, flags);
>         else
>                 spin_lock_irqsave(&b->lock, flags);
> @@ -185,13 +193,18 @@ static inline void htab_unlock_bucket(const struct bpf_htab *htab,
>                                       struct bucket *b, u32 hash,
>                                       unsigned long flags)
>  {
> +       bool use_raw_lock = htab_use_raw_lock(htab);
> +
>         hash = hash & HASHTAB_MAP_LOCK_MASK;
> -       if (htab_use_raw_lock(htab))
> +       if (use_raw_lock)
>                 raw_spin_unlock_irqrestore(&b->raw_lock, flags);
>         else
>                 spin_unlock_irqrestore(&b->lock, flags);
>         __this_cpu_dec(*(htab->map_locked[hash]));
> -       migrate_enable();
> +       if (use_raw_lock)
> +               preempt_enable();
> +       else
> +               migrate_enable();
>  }
>
>  static bool htab_lru_map_delete_node(void *arg, struct bpf_lru_node *node);
> --
> 2.29.2
>
Hou Tao Aug. 22, 2022, 1:27 a.m. UTC | #2
Hi,

On 8/22/2022 12:42 AM, Hao Luo wrote:
> Hi Hou Tao,
>
> On Sat, Aug 20, 2022 at 8:14 PM Hou Tao <houtao@huaweicloud.com> wrote:
>> From: Hou Tao <houtao1@huawei.com>
>>
>> Per-cpu htab->map_locked is used to prohibit the concurrent accesses
>> from both NMI and non-NMI contexts. But since 74d862b682f51 ("sched:
>> Make migrate_disable/enable() independent of RT"), migrations_disable()
>> is also preemptible under !PREEMPT_RT case, so now map_locked also
>> disallows concurrent updates from normal contexts (e.g. userspace
>> processes) unexpectedly as shown below:
>>
>> process A                      process B
>>
>> htab_map_update_elem()
>>   htab_lock_bucket()
>>     migrate_disable()
>>     /* return 1 */
>>     __this_cpu_inc_return()
>>     /* preempted by B */
>>
>>                                htab_map_update_elem()
>>                                  /* the same bucket as A */
>>                                  htab_lock_bucket()
>>                                    migrate_disable()
>>                                    /* return 2, so lock fails */
>>                                    __this_cpu_inc_return()
>>                                    return -EBUSY
>>
>> A fix that seems feasible is using in_nmi() in htab_lock_bucket() and
>> only checking the value of map_locked for nmi context. But it will
>> re-introduce dead-lock on bucket lock if htab_lock_bucket() is re-entered
>> through non-tracing program (e.g. fentry program).
>>
>> So fixing it by using disable_preempt() instead of migrate_disable() when
>> increasing htab->map_locked. However when htab_use_raw_lock() is false,
>> bucket lock will be a sleepable spin-lock and it breaks disable_preempt(),
>> so still use migrate_disable() for spin-lock case.
>>
>> Signed-off-by: Hou Tao <houtao1@huawei.com>
>> ---
> IIUC, this patch enlarges the scope of preemption disable to cover inc
> map_locked. But I don't think the change is meaningful.
Before 74d862b682f51 ("sched: Make migrate_disable/enable() independent of
RT"),  the preemption is disabled before increasing map_locked for !PREEMPT_RT
case, so I don't think that the change is meaningless.
>
> This patch only affects the case when raw lock is used. In the case of
> raw lock, irq is disabled for b->raw_lock protected critical section.
> A raw spin lock itself doesn't block in both RT and non-RT. So, my
> understanding about this patch is, it just makes sure preemption
> doesn't happen on the exact __this_cpu_inc_return. But the window is
> so small that it should be really unlikely to happen.
No, it can be easily reproduced by running multiple htab update processes in the
same CPU. Will add selftest to demonstrate that.
>
>>  kernel/bpf/hashtab.c | 23 ++++++++++++++++++-----
>>  1 file changed, 18 insertions(+), 5 deletions(-)
>>
>> diff --git a/kernel/bpf/hashtab.c b/kernel/bpf/hashtab.c
>> index 6c530a5e560a..ad09da139589 100644
>> --- a/kernel/bpf/hashtab.c
>> +++ b/kernel/bpf/hashtab.c
>> @@ -162,17 +162,25 @@ static inline int htab_lock_bucket(const struct bpf_htab *htab,
>>                                    unsigned long *pflags)
>>  {
>>         unsigned long flags;
>> +       bool use_raw_lock;
>>
>>         hash = hash & HASHTAB_MAP_LOCK_MASK;
>>
>> -       migrate_disable();
>> +       use_raw_lock = htab_use_raw_lock(htab);
>> +       if (use_raw_lock)
>> +               preempt_disable();
>> +       else
>> +               migrate_disable();
>>         if (unlikely(__this_cpu_inc_return(*(htab->map_locked[hash])) != 1)) {
>>                 __this_cpu_dec(*(htab->map_locked[hash]));
>> -               migrate_enable();
>> +               if (use_raw_lock)
>> +                       preempt_enable();
>> +               else
>> +                       migrate_enable();
>>                 return -EBUSY;
>>         }
>>
>> -       if (htab_use_raw_lock(htab))
>> +       if (use_raw_lock)
>>                 raw_spin_lock_irqsave(&b->raw_lock, flags);
>>         else
>>                 spin_lock_irqsave(&b->lock, flags);
>> @@ -185,13 +193,18 @@ static inline void htab_unlock_bucket(const struct bpf_htab *htab,
>>                                       struct bucket *b, u32 hash,
>>                                       unsigned long flags)
>>  {
>> +       bool use_raw_lock = htab_use_raw_lock(htab);
>> +
>>         hash = hash & HASHTAB_MAP_LOCK_MASK;
>> -       if (htab_use_raw_lock(htab))
>> +       if (use_raw_lock)
>>                 raw_spin_unlock_irqrestore(&b->raw_lock, flags);
>>         else
>>                 spin_unlock_irqrestore(&b->lock, flags);
>>         __this_cpu_dec(*(htab->map_locked[hash]));
>> -       migrate_enable();
>> +       if (use_raw_lock)
>> +               preempt_enable();
>> +       else
>> +               migrate_enable();
>>  }
>>
>>  static bool htab_lru_map_delete_node(void *arg, struct bpf_lru_node *node);
>> --
>> 2.29.2
>>
> .
Hao Luo Aug. 22, 2022, 3:21 a.m. UTC | #3
Hi, Hou Tao

On Sun, Aug 21, 2022 at 6:28 PM Hou Tao <houtao@huaweicloud.com> wrote:
>
> Hi,
>
> On 8/22/2022 12:42 AM, Hao Luo wrote:
> > Hi Hou Tao,
> >
> > On Sat, Aug 20, 2022 at 8:14 PM Hou Tao <houtao@huaweicloud.com> wrote:
> >> From: Hou Tao <houtao1@huawei.com>
> >>
> >> Per-cpu htab->map_locked is used to prohibit the concurrent accesses
> >> from both NMI and non-NMI contexts. But since 74d862b682f51 ("sched:
> >> Make migrate_disable/enable() independent of RT"), migrations_disable()
> >> is also preemptible under !PREEMPT_RT case, so now map_locked also
> >> disallows concurrent updates from normal contexts (e.g. userspace
> >> processes) unexpectedly as shown below:
> >>
> >> process A                      process B
> >>
> >> htab_map_update_elem()
> >>   htab_lock_bucket()
> >>     migrate_disable()
> >>     /* return 1 */
> >>     __this_cpu_inc_return()
> >>     /* preempted by B */
> >>
> >>                                htab_map_update_elem()
> >>                                  /* the same bucket as A */
> >>                                  htab_lock_bucket()
> >>                                    migrate_disable()
> >>                                    /* return 2, so lock fails */
> >>                                    __this_cpu_inc_return()
> >>                                    return -EBUSY
> >>
> >> A fix that seems feasible is using in_nmi() in htab_lock_bucket() and
> >> only checking the value of map_locked for nmi context. But it will
> >> re-introduce dead-lock on bucket lock if htab_lock_bucket() is re-entered
> >> through non-tracing program (e.g. fentry program).
> >>
> >> So fixing it by using disable_preempt() instead of migrate_disable() when
> >> increasing htab->map_locked. However when htab_use_raw_lock() is false,
> >> bucket lock will be a sleepable spin-lock and it breaks disable_preempt(),
> >> so still use migrate_disable() for spin-lock case.
> >>
> >> Signed-off-by: Hou Tao <houtao1@huawei.com>
> >> ---
> > IIUC, this patch enlarges the scope of preemption disable to cover inc
> > map_locked. But I don't think the change is meaningful.
> Before 74d862b682f51 ("sched: Make migrate_disable/enable() independent of
> RT"),  the preemption is disabled before increasing map_locked for !PREEMPT_RT
> case, so I don't think that the change is meaningless.
> >
> > This patch only affects the case when raw lock is used. In the case of
> > raw lock, irq is disabled for b->raw_lock protected critical section.
> > A raw spin lock itself doesn't block in both RT and non-RT. So, my
> > understanding about this patch is, it just makes sure preemption
> > doesn't happen on the exact __this_cpu_inc_return. But the window is
> > so small that it should be really unlikely to happen.
> No, it can be easily reproduced by running multiple htab update processes in the
> same CPU. Will add selftest to demonstrate that.

Can you clarify what you demonstrate?

Here is my theory, but please correct me if I'm wrong, I haven't
tested yet. In non-RT, I doubt preemptions are likely to happen after
migrate_disable. That is because very soon after migrate_disable, we
enter the critical section of b->raw_lock with irq disabled. In RT,
preemptions can happen on acquiring b->lock, that is certainly
possible, but this is the !use_raw_lock path, which isn't side-stepped
by this patch.

> >
> >>  kernel/bpf/hashtab.c | 23 ++++++++++++++++++-----
> >>  1 file changed, 18 insertions(+), 5 deletions(-)
> >>
> >> diff --git a/kernel/bpf/hashtab.c b/kernel/bpf/hashtab.c
> >> index 6c530a5e560a..ad09da139589 100644
> >> --- a/kernel/bpf/hashtab.c
> >> +++ b/kernel/bpf/hashtab.c
> >> @@ -162,17 +162,25 @@ static inline int htab_lock_bucket(const struct bpf_htab *htab,
> >>                                    unsigned long *pflags)
> >>  {
> >>         unsigned long flags;
> >> +       bool use_raw_lock;
> >>
> >>         hash = hash & HASHTAB_MAP_LOCK_MASK;
> >>
> >> -       migrate_disable();
> >> +       use_raw_lock = htab_use_raw_lock(htab);
> >> +       if (use_raw_lock)
> >> +               preempt_disable();
> >> +       else
> >> +               migrate_disable();
> >>         if (unlikely(__this_cpu_inc_return(*(htab->map_locked[hash])) != 1)) {
> >>                 __this_cpu_dec(*(htab->map_locked[hash]));
> >> -               migrate_enable();
> >> +               if (use_raw_lock)
> >> +                       preempt_enable();
> >> +               else
> >> +                       migrate_enable();
> >>                 return -EBUSY;
> >>         }
> >>
> >> -       if (htab_use_raw_lock(htab))
> >> +       if (use_raw_lock)
> >>                 raw_spin_lock_irqsave(&b->raw_lock, flags);
> >>         else
> >>                 spin_lock_irqsave(&b->lock, flags);
> >> @@ -185,13 +193,18 @@ static inline void htab_unlock_bucket(const struct bpf_htab *htab,
> >>                                       struct bucket *b, u32 hash,
> >>                                       unsigned long flags)
> >>  {
> >> +       bool use_raw_lock = htab_use_raw_lock(htab);
> >> +
> >>         hash = hash & HASHTAB_MAP_LOCK_MASK;
> >> -       if (htab_use_raw_lock(htab))
> >> +       if (use_raw_lock)
> >>                 raw_spin_unlock_irqrestore(&b->raw_lock, flags);
> >>         else
> >>                 spin_unlock_irqrestore(&b->lock, flags);
> >>         __this_cpu_dec(*(htab->map_locked[hash]));
> >> -       migrate_enable();
> >> +       if (use_raw_lock)
> >> +               preempt_enable();
> >> +       else
> >> +               migrate_enable();
> >>  }
> >>
> >>  static bool htab_lru_map_delete_node(void *arg, struct bpf_lru_node *node);
> >> --
> >> 2.29.2
> >>
> > .
>
Sebastian Andrzej Siewior Aug. 22, 2022, 8:13 a.m. UTC | #4
On 2022-08-21 11:32:21 [+0800], Hou Tao wrote:
> process A                      process B
> 
> htab_map_update_elem()
>   htab_lock_bucket()
>     migrate_disable()
>     /* return 1 */
>     __this_cpu_inc_return()
>     /* preempted by B */
> 
>                                htab_map_update_elem()
>                                  /* the same bucket as A */
>                                  htab_lock_bucket()
>                                    migrate_disable()
>                                    /* return 2, so lock fails */
>                                    __this_cpu_inc_return()
>                                    return -EBUSY
> 
> A fix that seems feasible is using in_nmi() in htab_lock_bucket() and
> only checking the value of map_locked for nmi context. But it will
> re-introduce dead-lock on bucket lock if htab_lock_bucket() is re-entered
> through non-tracing program (e.g. fentry program).
> 
> So fixing it by using disable_preempt() instead of migrate_disable() when
> increasing htab->map_locked. However when htab_use_raw_lock() is false,
> bucket lock will be a sleepable spin-lock and it breaks disable_preempt(),
> so still use migrate_disable() for spin-lock case.

But isn't the RT case still affected by the very same problem?

> Signed-off-by: Hou Tao <houtao1@huawei.com>

Sebastian
Hou Tao Aug. 22, 2022, 12:07 p.m. UTC | #5
Hi,

On 8/22/2022 11:21 AM, Hao Luo wrote:
> Hi, Hou Tao
>
> On Sun, Aug 21, 2022 at 6:28 PM Hou Tao <houtao@huaweicloud.com> wrote:
>> Hi,
>>
>> On 8/22/2022 12:42 AM, Hao Luo wrote:
>>> Hi Hou Tao,
>>>
>>> On Sat, Aug 20, 2022 at 8:14 PM Hou Tao <houtao@huaweicloud.com> wrote:
>>>> From: Hou Tao <houtao1@huawei.com>
>>>>
>>>> Per-cpu htab->map_locked is used to prohibit the concurrent accesses
>>>> from both NMI and non-NMI contexts. But since 74d862b682f51 ("sched:
>>>> Make migrate_disable/enable() independent of RT"), migrations_disable()
>>>> is also preemptible under !PREEMPT_RT case, so now map_locked also
>>>> disallows concurrent updates from normal contexts (e.g. userspace
>>>> processes) unexpectedly as shown below:
>>>>
>>>> process A                      process B
>>>>
>>>> htab_map_update_elem()
>>>>   htab_lock_bucket()
>>>>     migrate_disable()
>>>>     /* return 1 */
>>>>     __this_cpu_inc_return()
>>>>     /* preempted by B */
>>>>
>>>>                                htab_map_update_elem()
>>>>                                  /* the same bucket as A */
>>>>                                  htab_lock_bucket()
>>>>                                    migrate_disable()
>>>>                                    /* return 2, so lock fails */
>>>>                                    __this_cpu_inc_return()
>>>>                                    return -EBUSY
>>>>
>>>> A fix that seems feasible is using in_nmi() in htab_lock_bucket() and
>>>> only checking the value of map_locked for nmi context. But it will
>>>> re-introduce dead-lock on bucket lock if htab_lock_bucket() is re-entered
>>>> through non-tracing program (e.g. fentry program).
>>>>
>>>> So fixing it by using disable_preempt() instead of migrate_disable() when
>>>> increasing htab->map_locked. However when htab_use_raw_lock() is false,
>>>> bucket lock will be a sleepable spin-lock and it breaks disable_preempt(),
>>>> so still use migrate_disable() for spin-lock case.
>>>>
>>>> Signed-off-by: Hou Tao <houtao1@huawei.com>
>>>> ---
>>> IIUC, this patch enlarges the scope of preemption disable to cover inc
>>> map_locked. But I don't think the change is meaningful.
>> Before 74d862b682f51 ("sched: Make migrate_disable/enable() independent of
>> RT"),  the preemption is disabled before increasing map_locked for !PREEMPT_RT
>> case, so I don't think that the change is meaningless.
>>> This patch only affects the case when raw lock is used. In the case of
>>> raw lock, irq is disabled for b->raw_lock protected critical section.
>>> A raw spin lock itself doesn't block in both RT and non-RT. So, my
>>> understanding about this patch is, it just makes sure preemption
>>> doesn't happen on the exact __this_cpu_inc_return. But the window is
>>> so small that it should be really unlikely to happen.
>> No, it can be easily reproduced by running multiple htab update processes in the
>> same CPU. Will add selftest to demonstrate that.
> Can you clarify what you demonstrate?
First please enable CONFIG_PREEMPT for the running kernel and then run the
following test program as shown below.

# sudo taskset -c 2 ./update.bin
thread nr 2
wait for error
update error -16
all threads exit

If there is no "update error -16", you can try to create more map update
threads. For example running 16 update threads:

# sudo taskset -c 2 ./update.bin 16
thread nr 16
wait for error
update error -16
update error -16
update error -16
update error -16
update error -16
update error -16
update error -16
update error -16
all threads exit

The following is the source code for update.bin:

#define _GNU_SOURCE
#include <stdio.h>
#include <stdbool.h>
#include <stdlib.h>
#include <unistd.h>
#include <string.h>
#include <errno.h>
#include <pthread.h>

#include "bpf.h"
#include "libbpf.h"

static bool stop;
static int fd;

static void *update_fn(void *arg)
{
        while (!stop) {
                unsigned int key = 0, value = 1;
                int err;

                err = bpf_map_update_elem(fd, &key, &value, 0);
                if (err) {
                        printf("update error %d\n", err);
                        stop = true;
                        break;
                }
        }

        return NULL;
}

int main(int argc, char **argv)
{
        LIBBPF_OPTS(bpf_map_create_opts, opts);
        unsigned int i, nr;
        pthread_t *tids;

        nr = 2;
        if (argc > 1)
                nr = atoi(argv[1]);
        printf("thread nr %u\n", nr);

        libbpf_set_strict_mode(LIBBPF_STRICT_ALL);
        fd = bpf_map_create(BPF_MAP_TYPE_HASH, "batch", 4, 4, 1, &opts);
        if (fd < 0) {
                printf("create map error %d\n", fd);
                return 1;
        }

        tids = malloc(nr * sizeof(*tids));
        for (i = 0; i < nr; i++)
                pthread_create(&tids[i], NULL, update_fn, NULL);

        printf("wait for error\n");
        for (i = 0; i < nr; i++)
                pthread_join(tids[i], NULL);

        printf("all threads exit\n");

        free(tids);
        close(fd);

        return 0;
}

>
> Here is my theory, but please correct me if I'm wrong, I haven't
> tested yet. In non-RT, I doubt preemptions are likely to happen after
> migrate_disable. That is because very soon after migrate_disable, we
> enter the critical section of b->raw_lock with irq disabled. In RT,
> preemptions can happen on acquiring b->lock, that is certainly
> possible, but this is the !use_raw_lock path, which isn't side-stepped
> by this patch.
>
>>>>  kernel/bpf/hashtab.c | 23 ++++++++++++++++++-----
>>>>  1 file changed, 18 insertions(+), 5 deletions(-)
>>>>
>>>> diff --git a/kernel/bpf/hashtab.c b/kernel/bpf/hashtab.c
>>>> index 6c530a5e560a..ad09da139589 100644
>>>> --- a/kernel/bpf/hashtab.c
>>>> +++ b/kernel/bpf/hashtab.c
>>>> @@ -162,17 +162,25 @@ static inline int htab_lock_bucket(const struct bpf_htab *htab,
>>>>                                    unsigned long *pflags)
>>>>  {
>>>>         unsigned long flags;
>>>> +       bool use_raw_lock;
>>>>
>>>>         hash = hash & HASHTAB_MAP_LOCK_MASK;
>>>>
>>>> -       migrate_disable();
>>>> +       use_raw_lock = htab_use_raw_lock(htab);
>>>> +       if (use_raw_lock)
>>>> +               preempt_disable();
>>>> +       else
>>>> +               migrate_disable();
>>>>         if (unlikely(__this_cpu_inc_return(*(htab->map_locked[hash])) != 1)) {
>>>>                 __this_cpu_dec(*(htab->map_locked[hash]));
>>>> -               migrate_enable();
>>>> +               if (use_raw_lock)
>>>> +                       preempt_enable();
>>>> +               else
>>>> +                       migrate_enable();
>>>>                 return -EBUSY;
>>>>         }
>>>>
>>>> -       if (htab_use_raw_lock(htab))
>>>> +       if (use_raw_lock)
>>>>                 raw_spin_lock_irqsave(&b->raw_lock, flags);
>>>>         else
>>>>                 spin_lock_irqsave(&b->lock, flags);
>>>> @@ -185,13 +193,18 @@ static inline void htab_unlock_bucket(const struct bpf_htab *htab,
>>>>                                       struct bucket *b, u32 hash,
>>>>                                       unsigned long flags)
>>>>  {
>>>> +       bool use_raw_lock = htab_use_raw_lock(htab);
>>>> +
>>>>         hash = hash & HASHTAB_MAP_LOCK_MASK;
>>>> -       if (htab_use_raw_lock(htab))
>>>> +       if (use_raw_lock)
>>>>                 raw_spin_unlock_irqrestore(&b->raw_lock, flags);
>>>>         else
>>>>                 spin_unlock_irqrestore(&b->lock, flags);
>>>>         __this_cpu_dec(*(htab->map_locked[hash]));
>>>> -       migrate_enable();
>>>> +       if (use_raw_lock)
>>>> +               preempt_enable();
>>>> +       else
>>>> +               migrate_enable();
>>>>  }
>>>>
>>>>  static bool htab_lru_map_delete_node(void *arg, struct bpf_lru_node *node);
>>>> --
>>>> 2.29.2
>>>>
>>> .
> .
Hou Tao Aug. 22, 2022, 12:09 p.m. UTC | #6
Hi,

On 8/22/2022 4:13 PM, Sebastian Andrzej Siewior wrote:
> On 2022-08-21 11:32:21 [+0800], Hou Tao wrote:
>> process A                      process B
>>
>> htab_map_update_elem()
>>   htab_lock_bucket()
>>     migrate_disable()
>>     /* return 1 */
>>     __this_cpu_inc_return()
>>     /* preempted by B */
>>
>>                                htab_map_update_elem()
>>                                  /* the same bucket as A */
>>                                  htab_lock_bucket()
>>                                    migrate_disable()
>>                                    /* return 2, so lock fails */
>>                                    __this_cpu_inc_return()
>>                                    return -EBUSY
>>
>> A fix that seems feasible is using in_nmi() in htab_lock_bucket() and
>> only checking the value of map_locked for nmi context. But it will
>> re-introduce dead-lock on bucket lock if htab_lock_bucket() is re-entered
>> through non-tracing program (e.g. fentry program).
>>
>> So fixing it by using disable_preempt() instead of migrate_disable() when
>> increasing htab->map_locked. However when htab_use_raw_lock() is false,
>> bucket lock will be a sleepable spin-lock and it breaks disable_preempt(),
>> so still use migrate_disable() for spin-lock case.
> But isn't the RT case still affected by the very same problem?
As said in patch 0, the CONFIG_PREEMPT_RT && non-preallocated case is fixed in
patch 2.
>
>> Signed-off-by: Hou Tao <houtao1@huawei.com>
> Sebastian
> .
Sebastian Andrzej Siewior Aug. 22, 2022, 3:30 p.m. UTC | #7
On 2022-08-22 20:09:47 [+0800], Hou Tao wrote:
> Hi,
Hi,

> > But isn't the RT case still affected by the very same problem?
> As said in patch 0, the CONFIG_PREEMPT_RT && non-preallocated case is fixed in
> patch 2.

ups, sorry.

Sebastian
Hao Luo Aug. 22, 2022, 6:01 p.m. UTC | #8
On Mon, Aug 22, 2022 at 5:08 AM Hou Tao <houtao@huaweicloud.com> wrote:
>
> Hi,
>
> On 8/22/2022 11:21 AM, Hao Luo wrote:
> > Hi, Hou Tao
> >
> > On Sun, Aug 21, 2022 at 6:28 PM Hou Tao <houtao@huaweicloud.com> wrote:
> >> Hi,
> >>
> >> On 8/22/2022 12:42 AM, Hao Luo wrote:
> >>> Hi Hou Tao,
> >>>
> >>> On Sat, Aug 20, 2022 at 8:14 PM Hou Tao <houtao@huaweicloud.com> wrote:
> >>>> From: Hou Tao <houtao1@huawei.com>
> >>>>
> >>>> Per-cpu htab->map_locked is used to prohibit the concurrent accesses
> >>>> from both NMI and non-NMI contexts. But since 74d862b682f51 ("sched:
> >>>> Make migrate_disable/enable() independent of RT"), migrations_disable()
> >>>> is also preemptible under !PREEMPT_RT case, so now map_locked also
> >>>> disallows concurrent updates from normal contexts (e.g. userspace
> >>>> processes) unexpectedly as shown below:
> >>>>
> >>>> process A                      process B
> >>>>
> >>>> htab_map_update_elem()
> >>>>   htab_lock_bucket()
> >>>>     migrate_disable()
> >>>>     /* return 1 */
> >>>>     __this_cpu_inc_return()
> >>>>     /* preempted by B */
> >>>>
> >>>>                                htab_map_update_elem()
> >>>>                                  /* the same bucket as A */
> >>>>                                  htab_lock_bucket()
> >>>>                                    migrate_disable()
> >>>>                                    /* return 2, so lock fails */
> >>>>                                    __this_cpu_inc_return()
> >>>>                                    return -EBUSY
> >>>>
> >>>> A fix that seems feasible is using in_nmi() in htab_lock_bucket() and
> >>>> only checking the value of map_locked for nmi context. But it will
> >>>> re-introduce dead-lock on bucket lock if htab_lock_bucket() is re-entered
> >>>> through non-tracing program (e.g. fentry program).
> >>>>
> >>>> So fixing it by using disable_preempt() instead of migrate_disable() when
> >>>> increasing htab->map_locked. However when htab_use_raw_lock() is false,
> >>>> bucket lock will be a sleepable spin-lock and it breaks disable_preempt(),
> >>>> so still use migrate_disable() for spin-lock case.
> >>>>
> >>>> Signed-off-by: Hou Tao <houtao1@huawei.com>
> >>>> ---
> >>> IIUC, this patch enlarges the scope of preemption disable to cover inc
> >>> map_locked. But I don't think the change is meaningful.
> >> Before 74d862b682f51 ("sched: Make migrate_disable/enable() independent of
> >> RT"),  the preemption is disabled before increasing map_locked for !PREEMPT_RT
> >> case, so I don't think that the change is meaningless.
> >>> This patch only affects the case when raw lock is used. In the case of
> >>> raw lock, irq is disabled for b->raw_lock protected critical section.
> >>> A raw spin lock itself doesn't block in both RT and non-RT. So, my
> >>> understanding about this patch is, it just makes sure preemption
> >>> doesn't happen on the exact __this_cpu_inc_return. But the window is
> >>> so small that it should be really unlikely to happen.
> >> No, it can be easily reproduced by running multiple htab update processes in the
> >> same CPU. Will add selftest to demonstrate that.
> > Can you clarify what you demonstrate?
> First please enable CONFIG_PREEMPT for the running kernel and then run the
> following test program as shown below.
>

Ah, fully preemptive kernel. It's worth mentioning that in the commit
message. Then it seems promoting migrate_disable to preempt_disable
may be the best way to solve the problem you described.

> # sudo taskset -c 2 ./update.bin
> thread nr 2
> wait for error
> update error -16
> all threads exit
>
> If there is no "update error -16", you can try to create more map update
> threads. For example running 16 update threads:
>
> # sudo taskset -c 2 ./update.bin 16
> thread nr 16
> wait for error
> update error -16
> update error -16
> update error -16
> update error -16
> update error -16
> update error -16
> update error -16
> update error -16
> all threads exit
>
> The following is the source code for update.bin:
>
> #define _GNU_SOURCE
> #include <stdio.h>
> #include <stdbool.h>
> #include <stdlib.h>
> #include <unistd.h>
> #include <string.h>
> #include <errno.h>
> #include <pthread.h>
>
> #include "bpf.h"
> #include "libbpf.h"
>
> static bool stop;
> static int fd;
>
> static void *update_fn(void *arg)
> {
>         while (!stop) {
>                 unsigned int key = 0, value = 1;
>                 int err;
>
>                 err = bpf_map_update_elem(fd, &key, &value, 0);
>                 if (err) {
>                         printf("update error %d\n", err);
>                         stop = true;
>                         break;
>                 }
>         }
>
>         return NULL;
> }
>
> int main(int argc, char **argv)
> {
>         LIBBPF_OPTS(bpf_map_create_opts, opts);
>         unsigned int i, nr;
>         pthread_t *tids;
>
>         nr = 2;
>         if (argc > 1)
>                 nr = atoi(argv[1]);
>         printf("thread nr %u\n", nr);
>
>         libbpf_set_strict_mode(LIBBPF_STRICT_ALL);
>         fd = bpf_map_create(BPF_MAP_TYPE_HASH, "batch", 4, 4, 1, &opts);
>         if (fd < 0) {
>                 printf("create map error %d\n", fd);
>                 return 1;
>         }
>
>         tids = malloc(nr * sizeof(*tids));
>         for (i = 0; i < nr; i++)
>                 pthread_create(&tids[i], NULL, update_fn, NULL);
>
>         printf("wait for error\n");
>         for (i = 0; i < nr; i++)
>                 pthread_join(tids[i], NULL);
>
>         printf("all threads exit\n");
>
>         free(tids);
>         close(fd);
>
>         return 0;
> }
>
> >
> > Here is my theory, but please correct me if I'm wrong, I haven't
> > tested yet. In non-RT, I doubt preemptions are likely to happen after
> > migrate_disable. That is because very soon after migrate_disable, we
> > enter the critical section of b->raw_lock with irq disabled. In RT,
> > preemptions can happen on acquiring b->lock, that is certainly
> > possible, but this is the !use_raw_lock path, which isn't side-stepped
> > by this patch.
> >
> >>>>  kernel/bpf/hashtab.c | 23 ++++++++++++++++++-----
> >>>>  1 file changed, 18 insertions(+), 5 deletions(-)
> >>>>
> >>>> diff --git a/kernel/bpf/hashtab.c b/kernel/bpf/hashtab.c
> >>>> index 6c530a5e560a..ad09da139589 100644
> >>>> --- a/kernel/bpf/hashtab.c
> >>>> +++ b/kernel/bpf/hashtab.c
> >>>> @@ -162,17 +162,25 @@ static inline int htab_lock_bucket(const struct bpf_htab *htab,
> >>>>                                    unsigned long *pflags)
> >>>>  {
> >>>>         unsigned long flags;
> >>>> +       bool use_raw_lock;
> >>>>
> >>>>         hash = hash & HASHTAB_MAP_LOCK_MASK;
> >>>>
> >>>> -       migrate_disable();
> >>>> +       use_raw_lock = htab_use_raw_lock(htab);
> >>>> +       if (use_raw_lock)
> >>>> +               preempt_disable();
> >>>> +       else
> >>>> +               migrate_disable();
> >>>>         if (unlikely(__this_cpu_inc_return(*(htab->map_locked[hash])) != 1)) {
> >>>>                 __this_cpu_dec(*(htab->map_locked[hash]));
> >>>> -               migrate_enable();
> >>>> +               if (use_raw_lock)
> >>>> +                       preempt_enable();
> >>>> +               else
> >>>> +                       migrate_enable();
> >>>>                 return -EBUSY;
> >>>>         }
> >>>>
> >>>> -       if (htab_use_raw_lock(htab))
> >>>> +       if (use_raw_lock)
> >>>>                 raw_spin_lock_irqsave(&b->raw_lock, flags);
> >>>>         else
> >>>>                 spin_lock_irqsave(&b->lock, flags);
> >>>> @@ -185,13 +193,18 @@ static inline void htab_unlock_bucket(const struct bpf_htab *htab,
> >>>>                                       struct bucket *b, u32 hash,
> >>>>                                       unsigned long flags)
> >>>>  {
> >>>> +       bool use_raw_lock = htab_use_raw_lock(htab);
> >>>> +
> >>>>         hash = hash & HASHTAB_MAP_LOCK_MASK;
> >>>> -       if (htab_use_raw_lock(htab))
> >>>> +       if (use_raw_lock)
> >>>>                 raw_spin_unlock_irqrestore(&b->raw_lock, flags);
> >>>>         else
> >>>>                 spin_unlock_irqrestore(&b->lock, flags);
> >>>>         __this_cpu_dec(*(htab->map_locked[hash]));
> >>>> -       migrate_enable();
> >>>> +       if (use_raw_lock)
> >>>> +               preempt_enable();
> >>>> +       else
> >>>> +               migrate_enable();
> >>>>  }
> >>>>
> >>>>  static bool htab_lru_map_delete_node(void *arg, struct bpf_lru_node *node);
> >>>> --
> >>>> 2.29.2
> >>>>
> >>> .
> > .
>
Hao Luo Aug. 23, 2022, 12:56 a.m. UTC | #9
On Mon, Aug 22, 2022 at 11:01 AM Hao Luo <haoluo@google.com> wrote:
>
> On Mon, Aug 22, 2022 at 5:08 AM Hou Tao <houtao@huaweicloud.com> wrote:
> >
> > Hi,
> >
> > On 8/22/2022 11:21 AM, Hao Luo wrote:
> > > Hi, Hou Tao
> > >
> > > On Sun, Aug 21, 2022 at 6:28 PM Hou Tao <houtao@huaweicloud.com> wrote:
> > >> Hi,
> > >>
> > >> On 8/22/2022 12:42 AM, Hao Luo wrote:
> > >>> Hi Hou Tao,
> > >>>
> > >>> On Sat, Aug 20, 2022 at 8:14 PM Hou Tao <houtao@huaweicloud.com> wrote:
> > >>>> From: Hou Tao <houtao1@huawei.com>
> > >>>>
> > >>>> Per-cpu htab->map_locked is used to prohibit the concurrent accesses
> > >>>> from both NMI and non-NMI contexts. But since 74d862b682f51 ("sched:
> > >>>> Make migrate_disable/enable() independent of RT"), migrations_disable()
> > >>>> is also preemptible under !PREEMPT_RT case, so now map_locked also
> > >>>> disallows concurrent updates from normal contexts (e.g. userspace
> > >>>> processes) unexpectedly as shown below:
> > >>>>
> > >>>> process A                      process B
> > >>>>
> > >>>> htab_map_update_elem()
> > >>>>   htab_lock_bucket()
> > >>>>     migrate_disable()
> > >>>>     /* return 1 */
> > >>>>     __this_cpu_inc_return()
> > >>>>     /* preempted by B */
> > >>>>
> > >>>>                                htab_map_update_elem()
> > >>>>                                  /* the same bucket as A */
> > >>>>                                  htab_lock_bucket()
> > >>>>                                    migrate_disable()
> > >>>>                                    /* return 2, so lock fails */
> > >>>>                                    __this_cpu_inc_return()
> > >>>>                                    return -EBUSY
> > >>>>
> > >>>> A fix that seems feasible is using in_nmi() in htab_lock_bucket() and
> > >>>> only checking the value of map_locked for nmi context. But it will
> > >>>> re-introduce dead-lock on bucket lock if htab_lock_bucket() is re-entered
> > >>>> through non-tracing program (e.g. fentry program).
> > >>>>
> > >>>> So fixing it by using disable_preempt() instead of migrate_disable() when
> > >>>> increasing htab->map_locked. However when htab_use_raw_lock() is false,
> > >>>> bucket lock will be a sleepable spin-lock and it breaks disable_preempt(),
> > >>>> so still use migrate_disable() for spin-lock case.
> > >>>>
> > >>>> Signed-off-by: Hou Tao <houtao1@huawei.com>
> > >>>> ---
> > >>> IIUC, this patch enlarges the scope of preemption disable to cover inc
> > >>> map_locked. But I don't think the change is meaningful.
> > >> Before 74d862b682f51 ("sched: Make migrate_disable/enable() independent of
> > >> RT"),  the preemption is disabled before increasing map_locked for !PREEMPT_RT
> > >> case, so I don't think that the change is meaningless.
> > >>> This patch only affects the case when raw lock is used. In the case of
> > >>> raw lock, irq is disabled for b->raw_lock protected critical section.
> > >>> A raw spin lock itself doesn't block in both RT and non-RT. So, my
> > >>> understanding about this patch is, it just makes sure preemption
> > >>> doesn't happen on the exact __this_cpu_inc_return. But the window is
> > >>> so small that it should be really unlikely to happen.
> > >> No, it can be easily reproduced by running multiple htab update processes in the
> > >> same CPU. Will add selftest to demonstrate that.
> > > Can you clarify what you demonstrate?
> > First please enable CONFIG_PREEMPT for the running kernel and then run the
> > following test program as shown below.
> >
>
> Ah, fully preemptive kernel. It's worth mentioning that in the commit
> message. Then it seems promoting migrate_disable to preempt_disable
> may be the best way to solve the problem you described.
>
> > # sudo taskset -c 2 ./update.bin
> > thread nr 2
> > wait for error
> > update error -16
> > all threads exit
> >
> > If there is no "update error -16", you can try to create more map update
> > threads. For example running 16 update threads:
> >
> > # sudo taskset -c 2 ./update.bin 16
> > thread nr 16
> > wait for error
> > update error -16
> > update error -16
> > update error -16
> > update error -16
> > update error -16
> > update error -16
> > update error -16
> > update error -16
> > all threads exit
> >
> > The following is the source code for update.bin:
> >
> > #define _GNU_SOURCE
> > #include <stdio.h>
> > #include <stdbool.h>
> > #include <stdlib.h>
> > #include <unistd.h>
> > #include <string.h>
> > #include <errno.h>
> > #include <pthread.h>
> >
> > #include "bpf.h"
> > #include "libbpf.h"
> >
> > static bool stop;
> > static int fd;
> >
> > static void *update_fn(void *arg)
> > {
> >         while (!stop) {
> >                 unsigned int key = 0, value = 1;
> >                 int err;
> >
> >                 err = bpf_map_update_elem(fd, &key, &value, 0);
> >                 if (err) {
> >                         printf("update error %d\n", err);
> >                         stop = true;
> >                         break;
> >                 }
> >         }
> >
> >         return NULL;
> > }
> >
> > int main(int argc, char **argv)
> > {
> >         LIBBPF_OPTS(bpf_map_create_opts, opts);
> >         unsigned int i, nr;
> >         pthread_t *tids;
> >
> >         nr = 2;
> >         if (argc > 1)
> >                 nr = atoi(argv[1]);
> >         printf("thread nr %u\n", nr);
> >
> >         libbpf_set_strict_mode(LIBBPF_STRICT_ALL);
> >         fd = bpf_map_create(BPF_MAP_TYPE_HASH, "batch", 4, 4, 1, &opts);
> >         if (fd < 0) {
> >                 printf("create map error %d\n", fd);
> >                 return 1;
> >         }
> >
> >         tids = malloc(nr * sizeof(*tids));
> >         for (i = 0; i < nr; i++)
> >                 pthread_create(&tids[i], NULL, update_fn, NULL);
> >
> >         printf("wait for error\n");
> >         for (i = 0; i < nr; i++)
> >                 pthread_join(tids[i], NULL);
> >
> >         printf("all threads exit\n");
> >
> >         free(tids);
> >         close(fd);
> >
> >         return 0;
> > }
> >

Tao, thanks very much for the test. I played it a bit and I can
confirm that map_update failures are seen under CONFIG_PREEMPT. The
failures are not present under CONFIG_PREEMPT_NONE or
CONFIG_PREEMPT_VOLUNTARY. I experimented with a few alternatives I was
thinking of and they didn't work. It looks like Hou Tao's idea,
promoting migrate_disable to preempt_disable, is probably the best we
can do for the non-RT case. So

Reviewed-by: Hao Luo <haoluo@google.com>

But, I am not sure if we want to get rid of preemption-caused batch
map updates on preemptive kernels, or if there are better solutions to
address [0].

Tao, could you mention CONFIG_PREEMPT in your commit message? Thanks.

> > >
> > > Here is my theory, but please correct me if I'm wrong, I haven't
> > > tested yet. In non-RT, I doubt preemptions are likely to happen after
> > > migrate_disable. That is because very soon after migrate_disable, we
> > > enter the critical section of b->raw_lock with irq disabled. In RT,
> > > preemptions can happen on acquiring b->lock, that is certainly
> > > possible, but this is the !use_raw_lock path, which isn't side-stepped
> > > by this patch.
> > >
> > >>>>  kernel/bpf/hashtab.c | 23 ++++++++++++++++++-----
> > >>>>  1 file changed, 18 insertions(+), 5 deletions(-)
> > >>>>
> > >>>> diff --git a/kernel/bpf/hashtab.c b/kernel/bpf/hashtab.c
> > >>>> index 6c530a5e560a..ad09da139589 100644
> > >>>> --- a/kernel/bpf/hashtab.c
> > >>>> +++ b/kernel/bpf/hashtab.c
> > >>>> @@ -162,17 +162,25 @@ static inline int htab_lock_bucket(const struct bpf_htab *htab,
> > >>>>                                    unsigned long *pflags)
> > >>>>  {
> > >>>>         unsigned long flags;
> > >>>> +       bool use_raw_lock;
> > >>>>
> > >>>>         hash = hash & HASHTAB_MAP_LOCK_MASK;
> > >>>>
> > >>>> -       migrate_disable();
> > >>>> +       use_raw_lock = htab_use_raw_lock(htab);
> > >>>> +       if (use_raw_lock)
> > >>>> +               preempt_disable();
> > >>>> +       else
> > >>>> +               migrate_disable();
> > >>>>         if (unlikely(__this_cpu_inc_return(*(htab->map_locked[hash])) != 1)) {
> > >>>>                 __this_cpu_dec(*(htab->map_locked[hash]));
> > >>>> -               migrate_enable();
> > >>>> +               if (use_raw_lock)
> > >>>> +                       preempt_enable();
> > >>>> +               else
> > >>>> +                       migrate_enable();
> > >>>>                 return -EBUSY;
> > >>>>         }
> > >>>>
> > >>>> -       if (htab_use_raw_lock(htab))
> > >>>> +       if (use_raw_lock)
> > >>>>                 raw_spin_lock_irqsave(&b->raw_lock, flags);
> > >>>>         else
> > >>>>                 spin_lock_irqsave(&b->lock, flags);
> > >>>> @@ -185,13 +193,18 @@ static inline void htab_unlock_bucket(const struct bpf_htab *htab,
> > >>>>                                       struct bucket *b, u32 hash,
> > >>>>                                       unsigned long flags)
> > >>>>  {
> > >>>> +       bool use_raw_lock = htab_use_raw_lock(htab);
> > >>>> +
> > >>>>         hash = hash & HASHTAB_MAP_LOCK_MASK;
> > >>>> -       if (htab_use_raw_lock(htab))
> > >>>> +       if (use_raw_lock)
> > >>>>                 raw_spin_unlock_irqrestore(&b->raw_lock, flags);
> > >>>>         else
> > >>>>                 spin_unlock_irqrestore(&b->lock, flags);
> > >>>>         __this_cpu_dec(*(htab->map_locked[hash]));
> > >>>> -       migrate_enable();
> > >>>> +       if (use_raw_lock)
> > >>>> +               preempt_enable();
> > >>>> +       else
> > >>>> +               migrate_enable();
> > >>>>  }
> > >>>>
> > >>>>  static bool htab_lru_map_delete_node(void *arg, struct bpf_lru_node *node);
> > >>>> --
> > >>>> 2.29.2
> > >>>>
> > >>> .
> > > .
> >
Alexei Starovoitov Aug. 23, 2022, 1:29 a.m. UTC | #10
On Mon, Aug 22, 2022 at 5:56 PM Hao Luo <haoluo@google.com> wrote:
>
> On Mon, Aug 22, 2022 at 11:01 AM Hao Luo <haoluo@google.com> wrote:
> >
> > On Mon, Aug 22, 2022 at 5:08 AM Hou Tao <houtao@huaweicloud.com> wrote:
> > >
> > > Hi,
> > >
> > > On 8/22/2022 11:21 AM, Hao Luo wrote:
> > > > Hi, Hou Tao
> > > >
> > > > On Sun, Aug 21, 2022 at 6:28 PM Hou Tao <houtao@huaweicloud.com> wrote:
> > > >> Hi,
> > > >>
> > > >> On 8/22/2022 12:42 AM, Hao Luo wrote:
> > > >>> Hi Hou Tao,
> > > >>>
> > > >>> On Sat, Aug 20, 2022 at 8:14 PM Hou Tao <houtao@huaweicloud.com> wrote:
> > > >>>> From: Hou Tao <houtao1@huawei.com>
> > > >>>>
> > > >>>> Per-cpu htab->map_locked is used to prohibit the concurrent accesses
> > > >>>> from both NMI and non-NMI contexts. But since 74d862b682f51 ("sched:
> > > >>>> Make migrate_disable/enable() independent of RT"), migrations_disable()
> > > >>>> is also preemptible under !PREEMPT_RT case, so now map_locked also
> > > >>>> disallows concurrent updates from normal contexts (e.g. userspace
> > > >>>> processes) unexpectedly as shown below:
> > > >>>>
> > > >>>> process A                      process B
> > > >>>>
> > > >>>> htab_map_update_elem()
> > > >>>>   htab_lock_bucket()
> > > >>>>     migrate_disable()
> > > >>>>     /* return 1 */
> > > >>>>     __this_cpu_inc_return()
> > > >>>>     /* preempted by B */
> > > >>>>
> > > >>>>                                htab_map_update_elem()
> > > >>>>                                  /* the same bucket as A */
> > > >>>>                                  htab_lock_bucket()
> > > >>>>                                    migrate_disable()
> > > >>>>                                    /* return 2, so lock fails */
> > > >>>>                                    __this_cpu_inc_return()
> > > >>>>                                    return -EBUSY
> > > >>>>
> > > >>>> A fix that seems feasible is using in_nmi() in htab_lock_bucket() and
> > > >>>> only checking the value of map_locked for nmi context. But it will
> > > >>>> re-introduce dead-lock on bucket lock if htab_lock_bucket() is re-entered
> > > >>>> through non-tracing program (e.g. fentry program).
> > > >>>>
> > > >>>> So fixing it by using disable_preempt() instead of migrate_disable() when
> > > >>>> increasing htab->map_locked. However when htab_use_raw_lock() is false,
> > > >>>> bucket lock will be a sleepable spin-lock and it breaks disable_preempt(),
> > > >>>> so still use migrate_disable() for spin-lock case.
> > > >>>>
> > > >>>> Signed-off-by: Hou Tao <houtao1@huawei.com>
>
> Tao, thanks very much for the test. I played it a bit and I can
> confirm that map_update failures are seen under CONFIG_PREEMPT. The
> failures are not present under CONFIG_PREEMPT_NONE or
> CONFIG_PREEMPT_VOLUNTARY. I experimented with a few alternatives I was
> thinking of and they didn't work. It looks like Hou Tao's idea,
> promoting migrate_disable to preempt_disable, is probably the best we
> can do for the non-RT case. So

preempt_disable is also faster than migrate_disable,
so patch 1 will not only fix this issue, but will improve performance.

Patch 2 is too hacky though.
I think it's better to wait until my bpf_mem_alloc patches land.
RT case won't be special anymore. We will be able to remove
htab_use_raw_lock() helper and unconditionally use raw_spin_lock.
With bpf_mem_alloc there is no inline memory allocation anymore.

So please address Hao's comments, add a test and
resubmit patches 1 and 3.
Also please use [PATCH bpf-next] in the subject to help BPF CI
and patchwork scripts.
Hou Tao Aug. 23, 2022, 2:54 a.m. UTC | #11
Hi,

On 8/23/2022 8:56 AM, Hao Luo wrote:
> On Mon, Aug 22, 2022 at 11:01 AM Hao Luo <haoluo@google.com> wrote:
>> On Mon, Aug 22, 2022 at 5:08 AM Hou Tao <houtao@huaweicloud.com> wrote:
>>> Hi,
>>>
>>> On 8/22/2022 11:21 AM, Hao Luo wrote:
>>>> Hi, Hou Tao
>>>>
>>>> On Sun, Aug 21, 2022 at 6:28 PM Hou Tao <houtao@huaweicloud.com> wrote:
>>>>> Hi,
>>>>>
>>>>> On 8/22/2022 12:42 AM, Hao Luo wrote:
>>>>>> Hi Hou Tao,
>>>>>>
>>>>>> On Sat, Aug 20, 2022 at 8:14 PM Hou Tao <houtao@huaweicloud.com> wrote:
>>>>>>> From: Hou Tao <houtao1@huawei.com>
>>>>>>>
>>>>>>> Per-cpu htab->map_locked is used to prohibit the concurrent accesses
>>>>>>> from both NMI and non-NMI contexts. But since 74d862b682f51 ("sched:
>>>>>>> Make migrate_disable/enable() independent of RT"), migrations_disable()
>>>>>>> is also preemptible under !PREEMPT_RT case, so now map_locked also
>>>>>>> disallows concurrent updates from normal contexts (e.g. userspace
>>>>>>> processes) unexpectedly as shown below:
>>>>>>>
>>>>>>> process A                      process B
>>>>>>>
SNIP
>>>>>>> First please enable CONFIG_PREEMPT for the running kernel and then run the
>>>>>>> following test program as shown below.
>>>>>>>
>> Ah, fully preemptive kernel. It's worth mentioning that in the commit
>> message. Then it seems promoting migrate_disable to preempt_disable
>> may be the best way to solve the problem you described.
Sorry, i missed that. Will add in v2.
>>
>>> # sudo taskset -c 2 ./update.bin
>>> thread nr 2
>>> wait for error
>>> update error -16
>>> all threads exit
>>>
>>> If there is no "update error -16", you can try to create more map update
>>> threads. For example running 16 update threads:
>>>
>>> # sudo taskset -c 2 ./update.bin 16
>>> thread nr 16
>>> wait for error
>>> update error -16
>>> update error -16
>>> update error -16
>>> update error -16
>>> update error -16
>>> update error -16
>>> update error -16
>>> update error -16
>>> all threads exit
SNIP
> Tao, thanks very much for the test. I played it a bit and I can
> confirm that map_update failures are seen under CONFIG_PREEMPT. The
> failures are not present under CONFIG_PREEMPT_NONE or
> CONFIG_PREEMPT_VOLUNTARY. I experimented with a few alternatives I was
> thinking of and they didn't work. It looks like Hou Tao's idea,
> promoting migrate_disable to preempt_disable, is probably the best we
> can do for the non-RT case. So
>
> Reviewed-by: Hao Luo <haoluo@google.com>
>
> But, I am not sure if we want to get rid of preemption-caused batch
> map updates on preemptive kernels, or if there are better solutions to
> address [0].
Thanks for your review.
Under preemptive kernel, if the preemption is disabled during batch map lookup
or update, htab_lock_bucket() from userspace will not fail, so I think it is OK
for now.
>
> Tao, could you mention CONFIG_PREEMPT in your commit message? Thanks.
Will do.
>>>> Here is my theory, but please correct me if I'm wrong, I haven't
>>>> tested yet. In non-RT, I doubt preemptions are likely to happen after
>>>> migrate_disable. That is because very soon after migrate_disable, we
>>>> enter the critical section of b->raw_lock with irq disabled. In RT,
>>>> preemptions can happen on acquiring b->lock, that is certainly
>>>> possible, but this is the !use_raw_lock path, which isn't side-stepped
>>>> by this patch.
>>>>
>>>>>>>  kernel/bpf/hashtab.c | 23 ++++++++++++++++++-----
>>>>>>>  1 file changed, 18 insertions(+), 5 deletions(-)
>>>>>>>
>>>>>>> diff --git a/kernel/bpf/hashtab.c b/kernel/bpf/hashtab.c
>>>>>>> index 6c530a5e560a..ad09da139589 100644
>>>>>>> --- a/kernel/bpf/hashtab.c
>>>>>>> +++ b/kernel/bpf/hashtab.c
>>>>>>> @@ -162,17 +162,25 @@ static inline int htab_lock_bucket(const struct bpf_htab *htab,
>>>>>>>                                    unsigned long *pflags)
>>>>>>>  {
>>>>>>>         unsigned long flags;
>>>>>>> +       bool use_raw_lock;
>>>>>>>
>>>>>>>         hash = hash & HASHTAB_MAP_LOCK_MASK;
>>>>>>>
>>>>>>> -       migrate_disable();
>>>>>>> +       use_raw_lock = htab_use_raw_lock(htab);
>>>>>>> +       if (use_raw_lock)
>>>>>>> +               preempt_disable();
>>>>>>> +       else
>>>>>>> +               migrate_disable();
>>>>>>>         if (unlikely(__this_cpu_inc_return(*(htab->map_locked[hash])) != 1)) {
>>>>>>>                 __this_cpu_dec(*(htab->map_locked[hash]));
>>>>>>> -               migrate_enable();
>>>>>>> +               if (use_raw_lock)
>>>>>>> +                       preempt_enable();
>>>>>>> +               else
>>>>>>> +                       migrate_enable();
>>>>>>>                 return -EBUSY;
>>>>>>>         }
>>>>>>>
>>>>>>> -       if (htab_use_raw_lock(htab))
>>>>>>> +       if (use_raw_lock)
>>>>>>>                 raw_spin_lock_irqsave(&b->raw_lock, flags);
>>>>>>>         else
>>>>>>>                 spin_lock_irqsave(&b->lock, flags);
>>>>>>> @@ -185,13 +193,18 @@ static inline void htab_unlock_bucket(const struct bpf_htab *htab,
>>>>>>>                                       struct bucket *b, u32 hash,
>>>>>>>                                       unsigned long flags)
>>>>>>>  {
>>>>>>> +       bool use_raw_lock = htab_use_raw_lock(htab);
>>>>>>> +
>>>>>>>         hash = hash & HASHTAB_MAP_LOCK_MASK;
>>>>>>> -       if (htab_use_raw_lock(htab))
>>>>>>> +       if (use_raw_lock)
>>>>>>>                 raw_spin_unlock_irqrestore(&b->raw_lock, flags);
>>>>>>>         else
>>>>>>>                 spin_unlock_irqrestore(&b->lock, flags);
>>>>>>>         __this_cpu_dec(*(htab->map_locked[hash]));
>>>>>>> -       migrate_enable();
>>>>>>> +       if (use_raw_lock)
>>>>>>> +               preempt_enable();
>>>>>>> +       else
>>>>>>> +               migrate_enable();
>>>>>>>  }
>>>>>>>
>>>>>>>  static bool htab_lru_map_delete_node(void *arg, struct bpf_lru_node *node);
>>>>>>> --
>>>>>>> 2.29.2
>>>>>>>
>>>>>> .
>>>> .
Hou Tao Aug. 23, 2022, 2:57 a.m. UTC | #12
Hi,

On 8/23/2022 9:29 AM, Alexei Starovoitov wrote:
> On Mon, Aug 22, 2022 at 5:56 PM Hao Luo <haoluo@google.com> wrote:
>>
SNIP
>> Tao, thanks very much for the test. I played it a bit and I can
>> confirm that map_update failures are seen under CONFIG_PREEMPT. The
>> failures are not present under CONFIG_PREEMPT_NONE or
>> CONFIG_PREEMPT_VOLUNTARY. I experimented with a few alternatives I was
>> thinking of and they didn't work. It looks like Hou Tao's idea,
>> promoting migrate_disable to preempt_disable, is probably the best we
>> can do for the non-RT case. So
> preempt_disable is also faster than migrate_disable,
> so patch 1 will not only fix this issue, but will improve performance.
>
> Patch 2 is too hacky though.
> I think it's better to wait until my bpf_mem_alloc patches land.
> RT case won't be special anymore. We will be able to remove
> htab_use_raw_lock() helper and unconditionally use raw_spin_lock.
> With bpf_mem_alloc there is no inline memory allocation anymore.
OK. Look forwards to the landing of BPF specific memory allocator.
>
> So please address Hao's comments, add a test and
> resubmit patches 1 and 3.
> Also please use [PATCH bpf-next] in the subject to help BPF CI
> and patchwork scripts.
Will do. And to bpf-next instead of bpf ?
> .
Alexei Starovoitov Aug. 23, 2022, 4:50 a.m. UTC | #13
On Mon, Aug 22, 2022 at 7:57 PM Hou Tao <houtao@huaweicloud.com> wrote:
>
> Hi,
>
> On 8/23/2022 9:29 AM, Alexei Starovoitov wrote:
> > On Mon, Aug 22, 2022 at 5:56 PM Hao Luo <haoluo@google.com> wrote:
> >>
> SNIP
> >> Tao, thanks very much for the test. I played it a bit and I can
> >> confirm that map_update failures are seen under CONFIG_PREEMPT. The
> >> failures are not present under CONFIG_PREEMPT_NONE or
> >> CONFIG_PREEMPT_VOLUNTARY. I experimented with a few alternatives I was
> >> thinking of and they didn't work. It looks like Hou Tao's idea,
> >> promoting migrate_disable to preempt_disable, is probably the best we
> >> can do for the non-RT case. So
> > preempt_disable is also faster than migrate_disable,
> > so patch 1 will not only fix this issue, but will improve performance.
> >
> > Patch 2 is too hacky though.
> > I think it's better to wait until my bpf_mem_alloc patches land.
> > RT case won't be special anymore. We will be able to remove
> > htab_use_raw_lock() helper and unconditionally use raw_spin_lock.
> > With bpf_mem_alloc there is no inline memory allocation anymore.
> OK. Look forwards to the landing of BPF specific memory allocator.
> >
> > So please address Hao's comments, add a test and
> > resubmit patches 1 and 3.
> > Also please use [PATCH bpf-next] in the subject to help BPF CI
> > and patchwork scripts.
> Will do. And to bpf-next instead of bpf ?

bpf-next is almost always prefered for fixes for corner
cases that have been around for some time.
bpf tree is for security and high pri fixes.
bpf-next gives time for fixes to prove themselves.
Hou Tao Aug. 23, 2022, 6:41 a.m. UTC | #14
Hi,

On 8/23/2022 12:50 PM, Alexei Starovoitov wrote:
> On Mon, Aug 22, 2022 at 7:57 PM Hou Tao <houtao@huaweicloud.com> wrote:
>>
SNIP
>>> So please address Hao's comments, add a test and
>>> resubmit patches 1 and 3.
>>> Also please use [PATCH bpf-next] in the subject to help BPF CI
>>> and patchwork scripts.
>> Will do. And to bpf-next instead of bpf ?
> bpf-next is almost always prefered for fixes for corner
> cases that have been around for some time.
> bpf tree is for security and high pri fixes.
> bpf-next gives time for fixes to prove themselves.
> .
Understand. Thanks for the explanation.
diff mbox series

Patch

diff --git a/kernel/bpf/hashtab.c b/kernel/bpf/hashtab.c
index 6c530a5e560a..ad09da139589 100644
--- a/kernel/bpf/hashtab.c
+++ b/kernel/bpf/hashtab.c
@@ -162,17 +162,25 @@  static inline int htab_lock_bucket(const struct bpf_htab *htab,
 				   unsigned long *pflags)
 {
 	unsigned long flags;
+	bool use_raw_lock;
 
 	hash = hash & HASHTAB_MAP_LOCK_MASK;
 
-	migrate_disable();
+	use_raw_lock = htab_use_raw_lock(htab);
+	if (use_raw_lock)
+		preempt_disable();
+	else
+		migrate_disable();
 	if (unlikely(__this_cpu_inc_return(*(htab->map_locked[hash])) != 1)) {
 		__this_cpu_dec(*(htab->map_locked[hash]));
-		migrate_enable();
+		if (use_raw_lock)
+			preempt_enable();
+		else
+			migrate_enable();
 		return -EBUSY;
 	}
 
-	if (htab_use_raw_lock(htab))
+	if (use_raw_lock)
 		raw_spin_lock_irqsave(&b->raw_lock, flags);
 	else
 		spin_lock_irqsave(&b->lock, flags);
@@ -185,13 +193,18 @@  static inline void htab_unlock_bucket(const struct bpf_htab *htab,
 				      struct bucket *b, u32 hash,
 				      unsigned long flags)
 {
+	bool use_raw_lock = htab_use_raw_lock(htab);
+
 	hash = hash & HASHTAB_MAP_LOCK_MASK;
-	if (htab_use_raw_lock(htab))
+	if (use_raw_lock)
 		raw_spin_unlock_irqrestore(&b->raw_lock, flags);
 	else
 		spin_unlock_irqrestore(&b->lock, flags);
 	__this_cpu_dec(*(htab->map_locked[hash]));
-	migrate_enable();
+	if (use_raw_lock)
+		preempt_enable();
+	else
+		migrate_enable();
 }
 
 static bool htab_lru_map_delete_node(void *arg, struct bpf_lru_node *node);