Message ID | 20220821033223.2598791-2-houtao@huaweicloud.com (mailing list archive) |
---|---|
State | Superseded |
Delegated to: | BPF |
Headers | show |
Series | fixes for concurrent htab updates | expand |
Hi Hou Tao, On Sat, Aug 20, 2022 at 8:14 PM Hou Tao <houtao@huaweicloud.com> wrote: > > From: Hou Tao <houtao1@huawei.com> > > Per-cpu htab->map_locked is used to prohibit the concurrent accesses > from both NMI and non-NMI contexts. But since 74d862b682f51 ("sched: > Make migrate_disable/enable() independent of RT"), migrations_disable() > is also preemptible under !PREEMPT_RT case, so now map_locked also > disallows concurrent updates from normal contexts (e.g. userspace > processes) unexpectedly as shown below: > > process A process B > > htab_map_update_elem() > htab_lock_bucket() > migrate_disable() > /* return 1 */ > __this_cpu_inc_return() > /* preempted by B */ > > htab_map_update_elem() > /* the same bucket as A */ > htab_lock_bucket() > migrate_disable() > /* return 2, so lock fails */ > __this_cpu_inc_return() > return -EBUSY > > A fix that seems feasible is using in_nmi() in htab_lock_bucket() and > only checking the value of map_locked for nmi context. But it will > re-introduce dead-lock on bucket lock if htab_lock_bucket() is re-entered > through non-tracing program (e.g. fentry program). > > So fixing it by using disable_preempt() instead of migrate_disable() when > increasing htab->map_locked. However when htab_use_raw_lock() is false, > bucket lock will be a sleepable spin-lock and it breaks disable_preempt(), > so still use migrate_disable() for spin-lock case. > > Signed-off-by: Hou Tao <houtao1@huawei.com> > --- IIUC, this patch enlarges the scope of preemption disable to cover inc map_locked. But I don't think the change is meaningful. This patch only affects the case when raw lock is used. In the case of raw lock, irq is disabled for b->raw_lock protected critical section. A raw spin lock itself doesn't block in both RT and non-RT. So, my understanding about this patch is, it just makes sure preemption doesn't happen on the exact __this_cpu_inc_return. But the window is so small that it should be really unlikely to happen. > kernel/bpf/hashtab.c | 23 ++++++++++++++++++----- > 1 file changed, 18 insertions(+), 5 deletions(-) > > diff --git a/kernel/bpf/hashtab.c b/kernel/bpf/hashtab.c > index 6c530a5e560a..ad09da139589 100644 > --- a/kernel/bpf/hashtab.c > +++ b/kernel/bpf/hashtab.c > @@ -162,17 +162,25 @@ static inline int htab_lock_bucket(const struct bpf_htab *htab, > unsigned long *pflags) > { > unsigned long flags; > + bool use_raw_lock; > > hash = hash & HASHTAB_MAP_LOCK_MASK; > > - migrate_disable(); > + use_raw_lock = htab_use_raw_lock(htab); > + if (use_raw_lock) > + preempt_disable(); > + else > + migrate_disable(); > if (unlikely(__this_cpu_inc_return(*(htab->map_locked[hash])) != 1)) { > __this_cpu_dec(*(htab->map_locked[hash])); > - migrate_enable(); > + if (use_raw_lock) > + preempt_enable(); > + else > + migrate_enable(); > return -EBUSY; > } > > - if (htab_use_raw_lock(htab)) > + if (use_raw_lock) > raw_spin_lock_irqsave(&b->raw_lock, flags); > else > spin_lock_irqsave(&b->lock, flags); > @@ -185,13 +193,18 @@ static inline void htab_unlock_bucket(const struct bpf_htab *htab, > struct bucket *b, u32 hash, > unsigned long flags) > { > + bool use_raw_lock = htab_use_raw_lock(htab); > + > hash = hash & HASHTAB_MAP_LOCK_MASK; > - if (htab_use_raw_lock(htab)) > + if (use_raw_lock) > raw_spin_unlock_irqrestore(&b->raw_lock, flags); > else > spin_unlock_irqrestore(&b->lock, flags); > __this_cpu_dec(*(htab->map_locked[hash])); > - migrate_enable(); > + if (use_raw_lock) > + preempt_enable(); > + else > + migrate_enable(); > } > > static bool htab_lru_map_delete_node(void *arg, struct bpf_lru_node *node); > -- > 2.29.2 >
Hi, On 8/22/2022 12:42 AM, Hao Luo wrote: > Hi Hou Tao, > > On Sat, Aug 20, 2022 at 8:14 PM Hou Tao <houtao@huaweicloud.com> wrote: >> From: Hou Tao <houtao1@huawei.com> >> >> Per-cpu htab->map_locked is used to prohibit the concurrent accesses >> from both NMI and non-NMI contexts. But since 74d862b682f51 ("sched: >> Make migrate_disable/enable() independent of RT"), migrations_disable() >> is also preemptible under !PREEMPT_RT case, so now map_locked also >> disallows concurrent updates from normal contexts (e.g. userspace >> processes) unexpectedly as shown below: >> >> process A process B >> >> htab_map_update_elem() >> htab_lock_bucket() >> migrate_disable() >> /* return 1 */ >> __this_cpu_inc_return() >> /* preempted by B */ >> >> htab_map_update_elem() >> /* the same bucket as A */ >> htab_lock_bucket() >> migrate_disable() >> /* return 2, so lock fails */ >> __this_cpu_inc_return() >> return -EBUSY >> >> A fix that seems feasible is using in_nmi() in htab_lock_bucket() and >> only checking the value of map_locked for nmi context. But it will >> re-introduce dead-lock on bucket lock if htab_lock_bucket() is re-entered >> through non-tracing program (e.g. fentry program). >> >> So fixing it by using disable_preempt() instead of migrate_disable() when >> increasing htab->map_locked. However when htab_use_raw_lock() is false, >> bucket lock will be a sleepable spin-lock and it breaks disable_preempt(), >> so still use migrate_disable() for spin-lock case. >> >> Signed-off-by: Hou Tao <houtao1@huawei.com> >> --- > IIUC, this patch enlarges the scope of preemption disable to cover inc > map_locked. But I don't think the change is meaningful. Before 74d862b682f51 ("sched: Make migrate_disable/enable() independent of RT"), the preemption is disabled before increasing map_locked for !PREEMPT_RT case, so I don't think that the change is meaningless. > > This patch only affects the case when raw lock is used. In the case of > raw lock, irq is disabled for b->raw_lock protected critical section. > A raw spin lock itself doesn't block in both RT and non-RT. So, my > understanding about this patch is, it just makes sure preemption > doesn't happen on the exact __this_cpu_inc_return. But the window is > so small that it should be really unlikely to happen. No, it can be easily reproduced by running multiple htab update processes in the same CPU. Will add selftest to demonstrate that. > >> kernel/bpf/hashtab.c | 23 ++++++++++++++++++----- >> 1 file changed, 18 insertions(+), 5 deletions(-) >> >> diff --git a/kernel/bpf/hashtab.c b/kernel/bpf/hashtab.c >> index 6c530a5e560a..ad09da139589 100644 >> --- a/kernel/bpf/hashtab.c >> +++ b/kernel/bpf/hashtab.c >> @@ -162,17 +162,25 @@ static inline int htab_lock_bucket(const struct bpf_htab *htab, >> unsigned long *pflags) >> { >> unsigned long flags; >> + bool use_raw_lock; >> >> hash = hash & HASHTAB_MAP_LOCK_MASK; >> >> - migrate_disable(); >> + use_raw_lock = htab_use_raw_lock(htab); >> + if (use_raw_lock) >> + preempt_disable(); >> + else >> + migrate_disable(); >> if (unlikely(__this_cpu_inc_return(*(htab->map_locked[hash])) != 1)) { >> __this_cpu_dec(*(htab->map_locked[hash])); >> - migrate_enable(); >> + if (use_raw_lock) >> + preempt_enable(); >> + else >> + migrate_enable(); >> return -EBUSY; >> } >> >> - if (htab_use_raw_lock(htab)) >> + if (use_raw_lock) >> raw_spin_lock_irqsave(&b->raw_lock, flags); >> else >> spin_lock_irqsave(&b->lock, flags); >> @@ -185,13 +193,18 @@ static inline void htab_unlock_bucket(const struct bpf_htab *htab, >> struct bucket *b, u32 hash, >> unsigned long flags) >> { >> + bool use_raw_lock = htab_use_raw_lock(htab); >> + >> hash = hash & HASHTAB_MAP_LOCK_MASK; >> - if (htab_use_raw_lock(htab)) >> + if (use_raw_lock) >> raw_spin_unlock_irqrestore(&b->raw_lock, flags); >> else >> spin_unlock_irqrestore(&b->lock, flags); >> __this_cpu_dec(*(htab->map_locked[hash])); >> - migrate_enable(); >> + if (use_raw_lock) >> + preempt_enable(); >> + else >> + migrate_enable(); >> } >> >> static bool htab_lru_map_delete_node(void *arg, struct bpf_lru_node *node); >> -- >> 2.29.2 >> > .
Hi, Hou Tao On Sun, Aug 21, 2022 at 6:28 PM Hou Tao <houtao@huaweicloud.com> wrote: > > Hi, > > On 8/22/2022 12:42 AM, Hao Luo wrote: > > Hi Hou Tao, > > > > On Sat, Aug 20, 2022 at 8:14 PM Hou Tao <houtao@huaweicloud.com> wrote: > >> From: Hou Tao <houtao1@huawei.com> > >> > >> Per-cpu htab->map_locked is used to prohibit the concurrent accesses > >> from both NMI and non-NMI contexts. But since 74d862b682f51 ("sched: > >> Make migrate_disable/enable() independent of RT"), migrations_disable() > >> is also preemptible under !PREEMPT_RT case, so now map_locked also > >> disallows concurrent updates from normal contexts (e.g. userspace > >> processes) unexpectedly as shown below: > >> > >> process A process B > >> > >> htab_map_update_elem() > >> htab_lock_bucket() > >> migrate_disable() > >> /* return 1 */ > >> __this_cpu_inc_return() > >> /* preempted by B */ > >> > >> htab_map_update_elem() > >> /* the same bucket as A */ > >> htab_lock_bucket() > >> migrate_disable() > >> /* return 2, so lock fails */ > >> __this_cpu_inc_return() > >> return -EBUSY > >> > >> A fix that seems feasible is using in_nmi() in htab_lock_bucket() and > >> only checking the value of map_locked for nmi context. But it will > >> re-introduce dead-lock on bucket lock if htab_lock_bucket() is re-entered > >> through non-tracing program (e.g. fentry program). > >> > >> So fixing it by using disable_preempt() instead of migrate_disable() when > >> increasing htab->map_locked. However when htab_use_raw_lock() is false, > >> bucket lock will be a sleepable spin-lock and it breaks disable_preempt(), > >> so still use migrate_disable() for spin-lock case. > >> > >> Signed-off-by: Hou Tao <houtao1@huawei.com> > >> --- > > IIUC, this patch enlarges the scope of preemption disable to cover inc > > map_locked. But I don't think the change is meaningful. > Before 74d862b682f51 ("sched: Make migrate_disable/enable() independent of > RT"), the preemption is disabled before increasing map_locked for !PREEMPT_RT > case, so I don't think that the change is meaningless. > > > > This patch only affects the case when raw lock is used. In the case of > > raw lock, irq is disabled for b->raw_lock protected critical section. > > A raw spin lock itself doesn't block in both RT and non-RT. So, my > > understanding about this patch is, it just makes sure preemption > > doesn't happen on the exact __this_cpu_inc_return. But the window is > > so small that it should be really unlikely to happen. > No, it can be easily reproduced by running multiple htab update processes in the > same CPU. Will add selftest to demonstrate that. Can you clarify what you demonstrate? Here is my theory, but please correct me if I'm wrong, I haven't tested yet. In non-RT, I doubt preemptions are likely to happen after migrate_disable. That is because very soon after migrate_disable, we enter the critical section of b->raw_lock with irq disabled. In RT, preemptions can happen on acquiring b->lock, that is certainly possible, but this is the !use_raw_lock path, which isn't side-stepped by this patch. > > > >> kernel/bpf/hashtab.c | 23 ++++++++++++++++++----- > >> 1 file changed, 18 insertions(+), 5 deletions(-) > >> > >> diff --git a/kernel/bpf/hashtab.c b/kernel/bpf/hashtab.c > >> index 6c530a5e560a..ad09da139589 100644 > >> --- a/kernel/bpf/hashtab.c > >> +++ b/kernel/bpf/hashtab.c > >> @@ -162,17 +162,25 @@ static inline int htab_lock_bucket(const struct bpf_htab *htab, > >> unsigned long *pflags) > >> { > >> unsigned long flags; > >> + bool use_raw_lock; > >> > >> hash = hash & HASHTAB_MAP_LOCK_MASK; > >> > >> - migrate_disable(); > >> + use_raw_lock = htab_use_raw_lock(htab); > >> + if (use_raw_lock) > >> + preempt_disable(); > >> + else > >> + migrate_disable(); > >> if (unlikely(__this_cpu_inc_return(*(htab->map_locked[hash])) != 1)) { > >> __this_cpu_dec(*(htab->map_locked[hash])); > >> - migrate_enable(); > >> + if (use_raw_lock) > >> + preempt_enable(); > >> + else > >> + migrate_enable(); > >> return -EBUSY; > >> } > >> > >> - if (htab_use_raw_lock(htab)) > >> + if (use_raw_lock) > >> raw_spin_lock_irqsave(&b->raw_lock, flags); > >> else > >> spin_lock_irqsave(&b->lock, flags); > >> @@ -185,13 +193,18 @@ static inline void htab_unlock_bucket(const struct bpf_htab *htab, > >> struct bucket *b, u32 hash, > >> unsigned long flags) > >> { > >> + bool use_raw_lock = htab_use_raw_lock(htab); > >> + > >> hash = hash & HASHTAB_MAP_LOCK_MASK; > >> - if (htab_use_raw_lock(htab)) > >> + if (use_raw_lock) > >> raw_spin_unlock_irqrestore(&b->raw_lock, flags); > >> else > >> spin_unlock_irqrestore(&b->lock, flags); > >> __this_cpu_dec(*(htab->map_locked[hash])); > >> - migrate_enable(); > >> + if (use_raw_lock) > >> + preempt_enable(); > >> + else > >> + migrate_enable(); > >> } > >> > >> static bool htab_lru_map_delete_node(void *arg, struct bpf_lru_node *node); > >> -- > >> 2.29.2 > >> > > . >
On 2022-08-21 11:32:21 [+0800], Hou Tao wrote: > process A process B > > htab_map_update_elem() > htab_lock_bucket() > migrate_disable() > /* return 1 */ > __this_cpu_inc_return() > /* preempted by B */ > > htab_map_update_elem() > /* the same bucket as A */ > htab_lock_bucket() > migrate_disable() > /* return 2, so lock fails */ > __this_cpu_inc_return() > return -EBUSY > > A fix that seems feasible is using in_nmi() in htab_lock_bucket() and > only checking the value of map_locked for nmi context. But it will > re-introduce dead-lock on bucket lock if htab_lock_bucket() is re-entered > through non-tracing program (e.g. fentry program). > > So fixing it by using disable_preempt() instead of migrate_disable() when > increasing htab->map_locked. However when htab_use_raw_lock() is false, > bucket lock will be a sleepable spin-lock and it breaks disable_preempt(), > so still use migrate_disable() for spin-lock case. But isn't the RT case still affected by the very same problem? > Signed-off-by: Hou Tao <houtao1@huawei.com> Sebastian
Hi, On 8/22/2022 11:21 AM, Hao Luo wrote: > Hi, Hou Tao > > On Sun, Aug 21, 2022 at 6:28 PM Hou Tao <houtao@huaweicloud.com> wrote: >> Hi, >> >> On 8/22/2022 12:42 AM, Hao Luo wrote: >>> Hi Hou Tao, >>> >>> On Sat, Aug 20, 2022 at 8:14 PM Hou Tao <houtao@huaweicloud.com> wrote: >>>> From: Hou Tao <houtao1@huawei.com> >>>> >>>> Per-cpu htab->map_locked is used to prohibit the concurrent accesses >>>> from both NMI and non-NMI contexts. But since 74d862b682f51 ("sched: >>>> Make migrate_disable/enable() independent of RT"), migrations_disable() >>>> is also preemptible under !PREEMPT_RT case, so now map_locked also >>>> disallows concurrent updates from normal contexts (e.g. userspace >>>> processes) unexpectedly as shown below: >>>> >>>> process A process B >>>> >>>> htab_map_update_elem() >>>> htab_lock_bucket() >>>> migrate_disable() >>>> /* return 1 */ >>>> __this_cpu_inc_return() >>>> /* preempted by B */ >>>> >>>> htab_map_update_elem() >>>> /* the same bucket as A */ >>>> htab_lock_bucket() >>>> migrate_disable() >>>> /* return 2, so lock fails */ >>>> __this_cpu_inc_return() >>>> return -EBUSY >>>> >>>> A fix that seems feasible is using in_nmi() in htab_lock_bucket() and >>>> only checking the value of map_locked for nmi context. But it will >>>> re-introduce dead-lock on bucket lock if htab_lock_bucket() is re-entered >>>> through non-tracing program (e.g. fentry program). >>>> >>>> So fixing it by using disable_preempt() instead of migrate_disable() when >>>> increasing htab->map_locked. However when htab_use_raw_lock() is false, >>>> bucket lock will be a sleepable spin-lock and it breaks disable_preempt(), >>>> so still use migrate_disable() for spin-lock case. >>>> >>>> Signed-off-by: Hou Tao <houtao1@huawei.com> >>>> --- >>> IIUC, this patch enlarges the scope of preemption disable to cover inc >>> map_locked. But I don't think the change is meaningful. >> Before 74d862b682f51 ("sched: Make migrate_disable/enable() independent of >> RT"), the preemption is disabled before increasing map_locked for !PREEMPT_RT >> case, so I don't think that the change is meaningless. >>> This patch only affects the case when raw lock is used. In the case of >>> raw lock, irq is disabled for b->raw_lock protected critical section. >>> A raw spin lock itself doesn't block in both RT and non-RT. So, my >>> understanding about this patch is, it just makes sure preemption >>> doesn't happen on the exact __this_cpu_inc_return. But the window is >>> so small that it should be really unlikely to happen. >> No, it can be easily reproduced by running multiple htab update processes in the >> same CPU. Will add selftest to demonstrate that. > Can you clarify what you demonstrate? First please enable CONFIG_PREEMPT for the running kernel and then run the following test program as shown below. # sudo taskset -c 2 ./update.bin thread nr 2 wait for error update error -16 all threads exit If there is no "update error -16", you can try to create more map update threads. For example running 16 update threads: # sudo taskset -c 2 ./update.bin 16 thread nr 16 wait for error update error -16 update error -16 update error -16 update error -16 update error -16 update error -16 update error -16 update error -16 all threads exit The following is the source code for update.bin: #define _GNU_SOURCE #include <stdio.h> #include <stdbool.h> #include <stdlib.h> #include <unistd.h> #include <string.h> #include <errno.h> #include <pthread.h> #include "bpf.h" #include "libbpf.h" static bool stop; static int fd; static void *update_fn(void *arg) { while (!stop) { unsigned int key = 0, value = 1; int err; err = bpf_map_update_elem(fd, &key, &value, 0); if (err) { printf("update error %d\n", err); stop = true; break; } } return NULL; } int main(int argc, char **argv) { LIBBPF_OPTS(bpf_map_create_opts, opts); unsigned int i, nr; pthread_t *tids; nr = 2; if (argc > 1) nr = atoi(argv[1]); printf("thread nr %u\n", nr); libbpf_set_strict_mode(LIBBPF_STRICT_ALL); fd = bpf_map_create(BPF_MAP_TYPE_HASH, "batch", 4, 4, 1, &opts); if (fd < 0) { printf("create map error %d\n", fd); return 1; } tids = malloc(nr * sizeof(*tids)); for (i = 0; i < nr; i++) pthread_create(&tids[i], NULL, update_fn, NULL); printf("wait for error\n"); for (i = 0; i < nr; i++) pthread_join(tids[i], NULL); printf("all threads exit\n"); free(tids); close(fd); return 0; } > > Here is my theory, but please correct me if I'm wrong, I haven't > tested yet. In non-RT, I doubt preemptions are likely to happen after > migrate_disable. That is because very soon after migrate_disable, we > enter the critical section of b->raw_lock with irq disabled. In RT, > preemptions can happen on acquiring b->lock, that is certainly > possible, but this is the !use_raw_lock path, which isn't side-stepped > by this patch. > >>>> kernel/bpf/hashtab.c | 23 ++++++++++++++++++----- >>>> 1 file changed, 18 insertions(+), 5 deletions(-) >>>> >>>> diff --git a/kernel/bpf/hashtab.c b/kernel/bpf/hashtab.c >>>> index 6c530a5e560a..ad09da139589 100644 >>>> --- a/kernel/bpf/hashtab.c >>>> +++ b/kernel/bpf/hashtab.c >>>> @@ -162,17 +162,25 @@ static inline int htab_lock_bucket(const struct bpf_htab *htab, >>>> unsigned long *pflags) >>>> { >>>> unsigned long flags; >>>> + bool use_raw_lock; >>>> >>>> hash = hash & HASHTAB_MAP_LOCK_MASK; >>>> >>>> - migrate_disable(); >>>> + use_raw_lock = htab_use_raw_lock(htab); >>>> + if (use_raw_lock) >>>> + preempt_disable(); >>>> + else >>>> + migrate_disable(); >>>> if (unlikely(__this_cpu_inc_return(*(htab->map_locked[hash])) != 1)) { >>>> __this_cpu_dec(*(htab->map_locked[hash])); >>>> - migrate_enable(); >>>> + if (use_raw_lock) >>>> + preempt_enable(); >>>> + else >>>> + migrate_enable(); >>>> return -EBUSY; >>>> } >>>> >>>> - if (htab_use_raw_lock(htab)) >>>> + if (use_raw_lock) >>>> raw_spin_lock_irqsave(&b->raw_lock, flags); >>>> else >>>> spin_lock_irqsave(&b->lock, flags); >>>> @@ -185,13 +193,18 @@ static inline void htab_unlock_bucket(const struct bpf_htab *htab, >>>> struct bucket *b, u32 hash, >>>> unsigned long flags) >>>> { >>>> + bool use_raw_lock = htab_use_raw_lock(htab); >>>> + >>>> hash = hash & HASHTAB_MAP_LOCK_MASK; >>>> - if (htab_use_raw_lock(htab)) >>>> + if (use_raw_lock) >>>> raw_spin_unlock_irqrestore(&b->raw_lock, flags); >>>> else >>>> spin_unlock_irqrestore(&b->lock, flags); >>>> __this_cpu_dec(*(htab->map_locked[hash])); >>>> - migrate_enable(); >>>> + if (use_raw_lock) >>>> + preempt_enable(); >>>> + else >>>> + migrate_enable(); >>>> } >>>> >>>> static bool htab_lru_map_delete_node(void *arg, struct bpf_lru_node *node); >>>> -- >>>> 2.29.2 >>>> >>> . > .
Hi, On 8/22/2022 4:13 PM, Sebastian Andrzej Siewior wrote: > On 2022-08-21 11:32:21 [+0800], Hou Tao wrote: >> process A process B >> >> htab_map_update_elem() >> htab_lock_bucket() >> migrate_disable() >> /* return 1 */ >> __this_cpu_inc_return() >> /* preempted by B */ >> >> htab_map_update_elem() >> /* the same bucket as A */ >> htab_lock_bucket() >> migrate_disable() >> /* return 2, so lock fails */ >> __this_cpu_inc_return() >> return -EBUSY >> >> A fix that seems feasible is using in_nmi() in htab_lock_bucket() and >> only checking the value of map_locked for nmi context. But it will >> re-introduce dead-lock on bucket lock if htab_lock_bucket() is re-entered >> through non-tracing program (e.g. fentry program). >> >> So fixing it by using disable_preempt() instead of migrate_disable() when >> increasing htab->map_locked. However when htab_use_raw_lock() is false, >> bucket lock will be a sleepable spin-lock and it breaks disable_preempt(), >> so still use migrate_disable() for spin-lock case. > But isn't the RT case still affected by the very same problem? As said in patch 0, the CONFIG_PREEMPT_RT && non-preallocated case is fixed in patch 2. > >> Signed-off-by: Hou Tao <houtao1@huawei.com> > Sebastian > .
On 2022-08-22 20:09:47 [+0800], Hou Tao wrote: > Hi, Hi, > > But isn't the RT case still affected by the very same problem? > As said in patch 0, the CONFIG_PREEMPT_RT && non-preallocated case is fixed in > patch 2. ups, sorry. Sebastian
On Mon, Aug 22, 2022 at 5:08 AM Hou Tao <houtao@huaweicloud.com> wrote: > > Hi, > > On 8/22/2022 11:21 AM, Hao Luo wrote: > > Hi, Hou Tao > > > > On Sun, Aug 21, 2022 at 6:28 PM Hou Tao <houtao@huaweicloud.com> wrote: > >> Hi, > >> > >> On 8/22/2022 12:42 AM, Hao Luo wrote: > >>> Hi Hou Tao, > >>> > >>> On Sat, Aug 20, 2022 at 8:14 PM Hou Tao <houtao@huaweicloud.com> wrote: > >>>> From: Hou Tao <houtao1@huawei.com> > >>>> > >>>> Per-cpu htab->map_locked is used to prohibit the concurrent accesses > >>>> from both NMI and non-NMI contexts. But since 74d862b682f51 ("sched: > >>>> Make migrate_disable/enable() independent of RT"), migrations_disable() > >>>> is also preemptible under !PREEMPT_RT case, so now map_locked also > >>>> disallows concurrent updates from normal contexts (e.g. userspace > >>>> processes) unexpectedly as shown below: > >>>> > >>>> process A process B > >>>> > >>>> htab_map_update_elem() > >>>> htab_lock_bucket() > >>>> migrate_disable() > >>>> /* return 1 */ > >>>> __this_cpu_inc_return() > >>>> /* preempted by B */ > >>>> > >>>> htab_map_update_elem() > >>>> /* the same bucket as A */ > >>>> htab_lock_bucket() > >>>> migrate_disable() > >>>> /* return 2, so lock fails */ > >>>> __this_cpu_inc_return() > >>>> return -EBUSY > >>>> > >>>> A fix that seems feasible is using in_nmi() in htab_lock_bucket() and > >>>> only checking the value of map_locked for nmi context. But it will > >>>> re-introduce dead-lock on bucket lock if htab_lock_bucket() is re-entered > >>>> through non-tracing program (e.g. fentry program). > >>>> > >>>> So fixing it by using disable_preempt() instead of migrate_disable() when > >>>> increasing htab->map_locked. However when htab_use_raw_lock() is false, > >>>> bucket lock will be a sleepable spin-lock and it breaks disable_preempt(), > >>>> so still use migrate_disable() for spin-lock case. > >>>> > >>>> Signed-off-by: Hou Tao <houtao1@huawei.com> > >>>> --- > >>> IIUC, this patch enlarges the scope of preemption disable to cover inc > >>> map_locked. But I don't think the change is meaningful. > >> Before 74d862b682f51 ("sched: Make migrate_disable/enable() independent of > >> RT"), the preemption is disabled before increasing map_locked for !PREEMPT_RT > >> case, so I don't think that the change is meaningless. > >>> This patch only affects the case when raw lock is used. In the case of > >>> raw lock, irq is disabled for b->raw_lock protected critical section. > >>> A raw spin lock itself doesn't block in both RT and non-RT. So, my > >>> understanding about this patch is, it just makes sure preemption > >>> doesn't happen on the exact __this_cpu_inc_return. But the window is > >>> so small that it should be really unlikely to happen. > >> No, it can be easily reproduced by running multiple htab update processes in the > >> same CPU. Will add selftest to demonstrate that. > > Can you clarify what you demonstrate? > First please enable CONFIG_PREEMPT for the running kernel and then run the > following test program as shown below. > Ah, fully preemptive kernel. It's worth mentioning that in the commit message. Then it seems promoting migrate_disable to preempt_disable may be the best way to solve the problem you described. > # sudo taskset -c 2 ./update.bin > thread nr 2 > wait for error > update error -16 > all threads exit > > If there is no "update error -16", you can try to create more map update > threads. For example running 16 update threads: > > # sudo taskset -c 2 ./update.bin 16 > thread nr 16 > wait for error > update error -16 > update error -16 > update error -16 > update error -16 > update error -16 > update error -16 > update error -16 > update error -16 > all threads exit > > The following is the source code for update.bin: > > #define _GNU_SOURCE > #include <stdio.h> > #include <stdbool.h> > #include <stdlib.h> > #include <unistd.h> > #include <string.h> > #include <errno.h> > #include <pthread.h> > > #include "bpf.h" > #include "libbpf.h" > > static bool stop; > static int fd; > > static void *update_fn(void *arg) > { > while (!stop) { > unsigned int key = 0, value = 1; > int err; > > err = bpf_map_update_elem(fd, &key, &value, 0); > if (err) { > printf("update error %d\n", err); > stop = true; > break; > } > } > > return NULL; > } > > int main(int argc, char **argv) > { > LIBBPF_OPTS(bpf_map_create_opts, opts); > unsigned int i, nr; > pthread_t *tids; > > nr = 2; > if (argc > 1) > nr = atoi(argv[1]); > printf("thread nr %u\n", nr); > > libbpf_set_strict_mode(LIBBPF_STRICT_ALL); > fd = bpf_map_create(BPF_MAP_TYPE_HASH, "batch", 4, 4, 1, &opts); > if (fd < 0) { > printf("create map error %d\n", fd); > return 1; > } > > tids = malloc(nr * sizeof(*tids)); > for (i = 0; i < nr; i++) > pthread_create(&tids[i], NULL, update_fn, NULL); > > printf("wait for error\n"); > for (i = 0; i < nr; i++) > pthread_join(tids[i], NULL); > > printf("all threads exit\n"); > > free(tids); > close(fd); > > return 0; > } > > > > > Here is my theory, but please correct me if I'm wrong, I haven't > > tested yet. In non-RT, I doubt preemptions are likely to happen after > > migrate_disable. That is because very soon after migrate_disable, we > > enter the critical section of b->raw_lock with irq disabled. In RT, > > preemptions can happen on acquiring b->lock, that is certainly > > possible, but this is the !use_raw_lock path, which isn't side-stepped > > by this patch. > > > >>>> kernel/bpf/hashtab.c | 23 ++++++++++++++++++----- > >>>> 1 file changed, 18 insertions(+), 5 deletions(-) > >>>> > >>>> diff --git a/kernel/bpf/hashtab.c b/kernel/bpf/hashtab.c > >>>> index 6c530a5e560a..ad09da139589 100644 > >>>> --- a/kernel/bpf/hashtab.c > >>>> +++ b/kernel/bpf/hashtab.c > >>>> @@ -162,17 +162,25 @@ static inline int htab_lock_bucket(const struct bpf_htab *htab, > >>>> unsigned long *pflags) > >>>> { > >>>> unsigned long flags; > >>>> + bool use_raw_lock; > >>>> > >>>> hash = hash & HASHTAB_MAP_LOCK_MASK; > >>>> > >>>> - migrate_disable(); > >>>> + use_raw_lock = htab_use_raw_lock(htab); > >>>> + if (use_raw_lock) > >>>> + preempt_disable(); > >>>> + else > >>>> + migrate_disable(); > >>>> if (unlikely(__this_cpu_inc_return(*(htab->map_locked[hash])) != 1)) { > >>>> __this_cpu_dec(*(htab->map_locked[hash])); > >>>> - migrate_enable(); > >>>> + if (use_raw_lock) > >>>> + preempt_enable(); > >>>> + else > >>>> + migrate_enable(); > >>>> return -EBUSY; > >>>> } > >>>> > >>>> - if (htab_use_raw_lock(htab)) > >>>> + if (use_raw_lock) > >>>> raw_spin_lock_irqsave(&b->raw_lock, flags); > >>>> else > >>>> spin_lock_irqsave(&b->lock, flags); > >>>> @@ -185,13 +193,18 @@ static inline void htab_unlock_bucket(const struct bpf_htab *htab, > >>>> struct bucket *b, u32 hash, > >>>> unsigned long flags) > >>>> { > >>>> + bool use_raw_lock = htab_use_raw_lock(htab); > >>>> + > >>>> hash = hash & HASHTAB_MAP_LOCK_MASK; > >>>> - if (htab_use_raw_lock(htab)) > >>>> + if (use_raw_lock) > >>>> raw_spin_unlock_irqrestore(&b->raw_lock, flags); > >>>> else > >>>> spin_unlock_irqrestore(&b->lock, flags); > >>>> __this_cpu_dec(*(htab->map_locked[hash])); > >>>> - migrate_enable(); > >>>> + if (use_raw_lock) > >>>> + preempt_enable(); > >>>> + else > >>>> + migrate_enable(); > >>>> } > >>>> > >>>> static bool htab_lru_map_delete_node(void *arg, struct bpf_lru_node *node); > >>>> -- > >>>> 2.29.2 > >>>> > >>> . > > . >
On Mon, Aug 22, 2022 at 11:01 AM Hao Luo <haoluo@google.com> wrote: > > On Mon, Aug 22, 2022 at 5:08 AM Hou Tao <houtao@huaweicloud.com> wrote: > > > > Hi, > > > > On 8/22/2022 11:21 AM, Hao Luo wrote: > > > Hi, Hou Tao > > > > > > On Sun, Aug 21, 2022 at 6:28 PM Hou Tao <houtao@huaweicloud.com> wrote: > > >> Hi, > > >> > > >> On 8/22/2022 12:42 AM, Hao Luo wrote: > > >>> Hi Hou Tao, > > >>> > > >>> On Sat, Aug 20, 2022 at 8:14 PM Hou Tao <houtao@huaweicloud.com> wrote: > > >>>> From: Hou Tao <houtao1@huawei.com> > > >>>> > > >>>> Per-cpu htab->map_locked is used to prohibit the concurrent accesses > > >>>> from both NMI and non-NMI contexts. But since 74d862b682f51 ("sched: > > >>>> Make migrate_disable/enable() independent of RT"), migrations_disable() > > >>>> is also preemptible under !PREEMPT_RT case, so now map_locked also > > >>>> disallows concurrent updates from normal contexts (e.g. userspace > > >>>> processes) unexpectedly as shown below: > > >>>> > > >>>> process A process B > > >>>> > > >>>> htab_map_update_elem() > > >>>> htab_lock_bucket() > > >>>> migrate_disable() > > >>>> /* return 1 */ > > >>>> __this_cpu_inc_return() > > >>>> /* preempted by B */ > > >>>> > > >>>> htab_map_update_elem() > > >>>> /* the same bucket as A */ > > >>>> htab_lock_bucket() > > >>>> migrate_disable() > > >>>> /* return 2, so lock fails */ > > >>>> __this_cpu_inc_return() > > >>>> return -EBUSY > > >>>> > > >>>> A fix that seems feasible is using in_nmi() in htab_lock_bucket() and > > >>>> only checking the value of map_locked for nmi context. But it will > > >>>> re-introduce dead-lock on bucket lock if htab_lock_bucket() is re-entered > > >>>> through non-tracing program (e.g. fentry program). > > >>>> > > >>>> So fixing it by using disable_preempt() instead of migrate_disable() when > > >>>> increasing htab->map_locked. However when htab_use_raw_lock() is false, > > >>>> bucket lock will be a sleepable spin-lock and it breaks disable_preempt(), > > >>>> so still use migrate_disable() for spin-lock case. > > >>>> > > >>>> Signed-off-by: Hou Tao <houtao1@huawei.com> > > >>>> --- > > >>> IIUC, this patch enlarges the scope of preemption disable to cover inc > > >>> map_locked. But I don't think the change is meaningful. > > >> Before 74d862b682f51 ("sched: Make migrate_disable/enable() independent of > > >> RT"), the preemption is disabled before increasing map_locked for !PREEMPT_RT > > >> case, so I don't think that the change is meaningless. > > >>> This patch only affects the case when raw lock is used. In the case of > > >>> raw lock, irq is disabled for b->raw_lock protected critical section. > > >>> A raw spin lock itself doesn't block in both RT and non-RT. So, my > > >>> understanding about this patch is, it just makes sure preemption > > >>> doesn't happen on the exact __this_cpu_inc_return. But the window is > > >>> so small that it should be really unlikely to happen. > > >> No, it can be easily reproduced by running multiple htab update processes in the > > >> same CPU. Will add selftest to demonstrate that. > > > Can you clarify what you demonstrate? > > First please enable CONFIG_PREEMPT for the running kernel and then run the > > following test program as shown below. > > > > Ah, fully preemptive kernel. It's worth mentioning that in the commit > message. Then it seems promoting migrate_disable to preempt_disable > may be the best way to solve the problem you described. > > > # sudo taskset -c 2 ./update.bin > > thread nr 2 > > wait for error > > update error -16 > > all threads exit > > > > If there is no "update error -16", you can try to create more map update > > threads. For example running 16 update threads: > > > > # sudo taskset -c 2 ./update.bin 16 > > thread nr 16 > > wait for error > > update error -16 > > update error -16 > > update error -16 > > update error -16 > > update error -16 > > update error -16 > > update error -16 > > update error -16 > > all threads exit > > > > The following is the source code for update.bin: > > > > #define _GNU_SOURCE > > #include <stdio.h> > > #include <stdbool.h> > > #include <stdlib.h> > > #include <unistd.h> > > #include <string.h> > > #include <errno.h> > > #include <pthread.h> > > > > #include "bpf.h" > > #include "libbpf.h" > > > > static bool stop; > > static int fd; > > > > static void *update_fn(void *arg) > > { > > while (!stop) { > > unsigned int key = 0, value = 1; > > int err; > > > > err = bpf_map_update_elem(fd, &key, &value, 0); > > if (err) { > > printf("update error %d\n", err); > > stop = true; > > break; > > } > > } > > > > return NULL; > > } > > > > int main(int argc, char **argv) > > { > > LIBBPF_OPTS(bpf_map_create_opts, opts); > > unsigned int i, nr; > > pthread_t *tids; > > > > nr = 2; > > if (argc > 1) > > nr = atoi(argv[1]); > > printf("thread nr %u\n", nr); > > > > libbpf_set_strict_mode(LIBBPF_STRICT_ALL); > > fd = bpf_map_create(BPF_MAP_TYPE_HASH, "batch", 4, 4, 1, &opts); > > if (fd < 0) { > > printf("create map error %d\n", fd); > > return 1; > > } > > > > tids = malloc(nr * sizeof(*tids)); > > for (i = 0; i < nr; i++) > > pthread_create(&tids[i], NULL, update_fn, NULL); > > > > printf("wait for error\n"); > > for (i = 0; i < nr; i++) > > pthread_join(tids[i], NULL); > > > > printf("all threads exit\n"); > > > > free(tids); > > close(fd); > > > > return 0; > > } > > Tao, thanks very much for the test. I played it a bit and I can confirm that map_update failures are seen under CONFIG_PREEMPT. The failures are not present under CONFIG_PREEMPT_NONE or CONFIG_PREEMPT_VOLUNTARY. I experimented with a few alternatives I was thinking of and they didn't work. It looks like Hou Tao's idea, promoting migrate_disable to preempt_disable, is probably the best we can do for the non-RT case. So Reviewed-by: Hao Luo <haoluo@google.com> But, I am not sure if we want to get rid of preemption-caused batch map updates on preemptive kernels, or if there are better solutions to address [0]. Tao, could you mention CONFIG_PREEMPT in your commit message? Thanks. > > > > > > Here is my theory, but please correct me if I'm wrong, I haven't > > > tested yet. In non-RT, I doubt preemptions are likely to happen after > > > migrate_disable. That is because very soon after migrate_disable, we > > > enter the critical section of b->raw_lock with irq disabled. In RT, > > > preemptions can happen on acquiring b->lock, that is certainly > > > possible, but this is the !use_raw_lock path, which isn't side-stepped > > > by this patch. > > > > > >>>> kernel/bpf/hashtab.c | 23 ++++++++++++++++++----- > > >>>> 1 file changed, 18 insertions(+), 5 deletions(-) > > >>>> > > >>>> diff --git a/kernel/bpf/hashtab.c b/kernel/bpf/hashtab.c > > >>>> index 6c530a5e560a..ad09da139589 100644 > > >>>> --- a/kernel/bpf/hashtab.c > > >>>> +++ b/kernel/bpf/hashtab.c > > >>>> @@ -162,17 +162,25 @@ static inline int htab_lock_bucket(const struct bpf_htab *htab, > > >>>> unsigned long *pflags) > > >>>> { > > >>>> unsigned long flags; > > >>>> + bool use_raw_lock; > > >>>> > > >>>> hash = hash & HASHTAB_MAP_LOCK_MASK; > > >>>> > > >>>> - migrate_disable(); > > >>>> + use_raw_lock = htab_use_raw_lock(htab); > > >>>> + if (use_raw_lock) > > >>>> + preempt_disable(); > > >>>> + else > > >>>> + migrate_disable(); > > >>>> if (unlikely(__this_cpu_inc_return(*(htab->map_locked[hash])) != 1)) { > > >>>> __this_cpu_dec(*(htab->map_locked[hash])); > > >>>> - migrate_enable(); > > >>>> + if (use_raw_lock) > > >>>> + preempt_enable(); > > >>>> + else > > >>>> + migrate_enable(); > > >>>> return -EBUSY; > > >>>> } > > >>>> > > >>>> - if (htab_use_raw_lock(htab)) > > >>>> + if (use_raw_lock) > > >>>> raw_spin_lock_irqsave(&b->raw_lock, flags); > > >>>> else > > >>>> spin_lock_irqsave(&b->lock, flags); > > >>>> @@ -185,13 +193,18 @@ static inline void htab_unlock_bucket(const struct bpf_htab *htab, > > >>>> struct bucket *b, u32 hash, > > >>>> unsigned long flags) > > >>>> { > > >>>> + bool use_raw_lock = htab_use_raw_lock(htab); > > >>>> + > > >>>> hash = hash & HASHTAB_MAP_LOCK_MASK; > > >>>> - if (htab_use_raw_lock(htab)) > > >>>> + if (use_raw_lock) > > >>>> raw_spin_unlock_irqrestore(&b->raw_lock, flags); > > >>>> else > > >>>> spin_unlock_irqrestore(&b->lock, flags); > > >>>> __this_cpu_dec(*(htab->map_locked[hash])); > > >>>> - migrate_enable(); > > >>>> + if (use_raw_lock) > > >>>> + preempt_enable(); > > >>>> + else > > >>>> + migrate_enable(); > > >>>> } > > >>>> > > >>>> static bool htab_lru_map_delete_node(void *arg, struct bpf_lru_node *node); > > >>>> -- > > >>>> 2.29.2 > > >>>> > > >>> . > > > . > >
On Mon, Aug 22, 2022 at 5:56 PM Hao Luo <haoluo@google.com> wrote: > > On Mon, Aug 22, 2022 at 11:01 AM Hao Luo <haoluo@google.com> wrote: > > > > On Mon, Aug 22, 2022 at 5:08 AM Hou Tao <houtao@huaweicloud.com> wrote: > > > > > > Hi, > > > > > > On 8/22/2022 11:21 AM, Hao Luo wrote: > > > > Hi, Hou Tao > > > > > > > > On Sun, Aug 21, 2022 at 6:28 PM Hou Tao <houtao@huaweicloud.com> wrote: > > > >> Hi, > > > >> > > > >> On 8/22/2022 12:42 AM, Hao Luo wrote: > > > >>> Hi Hou Tao, > > > >>> > > > >>> On Sat, Aug 20, 2022 at 8:14 PM Hou Tao <houtao@huaweicloud.com> wrote: > > > >>>> From: Hou Tao <houtao1@huawei.com> > > > >>>> > > > >>>> Per-cpu htab->map_locked is used to prohibit the concurrent accesses > > > >>>> from both NMI and non-NMI contexts. But since 74d862b682f51 ("sched: > > > >>>> Make migrate_disable/enable() independent of RT"), migrations_disable() > > > >>>> is also preemptible under !PREEMPT_RT case, so now map_locked also > > > >>>> disallows concurrent updates from normal contexts (e.g. userspace > > > >>>> processes) unexpectedly as shown below: > > > >>>> > > > >>>> process A process B > > > >>>> > > > >>>> htab_map_update_elem() > > > >>>> htab_lock_bucket() > > > >>>> migrate_disable() > > > >>>> /* return 1 */ > > > >>>> __this_cpu_inc_return() > > > >>>> /* preempted by B */ > > > >>>> > > > >>>> htab_map_update_elem() > > > >>>> /* the same bucket as A */ > > > >>>> htab_lock_bucket() > > > >>>> migrate_disable() > > > >>>> /* return 2, so lock fails */ > > > >>>> __this_cpu_inc_return() > > > >>>> return -EBUSY > > > >>>> > > > >>>> A fix that seems feasible is using in_nmi() in htab_lock_bucket() and > > > >>>> only checking the value of map_locked for nmi context. But it will > > > >>>> re-introduce dead-lock on bucket lock if htab_lock_bucket() is re-entered > > > >>>> through non-tracing program (e.g. fentry program). > > > >>>> > > > >>>> So fixing it by using disable_preempt() instead of migrate_disable() when > > > >>>> increasing htab->map_locked. However when htab_use_raw_lock() is false, > > > >>>> bucket lock will be a sleepable spin-lock and it breaks disable_preempt(), > > > >>>> so still use migrate_disable() for spin-lock case. > > > >>>> > > > >>>> Signed-off-by: Hou Tao <houtao1@huawei.com> > > Tao, thanks very much for the test. I played it a bit and I can > confirm that map_update failures are seen under CONFIG_PREEMPT. The > failures are not present under CONFIG_PREEMPT_NONE or > CONFIG_PREEMPT_VOLUNTARY. I experimented with a few alternatives I was > thinking of and they didn't work. It looks like Hou Tao's idea, > promoting migrate_disable to preempt_disable, is probably the best we > can do for the non-RT case. So preempt_disable is also faster than migrate_disable, so patch 1 will not only fix this issue, but will improve performance. Patch 2 is too hacky though. I think it's better to wait until my bpf_mem_alloc patches land. RT case won't be special anymore. We will be able to remove htab_use_raw_lock() helper and unconditionally use raw_spin_lock. With bpf_mem_alloc there is no inline memory allocation anymore. So please address Hao's comments, add a test and resubmit patches 1 and 3. Also please use [PATCH bpf-next] in the subject to help BPF CI and patchwork scripts.
Hi, On 8/23/2022 8:56 AM, Hao Luo wrote: > On Mon, Aug 22, 2022 at 11:01 AM Hao Luo <haoluo@google.com> wrote: >> On Mon, Aug 22, 2022 at 5:08 AM Hou Tao <houtao@huaweicloud.com> wrote: >>> Hi, >>> >>> On 8/22/2022 11:21 AM, Hao Luo wrote: >>>> Hi, Hou Tao >>>> >>>> On Sun, Aug 21, 2022 at 6:28 PM Hou Tao <houtao@huaweicloud.com> wrote: >>>>> Hi, >>>>> >>>>> On 8/22/2022 12:42 AM, Hao Luo wrote: >>>>>> Hi Hou Tao, >>>>>> >>>>>> On Sat, Aug 20, 2022 at 8:14 PM Hou Tao <houtao@huaweicloud.com> wrote: >>>>>>> From: Hou Tao <houtao1@huawei.com> >>>>>>> >>>>>>> Per-cpu htab->map_locked is used to prohibit the concurrent accesses >>>>>>> from both NMI and non-NMI contexts. But since 74d862b682f51 ("sched: >>>>>>> Make migrate_disable/enable() independent of RT"), migrations_disable() >>>>>>> is also preemptible under !PREEMPT_RT case, so now map_locked also >>>>>>> disallows concurrent updates from normal contexts (e.g. userspace >>>>>>> processes) unexpectedly as shown below: >>>>>>> >>>>>>> process A process B >>>>>>> SNIP >>>>>>> First please enable CONFIG_PREEMPT for the running kernel and then run the >>>>>>> following test program as shown below. >>>>>>> >> Ah, fully preemptive kernel. It's worth mentioning that in the commit >> message. Then it seems promoting migrate_disable to preempt_disable >> may be the best way to solve the problem you described. Sorry, i missed that. Will add in v2. >> >>> # sudo taskset -c 2 ./update.bin >>> thread nr 2 >>> wait for error >>> update error -16 >>> all threads exit >>> >>> If there is no "update error -16", you can try to create more map update >>> threads. For example running 16 update threads: >>> >>> # sudo taskset -c 2 ./update.bin 16 >>> thread nr 16 >>> wait for error >>> update error -16 >>> update error -16 >>> update error -16 >>> update error -16 >>> update error -16 >>> update error -16 >>> update error -16 >>> update error -16 >>> all threads exit SNIP > Tao, thanks very much for the test. I played it a bit and I can > confirm that map_update failures are seen under CONFIG_PREEMPT. The > failures are not present under CONFIG_PREEMPT_NONE or > CONFIG_PREEMPT_VOLUNTARY. I experimented with a few alternatives I was > thinking of and they didn't work. It looks like Hou Tao's idea, > promoting migrate_disable to preempt_disable, is probably the best we > can do for the non-RT case. So > > Reviewed-by: Hao Luo <haoluo@google.com> > > But, I am not sure if we want to get rid of preemption-caused batch > map updates on preemptive kernels, or if there are better solutions to > address [0]. Thanks for your review. Under preemptive kernel, if the preemption is disabled during batch map lookup or update, htab_lock_bucket() from userspace will not fail, so I think it is OK for now. > > Tao, could you mention CONFIG_PREEMPT in your commit message? Thanks. Will do. >>>> Here is my theory, but please correct me if I'm wrong, I haven't >>>> tested yet. In non-RT, I doubt preemptions are likely to happen after >>>> migrate_disable. That is because very soon after migrate_disable, we >>>> enter the critical section of b->raw_lock with irq disabled. In RT, >>>> preemptions can happen on acquiring b->lock, that is certainly >>>> possible, but this is the !use_raw_lock path, which isn't side-stepped >>>> by this patch. >>>> >>>>>>> kernel/bpf/hashtab.c | 23 ++++++++++++++++++----- >>>>>>> 1 file changed, 18 insertions(+), 5 deletions(-) >>>>>>> >>>>>>> diff --git a/kernel/bpf/hashtab.c b/kernel/bpf/hashtab.c >>>>>>> index 6c530a5e560a..ad09da139589 100644 >>>>>>> --- a/kernel/bpf/hashtab.c >>>>>>> +++ b/kernel/bpf/hashtab.c >>>>>>> @@ -162,17 +162,25 @@ static inline int htab_lock_bucket(const struct bpf_htab *htab, >>>>>>> unsigned long *pflags) >>>>>>> { >>>>>>> unsigned long flags; >>>>>>> + bool use_raw_lock; >>>>>>> >>>>>>> hash = hash & HASHTAB_MAP_LOCK_MASK; >>>>>>> >>>>>>> - migrate_disable(); >>>>>>> + use_raw_lock = htab_use_raw_lock(htab); >>>>>>> + if (use_raw_lock) >>>>>>> + preempt_disable(); >>>>>>> + else >>>>>>> + migrate_disable(); >>>>>>> if (unlikely(__this_cpu_inc_return(*(htab->map_locked[hash])) != 1)) { >>>>>>> __this_cpu_dec(*(htab->map_locked[hash])); >>>>>>> - migrate_enable(); >>>>>>> + if (use_raw_lock) >>>>>>> + preempt_enable(); >>>>>>> + else >>>>>>> + migrate_enable(); >>>>>>> return -EBUSY; >>>>>>> } >>>>>>> >>>>>>> - if (htab_use_raw_lock(htab)) >>>>>>> + if (use_raw_lock) >>>>>>> raw_spin_lock_irqsave(&b->raw_lock, flags); >>>>>>> else >>>>>>> spin_lock_irqsave(&b->lock, flags); >>>>>>> @@ -185,13 +193,18 @@ static inline void htab_unlock_bucket(const struct bpf_htab *htab, >>>>>>> struct bucket *b, u32 hash, >>>>>>> unsigned long flags) >>>>>>> { >>>>>>> + bool use_raw_lock = htab_use_raw_lock(htab); >>>>>>> + >>>>>>> hash = hash & HASHTAB_MAP_LOCK_MASK; >>>>>>> - if (htab_use_raw_lock(htab)) >>>>>>> + if (use_raw_lock) >>>>>>> raw_spin_unlock_irqrestore(&b->raw_lock, flags); >>>>>>> else >>>>>>> spin_unlock_irqrestore(&b->lock, flags); >>>>>>> __this_cpu_dec(*(htab->map_locked[hash])); >>>>>>> - migrate_enable(); >>>>>>> + if (use_raw_lock) >>>>>>> + preempt_enable(); >>>>>>> + else >>>>>>> + migrate_enable(); >>>>>>> } >>>>>>> >>>>>>> static bool htab_lru_map_delete_node(void *arg, struct bpf_lru_node *node); >>>>>>> -- >>>>>>> 2.29.2 >>>>>>> >>>>>> . >>>> .
Hi, On 8/23/2022 9:29 AM, Alexei Starovoitov wrote: > On Mon, Aug 22, 2022 at 5:56 PM Hao Luo <haoluo@google.com> wrote: >> SNIP >> Tao, thanks very much for the test. I played it a bit and I can >> confirm that map_update failures are seen under CONFIG_PREEMPT. The >> failures are not present under CONFIG_PREEMPT_NONE or >> CONFIG_PREEMPT_VOLUNTARY. I experimented with a few alternatives I was >> thinking of and they didn't work. It looks like Hou Tao's idea, >> promoting migrate_disable to preempt_disable, is probably the best we >> can do for the non-RT case. So > preempt_disable is also faster than migrate_disable, > so patch 1 will not only fix this issue, but will improve performance. > > Patch 2 is too hacky though. > I think it's better to wait until my bpf_mem_alloc patches land. > RT case won't be special anymore. We will be able to remove > htab_use_raw_lock() helper and unconditionally use raw_spin_lock. > With bpf_mem_alloc there is no inline memory allocation anymore. OK. Look forwards to the landing of BPF specific memory allocator. > > So please address Hao's comments, add a test and > resubmit patches 1 and 3. > Also please use [PATCH bpf-next] in the subject to help BPF CI > and patchwork scripts. Will do. And to bpf-next instead of bpf ? > .
On Mon, Aug 22, 2022 at 7:57 PM Hou Tao <houtao@huaweicloud.com> wrote: > > Hi, > > On 8/23/2022 9:29 AM, Alexei Starovoitov wrote: > > On Mon, Aug 22, 2022 at 5:56 PM Hao Luo <haoluo@google.com> wrote: > >> > SNIP > >> Tao, thanks very much for the test. I played it a bit and I can > >> confirm that map_update failures are seen under CONFIG_PREEMPT. The > >> failures are not present under CONFIG_PREEMPT_NONE or > >> CONFIG_PREEMPT_VOLUNTARY. I experimented with a few alternatives I was > >> thinking of and they didn't work. It looks like Hou Tao's idea, > >> promoting migrate_disable to preempt_disable, is probably the best we > >> can do for the non-RT case. So > > preempt_disable is also faster than migrate_disable, > > so patch 1 will not only fix this issue, but will improve performance. > > > > Patch 2 is too hacky though. > > I think it's better to wait until my bpf_mem_alloc patches land. > > RT case won't be special anymore. We will be able to remove > > htab_use_raw_lock() helper and unconditionally use raw_spin_lock. > > With bpf_mem_alloc there is no inline memory allocation anymore. > OK. Look forwards to the landing of BPF specific memory allocator. > > > > So please address Hao's comments, add a test and > > resubmit patches 1 and 3. > > Also please use [PATCH bpf-next] in the subject to help BPF CI > > and patchwork scripts. > Will do. And to bpf-next instead of bpf ? bpf-next is almost always prefered for fixes for corner cases that have been around for some time. bpf tree is for security and high pri fixes. bpf-next gives time for fixes to prove themselves.
Hi, On 8/23/2022 12:50 PM, Alexei Starovoitov wrote: > On Mon, Aug 22, 2022 at 7:57 PM Hou Tao <houtao@huaweicloud.com> wrote: >> SNIP >>> So please address Hao's comments, add a test and >>> resubmit patches 1 and 3. >>> Also please use [PATCH bpf-next] in the subject to help BPF CI >>> and patchwork scripts. >> Will do. And to bpf-next instead of bpf ? > bpf-next is almost always prefered for fixes for corner > cases that have been around for some time. > bpf tree is for security and high pri fixes. > bpf-next gives time for fixes to prove themselves. > . Understand. Thanks for the explanation.
diff --git a/kernel/bpf/hashtab.c b/kernel/bpf/hashtab.c index 6c530a5e560a..ad09da139589 100644 --- a/kernel/bpf/hashtab.c +++ b/kernel/bpf/hashtab.c @@ -162,17 +162,25 @@ static inline int htab_lock_bucket(const struct bpf_htab *htab, unsigned long *pflags) { unsigned long flags; + bool use_raw_lock; hash = hash & HASHTAB_MAP_LOCK_MASK; - migrate_disable(); + use_raw_lock = htab_use_raw_lock(htab); + if (use_raw_lock) + preempt_disable(); + else + migrate_disable(); if (unlikely(__this_cpu_inc_return(*(htab->map_locked[hash])) != 1)) { __this_cpu_dec(*(htab->map_locked[hash])); - migrate_enable(); + if (use_raw_lock) + preempt_enable(); + else + migrate_enable(); return -EBUSY; } - if (htab_use_raw_lock(htab)) + if (use_raw_lock) raw_spin_lock_irqsave(&b->raw_lock, flags); else spin_lock_irqsave(&b->lock, flags); @@ -185,13 +193,18 @@ static inline void htab_unlock_bucket(const struct bpf_htab *htab, struct bucket *b, u32 hash, unsigned long flags) { + bool use_raw_lock = htab_use_raw_lock(htab); + hash = hash & HASHTAB_MAP_LOCK_MASK; - if (htab_use_raw_lock(htab)) + if (use_raw_lock) raw_spin_unlock_irqrestore(&b->raw_lock, flags); else spin_unlock_irqrestore(&b->lock, flags); __this_cpu_dec(*(htab->map_locked[hash])); - migrate_enable(); + if (use_raw_lock) + preempt_enable(); + else + migrate_enable(); } static bool htab_lru_map_delete_node(void *arg, struct bpf_lru_node *node);