Message ID | 20230822133807.3198625-2-houtao@huaweicloud.com (mailing list archive) |
---|---|
State | Changes Requested |
Delegated to: | BPF |
Headers | show |
Series | bpf: Enable preemption after irq_work_raise() completes | expand |
On Tue, Aug 22, 2023 at 6:06 AM Hou Tao <houtao@huaweicloud.com> wrote: > > From: Hou Tao <houtao1@huawei.com> > > When doing stress test for qp-trie, bpf_mem_alloc() returned NULL > unexpectedly because all qp-trie operations were initiated from > bpf syscalls and there was still available free memory. bpf_obj_new() > has the same problem as shown by the following selftest. > > The failure is due to the preemption. irq_work_raise() will invoke > irq_work_claim() first to mark the irq work as pending and then inovke > __irq_work_queue_local() to raise an IPI. So when the current task > which is invoking irq_work_raise() is preempted by other task, > unit_alloc() may return NULL for preemptive task as shown below: > > task A task B > > unit_alloc() > // low_watermark = 32 > // free_cnt = 31 after alloc > irq_work_raise() > // mark irq work as IRQ_WORK_PENDING > irq_work_claim() > > // task B preempts task A > unit_alloc() > // free_cnt = 30 after alloc > // irq work is already PENDING, > // so just return > irq_work_raise() > // does unit_alloc() 30-times > ...... > unit_alloc() > // free_cnt = 0 before alloc > return NULL > > Fix it by invoking preempt_disable_notrace() before allocation and > invoking preempt_enable_notrace() to enable preemption after > irq_work_raise() completes. An alternative fix is to move > local_irq_restore() after the invocation of irq_work_raise(), but it > will enlarge the irq-disabled region. Another feasible fix is to only > disable preemption before invoking irq_work_queue() and enable > preemption after the invocation in irq_work_raise(), but it can't > handle the case when c->low_watermark is 1. > > Signed-off-by: Hou Tao <houtao1@huawei.com> > --- > kernel/bpf/memalloc.c | 8 ++++++++ > 1 file changed, 8 insertions(+) > > diff --git a/kernel/bpf/memalloc.c b/kernel/bpf/memalloc.c > index 9c49ae53deaf..83f8913ebb0a 100644 > --- a/kernel/bpf/memalloc.c > +++ b/kernel/bpf/memalloc.c > @@ -6,6 +6,7 @@ > #include <linux/irq_work.h> > #include <linux/bpf_mem_alloc.h> > #include <linux/memcontrol.h> > +#include <linux/preempt.h> > #include <asm/local.h> > > /* Any context (including NMI) BPF specific memory allocator. > @@ -725,6 +726,7 @@ static void notrace *unit_alloc(struct bpf_mem_cache *c) > * Use per-cpu 'active' counter to order free_list access between > * unit_alloc/unit_free/bpf_mem_refill. > */ > + preempt_disable_notrace(); > local_irq_save(flags); > if (local_inc_return(&c->active) == 1) { > llnode = __llist_del_first(&c->free_llist); > @@ -740,6 +742,12 @@ static void notrace *unit_alloc(struct bpf_mem_cache *c) > > if (cnt < c->low_watermark) > (c); > + /* Enable preemption after the enqueue of irq work completes, > + * so free_llist may be refilled by irq work before other task > + * preempts current task. > + */ > + preempt_enable_notrace(); So this helps qp-trie init, since it's doing bpf_mem_alloc from syscall context and helps bpf_obj_new from bpf prog, since prog is non-migrateable, but preemptable. It's not an issue for htab doing during map_update, since it's under htab bucket lock. Let's introduce minimal: /* big comment here explaining the reason of extra preempt disable */ static void bpf_memalloc_irq_work_raise(...) { preempt_disable_notrace(); irq_work_raise(); preempt_enable_notrace(); } it will have the same effect, right?
Hi, On 8/23/2023 8:05 AM, Alexei Starovoitov wrote: > On Tue, Aug 22, 2023 at 6:06 AM Hou Tao <houtao@huaweicloud.com> wrote: >> From: Hou Tao <houtao1@huawei.com> >> >> When doing stress test for qp-trie, bpf_mem_alloc() returned NULL >> unexpectedly because all qp-trie operations were initiated from >> bpf syscalls and there was still available free memory. bpf_obj_new() >> has the same problem as shown by the following selftest. >> >> The failure is due to the preemption. irq_work_raise() will invoke >> irq_work_claim() first to mark the irq work as pending and then inovke >> __irq_work_queue_local() to raise an IPI. So when the current task >> which is invoking irq_work_raise() is preempted by other task, >> unit_alloc() may return NULL for preemptive task as shown below: >> >> task A task B >> >> unit_alloc() >> // low_watermark = 32 >> // free_cnt = 31 after alloc >> irq_work_raise() >> // mark irq work as IRQ_WORK_PENDING >> irq_work_claim() >> >> // task B preempts task A >> unit_alloc() >> // free_cnt = 30 after alloc >> // irq work is already PENDING, >> // so just return >> irq_work_raise() >> // does unit_alloc() 30-times >> ...... >> unit_alloc() >> // free_cnt = 0 before alloc >> return NULL >> >> Fix it by invoking preempt_disable_notrace() before allocation and >> invoking preempt_enable_notrace() to enable preemption after >> irq_work_raise() completes. An alternative fix is to move >> local_irq_restore() after the invocation of irq_work_raise(), but it >> will enlarge the irq-disabled region. Another feasible fix is to only >> disable preemption before invoking irq_work_queue() and enable >> preemption after the invocation in irq_work_raise(), but it can't >> handle the case when c->low_watermark is 1. >> >> Signed-off-by: Hou Tao <houtao1@huawei.com> >> --- >> kernel/bpf/memalloc.c | 8 ++++++++ >> 1 file changed, 8 insertions(+) >> >> diff --git a/kernel/bpf/memalloc.c b/kernel/bpf/memalloc.c >> index 9c49ae53deaf..83f8913ebb0a 100644 >> --- a/kernel/bpf/memalloc.c >> +++ b/kernel/bpf/memalloc.c >> @@ -6,6 +6,7 @@ >> #include <linux/irq_work.h> >> #include <linux/bpf_mem_alloc.h> >> #include <linux/memcontrol.h> >> +#include <linux/preempt.h> >> #include <asm/local.h> >> >> /* Any context (including NMI) BPF specific memory allocator. >> @@ -725,6 +726,7 @@ static void notrace *unit_alloc(struct bpf_mem_cache *c) >> * Use per-cpu 'active' counter to order free_list access between >> * unit_alloc/unit_free/bpf_mem_refill. >> */ >> + preempt_disable_notrace(); >> local_irq_save(flags); >> if (local_inc_return(&c->active) == 1) { >> llnode = __llist_del_first(&c->free_llist); >> @@ -740,6 +742,12 @@ static void notrace *unit_alloc(struct bpf_mem_cache *c) >> >> if (cnt < c->low_watermark) >> (c); >> + /* Enable preemption after the enqueue of irq work completes, >> + * so free_llist may be refilled by irq work before other task >> + * preempts current task. >> + */ >> + preempt_enable_notrace(); > So this helps qp-trie init, since it's doing bpf_mem_alloc from > syscall context and helps bpf_obj_new from bpf prog, since prog is > non-migrateable, but preemptable. It's not an issue for htab doing > during map_update, since > it's under htab bucket lock. > Let's introduce minimal: > /* big comment here explaining the reason of extra preempt disable */ > static void bpf_memalloc_irq_work_raise(...) > { > preempt_disable_notrace(); > irq_work_raise(); > preempt_enable_notrace(); > } > > it will have the same effect, right? > . No. As I said in commit message, when c->low_watermark is 1, the above fix doesn't work as shown below: task A task B unit_alloc() // low_watermark = 1 // free_cnt = 0 after alloc // task B preempts task A unit_alloc() // free_cnt = 0 before alloc return NULL irq_work_raise()
On Tue, Aug 22, 2023 at 5:51 PM Hou Tao <houtao@huaweicloud.com> wrote: > > Hi, > > On 8/23/2023 8:05 AM, Alexei Starovoitov wrote: > > On Tue, Aug 22, 2023 at 6:06 AM Hou Tao <houtao@huaweicloud.com> wrote: > >> From: Hou Tao <houtao1@huawei.com> > >> > >> When doing stress test for qp-trie, bpf_mem_alloc() returned NULL > >> unexpectedly because all qp-trie operations were initiated from > >> bpf syscalls and there was still available free memory. bpf_obj_new() > >> has the same problem as shown by the following selftest. > >> > >> The failure is due to the preemption. irq_work_raise() will invoke > >> irq_work_claim() first to mark the irq work as pending and then inovke > >> __irq_work_queue_local() to raise an IPI. So when the current task > >> which is invoking irq_work_raise() is preempted by other task, > >> unit_alloc() may return NULL for preemptive task as shown below: > >> > >> task A task B > >> > >> unit_alloc() > >> // low_watermark = 32 > >> // free_cnt = 31 after alloc > >> irq_work_raise() > >> // mark irq work as IRQ_WORK_PENDING > >> irq_work_claim() > >> > >> // task B preempts task A > >> unit_alloc() > >> // free_cnt = 30 after alloc > >> // irq work is already PENDING, > >> // so just return > >> irq_work_raise() > >> // does unit_alloc() 30-times > >> ...... > >> unit_alloc() > >> // free_cnt = 0 before alloc > >> return NULL > >> > >> Fix it by invoking preempt_disable_notrace() before allocation and > >> invoking preempt_enable_notrace() to enable preemption after > >> irq_work_raise() completes. An alternative fix is to move > >> local_irq_restore() after the invocation of irq_work_raise(), but it > >> will enlarge the irq-disabled region. Another feasible fix is to only > >> disable preemption before invoking irq_work_queue() and enable > >> preemption after the invocation in irq_work_raise(), but it can't > >> handle the case when c->low_watermark is 1. > >> > >> Signed-off-by: Hou Tao <houtao1@huawei.com> > >> --- > >> kernel/bpf/memalloc.c | 8 ++++++++ > >> 1 file changed, 8 insertions(+) > >> > >> diff --git a/kernel/bpf/memalloc.c b/kernel/bpf/memalloc.c > >> index 9c49ae53deaf..83f8913ebb0a 100644 > >> --- a/kernel/bpf/memalloc.c > >> +++ b/kernel/bpf/memalloc.c > >> @@ -6,6 +6,7 @@ > >> #include <linux/irq_work.h> > >> #include <linux/bpf_mem_alloc.h> > >> #include <linux/memcontrol.h> > >> +#include <linux/preempt.h> > >> #include <asm/local.h> > >> > >> /* Any context (including NMI) BPF specific memory allocator. > >> @@ -725,6 +726,7 @@ static void notrace *unit_alloc(struct bpf_mem_cache *c) > >> * Use per-cpu 'active' counter to order free_list access between > >> * unit_alloc/unit_free/bpf_mem_refill. > >> */ > >> + preempt_disable_notrace(); > >> local_irq_save(flags); > >> if (local_inc_return(&c->active) == 1) { > >> llnode = __llist_del_first(&c->free_llist); > >> @@ -740,6 +742,12 @@ static void notrace *unit_alloc(struct bpf_mem_cache *c) > >> > >> if (cnt < c->low_watermark) > >> (c); > >> + /* Enable preemption after the enqueue of irq work completes, > >> + * so free_llist may be refilled by irq work before other task > >> + * preempts current task. > >> + */ > >> + preempt_enable_notrace(); > > So this helps qp-trie init, since it's doing bpf_mem_alloc from > > syscall context and helps bpf_obj_new from bpf prog, since prog is > > non-migrateable, but preemptable. It's not an issue for htab doing > > during map_update, since > > it's under htab bucket lock. > > Let's introduce minimal: > > /* big comment here explaining the reason of extra preempt disable */ > > static void bpf_memalloc_irq_work_raise(...) > > { > > preempt_disable_notrace(); > > irq_work_raise(); > > preempt_enable_notrace(); > > } > > > > it will have the same effect, right? > > . > > No. As I said in commit message, when c->low_watermark is 1, the above > fix doesn't work as shown below: Yes. I got mark=1 part. I just don't think it's worth the complexity.
Hi, On 8/23/2023 9:57 AM, Alexei Starovoitov wrote: > On Tue, Aug 22, 2023 at 5:51 PM Hou Tao <houtao@huaweicloud.com> wrote: >> Hi, >> >> On 8/23/2023 8:05 AM, Alexei Starovoitov wrote: >>> On Tue, Aug 22, 2023 at 6:06 AM Hou Tao <houtao@huaweicloud.com> wrote: >>>> From: Hou Tao <houtao1@huawei.com> >>>> >>>> When doing stress test for qp-trie, bpf_mem_alloc() returned NULL >>>> unexpectedly because all qp-trie operations were initiated from >>>> bpf syscalls and there was still available free memory. bpf_obj_new() >>>> has the same problem as shown by the following selftest. >>>> >>>> The failure is due to the preemption. irq_work_raise() will invoke >>>> irq_work_claim() first to mark the irq work as pending and then inovke >>>> __irq_work_queue_local() to raise an IPI. So when the current task >>>> which is invoking irq_work_raise() is preempted by other task, >>>> unit_alloc() may return NULL for preemptive task as shown below: >>>> >>>> task A task B >>>> >>>> unit_alloc() >>>> // low_watermark = 32 >>>> // free_cnt = 31 after alloc >>>> irq_work_raise() >>>> // mark irq work as IRQ_WORK_PENDING >>>> irq_work_claim() >>>> >>>> // task B preempts task A >>>> unit_alloc() >>>> // free_cnt = 30 after alloc >>>> // irq work is already PENDING, >>>> // so just return >>>> irq_work_raise() >>>> // does unit_alloc() 30-times >>>> ...... >>>> unit_alloc() >>>> // free_cnt = 0 before alloc >>>> return NULL >>>> >>>> Fix it by invoking preempt_disable_notrace() before allocation and >>>> invoking preempt_enable_notrace() to enable preemption after >>>> irq_work_raise() completes. An alternative fix is to move >>>> local_irq_restore() after the invocation of irq_work_raise(), but it >>>> will enlarge the irq-disabled region. Another feasible fix is to only >>>> disable preemption before invoking irq_work_queue() and enable >>>> preemption after the invocation in irq_work_raise(), but it can't >>>> handle the case when c->low_watermark is 1. >>>> >>>> Signed-off-by: Hou Tao <houtao1@huawei.com> >>>> --- >>>> kernel/bpf/memalloc.c | 8 ++++++++ >>>> 1 file changed, 8 insertions(+) >>>> >>>> diff --git a/kernel/bpf/memalloc.c b/kernel/bpf/memalloc.c >>>> index 9c49ae53deaf..83f8913ebb0a 100644 >>>> --- a/kernel/bpf/memalloc.c >>>> +++ b/kernel/bpf/memalloc.c >>>> @@ -6,6 +6,7 @@ >>>> #include <linux/irq_work.h> >>>> #include <linux/bpf_mem_alloc.h> >>>> #include <linux/memcontrol.h> >>>> +#include <linux/preempt.h> >>>> #include <asm/local.h> >>>> >>>> /* Any context (including NMI) BPF specific memory allocator. >>>> @@ -725,6 +726,7 @@ static void notrace *unit_alloc(struct bpf_mem_cache *c) >>>> * Use per-cpu 'active' counter to order free_list access between >>>> * unit_alloc/unit_free/bpf_mem_refill. >>>> */ >>>> + preempt_disable_notrace(); >>>> local_irq_save(flags); >>>> if (local_inc_return(&c->active) == 1) { >>>> llnode = __llist_del_first(&c->free_llist); >>>> @@ -740,6 +742,12 @@ static void notrace *unit_alloc(struct bpf_mem_cache *c) >>>> >>>> if (cnt < c->low_watermark) >>>> (c); >>>> + /* Enable preemption after the enqueue of irq work completes, >>>> + * so free_llist may be refilled by irq work before other task >>>> + * preempts current task. >>>> + */ >>>> + preempt_enable_notrace(); >>> So this helps qp-trie init, since it's doing bpf_mem_alloc from >>> syscall context and helps bpf_obj_new from bpf prog, since prog is >>> non-migrateable, but preemptable. It's not an issue for htab doing >>> during map_update, since >>> it's under htab bucket lock. >>> Let's introduce minimal: >>> /* big comment here explaining the reason of extra preempt disable */ >>> static void bpf_memalloc_irq_work_raise(...) >>> { >>> preempt_disable_notrace(); >>> irq_work_raise(); >>> preempt_enable_notrace(); >>> } >>> >>> it will have the same effect, right? >>> . >> No. As I said in commit message, when c->low_watermark is 1, the above >> fix doesn't work as shown below: > Yes. I got mark=1 part. I just don't think it's worth the complexity. Just find out that for bpf_obj_new() the minimal low_watermark is 2 instead of 1 (unit_size= 4096 instead of 4096 + 8). But even with low_watermark as 2, the above fix may don't work when there are nested preemption: task A (free_cnt = 1 after alloc) -> preempted by task B (free_cnt = 0 after alloc) -> preempted by task C (fail to do allocation). And in my naive understanding of bpf memory allocate, these fixes are simple. Why do you think it will introduce extra complexity ? Do you mean preempt_disable_notrace() could be used to trigger the running of bpf program ? If it is the problem, I think we should fix it instead.
On Tue, Aug 22, 2023 at 9:39 PM Hou Tao <houtao@huaweicloud.com> wrote: > > Hi, > > On 8/23/2023 9:57 AM, Alexei Starovoitov wrote: > > On Tue, Aug 22, 2023 at 5:51 PM Hou Tao <houtao@huaweicloud.com> wrote: > >> Hi, > >> > >> On 8/23/2023 8:05 AM, Alexei Starovoitov wrote: > >>> On Tue, Aug 22, 2023 at 6:06 AM Hou Tao <houtao@huaweicloud.com> wrote: > >>>> From: Hou Tao <houtao1@huawei.com> > >>>> > >>>> When doing stress test for qp-trie, bpf_mem_alloc() returned NULL > >>>> unexpectedly because all qp-trie operations were initiated from > >>>> bpf syscalls and there was still available free memory. bpf_obj_new() > >>>> has the same problem as shown by the following selftest. > >>>> > >>>> The failure is due to the preemption. irq_work_raise() will invoke > >>>> irq_work_claim() first to mark the irq work as pending and then inovke > >>>> __irq_work_queue_local() to raise an IPI. So when the current task > >>>> which is invoking irq_work_raise() is preempted by other task, > >>>> unit_alloc() may return NULL for preemptive task as shown below: > >>>> > >>>> task A task B > >>>> > >>>> unit_alloc() > >>>> // low_watermark = 32 > >>>> // free_cnt = 31 after alloc > >>>> irq_work_raise() > >>>> // mark irq work as IRQ_WORK_PENDING > >>>> irq_work_claim() > >>>> > >>>> // task B preempts task A > >>>> unit_alloc() > >>>> // free_cnt = 30 after alloc > >>>> // irq work is already PENDING, > >>>> // so just return > >>>> irq_work_raise() > >>>> // does unit_alloc() 30-times > >>>> ...... > >>>> unit_alloc() > >>>> // free_cnt = 0 before alloc > >>>> return NULL > >>>> > >>>> Fix it by invoking preempt_disable_notrace() before allocation and > >>>> invoking preempt_enable_notrace() to enable preemption after > >>>> irq_work_raise() completes. An alternative fix is to move > >>>> local_irq_restore() after the invocation of irq_work_raise(), but it > >>>> will enlarge the irq-disabled region. Another feasible fix is to only > >>>> disable preemption before invoking irq_work_queue() and enable > >>>> preemption after the invocation in irq_work_raise(), but it can't > >>>> handle the case when c->low_watermark is 1. > >>>> > >>>> Signed-off-by: Hou Tao <houtao1@huawei.com> > >>>> --- > >>>> kernel/bpf/memalloc.c | 8 ++++++++ > >>>> 1 file changed, 8 insertions(+) > >>>> > >>>> diff --git a/kernel/bpf/memalloc.c b/kernel/bpf/memalloc.c > >>>> index 9c49ae53deaf..83f8913ebb0a 100644 > >>>> --- a/kernel/bpf/memalloc.c > >>>> +++ b/kernel/bpf/memalloc.c > >>>> @@ -6,6 +6,7 @@ > >>>> #include <linux/irq_work.h> > >>>> #include <linux/bpf_mem_alloc.h> > >>>> #include <linux/memcontrol.h> > >>>> +#include <linux/preempt.h> > >>>> #include <asm/local.h> > >>>> > >>>> /* Any context (including NMI) BPF specific memory allocator. > >>>> @@ -725,6 +726,7 @@ static void notrace *unit_alloc(struct bpf_mem_cache *c) > >>>> * Use per-cpu 'active' counter to order free_list access between > >>>> * unit_alloc/unit_free/bpf_mem_refill. > >>>> */ > >>>> + preempt_disable_notrace(); > >>>> local_irq_save(flags); > >>>> if (local_inc_return(&c->active) == 1) { > >>>> llnode = __llist_del_first(&c->free_llist); > >>>> @@ -740,6 +742,12 @@ static void notrace *unit_alloc(struct bpf_mem_cache *c) > >>>> > >>>> if (cnt < c->low_watermark) > >>>> (c); > >>>> + /* Enable preemption after the enqueue of irq work completes, > >>>> + * so free_llist may be refilled by irq work before other task > >>>> + * preempts current task. > >>>> + */ > >>>> + preempt_enable_notrace(); > >>> So this helps qp-trie init, since it's doing bpf_mem_alloc from > >>> syscall context and helps bpf_obj_new from bpf prog, since prog is > >>> non-migrateable, but preemptable. It's not an issue for htab doing > >>> during map_update, since > >>> it's under htab bucket lock. > >>> Let's introduce minimal: > >>> /* big comment here explaining the reason of extra preempt disable */ > >>> static void bpf_memalloc_irq_work_raise(...) > >>> { > >>> preempt_disable_notrace(); > >>> irq_work_raise(); > >>> preempt_enable_notrace(); > >>> } > >>> > >>> it will have the same effect, right? > >>> . > >> No. As I said in commit message, when c->low_watermark is 1, the above > >> fix doesn't work as shown below: > > Yes. I got mark=1 part. I just don't think it's worth the complexity. > > Just find out that for bpf_obj_new() the minimal low_watermark is 2 > instead of 1 (unit_size= 4096 instead of 4096 + 8). But even with > low_watermark as 2, the above fix may don't work when there are nested > preemption: task A (free_cnt = 1 after alloc) -> preempted by task B > (free_cnt = 0 after alloc) -> preempted by task C (fail to do > allocation). And in my naive understanding of bpf memory allocate, these > fixes are simple. Why do you think it will introduce extra complexity ? > Do you mean preempt_disable_notrace() could be used to trigger the > running of bpf program ? If it is the problem, I think we should fix it > instead. I'm not worried about recursive calls from _notrace(). That shouldn't be possible. I'm just saying that disabling preemption around irq_work_raise() helps a bit while disable around the whole unit_alloc/free is a snake oil. bpf prog could be running in irq disabled context and preempt disabled unit_alloc vs irq_work_raise won't make any difference. Both will return NULL. Same with batched htab update. It will hit NULL too. So from my pov you're trying to fix something that is not fixable. Batched alloc will fail. The users have to use bpf_ma differently. In places where they cannot afford alloc failures they need to pre-allocate. We were planning to introduce bpf_obj_new() that does kmalloc when bpf prog is sleepable. Then bpf prog can stash such manually pre-allocated object and use it later.
Hi, On 8/24/2023 12:33 AM, Alexei Starovoitov wrote: > On Tue, Aug 22, 2023 at 9:39 PM Hou Tao <houtao@huaweicloud.com> wrote: >> Hi, >> >> On 8/23/2023 9:57 AM, Alexei Starovoitov wrote: >>> On Tue, Aug 22, 2023 at 5:51 PM Hou Tao <houtao@huaweicloud.com> wrote: >>>> Hi, >>>> >>>> On 8/23/2023 8:05 AM, Alexei Starovoitov wrote: >>>>> On Tue, Aug 22, 2023 at 6:06 AM Hou Tao <houtao@huaweicloud.com> wrote: >>>>>> From: Hou Tao <houtao1@huawei.com> >>>>>> >>>>>> When doing stress test for qp-trie, bpf_mem_alloc() returned NULL >>>>>> unexpectedly because all qp-trie operations were initiated from >>>>>> bpf syscalls and there was still available free memory. bpf_obj_new() >>>>>> has the same problem as shown by the following selftest. >>>>>> >>>>>> The failure is due to the preemption. irq_work_raise() will invoke >>>>>> irq_work_claim() first to mark the irq work as pending and then inovke >>>>>> __irq_work_queue_local() to raise an IPI. So when the current task >>>>>> which is invoking irq_work_raise() is preempted by other task, >>>>>> unit_alloc() may return NULL for preemptive task as shown below: >>>>>> >>>>>> task A task B >>>>>> >>>>>> unit_alloc() >>>>>> // low_watermark = 32 >>>>>> // free_cnt = 31 after alloc >>>>>> irq_work_raise() >>>>>> // mark irq work as IRQ_WORK_PENDING >>>>>> irq_work_claim() >>>>>> >>>>>> // task B preempts task A >>>>>> unit_alloc() >>>>>> // free_cnt = 30 after alloc >>>>>> // irq work is already PENDING, >>>>>> // so just return >>>>>> irq_work_raise() >>>>>> // does unit_alloc() 30-times >>>>>> ...... >>>>>> unit_alloc() >>>>>> // free_cnt = 0 before alloc >>>>>> return NULL >>>>>> >>>>>> Fix it by invoking preempt_disable_notrace() before allocation and >>>>>> invoking preempt_enable_notrace() to enable preemption after >>>>>> irq_work_raise() completes. An alternative fix is to move >>>>>> local_irq_restore() after the invocation of irq_work_raise(), but it >>>>>> will enlarge the irq-disabled region. Another feasible fix is to only >>>>>> disable preemption before invoking irq_work_queue() and enable >>>>>> preemption after the invocation in irq_work_raise(), but it can't >>>>>> handle the case when c->low_watermark is 1. >>>>>> >>>>>> Signed-off-by: Hou Tao <houtao1@huawei.com> >>>>>> --- >>>>>> kernel/bpf/memalloc.c | 8 ++++++++ >>>>>> 1 file changed, 8 insertions(+) >>>>>> >>>>>> diff --git a/kernel/bpf/memalloc.c b/kernel/bpf/memalloc.c >>>>>> index 9c49ae53deaf..83f8913ebb0a 100644 >>>>>> --- a/kernel/bpf/memalloc.c >>>>>> +++ b/kernel/bpf/memalloc.c >>>>>> @@ -6,6 +6,7 @@ >>>>>> #include <linux/irq_work.h> >>>>>> #include <linux/bpf_mem_alloc.h> >>>>>> #include <linux/memcontrol.h> >>>>>> +#include <linux/preempt.h> >>>>>> #include <asm/local.h> >>>>>> >>>>>> /* Any context (including NMI) BPF specific memory allocator. >>>>>> @@ -725,6 +726,7 @@ static void notrace *unit_alloc(struct bpf_mem_cache *c) >>>>>> * Use per-cpu 'active' counter to order free_list access between >>>>>> * unit_alloc/unit_free/bpf_mem_refill. >>>>>> */ >>>>>> + preempt_disable_notrace(); >>>>>> local_irq_save(flags); >>>>>> if (local_inc_return(&c->active) == 1) { >>>>>> llnode = __llist_del_first(&c->free_llist); >>>>>> @@ -740,6 +742,12 @@ static void notrace *unit_alloc(struct bpf_mem_cache *c) >>>>>> >>>>>> if (cnt < c->low_watermark) >>>>>> (c); >>>>>> + /* Enable preemption after the enqueue of irq work completes, >>>>>> + * so free_llist may be refilled by irq work before other task >>>>>> + * preempts current task. >>>>>> + */ >>>>>> + preempt_enable_notrace(); >>>>> So this helps qp-trie init, since it's doing bpf_mem_alloc from >>>>> syscall context and helps bpf_obj_new from bpf prog, since prog is >>>>> non-migrateable, but preemptable. It's not an issue for htab doing >>>>> during map_update, since >>>>> it's under htab bucket lock. >>>>> Let's introduce minimal: >>>>> /* big comment here explaining the reason of extra preempt disable */ >>>>> static void bpf_memalloc_irq_work_raise(...) >>>>> { >>>>> preempt_disable_notrace(); >>>>> irq_work_raise(); >>>>> preempt_enable_notrace(); >>>>> } >>>>> >>>>> it will have the same effect, right? >>>>> . >>>> No. As I said in commit message, when c->low_watermark is 1, the above >>>> fix doesn't work as shown below: >>> Yes. I got mark=1 part. I just don't think it's worth the complexity. >> Just find out that for bpf_obj_new() the minimal low_watermark is 2 >> instead of 1 (unit_size= 4096 instead of 4096 + 8). But even with >> low_watermark as 2, the above fix may don't work when there are nested >> preemption: task A (free_cnt = 1 after alloc) -> preempted by task B >> (free_cnt = 0 after alloc) -> preempted by task C (fail to do >> allocation). And in my naive understanding of bpf memory allocate, these >> fixes are simple. Why do you think it will introduce extra complexity ? >> Do you mean preempt_disable_notrace() could be used to trigger the >> running of bpf program ? If it is the problem, I think we should fix it >> instead. > I'm not worried about recursive calls from _notrace(). That shouldn't > be possible. OK > I'm just saying that disabling preemption around irq_work_raise() helps a bit > while disable around the whole unit_alloc/free is a snake oil. > bpf prog could be running in irq disabled context and preempt disabled > unit_alloc vs irq_work_raise won't make any difference. Both will return NULL. > Same with batched htab update. It will hit NULL too. > So from my pov you're trying to fix something that is not fixable. The patch set didn't try to fix the problem for all possible context, especially the irq disable context. It just tries to fix the ENOMEM problem for process context which is the major context. I still think disabling preemption around the whole unit_alloc/free is much solider than just do that for irq_work_raise() (e.g., for the nested preemption case). But if you have a strong preference for only disabling preemption for irq_work_raise(), I will post v2 to do that. > Batched alloc will fail. The users have to use bpf_ma differently. At least under x86-64 and process context, I don't think batched alloc will fail as shown in the selftest in which it tries to allocate 512 4096-sized objects in a bpf_loop. And it also didn't fail when I changed bpf_mem_free() to bpf_mem_free_rcu() in bpf_obj_drop() to disable immediate reuse. > In places where they cannot afford alloc failures they need to pre-allocate. > We were planning to introduce bpf_obj_new() that does kmalloc when > bpf prog is sleepable. Then bpf prog can stash such manually pre-allocated > object and use it later. I also thought about the problem when written the patch set. Maybe we could introduce bpf_mem_preload() or something similar to do that. For non-sleepable bpf program, maybe we need to introduce some userspace APIs to do pre-allocation.
On Thu, Aug 24, 2023 at 7:07 AM Hou Tao <houtao@huaweicloud.com> wrote: > > Hi, > > On 8/24/2023 12:33 AM, Alexei Starovoitov wrote: > > On Tue, Aug 22, 2023 at 9:39 PM Hou Tao <houtao@huaweicloud.com> wrote: > >> Hi, > >> > >> On 8/23/2023 9:57 AM, Alexei Starovoitov wrote: > >>> On Tue, Aug 22, 2023 at 5:51 PM Hou Tao <houtao@huaweicloud.com> wrote: > >>>> Hi, > >>>> > >>>> On 8/23/2023 8:05 AM, Alexei Starovoitov wrote: > >>>>> On Tue, Aug 22, 2023 at 6:06 AM Hou Tao <houtao@huaweicloud.com> wrote: > >>>>>> From: Hou Tao <houtao1@huawei.com> > >>>>>> > >>>>>> When doing stress test for qp-trie, bpf_mem_alloc() returned NULL > >>>>>> unexpectedly because all qp-trie operations were initiated from > >>>>>> bpf syscalls and there was still available free memory. bpf_obj_new() > >>>>>> has the same problem as shown by the following selftest. > >>>>>> > >>>>>> The failure is due to the preemption. irq_work_raise() will invoke > >>>>>> irq_work_claim() first to mark the irq work as pending and then inovke > >>>>>> __irq_work_queue_local() to raise an IPI. So when the current task > >>>>>> which is invoking irq_work_raise() is preempted by other task, > >>>>>> unit_alloc() may return NULL for preemptive task as shown below: > >>>>>> > >>>>>> task A task B > >>>>>> > >>>>>> unit_alloc() > >>>>>> // low_watermark = 32 > >>>>>> // free_cnt = 31 after alloc > >>>>>> irq_work_raise() > >>>>>> // mark irq work as IRQ_WORK_PENDING > >>>>>> irq_work_claim() > >>>>>> > >>>>>> // task B preempts task A > >>>>>> unit_alloc() > >>>>>> // free_cnt = 30 after alloc > >>>>>> // irq work is already PENDING, > >>>>>> // so just return > >>>>>> irq_work_raise() > >>>>>> // does unit_alloc() 30-times > >>>>>> ...... > >>>>>> unit_alloc() > >>>>>> // free_cnt = 0 before alloc > >>>>>> return NULL > >>>>>> > >>>>>> Fix it by invoking preempt_disable_notrace() before allocation and > >>>>>> invoking preempt_enable_notrace() to enable preemption after > >>>>>> irq_work_raise() completes. An alternative fix is to move > >>>>>> local_irq_restore() after the invocation of irq_work_raise(), but it > >>>>>> will enlarge the irq-disabled region. Another feasible fix is to only > >>>>>> disable preemption before invoking irq_work_queue() and enable > >>>>>> preemption after the invocation in irq_work_raise(), but it can't > >>>>>> handle the case when c->low_watermark is 1. > >>>>>> > >>>>>> Signed-off-by: Hou Tao <houtao1@huawei.com> > >>>>>> --- > >>>>>> kernel/bpf/memalloc.c | 8 ++++++++ > >>>>>> 1 file changed, 8 insertions(+) > >>>>>> > >>>>>> diff --git a/kernel/bpf/memalloc.c b/kernel/bpf/memalloc.c > >>>>>> index 9c49ae53deaf..83f8913ebb0a 100644 > >>>>>> --- a/kernel/bpf/memalloc.c > >>>>>> +++ b/kernel/bpf/memalloc.c > >>>>>> @@ -6,6 +6,7 @@ > >>>>>> #include <linux/irq_work.h> > >>>>>> #include <linux/bpf_mem_alloc.h> > >>>>>> #include <linux/memcontrol.h> > >>>>>> +#include <linux/preempt.h> > >>>>>> #include <asm/local.h> > >>>>>> > >>>>>> /* Any context (including NMI) BPF specific memory allocator. > >>>>>> @@ -725,6 +726,7 @@ static void notrace *unit_alloc(struct bpf_mem_cache *c) > >>>>>> * Use per-cpu 'active' counter to order free_list access between > >>>>>> * unit_alloc/unit_free/bpf_mem_refill. > >>>>>> */ > >>>>>> + preempt_disable_notrace(); > >>>>>> local_irq_save(flags); > >>>>>> if (local_inc_return(&c->active) == 1) { > >>>>>> llnode = __llist_del_first(&c->free_llist); > >>>>>> @@ -740,6 +742,12 @@ static void notrace *unit_alloc(struct bpf_mem_cache *c) > >>>>>> > >>>>>> if (cnt < c->low_watermark) > >>>>>> (c); > >>>>>> + /* Enable preemption after the enqueue of irq work completes, > >>>>>> + * so free_llist may be refilled by irq work before other task > >>>>>> + * preempts current task. > >>>>>> + */ > >>>>>> + preempt_enable_notrace(); > >>>>> So this helps qp-trie init, since it's doing bpf_mem_alloc from > >>>>> syscall context and helps bpf_obj_new from bpf prog, since prog is > >>>>> non-migrateable, but preemptable. It's not an issue for htab doing > >>>>> during map_update, since > >>>>> it's under htab bucket lock. > >>>>> Let's introduce minimal: > >>>>> /* big comment here explaining the reason of extra preempt disable */ > >>>>> static void bpf_memalloc_irq_work_raise(...) > >>>>> { > >>>>> preempt_disable_notrace(); > >>>>> irq_work_raise(); > >>>>> preempt_enable_notrace(); > >>>>> } > >>>>> > >>>>> it will have the same effect, right? > >>>>> . > >>>> No. As I said in commit message, when c->low_watermark is 1, the above > >>>> fix doesn't work as shown below: > >>> Yes. I got mark=1 part. I just don't think it's worth the complexity. > >> Just find out that for bpf_obj_new() the minimal low_watermark is 2 > >> instead of 1 (unit_size= 4096 instead of 4096 + 8). But even with > >> low_watermark as 2, the above fix may don't work when there are nested > >> preemption: task A (free_cnt = 1 after alloc) -> preempted by task B > >> (free_cnt = 0 after alloc) -> preempted by task C (fail to do > >> allocation). And in my naive understanding of bpf memory allocate, these > >> fixes are simple. Why do you think it will introduce extra complexity ? > >> Do you mean preempt_disable_notrace() could be used to trigger the > >> running of bpf program ? If it is the problem, I think we should fix it > >> instead. > > I'm not worried about recursive calls from _notrace(). That shouldn't > > be possible. > > OK > > I'm just saying that disabling preemption around irq_work_raise() helps a bit > > while disable around the whole unit_alloc/free is a snake oil. > > bpf prog could be running in irq disabled context and preempt disabled > > unit_alloc vs irq_work_raise won't make any difference. Both will return NULL. > > Same with batched htab update. It will hit NULL too. > > So from my pov you're trying to fix something that is not fixable. > > The patch set didn't try to fix the problem for all possible context, > especially the irq disable context. It just tries to fix the ENOMEM > problem for process context which is the major context. I still think > disabling preemption around the whole unit_alloc/free is much solider > than just do that for irq_work_raise() (e.g., for the nested preemption > case). But if you have a strong preference for only disabling preemption > for irq_work_raise(), I will post v2 to do that. In process ctx the preempt_disable/enable across unit_alloc will keep asking kernel to consider preemption on every unit_alloc call which can be a lot. If I'm reading the code correctly preempt_schedule() is quite heavy. Doing it every unit_alloc is a performance concern. Could you try the following: diff --git a/kernel/bpf/memalloc.c b/kernel/bpf/memalloc.c index 9c49ae53deaf..ee8262f58c5a 100644 --- a/kernel/bpf/memalloc.c +++ b/kernel/bpf/memalloc.c @@ -442,7 +442,10 @@ static void bpf_mem_refill(struct irq_work *work) static void notrace irq_work_raise(struct bpf_mem_cache *c) { - irq_work_queue(&c->refill_work); + if (!irq_work_queue(&c->refill_work)) { + preempt_disable_notrace(); + preempt_enable_notrace(); + } } The idea that it will ask for resched if preemptible. will it address the issue you're seeing?
Hi, On 8/25/2023 11:48 AM, Alexei Starovoitov wrote: > On Thu, Aug 24, 2023 at 7:07 AM Hou Tao <houtao@huaweicloud.com> wrote: >> Hi, >> >> On 8/24/2023 12:33 AM, Alexei Starovoitov wrote: >>> On Tue, Aug 22, 2023 at 9:39 PM Hou Tao <houtao@huaweicloud.com> wrote: >>>> Hi, >>>> >>>> On 8/23/2023 9:57 AM, Alexei Starovoitov wrote: >>>>> On Tue, Aug 22, 2023 at 5:51 PM Hou Tao <houtao@huaweicloud.com> wrote: >>>>>> Hi, >>>>>> >>>>>> On 8/23/2023 8:05 AM, Alexei Starovoitov wrote: >>>>>>> On Tue, Aug 22, 2023 at 6:06 AM Hou Tao <houtao@huaweicloud.com> wrote: >>>>>>>> From: Hou Tao <houtao1@huawei.com> >>>>>>>> >>>>>>>> When doing stress test for qp-trie, bpf_mem_alloc() returned NULL >>>>>>>> unexpectedly because all qp-trie operations were initiated from >>>>>>>> bpf syscalls and there was still available free memory. bpf_obj_new() >>>>>>>> has the same problem as shown by the following selftest. >>>>>>>> >>>>>>>> The failure is due to the preemption. irq_work_raise() will invoke >>>>>>>> irq_work_claim() first to mark the irq work as pending and then inovke >>>>>>>> __irq_work_queue_local() to raise an IPI. So when the current task >>>>>>>> which is invoking irq_work_raise() is preempted by other task, >>>>>>>> unit_alloc() may return NULL for preemptive task as shown below: >>>>>>>> >>>>>>>> task A task B >>>>>>>> >>>>>>>> unit_alloc() >>>>>>>> // low_watermark = 32 >>>>>>>> // free_cnt = 31 after alloc >>>>>>>> irq_work_raise() >>>>>>>> // mark irq work as IRQ_WORK_PENDING >>>>>>>> irq_work_claim() >>>>>>>> >>>>>>>> // task B preempts task A >>>>>>>> unit_alloc() >>>>>>>> // free_cnt = 30 after alloc >>>>>>>> // irq work is already PENDING, >>>>>>>> // so just return >>>>>>>> irq_work_raise() >>>>>>>> // does unit_alloc() 30-times >>>>>>>> ...... >>>>>>>> unit_alloc() >>>>>>>> // free_cnt = 0 before alloc >>>>>>>> return NULL >>>>>>>> >>>>>>>> Fix it by invoking preempt_disable_notrace() before allocation and >>>>>>>> invoking preempt_enable_notrace() to enable preemption after >>>>>>>> irq_work_raise() completes. An alternative fix is to move >>>>>>>> local_irq_restore() after the invocation of irq_work_raise(), but it >>>>>>>> will enlarge the irq-disabled region. Another feasible fix is to only >>>>>>>> disable preemption before invoking irq_work_queue() and enable >>>>>>>> preemption after the invocation in irq_work_raise(), but it can't >>>>>>>> handle the case when c->low_watermark is 1. >>>>>>>> >>>>>>>> Signed-off-by: Hou Tao <houtao1@huawei.com> >>>>>>>> --- >>>>>>>> kernel/bpf/memalloc.c | 8 ++++++++ >>>>>>>> 1 file changed, 8 insertions(+) >>>>>>>> >>>>>>>> diff --git a/kernel/bpf/memalloc.c b/kernel/bpf/memalloc.c >>>>>>>> index 9c49ae53deaf..83f8913ebb0a 100644 >>>>>>>> --- a/kernel/bpf/memalloc.c >>>>>>>> +++ b/kernel/bpf/memalloc.c >>>>>>>> @@ -6,6 +6,7 @@ >>>>>>>> #include <linux/irq_work.h> >>>>>>>> #include <linux/bpf_mem_alloc.h> >>>>>>>> #include <linux/memcontrol.h> >>>>>>>> +#include <linux/preempt.h> >>>>>>>> #include <asm/local.h> >>>>>>>> >>>>>>>> /* Any context (including NMI) BPF specific memory allocator. >>>>>>>> @@ -725,6 +726,7 @@ static void notrace *unit_alloc(struct bpf_mem_cache *c) >>>>>>>> * Use per-cpu 'active' counter to order free_list access between >>>>>>>> * unit_alloc/unit_free/bpf_mem_refill. >>>>>>>> */ >>>>>>>> + preempt_disable_notrace(); >>>>>>>> local_irq_save(flags); >>>>>>>> if (local_inc_return(&c->active) == 1) { >>>>>>>> llnode = __llist_del_first(&c->free_llist); >>>>>>>> @@ -740,6 +742,12 @@ static void notrace *unit_alloc(struct bpf_mem_cache *c) >>>>>>>> >>>>>>>> if (cnt < c->low_watermark) >>>>>>>> (c); >>>>>>>> + /* Enable preemption after the enqueue of irq work completes, >>>>>>>> + * so free_llist may be refilled by irq work before other task >>>>>>>> + * preempts current task. >>>>>>>> + */ >>>>>>>> + preempt_enable_notrace(); >>>>>>> So this helps qp-trie init, since it's doing bpf_mem_alloc from >>>>>>> syscall context and helps bpf_obj_new from bpf prog, since prog is >>>>>>> non-migrateable, but preemptable. It's not an issue for htab doing >>>>>>> during map_update, since >>>>>>> it's under htab bucket lock. >>>>>>> Let's introduce minimal: >>>>>>> /* big comment here explaining the reason of extra preempt disable */ >>>>>>> static void bpf_memalloc_irq_work_raise(...) >>>>>>> { >>>>>>> preempt_disable_notrace(); >>>>>>> irq_work_raise(); >>>>>>> preempt_enable_notrace(); >>>>>>> } >>>>>>> >>>>>>> it will have the same effect, right? >>>>>>> . >>>>>> No. As I said in commit message, when c->low_watermark is 1, the above >>>>>> fix doesn't work as shown below: >>>>> Yes. I got mark=1 part. I just don't think it's worth the complexity. >>>> Just find out that for bpf_obj_new() the minimal low_watermark is 2 >>>> instead of 1 (unit_size= 4096 instead of 4096 + 8). But even with >>>> low_watermark as 2, the above fix may don't work when there are nested >>>> preemption: task A (free_cnt = 1 after alloc) -> preempted by task B >>>> (free_cnt = 0 after alloc) -> preempted by task C (fail to do >>>> allocation). And in my naive understanding of bpf memory allocate, these >>>> fixes are simple. Why do you think it will introduce extra complexity ? >>>> Do you mean preempt_disable_notrace() could be used to trigger the >>>> running of bpf program ? If it is the problem, I think we should fix it >>>> instead. >>> I'm not worried about recursive calls from _notrace(). That shouldn't >>> be possible. >> OK >>> I'm just saying that disabling preemption around irq_work_raise() helps a bit >>> while disable around the whole unit_alloc/free is a snake oil. >>> bpf prog could be running in irq disabled context and preempt disabled >>> unit_alloc vs irq_work_raise won't make any difference. Both will return NULL. >>> Same with batched htab update. It will hit NULL too. >>> So from my pov you're trying to fix something that is not fixable. >> The patch set didn't try to fix the problem for all possible context, >> especially the irq disable context. It just tries to fix the ENOMEM >> problem for process context which is the major context. I still think >> disabling preemption around the whole unit_alloc/free is much solider >> than just do that for irq_work_raise() (e.g., for the nested preemption >> case). But if you have a strong preference for only disabling preemption >> for irq_work_raise(), I will post v2 to do that. > In process ctx the preempt_disable/enable across unit_alloc will keep > asking kernel to consider preemption on every unit_alloc call > which can be a lot. > If I'm reading the code correctly preempt_schedule() is quite heavy. > Doing it every unit_alloc is a performance concern. > > Could you try the following: > diff --git a/kernel/bpf/memalloc.c b/kernel/bpf/memalloc.c > index 9c49ae53deaf..ee8262f58c5a 100644 > --- a/kernel/bpf/memalloc.c > +++ b/kernel/bpf/memalloc.c > @@ -442,7 +442,10 @@ static void bpf_mem_refill(struct irq_work *work) > > static void notrace irq_work_raise(struct bpf_mem_cache *c) > { > - irq_work_queue(&c->refill_work); > + if (!irq_work_queue(&c->refill_work)) { > + preempt_disable_notrace(); > + preempt_enable_notrace(); > + } > } > > The idea that it will ask for resched if preemptible. > will it address the issue you're seeing? > > . No. It didn't work. If you are concerning about the overhead of preempt_enabled_notrace(), we could use local_irq_save() and local_irq_restore() instead.
On Thu, Aug 24, 2023 at 11:04 PM Hou Tao <houtao@huaweicloud.com> wrote: > > > Could you try the following: > > diff --git a/kernel/bpf/memalloc.c b/kernel/bpf/memalloc.c > > index 9c49ae53deaf..ee8262f58c5a 100644 > > --- a/kernel/bpf/memalloc.c > > +++ b/kernel/bpf/memalloc.c > > @@ -442,7 +442,10 @@ static void bpf_mem_refill(struct irq_work *work) > > > > static void notrace irq_work_raise(struct bpf_mem_cache *c) > > { > > - irq_work_queue(&c->refill_work); > > + if (!irq_work_queue(&c->refill_work)) { > > + preempt_disable_notrace(); > > + preempt_enable_notrace(); > > + } > > } > > > > The idea that it will ask for resched if preemptible. > > will it address the issue you're seeing? > > > > . > > No. It didn't work. why? > If you are concerning about the overhead of > preempt_enabled_notrace(), we could use local_irq_save() and > local_irq_restore() instead. That's much better. Moving local_irq_restore() after irq_work_raise() in process ctx would mean that irq_work will execute immediately after local_irq_restore() which would make bpf_ma to behave like sync allocation. Which is the ideal situation. preempt disable/enable game is more fragile.
On 8/26/2023 1:16 AM, Alexei Starovoitov wrote: > On Thu, Aug 24, 2023 at 11:04 PM Hou Tao <houtao@huaweicloud.com> wrote: >>> Could you try the following: >>> diff --git a/kernel/bpf/memalloc.c b/kernel/bpf/memalloc.c >>> index 9c49ae53deaf..ee8262f58c5a 100644 >>> --- a/kernel/bpf/memalloc.c >>> +++ b/kernel/bpf/memalloc.c >>> @@ -442,7 +442,10 @@ static void bpf_mem_refill(struct irq_work *work) >>> >>> static void notrace irq_work_raise(struct bpf_mem_cache *c) >>> { >>> - irq_work_queue(&c->refill_work); >>> + if (!irq_work_queue(&c->refill_work)) { >>> + preempt_disable_notrace(); >>> + preempt_enable_notrace(); >>> + } >>> } >>> >>> The idea that it will ask for resched if preemptible. >>> will it address the issue you're seeing? >>> >>> . >> No. It didn't work. > why? Don't know the extra reason. It seems preempt_enable_notrace() inovked in the preemption task doesn't return the CPU back to the preempted task. Will add some debug info to check that. > >> If you are concerning about the overhead of >> preempt_enabled_notrace(), we could use local_irq_save() and >> local_irq_restore() instead. > That's much better. > Moving local_irq_restore() after irq_work_raise() in process ctx > would mean that irq_work will execute immediately after local_irq_restore() > which would make bpf_ma to behave like sync allocation. > Which is the ideal situation. preempt disable/enable game is more fragile. OK. So you are OK to wrap the whole implementation of unit_alloc() and unit_free() by local_irq_saved() and local_irq_restore(), right ?
On Fri, Aug 25, 2023 at 7:22 PM Hou Tao <houtao@huaweicloud.com> wrote: > > > > On 8/26/2023 1:16 AM, Alexei Starovoitov wrote: > > On Thu, Aug 24, 2023 at 11:04 PM Hou Tao <houtao@huaweicloud.com> wrote: > >>> Could you try the following: > >>> diff --git a/kernel/bpf/memalloc.c b/kernel/bpf/memalloc.c > >>> index 9c49ae53deaf..ee8262f58c5a 100644 > >>> --- a/kernel/bpf/memalloc.c > >>> +++ b/kernel/bpf/memalloc.c > >>> @@ -442,7 +442,10 @@ static void bpf_mem_refill(struct irq_work *work) > >>> > >>> static void notrace irq_work_raise(struct bpf_mem_cache *c) > >>> { > >>> - irq_work_queue(&c->refill_work); > >>> + if (!irq_work_queue(&c->refill_work)) { > >>> + preempt_disable_notrace(); > >>> + preempt_enable_notrace(); > >>> + } > >>> } > >>> > >>> The idea that it will ask for resched if preemptible. > >>> will it address the issue you're seeing? > >>> > >>> . > >> No. It didn't work. > > why? > > Don't know the extra reason. It seems preempt_enable_notrace() inovked > in the preemption task doesn't return the CPU back to the preempted > task. Will add some debug info to check that. > > > >> If you are concerning about the overhead of > >> preempt_enabled_notrace(), we could use local_irq_save() and > >> local_irq_restore() instead. > > That's much better. > > Moving local_irq_restore() after irq_work_raise() in process ctx > > would mean that irq_work will execute immediately after local_irq_restore() > > which would make bpf_ma to behave like sync allocation. > > Which is the ideal situation. preempt disable/enable game is more fragile. > > OK. So you are OK to wrap the whole implementation of unit_alloc() and > unit_free() by local_irq_saved() and local_irq_restore(), right ? You don't need to wrap it. Just need to move local_irq_restore() in unit_alloc/unit_free/unit_free_rcu couple lines down.
diff --git a/kernel/bpf/memalloc.c b/kernel/bpf/memalloc.c index 9c49ae53deaf..83f8913ebb0a 100644 --- a/kernel/bpf/memalloc.c +++ b/kernel/bpf/memalloc.c @@ -6,6 +6,7 @@ #include <linux/irq_work.h> #include <linux/bpf_mem_alloc.h> #include <linux/memcontrol.h> +#include <linux/preempt.h> #include <asm/local.h> /* Any context (including NMI) BPF specific memory allocator. @@ -725,6 +726,7 @@ static void notrace *unit_alloc(struct bpf_mem_cache *c) * Use per-cpu 'active' counter to order free_list access between * unit_alloc/unit_free/bpf_mem_refill. */ + preempt_disable_notrace(); local_irq_save(flags); if (local_inc_return(&c->active) == 1) { llnode = __llist_del_first(&c->free_llist); @@ -740,6 +742,12 @@ static void notrace *unit_alloc(struct bpf_mem_cache *c) if (cnt < c->low_watermark) irq_work_raise(c); + /* Enable preemption after the enqueue of irq work completes, + * so free_llist may be refilled by irq work before other task + * preempts current task. + */ + preempt_enable_notrace(); + return llnode; }