Message ID | 20230706033447.54696-10-alexei.starovoitov@gmail.com (mailing list archive) |
---|---|
State | Accepted |
Commit | 04fabf00b4d3aff5d010ecb617001814e409e24a |
Delegated to: | BPF |
Headers | show |
Series | bpf: Introduce bpf_mem_cache_free_rcu(). | expand |
Hi, On 7/6/2023 11:34 AM, Alexei Starovoitov wrote: > From: Alexei Starovoitov <ast@kernel.org> > > alloc_bulk() can reuse elements from free_by_rcu_ttrace. > Let it reuse from waiting_for_gp_ttrace as well to avoid unnecessary kmalloc(). > > Signed-off-by: Alexei Starovoitov <ast@kernel.org> > --- > kernel/bpf/memalloc.c | 16 ++++++++++------ > 1 file changed, 10 insertions(+), 6 deletions(-) > > diff --git a/kernel/bpf/memalloc.c b/kernel/bpf/memalloc.c > index 9986c6b7df4d..e5a87f6cf2cc 100644 > --- a/kernel/bpf/memalloc.c > +++ b/kernel/bpf/memalloc.c > @@ -212,6 +212,15 @@ static void alloc_bulk(struct bpf_mem_cache *c, int cnt, int node) > if (i >= cnt) > return; > > + for (; i < cnt; i++) { > + obj = llist_del_first(&c->waiting_for_gp_ttrace); > + if (!obj) > + break; > + add_obj_to_free_list(c, obj); > + } > + if (i >= cnt) > + return; I still think using llist_del_first() here is not safe as reported in [1]. Not sure whether or not invoking enque_to_free() firstly for free_llist_extra will close the race completely. Will check later. [1]: https://lore.kernel.org/rcu/957dd5cd-0855-1197-7045-4cb1590bd753@huaweicloud.com/ > + > memcg = get_memcg(c); > old_memcg = set_active_memcg(memcg); > for (; i < cnt; i++) { > @@ -295,12 +304,7 @@ static void do_call_rcu_ttrace(struct bpf_mem_cache *c) > > WARN_ON_ONCE(!llist_empty(&c->waiting_for_gp_ttrace)); > llist_for_each_safe(llnode, t, llist_del_all(&c->free_by_rcu_ttrace)) > - /* There is no concurrent __llist_add(waiting_for_gp_ttrace) access. > - * It doesn't race with llist_del_all either. > - * But there could be two concurrent llist_del_all(waiting_for_gp_ttrace): > - * from __free_rcu() and from drain_mem_cache(). > - */ > - __llist_add(llnode, &c->waiting_for_gp_ttrace); > + llist_add(llnode, &c->waiting_for_gp_ttrace); > > if (unlikely(READ_ONCE(c->draining))) { > __free_rcu(&c->rcu_ttrace);
On Thu, Jul 6, 2023 at 7:07 PM Hou Tao <houtao@huaweicloud.com> wrote: > > Hi, > > On 7/6/2023 11:34 AM, Alexei Starovoitov wrote: > > From: Alexei Starovoitov <ast@kernel.org> > > > > alloc_bulk() can reuse elements from free_by_rcu_ttrace. > > Let it reuse from waiting_for_gp_ttrace as well to avoid unnecessary kmalloc(). > > > > Signed-off-by: Alexei Starovoitov <ast@kernel.org> > > --- > > kernel/bpf/memalloc.c | 16 ++++++++++------ > > 1 file changed, 10 insertions(+), 6 deletions(-) > > > > diff --git a/kernel/bpf/memalloc.c b/kernel/bpf/memalloc.c > > index 9986c6b7df4d..e5a87f6cf2cc 100644 > > --- a/kernel/bpf/memalloc.c > > +++ b/kernel/bpf/memalloc.c > > @@ -212,6 +212,15 @@ static void alloc_bulk(struct bpf_mem_cache *c, int cnt, int node) > > if (i >= cnt) > > return; > > > > + for (; i < cnt; i++) { > > + obj = llist_del_first(&c->waiting_for_gp_ttrace); > > + if (!obj) > > + break; > > + add_obj_to_free_list(c, obj); > > + } > > + if (i >= cnt) > > + return; > > I still think using llist_del_first() here is not safe as reported in > [1]. Not sure whether or not invoking enque_to_free() firstly for > free_llist_extra will close the race completely. Will check later. lol. see my reply a second ago in the other thread. and it's not just waiting_for_gp_ttrace. free_by_rcu_ttrace is similar.
Hi, On 7/7/2023 10:12 AM, Alexei Starovoitov wrote: > On Thu, Jul 6, 2023 at 7:07 PM Hou Tao <houtao@huaweicloud.com> wrote: >> Hi, >> >> On 7/6/2023 11:34 AM, Alexei Starovoitov wrote: >>> From: Alexei Starovoitov <ast@kernel.org> >>> >>> alloc_bulk() can reuse elements from free_by_rcu_ttrace. >>> Let it reuse from waiting_for_gp_ttrace as well to avoid unnecessary kmalloc(). >>> >>> Signed-off-by: Alexei Starovoitov <ast@kernel.org> >>> --- >>> kernel/bpf/memalloc.c | 16 ++++++++++------ >>> 1 file changed, 10 insertions(+), 6 deletions(-) >>> >>> diff --git a/kernel/bpf/memalloc.c b/kernel/bpf/memalloc.c >>> index 9986c6b7df4d..e5a87f6cf2cc 100644 >>> --- a/kernel/bpf/memalloc.c >>> +++ b/kernel/bpf/memalloc.c >>> @@ -212,6 +212,15 @@ static void alloc_bulk(struct bpf_mem_cache *c, int cnt, int node) >>> if (i >= cnt) >>> return; >>> >>> + for (; i < cnt; i++) { >>> + obj = llist_del_first(&c->waiting_for_gp_ttrace); >>> + if (!obj) >>> + break; >>> + add_obj_to_free_list(c, obj); >>> + } >>> + if (i >= cnt) >>> + return; >> I still think using llist_del_first() here is not safe as reported in >> [1]. Not sure whether or not invoking enque_to_free() firstly for >> free_llist_extra will close the race completely. Will check later. > lol. see my reply a second ago in the other thread. > > and it's not just waiting_for_gp_ttrace. free_by_rcu_ttrace is similar. I think free_by_rcu_ttrace is different, because the reuse is only possible after one tasks trace RCU grace period as shown below, and the concurrent llist_del_first() must have been completed when the head is reused and re-added into free_by_rcu_ttrace again. // c0->free_by_rcu_ttrace A -> B -> C -> nil P1: alloc_bulk() llist_del_first(&c->free_by_rcu_ttrace) entry = A next = B P2: do_call_rcu_ttrace() // c->free_by_rcu_ttrace->first = NULL llist_del_all(&c->free_by_rcu_ttrace) move to c->waiting_for_gp_ttrace P1: llist_del_first() return NULL // A is only reusable after one task trace RCU grace // llist_del_first() must have been completed __free_rcu_tasks_trace free_all(llist_del_all(&c->waiting_for_gp_ttrace))
On Thu, Jul 6, 2023 at 8:39 PM Hou Tao <houtao@huaweicloud.com> wrote: > > Hi, > > On 7/7/2023 10:12 AM, Alexei Starovoitov wrote: > > On Thu, Jul 6, 2023 at 7:07 PM Hou Tao <houtao@huaweicloud.com> wrote: > >> Hi, > >> > >> On 7/6/2023 11:34 AM, Alexei Starovoitov wrote: > >>> From: Alexei Starovoitov <ast@kernel.org> > >>> > >>> alloc_bulk() can reuse elements from free_by_rcu_ttrace. > >>> Let it reuse from waiting_for_gp_ttrace as well to avoid unnecessary kmalloc(). > >>> > >>> Signed-off-by: Alexei Starovoitov <ast@kernel.org> > >>> --- > >>> kernel/bpf/memalloc.c | 16 ++++++++++------ > >>> 1 file changed, 10 insertions(+), 6 deletions(-) > >>> > >>> diff --git a/kernel/bpf/memalloc.c b/kernel/bpf/memalloc.c > >>> index 9986c6b7df4d..e5a87f6cf2cc 100644 > >>> --- a/kernel/bpf/memalloc.c > >>> +++ b/kernel/bpf/memalloc.c > >>> @@ -212,6 +212,15 @@ static void alloc_bulk(struct bpf_mem_cache *c, int cnt, int node) > >>> if (i >= cnt) > >>> return; > >>> > >>> + for (; i < cnt; i++) { > >>> + obj = llist_del_first(&c->waiting_for_gp_ttrace); > >>> + if (!obj) > >>> + break; > >>> + add_obj_to_free_list(c, obj); > >>> + } > >>> + if (i >= cnt) > >>> + return; > >> I still think using llist_del_first() here is not safe as reported in > >> [1]. Not sure whether or not invoking enque_to_free() firstly for > >> free_llist_extra will close the race completely. Will check later. > > lol. see my reply a second ago in the other thread. > > > > and it's not just waiting_for_gp_ttrace. free_by_rcu_ttrace is similar. > > I think free_by_rcu_ttrace is different, because the reuse is only > possible after one tasks trace RCU grace period as shown below, and the > concurrent llist_del_first() must have been completed when the head is > reused and re-added into free_by_rcu_ttrace again. > > // c0->free_by_rcu_ttrace > A -> B -> C -> nil > > P1: > alloc_bulk() > llist_del_first(&c->free_by_rcu_ttrace) > entry = A > next = B > > P2: > do_call_rcu_ttrace() > // c->free_by_rcu_ttrace->first = NULL > llist_del_all(&c->free_by_rcu_ttrace) > move to c->waiting_for_gp_ttrace > > P1: > llist_del_first() > return NULL > > // A is only reusable after one task trace RCU grace > // llist_del_first() must have been completed "must have been completed" ? I guess you're assuming that alloc_bulk() from irq_work is running within rcu_tasks_trace critical section, so __free_rcu_tasks_trace() callback will execute after irq work completed? I don't think that's the case. In vCPU P1 is stopped for looong time by host, P2 can execute __free_rcu_tasks_trace (or P3, since tasks trace callbacks execute in a kthread that is not bound to any cpu). __free_rcu_tasks_trace() will free it into slab. Then kmalloc the same obj and eventually put it back into free_by_rcu_ttrace. Since you believe that waiting_for_gp_ttrace ABA is possible here it's the same probability. imo both lower than a bit flip due to cosmic rays which is actually observable in practice. > __free_rcu_tasks_trace > free_all(llist_del_all(&c->waiting_for_gp_ttrace)) > >
Hi, On 7/7/2023 12:16 PM, Alexei Starovoitov wrote: > On Thu, Jul 6, 2023 at 8:39 PM Hou Tao <houtao@huaweicloud.com> wrote: >> Hi, >> >> On 7/7/2023 10:12 AM, Alexei Starovoitov wrote: >>> On Thu, Jul 6, 2023 at 7:07 PM Hou Tao <houtao@huaweicloud.com> wrote: >>>> Hi, >>>> >>>> On 7/6/2023 11:34 AM, Alexei Starovoitov wrote: >>>> SNIP >>> and it's not just waiting_for_gp_ttrace. free_by_rcu_ttrace is similar. >> I think free_by_rcu_ttrace is different, because the reuse is only >> possible after one tasks trace RCU grace period as shown below, and the >> concurrent llist_del_first() must have been completed when the head is >> reused and re-added into free_by_rcu_ttrace again. >> >> // c0->free_by_rcu_ttrace >> A -> B -> C -> nil >> >> P1: >> alloc_bulk() >> llist_del_first(&c->free_by_rcu_ttrace) >> entry = A >> next = B >> >> P2: >> do_call_rcu_ttrace() >> // c->free_by_rcu_ttrace->first = NULL >> llist_del_all(&c->free_by_rcu_ttrace) >> move to c->waiting_for_gp_ttrace >> >> P1: >> llist_del_first() >> return NULL >> >> // A is only reusable after one task trace RCU grace >> // llist_del_first() must have been completed > "must have been completed" ? > > I guess you're assuming that alloc_bulk() from irq_work > is running within rcu_tasks_trace critical section, > so __free_rcu_tasks_trace() callback will execute after > irq work completed? > I don't think that's the case. Yes. The following is my original thoughts. Correct me if I was wrong: 1. llist_del_first() must be running concurrently with llist_del_all(). If llist_del_first() runs after llist_del_all(), it will return NULL directly. 2. call_rcu_tasks_trace() must happen after llist_del_all(), else the elements in free_by_rcu_ttrace will not be freed back to slab. 3. call_rcu_tasks_trace() will wait for one tasks trace RCU grace period to call __free_rcu_tasks_trace() 4. llist_del_first() in running in an context with irq-disabled, so the tasks trace RCU grace period will wait for the end of llist_del_first() It seems you thought step 4) is not true, right ? > In vCPU P1 is stopped for looong time by host, > P2 can execute __free_rcu_tasks_trace (or P3, since > tasks trace callbacks execute in a kthread that is not bound > to any cpu). > __free_rcu_tasks_trace() will free it into slab. > Then kmalloc the same obj and eventually put it back into > free_by_rcu_ttrace. > > Since you believe that waiting_for_gp_ttrace ABA is possible > here it's the same probability. imo both lower than a bit flip due > to cosmic rays which is actually observable in practice. > >> __free_rcu_tasks_trace >> free_all(llist_del_all(&c->waiting_for_gp_ttrace)) >> >> > .
On Thu, Jul 6, 2023 at 9:37 PM Hou Tao <houtao@huaweicloud.com> wrote: > > Hi, > > On 7/7/2023 12:16 PM, Alexei Starovoitov wrote: > > On Thu, Jul 6, 2023 at 8:39 PM Hou Tao <houtao@huaweicloud.com> wrote: > >> Hi, > >> > >> On 7/7/2023 10:12 AM, Alexei Starovoitov wrote: > >>> On Thu, Jul 6, 2023 at 7:07 PM Hou Tao <houtao@huaweicloud.com> wrote: > >>>> Hi, > >>>> > >>>> On 7/6/2023 11:34 AM, Alexei Starovoitov wrote: > >>>> > SNIP > >>> and it's not just waiting_for_gp_ttrace. free_by_rcu_ttrace is similar. > >> I think free_by_rcu_ttrace is different, because the reuse is only > >> possible after one tasks trace RCU grace period as shown below, and the > >> concurrent llist_del_first() must have been completed when the head is > >> reused and re-added into free_by_rcu_ttrace again. > >> > >> // c0->free_by_rcu_ttrace > >> A -> B -> C -> nil > >> > >> P1: > >> alloc_bulk() > >> llist_del_first(&c->free_by_rcu_ttrace) > >> entry = A > >> next = B > >> > >> P2: > >> do_call_rcu_ttrace() > >> // c->free_by_rcu_ttrace->first = NULL > >> llist_del_all(&c->free_by_rcu_ttrace) > >> move to c->waiting_for_gp_ttrace > >> > >> P1: > >> llist_del_first() > >> return NULL > >> > >> // A is only reusable after one task trace RCU grace > >> // llist_del_first() must have been completed > > "must have been completed" ? > > > > I guess you're assuming that alloc_bulk() from irq_work > > is running within rcu_tasks_trace critical section, > > so __free_rcu_tasks_trace() callback will execute after > > irq work completed? > > I don't think that's the case. > > Yes. The following is my original thoughts. Correct me if I was wrong: > > 1. llist_del_first() must be running concurrently with llist_del_all(). > If llist_del_first() runs after llist_del_all(), it will return NULL > directly. > 2. call_rcu_tasks_trace() must happen after llist_del_all(), else the > elements in free_by_rcu_ttrace will not be freed back to slab. > 3. call_rcu_tasks_trace() will wait for one tasks trace RCU grace period > to call __free_rcu_tasks_trace() > 4. llist_del_first() in running in an context with irq-disabled, so the > tasks trace RCU grace period will wait for the end of llist_del_first() > > It seems you thought step 4) is not true, right ? Yes. I think so. For two reasons: 1. I believe irq disabled region isn't considered equivalent to rcu_read_lock_trace() region. Paul, could you clarify ? 2. Even if 1 is incorrect, in RT llist_del_first() from alloc_bulk() runs "in a per-CPU thread in preemptible context." See irq_work_run_list.
On Fri, Jul 07, 2023 at 09:11:22AM -0700, Alexei Starovoitov wrote: > On Thu, Jul 6, 2023 at 9:37 PM Hou Tao <houtao@huaweicloud.com> wrote: > > > > Hi, > > > > On 7/7/2023 12:16 PM, Alexei Starovoitov wrote: > > > On Thu, Jul 6, 2023 at 8:39 PM Hou Tao <houtao@huaweicloud.com> wrote: > > >> Hi, > > >> > > >> On 7/7/2023 10:12 AM, Alexei Starovoitov wrote: > > >>> On Thu, Jul 6, 2023 at 7:07 PM Hou Tao <houtao@huaweicloud.com> wrote: > > >>>> Hi, > > >>>> > > >>>> On 7/6/2023 11:34 AM, Alexei Starovoitov wrote: > > >>>> > > SNIP > > >>> and it's not just waiting_for_gp_ttrace. free_by_rcu_ttrace is similar. > > >> I think free_by_rcu_ttrace is different, because the reuse is only > > >> possible after one tasks trace RCU grace period as shown below, and the > > >> concurrent llist_del_first() must have been completed when the head is > > >> reused and re-added into free_by_rcu_ttrace again. > > >> > > >> // c0->free_by_rcu_ttrace > > >> A -> B -> C -> nil > > >> > > >> P1: > > >> alloc_bulk() > > >> llist_del_first(&c->free_by_rcu_ttrace) > > >> entry = A > > >> next = B > > >> > > >> P2: > > >> do_call_rcu_ttrace() > > >> // c->free_by_rcu_ttrace->first = NULL > > >> llist_del_all(&c->free_by_rcu_ttrace) > > >> move to c->waiting_for_gp_ttrace > > >> > > >> P1: > > >> llist_del_first() > > >> return NULL > > >> > > >> // A is only reusable after one task trace RCU grace > > >> // llist_del_first() must have been completed > > > "must have been completed" ? > > > > > > I guess you're assuming that alloc_bulk() from irq_work > > > is running within rcu_tasks_trace critical section, > > > so __free_rcu_tasks_trace() callback will execute after > > > irq work completed? > > > I don't think that's the case. > > > > Yes. The following is my original thoughts. Correct me if I was wrong: > > > > 1. llist_del_first() must be running concurrently with llist_del_all(). > > If llist_del_first() runs after llist_del_all(), it will return NULL > > directly. > > 2. call_rcu_tasks_trace() must happen after llist_del_all(), else the > > elements in free_by_rcu_ttrace will not be freed back to slab. > > 3. call_rcu_tasks_trace() will wait for one tasks trace RCU grace period > > to call __free_rcu_tasks_trace() > > 4. llist_del_first() in running in an context with irq-disabled, so the > > tasks trace RCU grace period will wait for the end of llist_del_first() > > > > It seems you thought step 4) is not true, right ? > > Yes. I think so. For two reasons: > > 1. > I believe irq disabled region isn't considered equivalent > to rcu_read_lock_trace() region. > > Paul, > could you clarify ? You are correct, Alexei. Unlike vanilla RCU, RCU Tasks Trace does not count irq-disabled regions of code as readers. But why not just put an rcu_read_lock_trace() and a matching rcu_read_unlock_trace() within that irq-disabled region of code? For completeness, if it were not for CONFIG_TASKS_TRACE_RCU_READ_MB, Hou Tao would be correct from a strict current-implementation viewpoint. The reason is that, given the current implementation in CONFIG_TASKS_TRACE_RCU_READ_MB=n kernels, a task must either block or take an IPI in order for the grace-period machinery to realize that this task is done with all prior readers. However, we need to account for the possibility of IPI-free implementations, for example, if the real-time guys decide to start making heavy use of BPF sleepable programs. They would then insist on getting rid of those IPIs for CONFIG_PREEMPT_RT=y kernels. At which point, irq-disabled regions of code will absolutely not act as RCU tasks trace readers. Again, why not just put an rcu_read_lock_trace() and a matching rcu_read_unlock_trace() within that irq-disabled region of code? Or maybe there is a better workaround. > 2. > Even if 1 is incorrect, in RT llist_del_first() from alloc_bulk() > runs "in a per-CPU thread in preemptible context." > See irq_work_run_list. Agreed, under RT, "interrupt handlers" often run in task context. Thanx, Paul
On Fri, Jul 7, 2023 at 1:47 PM Paul E. McKenney <paulmck@kernel.org> wrote: > > On Fri, Jul 07, 2023 at 09:11:22AM -0700, Alexei Starovoitov wrote: > > On Thu, Jul 6, 2023 at 9:37 PM Hou Tao <houtao@huaweicloud.com> wrote: > > > > > > Hi, > > > > > > On 7/7/2023 12:16 PM, Alexei Starovoitov wrote: > > > > On Thu, Jul 6, 2023 at 8:39 PM Hou Tao <houtao@huaweicloud.com> wrote: > > > >> Hi, > > > >> > > > >> On 7/7/2023 10:12 AM, Alexei Starovoitov wrote: > > > >>> On Thu, Jul 6, 2023 at 7:07 PM Hou Tao <houtao@huaweicloud.com> wrote: > > > >>>> Hi, > > > >>>> > > > >>>> On 7/6/2023 11:34 AM, Alexei Starovoitov wrote: > > > >>>> > > > SNIP > > > >>> and it's not just waiting_for_gp_ttrace. free_by_rcu_ttrace is similar. > > > >> I think free_by_rcu_ttrace is different, because the reuse is only > > > >> possible after one tasks trace RCU grace period as shown below, and the > > > >> concurrent llist_del_first() must have been completed when the head is > > > >> reused and re-added into free_by_rcu_ttrace again. > > > >> > > > >> // c0->free_by_rcu_ttrace > > > >> A -> B -> C -> nil > > > >> > > > >> P1: > > > >> alloc_bulk() > > > >> llist_del_first(&c->free_by_rcu_ttrace) > > > >> entry = A > > > >> next = B > > > >> > > > >> P2: > > > >> do_call_rcu_ttrace() > > > >> // c->free_by_rcu_ttrace->first = NULL > > > >> llist_del_all(&c->free_by_rcu_ttrace) > > > >> move to c->waiting_for_gp_ttrace > > > >> > > > >> P1: > > > >> llist_del_first() > > > >> return NULL > > > >> > > > >> // A is only reusable after one task trace RCU grace > > > >> // llist_del_first() must have been completed > > > > "must have been completed" ? > > > > > > > > I guess you're assuming that alloc_bulk() from irq_work > > > > is running within rcu_tasks_trace critical section, > > > > so __free_rcu_tasks_trace() callback will execute after > > > > irq work completed? > > > > I don't think that's the case. > > > > > > Yes. The following is my original thoughts. Correct me if I was wrong: > > > > > > 1. llist_del_first() must be running concurrently with llist_del_all(). > > > If llist_del_first() runs after llist_del_all(), it will return NULL > > > directly. > > > 2. call_rcu_tasks_trace() must happen after llist_del_all(), else the > > > elements in free_by_rcu_ttrace will not be freed back to slab. > > > 3. call_rcu_tasks_trace() will wait for one tasks trace RCU grace period > > > to call __free_rcu_tasks_trace() > > > 4. llist_del_first() in running in an context with irq-disabled, so the > > > tasks trace RCU grace period will wait for the end of llist_del_first() > > > > > > It seems you thought step 4) is not true, right ? > > > > Yes. I think so. For two reasons: > > > > 1. > > I believe irq disabled region isn't considered equivalent > > to rcu_read_lock_trace() region. > > > > Paul, > > could you clarify ? > > You are correct, Alexei. Unlike vanilla RCU, RCU Tasks Trace does not > count irq-disabled regions of code as readers. > > But why not just put an rcu_read_lock_trace() and a matching > rcu_read_unlock_trace() within that irq-disabled region of code? > > For completeness, if it were not for CONFIG_TASKS_TRACE_RCU_READ_MB, > Hou Tao would be correct from a strict current-implementation > viewpoint. The reason is that, given the current implementation in > CONFIG_TASKS_TRACE_RCU_READ_MB=n kernels, a task must either block or > take an IPI in order for the grace-period machinery to realize that this > task is done with all prior readers. > > However, we need to account for the possibility of IPI-free > implementations, for example, if the real-time guys decide to start > making heavy use of BPF sleepable programs. They would then insist on > getting rid of those IPIs for CONFIG_PREEMPT_RT=y kernels. At which > point, irq-disabled regions of code will absolutely not act as > RCU tasks trace readers. > > Again, why not just put an rcu_read_lock_trace() and a matching > rcu_read_unlock_trace() within that irq-disabled region of code? If I remember correctly, the general guidance is to always put an explicit marker if it is in an RCU-reader, instead of relying on implementation details. So the suggestion to put the marker instead of relying on IRQ disabling does align with that. Thanks.
Hi, On 7/8/2023 1:47 AM, Paul E. McKenney wrote: > On Fri, Jul 07, 2023 at 09:11:22AM -0700, Alexei Starovoitov wrote: >> On Thu, Jul 6, 2023 at 9:37 PM Hou Tao <houtao@huaweicloud.com> wrote: SNIP >>> I guess you're assuming that alloc_bulk() from irq_work >>> is running within rcu_tasks_trace critical section, >>> so __free_rcu_tasks_trace() callback will execute after >>> irq work completed? >>> I don't think that's the case. >>> Yes. The following is my original thoughts. Correct me if I was wrong: >>> >>> 1. llist_del_first() must be running concurrently with llist_del_all(). >>> If llist_del_first() runs after llist_del_all(), it will return NULL >>> directly. >>> 2. call_rcu_tasks_trace() must happen after llist_del_all(), else the >>> elements in free_by_rcu_ttrace will not be freed back to slab. >>> 3. call_rcu_tasks_trace() will wait for one tasks trace RCU grace period >>> to call __free_rcu_tasks_trace() >>> 4. llist_del_first() in running in an context with irq-disabled, so the >>> tasks trace RCU grace period will wait for the end of llist_del_first() >>> >>> It seems you thought step 4) is not true, right ? >> Yes. I think so. For two reasons: >> >> 1. >> I believe irq disabled region isn't considered equivalent >> to rcu_read_lock_trace() region. >> >> Paul, >> could you clarify ? > You are correct, Alexei. Unlike vanilla RCU, RCU Tasks Trace does not > count irq-disabled regions of code as readers. I see. But I still have one question: considering that in current implementation one Tasks Trace RCU grace period implies one vanilla RCU grace period (aka rcu_trace_implies_rcu_gp), so in my naive understanding of RCU, does that mean __free_rcu_tasks_trace() will be invoked after the expiration of current Task Trace RCU grace period, right ? And does it also mean __free_rcu_tasks_trace() will be invoked after the expiration of current vanilla RCU grace period, right ? If these two conditions above are true, does it mean __free_rcu_tasks_trace() will wait for the irq-disabled code reigion ? > But why not just put an rcu_read_lock_trace() and a matching > rcu_read_unlock_trace() within that irq-disabled region of code? > > For completeness, if it were not for CONFIG_TASKS_TRACE_RCU_READ_MB, > Hou Tao would be correct from a strict current-implementation > viewpoint. The reason is that, given the current implementation in > CONFIG_TASKS_TRACE_RCU_READ_MB=n kernels, a task must either block or > take an IPI in order for the grace-period machinery to realize that this > task is done with all prior readers. Thanks for the detailed explanation. > However, we need to account for the possibility of IPI-free > implementations, for example, if the real-time guys decide to start > making heavy use of BPF sleepable programs. They would then insist on > getting rid of those IPIs for CONFIG_PREEMPT_RT=y kernels. At which > point, irq-disabled regions of code will absolutely not act as > RCU tasks trace readers. > > Again, why not just put an rcu_read_lock_trace() and a matching > rcu_read_unlock_trace() within that irq-disabled region of code? > > Or maybe there is a better workaround. Yes. I think we could use rcu_read_{lock,unlock}_trace to fix the ABA problem for free_by_rcu_ttrace. > >> 2. >> Even if 1 is incorrect, in RT llist_del_first() from alloc_bulk() >> runs "in a per-CPU thread in preemptible context." >> See irq_work_run_list. > Agreed, under RT, "interrupt handlers" often run in task context. Yes, I missed that. I misread alloc_bulk(), and it seems it only does inc_active() for c->free_llist. > Thanx, Paul
On Sat, Jul 08, 2023 at 03:03:40PM +0800, Hou Tao wrote: > Hi, > > On 7/8/2023 1:47 AM, Paul E. McKenney wrote: > > On Fri, Jul 07, 2023 at 09:11:22AM -0700, Alexei Starovoitov wrote: > >> On Thu, Jul 6, 2023 at 9:37 PM Hou Tao <houtao@huaweicloud.com> wrote: > SNIP > >>> I guess you're assuming that alloc_bulk() from irq_work > >>> is running within rcu_tasks_trace critical section, > >>> so __free_rcu_tasks_trace() callback will execute after > >>> irq work completed? > >>> I don't think that's the case. > >>> Yes. The following is my original thoughts. Correct me if I was wrong: > >>> > >>> 1. llist_del_first() must be running concurrently with llist_del_all(). > >>> If llist_del_first() runs after llist_del_all(), it will return NULL > >>> directly. > >>> 2. call_rcu_tasks_trace() must happen after llist_del_all(), else the > >>> elements in free_by_rcu_ttrace will not be freed back to slab. > >>> 3. call_rcu_tasks_trace() will wait for one tasks trace RCU grace period > >>> to call __free_rcu_tasks_trace() > >>> 4. llist_del_first() in running in an context with irq-disabled, so the > >>> tasks trace RCU grace period will wait for the end of llist_del_first() > >>> > >>> It seems you thought step 4) is not true, right ? > >> Yes. I think so. For two reasons: > >> > >> 1. > >> I believe irq disabled region isn't considered equivalent > >> to rcu_read_lock_trace() region. > >> > >> Paul, > >> could you clarify ? > > You are correct, Alexei. Unlike vanilla RCU, RCU Tasks Trace does not > > count irq-disabled regions of code as readers. > > I see. But I still have one question: considering that in current > implementation one Tasks Trace RCU grace period implies one vanilla RCU > grace period (aka rcu_trace_implies_rcu_gp), so in my naive > understanding of RCU, does that mean __free_rcu_tasks_trace() will be > invoked after the expiration of current Task Trace RCU grace period, > right ? And does it also mean __free_rcu_tasks_trace() will be invoked > after the expiration of current vanilla RCU grace period, right ? If > these two conditions above are true, does it mean > __free_rcu_tasks_trace() will wait for the irq-disabled code reigion ? First, good show digging into the code! However, this is guaranteed only if rcu_trace_implies_rcu_gp(), which right now happens to return the constant true. In other words, that is an accident of the current implementation. To rely on this, you must check the return value of rcu_trace_implies_rcu_gp() and then have some alternative code that does not rely on that synchronize_rcu(). > > But why not just put an rcu_read_lock_trace() and a matching > > rcu_read_unlock_trace() within that irq-disabled region of code? > > > > For completeness, if it were not for CONFIG_TASKS_TRACE_RCU_READ_MB, > > Hou Tao would be correct from a strict current-implementation > > viewpoint. The reason is that, given the current implementation in > > CONFIG_TASKS_TRACE_RCU_READ_MB=n kernels, a task must either block or > > take an IPI in order for the grace-period machinery to realize that this > > task is done with all prior readers. > > Thanks for the detailed explanation. > > > However, we need to account for the possibility of IPI-free > > implementations, for example, if the real-time guys decide to start > > making heavy use of BPF sleepable programs. They would then insist on > > getting rid of those IPIs for CONFIG_PREEMPT_RT=y kernels. At which > > point, irq-disabled regions of code will absolutely not act as > > RCU tasks trace readers. > > > > Again, why not just put an rcu_read_lock_trace() and a matching > > rcu_read_unlock_trace() within that irq-disabled region of code? > > > > Or maybe there is a better workaround. > > Yes. I think we could use rcu_read_{lock,unlock}_trace to fix the ABA > problem for free_by_rcu_ttrace. That sounds good to me! Thanx, Paul > >> 2. > >> Even if 1 is incorrect, in RT llist_del_first() from alloc_bulk() > >> runs "in a per-CPU thread in preemptible context." > >> See irq_work_run_list. > > Agreed, under RT, "interrupt handlers" often run in task context. > > Yes, I missed that. I misread alloc_bulk(), and it seems it only does > inc_active() for c->free_llist. > > Thanx, Paul >
diff --git a/kernel/bpf/memalloc.c b/kernel/bpf/memalloc.c index 9986c6b7df4d..e5a87f6cf2cc 100644 --- a/kernel/bpf/memalloc.c +++ b/kernel/bpf/memalloc.c @@ -212,6 +212,15 @@ static void alloc_bulk(struct bpf_mem_cache *c, int cnt, int node) if (i >= cnt) return; + for (; i < cnt; i++) { + obj = llist_del_first(&c->waiting_for_gp_ttrace); + if (!obj) + break; + add_obj_to_free_list(c, obj); + } + if (i >= cnt) + return; + memcg = get_memcg(c); old_memcg = set_active_memcg(memcg); for (; i < cnt; i++) { @@ -295,12 +304,7 @@ static void do_call_rcu_ttrace(struct bpf_mem_cache *c) WARN_ON_ONCE(!llist_empty(&c->waiting_for_gp_ttrace)); llist_for_each_safe(llnode, t, llist_del_all(&c->free_by_rcu_ttrace)) - /* There is no concurrent __llist_add(waiting_for_gp_ttrace) access. - * It doesn't race with llist_del_all either. - * But there could be two concurrent llist_del_all(waiting_for_gp_ttrace): - * from __free_rcu() and from drain_mem_cache(). - */ - __llist_add(llnode, &c->waiting_for_gp_ttrace); + llist_add(llnode, &c->waiting_for_gp_ttrace); if (unlikely(READ_ONCE(c->draining))) { __free_rcu(&c->rcu_ttrace);