Message ID | 20230624031333.96597-10-alexei.starovoitov@gmail.com (mailing list archive) |
---|---|
State | Superseded |
Delegated to: | BPF |
Headers | show |
Series | bpf: Introduce bpf_mem_cache_free_rcu(). | expand |
Hi, On 6/24/2023 11:13 AM, Alexei Starovoitov wrote: > From: Alexei Starovoitov <ast@kernel.org> > > alloc_bulk() can reuse elements from free_by_rcu_ttrace. > Let it reuse from waiting_for_gp_ttrace as well to avoid unnecessary kmalloc(). > > Signed-off-by: Alexei Starovoitov <ast@kernel.org> > --- > kernel/bpf/memalloc.c | 9 +++++++++ > 1 file changed, 9 insertions(+) > > diff --git a/kernel/bpf/memalloc.c b/kernel/bpf/memalloc.c > index 692a9a30c1dc..666917c16e87 100644 > --- a/kernel/bpf/memalloc.c > +++ b/kernel/bpf/memalloc.c > @@ -203,6 +203,15 @@ static void alloc_bulk(struct bpf_mem_cache *c, int cnt, int node) > if (i >= cnt) > return; > > + for (; i < cnt; i++) { > + obj = llist_del_first(&c->waiting_for_gp_ttrace); After allowing to reuse elements from waiting_for_gp_ttrace, there may be concurrent llist_del_first() and llist_del_all() as shown below and llist_del_first() is not safe because the elements freed from free_rcu() could be reused immediately and head->first may be added back to c0->waiting_for_gp_ttrace by other process. // c0 alloc_bulk() llist_del_first(&c->waiting_for_gp_ttrace) // c1->tgt = c0 free_rcu() llist_del_all(&c->waiting_for_gp_ttrace) > + if (!obj) > + break; > + add_obj_to_free_list(c, obj); > + } > + if (i >= cnt) > + return; > + > memcg = get_memcg(c); > old_memcg = set_active_memcg(memcg); > for (; i < cnt; i++) {
On Sun, Jun 25, 2023 at 8:30 PM Hou Tao <houtao@huaweicloud.com> wrote: > > Hi, > > On 6/24/2023 11:13 AM, Alexei Starovoitov wrote: > > From: Alexei Starovoitov <ast@kernel.org> > > > > alloc_bulk() can reuse elements from free_by_rcu_ttrace. > > Let it reuse from waiting_for_gp_ttrace as well to avoid unnecessary kmalloc(). > > > > Signed-off-by: Alexei Starovoitov <ast@kernel.org> > > --- > > kernel/bpf/memalloc.c | 9 +++++++++ > > 1 file changed, 9 insertions(+) > > > > diff --git a/kernel/bpf/memalloc.c b/kernel/bpf/memalloc.c > > index 692a9a30c1dc..666917c16e87 100644 > > --- a/kernel/bpf/memalloc.c > > +++ b/kernel/bpf/memalloc.c > > @@ -203,6 +203,15 @@ static void alloc_bulk(struct bpf_mem_cache *c, int cnt, int node) > > if (i >= cnt) > > return; > > > > + for (; i < cnt; i++) { > > + obj = llist_del_first(&c->waiting_for_gp_ttrace); > After allowing to reuse elements from waiting_for_gp_ttrace, there may > be concurrent llist_del_first() and llist_del_all() as shown below and > llist_del_first() is not safe because the elements freed from free_rcu() > could be reused immediately and head->first may be added back to > c0->waiting_for_gp_ttrace by other process. > > // c0 > alloc_bulk() > llist_del_first(&c->waiting_for_gp_ttrace) > > // c1->tgt = c0 > free_rcu() > llist_del_all(&c->waiting_for_gp_ttrace) I'm still thinking about how to fix the other issues you've reported, but this one, I believe, is fine. Are you basing 'not safe' on a comment? Why xchg(&head->first, NULL); on one cpu and try_cmpxchg(&head->first, &entry, next); is unsafe?
Hi, On 6/26/2023 12:42 PM, Alexei Starovoitov wrote: > On Sun, Jun 25, 2023 at 8:30 PM Hou Tao <houtao@huaweicloud.com> wrote: >> Hi, >> >> On 6/24/2023 11:13 AM, Alexei Starovoitov wrote: >>> From: Alexei Starovoitov <ast@kernel.org> >>> >>> alloc_bulk() can reuse elements from free_by_rcu_ttrace. >>> Let it reuse from waiting_for_gp_ttrace as well to avoid unnecessary kmalloc(). >>> >>> Signed-off-by: Alexei Starovoitov <ast@kernel.org> >>> --- >>> kernel/bpf/memalloc.c | 9 +++++++++ >>> 1 file changed, 9 insertions(+) >>> >>> diff --git a/kernel/bpf/memalloc.c b/kernel/bpf/memalloc.c >>> index 692a9a30c1dc..666917c16e87 100644 >>> --- a/kernel/bpf/memalloc.c >>> +++ b/kernel/bpf/memalloc.c >>> @@ -203,6 +203,15 @@ static void alloc_bulk(struct bpf_mem_cache *c, int cnt, int node) >>> if (i >= cnt) >>> return; >>> >>> + for (; i < cnt; i++) { >>> + obj = llist_del_first(&c->waiting_for_gp_ttrace); >> After allowing to reuse elements from waiting_for_gp_ttrace, there may >> be concurrent llist_del_first() and llist_del_all() as shown below and >> llist_del_first() is not safe because the elements freed from free_rcu() >> could be reused immediately and head->first may be added back to >> c0->waiting_for_gp_ttrace by other process. >> >> // c0 >> alloc_bulk() >> llist_del_first(&c->waiting_for_gp_ttrace) >> >> // c1->tgt = c0 >> free_rcu() >> llist_del_all(&c->waiting_for_gp_ttrace) > I'm still thinking about how to fix the other issues you've reported, > but this one, I believe, is fine. > Are you basing 'not safe' on a comment? > Why xchg(&head->first, NULL); on one cpu and > try_cmpxchg(&head->first, &entry, next); > is unsafe? No, I didn't reason it only based on the comment in llist.h. The reason I though it was not safe because llist_del_first() may have ABA problem as pointed by you early, because after __free_rcu(), the elements on waiting_for_gp_ttrace would be available immediately and waiting_for_gp_ttrace->first may be reused and then added back to waiting_for_gp_ttrace->first again as shown below. // c0->waiting_for_gp_ttrace A -> B -> C -> nil P1: alloc_bulk(c0) llist_del_first(&c0->waiting_for_gp_ttrace) entry = A next = B P2: __free_rcu // A/B/C is freed back to slab // c0->waiting_for_gp_ttrace->first = NULL free_all(llist_del_all(c0)) P3: alloc_bulk(c1) // got A __kmalloc() // free A (from c1), ..., last free X (allocated from c0) P3: unit_free(c1) // the last freed element X is from c0 c1->tgt = c0 c1->free_llist->first -> X -> Y -> ... -> A P3: free_bulk(c1) enque_to_free(c0) c0->free_by_rcu_ttrace->first -> A -> ... -> Y -> X __llist_add_batch(c0->waiting_for_gp_ttrace) c0->waiting_for_gp_ttrace = A -> ... -> Y -> X P1: // A is added back as first again // but llist_del_first() didn't know try_cmpxhg(&c0->waiting_for_gp_ttrace->first, A, B) // c0->waiting_for_gp_trrace is corrupted c0->waiting_for_gp_ttrace->first = B
On 6/26/23 12:16 AM, Hou Tao wrote: > Hi, > > On 6/26/2023 12:42 PM, Alexei Starovoitov wrote: >> On Sun, Jun 25, 2023 at 8:30 PM Hou Tao <houtao@huaweicloud.com> wrote: >>> Hi, >>> >>> On 6/24/2023 11:13 AM, Alexei Starovoitov wrote: >>>> From: Alexei Starovoitov <ast@kernel.org> >>>> >>>> alloc_bulk() can reuse elements from free_by_rcu_ttrace. >>>> Let it reuse from waiting_for_gp_ttrace as well to avoid unnecessary kmalloc(). >>>> >>>> Signed-off-by: Alexei Starovoitov <ast@kernel.org> >>>> --- >>>> kernel/bpf/memalloc.c | 9 +++++++++ >>>> 1 file changed, 9 insertions(+) >>>> >>>> diff --git a/kernel/bpf/memalloc.c b/kernel/bpf/memalloc.c >>>> index 692a9a30c1dc..666917c16e87 100644 >>>> --- a/kernel/bpf/memalloc.c >>>> +++ b/kernel/bpf/memalloc.c >>>> @@ -203,6 +203,15 @@ static void alloc_bulk(struct bpf_mem_cache *c, int cnt, int node) >>>> if (i >= cnt) >>>> return; >>>> >>>> + for (; i < cnt; i++) { >>>> + obj = llist_del_first(&c->waiting_for_gp_ttrace); >>> After allowing to reuse elements from waiting_for_gp_ttrace, there may >>> be concurrent llist_del_first() and llist_del_all() as shown below and >>> llist_del_first() is not safe because the elements freed from free_rcu() >>> could be reused immediately and head->first may be added back to >>> c0->waiting_for_gp_ttrace by other process. >>> >>> // c0 >>> alloc_bulk() >>> llist_del_first(&c->waiting_for_gp_ttrace) >>> >>> // c1->tgt = c0 >>> free_rcu() >>> llist_del_all(&c->waiting_for_gp_ttrace) >> I'm still thinking about how to fix the other issues you've reported, >> but this one, I believe, is fine. >> Are you basing 'not safe' on a comment? >> Why xchg(&head->first, NULL); on one cpu and >> try_cmpxchg(&head->first, &entry, next); >> is unsafe? > No, I didn't reason it only based on the comment in llist.h. The reason > I though it was not safe because llist_del_first() may have ABA problem > as pointed by you early, because after __free_rcu(), the elements on > waiting_for_gp_ttrace would be available immediately and > waiting_for_gp_ttrace->first may be reused and then added back to > waiting_for_gp_ttrace->first again as shown below. > > // c0->waiting_for_gp_ttrace > A -> B -> C -> nil > > P1: > alloc_bulk(c0) > llist_del_first(&c0->waiting_for_gp_ttrace) > entry = A > next = B > > P2: __free_rcu > // A/B/C is freed back to slab > // c0->waiting_for_gp_ttrace->first = NULL > free_all(llist_del_all(c0)) > > P3: alloc_bulk(c1) > // got A > __kmalloc() > > // free A (from c1), ..., last free X (allocated from c0) > P3: unit_free(c1) > // the last freed element X is from c0 > c1->tgt = c0 > c1->free_llist->first -> X -> Y -> ... -> A > P3: free_bulk(c1) > enque_to_free(c0) > c0->free_by_rcu_ttrace->first -> A -> ... -> Y -> X > __llist_add_batch(c0->waiting_for_gp_ttrace) > c0->waiting_for_gp_ttrace = A -> ... -> Y -> X In theory that's possible, but for this to happen one cpu needs to be thousand times slower than all others and since there is no preemption in llist_del_first I don't think we need to worry about it. Also with removal of _tail optimization the above llist_add_batch(waiting_for_gp_ttrace) will become a loop, so reused element will be at the very end instead of top, so one cpu to million times slower which is not realistic. > P1: > // A is added back as first again > // but llist_del_first() didn't know > try_cmpxhg(&c0->waiting_for_gp_ttrace->first, A, B) > // c0->waiting_for_gp_trrace is corrupted > c0->waiting_for_gp_ttrace->first = B >
On Wed, Jun 28, 2023 at 04:09:14PM +0800, Hou Tao wrote: > Hi, > > On 6/28/2023 8:59 AM, Alexei Starovoitov wrote: > > On 6/26/23 12:16 AM, Hou Tao wrote: > >> Hi, > >> > >> On 6/26/2023 12:42 PM, Alexei Starovoitov wrote: > >>> On Sun, Jun 25, 2023 at 8:30 PM Hou Tao <houtao@huaweicloud.com> wrote: > >>>> Hi, > >>>> > >>>> On 6/24/2023 11:13 AM, Alexei Starovoitov wrote: > >>>>> From: Alexei Starovoitov <ast@kernel.org> > >>>>> > >>>>> alloc_bulk() can reuse elements from free_by_rcu_ttrace. > >>>>> Let it reuse from waiting_for_gp_ttrace as well to avoid > >>>>> unnecessary kmalloc(). > >>>>> > >>>>> Signed-off-by: Alexei Starovoitov <ast@kernel.org> > >>>>> --- > >>>>> kernel/bpf/memalloc.c | 9 +++++++++ > >>>>> 1 file changed, 9 insertions(+) > >>>>> > SNIP > >> // free A (from c1), ..., last free X (allocated from c0) > >> P3: unit_free(c1) > >> // the last freed element X is from c0 > >> c1->tgt = c0 > >> c1->free_llist->first -> X -> Y -> ... -> A > >> P3: free_bulk(c1) > >> enque_to_free(c0) > >> c0->free_by_rcu_ttrace->first -> A -> ... -> Y -> X > >> __llist_add_batch(c0->waiting_for_gp_ttrace) > >> c0->waiting_for_gp_ttrace = A -> ... -> Y -> X > > > > In theory that's possible, but for this to happen one cpu needs > > to be thousand times slower than all others and since there is no > > preemption in llist_del_first I don't think we need to worry about it. > > Not sure whether or not such case will be possible in a VM, after all, > the CPU X is just a thread in host and it may be preempted in any time > and with any duration. vCPU preemption can happen even with guest-OS interrupts disabled, and such preemption can persist for hundreds of milliseconds, or even for several seconds. So admittedly quite rare, but also quite possible. Thanx, Paul > > Also with removal of _tail optimization the above > > llist_add_batch(waiting_for_gp_ttrace) > > will become a loop, so reused element will be at the very end > > instead of top, so one cpu to million times slower which is not > > realistic. > > It is still possible A will be added back as > waiting_for_gp_ttrace->first after switching to llist_add() as shown > below. My questions is how much is the benefit for reusing from > waiting_for_gp_ttrace ? > > // free A (from c1), ..., last free X (allocated from c0) > P3: unit_free(c1) > // the last freed element X is allocated from c0 > c1->tgt = c0 > c1->free_llist->first -> A -> ... -> Y > c1->free_llist_extra -> X > > P3: free_bulk(c1) > enque_to_free(c0) > c0->free_by_rcu_ttrace->first -> Y -> ... A > c0->free_by_rcu_ttrace->first -> X -> Y -> ... A > > llist_add(c0->waiting_for_gp_ttrace) > c0->waiting_for_gp_ttrace = A -> .. -> Y -> X > > > > >> P1: > >> // A is added back as first again > >> // but llist_del_first() didn't know > >> try_cmpxhg(&c0->waiting_for_gp_ttrace->first, A, B) > >> // c0->waiting_for_gp_trrace is corrupted > >> c0->waiting_for_gp_ttrace->first = B > >> >
diff --git a/kernel/bpf/memalloc.c b/kernel/bpf/memalloc.c index 692a9a30c1dc..666917c16e87 100644 --- a/kernel/bpf/memalloc.c +++ b/kernel/bpf/memalloc.c @@ -203,6 +203,15 @@ static void alloc_bulk(struct bpf_mem_cache *c, int cnt, int node) if (i >= cnt) return; + for (; i < cnt; i++) { + obj = llist_del_first(&c->waiting_for_gp_ttrace); + if (!obj) + break; + add_obj_to_free_list(c, obj); + } + if (i >= cnt) + return; + memcg = get_memcg(c); old_memcg = set_active_memcg(memcg); for (; i < cnt; i++) {