Message ID | 20221206042946.686847-2-houtao@huaweicloud.com (mailing list archive) |
---|---|
State | Superseded |
Delegated to: | BPF |
Headers | show |
Series | Misc optimizations for bpf mem allocator | expand |
On 12/5/22 8:29 PM, Hou Tao wrote: > From: Hou Tao <houtao1@huawei.com> > > When there are batched freeing operations on a specific CPU, part of > the freed elements ((high_watermark - lower_watermark) / 2 + 1) will > be moved to waiting_for_gp list and the remaining part will be left in > free_by_rcu list and waits for the expiration of RCU-tasks-trace grace > period and the next invocation of free_bulk(). The change below LGTM. However, the above description seems not precise. IIUC, free_by_rcu list => waiting_for_gp is controlled by whether call_rcu_in_progress is true or not. If it is true, free_by_rcu list will remain intact and not moving into waiting_for_gp list. So it is not 'the remaining part will be left in free_by_rcu'. It is all elements in free_by_rcu to waiting_for_gp or none. > > So instead of invoking __alloc_percpu_gfp() or kmalloc_node() to > allocate a new object, in alloc_bulk() just check whether or not there > is freed element in free_by_rcu and reuse it if available. > > Signed-off-by: Hou Tao <houtao1@huawei.com> LGTM except the above suggestions. Acked-by: Yonghong Song <yhs@fb.com> > --- > kernel/bpf/memalloc.c | 21 ++++++++++++++++++--- > 1 file changed, 18 insertions(+), 3 deletions(-) > > diff --git a/kernel/bpf/memalloc.c b/kernel/bpf/memalloc.c > index 8f0d65f2474a..7daf147bc8f6 100644 > --- a/kernel/bpf/memalloc.c > +++ b/kernel/bpf/memalloc.c > @@ -171,9 +171,24 @@ static void alloc_bulk(struct bpf_mem_cache *c, int cnt, int node) > memcg = get_memcg(c); > old_memcg = set_active_memcg(memcg); > for (i = 0; i < cnt; i++) { > - obj = __alloc(c, node); > - if (!obj) > - break; > + /* > + * free_by_rcu is only manipulated by irq work refill_work(). > + * IRQ works on the same CPU are called sequentially, so it is > + * safe to use __llist_del_first() here. If alloc_bulk() is > + * invoked by the initial prefill, there will be no running > + * irq work, so __llist_del_first() is fine as well. > + * > + * In most cases, objects on free_by_rcu are from the same CPU. > + * If some objects come from other CPUs, it doesn't incur any > + * harm because NUMA_NO_NODE means the preference for current > + * numa node and it is not a guarantee. > + */ > + obj = __llist_del_first(&c->free_by_rcu); > + if (!obj) { > + obj = __alloc(c, node); > + if (!obj) > + break; > + } > if (IS_ENABLED(CONFIG_PREEMPT_RT)) > /* In RT irq_work runs in per-cpu kthread, so disable > * interrupts to avoid preemption and interrupts and
Hi, On 12/7/2022 9:52 AM, Yonghong Song wrote: > > > On 12/5/22 8:29 PM, Hou Tao wrote: >> From: Hou Tao <houtao1@huawei.com> >> >> When there are batched freeing operations on a specific CPU, part of >> the freed elements ((high_watermark - lower_watermark) / 2 + 1) will >> be moved to waiting_for_gp list and the remaining part will be left in >> free_by_rcu list and waits for the expiration of RCU-tasks-trace grace >> period and the next invocation of free_bulk(). > > The change below LGTM. However, the above description seems not precise. > IIUC, free_by_rcu list => waiting_for_gp is controlled by whether > call_rcu_in_progress is true or not. If it is true, free_by_rcu list > will remain intact and not moving into waiting_for_gp list. > So it is not 'the remaining part will be left in free_by_rcu'. > It is all elements in free_by_rcu to waiting_for_gp or none. Thanks for the review and the suggestions. I tried to say that moving from free_by_rcu to waiting_for_gp is slow, and there can be many free elements being stacked on free_by_rcu list. So how about the following rephrasing or do you still prefer "It is all elements in free_by_rcu to waiting_for_gp or none." ? When there are batched freeing operations on a specific CPU, part of the freed elements ((high_watermark - lower_watermark) / 2 + 1) will be moved to waiting_for_gp list and the remaining part will be left in free_by_rcu list. These elements in free_by_rcu list will be moved into waiting_for_gp list after one RCU-tasks-trace grace period and another invocation of free_bulk(), so there may be many free elements being stacked on free_by_rcu_list. >> >> So instead of invoking __alloc_percpu_gfp() or kmalloc_node() to >> allocate a new object, in alloc_bulk() just check whether or not there >> is freed element in free_by_rcu and reuse it if available. >> >> Signed-off-by: Hou Tao <houtao1@huawei.com> > > LGTM except the above suggestions. > > Acked-by: Yonghong Song <yhs@fb.com> > >> --- >> kernel/bpf/memalloc.c | 21 ++++++++++++++++++--- >> 1 file changed, 18 insertions(+), 3 deletions(-) >> >> diff --git a/kernel/bpf/memalloc.c b/kernel/bpf/memalloc.c >> index 8f0d65f2474a..7daf147bc8f6 100644 >> --- a/kernel/bpf/memalloc.c >> +++ b/kernel/bpf/memalloc.c >> @@ -171,9 +171,24 @@ static void alloc_bulk(struct bpf_mem_cache *c, int cnt, >> int node) >> memcg = get_memcg(c); >> old_memcg = set_active_memcg(memcg); >> for (i = 0; i < cnt; i++) { >> - obj = __alloc(c, node); >> - if (!obj) >> - break; >> + /* >> + * free_by_rcu is only manipulated by irq work refill_work(). >> + * IRQ works on the same CPU are called sequentially, so it is >> + * safe to use __llist_del_first() here. If alloc_bulk() is >> + * invoked by the initial prefill, there will be no running >> + * irq work, so __llist_del_first() is fine as well. >> + * >> + * In most cases, objects on free_by_rcu are from the same CPU. >> + * If some objects come from other CPUs, it doesn't incur any >> + * harm because NUMA_NO_NODE means the preference for current >> + * numa node and it is not a guarantee. >> + */ >> + obj = __llist_del_first(&c->free_by_rcu); >> + if (!obj) { >> + obj = __alloc(c, node); >> + if (!obj) >> + break; >> + } >> if (IS_ENABLED(CONFIG_PREEMPT_RT)) >> /* In RT irq_work runs in per-cpu kthread, so disable >> * interrupts to avoid preemption and interrupts and > > .
On Tue, Dec 6, 2022 at 6:20 PM Hou Tao <houtao@huaweicloud.com> wrote: > > Hi, > > On 12/7/2022 9:52 AM, Yonghong Song wrote: > > > > > > On 12/5/22 8:29 PM, Hou Tao wrote: > >> From: Hou Tao <houtao1@huawei.com> > >> > >> When there are batched freeing operations on a specific CPU, part of > >> the freed elements ((high_watermark - lower_watermark) / 2 + 1) will > >> be moved to waiting_for_gp list and the remaining part will be left in > >> free_by_rcu list and waits for the expiration of RCU-tasks-trace grace > >> period and the next invocation of free_bulk(). > > > > The change below LGTM. However, the above description seems not precise. > > IIUC, free_by_rcu list => waiting_for_gp is controlled by whether > > call_rcu_in_progress is true or not. If it is true, free_by_rcu list > > will remain intact and not moving into waiting_for_gp list. > > So it is not 'the remaining part will be left in free_by_rcu'. > > It is all elements in free_by_rcu to waiting_for_gp or none. > Thanks for the review and the suggestions. I tried to say that moving from > free_by_rcu to waiting_for_gp is slow, and there can be many free elements being > stacked on free_by_rcu list. So how about the following rephrasing or do you > still prefer "It is all elements in free_by_rcu to waiting_for_gp or none." ? > > When there are batched freeing operations on a specific CPU, part of the freed > elements ((high_watermark - lower_watermark) / 2 + 1) will be moved to > waiting_for_gp list and the remaining part will be left in free_by_rcu list. I agree with Yonghong. The above sentence is not just not precise. 'elements moved to waiting_for_gp list' part is not correct. The elements never moved into it directly. Only via free_by_rcu list. All or none also matters.
Hi, On 12/7/2022 10:58 AM, Alexei Starovoitov wrote: > On Tue, Dec 6, 2022 at 6:20 PM Hou Tao <houtao@huaweicloud.com> wrote: >> Hi, >> >> On 12/7/2022 9:52 AM, Yonghong Song wrote: >>> >>> On 12/5/22 8:29 PM, Hou Tao wrote: >>>> From: Hou Tao <houtao1@huawei.com> >>>> >>>> When there are batched freeing operations on a specific CPU, part of >>>> the freed elements ((high_watermark - lower_watermark) / 2 + 1) will >>>> be moved to waiting_for_gp list and the remaining part will be left in >>>> free_by_rcu list and waits for the expiration of RCU-tasks-trace grace >>>> period and the next invocation of free_bulk(). >>> The change below LGTM. However, the above description seems not precise. >>> IIUC, free_by_rcu list => waiting_for_gp is controlled by whether >>> call_rcu_in_progress is true or not. If it is true, free_by_rcu list >>> will remain intact and not moving into waiting_for_gp list. >>> So it is not 'the remaining part will be left in free_by_rcu'. >>> It is all elements in free_by_rcu to waiting_for_gp or none. >> Thanks for the review and the suggestions. I tried to say that moving from >> free_by_rcu to waiting_for_gp is slow, and there can be many free elements being >> stacked on free_by_rcu list. So how about the following rephrasing or do you >> still prefer "It is all elements in free_by_rcu to waiting_for_gp or none." ? >> >> When there are batched freeing operations on a specific CPU, part of the freed >> elements ((high_watermark - lower_watermark) / 2 + 1) will be moved to >> waiting_for_gp list and the remaining part will be left in free_by_rcu list. > I agree with Yonghong. > The above sentence is not just not precise. > 'elements moved to waiting_for_gp list' part is not correct. > The elements never moved into it directly. > Only via free_by_rcu list. Yes. > > All or none also matters. I am still confused about the "All or none". Does it mean all elements in free_by_list will be moved into waiting_for_gp or none will be moved if call_rcu_in_progress is true, right ? So How about the following rephrasing ? When there are batched freeing operations on a specific CPU, part of the freed elements ((high_watermark - lower_watermark) / 2 + 1) will be indirectly moved into waiting_for_gp list through free_by_rcu list. After call_rcu_in_progress becomes false again, the remaining elements in free_by_rcu list will be moved to waiting_for_gp list by the next invocation of free_bulk(). However if the expiration of RCU tasks trace grace period is relatively slow, none element in free_by_rcu list will be moved. So instead of invoking __alloc_percpu_gfp() or kmalloc_node() to allocate a new object, in alloc_bulk() just check whether or not there is freed element in free_by_rcu list and reuse it if available. > .
diff --git a/kernel/bpf/memalloc.c b/kernel/bpf/memalloc.c index 8f0d65f2474a..7daf147bc8f6 100644 --- a/kernel/bpf/memalloc.c +++ b/kernel/bpf/memalloc.c @@ -171,9 +171,24 @@ static void alloc_bulk(struct bpf_mem_cache *c, int cnt, int node) memcg = get_memcg(c); old_memcg = set_active_memcg(memcg); for (i = 0; i < cnt; i++) { - obj = __alloc(c, node); - if (!obj) - break; + /* + * free_by_rcu is only manipulated by irq work refill_work(). + * IRQ works on the same CPU are called sequentially, so it is + * safe to use __llist_del_first() here. If alloc_bulk() is + * invoked by the initial prefill, there will be no running + * irq work, so __llist_del_first() is fine as well. + * + * In most cases, objects on free_by_rcu are from the same CPU. + * If some objects come from other CPUs, it doesn't incur any + * harm because NUMA_NO_NODE means the preference for current + * numa node and it is not a guarantee. + */ + obj = __llist_del_first(&c->free_by_rcu); + if (!obj) { + obj = __alloc(c, node); + if (!obj) + break; + } if (IS_ENABLED(CONFIG_PREEMPT_RT)) /* In RT irq_work runs in per-cpu kthread, so disable * interrupts to avoid preemption and interrupts and