Message ID | 20231221050010.1971932-1-yonghong.song@linux.dev (mailing list archive) |
---|---|
State | Superseded |
Delegated to: | BPF |
Headers | show |
Series | bpf: Reduce memory usage for bpf_global_percpu_ma | expand |
Hi, On 12/21/2023 1:00 PM, Yonghong Song wrote: > Commit 41a5db8d8161 ("Add support for non-fix-size percpu mem allocation") > added support for non-fix-size percpu memory allocation. > Such allocation will allocate percpu memory for all buckets on all > cpus and the memory consumption is in the order to quadratic. > For example, let us say, 4 cpus, unit size 16 bytes, so each > cpu has 16 * 4 = 64 bytes, with 4 cpus, total will be 64 * 4 = 256 bytes. > Then let us say, 8 cpus with the same unit size, each cpu > has 16 * 8 = 128 bytes, with 8 cpus, total will be 128 * 8 = 1024 bytes. > So if the number of cpus doubles, the number of memory consumption > will be 4 times. So for a system with large number of cpus, the > memory consumption goes up quickly with quadratic order. > For example, for 4KB percpu allocation, 128 cpus. The total memory > consumption will 4KB * 128 * 128 = 64MB. Things will become > worse if the number of cpus is bigger (e.g., 512, 1024, etc.) > > In Commit 41a5db8d8161, the non-fix-size percpu memory allocation is > done in boot time, so for system with large number of cpus, the initial > percpu memory consumption is very visible. For example, for 128 cpu > system, the total percpu memory allocation will be at least > (16 + 32 + 64 + 96 + 128 + 196 + 256 + 512 + 1024 + 2048 + 4096) > * 128 * 128 = ~138MB. > which is pretty big. It will be even bigger for larger number of cpus. SNIP > + > static void drain_mem_cache(struct bpf_mem_cache *c) > { > bool percpu = !!c->percpu_size; > diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c > index f13008d27f35..08f9a49cc11c 100644 > --- a/kernel/bpf/verifier.c > +++ b/kernel/bpf/verifier.c > @@ -12141,20 +12141,6 @@ static int check_kfunc_call(struct bpf_verifier_env *env, struct bpf_insn *insn, > if (meta.func_id == special_kfunc_list[KF_bpf_obj_new_impl] && !bpf_global_ma_set) > return -ENOMEM; > > - if (meta.func_id == special_kfunc_list[KF_bpf_percpu_obj_new_impl]) { > - if (!bpf_global_percpu_ma_set) { > - mutex_lock(&bpf_percpu_ma_lock); > - if (!bpf_global_percpu_ma_set) { > - err = bpf_mem_alloc_init(&bpf_global_percpu_ma, 0, true); > - if (!err) > - bpf_global_percpu_ma_set = true; > - } > - mutex_unlock(&bpf_percpu_ma_lock); > - if (err) > - return err; > - } > - } > - > if (((u64)(u32)meta.arg_constant.value) != meta.arg_constant.value) { > verbose(env, "local type ID argument must be in range [0, U32_MAX]\n"); > return -EINVAL; > @@ -12175,6 +12161,26 @@ static int check_kfunc_call(struct bpf_verifier_env *env, struct bpf_insn *insn, > return -EINVAL; > } > > + if (meta.func_id == special_kfunc_list[KF_bpf_percpu_obj_new_impl]) { > + if (!bpf_global_percpu_ma_set) { > + mutex_lock(&bpf_percpu_ma_lock); > + if (!bpf_global_percpu_ma_set) { > + err = bpf_mem_alloc_percpu_init(&bpf_global_percpu_ma); Because ma->objcg is assigned as get_obj_cgroup_from_current(), so I think the memory account will be incorrect, right ? Maybe we should pass objcg to bpf_mem_alloc_percpu_init() explicit. For root memcg, I think the objcg is NULL. > + if (!err) > + bpf_global_percpu_ma_set = true; > + } > + mutex_unlock(&bpf_percpu_ma_lock); > + if (err) > + return err; > + } > + > + mutex_lock(&bpf_percpu_ma_lock); > + err = bpf_mem_alloc_percpu_unit_init(&bpf_global_percpu_ma, ret_t->size); > + mutex_unlock(&bpf_percpu_ma_lock); > + if (err) > + return err; > + } > + > struct_meta = btf_find_struct_meta(ret_btf, ret_btf_id); > if (meta.func_id == special_kfunc_list[KF_bpf_percpu_obj_new_impl]) { > if (!__btf_type_is_scalar_struct(env, ret_btf, ret_t, 0)) {
On 12/20/23 10:26 PM, Hou Tao wrote: > Hi, > > On 12/21/2023 1:00 PM, Yonghong Song wrote: >> Commit 41a5db8d8161 ("Add support for non-fix-size percpu mem allocation") >> added support for non-fix-size percpu memory allocation. >> Such allocation will allocate percpu memory for all buckets on all >> cpus and the memory consumption is in the order to quadratic. >> For example, let us say, 4 cpus, unit size 16 bytes, so each >> cpu has 16 * 4 = 64 bytes, with 4 cpus, total will be 64 * 4 = 256 bytes. >> Then let us say, 8 cpus with the same unit size, each cpu >> has 16 * 8 = 128 bytes, with 8 cpus, total will be 128 * 8 = 1024 bytes. >> So if the number of cpus doubles, the number of memory consumption >> will be 4 times. So for a system with large number of cpus, the >> memory consumption goes up quickly with quadratic order. >> For example, for 4KB percpu allocation, 128 cpus. The total memory >> consumption will 4KB * 128 * 128 = 64MB. Things will become >> worse if the number of cpus is bigger (e.g., 512, 1024, etc.) >> >> In Commit 41a5db8d8161, the non-fix-size percpu memory allocation is >> done in boot time, so for system with large number of cpus, the initial >> percpu memory consumption is very visible. For example, for 128 cpu >> system, the total percpu memory allocation will be at least >> (16 + 32 + 64 + 96 + 128 + 196 + 256 + 512 + 1024 + 2048 + 4096) >> * 128 * 128 = ~138MB. >> which is pretty big. It will be even bigger for larger number of cpus. > SNIP >> + >> static void drain_mem_cache(struct bpf_mem_cache *c) >> { >> bool percpu = !!c->percpu_size; >> diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c >> index f13008d27f35..08f9a49cc11c 100644 >> --- a/kernel/bpf/verifier.c >> +++ b/kernel/bpf/verifier.c >> @@ -12141,20 +12141,6 @@ static int check_kfunc_call(struct bpf_verifier_env *env, struct bpf_insn *insn, >> if (meta.func_id == special_kfunc_list[KF_bpf_obj_new_impl] && !bpf_global_ma_set) >> return -ENOMEM; >> >> - if (meta.func_id == special_kfunc_list[KF_bpf_percpu_obj_new_impl]) { >> - if (!bpf_global_percpu_ma_set) { >> - mutex_lock(&bpf_percpu_ma_lock); >> - if (!bpf_global_percpu_ma_set) { >> - err = bpf_mem_alloc_init(&bpf_global_percpu_ma, 0, true); >> - if (!err) >> - bpf_global_percpu_ma_set = true; >> - } >> - mutex_unlock(&bpf_percpu_ma_lock); >> - if (err) >> - return err; >> - } >> - } >> - >> if (((u64)(u32)meta.arg_constant.value) != meta.arg_constant.value) { >> verbose(env, "local type ID argument must be in range [0, U32_MAX]\n"); >> return -EINVAL; >> @@ -12175,6 +12161,26 @@ static int check_kfunc_call(struct bpf_verifier_env *env, struct bpf_insn *insn, >> return -EINVAL; >> } >> >> + if (meta.func_id == special_kfunc_list[KF_bpf_percpu_obj_new_impl]) { >> + if (!bpf_global_percpu_ma_set) { >> + mutex_lock(&bpf_percpu_ma_lock); >> + if (!bpf_global_percpu_ma_set) { >> + err = bpf_mem_alloc_percpu_init(&bpf_global_percpu_ma); > Because ma->objcg is assigned as get_obj_cgroup_from_current(), so I > think the memory account will be incorrect, right ? Maybe we should pass > objcg to bpf_mem_alloc_percpu_init() explicit. For root memcg, I think > the objcg is NULL. You are correct. Calling bpf_mem_alloc_percpu_init() in init stage is exactly the reason to have proper root memcg for objcg. Sorry I missed it. I remembered I indeed traced it a few days ago and indeed it is NULL. There are three ways to resolve this: 1 Just do 'ma->objcg = NULL' unconditionally in bpf_mem_alloc_percpu_init(). 2 Second, we can remember objcg = bpf_mem_alloc_percpu_init() at init stage, e.g., in bpf_global_ma_init() init function (core.c), and later it can be used in bpf_mem_alloc_percpu_init(). 3 Still do bpf_mem_alloc_percpu_init() at init stage to initialize ma->objcg properly. But delay __alloc_percpu_gfp() later when verifier found a call to bpf_percpu_obj_new(). We could add a call bpf_mem_alloc_percpu_init_caches() to do __alloc_percpu_grp(). I prefer option 3, what do you think? >> + if (!err) >> + bpf_global_percpu_ma_set = true; >> + } >> + mutex_unlock(&bpf_percpu_ma_lock); >> + if (err) >> + return err; >> + } >> + >> + mutex_lock(&bpf_percpu_ma_lock); >> + err = bpf_mem_alloc_percpu_unit_init(&bpf_global_percpu_ma, ret_t->size); >> + mutex_unlock(&bpf_percpu_ma_lock); >> + if (err) >> + return err; >> + } >> + >> struct_meta = btf_find_struct_meta(ret_btf, ret_btf_id); >> if (meta.func_id == special_kfunc_list[KF_bpf_percpu_obj_new_impl]) { >> if (!__btf_type_is_scalar_struct(env, ret_btf, ret_t, 0)) { >
On 12/20/23 11:16 PM, Yonghong Song wrote: > > On 12/20/23 10:26 PM, Hou Tao wrote: >> Hi, >> >> On 12/21/2023 1:00 PM, Yonghong Song wrote: >>> Commit 41a5db8d8161 ("Add support for non-fix-size percpu mem >>> allocation") >>> added support for non-fix-size percpu memory allocation. >>> Such allocation will allocate percpu memory for all buckets on all >>> cpus and the memory consumption is in the order to quadratic. >>> For example, let us say, 4 cpus, unit size 16 bytes, so each >>> cpu has 16 * 4 = 64 bytes, with 4 cpus, total will be 64 * 4 = 256 >>> bytes. >>> Then let us say, 8 cpus with the same unit size, each cpu >>> has 16 * 8 = 128 bytes, with 8 cpus, total will be 128 * 8 = 1024 >>> bytes. >>> So if the number of cpus doubles, the number of memory consumption >>> will be 4 times. So for a system with large number of cpus, the >>> memory consumption goes up quickly with quadratic order. >>> For example, for 4KB percpu allocation, 128 cpus. The total memory >>> consumption will 4KB * 128 * 128 = 64MB. Things will become >>> worse if the number of cpus is bigger (e.g., 512, 1024, etc.) >>> >>> In Commit 41a5db8d8161, the non-fix-size percpu memory allocation is >>> done in boot time, so for system with large number of cpus, the initial >>> percpu memory consumption is very visible. For example, for 128 cpu >>> system, the total percpu memory allocation will be at least >>> (16 + 32 + 64 + 96 + 128 + 196 + 256 + 512 + 1024 + 2048 + 4096) >>> * 128 * 128 = ~138MB. >>> which is pretty big. It will be even bigger for larger number of cpus. >> SNIP >>> + >>> static void drain_mem_cache(struct bpf_mem_cache *c) >>> { >>> bool percpu = !!c->percpu_size; >>> diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c >>> index f13008d27f35..08f9a49cc11c 100644 >>> --- a/kernel/bpf/verifier.c >>> +++ b/kernel/bpf/verifier.c >>> @@ -12141,20 +12141,6 @@ static int check_kfunc_call(struct >>> bpf_verifier_env *env, struct bpf_insn *insn, >>> if (meta.func_id == >>> special_kfunc_list[KF_bpf_obj_new_impl] && !bpf_global_ma_set) >>> return -ENOMEM; >>> - if (meta.func_id == >>> special_kfunc_list[KF_bpf_percpu_obj_new_impl]) { >>> - if (!bpf_global_percpu_ma_set) { >>> - mutex_lock(&bpf_percpu_ma_lock); >>> - if (!bpf_global_percpu_ma_set) { >>> - err = >>> bpf_mem_alloc_init(&bpf_global_percpu_ma, 0, true); >>> - if (!err) >>> - bpf_global_percpu_ma_set = true; >>> - } >>> - mutex_unlock(&bpf_percpu_ma_lock); >>> - if (err) >>> - return err; >>> - } >>> - } >>> - >>> if (((u64)(u32)meta.arg_constant.value) != >>> meta.arg_constant.value) { >>> verbose(env, "local type ID argument must be >>> in range [0, U32_MAX]\n"); >>> return -EINVAL; >>> @@ -12175,6 +12161,26 @@ static int check_kfunc_call(struct >>> bpf_verifier_env *env, struct bpf_insn *insn, >>> return -EINVAL; >>> } >>> + if (meta.func_id == >>> special_kfunc_list[KF_bpf_percpu_obj_new_impl]) { >>> + if (!bpf_global_percpu_ma_set) { >>> + mutex_lock(&bpf_percpu_ma_lock); >>> + if (!bpf_global_percpu_ma_set) { >>> + err = >>> bpf_mem_alloc_percpu_init(&bpf_global_percpu_ma); >> Because ma->objcg is assigned as get_obj_cgroup_from_current(), so I >> think the memory account will be incorrect, right ? Maybe we should pass >> objcg to bpf_mem_alloc_percpu_init() explicit. For root memcg, I think >> the objcg is NULL. > > You are correct. Calling bpf_mem_alloc_percpu_init() in init stage > is exactly the reason to have proper root memcg for objcg. Sorry I > missed it. > > I remembered I indeed traced it a few days ago and indeed it is NULL. > There are three ways to resolve this: > 1 Just do 'ma->objcg = NULL' unconditionally in > bpf_mem_alloc_percpu_init(). > 2 Second, we can remember objcg = bpf_mem_alloc_percpu_init() at > init stage, > e.g., in bpf_global_ma_init() init function (core.c), and later > it can > be used in bpf_mem_alloc_percpu_init(). > 3 Still do bpf_mem_alloc_percpu_init() at init stage to initialize > ma->objcg > properly. But delay __alloc_percpu_gfp() later when verifier > found a call > to bpf_percpu_obj_new(). We could add a call > bpf_mem_alloc_percpu_init_caches() > to do __alloc_percpu_grp(). > > I prefer option 3, what do you think? The option 4 below: diff --git a/kernel/bpf/memalloc.c b/kernel/bpf/memalloc.c index 984c83ecace9..f90989cc9cbc 100644 --- a/kernel/bpf/memalloc.c +++ b/kernel/bpf/memalloc.c @@ -122,6 +122,7 @@ struct bpf_mem_caches { }; static const u16 sizes[NUM_CACHES] = {96, 192, 16, 32, 64, 128, 256, 512, 1024, 2048, 4096}; +static struct obj_cgroup *objcg_at_init __ro_after_init; static struct llist_node notrace *__llist_del_first(struct llist_head *head) { @@ -590,7 +591,7 @@ int bpf_mem_alloc_percpu_init(struct bpf_mem_alloc *ma) ma->percpu = true; #ifdef CONFIG_MEMCG_KMEM - ma->objcg = get_obj_cgroup_from_current(); + ma->objcg = objcg_at_init; #else ma->objcg = NULL; #endif @@ -1015,3 +1016,10 @@ void notrace *bpf_mem_cache_alloc_flags(struct bpf_mem_alloc *ma, gfp_t flags) return !ret ? NULL : ret + LLIST_NODE_SZ; } + +static int __init find_objcg_at_init(void) +{ + objcg_at_init = get_obj_cgroup_from_current(); + return 0; +} +late_initcall(find_objcg_at_init); It seems this is better? > >>> + if (!err) >>> + bpf_global_percpu_ma_set = true; >>> + } >>> + mutex_unlock(&bpf_percpu_ma_lock); >>> + if (err) >>> + return err; >>> + } >>> + >>> + mutex_lock(&bpf_percpu_ma_lock); >>> + err = >>> bpf_mem_alloc_percpu_unit_init(&bpf_global_percpu_ma, ret_t->size); >>> + mutex_unlock(&bpf_percpu_ma_lock); >>> + if (err) >>> + return err; >>> + } >>> + >>> struct_meta = btf_find_struct_meta(ret_btf, >>> ret_btf_id); >>> if (meta.func_id == >>> special_kfunc_list[KF_bpf_percpu_obj_new_impl]) { >>> if (!__btf_type_is_scalar_struct(env, ret_btf, >>> ret_t, 0)) { >> >
Hi, On 12/21/2023 3:52 PM, Yonghong Song wrote: > > On 12/20/23 11:16 PM, Yonghong Song wrote: >> >> On 12/20/23 10:26 PM, Hou Tao wrote: >>> Hi, >>> >>> On 12/21/2023 1:00 PM, Yonghong Song wrote: >>>> Commit 41a5db8d8161 ("Add support for non-fix-size percpu mem >>>> allocation") >>>> added support for non-fix-size percpu memory allocation. >>>> Such allocation will allocate percpu memory for all buckets on all >>>> cpus and the memory consumption is in the order to quadratic. >>>> For example, let us say, 4 cpus, unit size 16 bytes, so each >>>> cpu has 16 * 4 = 64 bytes, with 4 cpus, total will be 64 * 4 = 256 >>>> bytes. >>>> Then let us say, 8 cpus with the same unit size, each cpu >>>> has 16 * 8 = 128 bytes, with 8 cpus, total will be 128 * 8 = 1024 >>>> bytes. >>>> So if the number of cpus doubles, the number of memory consumption >>>> will be 4 times. So for a system with large number of cpus, the >>>> memory consumption goes up quickly with quadratic order. >>>> For example, for 4KB percpu allocation, 128 cpus. The total memory >>>> consumption will 4KB * 128 * 128 = 64MB. Things will become >>>> worse if the number of cpus is bigger (e.g., 512, 1024, etc.) >>>> >>>> In Commit 41a5db8d8161, the non-fix-size percpu memory allocation is >>>> done in boot time, so for system with large number of cpus, the >>>> initial >>>> percpu memory consumption is very visible. For example, for 128 cpu >>>> system, the total percpu memory allocation will be at least >>>> (16 + 32 + 64 + 96 + 128 + 196 + 256 + 512 + 1024 + 2048 + 4096) >>>> * 128 * 128 = ~138MB. >>>> which is pretty big. It will be even bigger for larger number of cpus. >>> SNIP >>>> + >>>> static void drain_mem_cache(struct bpf_mem_cache *c) >>>> { >>>> bool percpu = !!c->percpu_size; >>>> diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c >>>> index f13008d27f35..08f9a49cc11c 100644 >>>> --- a/kernel/bpf/verifier.c >>>> +++ b/kernel/bpf/verifier.c >>>> @@ -12141,20 +12141,6 @@ static int check_kfunc_call(struct >>>> bpf_verifier_env *env, struct bpf_insn *insn, >>>> if (meta.func_id == >>>> special_kfunc_list[KF_bpf_obj_new_impl] && !bpf_global_ma_set) >>>> return -ENOMEM; >>>> - if (meta.func_id == >>>> special_kfunc_list[KF_bpf_percpu_obj_new_impl]) { >>>> - if (!bpf_global_percpu_ma_set) { >>>> - mutex_lock(&bpf_percpu_ma_lock); >>>> - if (!bpf_global_percpu_ma_set) { >>>> - err = >>>> bpf_mem_alloc_init(&bpf_global_percpu_ma, 0, true); >>>> - if (!err) >>>> - bpf_global_percpu_ma_set = true; >>>> - } >>>> - mutex_unlock(&bpf_percpu_ma_lock); >>>> - if (err) >>>> - return err; >>>> - } >>>> - } >>>> - >>>> if (((u64)(u32)meta.arg_constant.value) != >>>> meta.arg_constant.value) { >>>> verbose(env, "local type ID argument must be >>>> in range [0, U32_MAX]\n"); >>>> return -EINVAL; >>>> @@ -12175,6 +12161,26 @@ static int check_kfunc_call(struct >>>> bpf_verifier_env *env, struct bpf_insn *insn, >>>> return -EINVAL; >>>> } >>>> + if (meta.func_id == >>>> special_kfunc_list[KF_bpf_percpu_obj_new_impl]) { >>>> + if (!bpf_global_percpu_ma_set) { >>>> + mutex_lock(&bpf_percpu_ma_lock); >>>> + if (!bpf_global_percpu_ma_set) { >>>> + err = >>>> bpf_mem_alloc_percpu_init(&bpf_global_percpu_ma); >>> Because ma->objcg is assigned as get_obj_cgroup_from_current(), so I >>> think the memory account will be incorrect, right ? Maybe we should >>> pass >>> objcg to bpf_mem_alloc_percpu_init() explicit. For root memcg, I think >>> the objcg is NULL. >> >> You are correct. Calling bpf_mem_alloc_percpu_init() in init stage >> is exactly the reason to have proper root memcg for objcg. Sorry I >> missed it. >> >> I remembered I indeed traced it a few days ago and indeed it is NULL. >> There are three ways to resolve this: >> 1 Just do 'ma->objcg = NULL' unconditionally in >> bpf_mem_alloc_percpu_init(). >> 2 Second, we can remember objcg = bpf_mem_alloc_percpu_init() at >> init stage, >> e.g., in bpf_global_ma_init() init function (core.c), and later >> it can >> be used in bpf_mem_alloc_percpu_init(). >> 3 Still do bpf_mem_alloc_percpu_init() at init stage to initialize >> ma->objcg >> properly. But delay __alloc_percpu_gfp() later when verifier >> found a call >> to bpf_percpu_obj_new(). We could add a call >> bpf_mem_alloc_percpu_init_caches() >> to do __alloc_percpu_grp(). >> >> I prefer option 3, what do you think? > > The option 4 below: > > diff --git a/kernel/bpf/memalloc.c b/kernel/bpf/memalloc.c > index 984c83ecace9..f90989cc9cbc 100644 > --- a/kernel/bpf/memalloc.c > +++ b/kernel/bpf/memalloc.c > @@ -122,6 +122,7 @@ struct bpf_mem_caches { > }; > > static const u16 sizes[NUM_CACHES] = {96, 192, 16, 32, 64, 128, 256, > 512, 1024, 2048, 4096}; > +static struct obj_cgroup *objcg_at_init __ro_after_init; > > static struct llist_node notrace *__llist_del_first(struct llist_head > *head) > { > @@ -590,7 +591,7 @@ int bpf_mem_alloc_percpu_init(struct bpf_mem_alloc > *ma) > ma->percpu = true; > > #ifdef CONFIG_MEMCG_KMEM > - ma->objcg = get_obj_cgroup_from_current(); > + ma->objcg = objcg_at_init; > #else > ma->objcg = NULL; > #endif > @@ -1015,3 +1016,10 @@ void notrace *bpf_mem_cache_alloc_flags(struct > bpf_mem_alloc *ma, gfp_t flags) > > return !ret ? NULL : ret + LLIST_NODE_SZ; > } > + > +static int __init find_objcg_at_init(void) > +{ > + objcg_at_init = get_obj_cgroup_from_current(); > + return 0; > +} > +late_initcall(find_objcg_at_init); > > It seems this is better? It seems that you are worried about the objcg of root memcg may change from NULL to something else one day, right ? If it is the case, I think both option 3 and 4 are fine. But I still think passing the desired objcg to bpf_mem_alloc_percpu_init() directly is better. If the objcg of root memcg is not NULL afterwards, we can update the passed parameter accordingly. > >> >>>> + if (!err) >>>> + bpf_global_percpu_ma_set = true; >>>> + } >>>> + mutex_unlock(&bpf_percpu_ma_lock); >>>> + if (err) >>>> + return err; >>>> + } >>>> + >>>> + mutex_lock(&bpf_percpu_ma_lock); >>>> + err = >>>> bpf_mem_alloc_percpu_unit_init(&bpf_global_percpu_ma, ret_t->size); >>>> + mutex_unlock(&bpf_percpu_ma_lock); >>>> + if (err) >>>> + return err; >>>> + } >>>> + >>>> struct_meta = btf_find_struct_meta(ret_btf, >>>> ret_btf_id); >>>> if (meta.func_id == >>>> special_kfunc_list[KF_bpf_percpu_obj_new_impl]) { >>>> if (!__btf_type_is_scalar_struct(env, >>>> ret_btf, ret_t, 0)) { >>> >>
On 12/21/23 12:42 AM, Hou Tao wrote: > Hi, > > On 12/21/2023 3:52 PM, Yonghong Song wrote: >> On 12/20/23 11:16 PM, Yonghong Song wrote: >>> On 12/20/23 10:26 PM, Hou Tao wrote: >>>> Hi, >>>> >>>> On 12/21/2023 1:00 PM, Yonghong Song wrote: >>>>> Commit 41a5db8d8161 ("Add support for non-fix-size percpu mem >>>>> allocation") >>>>> added support for non-fix-size percpu memory allocation. >>>>> Such allocation will allocate percpu memory for all buckets on all >>>>> cpus and the memory consumption is in the order to quadratic. >>>>> For example, let us say, 4 cpus, unit size 16 bytes, so each >>>>> cpu has 16 * 4 = 64 bytes, with 4 cpus, total will be 64 * 4 = 256 >>>>> bytes. >>>>> Then let us say, 8 cpus with the same unit size, each cpu >>>>> has 16 * 8 = 128 bytes, with 8 cpus, total will be 128 * 8 = 1024 >>>>> bytes. >>>>> So if the number of cpus doubles, the number of memory consumption >>>>> will be 4 times. So for a system with large number of cpus, the >>>>> memory consumption goes up quickly with quadratic order. >>>>> For example, for 4KB percpu allocation, 128 cpus. The total memory >>>>> consumption will 4KB * 128 * 128 = 64MB. Things will become >>>>> worse if the number of cpus is bigger (e.g., 512, 1024, etc.) >>>>> >>>>> In Commit 41a5db8d8161, the non-fix-size percpu memory allocation is >>>>> done in boot time, so for system with large number of cpus, the >>>>> initial >>>>> percpu memory consumption is very visible. For example, for 128 cpu >>>>> system, the total percpu memory allocation will be at least >>>>> (16 + 32 + 64 + 96 + 128 + 196 + 256 + 512 + 1024 + 2048 + 4096) >>>>> * 128 * 128 = ~138MB. >>>>> which is pretty big. It will be even bigger for larger number of cpus. >>>> SNIP >>>>> + >>>>> static void drain_mem_cache(struct bpf_mem_cache *c) >>>>> { >>>>> bool percpu = !!c->percpu_size; >>>>> diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c >>>>> index f13008d27f35..08f9a49cc11c 100644 >>>>> --- a/kernel/bpf/verifier.c >>>>> +++ b/kernel/bpf/verifier.c >>>>> @@ -12141,20 +12141,6 @@ static int check_kfunc_call(struct >>>>> bpf_verifier_env *env, struct bpf_insn *insn, >>>>> if (meta.func_id == >>>>> special_kfunc_list[KF_bpf_obj_new_impl] && !bpf_global_ma_set) >>>>> return -ENOMEM; >>>>> - if (meta.func_id == >>>>> special_kfunc_list[KF_bpf_percpu_obj_new_impl]) { >>>>> - if (!bpf_global_percpu_ma_set) { >>>>> - mutex_lock(&bpf_percpu_ma_lock); >>>>> - if (!bpf_global_percpu_ma_set) { >>>>> - err = >>>>> bpf_mem_alloc_init(&bpf_global_percpu_ma, 0, true); >>>>> - if (!err) >>>>> - bpf_global_percpu_ma_set = true; >>>>> - } >>>>> - mutex_unlock(&bpf_percpu_ma_lock); >>>>> - if (err) >>>>> - return err; >>>>> - } >>>>> - } >>>>> - >>>>> if (((u64)(u32)meta.arg_constant.value) != >>>>> meta.arg_constant.value) { >>>>> verbose(env, "local type ID argument must be >>>>> in range [0, U32_MAX]\n"); >>>>> return -EINVAL; >>>>> @@ -12175,6 +12161,26 @@ static int check_kfunc_call(struct >>>>> bpf_verifier_env *env, struct bpf_insn *insn, >>>>> return -EINVAL; >>>>> } >>>>> + if (meta.func_id == >>>>> special_kfunc_list[KF_bpf_percpu_obj_new_impl]) { >>>>> + if (!bpf_global_percpu_ma_set) { >>>>> + mutex_lock(&bpf_percpu_ma_lock); >>>>> + if (!bpf_global_percpu_ma_set) { >>>>> + err = >>>>> bpf_mem_alloc_percpu_init(&bpf_global_percpu_ma); >>>> Because ma->objcg is assigned as get_obj_cgroup_from_current(), so I >>>> think the memory account will be incorrect, right ? Maybe we should >>>> pass >>>> objcg to bpf_mem_alloc_percpu_init() explicit. For root memcg, I think >>>> the objcg is NULL. >>> You are correct. Calling bpf_mem_alloc_percpu_init() in init stage >>> is exactly the reason to have proper root memcg for objcg. Sorry I >>> missed it. >>> >>> I remembered I indeed traced it a few days ago and indeed it is NULL. >>> There are three ways to resolve this: >>> 1 Just do 'ma->objcg = NULL' unconditionally in >>> bpf_mem_alloc_percpu_init(). >>> 2 Second, we can remember objcg = bpf_mem_alloc_percpu_init() at >>> init stage, >>> e.g., in bpf_global_ma_init() init function (core.c), and later >>> it can >>> be used in bpf_mem_alloc_percpu_init(). >>> 3 Still do bpf_mem_alloc_percpu_init() at init stage to initialize >>> ma->objcg >>> properly. But delay __alloc_percpu_gfp() later when verifier >>> found a call >>> to bpf_percpu_obj_new(). We could add a call >>> bpf_mem_alloc_percpu_init_caches() >>> to do __alloc_percpu_grp(). >>> >>> I prefer option 3, what do you think? >> The option 4 below: >> >> diff --git a/kernel/bpf/memalloc.c b/kernel/bpf/memalloc.c >> index 984c83ecace9..f90989cc9cbc 100644 >> --- a/kernel/bpf/memalloc.c >> +++ b/kernel/bpf/memalloc.c >> @@ -122,6 +122,7 @@ struct bpf_mem_caches { >> }; >> >> static const u16 sizes[NUM_CACHES] = {96, 192, 16, 32, 64, 128, 256, >> 512, 1024, 2048, 4096}; >> +static struct obj_cgroup *objcg_at_init __ro_after_init; >> >> static struct llist_node notrace *__llist_del_first(struct llist_head >> *head) >> { >> @@ -590,7 +591,7 @@ int bpf_mem_alloc_percpu_init(struct bpf_mem_alloc >> *ma) >> ma->percpu = true; >> >> #ifdef CONFIG_MEMCG_KMEM >> - ma->objcg = get_obj_cgroup_from_current(); >> + ma->objcg = objcg_at_init; >> #else >> ma->objcg = NULL; >> #endif >> @@ -1015,3 +1016,10 @@ void notrace *bpf_mem_cache_alloc_flags(struct >> bpf_mem_alloc *ma, gfp_t flags) >> >> return !ret ? NULL : ret + LLIST_NODE_SZ; >> } >> + >> +static int __init find_objcg_at_init(void) >> +{ >> + objcg_at_init = get_obj_cgroup_from_current(); >> + return 0; >> +} >> +late_initcall(find_objcg_at_init); >> >> It seems this is better? > It seems that you are worried about the objcg of root memcg may change > from NULL to something else one day, right ? If it is the case, I think > both option 3 and 4 are fine. But I still think passing the desired I indeed a little bit worried that objcg for root memcg may change to a non-NULL value. Another option is to add an inline static function like below in memcontrol.h file something like static inline struct obj_cgroup *get_obj_cgroup_from_root_memcg() { return NULL; } But it seems silly that we have such helper function as it is obvious from current implementation. > objcg to bpf_mem_alloc_percpu_init() directly is better. If the objcg of > root memcg is not NULL afterwards, we can update the passed parameter > accordingly. So I guess what you suggested also make sense. I can add a comment to the call site of bpf_mem_alloc_percpu_init() to explain why 'NULL' objcg. If in really unlikely case objcg for root memcg changed, we can fix it accordingly. Will send v6 to fix this later today. >>>>> + if (!err) >>>>> + bpf_global_percpu_ma_set = true; >>>>> + } >>>>> + mutex_unlock(&bpf_percpu_ma_lock); >>>>> + if (err) >>>>> + return err; >>>>> + } >>>>> + >>>>> + mutex_lock(&bpf_percpu_ma_lock); >>>>> + err = >>>>> bpf_mem_alloc_percpu_unit_init(&bpf_global_percpu_ma, ret_t->size); >>>>> + mutex_unlock(&bpf_percpu_ma_lock); >>>>> + if (err) >>>>> + return err; >>>>> + } >>>>> + >>>>> struct_meta = btf_find_struct_meta(ret_btf, >>>>> ret_btf_id); >>>>> if (meta.func_id == >>>>> special_kfunc_list[KF_bpf_percpu_obj_new_impl]) { >>>>> if (!__btf_type_is_scalar_struct(env, >>>>> ret_btf, ret_t, 0)) {
diff --git a/include/linux/bpf_mem_alloc.h b/include/linux/bpf_mem_alloc.h index acef8c808599..d1403204379e 100644 --- a/include/linux/bpf_mem_alloc.h +++ b/include/linux/bpf_mem_alloc.h @@ -22,8 +22,15 @@ struct bpf_mem_alloc { * 'size = 0' is for bpf_mem_alloc which manages many fixed-size objects. * Alloc and free are done with bpf_mem_{alloc,free}() and the size of * the returned object is given by the size argument of bpf_mem_alloc(). + * If percpu equals true, error will be returned in order to avoid + * large memory consumption and the below bpf_mem_alloc_percpu_unit_init() + * should be used to do on-demand per-cpu allocation for each size. */ int bpf_mem_alloc_init(struct bpf_mem_alloc *ma, int size, bool percpu); +/* Initialize a non-fix-size percpu memory allocator */ +int bpf_mem_alloc_percpu_init(struct bpf_mem_alloc *ma); +/* The percpu allocation with a specific unit size. */ +int bpf_mem_alloc_percpu_unit_init(struct bpf_mem_alloc *ma, int size); void bpf_mem_alloc_destroy(struct bpf_mem_alloc *ma); /* kmalloc/kfree equivalent: */ diff --git a/kernel/bpf/memalloc.c b/kernel/bpf/memalloc.c index 4a21050f0359..a9c87ef4b89a 100644 --- a/kernel/bpf/memalloc.c +++ b/kernel/bpf/memalloc.c @@ -121,6 +121,8 @@ struct bpf_mem_caches { struct bpf_mem_cache cache[NUM_CACHES]; }; +static const u16 sizes[NUM_CACHES] = {96, 192, 16, 32, 64, 128, 256, 512, 1024, 2048, 4096}; + static struct llist_node notrace *__llist_del_first(struct llist_head *head) { struct llist_node *entry, *next; @@ -499,12 +501,14 @@ static void prefill_mem_cache(struct bpf_mem_cache *c, int cpu) */ int bpf_mem_alloc_init(struct bpf_mem_alloc *ma, int size, bool percpu) { - static u16 sizes[NUM_CACHES] = {96, 192, 16, 32, 64, 128, 256, 512, 1024, 2048, 4096}; struct bpf_mem_caches *cc, __percpu *pcc; struct bpf_mem_cache *c, __percpu *pc; struct obj_cgroup *objcg = NULL; int cpu, i, unit_size, percpu_size = 0; + if (percpu && size == 0) + return -EINVAL; + /* room for llist_node and per-cpu pointer */ if (percpu) percpu_size = LLIST_NODE_SZ + sizeof(void *); @@ -524,6 +528,7 @@ int bpf_mem_alloc_init(struct bpf_mem_alloc *ma, int size, bool percpu) objcg = get_obj_cgroup_from_current(); #endif ma->objcg = objcg; + for_each_possible_cpu(cpu) { c = per_cpu_ptr(pc, cpu); c->unit_size = unit_size; @@ -562,6 +567,61 @@ int bpf_mem_alloc_init(struct bpf_mem_alloc *ma, int size, bool percpu) return 0; } +int bpf_mem_alloc_percpu_init(struct bpf_mem_alloc *ma) +{ + struct bpf_mem_caches __percpu *pcc; + + pcc = __alloc_percpu_gfp(sizeof(struct bpf_mem_caches), 8, GFP_KERNEL); + if (!pcc) + return -ENOMEM; + + ma->caches = pcc; + ma->percpu = true; + +#ifdef CONFIG_MEMCG_KMEM + ma->objcg = get_obj_cgroup_from_current(); +#else + ma->objcg = NULL; +#endif + return 0; +} + +int bpf_mem_alloc_percpu_unit_init(struct bpf_mem_alloc *ma, int size) +{ + struct bpf_mem_caches *cc, __percpu *pcc; + int cpu, i, unit_size, percpu_size; + struct obj_cgroup *objcg; + struct bpf_mem_cache *c; + + i = bpf_mem_cache_idx(size); + if (i < 0) + return -EINVAL; + + /* room for llist_node and per-cpu pointer */ + percpu_size = LLIST_NODE_SZ + sizeof(void *); + + unit_size = sizes[i]; + objcg = ma->objcg; + pcc = ma->caches; + + for_each_possible_cpu(cpu) { + cc = per_cpu_ptr(pcc, cpu); + c = &cc->cache[i]; + if (cpu == 0 && c->unit_size) + break; + + c->unit_size = unit_size; + c->objcg = objcg; + c->percpu_size = percpu_size; + c->tgt = c; + + init_refill_work(c); + prefill_mem_cache(c, cpu); + } + + return 0; +} + static void drain_mem_cache(struct bpf_mem_cache *c) { bool percpu = !!c->percpu_size; diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c index f13008d27f35..08f9a49cc11c 100644 --- a/kernel/bpf/verifier.c +++ b/kernel/bpf/verifier.c @@ -12141,20 +12141,6 @@ static int check_kfunc_call(struct bpf_verifier_env *env, struct bpf_insn *insn, if (meta.func_id == special_kfunc_list[KF_bpf_obj_new_impl] && !bpf_global_ma_set) return -ENOMEM; - if (meta.func_id == special_kfunc_list[KF_bpf_percpu_obj_new_impl]) { - if (!bpf_global_percpu_ma_set) { - mutex_lock(&bpf_percpu_ma_lock); - if (!bpf_global_percpu_ma_set) { - err = bpf_mem_alloc_init(&bpf_global_percpu_ma, 0, true); - if (!err) - bpf_global_percpu_ma_set = true; - } - mutex_unlock(&bpf_percpu_ma_lock); - if (err) - return err; - } - } - if (((u64)(u32)meta.arg_constant.value) != meta.arg_constant.value) { verbose(env, "local type ID argument must be in range [0, U32_MAX]\n"); return -EINVAL; @@ -12175,6 +12161,26 @@ static int check_kfunc_call(struct bpf_verifier_env *env, struct bpf_insn *insn, return -EINVAL; } + if (meta.func_id == special_kfunc_list[KF_bpf_percpu_obj_new_impl]) { + if (!bpf_global_percpu_ma_set) { + mutex_lock(&bpf_percpu_ma_lock); + if (!bpf_global_percpu_ma_set) { + err = bpf_mem_alloc_percpu_init(&bpf_global_percpu_ma); + if (!err) + bpf_global_percpu_ma_set = true; + } + mutex_unlock(&bpf_percpu_ma_lock); + if (err) + return err; + } + + mutex_lock(&bpf_percpu_ma_lock); + err = bpf_mem_alloc_percpu_unit_init(&bpf_global_percpu_ma, ret_t->size); + mutex_unlock(&bpf_percpu_ma_lock); + if (err) + return err; + } + struct_meta = btf_find_struct_meta(ret_btf, ret_btf_id); if (meta.func_id == special_kfunc_list[KF_bpf_percpu_obj_new_impl]) { if (!__btf_type_is_scalar_struct(env, ret_btf, ret_t, 0)) {
Commit 41a5db8d8161 ("Add support for non-fix-size percpu mem allocation") added support for non-fix-size percpu memory allocation. Such allocation will allocate percpu memory for all buckets on all cpus and the memory consumption is in the order to quadratic. For example, let us say, 4 cpus, unit size 16 bytes, so each cpu has 16 * 4 = 64 bytes, with 4 cpus, total will be 64 * 4 = 256 bytes. Then let us say, 8 cpus with the same unit size, each cpu has 16 * 8 = 128 bytes, with 8 cpus, total will be 128 * 8 = 1024 bytes. So if the number of cpus doubles, the number of memory consumption will be 4 times. So for a system with large number of cpus, the memory consumption goes up quickly with quadratic order. For example, for 4KB percpu allocation, 128 cpus. The total memory consumption will 4KB * 128 * 128 = 64MB. Things will become worse if the number of cpus is bigger (e.g., 512, 1024, etc.) In Commit 41a5db8d8161, the non-fix-size percpu memory allocation is done in boot time, so for system with large number of cpus, the initial percpu memory consumption is very visible. For example, for 128 cpu system, the total percpu memory allocation will be at least (16 + 32 + 64 + 96 + 128 + 196 + 256 + 512 + 1024 + 2048 + 4096) * 128 * 128 = ~138MB. which is pretty big. It will be even bigger for larger number of cpus. Note that the current prefill also allocates 4 entries if the unit size is less than 256. So on top of 138MB memory consumption, this will add more consumption with 3 * (16 + 32 + 64 + 96 + 128 + 196 + 256) * 128 * 128 = ~38MB. Next patch will try to reduce this memory consumption. Later on, Commit 1fda5bb66ad8 ("bpf: Do not allocate percpu memory at init stage") moved the non-fix-size percpu memory allocation to bpf verificaiton stage. Once a particular bpf_percpu_obj_new() is called by bpf program, the memory allocator will try to fill in the cache with all sizes, causing the same amount of percpu memory consumption as in the boot stage. To reduce the initial percpu memory consumption for non-fix-size percpu memory allocation, instead of filling the cache with all supported allocation sizes, this patch intends to fill the cache only for the requested size. As typically users will not use large percpu data structure, this can save memory significantly. For example, the allocation size is 64 bytes with 128 cpus. Then total percpu memory amount will be 64 * 128 * 128 = 1MB, much less than previous 138MB. Signed-off-by: Yonghong Song <yonghong.song@linux.dev> --- include/linux/bpf_mem_alloc.h | 7 ++++ kernel/bpf/memalloc.c | 62 ++++++++++++++++++++++++++++++++++- kernel/bpf/verifier.c | 34 +++++++++++-------- 3 files changed, 88 insertions(+), 15 deletions(-)