Message ID | 20200207081810.3918919-1-kafai@fb.com (mailing list archive) |
---|---|
State | Not Applicable, archived |
Headers | show |
Series | [bpf] bpf: Improve bucket_log calculation logic | expand |
On Fri, Feb 07, 2020 at 12:18:10AM -0800, Martin KaFai Lau wrote: > > diff --git a/net/core/bpf_sk_storage.c b/net/core/bpf_sk_storage.c > index 458be6b3eda9..3ab23f698221 100644 > --- a/net/core/bpf_sk_storage.c > +++ b/net/core/bpf_sk_storage.c > @@ -643,9 +643,10 @@ static struct bpf_map *bpf_sk_storage_map_alloc(union bpf_attr *attr) > return ERR_PTR(-ENOMEM); > bpf_map_init_from_attr(&smap->map, attr); > > + nbuckets = roundup_pow_of_two(num_possible_cpus()); > /* Use at least 2 buckets, select_bucket() is undefined behavior with 1 bucket */ > - smap->bucket_log = max_t(u32, 1, ilog2(roundup_pow_of_two(num_possible_cpus()))); > - nbuckets = 1U << smap->bucket_log; > + nbuckets = max_t(u32, 2, nbuckets); > + smap->bucket_log = ilog2(nbuckets); > cost = sizeof(*smap->buckets) * nbuckets + sizeof(*smap); > > ret = bpf_map_charge_init(&smap->map.memory, cost); > -- Yes, that's much nicer to read. Feel free to add my Reviewed-by: Luc Van Oostenryck <luc.vanoostenryck@gmail.com> -- Luc
On Fri, Feb 7, 2020 at 12:18 AM Martin KaFai Lau <kafai@fb.com> wrote: > > It was reported that the max_t, ilog2, and roundup_pow_of_two macros have > exponential effects on the number of states in the sparse checker. Patch looks good, but I'd like to point out that it's not just sparse. You can see it with a simple make net/core/bpf_sk_storage.i grep 'smap->bucket_log = ' net/core/bpf_sk_storage.i | wc and see the end result: 1 365071 2686974 That's one line (the assignment line) that is 2,686,974 characters in length. Now, sparse does happen to react particularly badly to that (I didn't look to why, but I suspect it's just that evaluating all the types that don't actually ever end up getting used ends up being much more expensive than it should be), but I bet it's not good for gcc either. I do think this is a good test-case for sparse. Luc, have you looked at what it is that then makes sparse use *so* much memory for this one line? Linus
On Fri, Feb 7, 2020 at 10:07 AM Linus Torvalds <torvalds@linux-foundation.org> wrote: > > I do think this is a good test-case for sparse. Luc, have you looked > at what it is that then makes sparse use *so* much memory for this one > line? Looking at the profile, it's doing a lot of "copy_expression()". Which comes from inlining. I think the problem may be that with that macro expansion from hell we end up with 28968 copies of cpumask_weight(), and sparse will inline every one of them into the parse tree - even though basically none of them are _used_. In fact, it's worse than that: we end up having a few rounds of inlining thanks to static inline unsigned int cpumask_weight(const struct cpumask *srcp) { return bitmap_weight(cpumask_bits(srcp), nr_cpumask_bits); } static __always_inline int bitmap_weight(const unsigned long *src, unsigned int nbits) { if (small_const_nbits(nbits)) return hweight_long(*src & BITMAP_LAST_WORD_MASK(nbits)); return __bitmap_weight(src, nbits); } static __always_inline unsigned long hweight_long(unsigned long w) { return sizeof(w) == 4 ? hweight32(w) : hweight64(w); } where those hweight*() things aren't simple either, they end up doing #define hweight32(w) (__builtin_constant_p(w) ? __const_hweight32(w) : __arch_hweight32(w)) #define hweight64(w) (__builtin_constant_p(w) ? __const_hweight64(w) : __arch_hweight64(w)) where the __const_hweight*() in turn are more expansions of a macro with several levels in order to turn it all into a constant value. So we may have "only" 28968 calls to cpumask_weight(), but it results in millions of expressions being expanded. If we did some basic simplification of constant ops before inlining, that would likely help a lot. But currently sparse does inline function expansion at type evaluation time - so long before it does any simplification of the tree at all. So that explains why sparse happens to react _so_ badly to this thing. A real compiler would do inlining much later. Inlining that early is partly because originally one of the design ideas in sparse was to make inline functions act basically as templates, so they'd react to the types of the context. But it really bites us in the ass here. Luc, any ideas? Yes, this is solvable in the kernel, but it does show that sparse simply does a _lot_ of unnecessary work. Linus
On Fri, Feb 07, 2020 at 10:07:32AM -0800, Linus Torvalds wrote: > On Fri, Feb 7, 2020 at 12:18 AM Martin KaFai Lau <kafai@fb.com> wrote: > > > > It was reported that the max_t, ilog2, and roundup_pow_of_two macros have > > exponential effects on the number of states in the sparse checker. > > Patch looks good, but I'd like to point out that it's not just sparse. > > You can see it with a simple > > make net/core/bpf_sk_storage.i > grep 'smap->bucket_log = ' net/core/bpf_sk_storage.i | wc > > and see the end result: > > 1 365071 2686974 > > That's one line (the assignment line) that is 2,686,974 characters in length. In addition to this patch I've tried: diff --git a/include/linux/log2.h b/include/linux/log2.h index 83a4a3ca3e8a..7363abf60854 100644 --- a/include/linux/log2.h +++ b/include/linux/log2.h @@ -74,74 +74,76 @@ unsigned long __rounddown_pow_of_two(unsigned long n) * Use this where sparse expects a true constant expression, e.g. for array * indices. */ -#define const_ilog2(n) \ -( \ - __builtin_constant_p(n) ? ( \ - (n) < 2 ? 0 : \ - (n) & (1ULL << 63) ? 63 : \ - (n) & (1ULL << 62) ? 62 : \ ... +#define __const_ilog2(n, unique_n) ({ \ + typeof(n) unique_n = (n); \ + __builtin_constant_p(unique_n) ? ( \ + (unique_n) < 2 ? 0 : \ + (unique_n) & (1ULL << 63) ? 63 : \ + (unique_n) & (1ULL << 62) ? 62 : \ + (unique_n) & (1ULL << 61) ? 61 : \ + (unique_n) & (1ULL << 60) ? 60 : \ + (unique_n) & (1ULL << 59) ? 59 : \ ... + (unique_n) & (1ULL << 3) ? 3 : \ + (unique_n) & (1ULL << 2) ? 2 : \ 1) : \ - -1) + -1; }) + +#define const_ilog2(n) __const_ilog2(n, __UNIQUE_ID(__n)) and for this nested ilog2() case that caused this explosion the line got shorter: from 2.6M characters to 144k. Still a lot. Unfortunately this approach doesn't work in all cases: ../include/linux/log2.h:77:36: error: braced-group within expression allowed only inside a function 77 | #define __const_ilog2(n, unique_n) ({ \ | ^ ../include/linux/log2.h:146:24: note: in expansion of macro ‘__const_ilog2’ 146 | #define const_ilog2(n) __const_ilog2(n, __UNIQUE_ID(__n)) | ^~~~~~~~~~~~~ ../include/linux/log2.h:161:2: note: in expansion of macro ‘const_ilog2’ 161 | const_ilog2(n) : \ | ^~~~~~~~~~~ ../include/linux/blockgroup_lock.h:14:27: note: in expansion of macro ‘ilog2’ 14 | #define NR_BG_LOCKS (4 << ilog2(NR_CPUS < 32 ? NR_CPUS : 32)) | ^~~~~ ../include/linux/blockgroup_lock.h:24:24: note: in expansion of macro ‘NR_BG_LOCKS’ 24 | struct bgl_lock locks[NR_BG_LOCKS]; Just fyi for folks who're looking at ilog2 and wondering why it was done this way without ({ })
On Fri, Feb 07, 2020 at 11:39:24AM -0800, Linus Torvalds wrote: > On Fri, Feb 7, 2020 at 10:07 AM Linus Torvalds > <torvalds@linux-foundation.org> wrote: > > > > I do think this is a good test-case for sparse. Luc, have you looked > > at what it is that then makes sparse use *so* much memory for this one > > line? > > Looking at the profile, it's doing a lot of "copy_expression()". > > Which comes from inlining. > > I think the problem may be that with that macro expansion from hell we > end up with 28968 copies of cpumask_weight(), and sparse will inline > every one of them into the parse tree - even though basically none of > them are _used_. Yes, indeed. I was just what I saw too. > In fact, it's worse than that: we end up having a few rounds of > inlining thanks to <snip> > So we may have "only" 28968 calls to cpumask_weight(), but it results > in millions of expressions being expanded. Yes, roughly 1500 expressions per call (: > If we did some basic simplification of constant ops before inlining, > that would likely help a lot. > > But currently sparse does inline function expansion at type evaluation > time - so long before it does any simplification of the tree at all. > > So that explains why sparse happens to react _so_ badly to this thing. > A real compiler would do inlining much later. > > Inlining that early is partly because originally one of the design > ideas in sparse was to make inline functions act basically as > templates, so they'd react to the types of the context. But it really > bites us in the ass here. > > Luc, any ideas? Yes, this is solvable in the kernel, but it does show > that sparse simply does a _lot_ of unnecessary work. I never saw it so badly but it's not the first time I've bitten by the very early inlining. Independently of this, it would be handy to have an inliner at IR level, it shouldn't be very difficult but ... OTOH, it should really be straightforward would be to separate the current tree inliner from the type evaluation and instead run it just after expansion. The downsides would be: * the tree would need to be walked once more; * this may make the expansion less useful but it could be run again after the inlining. [ If we would like to keep inline-as-template it would just need to be able to detect such inlines at type evaluation and only inline those. ] I'll look more closely at all of it during the WE. -- Luc
On Fri, Feb 7, 2020 at 12:13 PM Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote: > > In addition to this patch I've tried: > +#define __const_ilog2(n, unique_n) ({ \ > + typeof(n) unique_n = (n); \ Yeah, as you found out, this doesn't work for the case of having global initializers or things like array sizes. Which people do do - often nor directly, but through various size macros. It's annoying, but one of the failures of C is having a nice form of compile-time constant handling where you can do slightly smarter arithmetic. The definition of a "const expression" is very very limited, and hurts us exactly for the array declaration and constant initializer case. Oh well. Linus
On Fri, Feb 7, 2020 at 12:41 PM Luc Van Oostenryck <luc.vanoostenryck@gmail.com> wrote: > > I never saw it so badly but it's not the first time I've bitten by > the very early inlining. Independently of this, it would be handy to > have an inliner at IR level, it shouldn't be very difficult but ... > OTOH, it should really be straightforward would be to separate the > current tree inliner from the type evaluation and instead run it just > after expansion. The downsides would be: > * the tree would need to be walked once more; Actually, if we were to do the inlining _during_ expansion, we wouldn't add a new phase. > * this may make the expansion less useful but it could be run again > after the inlining. Same comment: how about doing it as part of the expansion phase? This is where we handle the built-ins too, it would kind of make sense to do inlining in expand_symbol_call(), I feel. An inline function is a "builtin" that the user has defined, after all. And if we do it in that phase, we'd automatically avoid it for conditional expressions with a static conditional value, because expansion does the obvious trivial simplification as it goes along, and never expands the side that is trivially not seen. Something like the attached completely broken patch. It builds but doesn't work, because "inline_function()" is currently designed to work during evaluation, not during expansion. So the patch is complete garbage, but maybe could be the starting point for something that works. Linus
On 2/7/20 9:18 AM, Martin KaFai Lau wrote: > It was reported that the max_t, ilog2, and roundup_pow_of_two macros have > exponential effects on the number of states in the sparse checker. > > This patch breaks them up by calculating the "nbuckets" first so > that the "bucket_log" only needs to take ilog2(). > > Fixes: 6ac99e8f23d4 ("bpf: Introduce bpf sk local storage") > Reported-by: Randy Dunlap <rdunlap@infradead.org> > Reported-by: Luc Van Oostenryck <luc.vanoostenryck@gmail.com> > Suggested-by: Linus Torvalds <torvalds@linux-foundation.org> > Signed-off-by: Martin KaFai Lau <kafai@fb.com> Applied (& improved changelog to clarify it's not just sparse), thanks!
On Sat, Feb 8, 2020 at 3:55 PM Luc Van Oostenryck <luc.vanoostenryck@gmail.com> wrote: > > > Something like the attached completely broken patch. It builds but > > doesn't work, because "inline_function()" is currently designed to > > work during evaluation, not during expansion. > > > > So the patch is complete garbage, but maybe could be the starting > > point for something that works. > > It was just missing the part with current_fn (needed for the evaluation) > and some adjustment to avoid recursive expansions. Oh, good. I looked at that current_fn thing and decided it shouldn't possibly matter, so my hacked-up patch just dropped it. Blush. And yeah, the recursion avoidance got broken because it only triggered during evaluation and that wasn't where it was recursing any more. Anyway, your fixed patch looks good, and the numbers look lovely. I don't see why there would sometimes be extra memory use, but the patch feels like the right thing to do regardless. Thanks, Linus
On Sat, Feb 08, 2020 at 04:58:51PM -0800, Linus Torvalds wrote: > > Anyway, your fixed patch looks good, and the numbers look lovely. I > don't see why there would sometimes be extra memory use, but the patch > feels like the right thing to do regardless. Yes, I'm quite happy with it so. Thank you for the suggestion. For the cases with extra memory consumption, I've investigated the most extreme one and it's quite interesting. The extra memory was used for basic blocks, instructions and pseudos, so more linearized code. I reduced it to: static inline int get_order(long size) { return __builtin_constant_p(size) ? 0 : 1; } int foo(void) { return get_order(0); } Sparse used to not recognized the size as a constant (I don't understand why but haven't investigated). Strangely, the builtin without the conditional gave the expected result. Now, with the patch doing the inlining during expansion, the size is correctly recognized as a constant, with or without the conditional. The extra linearized code comes from some complex expression that is now selected instead of a function call (while reducing, I had the vague impression that the expression should have expanded further but I haven't check that yet). -- Luc
diff --git a/net/core/bpf_sk_storage.c b/net/core/bpf_sk_storage.c index 458be6b3eda9..3ab23f698221 100644 --- a/net/core/bpf_sk_storage.c +++ b/net/core/bpf_sk_storage.c @@ -643,9 +643,10 @@ static struct bpf_map *bpf_sk_storage_map_alloc(union bpf_attr *attr) return ERR_PTR(-ENOMEM); bpf_map_init_from_attr(&smap->map, attr); + nbuckets = roundup_pow_of_two(num_possible_cpus()); /* Use at least 2 buckets, select_bucket() is undefined behavior with 1 bucket */ - smap->bucket_log = max_t(u32, 1, ilog2(roundup_pow_of_two(num_possible_cpus()))); - nbuckets = 1U << smap->bucket_log; + nbuckets = max_t(u32, 2, nbuckets); + smap->bucket_log = ilog2(nbuckets); cost = sizeof(*smap->buckets) * nbuckets + sizeof(*smap); ret = bpf_map_charge_init(&smap->map.memory, cost);
It was reported that the max_t, ilog2, and roundup_pow_of_two macros have exponential effects on the number of states in the sparse checker. This patch breaks them up by calculating the "nbuckets" first so that the "bucket_log" only needs to take ilog2(). Fixes: 6ac99e8f23d4 ("bpf: Introduce bpf sk local storage") Reported-by: Randy Dunlap <rdunlap@infradead.org> Reported-by: Luc Van Oostenryck <luc.vanoostenryck@gmail.com> Suggested-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Martin KaFai Lau <kafai@fb.com> --- net/core/bpf_sk_storage.c | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-)