[bpf] bpf: Improve bucket_log calculation logic

Message ID	20200207081810.3918919-1-kafai@fb.com (mailing list archive)
State	Not Applicable, archived
Headers	show Return-Path: <SRS0=HN6/=33=vger.kernel.org=linux-sparse-owner@kernel.org> Smtp-Origin-Hostprefix: devbig From: Martin KaFai Lau <kafai@fb.com> Smtp-Origin-Hostname: devbig005.ftw2.facebook.com To: <bpf@vger.kernel.org> CC: Alexei Starovoitov <ast@kernel.org>, Daniel Borkmann <daniel@iogearbox.net>, David Miller <davem@davemloft.net>, <kernel-team@fb.com>, Linus Torvalds <torvalds@linux-foundation.org>, Linux-Sparse <linux-sparse@vger.kernel.org>, Luc Van Oostenryck <luc.vanoostenryck@gmail.com>, <netdev@vger.kernel.org>, Randy Dunlap <rdunlap@infradead.org> Smtp-Origin-Cluster: ftw2c04 Subject: [PATCH bpf] bpf: Improve bucket_log calculation logic Date: Fri, 7 Feb 2020 00:18:10 -0800 Message-ID: <20200207081810.3918919-1-kafai@fb.com> MIME-Version: 1.0 Content-Type: text/plain Sender: linux-sparse-owner@vger.kernel.org Precedence: bulk
Series	[bpf] bpf: Improve bucket_log calculation logic \| expand [bpf] bpf: Improve bucket_log calculation logic

Martin KaFai Lau Feb. 7, 2020, 8:18 a.m. UTC

It was reported that the max_t, ilog2, and roundup_pow_of_two macros have
exponential effects on the number of states in the sparse checker.

This patch breaks them up by calculating the "nbuckets" first so
that the "bucket_log" only needs to take ilog2().

Fixes: 6ac99e8f23d4 ("bpf: Introduce bpf sk local storage")
Reported-by: Randy Dunlap <rdunlap@infradead.org>
Reported-by: Luc Van Oostenryck <luc.vanoostenryck@gmail.com>
Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
---
 net/core/bpf_sk_storage.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

Luc Van Oostenryck Feb. 7, 2020, 1:07 p.m. UTC | #1

On Fri, Feb 07, 2020 at 12:18:10AM -0800, Martin KaFai Lau wrote:
> 
> diff --git a/net/core/bpf_sk_storage.c b/net/core/bpf_sk_storage.c
> index 458be6b3eda9..3ab23f698221 100644
> --- a/net/core/bpf_sk_storage.c
> +++ b/net/core/bpf_sk_storage.c
> @@ -643,9 +643,10 @@ static struct bpf_map *bpf_sk_storage_map_alloc(union bpf_attr *attr)
>  		return ERR_PTR(-ENOMEM);
>  	bpf_map_init_from_attr(&smap->map, attr);
>  
> +	nbuckets = roundup_pow_of_two(num_possible_cpus());
>  	/* Use at least 2 buckets, select_bucket() is undefined behavior with 1 bucket */
> -	smap->bucket_log = max_t(u32, 1, ilog2(roundup_pow_of_two(num_possible_cpus())));
> -	nbuckets = 1U << smap->bucket_log;
> +	nbuckets = max_t(u32, 2, nbuckets);
> +	smap->bucket_log = ilog2(nbuckets);
>  	cost = sizeof(*smap->buckets) * nbuckets + sizeof(*smap);
>  
>  	ret = bpf_map_charge_init(&smap->map.memory, cost);
> -- 

Yes, that's much nicer to read. Feel free to add my

Reviewed-by: Luc Van Oostenryck <luc.vanoostenryck@gmail.com>

-- Luc

Linus Torvalds Feb. 7, 2020, 6:07 p.m. UTC | #2

On Fri, Feb 7, 2020 at 12:18 AM Martin KaFai Lau <kafai@fb.com> wrote:
>
> It was reported that the max_t, ilog2, and roundup_pow_of_two macros have
> exponential effects on the number of states in the sparse checker.

Patch looks good, but I'd like to point out that it's not just sparse.

You can see it with a simple

    make net/core/bpf_sk_storage.i
    grep 'smap->bucket_log = ' net/core/bpf_sk_storage.i | wc

and see the end result:

      1  365071 2686974

That's one line (the assignment line) that is 2,686,974 characters in length.

Now, sparse does happen to react particularly badly to that (I didn't
look to why, but I suspect it's just that evaluating all the types
that don't actually ever end up getting used ends up being much more
expensive than it should be), but I bet it's not good for gcc either.

I do think this is a good test-case for sparse. Luc, have you looked
at what it is that then makes sparse use *so* much memory for this one
line?

             Linus

Linus Torvalds Feb. 7, 2020, 7:39 p.m. UTC | #3

On Fri, Feb 7, 2020 at 10:07 AM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> I do think this is a good test-case for sparse. Luc, have you looked
> at what it is that then makes sparse use *so* much memory for this one
> line?

Looking at the profile, it's doing a lot of "copy_expression()".

Which comes from inlining.

I think the problem may be that with that macro expansion from hell we
end up with 28968 copies of cpumask_weight(), and sparse will inline
every one of them into the parse tree - even though basically none of
them are _used_.

In fact, it's worse than that: we end up having a few rounds of
inlining thanks to

static inline unsigned int cpumask_weight(const struct cpumask *srcp)
{
        return bitmap_weight(cpumask_bits(srcp), nr_cpumask_bits);
}

static __always_inline int bitmap_weight(const unsigned long *src,
unsigned int nbits)
{
        if (small_const_nbits(nbits))
                return hweight_long(*src & BITMAP_LAST_WORD_MASK(nbits));
        return __bitmap_weight(src, nbits);
}

static __always_inline unsigned long hweight_long(unsigned long w)
{
        return sizeof(w) == 4 ? hweight32(w) : hweight64(w);
}

where those hweight*() things aren't simple either, they end up doing

  #define hweight32(w) (__builtin_constant_p(w) ? __const_hweight32(w)
: __arch_hweight32(w))
  #define hweight64(w) (__builtin_constant_p(w) ? __const_hweight64(w)
: __arch_hweight64(w))

where the __const_hweight*() in turn are more expansions of a macro
with several levels in order to turn it all into a constant value.

So we may have "only" 28968 calls to cpumask_weight(), but it results
in millions of expressions being expanded.

If we did some basic simplification of constant ops before inlining,
that would likely help a lot.

But currently sparse does inline function expansion at type evaluation
time - so long before it does any simplification of the tree at all.

So that explains why sparse happens to react _so_ badly to this thing.
A real compiler would do inlining much later.

Inlining that early is partly because originally one of the design
ideas in sparse was to make inline functions act basically as
templates, so they'd react to the types of the context. But it really
bites us in the ass here.

Luc, any ideas? Yes, this is solvable in the kernel, but it does show
that sparse simply does a _lot_ of unnecessary work.

               Linus

Alexei Starovoitov Feb. 7, 2020, 8:13 p.m. UTC | #4

On Fri, Feb 07, 2020 at 10:07:32AM -0800, Linus Torvalds wrote:
> On Fri, Feb 7, 2020 at 12:18 AM Martin KaFai Lau <kafai@fb.com> wrote:
> >
> > It was reported that the max_t, ilog2, and roundup_pow_of_two macros have
> > exponential effects on the number of states in the sparse checker.
> 
> Patch looks good, but I'd like to point out that it's not just sparse.
> 
> You can see it with a simple
> 
>     make net/core/bpf_sk_storage.i
>     grep 'smap->bucket_log = ' net/core/bpf_sk_storage.i | wc
> 
> and see the end result:
> 
>       1  365071 2686974
> 
> That's one line (the assignment line) that is 2,686,974 characters in length.

In addition to this patch I've tried:
diff --git a/include/linux/log2.h b/include/linux/log2.h
index 83a4a3ca3e8a..7363abf60854 100644
--- a/include/linux/log2.h
+++ b/include/linux/log2.h
@@ -74,74 +74,76 @@ unsigned long __rounddown_pow_of_two(unsigned long n)
  * Use this where sparse expects a true constant expression, e.g. for array
  * indices.
  */
-#define const_ilog2(n)                         \
-(                                              \
-       __builtin_constant_p(n) ? (             \
-               (n) < 2 ? 0 :                   \
-               (n) & (1ULL << 63) ? 63 :       \
-               (n) & (1ULL << 62) ? 62 :       \
...
+#define __const_ilog2(n, unique_n) ({                  \
+       typeof(n) unique_n = (n);                       \
+       __builtin_constant_p(unique_n) ? (              \
+               (unique_n) < 2 ? 0 :                    \
+               (unique_n) & (1ULL << 63) ? 63 :        \
+               (unique_n) & (1ULL << 62) ? 62 :        \
+               (unique_n) & (1ULL << 61) ? 61 :        \
+               (unique_n) & (1ULL << 60) ? 60 :        \
+               (unique_n) & (1ULL << 59) ? 59 :        \
...
+               (unique_n) & (1ULL <<  3) ?  3 :        \
+               (unique_n) & (1ULL <<  2) ?  2 :        \
                1) :                            \
-       -1)
+       -1; })
+
+#define const_ilog2(n) __const_ilog2(n, __UNIQUE_ID(__n))

and for this nested ilog2() case that caused this explosion
the line got shorter: from 2.6M characters to 144k.
Still a lot.
Unfortunately this approach doesn't work in all cases:
../include/linux/log2.h:77:36: error: braced-group within expression allowed only inside a function
   77 | #define __const_ilog2(n, unique_n) ({   \
      |                                    ^
../include/linux/log2.h:146:24: note: in expansion of macro ‘__const_ilog2’
  146 | #define const_ilog2(n) __const_ilog2(n, __UNIQUE_ID(__n))
      |                        ^~~~~~~~~~~~~
../include/linux/log2.h:161:2: note: in expansion of macro ‘const_ilog2’
  161 |  const_ilog2(n) :  \
      |  ^~~~~~~~~~~
../include/linux/blockgroup_lock.h:14:27: note: in expansion of macro ‘ilog2’
   14 | #define NR_BG_LOCKS (4 << ilog2(NR_CPUS < 32 ? NR_CPUS : 32))
      |                           ^~~~~
../include/linux/blockgroup_lock.h:24:24: note: in expansion of macro ‘NR_BG_LOCKS’
   24 |  struct bgl_lock locks[NR_BG_LOCKS];

Just fyi for folks who're looking at ilog2 and wondering
why it was done this way without ({ })

Luc Van Oostenryck Feb. 7, 2020, 8:41 p.m. UTC | #5

On Fri, Feb 07, 2020 at 11:39:24AM -0800, Linus Torvalds wrote:
> On Fri, Feb 7, 2020 at 10:07 AM Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
> >
> > I do think this is a good test-case for sparse. Luc, have you looked
> > at what it is that then makes sparse use *so* much memory for this one
> > line?
> 
> Looking at the profile, it's doing a lot of "copy_expression()".
> 
> Which comes from inlining.
> 
> I think the problem may be that with that macro expansion from hell we
> end up with 28968 copies of cpumask_weight(), and sparse will inline
> every one of them into the parse tree - even though basically none of
> them are _used_.

Yes, indeed. I was just what I saw too.

> In fact, it's worse than that: we end up having a few rounds of
> inlining thanks to

<snip> 

> So we may have "only" 28968 calls to cpumask_weight(), but it results
> in millions of expressions being expanded.

Yes, roughly 1500 expressions per call (:
 
> If we did some basic simplification of constant ops before inlining,
> that would likely help a lot.
> 
> But currently sparse does inline function expansion at type evaluation
> time - so long before it does any simplification of the tree at all.
> 
> So that explains why sparse happens to react _so_ badly to this thing.
> A real compiler would do inlining much later.
> 
> Inlining that early is partly because originally one of the design
> ideas in sparse was to make inline functions act basically as
> templates, so they'd react to the types of the context. But it really
> bites us in the ass here.
> 
> Luc, any ideas? Yes, this is solvable in the kernel, but it does show
> that sparse simply does a _lot_ of unnecessary work.

I never saw it so badly but it's not the first time I've bitten by
the very early inlining. Independently of this, it would be handy to
have an inliner at IR level, it shouldn't be very difficult but ...
OTOH, it should really be straightforward would be to separate the
current tree inliner from the type evaluation and instead run it just
after expansion. The downsides would be:
  * the tree would need to be walked once more;
  * this may make the expansion less useful but it could be run again
    after the inlining.

[ If we would like to keep inline-as-template it would just need to be
  able to detect such inlines at type evaluation and only inline those. ]

I'll look more closely at all of it during the WE.

-- Luc

Linus Torvalds Feb. 7, 2020, 9:09 p.m. UTC | #6

On Fri, Feb 7, 2020 at 12:13 PM Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
>
> In addition to this patch I've tried:
> +#define __const_ilog2(n, unique_n) ({                  \
> +       typeof(n) unique_n = (n);                       \

Yeah, as you found out, this doesn't work for the case of having
global initializers or things like array sizes.

Which people do do - often nor directly, but through various size macros.

It's annoying, but one of the failures of C is having a nice form of
compile-time constant handling where you can do slightly smarter
arithmetic. The definition of a "const expression" is very very
limited, and hurts us exactly for the array declaration and constant
initializer case.

Oh well.

          Linus

Linus Torvalds Feb. 7, 2020, 9:22 p.m. UTC | #7

On Fri, Feb 7, 2020 at 12:41 PM Luc Van Oostenryck
<luc.vanoostenryck@gmail.com> wrote:
>
> I never saw it so badly but it's not the first time I've bitten by
> the very early inlining. Independently of this, it would be handy to
> have an inliner at IR level, it shouldn't be very difficult but ...
> OTOH, it should really be straightforward would be to separate the
> current tree inliner from the type evaluation and instead run it just
> after expansion. The downsides would be:
>   * the tree would need to be walked once more;

Actually, if we were to do the inlining _during_ expansion, we
wouldn't add a new phase.

>   * this may make the expansion less useful but it could be run again
>     after the inlining.

Same comment: how about doing it as part of the expansion phase?

This is where we handle the built-ins too, it would kind of make sense
to do inlining in expand_symbol_call(), I feel. An inline function is
a "builtin" that the user has defined, after all.

And if we do it in that phase, we'd automatically avoid it for
conditional expressions with a static conditional value, because
expansion does the obvious trivial simplification as it goes along,
and never expands the side that is trivially not seen.

Something like the attached completely broken patch. It builds but
doesn't work, because "inline_function()" is currently designed to
work during evaluation, not during expansion.

So the patch is complete garbage, but maybe could be the starting
point for something that works.

             Linus

Daniel Borkmann Feb. 7, 2020, 10:04 p.m. UTC | #8

On 2/7/20 9:18 AM, Martin KaFai Lau wrote:
> It was reported that the max_t, ilog2, and roundup_pow_of_two macros have
> exponential effects on the number of states in the sparse checker.
> 
> This patch breaks them up by calculating the "nbuckets" first so
> that the "bucket_log" only needs to take ilog2().
> 
> Fixes: 6ac99e8f23d4 ("bpf: Introduce bpf sk local storage")
> Reported-by: Randy Dunlap <rdunlap@infradead.org>
> Reported-by: Luc Van Oostenryck <luc.vanoostenryck@gmail.com>
> Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
> Signed-off-by: Martin KaFai Lau <kafai@fb.com>

Applied (& improved changelog to clarify it's not just sparse), thanks!

Linus Torvalds Feb. 9, 2020, 12:58 a.m. UTC | #9

On Sat, Feb 8, 2020 at 3:55 PM Luc Van Oostenryck
<luc.vanoostenryck@gmail.com> wrote:
>
> > Something like the attached completely broken patch. It builds but
> > doesn't work, because "inline_function()" is currently designed to
> > work during evaluation, not during expansion.
> >
> > So the patch is complete garbage, but maybe could be the starting
> > point for something that works.
>
> It was just missing the part with current_fn (needed for the evaluation)
> and some adjustment to avoid recursive expansions.

Oh, good. I looked at that current_fn thing and decided it shouldn't
possibly matter, so my hacked-up patch just dropped it. Blush. And
yeah, the recursion avoidance got broken because it only triggered
during evaluation and that wasn't where it was recursing any more.

Anyway, your fixed patch looks good, and the numbers look lovely. I
don't see why there would sometimes be extra memory use, but the patch
feels like the right thing to do regardless.

Thanks,
               Linus

Luc Van Oostenryck Feb. 9, 2020, 12:19 p.m. UTC | #10

On Sat, Feb 08, 2020 at 04:58:51PM -0800, Linus Torvalds wrote:
> 
> Anyway, your fixed patch looks good, and the numbers look lovely. I
> don't see why there would sometimes be extra memory use, but the patch
> feels like the right thing to do regardless.

Yes, I'm quite happy with it so. Thank you for the suggestion.

For the cases with extra memory consumption, I've investigated the
most extreme one and it's quite interesting. The extra memory was
used for basic blocks, instructions and pseudos, so more linearized
code. I reduced it to:

	static inline int get_order(long size)
	{
		return __builtin_constant_p(size) ? 0 : 1;
	}

	int foo(void)
	{
		return get_order(0);
	}

Sparse used to not recognized the size as a constant (I don't
understand why but haven't investigated). Strangely, the builtin
without the conditional gave the expected result.

Now, with the patch doing the inlining during expansion, the size
is correctly recognized as a constant, with or without the conditional.
The extra linearized code comes from some complex expression that is
now selected instead of a function call (while reducing, I had the
vague impression that the expression should have expanded further
but I haven't check that yet).

-- Luc

[bpf] bpf: Improve bucket_log calculation logic

Commit Message

Comments

Patch