mbox series

[RFC,v2,bpf-next,0/4] bpf: add percpu stats for bpf_map

Message ID 20230622095330.1023453-1-aspsk@isovalent.com (mailing list archive)
Headers show
Series bpf: add percpu stats for bpf_map | expand

Message

Anton Protopopov June 22, 2023, 9:53 a.m. UTC
This series adds a mechanism for maps to populate per-cpu counters of elements
on insertions/deletions. The sum of these counters can be accessed by a new
kfunc from a map iterator program.

The following patches are present in the series:

  * Patch 1 adds a generic per-cpu counter to struct bpf_map
  * Patch 2 utilizes this mechanism for hash-based maps
  * Patch 3 extends the preloaded map iterator to dump the sum
  * Patch 4 adds a self-test for the change

The reason for adding this functionality in our case (Cilium) is to get
signals about how full some heavy-used maps are and what the actual dynamic
profile of map capacity is. In the case of LRU maps this is impossible to get
this information anyhow else. See also [1].

This is a v2 for the https://lore.kernel.org/bpf/20230531110511.64612-1-aspsk@isovalent.com/T/#t
It was rewritten according to previous comments.  I've turned this series into
an RFC for two reasons:

1) This patch only works on systems where this_cpu_{inc,dec} is atomic for s64.
For systems which might write s64 non-atomically this would be required to use
some locking mechanism to prevent readers from reading trash via the
bpf_map_sum_elements_counter() kfunc (see patch 1)

2) In comparison with the v1, we're adding extra instructions per map operation
(for preallocated maps, as well as for non-preallocated maps). The only
functionality we're interested at the moment is the number of elements present
in a map, not a per-cpu statistics. This could be better achieved by using
the v1 version, which only adds computations for preallocated maps.

So, the question is: won't it be fine to do the changes in the following way:

  * extend the preallocated hash maps to populate percpu batch counters as in v1
  * add a kfunc as in v2 to get the current sum

This works as

  * nobody at the moment actually requires the per-cpu statistcs
  * this implementation can be transparently turned into per-cpu statistics, if
    such a need occurs on practice (the only thing to change would be to
    re-implement the kfunc and, maybe, add more kfuncs to get per-cpu stats)
  * the "v1 way" is the least intrusive: it only affects preallocated maps, as
    other maps already provide the required functionality

  [1] https://lpc.events/event/16/contributions/1368/

v1 -> v2:
- make the counters generic part of struct bpf_map
- don't use map_info and /proc/self/fdinfo in favor of a kfunc

Anton Protopopov (4):
  bpf: add percpu stats for bpf_map elements insertions/deletions
  bpf: populate the per-cpu insertions/deletions counters for hashmaps
  bpf: make preloaded map iterators to display map elements count
  selftests/bpf: test map percpu stats

 include/linux/bpf.h                           |  30 +
 kernel/bpf/hashtab.c                          | 102 ++--
 kernel/bpf/map_iter.c                         |  48 +-
 kernel/bpf/preload/iterators/iterators.bpf.c  |   9 +-
 .../iterators/iterators.lskel-little-endian.h | 513 +++++++++---------
 .../bpf/map_tests/map_percpu_stats.c          | 336 ++++++++++++
 .../selftests/bpf/progs/map_percpu_stats.c    |  24 +
 7 files changed, 766 insertions(+), 296 deletions(-)
 create mode 100644 tools/testing/selftests/bpf/map_tests/map_percpu_stats.c
 create mode 100644 tools/testing/selftests/bpf/progs/map_percpu_stats.c

Comments

Daniel Borkmann June 23, 2023, 9:53 a.m. UTC | #1
On 6/22/23 11:53 AM, Anton Protopopov wrote:
> This series adds a mechanism for maps to populate per-cpu counters of elements
> on insertions/deletions. The sum of these counters can be accessed by a new
> kfunc from a map iterator program.
> 
> The following patches are present in the series:
> 
>    * Patch 1 adds a generic per-cpu counter to struct bpf_map
>    * Patch 2 utilizes this mechanism for hash-based maps
>    * Patch 3 extends the preloaded map iterator to dump the sum
>    * Patch 4 adds a self-test for the change
> 
> The reason for adding this functionality in our case (Cilium) is to get
> signals about how full some heavy-used maps are and what the actual dynamic
> profile of map capacity is. In the case of LRU maps this is impossible to get
> this information anyhow else. See also [1].
> 
> This is a v2 for the https://lore.kernel.org/bpf/20230531110511.64612-1-aspsk@isovalent.com/T/#t
> It was rewritten according to previous comments.  I've turned this series into
> an RFC for two reasons:
> 
> 1) This patch only works on systems where this_cpu_{inc,dec} is atomic for s64.
> For systems which might write s64 non-atomically this would be required to use
> some locking mechanism to prevent readers from reading trash via the
> bpf_map_sum_elements_counter() kfunc (see patch 1)
> 
> 2) In comparison with the v1, we're adding extra instructions per map operation
> (for preallocated maps, as well as for non-preallocated maps). The only
> functionality we're interested at the moment is the number of elements present
> in a map, not a per-cpu statistics. This could be better achieved by using
> the v1 version, which only adds computations for preallocated maps.
> 
> So, the question is: won't it be fine to do the changes in the following way:
> 
>    * extend the preallocated hash maps to populate percpu batch counters as in v1
>    * add a kfunc as in v2 to get the current sum
> 
> This works as
> 
>    * nobody at the moment actually requires the per-cpu statistcs
>    * this implementation can be transparently turned into per-cpu statistics, if
>      such a need occurs on practice (the only thing to change would be to
>      re-implement the kfunc and, maybe, add more kfuncs to get per-cpu stats)
>    * the "v1 way" is the least intrusive: it only affects preallocated maps, as
>      other maps already provide the required functionality
> 
>    [1] https://lpc.events/event/16/contributions/1368/
> 
> v1 -> v2:
> - make the counters generic part of struct bpf_map
> - don't use map_info and /proc/self/fdinfo in favor of a kfunc

Tbh, I did like v1 approach a bit better. We are trying to bend over backwards just
so that we don't add things to uapi, but in the end we are also adding it to the
maps.debug, etc (yes, it has .debug in the name and all) ...

 > for maps can be extended to print the element count, so the user can have
 > convenient 'cat /sys/fs/bpf/maps.debug' way to debug maps.

... I feel with convenience access, some folks in the wild might build apps parsing
it in the end instead of writing iterators.

I would be curious to hear from Google and Meta folks in terms of how you observe
map fillup or other useful map related stats in production (unless you don't watch
out for this?), and if there is a common denominator we could come up with that
would be really useful for all parties and thus potentially make sense for fdinfo
or as an extensible bpf_map_info stats extension in case there is some agreement?

In our case we don't really care about exact numbers, just more around whether a
user is scaling up their k8s cluster to the point where the previous configurations
on sizing e.g. around LB mappings etc would hit the limits in the foreseeable future
if they dial up further. This is exported to dashboards (e.g. grafana) for visibility
so operators can know ahead of time.

The iterator approach is okay, less convenient, but if there is indeed nothing
common from other production users which helped map visibility over the years, then
so be it. I thought to at least ask given we have this thread. :)

Thanks,
Daniel
Alexei Starovoitov June 24, 2023, 12:17 a.m. UTC | #2
On Fri, Jun 23, 2023 at 2:53 AM Daniel Borkmann <daniel@iogearbox.net> wrote:
>
> On 6/22/23 11:53 AM, Anton Protopopov wrote:
> > This series adds a mechanism for maps to populate per-cpu counters of elements
> > on insertions/deletions. The sum of these counters can be accessed by a new
> > kfunc from a map iterator program.
> >
> > The following patches are present in the series:
> >
> >    * Patch 1 adds a generic per-cpu counter to struct bpf_map
> >    * Patch 2 utilizes this mechanism for hash-based maps
> >    * Patch 3 extends the preloaded map iterator to dump the sum
> >    * Patch 4 adds a self-test for the change
> >
> > The reason for adding this functionality in our case (Cilium) is to get
> > signals about how full some heavy-used maps are and what the actual dynamic
> > profile of map capacity is. In the case of LRU maps this is impossible to get
> > this information anyhow else. See also [1].
> >
> > This is a v2 for the https://lore.kernel.org/bpf/20230531110511.64612-1-aspsk@isovalent.com/T/#t
> > It was rewritten according to previous comments.  I've turned this series into
> > an RFC for two reasons:
> >
> > 1) This patch only works on systems where this_cpu_{inc,dec} is atomic for s64.
> > For systems which might write s64 non-atomically this would be required to use
> > some locking mechanism to prevent readers from reading trash via the
> > bpf_map_sum_elements_counter() kfunc (see patch 1)
> >
> > 2) In comparison with the v1, we're adding extra instructions per map operation
> > (for preallocated maps, as well as for non-preallocated maps). The only
> > functionality we're interested at the moment is the number of elements present
> > in a map, not a per-cpu statistics. This could be better achieved by using
> > the v1 version, which only adds computations for preallocated maps.
> >
> > So, the question is: won't it be fine to do the changes in the following way:
> >
> >    * extend the preallocated hash maps to populate percpu batch counters as in v1
> >    * add a kfunc as in v2 to get the current sum
> >
> > This works as
> >
> >    * nobody at the moment actually requires the per-cpu statistcs
> >    * this implementation can be transparently turned into per-cpu statistics, if
> >      such a need occurs on practice (the only thing to change would be to
> >      re-implement the kfunc and, maybe, add more kfuncs to get per-cpu stats)
> >    * the "v1 way" is the least intrusive: it only affects preallocated maps, as
> >      other maps already provide the required functionality
> >
> >    [1] https://lpc.events/event/16/contributions/1368/
> >
> > v1 -> v2:
> > - make the counters generic part of struct bpf_map
> > - don't use map_info and /proc/self/fdinfo in favor of a kfunc
>
> Tbh, I did like v1 approach a bit better. We are trying to bend over backwards just
> so that we don't add things to uapi, but in the end we are also adding it to the
> maps.debug, etc (yes, it has .debug in the name and all) ...

I think we should keep bending even more backwards to avoid uapi burden.
All new features should be non-uapi no matter how many people
say "I'll definitely use it".

> or as an extensible bpf_map_info stats extension in case there is some agreement?

I'd rather not.
bpf_map_info returns what user space sent to the kernel earlier.
stats or anything that user space didn't explicitly give earlier
is quite different.
Same goes for bpf_prog_info.
We only added verified_insns there that doesn't fit the above definition
and it was a mistake. After almost 2 years it is still unused
and cannot be removed.
veristat is parsing non-uapi verifier log.
Tooling can live with non-uapi map stats just fine.
Daniel Borkmann June 26, 2023, 8:50 a.m. UTC | #3
On 6/24/23 2:17 AM, Alexei Starovoitov wrote:
> On Fri, Jun 23, 2023 at 2:53 AM Daniel Borkmann <daniel@iogearbox.net> wrote:
>> On 6/22/23 11:53 AM, Anton Protopopov wrote:
>>> This series adds a mechanism for maps to populate per-cpu counters of elements
>>> on insertions/deletions. The sum of these counters can be accessed by a new
>>> kfunc from a map iterator program.
>>>
>>> The following patches are present in the series:
>>>
>>>     * Patch 1 adds a generic per-cpu counter to struct bpf_map
>>>     * Patch 2 utilizes this mechanism for hash-based maps
>>>     * Patch 3 extends the preloaded map iterator to dump the sum
>>>     * Patch 4 adds a self-test for the change
>>>
>>> The reason for adding this functionality in our case (Cilium) is to get
>>> signals about how full some heavy-used maps are and what the actual dynamic
>>> profile of map capacity is. In the case of LRU maps this is impossible to get
>>> this information anyhow else. See also [1].
>>>
>>> This is a v2 for the https://lore.kernel.org/bpf/20230531110511.64612-1-aspsk@isovalent.com/T/#t
>>> It was rewritten according to previous comments.  I've turned this series into
>>> an RFC for two reasons:
>>>
>>> 1) This patch only works on systems where this_cpu_{inc,dec} is atomic for s64.
>>> For systems which might write s64 non-atomically this would be required to use
>>> some locking mechanism to prevent readers from reading trash via the
>>> bpf_map_sum_elements_counter() kfunc (see patch 1)
>>>
>>> 2) In comparison with the v1, we're adding extra instructions per map operation
>>> (for preallocated maps, as well as for non-preallocated maps). The only
>>> functionality we're interested at the moment is the number of elements present
>>> in a map, not a per-cpu statistics. This could be better achieved by using
>>> the v1 version, which only adds computations for preallocated maps.
>>>
>>> So, the question is: won't it be fine to do the changes in the following way:
>>>
>>>     * extend the preallocated hash maps to populate percpu batch counters as in v1
>>>     * add a kfunc as in v2 to get the current sum
>>>
>>> This works as
>>>
>>>     * nobody at the moment actually requires the per-cpu statistcs
>>>     * this implementation can be transparently turned into per-cpu statistics, if
>>>       such a need occurs on practice (the only thing to change would be to
>>>       re-implement the kfunc and, maybe, add more kfuncs to get per-cpu stats)
>>>     * the "v1 way" is the least intrusive: it only affects preallocated maps, as
>>>       other maps already provide the required functionality
>>>
>>>     [1] https://lpc.events/event/16/contributions/1368/
>>>
>>> v1 -> v2:
>>> - make the counters generic part of struct bpf_map
>>> - don't use map_info and /proc/self/fdinfo in favor of a kfunc
>>
>> Tbh, I did like v1 approach a bit better. We are trying to bend over backwards just
>> so that we don't add things to uapi, but in the end we are also adding it to the
>> maps.debug, etc (yes, it has .debug in the name and all) ...
> 
> I think we should keep bending even more backwards to avoid uapi burden.
> All new features should be non-uapi no matter how many people
> say "I'll definitely use it".
> 
>> or as an extensible bpf_map_info stats extension in case there is some agreement?
> 
> I'd rather not.
> bpf_map_info returns what user space sent to the kernel earlier.
> stats or anything that user space didn't explicitly give earlier
> is quite different.
> Same goes for bpf_prog_info.
> We only added verified_insns there that doesn't fit the above definition
> and it was a mistake. After almost 2 years it is still unused
> and cannot be removed.

Right, hence my question also on how others observe maps in production to see
if there are similar use cases and approaches as we have.

> veristat is parsing non-uapi verifier log.
> Tooling can live with non-uapi map stats just fine.

Sure, that's okay.

Thanks,
Daniel