mbox series

[bpf-next,v3,00/18] bpf: bpf memory usage

Message ID 20230227152032.12359-1-laoar.shao@gmail.com (mailing list archive)
Headers show
Series bpf: bpf memory usage | expand

Message

Yafang Shao Feb. 27, 2023, 3:20 p.m. UTC
Currently we can't get bpf memory usage reliably. bpftool now shows the
bpf memory footprint, which is difference with bpf memory usage. The
difference can be quite great in some cases, for example,

- non-preallocated bpf map
  The non-preallocated bpf map memory usage is dynamically changed. The
  allocated elements count can be from 0 to the max entries. But the
  memory footprint in bpftool only shows a fixed number.

- bpf metadata consumes more memory than bpf element
  In some corner cases, the bpf metadata can consumes a lot more memory
  than bpf element consumes. For example, it can happen when the element
  size is quite small.

- some maps don't have key, value or max_entries
  For example the key_size and value_size of ringbuf is 0, so its
  memlock is always 0.

We need a way to show the bpf memory usage especially there will be more
and more bpf programs running on the production environment and thus the
bpf memory usage is not trivial.

This patchset introduces a new map ops ->map_mem_usage to calculate the
memory usage. Note that we don't intend to make the memory usage 100%
accurate, while our goal is to make sure there is only a small difference
between what bpftool reports and the real memory. That small difference
can be ignored compared to the total usage.  That is enough to monitor
the bpf memory usage. For example, the user can rely on this value to
monitor the trend of bpf memory usage, compare the difference in bpf
memory usage between different bpf program versions, figure out which
maps consume large memory, and etc.

This patchset implements the bpf memory usage for all maps, and yet there's
still work to do. We don't want to introduce runtime overhead in the
element update and delete path, but we have to do it for some
non-preallocated maps,
- devmap, xskmap
  When we update or delete an element, it will allocate or free memory.
  In order to track this dynamic memory, we have to track the count in
  element update and delete path. 

- cpumap
  The element size of each cpumap element is not determinated. If we
  want to track the usage, we have to count the size of all elements in
  the element update and delete path. So I just put it aside currently.

- local_storage, bpf_local_storage
  When we attach or detach a cgroup, it will allocate or free memory. If
  we want to track the dynamic memory, we also need to do something in
  the update and delete path. So I just put it aside currently.

- offload map
  The element update and delete of offload map is via the netdev dev_ops,
  in which it may dynamically allocate or free memory, but this dynamic
  memory isn't counted in offload map memory usage currently.

The result of each map can be found in the individual patch.

Changes:
v2->v3: check callback at map creation time and avoid warning (Alexei)
        fix build error under CONFIG_BPF=n (lkp@intel.com)
v1->v2: calculate the memory usage within bpf (Alexei)
- [v1] bpf, mm: bpf memory usage
  https://lwn.net/Articles/921991/
- [RFC PATCH v2] mm, bpf: Add BPF into /proc/meminfo
  https://lwn.net/Articles/919848/
- [RFC PATCH v1] mm, bpf: Add BPF into /proc/meminfo
  https://lwn.net/Articles/917647/ 


Yafang Shao (18):
  bpf: add new map ops ->map_mem_usage
  bpf: lpm_trie memory usage
  bpf: hashtab memory usage
  bpf: arraymap memory usage
  bpf: stackmap memory usage
  bpf: reuseport_array memory usage
  bpf: ringbuf memory usage
  bpf: bloom_filter memory usage
  bpf: cpumap memory usage
  bpf: devmap memory usage
  bpf: queue_stack_maps memory usage
  bpf: bpf_struct_ops memory usage
  bpf: local_storage memory usage
  bpf, net: bpf_local_storage memory usage
  bpf, net: sock_map memory usage
  bpf, net: xskmap memory usage
  bpf: offload map memory usage
  bpf: enforce all maps having memory usage callback

 include/linux/bpf.h               |  8 ++++++++
 include/linux/bpf_local_storage.h |  1 +
 include/net/xdp_sock.h            |  1 +
 kernel/bpf/arraymap.c             | 28 +++++++++++++++++++++++++
 kernel/bpf/bloom_filter.c         | 12 +++++++++++
 kernel/bpf/bpf_cgrp_storage.c     |  1 +
 kernel/bpf/bpf_inode_storage.c    |  1 +
 kernel/bpf/bpf_local_storage.c    | 10 +++++++++
 kernel/bpf/bpf_struct_ops.c       | 16 +++++++++++++++
 kernel/bpf/bpf_task_storage.c     |  1 +
 kernel/bpf/cpumap.c               | 10 +++++++++
 kernel/bpf/devmap.c               | 26 +++++++++++++++++++++--
 kernel/bpf/hashtab.c              | 43 +++++++++++++++++++++++++++++++++++++++
 kernel/bpf/local_storage.c        |  7 +++++++
 kernel/bpf/lpm_trie.c             | 11 ++++++++++
 kernel/bpf/offload.c              |  6 ++++++
 kernel/bpf/queue_stack_maps.c     | 10 +++++++++
 kernel/bpf/reuseport_array.c      |  8 ++++++++
 kernel/bpf/ringbuf.c              | 19 +++++++++++++++++
 kernel/bpf/stackmap.c             | 14 +++++++++++++
 kernel/bpf/syscall.c              | 20 ++++++++----------
 net/core/bpf_sk_storage.c         |  1 +
 net/core/sock_map.c               | 20 ++++++++++++++++++
 net/xdp/xskmap.c                  | 13 ++++++++++++
 24 files changed, 273 insertions(+), 14 deletions(-)

Comments

Daniel Borkmann Feb. 27, 2023, 10:37 p.m. UTC | #1
On 2/27/23 4:20 PM, Yafang Shao wrote:
> Currently we can't get bpf memory usage reliably. bpftool now shows the
> bpf memory footprint, which is difference with bpf memory usage. The
> difference can be quite great in some cases, for example,
> 
> - non-preallocated bpf map
>    The non-preallocated bpf map memory usage is dynamically changed. The
>    allocated elements count can be from 0 to the max entries. But the
>    memory footprint in bpftool only shows a fixed number.
> 
> - bpf metadata consumes more memory than bpf element
>    In some corner cases, the bpf metadata can consumes a lot more memory
>    than bpf element consumes. For example, it can happen when the element
>    size is quite small.
> 
> - some maps don't have key, value or max_entries
>    For example the key_size and value_size of ringbuf is 0, so its
>    memlock is always 0.
> 
> We need a way to show the bpf memory usage especially there will be more
> and more bpf programs running on the production environment and thus the
> bpf memory usage is not trivial.
> 
> This patchset introduces a new map ops ->map_mem_usage to calculate the
> memory usage. Note that we don't intend to make the memory usage 100%
> accurate, while our goal is to make sure there is only a small difference
> between what bpftool reports and the real memory. That small difference
> can be ignored compared to the total usage.  That is enough to monitor
> the bpf memory usage. For example, the user can rely on this value to
> monitor the trend of bpf memory usage, compare the difference in bpf
> memory usage between different bpf program versions, figure out which
> maps consume large memory, and etc.

Now that there is the cgroup.memory=nobpf, this is now rebuilding the memory
accounting as a band aid that you would otherwise get for free via memcg.. :/
Can't you instead move the selectable memcg forward? Tejun and others have
brought up the resource domain concept, have you looked into it?

Thanks,
Daniel
Yafang Shao Feb. 28, 2023, 2:53 a.m. UTC | #2
On Tue, Feb 28, 2023 at 6:37 AM Daniel Borkmann <daniel@iogearbox.net> wrote:
>
> On 2/27/23 4:20 PM, Yafang Shao wrote:
> > Currently we can't get bpf memory usage reliably. bpftool now shows the
> > bpf memory footprint, which is difference with bpf memory usage. The
> > difference can be quite great in some cases, for example,
> >
> > - non-preallocated bpf map
> >    The non-preallocated bpf map memory usage is dynamically changed. The
> >    allocated elements count can be from 0 to the max entries. But the
> >    memory footprint in bpftool only shows a fixed number.
> >
> > - bpf metadata consumes more memory than bpf element
> >    In some corner cases, the bpf metadata can consumes a lot more memory
> >    than bpf element consumes. For example, it can happen when the element
> >    size is quite small.
> >
> > - some maps don't have key, value or max_entries
> >    For example the key_size and value_size of ringbuf is 0, so its
> >    memlock is always 0.
> >
> > We need a way to show the bpf memory usage especially there will be more
> > and more bpf programs running on the production environment and thus the
> > bpf memory usage is not trivial.
> >
> > This patchset introduces a new map ops ->map_mem_usage to calculate the
> > memory usage. Note that we don't intend to make the memory usage 100%
> > accurate, while our goal is to make sure there is only a small difference
> > between what bpftool reports and the real memory. That small difference
> > can be ignored compared to the total usage.  That is enough to monitor
> > the bpf memory usage. For example, the user can rely on this value to
> > monitor the trend of bpf memory usage, compare the difference in bpf
> > memory usage between different bpf program versions, figure out which
> > maps consume large memory, and etc.
>
> Now that there is the cgroup.memory=nobpf, this is now rebuilding the memory
> accounting as a band aid that you would otherwise get for free via memcg.. :/

No, we can't get it for free via memcg, because there's no such a
"bpf" item in memory.stat, but only "kmem", "sock" and "vmalloc" in
memory.stat. With these three items we still can't figure out the bpf
memory usage, because the bpf memory usage may be far less than kmem,
for example, the dentry may consume lots of kmem.
Furthermore,  we still can't get the memory usage of each individual
map with memcg, but we can get it with bpftool. As Alexei explained in
another thread [1], "bpftool map show | awk" can show all cases.

I have tried to add the bpf item into memory.stat earlier[2], but it
seems we'd better add "memcg_id" or "memcg_path" into
bpftool-{map,prog}-show[3] instead.

[1]. https://lore.kernel.org/bpf/CAADnVQJGF5Xthpn7D2DgHHvZz8+dnuz2xMi6yoSziuauXO7ncA@mail.gmail.com/
[2]. https://lore.kernel.org/bpf/20220921170002.29557-1-laoar.shao@gmail.com/
[3]. https://lore.kernel.org/bpf/CALOAHbCY4fGyAN6q3dd+hULs3hRJcYgvMR7M5wg1yb3vPiK=mw@mail.gmail.com/


> Can't you instead move the selectable memcg forward? Tejun and others have
> brought up the resource domain concept, have you looked into it?
>

I will take a look at the resource domain concept and try to move
selectable memory forward again, but it doesn't conflict with this
series.