[RFC,bpf-next,v2,00/11] mm, bpf: Add BPF into /proc/meminfo

Message ID	20230112155326.26902-1-laoar.shao@gmail.com (mailing list archive)
Headers	show Return-Path: <owner-linux-mm@kvack.org> From: Yafang Shao <laoar.shao@gmail.com> To: 42.hyeyoo@gmail.com, vbabka@suse.cz, ast@kernel.org, daniel@iogearbox.net, andrii@kernel.org, kafai@fb.com, songliubraving@fb.com, yhs@fb.com, john.fastabend@gmail.com, kpsingh@kernel.org, sdf@google.com, haoluo@google.com, jolsa@kernel.org, tj@kernel.org, dennis@kernel.org, cl@linux.com, akpm@linux-foundation.org, penberg@kernel.org, rientjes@google.com, iamjoonsoo.kim@lge.com, roman.gushchin@linux.dev Cc: linux-mm@kvack.org, bpf@vger.kernel.org, Yafang Shao <laoar.shao@gmail.com> Subject: [RFC PATCH bpf-next v2 00/11] mm, bpf: Add BPF into /proc/meminfo Date: Thu, 12 Jan 2023 15:53:15 +0000 Message-Id: <20230112155326.26902-1-laoar.shao@gmail.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Sender: owner-linux-mm@kvack.org Precedence: bulk
Series	mm, bpf: Add BPF into /proc/meminfo \| expand [RFC,bpf-next,v2,00/11] mm, bpf: Add BPF into /proc/meminfo [RFC,bpf-next,v2,01/11] mm: percpu: count memcg relevant memory only when kmemcg is enabled [RFC,bpf-next,v2,02/11] mm: percpu: introduce percpu_size() [RFC,bpf-next,v2,03/11] mm: slab: rename obj_full_size() [RFC,bpf-next,v2,04/11] mm: slab: introduce ksize_full() [RFC,bpf-next,v2,05/11] mm: vmalloc: introduce vsize() [RFC,bpf-next,v2,06/11] mm: util: introduce kvsize() [RFC,bpf-next,v2,07/11] bpf: introduce new helpers bpf_ringbuf_pages_{alloc,free} [RFC,bpf-next,v2,08/11] bpf: use bpf_map_kzalloc in arraymap [RFC,bpf-next,v2,09/11] bpf: use bpf_map_kvcalloc in bpf_local_storage [RFC,bpf-next,v2,10/11] bpf: add and use bpf map free helpers [RFC,bpf-next,v2,11/11] bpf: introduce bpf memory statistics

Yafang Shao Jan. 12, 2023, 3:53 p.m. UTC

Currently there's no way to get BPF memory usage, while we can only
estimate the usage by bpftool or memcg, both of which are not reliable.

- bpftool
  `bpftool {map,prog} show` can show us the memlock of each map and
  prog, but the memlock is vary from the real memory size. The memlock
  of a bpf object is approximately
  `round_up(key_size + value_size, 8) * max_entries`,
  so 1) it can't apply to the non-preallocated bpf map which may
  increase or decrease the real memory size dynamically. 2) the element
  size of some bpf map is not `key_size + value_size`, for example the
  element size of htab is
  `sizeof(struct htab_elem) + round_up(key_size, 8) + round_up(value_size, 8)`
  That said the differece between these two values may be very great if
  the key_size and value_size is small. For example in my verifaction,
  the size of memlock and real memory of a preallocated hash map are,

  $ grep BPF /proc/meminfo
  BPF:                 350 kB  <<< the size of preallocated memalloc pool

  (create hash map)

  $ bpftool map show
  41549: hash  name count_map  flags 0x0
        key 4B  value 4B  max_entries 1048576  memlock 8388608B

  $ grep BPF /proc/meminfo
  BPF:               82284 kB
 
  So the real memory size is $((82284 - 350)) which is 81934 kB 
  while the memlock is only 8192 kB. 

- memcg  
  With memcg we only know that the BPF memory usage is less than
  memory.kmem.usage_in_bytes (or memory.current in v2). Furthermore, we
  only know that the BPF memory usage is less than $MemTotal if the BPF
  object is charged into root memcg :)

So we need a way to get the BPF memory usage especially there will be
more and more bpf programs running on the production environment. The
memory usage of BPF memory is not trivial, which deserves a new item in
/proc/meminfo.

There're some ways to calculate the BPF memory usage. They all have pros
and cons.

- Option 1: Annotate BPF memory allocation only
  It is how I implemented in RFC v1. You can look into the detail and
  discussion on it via the link below[1]. 
  - pros
    We only need to annotate the BPF memory allocation, and then we can
    find these allocated memory in the free path automatically. So it is
    very easy to use, and we don't need to worry about the stat leak.
  - cons
    We must store the information of these allocated memory, in
    particular the allocated slab objects. So it takes extra memory. If
    we introduce a new member into struct page or add this member into
    page_ext, it will take at least 0.2% of the total memory on 64bit
    system, that is not acceptible.
    One way to reduce this memory overhead is to introduce dynamic page
    extension, but it will take great effort and it may not worth it.

- Option 2: Annotate both allocation and free
  It is similar to how I implemented in an earlier version[2].
  - pros
    There's almost no memory overhead.
  - cons    
    All the memory allocation and free must use the BPF helpers, but
    can't use the generic helpers like kfree/vfree/percpu_free. So if
    the user forget to use the helpers we introduced to allocate or
    free BPF memory, there will be stat leak.
    It is not easy to annotate some derferred allocation, in particular
    the kfree_rcu(). So the user have to use call_rcu() instead of
    kfree_rcu(). Another risk is that if we introduce other deferred
    free helpers in the future, this BPF statistic may break easily.
    
- Option 3: Calculate the memory size via the pointer
  It is how I implement in this patchset.    
  After allocating some BPF memory, we get the full size from the
  pointer and add it; Before freeing the BPF memory, we get the full
  size from the pointer and sub it.
  - pros    
    No memory overhead.    
    No code churn in MM core allocation and free path.
    The impementation is quite clear and easy to maintain.
  - cons
    The calculation is not embedded in the MM allocation/free path, so
    there will be some redundant code to execute to get the size via
    pointer.    
    BPF memory allocation and free must use the helpers we introduced,
    otherwise there will be stat leak.

I perfer the option 3. Its cons can be justified.    
- bpf_map_free should be paired with bpf_map_alloc, that's reasonable.
- Regarding the possible extra cpu cycles it may take, the user should
  not allocate and free memory in the critical path if it is latency
  sensitive. 

[1]. https://lwn.net/Articles/917647/
[2]. https://lore.kernel.org/linux-mm/20220921170002.29557-1-laoar.shao@gmail.com/

v1->v2: don't use page_ext (Vlastimil, Hyeonggon)

Yafang Shao (11):
  mm: percpu: count memcg relevant memory only when kmemcg is enabled
  mm: percpu: introduce percpu_size()
  mm: slab: rename obj_full_size()
  mm: slab: introduce ksize_full()
  mm: vmalloc: introduce vsize()
  mm: util: introduce kvsize()
  bpf: introduce new helpers bpf_ringbuf_pages_{alloc,free}
  bpf: use bpf_map_kzalloc in arraymap
  bpf: use bpf_map_kvcalloc in bpf_local_storage
  bpf: add and use bpf map free helpers
  bpf: introduce bpf memory statistics

 fs/proc/meminfo.c              |   4 ++
 include/linux/bpf.h            | 115 +++++++++++++++++++++++++++++++++++++++--
 include/linux/percpu.h         |   1 +
 include/linux/slab.h           |  10 ++++
 include/linux/vmalloc.h        |  15 ++++++
 kernel/bpf/arraymap.c          |  20 +++----
 kernel/bpf/bpf_cgrp_storage.c  |   2 +-
 kernel/bpf/bpf_inode_storage.c |   2 +-
 kernel/bpf/bpf_local_storage.c |  24 ++++-----
 kernel/bpf/bpf_task_storage.c  |   2 +-
 kernel/bpf/cpumap.c            |  13 +++--
 kernel/bpf/devmap.c            |  10 ++--
 kernel/bpf/hashtab.c           |   8 +--
 kernel/bpf/helpers.c           |   2 +-
 kernel/bpf/local_storage.c     |  12 ++---
 kernel/bpf/lpm_trie.c          |  14 ++---
 kernel/bpf/memalloc.c          |  19 ++++++-
 kernel/bpf/ringbuf.c           |  75 ++++++++++++++++++---------
 kernel/bpf/syscall.c           |  54 ++++++++++++++++++-
 mm/percpu-internal.h           |   4 +-
 mm/percpu.c                    |  35 +++++++++++++
 mm/slab.h                      |  19 ++++---
 mm/slab_common.c               |  52 +++++++++++++------
 mm/slob.c                      |   2 +-
 mm/util.c                      |  15 ++++++
 net/core/bpf_sk_storage.c      |   4 +-
 net/core/sock_map.c            |   2 +-
 net/xdp/xskmap.c               |   2 +-
 28 files changed, 422 insertions(+), 115 deletions(-)

Alexei Starovoitov Jan. 12, 2023, 9:05 p.m. UTC | #1

On Thu, Jan 12, 2023 at 7:53 AM Yafang Shao <laoar.shao@gmail.com> wrote:
>
> Currently there's no way to get BPF memory usage, while we can only
> estimate the usage by bpftool or memcg, both of which are not reliable.
>
> - bpftool
>   `bpftool {map,prog} show` can show us the memlock of each map and
>   prog, but the memlock is vary from the real memory size. The memlock
>   of a bpf object is approximately
>   `round_up(key_size + value_size, 8) * max_entries`,
>   so 1) it can't apply to the non-preallocated bpf map which may
>   increase or decrease the real memory size dynamically. 2) the element
>   size of some bpf map is not `key_size + value_size`, for example the
>   element size of htab is
>   `sizeof(struct htab_elem) + round_up(key_size, 8) + round_up(value_size, 8)`
>   That said the differece between these two values may be very great if
>   the key_size and value_size is small. For example in my verifaction,
>   the size of memlock and real memory of a preallocated hash map are,
>
>   $ grep BPF /proc/meminfo
>   BPF:                 350 kB  <<< the size of preallocated memalloc pool
>
>   (create hash map)
>
>   $ bpftool map show
>   41549: hash  name count_map  flags 0x0
>         key 4B  value 4B  max_entries 1048576  memlock 8388608B
>
>   $ grep BPF /proc/meminfo
>   BPF:               82284 kB
>
>   So the real memory size is $((82284 - 350)) which is 81934 kB
>   while the memlock is only 8192 kB.

hashmap with key 4b and value 4b looks artificial to me,
but since you're concerned with accuracy of bpftool reporting,
please fix the estimation in bpf_map_memory_footprint().
You're correct that:

> size of some bpf map is not `key_size + value_size`, for example the
>   element size of htab is
>   `sizeof(struct htab_elem) + round_up(key_size, 8) + round_up(value_size, 8)`

So just teach bpf_map_memory_footprint() to do this more accurately.
Add bucket size to it as well.
Make it even more accurate with prealloc vs not.
Much simpler change than adding run-time overhead to every alloc/free
on bpf side.

Higher level point:
bpf side tracks all of its allocation. There is no need to do that
in generic mm side.
Exposing an aggregated single number if /proc/meminfo also looks wrong.
People should be able to "bpftool map show|awk sum of fields"
and get the same number.

Yafang Shao Jan. 13, 2023, 11:53 a.m. UTC | #2

On Fri, Jan 13, 2023 at 5:05 AM Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
>
> On Thu, Jan 12, 2023 at 7:53 AM Yafang Shao <laoar.shao@gmail.com> wrote:
> >
> > Currently there's no way to get BPF memory usage, while we can only
> > estimate the usage by bpftool or memcg, both of which are not reliable.
> >
> > - bpftool
> >   `bpftool {map,prog} show` can show us the memlock of each map and
> >   prog, but the memlock is vary from the real memory size. The memlock
> >   of a bpf object is approximately
> >   `round_up(key_size + value_size, 8) * max_entries`,
> >   so 1) it can't apply to the non-preallocated bpf map which may
> >   increase or decrease the real memory size dynamically. 2) the element
> >   size of some bpf map is not `key_size + value_size`, for example the
> >   element size of htab is
> >   `sizeof(struct htab_elem) + round_up(key_size, 8) + round_up(value_size, 8)`
> >   That said the differece between these two values may be very great if
> >   the key_size and value_size is small. For example in my verifaction,
> >   the size of memlock and real memory of a preallocated hash map are,
> >
> >   $ grep BPF /proc/meminfo
> >   BPF:                 350 kB  <<< the size of preallocated memalloc pool
> >
> >   (create hash map)
> >
> >   $ bpftool map show
> >   41549: hash  name count_map  flags 0x0
> >         key 4B  value 4B  max_entries 1048576  memlock 8388608B
> >
> >   $ grep BPF /proc/meminfo
> >   BPF:               82284 kB
> >
> >   So the real memory size is $((82284 - 350)) which is 81934 kB
> >   while the memlock is only 8192 kB.
>
> hashmap with key 4b and value 4b looks artificial to me,
> but since you're concerned with accuracy of bpftool reporting,
> please fix the estimation in bpf_map_memory_footprint().

I thought bpf_map_memory_footprint() was deprecated, so I didn't try
to fix it before.

> You're correct that:
>
> > size of some bpf map is not `key_size + value_size`, for example the
> >   element size of htab is
> >   `sizeof(struct htab_elem) + round_up(key_size, 8) + round_up(value_size, 8)`
>
> So just teach bpf_map_memory_footprint() to do this more accurately.
> Add bucket size to it as well.
> Make it even more accurate with prealloc vs not.
> Much simpler change than adding run-time overhead to every alloc/free
> on bpf side.
>

It seems that we'd better introduce ->memory_footprint for some
specific bpf maps. I will think about it.

> Higher level point:

Thanks for your thoughts.

> bpf side tracks all of its allocation. There is no need to do that
> in generic mm side.
> Exposing an aggregated single number if /proc/meminfo also looks wrong.

Do you mean that we shouldn't expose it in /proc/meminfo ?

> People should be able to "bpftool map show|awk sum of fields"
> and get the same number.

Alexei Starovoitov Jan. 17, 2023, 5:25 p.m. UTC | #3

On Fri, Jan 13, 2023 at 3:53 AM Yafang Shao <laoar.shao@gmail.com> wrote:
>
> On Fri, Jan 13, 2023 at 5:05 AM Alexei Starovoitov
> <alexei.starovoitov@gmail.com> wrote:
> >
> > On Thu, Jan 12, 2023 at 7:53 AM Yafang Shao <laoar.shao@gmail.com> wrote:
> > >
> > > Currently there's no way to get BPF memory usage, while we can only
> > > estimate the usage by bpftool or memcg, both of which are not reliable.
> > >
> > > - bpftool
> > >   `bpftool {map,prog} show` can show us the memlock of each map and
> > >   prog, but the memlock is vary from the real memory size. The memlock
> > >   of a bpf object is approximately
> > >   `round_up(key_size + value_size, 8) * max_entries`,
> > >   so 1) it can't apply to the non-preallocated bpf map which may
> > >   increase or decrease the real memory size dynamically. 2) the element
> > >   size of some bpf map is not `key_size + value_size`, for example the
> > >   element size of htab is
> > >   `sizeof(struct htab_elem) + round_up(key_size, 8) + round_up(value_size, 8)`
> > >   That said the differece between these two values may be very great if
> > >   the key_size and value_size is small. For example in my verifaction,
> > >   the size of memlock and real memory of a preallocated hash map are,
> > >
> > >   $ grep BPF /proc/meminfo
> > >   BPF:                 350 kB  <<< the size of preallocated memalloc pool
> > >
> > >   (create hash map)
> > >
> > >   $ bpftool map show
> > >   41549: hash  name count_map  flags 0x0
> > >         key 4B  value 4B  max_entries 1048576  memlock 8388608B
> > >
> > >   $ grep BPF /proc/meminfo
> > >   BPF:               82284 kB
> > >
> > >   So the real memory size is $((82284 - 350)) which is 81934 kB
> > >   while the memlock is only 8192 kB.
> >
> > hashmap with key 4b and value 4b looks artificial to me,
> > but since you're concerned with accuracy of bpftool reporting,
> > please fix the estimation in bpf_map_memory_footprint().
>
> I thought bpf_map_memory_footprint() was deprecated, so I didn't try
> to fix it before.

It's not deprecated. It's trying to be accurate.
See bpf_map_value_size().
In the past we had to be precise when we calculated the required memory
before we allocated and that was causing ongoing maintenance issues.
Now bpf_map_memory_footprint() is an estimate for show_fdinfo.
It can be made more accurate for this map with corner case key/value sizes.

> > You're correct that:
> >
> > > size of some bpf map is not `key_size + value_size`, for example the
> > >   element size of htab is
> > >   `sizeof(struct htab_elem) + round_up(key_size, 8) + round_up(value_size, 8)`
> >
> > So just teach bpf_map_memory_footprint() to do this more accurately.
> > Add bucket size to it as well.
> > Make it even more accurate with prealloc vs not.
> > Much simpler change than adding run-time overhead to every alloc/free
> > on bpf side.
> >
>
> It seems that we'd better introduce ->memory_footprint for some
> specific bpf maps. I will think about it.

No. Don't build it into a replica of what we had before.
Making existing bpf_map_memory_footprint() more accurate.

> > bpf side tracks all of its allocation. There is no need to do that
> > in generic mm side.
> > Exposing an aggregated single number if /proc/meminfo also looks wrong.
>
> Do you mean that we shouldn't expose it in /proc/meminfo ?

We should not because it helps one particular use case only.
Somebody else might want map mem info per container,
then somebody would need it per user, etc.
bpftool map show | awk
solves all those cases without adding new uapi-s.

Yafang Shao Jan. 18, 2023, 3:07 a.m. UTC | #4

On Wed, Jan 18, 2023 at 1:25 AM Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
>
> On Fri, Jan 13, 2023 at 3:53 AM Yafang Shao <laoar.shao@gmail.com> wrote:
> >
> > On Fri, Jan 13, 2023 at 5:05 AM Alexei Starovoitov
> > <alexei.starovoitov@gmail.com> wrote:
> > >
> > > On Thu, Jan 12, 2023 at 7:53 AM Yafang Shao <laoar.shao@gmail.com> wrote:
> > > >
> > > > Currently there's no way to get BPF memory usage, while we can only
> > > > estimate the usage by bpftool or memcg, both of which are not reliable.
> > > >
> > > > - bpftool
> > > >   `bpftool {map,prog} show` can show us the memlock of each map and
> > > >   prog, but the memlock is vary from the real memory size. The memlock
> > > >   of a bpf object is approximately
> > > >   `round_up(key_size + value_size, 8) * max_entries`,
> > > >   so 1) it can't apply to the non-preallocated bpf map which may
> > > >   increase or decrease the real memory size dynamically. 2) the element
> > > >   size of some bpf map is not `key_size + value_size`, for example the
> > > >   element size of htab is
> > > >   `sizeof(struct htab_elem) + round_up(key_size, 8) + round_up(value_size, 8)`
> > > >   That said the differece between these two values may be very great if
> > > >   the key_size and value_size is small. For example in my verifaction,
> > > >   the size of memlock and real memory of a preallocated hash map are,
> > > >
> > > >   $ grep BPF /proc/meminfo
> > > >   BPF:                 350 kB  <<< the size of preallocated memalloc pool
> > > >
> > > >   (create hash map)
> > > >
> > > >   $ bpftool map show
> > > >   41549: hash  name count_map  flags 0x0
> > > >         key 4B  value 4B  max_entries 1048576  memlock 8388608B
> > > >
> > > >   $ grep BPF /proc/meminfo
> > > >   BPF:               82284 kB
> > > >
> > > >   So the real memory size is $((82284 - 350)) which is 81934 kB
> > > >   while the memlock is only 8192 kB.
> > >
> > > hashmap with key 4b and value 4b looks artificial to me,
> > > but since you're concerned with accuracy of bpftool reporting,
> > > please fix the estimation in bpf_map_memory_footprint().
> >
> > I thought bpf_map_memory_footprint() was deprecated, so I didn't try
> > to fix it before.
>
> It's not deprecated. It's trying to be accurate.
> See bpf_map_value_size().
> In the past we had to be precise when we calculated the required memory
> before we allocated and that was causing ongoing maintenance issues.
> Now bpf_map_memory_footprint() is an estimate for show_fdinfo.
> It can be made more accurate for this map with corner case key/value sizes.
>

Thanks for the clarification.

> > > You're correct that:
> > >
> > > > size of some bpf map is not `key_size + value_size`, for example the
> > > >   element size of htab is
> > > >   `sizeof(struct htab_elem) + round_up(key_size, 8) + round_up(value_size, 8)`
> > >
> > > So just teach bpf_map_memory_footprint() to do this more accurately.
> > > Add bucket size to it as well.
> > > Make it even more accurate with prealloc vs not.
> > > Much simpler change than adding run-time overhead to every alloc/free
> > > on bpf side.
> > >
> >
> > It seems that we'd better introduce ->memory_footprint for some
> > specific bpf maps. I will think about it.
>
> No. Don't build it into a replica of what we had before.
> Making existing bpf_map_memory_footprint() more accurate.
>

I just don't want to add many if-elses or switch-cases into
bpf_map_memory_footprint(), because I think it is a little ugly.
Introducing a new map ops could make it more clear.  For example,
static unsigned long bpf_map_memory_footprint(const struct bpf_map *map)
{
    unsigned long size;

    if (map->ops->map_mem_footprint)
        return map->ops->map_mem_footprint(map);

    size = round_up(map->key_size + bpf_map_value_size(map), 8);
    return round_up(map->max_entries * size, PAGE_SIZE);
}

> > > bpf side tracks all of its allocation. There is no need to do that
> > > in generic mm side.
> > > Exposing an aggregated single number if /proc/meminfo also looks wrong.
> >
> > Do you mean that we shouldn't expose it in /proc/meminfo ?
>
> We should not because it helps one particular use case only.
> Somebody else might want map mem info per container,
> then somebody would need it per user, etc.

It seems we should show memcg info and user info in bpftool map show.

> bpftool map show | awk
> solves all those cases without adding new uapi-s.

Makes sense to me.

Alexei Starovoitov Jan. 18, 2023, 5:39 a.m. UTC | #5

On Tue, Jan 17, 2023 at 7:08 PM Yafang Shao <laoar.shao@gmail.com> wrote:
>
> On Wed, Jan 18, 2023 at 1:25 AM Alexei Starovoitov
> <alexei.starovoitov@gmail.com> wrote:
> >
> > On Fri, Jan 13, 2023 at 3:53 AM Yafang Shao <laoar.shao@gmail.com> wrote:
> > >
> > > On Fri, Jan 13, 2023 at 5:05 AM Alexei Starovoitov
> > > <alexei.starovoitov@gmail.com> wrote:
> > > >
> > > > On Thu, Jan 12, 2023 at 7:53 AM Yafang Shao <laoar.shao@gmail.com> wrote:
> > > > >
> > > > > Currently there's no way to get BPF memory usage, while we can only
> > > > > estimate the usage by bpftool or memcg, both of which are not reliable.
> > > > >
> > > > > - bpftool
> > > > >   `bpftool {map,prog} show` can show us the memlock of each map and
> > > > >   prog, but the memlock is vary from the real memory size. The memlock
> > > > >   of a bpf object is approximately
> > > > >   `round_up(key_size + value_size, 8) * max_entries`,
> > > > >   so 1) it can't apply to the non-preallocated bpf map which may
> > > > >   increase or decrease the real memory size dynamically. 2) the element
> > > > >   size of some bpf map is not `key_size + value_size`, for example the
> > > > >   element size of htab is
> > > > >   `sizeof(struct htab_elem) + round_up(key_size, 8) + round_up(value_size, 8)`
> > > > >   That said the differece between these two values may be very great if
> > > > >   the key_size and value_size is small. For example in my verifaction,
> > > > >   the size of memlock and real memory of a preallocated hash map are,
> > > > >
> > > > >   $ grep BPF /proc/meminfo
> > > > >   BPF:                 350 kB  <<< the size of preallocated memalloc pool
> > > > >
> > > > >   (create hash map)
> > > > >
> > > > >   $ bpftool map show
> > > > >   41549: hash  name count_map  flags 0x0
> > > > >         key 4B  value 4B  max_entries 1048576  memlock 8388608B
> > > > >
> > > > >   $ grep BPF /proc/meminfo
> > > > >   BPF:               82284 kB
> > > > >
> > > > >   So the real memory size is $((82284 - 350)) which is 81934 kB
> > > > >   while the memlock is only 8192 kB.
> > > >
> > > > hashmap with key 4b and value 4b looks artificial to me,
> > > > but since you're concerned with accuracy of bpftool reporting,
> > > > please fix the estimation in bpf_map_memory_footprint().
> > >
> > > I thought bpf_map_memory_footprint() was deprecated, so I didn't try
> > > to fix it before.
> >
> > It's not deprecated. It's trying to be accurate.
> > See bpf_map_value_size().
> > In the past we had to be precise when we calculated the required memory
> > before we allocated and that was causing ongoing maintenance issues.
> > Now bpf_map_memory_footprint() is an estimate for show_fdinfo.
> > It can be made more accurate for this map with corner case key/value sizes.
> >
>
> Thanks for the clarification.
>
> > > > You're correct that:
> > > >
> > > > > size of some bpf map is not `key_size + value_size`, for example the
> > > > >   element size of htab is
> > > > >   `sizeof(struct htab_elem) + round_up(key_size, 8) + round_up(value_size, 8)`
> > > >
> > > > So just teach bpf_map_memory_footprint() to do this more accurately.
> > > > Add bucket size to it as well.
> > > > Make it even more accurate with prealloc vs not.
> > > > Much simpler change than adding run-time overhead to every alloc/free
> > > > on bpf side.
> > > >
> > >
> > > It seems that we'd better introduce ->memory_footprint for some
> > > specific bpf maps. I will think about it.
> >
> > No. Don't build it into a replica of what we had before.
> > Making existing bpf_map_memory_footprint() more accurate.
> >
>
> I just don't want to add many if-elses or switch-cases into
> bpf_map_memory_footprint(), because I think it is a little ugly.
> Introducing a new map ops could make it more clear.  For example,
> static unsigned long bpf_map_memory_footprint(const struct bpf_map *map)
> {
>     unsigned long size;
>
>     if (map->ops->map_mem_footprint)
>         return map->ops->map_mem_footprint(map);
>
>     size = round_up(map->key_size + bpf_map_value_size(map), 8);
>     return round_up(map->max_entries * size, PAGE_SIZE);
> }

It is also ugly, because bpf_map_value_size() already has if-stmt.
I prefer to keep all estimates in one place.
There is no need to be 100% accurate.
With a callback devs will start thinking that this is somehow
a requirement to report precise memory.

> > > > bpf side tracks all of its allocation. There is no need to do that
> > > > in generic mm side.
> > > > Exposing an aggregated single number if /proc/meminfo also looks wrong.
> > >
> > > Do you mean that we shouldn't expose it in /proc/meminfo ?
> >
> > We should not because it helps one particular use case only.
> > Somebody else might want map mem info per container,
> > then somebody would need it per user, etc.
>
> It seems we should show memcg info and user info in bpftool map show.

Show memcg info? What do you have in mind?

The user info is often useless. We're printing it in bpftool prog show
and some folks suggested to remove it because it always prints 'uid 0'
Notice we use bpf iterators in both bpftool prog/map show
that prints the process that created the map.
That is much more useful than 'user id'.
In bpftool we can add 'verbosity' flag and print more things.
There is also json output.
And, of course, nothing stops you from having your own prog/map stats
collectors.

Yafang Shao Jan. 18, 2023, 6:49 a.m. UTC | #6

On Wed, Jan 18, 2023 at 1:39 PM Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
>
> On Tue, Jan 17, 2023 at 7:08 PM Yafang Shao <laoar.shao@gmail.com> wrote:
> >
> > On Wed, Jan 18, 2023 at 1:25 AM Alexei Starovoitov
> > <alexei.starovoitov@gmail.com> wrote:
> > >
> > > On Fri, Jan 13, 2023 at 3:53 AM Yafang Shao <laoar.shao@gmail.com> wrote:
> > > >
> > > > On Fri, Jan 13, 2023 at 5:05 AM Alexei Starovoitov
> > > > <alexei.starovoitov@gmail.com> wrote:
> > > > >
> > > > > On Thu, Jan 12, 2023 at 7:53 AM Yafang Shao <laoar.shao@gmail.com> wrote:
> > > > > >
> > > > > > Currently there's no way to get BPF memory usage, while we can only
> > > > > > estimate the usage by bpftool or memcg, both of which are not reliable.
> > > > > >
> > > > > > - bpftool
> > > > > >   `bpftool {map,prog} show` can show us the memlock of each map and
> > > > > >   prog, but the memlock is vary from the real memory size. The memlock
> > > > > >   of a bpf object is approximately
> > > > > >   `round_up(key_size + value_size, 8) * max_entries`,
> > > > > >   so 1) it can't apply to the non-preallocated bpf map which may
> > > > > >   increase or decrease the real memory size dynamically. 2) the element
> > > > > >   size of some bpf map is not `key_size + value_size`, for example the
> > > > > >   element size of htab is
> > > > > >   `sizeof(struct htab_elem) + round_up(key_size, 8) + round_up(value_size, 8)`
> > > > > >   That said the differece between these two values may be very great if
> > > > > >   the key_size and value_size is small. For example in my verifaction,
> > > > > >   the size of memlock and real memory of a preallocated hash map are,
> > > > > >
> > > > > >   $ grep BPF /proc/meminfo
> > > > > >   BPF:                 350 kB  <<< the size of preallocated memalloc pool
> > > > > >
> > > > > >   (create hash map)
> > > > > >
> > > > > >   $ bpftool map show
> > > > > >   41549: hash  name count_map  flags 0x0
> > > > > >         key 4B  value 4B  max_entries 1048576  memlock 8388608B
> > > > > >
> > > > > >   $ grep BPF /proc/meminfo
> > > > > >   BPF:               82284 kB
> > > > > >
> > > > > >   So the real memory size is $((82284 - 350)) which is 81934 kB
> > > > > >   while the memlock is only 8192 kB.
> > > > >
> > > > > hashmap with key 4b and value 4b looks artificial to me,
> > > > > but since you're concerned with accuracy of bpftool reporting,
> > > > > please fix the estimation in bpf_map_memory_footprint().
> > > >
> > > > I thought bpf_map_memory_footprint() was deprecated, so I didn't try
> > > > to fix it before.
> > >
> > > It's not deprecated. It's trying to be accurate.
> > > See bpf_map_value_size().
> > > In the past we had to be precise when we calculated the required memory
> > > before we allocated and that was causing ongoing maintenance issues.
> > > Now bpf_map_memory_footprint() is an estimate for show_fdinfo.
> > > It can be made more accurate for this map with corner case key/value sizes.
> > >
> >
> > Thanks for the clarification.
> >
> > > > > You're correct that:
> > > > >
> > > > > > size of some bpf map is not `key_size + value_size`, for example the
> > > > > >   element size of htab is
> > > > > >   `sizeof(struct htab_elem) + round_up(key_size, 8) + round_up(value_size, 8)`
> > > > >
> > > > > So just teach bpf_map_memory_footprint() to do this more accurately.
> > > > > Add bucket size to it as well.
> > > > > Make it even more accurate with prealloc vs not.
> > > > > Much simpler change than adding run-time overhead to every alloc/free
> > > > > on bpf side.
> > > > >
> > > >
> > > > It seems that we'd better introduce ->memory_footprint for some
> > > > specific bpf maps. I will think about it.
> > >
> > > No. Don't build it into a replica of what we had before.
> > > Making existing bpf_map_memory_footprint() more accurate.
> > >
> >
> > I just don't want to add many if-elses or switch-cases into
> > bpf_map_memory_footprint(), because I think it is a little ugly.
> > Introducing a new map ops could make it more clear.  For example,
> > static unsigned long bpf_map_memory_footprint(const struct bpf_map *map)
> > {
> >     unsigned long size;
> >
> >     if (map->ops->map_mem_footprint)
> >         return map->ops->map_mem_footprint(map);
> >
> >     size = round_up(map->key_size + bpf_map_value_size(map), 8);
> >     return round_up(map->max_entries * size, PAGE_SIZE);
> > }
>
> It is also ugly, because bpf_map_value_size() already has if-stmt.
> I prefer to keep all estimates in one place.
> There is no need to be 100% accurate.

Per my investigation, it can be almost accurate with little effort.
Take the htab for example,
static unsigned long htab_mem_footprint(const struct bpf_map *map)
{
    struct bpf_htab *htab = container_of(map, struct bpf_htab, map);
    unsigned long size = 0;

    if (!htab_is_prealloc(htab)) {
        size += htab_elements_size(htab);
    }
    size += kvsize(htab->elems);
    size += percpu_size(htab->extra_elems);
    size += kvsize(htab->buckets);
    size += bpf_mem_alloc_size(&htab->pcpu_ma);
    size += bpf_mem_alloc_size(&htab->ma);
    if (htab->use_percpu_counter)
        size += percpu_size(htab->pcount.counters);
    size += percpu_size(htab->map_locked[i]) * HASHTAB_MAP_LOCK_COUNT;
    size += kvsize(htab);
    return size;
}

We just need to get the real memory size from the pointer instead of
calculating the size again.
For non-preallocated htab, it is a little trouble to get the element
size (not the unit_size), but it won't be a big deal.

> With a callback devs will start thinking that this is somehow
> a requirement to report precise memory.
>
> > > > > bpf side tracks all of its allocation. There is no need to do that
> > > > > in generic mm side.
> > > > > Exposing an aggregated single number if /proc/meminfo also looks wrong.
> > > >
> > > > Do you mean that we shouldn't expose it in /proc/meminfo ?
> > >
> > > We should not because it helps one particular use case only.
> > > Somebody else might want map mem info per container,
> > > then somebody would need it per user, etc.
> >
> > It seems we should show memcg info and user info in bpftool map show.
>
> Show memcg info? What do you have in mind?
>

Each bpf map is charged to a memcg. If we know a bpf map belongs to
which memcg, we can know the map mem info per container.
Currently we can get the memcg info from the process which loads it,
but it can't apply to pinned-bpf-map.
So it would be better if we can show it in bpftool-map-show.

> The user info is often useless. We're printing it in bpftool prog show
> and some folks suggested to remove it because it always prints 'uid 0'
> Notice we use bpf iterators in both bpftool prog/map show
> that prints the process that created the map.
> That is much more useful than 'user id'.
> In bpftool we can add 'verbosity' flag and print more things.
> There is also json output.
> And, of course, nothing stops you from having your own prog/map stats
> collectors.

Alexei Starovoitov Jan. 26, 2023, 5:45 a.m. UTC | #7

On Tue, Jan 17, 2023 at 10:49 PM Yafang Shao <laoar.shao@gmail.com> wrote:
> > > I just don't want to add many if-elses or switch-cases into
> > > bpf_map_memory_footprint(), because I think it is a little ugly.
> > > Introducing a new map ops could make it more clear.  For example,
> > > static unsigned long bpf_map_memory_footprint(const struct bpf_map *map)
> > > {
> > >     unsigned long size;
> > >
> > >     if (map->ops->map_mem_footprint)
> > >         return map->ops->map_mem_footprint(map);
> > >
> > >     size = round_up(map->key_size + bpf_map_value_size(map), 8);
> > >     return round_up(map->max_entries * size, PAGE_SIZE);
> > > }
> >
> > It is also ugly, because bpf_map_value_size() already has if-stmt.
> > I prefer to keep all estimates in one place.
> > There is no need to be 100% accurate.
>
> Per my investigation, it can be almost accurate with little effort.
> Take the htab for example,
> static unsigned long htab_mem_footprint(const struct bpf_map *map)
> {
>     struct bpf_htab *htab = container_of(map, struct bpf_htab, map);
>     unsigned long size = 0;
>
>     if (!htab_is_prealloc(htab)) {
>         size += htab_elements_size(htab);
>     }
>     size += kvsize(htab->elems);
>     size += percpu_size(htab->extra_elems);
>     size += kvsize(htab->buckets);
>     size += bpf_mem_alloc_size(&htab->pcpu_ma);
>     size += bpf_mem_alloc_size(&htab->ma);
>     if (htab->use_percpu_counter)
>         size += percpu_size(htab->pcount.counters);
>     size += percpu_size(htab->map_locked[i]) * HASHTAB_MAP_LOCK_COUNT;
>     size += kvsize(htab);
>     return size;
> }

Please don't.
Above doesn't look maintainable.
Look at kvsize(htab). Do you really care about hundred bytes?
Just accept that there will be a small constant difference
between what show_fdinfo reports and the real memory.
You cannot make it 100%.
There is kfence that will allocate 4k though you asked kmalloc(8).

> We just need to get the real memory size from the pointer instead of
> calculating the size again.
> For non-preallocated htab, it is a little trouble to get the element
> size (not the unit_size), but it won't be a big deal.

You'd have to convince mm folks that kvsize() is worth doing.
I don't think it will be easy.

> > With a callback devs will start thinking that this is somehow
> > a requirement to report precise memory.
> >
> > > > > > bpf side tracks all of its allocation. There is no need to do that
> > > > > > in generic mm side.
> > > > > > Exposing an aggregated single number if /proc/meminfo also looks wrong.
> > > > >
> > > > > Do you mean that we shouldn't expose it in /proc/meminfo ?
> > > >
> > > > We should not because it helps one particular use case only.
> > > > Somebody else might want map mem info per container,
> > > > then somebody would need it per user, etc.
> > >
> > > It seems we should show memcg info and user info in bpftool map show.
> >
> > Show memcg info? What do you have in mind?
> >
>
> Each bpf map is charged to a memcg. If we know a bpf map belongs to
> which memcg, we can know the map mem info per container.
> Currently we can get the memcg info from the process which loads it,
> but it can't apply to pinned-bpf-map.
> So it would be better if we can show it in bpftool-map-show.

That sounds useful.
Have you looked at bpf iterators and how bpftool is using
them to figure out which process loaded bpf prog and created particular map?

Yafang Shao Jan. 28, 2023, 11:49 a.m. UTC | #8

On Thu, Jan 26, 2023 at 1:45 PM Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
>
> On Tue, Jan 17, 2023 at 10:49 PM Yafang Shao <laoar.shao@gmail.com> wrote:
> > > > I just don't want to add many if-elses or switch-cases into
> > > > bpf_map_memory_footprint(), because I think it is a little ugly.
> > > > Introducing a new map ops could make it more clear.  For example,
> > > > static unsigned long bpf_map_memory_footprint(const struct bpf_map *map)
> > > > {
> > > >     unsigned long size;
> > > >
> > > >     if (map->ops->map_mem_footprint)
> > > >         return map->ops->map_mem_footprint(map);
> > > >
> > > >     size = round_up(map->key_size + bpf_map_value_size(map), 8);
> > > >     return round_up(map->max_entries * size, PAGE_SIZE);
> > > > }
> > >
> > > It is also ugly, because bpf_map_value_size() already has if-stmt.
> > > I prefer to keep all estimates in one place.
> > > There is no need to be 100% accurate.
> >
> > Per my investigation, it can be almost accurate with little effort.
> > Take the htab for example,
> > static unsigned long htab_mem_footprint(const struct bpf_map *map)
> > {
> >     struct bpf_htab *htab = container_of(map, struct bpf_htab, map);
> >     unsigned long size = 0;
> >
> >     if (!htab_is_prealloc(htab)) {
> >         size += htab_elements_size(htab);
> >     }
> >     size += kvsize(htab->elems);
> >     size += percpu_size(htab->extra_elems);
> >     size += kvsize(htab->buckets);
> >     size += bpf_mem_alloc_size(&htab->pcpu_ma);
> >     size += bpf_mem_alloc_size(&htab->ma);
> >     if (htab->use_percpu_counter)
> >         size += percpu_size(htab->pcount.counters);
> >     size += percpu_size(htab->map_locked[i]) * HASHTAB_MAP_LOCK_COUNT;
> >     size += kvsize(htab);
> >     return size;
> > }
>
> Please don't.
> Above doesn't look maintainable.

It is similar to htab_map_free(). These pointers are the pointers
which will be freed in map_free().
We just need to keep map_mem_footprint() in sync with map_free(). It
won't be a problem for maintenance.

> Look at kvsize(htab). Do you really care about hundred bytes?
> Just accept that there will be a small constant difference
> between what show_fdinfo reports and the real memory.

The point is we don't have a clear idea what the margin is.

> You cannot make it 100%.
> There is kfence that will allocate 4k though you asked kmalloc(8).
>

We already have ksize()[1], which covers the kfence.

[1]. https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next.git/tree/mm/slab_common.c#n1431

> > We just need to get the real memory size from the pointer instead of
> > calculating the size again.
> > For non-preallocated htab, it is a little trouble to get the element
> > size (not the unit_size), but it won't be a big deal.
>
> You'd have to convince mm folks that kvsize() is worth doing.
> I don't think it will be easy.
>

As I mentioned above, we already have ksize(), so we only need to
introduce vsize().  Per my understanding, we can simply use
vm_struct->size to get the vmalloc size, see also the patch #5 in this
patchset[2].

Andrew, Uladzislau, Christoph,  do you have any comments on this newly
introduced vsize()[2] ?

[2]. https://lore.kernel.org/bpf/20230112155326.26902-6-laoar.shao@gmail.com/

> > > With a callback devs will start thinking that this is somehow
> > > a requirement to report precise memory.
> > >
> > > > > > > bpf side tracks all of its allocation. There is no need to do that
> > > > > > > in generic mm side.
> > > > > > > Exposing an aggregated single number if /proc/meminfo also looks wrong.
> > > > > >
> > > > > > Do you mean that we shouldn't expose it in /proc/meminfo ?
> > > > >
> > > > > We should not because it helps one particular use case only.
> > > > > Somebody else might want map mem info per container,
> > > > > then somebody would need it per user, etc.
> > > >
> > > > It seems we should show memcg info and user info in bpftool map show.
> > >
> > > Show memcg info? What do you have in mind?
> > >
> >
> > Each bpf map is charged to a memcg. If we know a bpf map belongs to
> > which memcg, we can know the map mem info per container.
> > Currently we can get the memcg info from the process which loads it,
> > but it can't apply to pinned-bpf-map.
> > So it would be better if we can show it in bpftool-map-show.
>
> That sounds useful.
> Have you looked at bpf iterators and how bpftool is using
> them to figure out which process loaded bpf prog and created particular map?

Yes, I have looked at it.  I know what you mean. It seems we can
introduce a memcg_iter or something else to implement it.

Uladzislau Rezki Jan. 30, 2023, 1:14 p.m. UTC | #9

On Sat, Jan 28, 2023 at 07:49:08PM +0800, Yafang Shao wrote:
> On Thu, Jan 26, 2023 at 1:45 PM Alexei Starovoitov
> <alexei.starovoitov@gmail.com> wrote:
> >
> > On Tue, Jan 17, 2023 at 10:49 PM Yafang Shao <laoar.shao@gmail.com> wrote:
> > > > > I just don't want to add many if-elses or switch-cases into
> > > > > bpf_map_memory_footprint(), because I think it is a little ugly.
> > > > > Introducing a new map ops could make it more clear.  For example,
> > > > > static unsigned long bpf_map_memory_footprint(const struct bpf_map *map)
> > > > > {
> > > > >     unsigned long size;
> > > > >
> > > > >     if (map->ops->map_mem_footprint)
> > > > >         return map->ops->map_mem_footprint(map);
> > > > >
> > > > >     size = round_up(map->key_size + bpf_map_value_size(map), 8);
> > > > >     return round_up(map->max_entries * size, PAGE_SIZE);
> > > > > }
> > > >
> > > > It is also ugly, because bpf_map_value_size() already has if-stmt.
> > > > I prefer to keep all estimates in one place.
> > > > There is no need to be 100% accurate.
> > >
> > > Per my investigation, it can be almost accurate with little effort.
> > > Take the htab for example,
> > > static unsigned long htab_mem_footprint(const struct bpf_map *map)
> > > {
> > >     struct bpf_htab *htab = container_of(map, struct bpf_htab, map);
> > >     unsigned long size = 0;
> > >
> > >     if (!htab_is_prealloc(htab)) {
> > >         size += htab_elements_size(htab);
> > >     }
> > >     size += kvsize(htab->elems);
> > >     size += percpu_size(htab->extra_elems);
> > >     size += kvsize(htab->buckets);
> > >     size += bpf_mem_alloc_size(&htab->pcpu_ma);
> > >     size += bpf_mem_alloc_size(&htab->ma);
> > >     if (htab->use_percpu_counter)
> > >         size += percpu_size(htab->pcount.counters);
> > >     size += percpu_size(htab->map_locked[i]) * HASHTAB_MAP_LOCK_COUNT;
> > >     size += kvsize(htab);
> > >     return size;
> > > }
> >
> > Please don't.
> > Above doesn't look maintainable.
> 
> It is similar to htab_map_free(). These pointers are the pointers
> which will be freed in map_free().
> We just need to keep map_mem_footprint() in sync with map_free(). It
> won't be a problem for maintenance.
> 
> > Look at kvsize(htab). Do you really care about hundred bytes?
> > Just accept that there will be a small constant difference
> > between what show_fdinfo reports and the real memory.
> 
> The point is we don't have a clear idea what the margin is.
> 
> > You cannot make it 100%.
> > There is kfence that will allocate 4k though you asked kmalloc(8).
> >
> 
> We already have ksize()[1], which covers the kfence.
> 
> [1]. https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next.git/tree/mm/slab_common.c#n1431
> 
> > > We just need to get the real memory size from the pointer instead of
> > > calculating the size again.
> > > For non-preallocated htab, it is a little trouble to get the element
> > > size (not the unit_size), but it won't be a big deal.
> >
> > You'd have to convince mm folks that kvsize() is worth doing.
> > I don't think it will be easy.
> >
> 
> As I mentioned above, we already have ksize(), so we only need to
> introduce vsize().  Per my understanding, we can simply use
> vm_struct->size to get the vmalloc size, see also the patch #5 in this
> patchset[2].
> 
> Andrew, Uladzislau, Christoph,  do you have any comments on this newly
> introduced vsize()[2] ?
> 
> [2]. https://lore.kernel.org/bpf/20230112155326.26902-6-laoar.shao@gmail.com/
> 
<snip>
+/* Report full size of underlying allocation of a vmalloc'ed addr */
+static inline size_t vsize(const void *addr)
+{
+	struct vm_struct *area;
+
+	if (!addr)
+		return 0;
+
+	area = find_vm_area(addr);
+	if (unlikely(!area))
+		return 0;
+
+	return area->size;
+}
<snip>

You can not access area after the lock is dropped. We do not have any
ref counters for VA objects. Therefore it should be done like below:


<snip>
  spin_lock(&vmap_area_lock);
  va = __find_vmap_area(addr, &vmap_area_root);
  if (va && va->vm)
    va_size = va->vm->size;
  spin_unlock(&vmap_area_lock);

  return va_size;
<snip>

--
Uladzislau Rezki

Yafang Shao Jan. 31, 2023, 6:28 a.m. UTC | #10

On Mon, Jan 30, 2023 at 9:14 PM Uladzislau Rezki <urezki@gmail.com> wrote:
>
> On Sat, Jan 28, 2023 at 07:49:08PM +0800, Yafang Shao wrote:
> > On Thu, Jan 26, 2023 at 1:45 PM Alexei Starovoitov
> > <alexei.starovoitov@gmail.com> wrote:
> > >
> > > On Tue, Jan 17, 2023 at 10:49 PM Yafang Shao <laoar.shao@gmail.com> wrote:
> > > > > > I just don't want to add many if-elses or switch-cases into
> > > > > > bpf_map_memory_footprint(), because I think it is a little ugly.
> > > > > > Introducing a new map ops could make it more clear.  For example,
> > > > > > static unsigned long bpf_map_memory_footprint(const struct bpf_map *map)
> > > > > > {
> > > > > >     unsigned long size;
> > > > > >
> > > > > >     if (map->ops->map_mem_footprint)
> > > > > >         return map->ops->map_mem_footprint(map);
> > > > > >
> > > > > >     size = round_up(map->key_size + bpf_map_value_size(map), 8);
> > > > > >     return round_up(map->max_entries * size, PAGE_SIZE);
> > > > > > }
> > > > >
> > > > > It is also ugly, because bpf_map_value_size() already has if-stmt.
> > > > > I prefer to keep all estimates in one place.
> > > > > There is no need to be 100% accurate.
> > > >
> > > > Per my investigation, it can be almost accurate with little effort.
> > > > Take the htab for example,
> > > > static unsigned long htab_mem_footprint(const struct bpf_map *map)
> > > > {
> > > >     struct bpf_htab *htab = container_of(map, struct bpf_htab, map);
> > > >     unsigned long size = 0;
> > > >
> > > >     if (!htab_is_prealloc(htab)) {
> > > >         size += htab_elements_size(htab);
> > > >     }
> > > >     size += kvsize(htab->elems);
> > > >     size += percpu_size(htab->extra_elems);
> > > >     size += kvsize(htab->buckets);
> > > >     size += bpf_mem_alloc_size(&htab->pcpu_ma);
> > > >     size += bpf_mem_alloc_size(&htab->ma);
> > > >     if (htab->use_percpu_counter)
> > > >         size += percpu_size(htab->pcount.counters);
> > > >     size += percpu_size(htab->map_locked[i]) * HASHTAB_MAP_LOCK_COUNT;
> > > >     size += kvsize(htab);
> > > >     return size;
> > > > }
> > >
> > > Please don't.
> > > Above doesn't look maintainable.
> >
> > It is similar to htab_map_free(). These pointers are the pointers
> > which will be freed in map_free().
> > We just need to keep map_mem_footprint() in sync with map_free(). It
> > won't be a problem for maintenance.
> >
> > > Look at kvsize(htab). Do you really care about hundred bytes?
> > > Just accept that there will be a small constant difference
> > > between what show_fdinfo reports and the real memory.
> >
> > The point is we don't have a clear idea what the margin is.
> >
> > > You cannot make it 100%.
> > > There is kfence that will allocate 4k though you asked kmalloc(8).
> > >
> >
> > We already have ksize()[1], which covers the kfence.
> >
> > [1]. https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next.git/tree/mm/slab_common.c#n1431
> >
> > > > We just need to get the real memory size from the pointer instead of
> > > > calculating the size again.
> > > > For non-preallocated htab, it is a little trouble to get the element
> > > > size (not the unit_size), but it won't be a big deal.
> > >
> > > You'd have to convince mm folks that kvsize() is worth doing.
> > > I don't think it will be easy.
> > >
> >
> > As I mentioned above, we already have ksize(), so we only need to
> > introduce vsize().  Per my understanding, we can simply use
> > vm_struct->size to get the vmalloc size, see also the patch #5 in this
> > patchset[2].
> >
> > Andrew, Uladzislau, Christoph,  do you have any comments on this newly
> > introduced vsize()[2] ?
> >
> > [2]. https://lore.kernel.org/bpf/20230112155326.26902-6-laoar.shao@gmail.com/
> >
> <snip>
> +/* Report full size of underlying allocation of a vmalloc'ed addr */
> +static inline size_t vsize(const void *addr)
> +{
> +       struct vm_struct *area;
> +
> +       if (!addr)
> +               return 0;
> +
> +       area = find_vm_area(addr);
> +       if (unlikely(!area))
> +               return 0;
> +
> +       return area->size;
> +}
> <snip>
>
> You can not access area after the lock is dropped. We do not have any
> ref counters for VA objects. Therefore it should be done like below:
>
>
> <snip>
>   spin_lock(&vmap_area_lock);
>   va = __find_vmap_area(addr, &vmap_area_root);
>   if (va && va->vm)
>     va_size = va->vm->size;
>   spin_unlock(&vmap_area_lock);
>
>   return va_size;
> <snip>
>

Ah, it should take this global lock.  I missed that.
Many thanks for the detailed explanation.

[RFC,bpf-next,v2,00/11] mm, bpf: Add BPF into /proc/meminfo

Message

Comments