diff mbox series

[bpf-next,7/7] bpf: hashtab memory usage

Message ID 20230202014158.19616-8-laoar.shao@gmail.com (mailing list archive)
State New
Headers show
Series bpf, mm: bpf memory usage | expand

Commit Message

Yafang Shao Feb. 2, 2023, 1:41 a.m. UTC
Get htab memory usage from the htab pointers we have allocated. Some
small pointers are ignored as their size are quite small compared with
the total size.

The result as follows,
- before this change
1: hash  name count_map  flags 0x0  <<<< prealloc
        key 16B  value 24B  max_entries 1048576  memlock 41943040B
2: hash  name count_map  flags 0x1  <<<< non prealloc, fully set
        key 16B  value 24B  max_entries 1048576  memlock 41943040B
3: hash  name count_map  flags 0x1  <<<< non prealloc, non set
        key 16B  value 24B  max_entries 1048576  memlock 41943040B

The memlock is always a fixed number whatever it is preallocated or
not, and whatever the allocated elements number is.

- after this change
1: hash  name count_map  flags 0x0  <<<< prealloc
        key 16B  value 24B  max_entries 1048576  memlock 109064464B
2: hash  name count_map  flags 0x1  <<<< non prealloc, fully set
        key 16B  value 24B  max_entries 1048576  memlock 117464320B
3: hash  name count_map  flags 0x1  <<<< non prealloc, non set
        key 16B  value 24B  max_entries 1048576  memlock 16797952B

The memlock now is hashtab actually allocated.

At worst, the difference can be 10x, for example,
- before this change
4: hash  name count_map  flags 0x0
        key 4B  value 4B  max_entries 1048576  memlock 8388608B

- after this change
4: hash  name count_map  flags 0x0
        key 4B  value 4B  max_entries 1048576  memlock 83898640B

Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
---
 kernel/bpf/hashtab.c | 80 +++++++++++++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 79 insertions(+), 1 deletion(-)

Comments

John Fastabend Feb. 4, 2023, 2:01 a.m. UTC | #1
Yafang Shao wrote:
> Get htab memory usage from the htab pointers we have allocated. Some
> small pointers are ignored as their size are quite small compared with
> the total size.
> 
> The result as follows,
> - before this change
> 1: hash  name count_map  flags 0x0  <<<< prealloc
>         key 16B  value 24B  max_entries 1048576  memlock 41943040B
> 2: hash  name count_map  flags 0x1  <<<< non prealloc, fully set
>         key 16B  value 24B  max_entries 1048576  memlock 41943040B
> 3: hash  name count_map  flags 0x1  <<<< non prealloc, non set
>         key 16B  value 24B  max_entries 1048576  memlock 41943040B
> 
> The memlock is always a fixed number whatever it is preallocated or
> not, and whatever the allocated elements number is.
> 
> - after this change
> 1: hash  name count_map  flags 0x0  <<<< prealloc
>         key 16B  value 24B  max_entries 1048576  memlock 109064464B
> 2: hash  name count_map  flags 0x1  <<<< non prealloc, fully set
>         key 16B  value 24B  max_entries 1048576  memlock 117464320B
> 3: hash  name count_map  flags 0x1  <<<< non prealloc, non set
>         key 16B  value 24B  max_entries 1048576  memlock 16797952B
> 
> The memlock now is hashtab actually allocated.
> 
> At worst, the difference can be 10x, for example,
> - before this change
> 4: hash  name count_map  flags 0x0
>         key 4B  value 4B  max_entries 1048576  memlock 8388608B
> 
> - after this change
> 4: hash  name count_map  flags 0x0
>         key 4B  value 4B  max_entries 1048576  memlock 83898640B
> 

This walks the entire map and buckets to get the size? Inside a
rcu critical section as well :/ it seems.

What am I missing, if you know how many elements are added (which
you can track on map updates) how come we can't just calculate the
memory size directly from this?

Thanks,
John
Yafang Shao Feb. 5, 2023, 3:55 a.m. UTC | #2
On Sat, Feb 4, 2023 at 10:01 AM John Fastabend <john.fastabend@gmail.com> wrote:
>
> Yafang Shao wrote:
> > Get htab memory usage from the htab pointers we have allocated. Some
> > small pointers are ignored as their size are quite small compared with
> > the total size.
> >
> > The result as follows,
> > - before this change
> > 1: hash  name count_map  flags 0x0  <<<< prealloc
> >         key 16B  value 24B  max_entries 1048576  memlock 41943040B
> > 2: hash  name count_map  flags 0x1  <<<< non prealloc, fully set
> >         key 16B  value 24B  max_entries 1048576  memlock 41943040B
> > 3: hash  name count_map  flags 0x1  <<<< non prealloc, non set
> >         key 16B  value 24B  max_entries 1048576  memlock 41943040B
> >
> > The memlock is always a fixed number whatever it is preallocated or
> > not, and whatever the allocated elements number is.
> >
> > - after this change
> > 1: hash  name count_map  flags 0x0  <<<< prealloc
> >         key 16B  value 24B  max_entries 1048576  memlock 109064464B
> > 2: hash  name count_map  flags 0x1  <<<< non prealloc, fully set
> >         key 16B  value 24B  max_entries 1048576  memlock 117464320B
> > 3: hash  name count_map  flags 0x1  <<<< non prealloc, non set
> >         key 16B  value 24B  max_entries 1048576  memlock 16797952B
> >
> > The memlock now is hashtab actually allocated.
> >
> > At worst, the difference can be 10x, for example,
> > - before this change
> > 4: hash  name count_map  flags 0x0
> >         key 4B  value 4B  max_entries 1048576  memlock 8388608B
> >
> > - after this change
> > 4: hash  name count_map  flags 0x0
> >         key 4B  value 4B  max_entries 1048576  memlock 83898640B
> >
>
> This walks the entire map and buckets to get the size? Inside a
> rcu critical section as well :/ it seems.
>

No, it doesn't walk the entire map and buckets, but just gets one
random element.
So it won't be a problem to do it inside a rcu critical section.

> What am I missing, if you know how many elements are added (which
> you can track on map updates) how come we can't just calculate the
> memory size directly from this?
>

It is less accurate and hard to understand. Take non-preallocated
percpu hashtab for example,
The size can be calculated as follows,
    key_size = round_up(htab->map.key_size, 8);
    value_size = round_up(htab->map.value_size, 8);
    pcpu_meta_size = sizeof(struct llist_node) + sizeof(void *);
    usage = ((value_size * num_possible_cpus() +\
                    pcpu_meta_size + key_size) * max_entries

That is quite unfriendly to the newbies, and may be error-prone.

Furthermore, it is less accurate because there is underlying memory
allocation in the MM subsystem.
Now we can get a more accurate usage with little overhead. Why not do it?
Cong Wang Feb. 5, 2023, 10:14 p.m. UTC | #3
On Thu, Feb 02, 2023 at 01:41:58AM +0000, Yafang Shao wrote:
> Get htab memory usage from the htab pointers we have allocated. Some
> small pointers are ignored as their size are quite small compared with
> the total size.
> 
> The result as follows,
> - before this change
> 1: hash  name count_map  flags 0x0  <<<< prealloc
>         key 16B  value 24B  max_entries 1048576  memlock 41943040B
> 2: hash  name count_map  flags 0x1  <<<< non prealloc, fully set
>         key 16B  value 24B  max_entries 1048576  memlock 41943040B
> 3: hash  name count_map  flags 0x1  <<<< non prealloc, non set
>         key 16B  value 24B  max_entries 1048576  memlock 41943040B
> 
> The memlock is always a fixed number whatever it is preallocated or
> not, and whatever the allocated elements number is.
> 
> - after this change
> 1: hash  name count_map  flags 0x0  <<<< prealloc
>         key 16B  value 24B  max_entries 1048576  memlock 109064464B
> 2: hash  name count_map  flags 0x1  <<<< non prealloc, fully set
>         key 16B  value 24B  max_entries 1048576  memlock 117464320B
> 3: hash  name count_map  flags 0x1  <<<< non prealloc, non set
>         key 16B  value 24B  max_entries 1048576  memlock 16797952B
> 
> The memlock now is hashtab actually allocated.
> 
> At worst, the difference can be 10x, for example,
> - before this change
> 4: hash  name count_map  flags 0x0
>         key 4B  value 4B  max_entries 1048576  memlock 8388608B
> 
> - after this change
> 4: hash  name count_map  flags 0x0
>         key 4B  value 4B  max_entries 1048576  memlock 83898640B
> 
> Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
> ---
>  kernel/bpf/hashtab.c | 80 +++++++++++++++++++++++++++++++++++++++++++++++++++-

What about other maps like regular array map?

Thanks.
Yafang Shao Feb. 6, 2023, 11:52 a.m. UTC | #4
On Mon, Feb 6, 2023 at 6:14 AM Cong Wang <xiyou.wangcong@gmail.com> wrote:
>
> On Thu, Feb 02, 2023 at 01:41:58AM +0000, Yafang Shao wrote:
> > Get htab memory usage from the htab pointers we have allocated. Some
> > small pointers are ignored as their size are quite small compared with
> > the total size.
> >
> > The result as follows,
> > - before this change
> > 1: hash  name count_map  flags 0x0  <<<< prealloc
> >         key 16B  value 24B  max_entries 1048576  memlock 41943040B
> > 2: hash  name count_map  flags 0x1  <<<< non prealloc, fully set
> >         key 16B  value 24B  max_entries 1048576  memlock 41943040B
> > 3: hash  name count_map  flags 0x1  <<<< non prealloc, non set
> >         key 16B  value 24B  max_entries 1048576  memlock 41943040B
> >
> > The memlock is always a fixed number whatever it is preallocated or
> > not, and whatever the allocated elements number is.
> >
> > - after this change
> > 1: hash  name count_map  flags 0x0  <<<< prealloc
> >         key 16B  value 24B  max_entries 1048576  memlock 109064464B
> > 2: hash  name count_map  flags 0x1  <<<< non prealloc, fully set
> >         key 16B  value 24B  max_entries 1048576  memlock 117464320B
> > 3: hash  name count_map  flags 0x1  <<<< non prealloc, non set
> >         key 16B  value 24B  max_entries 1048576  memlock 16797952B
> >
> > The memlock now is hashtab actually allocated.
> >
> > At worst, the difference can be 10x, for example,
> > - before this change
> > 4: hash  name count_map  flags 0x0
> >         key 4B  value 4B  max_entries 1048576  memlock 8388608B
> >
> > - after this change
> > 4: hash  name count_map  flags 0x0
> >         key 4B  value 4B  max_entries 1048576  memlock 83898640B
> >
> > Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
> > ---
> >  kernel/bpf/hashtab.c | 80 +++++++++++++++++++++++++++++++++++++++++++++++++++-
>
> What about other maps like regular array map?
>

I haven't finished the work to support all bpf maps.

Most of the maps have fixed entries, so we can get the usage directly
from an map pointer or get the element size and then calculate it, for
example,
- arraymap usage
  size = kvsize(array);
- percpu arraymap usage
  size = kvsize(array);
  size += percpu_size(array->pptrs[0]) * array->map.max_entries;

But there's special case like cgroup_storage, the max_entries of which
is 0, for example,
122: cgroup_storage  flags 0x0
        key 16B  value 8B  max_entries 0  memlock 0B

For this case, we should calculate the count of entries first.
Alexei Starovoitov Feb. 8, 2023, 1:56 a.m. UTC | #5
On Sat, Feb 4, 2023 at 7:56 PM Yafang Shao <laoar.shao@gmail.com> wrote:
>
> On Sat, Feb 4, 2023 at 10:01 AM John Fastabend <john.fastabend@gmail.com> wrote:
> >
> > Yafang Shao wrote:
> > > Get htab memory usage from the htab pointers we have allocated. Some
> > > small pointers are ignored as their size are quite small compared with
> > > the total size.
> > >
> > > The result as follows,
> > > - before this change
> > > 1: hash  name count_map  flags 0x0  <<<< prealloc
> > >         key 16B  value 24B  max_entries 1048576  memlock 41943040B
> > > 2: hash  name count_map  flags 0x1  <<<< non prealloc, fully set
> > >         key 16B  value 24B  max_entries 1048576  memlock 41943040B
> > > 3: hash  name count_map  flags 0x1  <<<< non prealloc, non set
> > >         key 16B  value 24B  max_entries 1048576  memlock 41943040B
> > >
> > > The memlock is always a fixed number whatever it is preallocated or
> > > not, and whatever the allocated elements number is.
> > >
> > > - after this change
> > > 1: hash  name count_map  flags 0x0  <<<< prealloc
> > >         key 16B  value 24B  max_entries 1048576  memlock 109064464B
> > > 2: hash  name count_map  flags 0x1  <<<< non prealloc, fully set
> > >         key 16B  value 24B  max_entries 1048576  memlock 117464320B
> > > 3: hash  name count_map  flags 0x1  <<<< non prealloc, non set
> > >         key 16B  value 24B  max_entries 1048576  memlock 16797952B
> > >
> > > The memlock now is hashtab actually allocated.
> > >
> > > At worst, the difference can be 10x, for example,
> > > - before this change
> > > 4: hash  name count_map  flags 0x0
> > >         key 4B  value 4B  max_entries 1048576  memlock 8388608B
> > >
> > > - after this change
> > > 4: hash  name count_map  flags 0x0
> > >         key 4B  value 4B  max_entries 1048576  memlock 83898640B
> > >
> >
> > This walks the entire map and buckets to get the size? Inside a
> > rcu critical section as well :/ it seems.
> >
>
> No, it doesn't walk the entire map and buckets, but just gets one
> random element.
> So it won't be a problem to do it inside a rcu critical section.
>
> > What am I missing, if you know how many elements are added (which
> > you can track on map updates) how come we can't just calculate the
> > memory size directly from this?
> >
>
> It is less accurate and hard to understand. Take non-preallocated
> percpu hashtab for example,
> The size can be calculated as follows,
>     key_size = round_up(htab->map.key_size, 8);
>     value_size = round_up(htab->map.value_size, 8);
>     pcpu_meta_size = sizeof(struct llist_node) + sizeof(void *);
>     usage = ((value_size * num_possible_cpus() +\
>                     pcpu_meta_size + key_size) * max_entries
>
> That is quite unfriendly to the newbies, and may be error-prone.

Please do that instead.
map_mem_usage callback is a no go as I mentioned earlier.

> Furthermore, it is less accurate because there is underlying memory
> allocation in the MM subsystem.
> Now we can get a more accurate usage with little overhead. Why not do it?

because htab_mem_usage() is not maintainable long term.
100% accuracy is a non-goal.
Yafang Shao Feb. 8, 2023, 3:33 a.m. UTC | #6
On Wed, Feb 8, 2023 at 9:56 AM Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
>
> On Sat, Feb 4, 2023 at 7:56 PM Yafang Shao <laoar.shao@gmail.com> wrote:
> >
> > On Sat, Feb 4, 2023 at 10:01 AM John Fastabend <john.fastabend@gmail.com> wrote:
> > >
> > > Yafang Shao wrote:
> > > > Get htab memory usage from the htab pointers we have allocated. Some
> > > > small pointers are ignored as their size are quite small compared with
> > > > the total size.
> > > >
> > > > The result as follows,
> > > > - before this change
> > > > 1: hash  name count_map  flags 0x0  <<<< prealloc
> > > >         key 16B  value 24B  max_entries 1048576  memlock 41943040B
> > > > 2: hash  name count_map  flags 0x1  <<<< non prealloc, fully set
> > > >         key 16B  value 24B  max_entries 1048576  memlock 41943040B
> > > > 3: hash  name count_map  flags 0x1  <<<< non prealloc, non set
> > > >         key 16B  value 24B  max_entries 1048576  memlock 41943040B
> > > >
> > > > The memlock is always a fixed number whatever it is preallocated or
> > > > not, and whatever the allocated elements number is.
> > > >
> > > > - after this change
> > > > 1: hash  name count_map  flags 0x0  <<<< prealloc
> > > >         key 16B  value 24B  max_entries 1048576  memlock 109064464B
> > > > 2: hash  name count_map  flags 0x1  <<<< non prealloc, fully set
> > > >         key 16B  value 24B  max_entries 1048576  memlock 117464320B
> > > > 3: hash  name count_map  flags 0x1  <<<< non prealloc, non set
> > > >         key 16B  value 24B  max_entries 1048576  memlock 16797952B
> > > >
> > > > The memlock now is hashtab actually allocated.
> > > >
> > > > At worst, the difference can be 10x, for example,
> > > > - before this change
> > > > 4: hash  name count_map  flags 0x0
> > > >         key 4B  value 4B  max_entries 1048576  memlock 8388608B
> > > >
> > > > - after this change
> > > > 4: hash  name count_map  flags 0x0
> > > >         key 4B  value 4B  max_entries 1048576  memlock 83898640B
> > > >
> > >
> > > This walks the entire map and buckets to get the size? Inside a
> > > rcu critical section as well :/ it seems.
> > >
> >
> > No, it doesn't walk the entire map and buckets, but just gets one
> > random element.
> > So it won't be a problem to do it inside a rcu critical section.
> >
> > > What am I missing, if you know how many elements are added (which
> > > you can track on map updates) how come we can't just calculate the
> > > memory size directly from this?
> > >
> >
> > It is less accurate and hard to understand. Take non-preallocated
> > percpu hashtab for example,
> > The size can be calculated as follows,
> >     key_size = round_up(htab->map.key_size, 8);
> >     value_size = round_up(htab->map.value_size, 8);
> >     pcpu_meta_size = sizeof(struct llist_node) + sizeof(void *);
> >     usage = ((value_size * num_possible_cpus() +\
> >                     pcpu_meta_size + key_size) * max_entries
> >
> > That is quite unfriendly to the newbies, and may be error-prone.
>
> Please do that instead.

I can do it as you suggested, but it seems we shouldn't keep all
estimates in one place. Because ...

> map_mem_usage callback is a no go as I mentioned earlier.

...we have to introduce the map_mem_usage callback. Take the lpm_trie
for example, its usage is
usage = (sizeof(struct lpm_trie_node) + trie->data_size) * trie->n_entries;

I don't think we want  to declare struct lpm_trie_node in kernel/bpf/syscall.c.
WDYT ?

>
> > Furthermore, it is less accurate because there is underlying memory
> > allocation in the MM subsystem.
> > Now we can get a more accurate usage with little overhead. Why not do it?
>
> because htab_mem_usage() is not maintainable long term.
> 100% accuracy is a non-goal.

htab_mem_usage() gives us an option to continue to make it more
accurate with considerable overhead.
But I won't insist on it if you think we don't need to make it more accurate.
Alexei Starovoitov Feb. 8, 2023, 4:29 a.m. UTC | #7
On Tue, Feb 7, 2023 at 7:34 PM Yafang Shao <laoar.shao@gmail.com> wrote:
>
> On Wed, Feb 8, 2023 at 9:56 AM Alexei Starovoitov
> <alexei.starovoitov@gmail.com> wrote:
> >
> > On Sat, Feb 4, 2023 at 7:56 PM Yafang Shao <laoar.shao@gmail.com> wrote:
> > >
> > > On Sat, Feb 4, 2023 at 10:01 AM John Fastabend <john.fastabend@gmail.com> wrote:
> > > >
> > > > Yafang Shao wrote:
> > > > > Get htab memory usage from the htab pointers we have allocated. Some
> > > > > small pointers are ignored as their size are quite small compared with
> > > > > the total size.
> > > > >
> > > > > The result as follows,
> > > > > - before this change
> > > > > 1: hash  name count_map  flags 0x0  <<<< prealloc
> > > > >         key 16B  value 24B  max_entries 1048576  memlock 41943040B
> > > > > 2: hash  name count_map  flags 0x1  <<<< non prealloc, fully set
> > > > >         key 16B  value 24B  max_entries 1048576  memlock 41943040B
> > > > > 3: hash  name count_map  flags 0x1  <<<< non prealloc, non set
> > > > >         key 16B  value 24B  max_entries 1048576  memlock 41943040B
> > > > >
> > > > > The memlock is always a fixed number whatever it is preallocated or
> > > > > not, and whatever the allocated elements number is.
> > > > >
> > > > > - after this change
> > > > > 1: hash  name count_map  flags 0x0  <<<< prealloc
> > > > >         key 16B  value 24B  max_entries 1048576  memlock 109064464B
> > > > > 2: hash  name count_map  flags 0x1  <<<< non prealloc, fully set
> > > > >         key 16B  value 24B  max_entries 1048576  memlock 117464320B
> > > > > 3: hash  name count_map  flags 0x1  <<<< non prealloc, non set
> > > > >         key 16B  value 24B  max_entries 1048576  memlock 16797952B
> > > > >
> > > > > The memlock now is hashtab actually allocated.
> > > > >
> > > > > At worst, the difference can be 10x, for example,
> > > > > - before this change
> > > > > 4: hash  name count_map  flags 0x0
> > > > >         key 4B  value 4B  max_entries 1048576  memlock 8388608B
> > > > >
> > > > > - after this change
> > > > > 4: hash  name count_map  flags 0x0
> > > > >         key 4B  value 4B  max_entries 1048576  memlock 83898640B
> > > > >
> > > >
> > > > This walks the entire map and buckets to get the size? Inside a
> > > > rcu critical section as well :/ it seems.
> > > >
> > >
> > > No, it doesn't walk the entire map and buckets, but just gets one
> > > random element.
> > > So it won't be a problem to do it inside a rcu critical section.
> > >
> > > > What am I missing, if you know how many elements are added (which
> > > > you can track on map updates) how come we can't just calculate the
> > > > memory size directly from this?
> > > >
> > >
> > > It is less accurate and hard to understand. Take non-preallocated
> > > percpu hashtab for example,
> > > The size can be calculated as follows,
> > >     key_size = round_up(htab->map.key_size, 8);
> > >     value_size = round_up(htab->map.value_size, 8);
> > >     pcpu_meta_size = sizeof(struct llist_node) + sizeof(void *);
> > >     usage = ((value_size * num_possible_cpus() +\
> > >                     pcpu_meta_size + key_size) * max_entries
> > >
> > > That is quite unfriendly to the newbies, and may be error-prone.
> >
> > Please do that instead.
>
> I can do it as you suggested, but it seems we shouldn't keep all
> estimates in one place. Because ...
>
> > map_mem_usage callback is a no go as I mentioned earlier.
>
> ...we have to introduce the map_mem_usage callback. Take the lpm_trie
> for example, its usage is
> usage = (sizeof(struct lpm_trie_node) + trie->data_size) * trie->n_entries;

sizeof(struct lpm_trie_node) + trie->data_size + trie->map.value_size.

and it won't count the inner nodes, but _it is ok_.

> I don't think we want  to declare struct lpm_trie_node in kernel/bpf/syscall.c.
> WDYT ?

Good point. Fine. Let's go with callback, but please keep it
to a single function without loops like htab_non_prealloc_elems_size()
and htab_prealloc_elems_size().

Also please implement it for all maps.
Doing it just for hash and arguing that every byte of accuracy matters
while not addressing lpm and other maps doesn't give credibility
to the accuracy argument.
Yafang Shao Feb. 8, 2023, 2:22 p.m. UTC | #8
On Wed, Feb 8, 2023 at 12:29 PM Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
>
> On Tue, Feb 7, 2023 at 7:34 PM Yafang Shao <laoar.shao@gmail.com> wrote:
> >
> > On Wed, Feb 8, 2023 at 9:56 AM Alexei Starovoitov
> > <alexei.starovoitov@gmail.com> wrote:
> > >
> > > On Sat, Feb 4, 2023 at 7:56 PM Yafang Shao <laoar.shao@gmail.com> wrote:
> > > >
> > > > On Sat, Feb 4, 2023 at 10:01 AM John Fastabend <john.fastabend@gmail.com> wrote:
> > > > >
> > > > > Yafang Shao wrote:
> > > > > > Get htab memory usage from the htab pointers we have allocated. Some
> > > > > > small pointers are ignored as their size are quite small compared with
> > > > > > the total size.
> > > > > >
> > > > > > The result as follows,
> > > > > > - before this change
> > > > > > 1: hash  name count_map  flags 0x0  <<<< prealloc
> > > > > >         key 16B  value 24B  max_entries 1048576  memlock 41943040B
> > > > > > 2: hash  name count_map  flags 0x1  <<<< non prealloc, fully set
> > > > > >         key 16B  value 24B  max_entries 1048576  memlock 41943040B
> > > > > > 3: hash  name count_map  flags 0x1  <<<< non prealloc, non set
> > > > > >         key 16B  value 24B  max_entries 1048576  memlock 41943040B
> > > > > >
> > > > > > The memlock is always a fixed number whatever it is preallocated or
> > > > > > not, and whatever the allocated elements number is.
> > > > > >
> > > > > > - after this change
> > > > > > 1: hash  name count_map  flags 0x0  <<<< prealloc
> > > > > >         key 16B  value 24B  max_entries 1048576  memlock 109064464B
> > > > > > 2: hash  name count_map  flags 0x1  <<<< non prealloc, fully set
> > > > > >         key 16B  value 24B  max_entries 1048576  memlock 117464320B
> > > > > > 3: hash  name count_map  flags 0x1  <<<< non prealloc, non set
> > > > > >         key 16B  value 24B  max_entries 1048576  memlock 16797952B
> > > > > >
> > > > > > The memlock now is hashtab actually allocated.
> > > > > >
> > > > > > At worst, the difference can be 10x, for example,
> > > > > > - before this change
> > > > > > 4: hash  name count_map  flags 0x0
> > > > > >         key 4B  value 4B  max_entries 1048576  memlock 8388608B
> > > > > >
> > > > > > - after this change
> > > > > > 4: hash  name count_map  flags 0x0
> > > > > >         key 4B  value 4B  max_entries 1048576  memlock 83898640B
> > > > > >
> > > > >
> > > > > This walks the entire map and buckets to get the size? Inside a
> > > > > rcu critical section as well :/ it seems.
> > > > >
> > > >
> > > > No, it doesn't walk the entire map and buckets, but just gets one
> > > > random element.
> > > > So it won't be a problem to do it inside a rcu critical section.
> > > >
> > > > > What am I missing, if you know how many elements are added (which
> > > > > you can track on map updates) how come we can't just calculate the
> > > > > memory size directly from this?
> > > > >
> > > >
> > > > It is less accurate and hard to understand. Take non-preallocated
> > > > percpu hashtab for example,
> > > > The size can be calculated as follows,
> > > >     key_size = round_up(htab->map.key_size, 8);
> > > >     value_size = round_up(htab->map.value_size, 8);
> > > >     pcpu_meta_size = sizeof(struct llist_node) + sizeof(void *);
> > > >     usage = ((value_size * num_possible_cpus() +\
> > > >                     pcpu_meta_size + key_size) * max_entries
> > > >
> > > > That is quite unfriendly to the newbies, and may be error-prone.
> > >
> > > Please do that instead.
> >
> > I can do it as you suggested, but it seems we shouldn't keep all
> > estimates in one place. Because ...
> >
> > > map_mem_usage callback is a no go as I mentioned earlier.
> >
> > ...we have to introduce the map_mem_usage callback. Take the lpm_trie
> > for example, its usage is
> > usage = (sizeof(struct lpm_trie_node) + trie->data_size) * trie->n_entries;
>
> sizeof(struct lpm_trie_node) + trie->data_size + trie->map.value_size.
>

Thanks for correcting it.

> and it won't count the inner nodes, but _it is ok_.
>
> > I don't think we want  to declare struct lpm_trie_node in kernel/bpf/syscall.c.
> > WDYT ?
>
> Good point. Fine. Let's go with callback, but please keep it
> to a single function without loops like htab_non_prealloc_elems_size()
> and htab_prealloc_elems_size().
>
> Also please implement it for all maps.

Sure, I will do it.

> Doing it just for hash and arguing that every byte of accuracy matters
> while not addressing lpm and other maps doesn't give credibility
> to the accuracy argument.
diff mbox series

Patch

diff --git a/kernel/bpf/hashtab.c b/kernel/bpf/hashtab.c
index 66bded1..cba540b 100644
--- a/kernel/bpf/hashtab.c
+++ b/kernel/bpf/hashtab.c
@@ -273,6 +273,25 @@  static void htab_free_elems(struct bpf_htab *htab)
 	bpf_map_area_free(htab->elems);
 }
 
+static unsigned long htab_prealloc_elems_size(struct bpf_htab *htab)
+{
+	unsigned long size = 0;
+	int i;
+
+	if (!htab_is_percpu(htab))
+		return kvsize(htab->elems);
+
+	for (i = 0; i < htab->map.max_entries; i++) {
+		void __percpu *pptr;
+
+		pptr = htab_elem_get_ptr(get_htab_elem(htab, i),
+					htab->map.key_size);
+		size += percpu_size(pptr);
+	}
+	size += kvsize(htab->elems);
+	return size;
+}
+
 /* The LRU list has a lock (lru_lock). Each htab bucket has a lock
  * (bucket_lock). If both locks need to be acquired together, the lock
  * order is always lru_lock -> bucket_lock and this only happens in
@@ -864,6 +883,16 @@  static void htab_elem_free(struct bpf_htab *htab, struct htab_elem *l)
 	bpf_mem_cache_free(&htab->ma, l);
 }
 
+static unsigned long htab_elem_size(struct bpf_htab *htab, struct htab_elem *l)
+{
+	unsigned long size = 0;
+
+	if (htab->map.map_type == BPF_MAP_TYPE_PERCPU_HASH)
+		size += bpf_mem_cache_elem_size(&htab->pcpu_ma, l->ptr_to_pptr);
+
+	return size + bpf_mem_cache_elem_size(&htab->ma, l);
+}
+
 static void htab_put_fd_value(struct bpf_htab *htab, struct htab_elem *l)
 {
 	struct bpf_map *map = &htab->map;
@@ -899,7 +928,6 @@  static void dec_elem_count(struct bpf_htab *htab)
 		atomic_dec(&htab->count);
 }
 
-
 static void free_htab_elem(struct bpf_htab *htab, struct htab_elem *l)
 {
 	htab_put_fd_value(htab, l);
@@ -1457,6 +1485,31 @@  static void delete_all_elements(struct bpf_htab *htab)
 	migrate_enable();
 }
 
+static unsigned long htab_non_prealloc_elems_size(struct bpf_htab *htab)
+{
+	unsigned long size = 0;
+	unsigned long count;
+	int i;
+
+	rcu_read_lock();
+	for (i = 0; i < htab->n_buckets; i++) {
+		struct hlist_nulls_head *head = select_bucket(htab, i);
+		struct hlist_nulls_node *n;
+		struct htab_elem *l;
+
+		hlist_nulls_for_each_entry(l, n, head, hash_node) {
+			size = htab_elem_size(htab, l);
+			goto out;
+		}
+	}
+out:
+	rcu_read_unlock();
+	count = htab->use_percpu_counter ? percpu_counter_sum(&htab->pcount) :
+			atomic_read(&htab->count);
+
+	return size * count;
+}
+
 static void htab_free_malloced_timers(struct bpf_htab *htab)
 {
 	int i;
@@ -1523,6 +1576,26 @@  static void htab_map_free(struct bpf_map *map)
 	bpf_map_area_free(htab);
 }
 
+/* Get the htab memory usage from pointers we have already allocated.
+ * Some minor pointers are igored as their size are quite small compared
+ * with the total size.
+ */
+static unsigned long htab_mem_usage(const struct bpf_map *map)
+{
+	struct bpf_htab *htab = container_of(map, struct bpf_htab, map);
+	unsigned long size = 0;
+
+	if (!htab_is_prealloc(htab))
+		size += htab_non_prealloc_elems_size(htab);
+	else
+		size += htab_prealloc_elems_size(htab);
+	size += percpu_size(htab->extra_elems);
+	size += kvsize(htab->buckets);
+	size += bpf_mem_alloc_size(&htab->pcpu_ma);
+	size += bpf_mem_alloc_size(&htab->ma);
+	return size;
+}
+
 static void htab_map_seq_show_elem(struct bpf_map *map, void *key,
 				   struct seq_file *m)
 {
@@ -2191,6 +2264,7 @@  static int bpf_for_each_hash_elem(struct bpf_map *map, bpf_callback_t callback_f
 	.map_seq_show_elem = htab_map_seq_show_elem,
 	.map_set_for_each_callback_args = map_set_for_each_callback_args,
 	.map_for_each_callback = bpf_for_each_hash_elem,
+	.map_mem_usage = htab_mem_usage,
 	BATCH_OPS(htab),
 	.map_btf_id = &htab_map_btf_ids[0],
 	.iter_seq_info = &iter_seq_info,
@@ -2212,6 +2286,7 @@  static int bpf_for_each_hash_elem(struct bpf_map *map, bpf_callback_t callback_f
 	.map_seq_show_elem = htab_map_seq_show_elem,
 	.map_set_for_each_callback_args = map_set_for_each_callback_args,
 	.map_for_each_callback = bpf_for_each_hash_elem,
+	.map_mem_usage = htab_mem_usage,
 	BATCH_OPS(htab_lru),
 	.map_btf_id = &htab_map_btf_ids[0],
 	.iter_seq_info = &iter_seq_info,
@@ -2363,6 +2438,7 @@  static void htab_percpu_map_seq_show_elem(struct bpf_map *map, void *key,
 	.map_seq_show_elem = htab_percpu_map_seq_show_elem,
 	.map_set_for_each_callback_args = map_set_for_each_callback_args,
 	.map_for_each_callback = bpf_for_each_hash_elem,
+	.map_mem_usage = htab_mem_usage,
 	BATCH_OPS(htab_percpu),
 	.map_btf_id = &htab_map_btf_ids[0],
 	.iter_seq_info = &iter_seq_info,
@@ -2382,6 +2458,7 @@  static void htab_percpu_map_seq_show_elem(struct bpf_map *map, void *key,
 	.map_seq_show_elem = htab_percpu_map_seq_show_elem,
 	.map_set_for_each_callback_args = map_set_for_each_callback_args,
 	.map_for_each_callback = bpf_for_each_hash_elem,
+	.map_mem_usage = htab_mem_usage,
 	BATCH_OPS(htab_lru_percpu),
 	.map_btf_id = &htab_map_btf_ids[0],
 	.iter_seq_info = &iter_seq_info,
@@ -2519,6 +2596,7 @@  static void htab_of_map_free(struct bpf_map *map)
 	.map_fd_sys_lookup_elem = bpf_map_fd_sys_lookup_elem,
 	.map_gen_lookup = htab_of_map_gen_lookup,
 	.map_check_btf = map_check_no_btf,
+	.map_mem_usage = htab_mem_usage,
 	BATCH_OPS(htab),
 	.map_btf_id = &htab_map_btf_ids[0],
 };