mbox series

[bpf-next,v9,00/34] bpf: switch to memcg-based memory accounting

Message ID 20201201215900.3569844-1-guro@fb.com (mailing list archive)
Headers show
Series bpf: switch to memcg-based memory accounting | expand

Message

Roman Gushchin Dec. 1, 2020, 9:58 p.m. UTC
Currently bpf is using the memlock rlimit for the memory accounting.
This approach has its downsides and over time has created a significant
amount of problems:

1) The limit is per-user, but because most bpf operations are performed
   as root, the limit has a little value.

2) It's hard to come up with a specific maximum value. Especially because
   the counter is shared with non-bpf use cases (e.g. memlock()).
   Any specific value is either too low and creates false failures
   or is too high and useless.

3) Charging is not connected to the actual memory allocation. Bpf code
   should manually calculate the estimated cost and charge the counter,
   and then take care of uncharging, including all fail paths.
   It adds to the code complexity and makes it easy to leak a charge.

4) There is no simple way of getting the current value of the counter.
   We've used drgn for it, but it's far from being convenient.

5) Cryptic -EPERM is returned on exceeding the limit. Libbpf even had
   a function to "explain" this case for users.

6) rlimits are generally considered as (at least partially) obsolete.
   They do not provide a comprehensive system for the control of physical
   resources: memory, cpu, io etc. All resource control developments
   in the recent years were related to cgroups.

In order to overcome these problems let's switch to the memory cgroup-based
memory accounting of bpf objects. With the recent addition of the percpu
memory accounting, now it's possible to provide a comprehensive accounting
of the memory used by bpf programs and maps.

This approach has the following advantages:
1) The limit is per-cgroup and hierarchical. It's way more flexible and allows
   a better control over memory usage by different workloads.

2) The actual memory consumption is taken into account. It happens automatically
   on the allocation time if __GFP_ACCOUNT flags is passed. Uncharging is also
   performed automatically on releasing the memory. So the code on the bpf side
   becomes simpler and safer.

3) There is a simple way to get the current value and statistics.

Cgroup-based accounting adds new requirements:
1) The kernel config should have CONFIG_CGROUPS and CONFIG_MEMCG_KMEM enabled.
   These options are usually enabled, maybe excluding tiny builds for embedded
   devices.
2) The system should have a configured cgroup hierarchy, including reasonable
   memory limits and/or guarantees. Modern systems usually delegate this task
   to systemd or similar task managers.

Without meeting these requirements there are no limits on how much memory bpf
can use and a non-root user is able to hurt the system by allocating too much.
But because per-user rlimits do not provide a functional system to protect
and manage physical resources anyway, anyone who seriously depends on it,
should use cgroups.

When a bpf map is created, the memory cgroup of the process which creates
the map is recorded. Subsequently all memory allocation related to the bpf map
are charged to the same cgroup. It includes allocations made from interrupts
and by any processes. Bpf program memory is charged to the memory cgroup of
a process which loads the program.

The patchset consists of the following parts:
1) 4 mm patches are required on the mm side, otherwise vmallocs cannot be mapped
   to userspace
2) memcg-based accounting for various bpf objects: progs and maps
3) removal of the rlimit-based accounting
4) removal of rlimit adjustments in userspace samples


v9:
  - always charge the saved memory cgroup, by Daniel, Toke and Alexei
  - added bpf_map_kzalloc()
  - rebase and minor fixes

v8:
  - extended the cover letter to be more clear on new requirements, by Daniel
  - an approximate value is provided by map memlock info, by Alexei

v7:
  - introduced bpf_map_kmalloc_node() and bpf_map_alloc_percpu(), by Alexei
  - switched allocations made from an interrupt context to new helpers,
    by Daniel
  - rebase and minor fixes

v6:
  - rebased to the latest version of the remote charging API
  - fixed signatures, added acks

v5:
  - rebased to the latest version of the remote charging API
  - implemented kmem accounting from an interrupt context, by Shakeel
  - rebased to latest changes in mm allowed to map vmallocs to userspace
  - fixed a build issue in kselftests, by Alexei
  - fixed a use-after-free bug in bpf_map_free_deferred()
  - added bpf line info coverage, by Shakeel
  - split bpf map charging preparations into a separate patch

v4:
  - covered allocations made from an interrupt context, by Daniel
  - added some clarifications to the cover letter

v3:
  - droped the userspace part for further discussions/refinements,
    by Andrii and Song

v2:
  - fixed build issue, caused by the remaining rlimit-based accounting
    for sockhash maps


Roman Gushchin (34):
  mm: memcontrol: use helpers to read page's memcg data
  mm: memcontrol/slab: use helpers to access slab page's memcg_data
  mm: introduce page memcg flags
  mm: convert page kmemcg type to a page memcg flag
  bpf: memcg-based memory accounting for bpf progs
  bpf: prepare for memcg-based memory accounting for bpf maps
  bpf: memcg-based memory accounting for bpf maps
  bpf: refine memcg-based memory accounting for arraymap maps
  bpf: refine memcg-based memory accounting for cpumap maps
  bpf: memcg-based memory accounting for cgroup storage maps
  bpf: refine memcg-based memory accounting for devmap maps
  bpf: refine memcg-based memory accounting for hashtab maps
  bpf: memcg-based memory accounting for lpm_trie maps
  bpf: memcg-based memory accounting for bpf ringbuffer
  bpf: memcg-based memory accounting for bpf local storage maps
  bpf: refine memcg-based memory accounting for sockmap and sockhash
    maps
  bpf: refine memcg-based memory accounting for xskmap maps
  bpf: eliminate rlimit-based memory accounting for arraymap maps
  bpf: eliminate rlimit-based memory accounting for bpf_struct_ops maps
  bpf: eliminate rlimit-based memory accounting for cpumap maps
  bpf: eliminate rlimit-based memory accounting for cgroup storage maps
  bpf: eliminate rlimit-based memory accounting for devmap maps
  bpf: eliminate rlimit-based memory accounting for hashtab maps
  bpf: eliminate rlimit-based memory accounting for lpm_trie maps
  bpf: eliminate rlimit-based memory accounting for queue_stack_maps
    maps
  bpf: eliminate rlimit-based memory accounting for reuseport_array maps
  bpf: eliminate rlimit-based memory accounting for bpf ringbuffer
  bpf: eliminate rlimit-based memory accounting for sockmap and sockhash
    maps
  bpf: eliminate rlimit-based memory accounting for stackmap maps
  bpf: eliminate rlimit-based memory accounting for xskmap maps
  bpf: eliminate rlimit-based memory accounting for bpf local storage
    maps
  bpf: eliminate rlimit-based memory accounting infra for bpf maps
  bpf: eliminate rlimit-based memory accounting for bpf progs
  bpf: samples: do not touch RLIMIT_MEMLOCK

 fs/buffer.c                                   |   2 +-
 fs/iomap/buffered-io.c                        |   2 +-
 include/linux/bpf.h                           |  57 +++--
 include/linux/memcontrol.h                    | 215 +++++++++++++++-
 include/linux/mm.h                            |  22 --
 include/linux/mm_types.h                      |   5 +-
 include/linux/page-flags.h                    |  11 +-
 include/trace/events/writeback.h              |   2 +-
 kernel/bpf/arraymap.c                         |  30 +--
 kernel/bpf/bpf_local_storage.c                |  20 +-
 kernel/bpf/bpf_struct_ops.c                   |  19 +-
 kernel/bpf/core.c                             |  22 +-
 kernel/bpf/cpumap.c                           |  37 +--
 kernel/bpf/devmap.c                           |  25 +-
 kernel/bpf/hashtab.c                          |  43 ++--
 kernel/bpf/local_storage.c                    |  44 +---
 kernel/bpf/lpm_trie.c                         |  19 +-
 kernel/bpf/queue_stack_maps.c                 |  16 +-
 kernel/bpf/reuseport_array.c                  |  12 +-
 kernel/bpf/ringbuf.c                          |  35 +--
 kernel/bpf/stackmap.c                         |  16 +-
 kernel/bpf/syscall.c                          | 234 +++++++-----------
 kernel/fork.c                                 |   7 +-
 mm/debug.c                                    |   4 +-
 mm/huge_memory.c                              |   4 +-
 mm/memcontrol.c                               | 139 +++++------
 mm/page_alloc.c                               |   8 +-
 mm/page_io.c                                  |   6 +-
 mm/slab.h                                     |  38 +--
 mm/workingset.c                               |   2 +-
 net/core/sock_map.c                           |  42 +---
 net/xdp/xskmap.c                              |  15 +-
 samples/bpf/map_perf_test_user.c              |   6 -
 samples/bpf/offwaketime_user.c                |   6 -
 samples/bpf/sockex2_user.c                    |   2 -
 samples/bpf/sockex3_user.c                    |   2 -
 samples/bpf/spintest_user.c                   |   6 -
 samples/bpf/syscall_tp_user.c                 |   2 -
 samples/bpf/task_fd_query_user.c              |   6 -
 samples/bpf/test_lru_dist.c                   |   3 -
 samples/bpf/test_map_in_map_user.c            |   6 -
 samples/bpf/test_overhead_user.c              |   2 -
 samples/bpf/trace_event_user.c                |   2 -
 samples/bpf/tracex2_user.c                    |   6 -
 samples/bpf/tracex3_user.c                    |   6 -
 samples/bpf/tracex4_user.c                    |   6 -
 samples/bpf/tracex5_user.c                    |   3 -
 samples/bpf/tracex6_user.c                    |   3 -
 samples/bpf/xdp1_user.c                       |   6 -
 samples/bpf/xdp_adjust_tail_user.c            |   6 -
 samples/bpf/xdp_monitor_user.c                |   5 -
 samples/bpf/xdp_redirect_cpu_user.c           |   6 -
 samples/bpf/xdp_redirect_map_user.c           |   6 -
 samples/bpf/xdp_redirect_user.c               |   6 -
 samples/bpf/xdp_router_ipv4_user.c            |   6 -
 samples/bpf/xdp_rxq_info_user.c               |   6 -
 samples/bpf/xdp_sample_pkts_user.c            |   6 -
 samples/bpf/xdp_tx_iptunnel_user.c            |   6 -
 samples/bpf/xdpsock_user.c                    |   7 -
 .../selftests/bpf/progs/bpf_iter_bpf_map.c    |   2 +-
 .../selftests/bpf/progs/map_ptr_kern.c        |   7 -
 61 files changed, 533 insertions(+), 762 deletions(-)

Comments

patchwork-bot+netdevbpf@kernel.org Dec. 3, 2020, 2:50 a.m. UTC | #1
Hello:

This series was applied to bpf/bpf-next.git (refs/heads/master):

On Tue, 1 Dec 2020 13:58:26 -0800 you wrote:
> Currently bpf is using the memlock rlimit for the memory accounting.
> This approach has its downsides and over time has created a significant
> amount of problems:
> 
> 1) The limit is per-user, but because most bpf operations are performed
>    as root, the limit has a little value.
> 
> [...]

Here is the summary with links:
  - [bpf-next,v9,01/34] mm: memcontrol: use helpers to read page's memcg data
    https://git.kernel.org/bpf/bpf-next/c/bcfe06bf2622
  - [bpf-next,v9,02/34] mm: memcontrol/slab: use helpers to access slab page's memcg_data
    https://git.kernel.org/bpf/bpf-next/c/270c6a71460e
  - [bpf-next,v9,03/34] mm: introduce page memcg flags
    https://git.kernel.org/bpf/bpf-next/c/87944e2992bd
  - [bpf-next,v9,04/34] mm: convert page kmemcg type to a page memcg flag
    https://git.kernel.org/bpf/bpf-next/c/18b2db3b0385
  - [bpf-next,v9,05/34] bpf: memcg-based memory accounting for bpf progs
    https://git.kernel.org/bpf/bpf-next/c/ddf8503c7c43
  - [bpf-next,v9,06/34] bpf: prepare for memcg-based memory accounting for bpf maps
    https://git.kernel.org/bpf/bpf-next/c/48edc1f78aab
  - [bpf-next,v9,07/34] bpf: memcg-based memory accounting for bpf maps
    https://git.kernel.org/bpf/bpf-next/c/d5299b67dd59
  - [bpf-next,v9,08/34] bpf: refine memcg-based memory accounting for arraymap maps
    https://git.kernel.org/bpf/bpf-next/c/6d192c7938b7
  - [bpf-next,v9,09/34] bpf: refine memcg-based memory accounting for cpumap maps
    https://git.kernel.org/bpf/bpf-next/c/e88cc05b61f3
  - [bpf-next,v9,10/34] bpf: memcg-based memory accounting for cgroup storage maps
    https://git.kernel.org/bpf/bpf-next/c/3a61c7c58b30
  - [bpf-next,v9,11/34] bpf: refine memcg-based memory accounting for devmap maps
    https://git.kernel.org/bpf/bpf-next/c/1440290adf7b
  - [bpf-next,v9,12/34] bpf: refine memcg-based memory accounting for hashtab maps
    https://git.kernel.org/bpf/bpf-next/c/881456811a33
  - [bpf-next,v9,13/34] bpf: memcg-based memory accounting for lpm_trie maps
    https://git.kernel.org/bpf/bpf-next/c/353e7af4bf5e
  - [bpf-next,v9,14/34] bpf: memcg-based memory accounting for bpf ringbuffer
    https://git.kernel.org/bpf/bpf-next/c/be4035c734d1
  - [bpf-next,v9,15/34] bpf: memcg-based memory accounting for bpf local storage maps
    https://git.kernel.org/bpf/bpf-next/c/e9aae8beba82
  - [bpf-next,v9,16/34] bpf: refine memcg-based memory accounting for sockmap and sockhash maps
    https://git.kernel.org/bpf/bpf-next/c/7846dd9f835e
  - [bpf-next,v9,17/34] bpf: refine memcg-based memory accounting for xskmap maps
    https://git.kernel.org/bpf/bpf-next/c/28e1dcdef0cb
  - [bpf-next,v9,18/34] bpf: eliminate rlimit-based memory accounting for arraymap maps
    https://git.kernel.org/bpf/bpf-next/c/1bc5975613ed
  - [bpf-next,v9,19/34] bpf: eliminate rlimit-based memory accounting for bpf_struct_ops maps
    https://git.kernel.org/bpf/bpf-next/c/f043733f31e5
  - [bpf-next,v9,20/34] bpf: eliminate rlimit-based memory accounting for cpumap maps
    https://git.kernel.org/bpf/bpf-next/c/711cabaf1432
  - [bpf-next,v9,21/34] bpf: eliminate rlimit-based memory accounting for cgroup storage maps
    https://git.kernel.org/bpf/bpf-next/c/087b0d39fe22
  - [bpf-next,v9,22/34] bpf: eliminate rlimit-based memory accounting for devmap maps
    https://git.kernel.org/bpf/bpf-next/c/844f157f6c0a
  - [bpf-next,v9,23/34] bpf: eliminate rlimit-based memory accounting for hashtab maps
    https://git.kernel.org/bpf/bpf-next/c/755e5d55367a
  - [bpf-next,v9,24/34] bpf: eliminate rlimit-based memory accounting for lpm_trie maps
    https://git.kernel.org/bpf/bpf-next/c/cbddcb574d41
  - [bpf-next,v9,25/34] bpf: eliminate rlimit-based memory accounting for queue_stack_maps maps
    https://git.kernel.org/bpf/bpf-next/c/a37fb7ef24a4
  - [bpf-next,v9,26/34] bpf: eliminate rlimit-based memory accounting for reuseport_array maps
    https://git.kernel.org/bpf/bpf-next/c/db54330d3e13
  - [bpf-next,v9,27/34] bpf: eliminate rlimit-based memory accounting for bpf ringbuffer
    https://git.kernel.org/bpf/bpf-next/c/abbdd0813f34
  - [bpf-next,v9,28/34] bpf: eliminate rlimit-based memory accounting for sockmap and sockhash maps
    https://git.kernel.org/bpf/bpf-next/c/0d2c4f964050
  - [bpf-next,v9,29/34] bpf: eliminate rlimit-based memory accounting for stackmap maps
    https://git.kernel.org/bpf/bpf-next/c/370868107bf6
  - [bpf-next,v9,30/34] bpf: eliminate rlimit-based memory accounting for xskmap maps
    https://git.kernel.org/bpf/bpf-next/c/819a4f323579
  - [bpf-next,v9,31/34] bpf: eliminate rlimit-based memory accounting for bpf local storage maps
    https://git.kernel.org/bpf/bpf-next/c/ab31be378a63
  - [bpf-next,v9,32/34] bpf: eliminate rlimit-based memory accounting infra for bpf maps
    https://git.kernel.org/bpf/bpf-next/c/80ee81e0403c
  - [bpf-next,v9,33/34] bpf: eliminate rlimit-based memory accounting for bpf progs
    https://git.kernel.org/bpf/bpf-next/c/3ac1f01b43b6
  - [bpf-next,v9,34/34] bpf: samples: do not touch RLIMIT_MEMLOCK
    https://git.kernel.org/bpf/bpf-next/c/5b0764b2d345

You are awesome, thank you!
--
Deet-doot-dot, I am a bot.
https://korg.docs.kernel.org/patchwork/pwbot.html
Alexei Starovoitov Dec. 3, 2020, 2:54 a.m. UTC | #2
On Tue, Dec 1, 2020 at 1:59 PM Roman Gushchin <guro@fb.com> wrote:
>
> 5) Cryptic -EPERM is returned on exceeding the limit. Libbpf even had
>    a function to "explain" this case for users.
...
> v9:
>   - always charge the saved memory cgroup, by Daniel, Toke and Alexei
>   - added bpf_map_kzalloc()
>   - rebase and minor fixes

This looks great. Applied.
Please follow up with a change to libbpf's pr_perm_msg().
That helpful warning should stay for old kernels, but it would be
misleading for new kernels.
libbpf probably needs a feature check to make this warning conditional.

Thanks!
Roman Gushchin Dec. 3, 2020, 3:26 a.m. UTC | #3
On Wed, Dec 02, 2020 at 06:54:46PM -0800, Alexei Starovoitov wrote:
> On Tue, Dec 1, 2020 at 1:59 PM Roman Gushchin <guro@fb.com> wrote:
> >
> > 5) Cryptic -EPERM is returned on exceeding the limit. Libbpf even had
> >    a function to "explain" this case for users.
> ...
> > v9:
> >   - always charge the saved memory cgroup, by Daniel, Toke and Alexei
> >   - added bpf_map_kzalloc()
> >   - rebase and minor fixes
> 
> This looks great. Applied.

Thanks!

> Please follow up with a change to libbpf's pr_perm_msg().
> That helpful warning should stay for old kernels, but it would be
> misleading for new kernels.
> libbpf probably needs a feature check to make this warning conditional.

I think we've discussed it several months ago and at that time we didn't
find a good way to check this feature. I'll think again, but if somebody
has any ideas here, I'll appreciate a lot.
Daniel Borkmann Dec. 5, 2020, 12:37 a.m. UTC | #4
On 12/3/20 4:26 AM, Roman Gushchin wrote:
> On Wed, Dec 02, 2020 at 06:54:46PM -0800, Alexei Starovoitov wrote:
>> On Tue, Dec 1, 2020 at 1:59 PM Roman Gushchin <guro@fb.com> wrote:
>>>
>>> 5) Cryptic -EPERM is returned on exceeding the limit. Libbpf even had
>>>     a function to "explain" this case for users.
>> ...
>>> v9:
>>>    - always charge the saved memory cgroup, by Daniel, Toke and Alexei
>>>    - added bpf_map_kzalloc()
>>>    - rebase and minor fixes
>>
>> This looks great. Applied.
> 
> Thanks!
> 
>> Please follow up with a change to libbpf's pr_perm_msg().
>> That helpful warning should stay for old kernels, but it would be
>> misleading for new kernels.
>> libbpf probably needs a feature check to make this warning conditional.
> 
> I think we've discussed it several months ago and at that time we didn't
> find a good way to check this feature. I'll think again, but if somebody
> has any ideas here, I'll appreciate a lot.

Hm, bit tricky, agree .. given we only throw the warning in pr_perm_msg() for
non-root and thus probing options are also limited, otherwise just probing for
a helper that was added in this same cycle would have been good enough as a
simple heuristic. I wonder if it would make sense to add some hint inside the
bpf_{prog,map}_show_fdinfo() to indicate that accounting with memcg is enabled
for the prog/map one way or another? Not just for the sake of pr_perm_msg(), but
in general for apps to stop messing with rlimit at this point. Maybe also bpftool
feature probe could be extended to indicate that as well (e.g. the json output
can be fed into Go natively).
Andrii Nakryiko Dec. 8, 2020, 2:53 a.m. UTC | #5
On Fri, Dec 4, 2020 at 4:37 PM Daniel Borkmann <daniel@iogearbox.net> wrote:
>
> On 12/3/20 4:26 AM, Roman Gushchin wrote:
> > On Wed, Dec 02, 2020 at 06:54:46PM -0800, Alexei Starovoitov wrote:
> >> On Tue, Dec 1, 2020 at 1:59 PM Roman Gushchin <guro@fb.com> wrote:
> >>>
> >>> 5) Cryptic -EPERM is returned on exceeding the limit. Libbpf even had
> >>>     a function to "explain" this case for users.
> >> ...
> >>> v9:
> >>>    - always charge the saved memory cgroup, by Daniel, Toke and Alexei
> >>>    - added bpf_map_kzalloc()
> >>>    - rebase and minor fixes
> >>
> >> This looks great. Applied.
> >
> > Thanks!
> >
> >> Please follow up with a change to libbpf's pr_perm_msg().
> >> That helpful warning should stay for old kernels, but it would be
> >> misleading for new kernels.
> >> libbpf probably needs a feature check to make this warning conditional.
> >
> > I think we've discussed it several months ago and at that time we didn't
> > find a good way to check this feature. I'll think again, but if somebody
> > has any ideas here, I'll appreciate a lot.
>
> Hm, bit tricky, agree .. given we only throw the warning in pr_perm_msg() for
> non-root and thus probing options are also limited, otherwise just probing for
> a helper that was added in this same cycle would have been good enough as a
> simple heuristic. I wonder if it would make sense to add some hint inside the
> bpf_{prog,map}_show_fdinfo() to indicate that accounting with memcg is enabled

I think the initial version was emitting 0 for memlock, so that was a
pretty simple way to prove stuff. But I think it was changed at the
last minute to emit some non-zero "estimate" of memory used or
something like that?

> for the prog/map one way or another? Not just for the sake of pr_perm_msg(), but
> in general for apps to stop messing with rlimit at this point. Maybe also bpftool
> feature probe could be extended to indicate that as well (e.g. the json output
> can be fed into Go natively).