diff mbox series

[bpf-next] libbpf: Improve probing for memcg-based memory accounting

Message ID 20220609143614.97837-1-quentin@isovalent.com (mailing list archive)
State Changes Requested
Delegated to: BPF
Headers show
Series [bpf-next] libbpf: Improve probing for memcg-based memory accounting | expand

Checks

Context Check Description
netdev/tree_selection success Clearly marked for bpf-next
netdev/fixes_present success Fixes tag not required for -next series
netdev/subject_prefix success Link
netdev/cover_letter success Single patches do not need cover letters
netdev/patch_count success Link
netdev/header_inline success No static functions without inline keyword in header files
netdev/build_32bit success Errors and warnings before: 0 this patch: 0
netdev/cc_maintainers warning 5 maintainers not CCed: songliubraving@fb.com yhs@fb.com john.fastabend@gmail.com kafai@fb.com kpsingh@kernel.org
netdev/build_clang success Errors and warnings before: 0 this patch: 0
netdev/module_param success Was 0 now: 0
netdev/verify_signedoff success Signed-off-by tag matches author and committer
netdev/check_selftest success No net selftest shell script
netdev/verify_fixes success No Fixes tag
netdev/build_allmodconfig_warn success Errors and warnings before: 0 this patch: 0
netdev/checkpatch success total: 0 errors, 0 warnings, 0 checks, 49 lines checked
netdev/kdoc success Errors and warnings before: 0 this patch: 0
netdev/source_inline success Was 0 now: 0
bpf/vmtest-bpf-next-PR pending PR summary
bpf/vmtest-bpf-next-VM_Test-3 pending Logs for Kernel LATEST on z15 with gcc
bpf/vmtest-bpf-next-VM_Test-1 success Logs for Kernel LATEST on ubuntu-latest with gcc
bpf/vmtest-bpf-next-VM_Test-2 success Logs for Kernel LATEST on ubuntu-latest with llvm-15

Commit Message

Quentin Monnet June 9, 2022, 2:36 p.m. UTC
To ensure that memory accounting will not hinder the load of BPF
objects, libbpf may raise the memlock rlimit before proceeding to some
operations. Whether this limit needs to be raised depends on the version
of the kernel: newer versions use cgroup-based (memcg) memory
accounting, and do not require any adjustment.

There is a probe in libbpf to determine whether memcg-based accounting
is supported. But this probe currently relies on the availability of a
given BPF helper, bpf_ktime_get_coarse_ns(), which landed in the same
kernel version as the memory accounting change. This works in the
generic case, but it may fail, for example, if the helper function has
been backported to an older kernel. This has been observed for Google
Cloud's Container-Optimized OS (COS), where the helper is available but
rlimit is still in use. The probe succeeds, the rlimit is not raised,
and probing features with bpftool, for example, fails.

Here we attempt to improve this probe and to effectively rely on memory
accounting. Function probe_memcg_account() in libbpf is updated to set
the rlimit to 0, then attempt to load a BPF object, and then to reset
the rlimit. If the load still succeeds, then this means we're running
with memcg-based accounting.

This probe was inspired by the similar one from the cilium/ebpf Go
library [0].

[0] https://github.com/cilium/ebpf/blob/v0.9.0/rlimit/rlimit.go#L39

Signed-off-by: Quentin Monnet <quentin@isovalent.com>
---
 tools/lib/bpf/bpf.c | 23 ++++++++++++++++++-----
 1 file changed, 18 insertions(+), 5 deletions(-)

Comments

Stanislav Fomichev June 9, 2022, 6:13 p.m. UTC | #1
On 06/09, Quentin Monnet wrote:
> To ensure that memory accounting will not hinder the load of BPF
> objects, libbpf may raise the memlock rlimit before proceeding to some
> operations. Whether this limit needs to be raised depends on the version
> of the kernel: newer versions use cgroup-based (memcg) memory
> accounting, and do not require any adjustment.

> There is a probe in libbpf to determine whether memcg-based accounting
> is supported. But this probe currently relies on the availability of a
> given BPF helper, bpf_ktime_get_coarse_ns(), which landed in the same
> kernel version as the memory accounting change. This works in the
> generic case, but it may fail, for example, if the helper function has
> been backported to an older kernel. This has been observed for Google
> Cloud's Container-Optimized OS (COS), where the helper is available but
> rlimit is still in use. The probe succeeds, the rlimit is not raised,
> and probing features with bpftool, for example, fails.

> Here we attempt to improve this probe and to effectively rely on memory
> accounting. Function probe_memcg_account() in libbpf is updated to set
> the rlimit to 0, then attempt to load a BPF object, and then to reset
> the rlimit. If the load still succeeds, then this means we're running
> with memcg-based accounting.

> This probe was inspired by the similar one from the cilium/ebpf Go
> library [0].

> [0] https://github.com/cilium/ebpf/blob/v0.9.0/rlimit/rlimit.go#L39

> Signed-off-by: Quentin Monnet <quentin@isovalent.com>
> ---
>   tools/lib/bpf/bpf.c | 23 ++++++++++++++++++-----
>   1 file changed, 18 insertions(+), 5 deletions(-)

> diff --git a/tools/lib/bpf/bpf.c b/tools/lib/bpf/bpf.c
> index 240186aac8e6..781387e6f66b 100644
> --- a/tools/lib/bpf/bpf.c
> +++ b/tools/lib/bpf/bpf.c
> @@ -99,31 +99,44 @@ static inline int sys_bpf_prog_load(union bpf_attr  
> *attr, unsigned int size, int

>   /* Probe whether kernel switched from memlock-based (RLIMIT_MEMLOCK) to
>    * memcg-based memory accounting for BPF maps and progs. This was done  
> in [0].
> - * We use the support for bpf_ktime_get_coarse_ns() helper, which was  
> added in
> - * the same 5.11 Linux release ([1]), to detect memcg-based accounting  
> for BPF.
> + * To do so, we lower the soft memlock rlimit to 0 and attempt to create  
> a BPF
> + * object. If it succeeds, then memcg-based accounting for BPF is  
> available.
>    *
>    *   [0]  
> https://lore.kernel.org/bpf/20201201215900.3569844-1-guro@fb.com/
> - *   [1] d05512618056 ("bpf: Add bpf_ktime_get_coarse_ns helper")
>    */
>   int probe_memcg_account(void)
>   {
>   	const size_t prog_load_attr_sz = offsetofend(union bpf_attr,  
> attach_btf_obj_fd);
>   	struct bpf_insn insns[] = {
> -		BPF_EMIT_CALL(BPF_FUNC_ktime_get_coarse_ns),
>   		BPF_EXIT_INSN(),
>   	};
> +	struct rlimit rlim_init, rlim_cur_zero = {};
>   	size_t insn_cnt = ARRAY_SIZE(insns);
>   	union bpf_attr attr;
>   	int prog_fd;

> -	/* attempt loading freplace trying to use custom BTF */
>   	memset(&attr, 0, prog_load_attr_sz);
>   	attr.prog_type = BPF_PROG_TYPE_SOCKET_FILTER;
>   	attr.insns = ptr_to_u64(insns);
>   	attr.insn_cnt = insn_cnt;
>   	attr.license = ptr_to_u64("GPL");

> +	if (getrlimit(RLIMIT_MEMLOCK, &rlim_init))
> +		return -1;
> +
> +	/* Drop the soft limit to zero. We maintain the hard limit to its
> +	 * current value, because lowering it would be a permanent operation
> +	 * for unprivileged users.
> +	 */
> +	rlim_cur_zero.rlim_max = rlim_init.rlim_max;
> +	if (setrlimit(RLIMIT_MEMLOCK, &rlim_cur_zero))
> +		return -1;
> +
>   	prog_fd = sys_bpf_fd(BPF_PROG_LOAD, &attr, prog_load_attr_sz);
> +
> +	/* reset soft rlimit as soon as possible */
> +	setrlimit(RLIMIT_MEMLOCK, &rlim_init);

Isn't that adding more flakiness to the other daemons running as
the same user? Also, there might be surprises if another daemon that
has libbpf in it starts right when we've set the limit temporarily to zero.

Can we push these decisions to the users as part of libbpf 1.0 cleanup?

> +
>   	if (prog_fd >= 0) {
>   		close(prog_fd);
>   		return 1;
> --
> 2.34.1
Andrii Nakryiko June 9, 2022, 6:37 p.m. UTC | #2
On Thu, Jun 9, 2022 at 11:13 AM <sdf@google.com> wrote:
>
> On 06/09, Quentin Monnet wrote:
> > To ensure that memory accounting will not hinder the load of BPF
> > objects, libbpf may raise the memlock rlimit before proceeding to some
> > operations. Whether this limit needs to be raised depends on the version
> > of the kernel: newer versions use cgroup-based (memcg) memory
> > accounting, and do not require any adjustment.
>
> > There is a probe in libbpf to determine whether memcg-based accounting
> > is supported. But this probe currently relies on the availability of a
> > given BPF helper, bpf_ktime_get_coarse_ns(), which landed in the same
> > kernel version as the memory accounting change. This works in the
> > generic case, but it may fail, for example, if the helper function has
> > been backported to an older kernel. This has been observed for Google
> > Cloud's Container-Optimized OS (COS), where the helper is available but
> > rlimit is still in use. The probe succeeds, the rlimit is not raised,
> > and probing features with bpftool, for example, fails.
>
> > Here we attempt to improve this probe and to effectively rely on memory
> > accounting. Function probe_memcg_account() in libbpf is updated to set
> > the rlimit to 0, then attempt to load a BPF object, and then to reset
> > the rlimit. If the load still succeeds, then this means we're running
> > with memcg-based accounting.
>
> > This probe was inspired by the similar one from the cilium/ebpf Go
> > library [0].
>
> > [0] https://github.com/cilium/ebpf/blob/v0.9.0/rlimit/rlimit.go#L39
>
> > Signed-off-by: Quentin Monnet <quentin@isovalent.com>
> > ---
> >   tools/lib/bpf/bpf.c | 23 ++++++++++++++++++-----
> >   1 file changed, 18 insertions(+), 5 deletions(-)
>
> > diff --git a/tools/lib/bpf/bpf.c b/tools/lib/bpf/bpf.c
> > index 240186aac8e6..781387e6f66b 100644
> > --- a/tools/lib/bpf/bpf.c
> > +++ b/tools/lib/bpf/bpf.c
> > @@ -99,31 +99,44 @@ static inline int sys_bpf_prog_load(union bpf_attr
> > *attr, unsigned int size, int
>
> >   /* Probe whether kernel switched from memlock-based (RLIMIT_MEMLOCK) to
> >    * memcg-based memory accounting for BPF maps and progs. This was done
> > in [0].
> > - * We use the support for bpf_ktime_get_coarse_ns() helper, which was
> > added in
> > - * the same 5.11 Linux release ([1]), to detect memcg-based accounting
> > for BPF.
> > + * To do so, we lower the soft memlock rlimit to 0 and attempt to create
> > a BPF
> > + * object. If it succeeds, then memcg-based accounting for BPF is
> > available.
> >    *
> >    *   [0]
> > https://lore.kernel.org/bpf/20201201215900.3569844-1-guro@fb.com/
> > - *   [1] d05512618056 ("bpf: Add bpf_ktime_get_coarse_ns helper")
> >    */
> >   int probe_memcg_account(void)
> >   {
> >       const size_t prog_load_attr_sz = offsetofend(union bpf_attr,
> > attach_btf_obj_fd);
> >       struct bpf_insn insns[] = {
> > -             BPF_EMIT_CALL(BPF_FUNC_ktime_get_coarse_ns),
> >               BPF_EXIT_INSN(),
> >       };
> > +     struct rlimit rlim_init, rlim_cur_zero = {};
> >       size_t insn_cnt = ARRAY_SIZE(insns);
> >       union bpf_attr attr;
> >       int prog_fd;
>
> > -     /* attempt loading freplace trying to use custom BTF */
> >       memset(&attr, 0, prog_load_attr_sz);
> >       attr.prog_type = BPF_PROG_TYPE_SOCKET_FILTER;
> >       attr.insns = ptr_to_u64(insns);
> >       attr.insn_cnt = insn_cnt;
> >       attr.license = ptr_to_u64("GPL");
>
> > +     if (getrlimit(RLIMIT_MEMLOCK, &rlim_init))
> > +             return -1;
> > +
> > +     /* Drop the soft limit to zero. We maintain the hard limit to its
> > +      * current value, because lowering it would be a permanent operation
> > +      * for unprivileged users.
> > +      */
> > +     rlim_cur_zero.rlim_max = rlim_init.rlim_max;
> > +     if (setrlimit(RLIMIT_MEMLOCK, &rlim_cur_zero))
> > +             return -1;
> > +
> >       prog_fd = sys_bpf_fd(BPF_PROG_LOAD, &attr, prog_load_attr_sz);
> > +
> > +     /* reset soft rlimit as soon as possible */
> > +     setrlimit(RLIMIT_MEMLOCK, &rlim_init);
>
> Isn't that adding more flakiness to the other daemons running as
> the same user? Also, there might be surprises if another daemon that
> has libbpf in it starts right when we've set the limit temporarily to zero.
>

I agree, it briefly changes global process state and can introduce
some undesirable (and very non-obvious) side effects.

> Can we push these decisions to the users as part of libbpf 1.0 cleanup?


Quentin, at least for bpftool, I think it's totally fine to just
always bump RLIMIT_MEMLOCK to avoid this issue. That would solve the
issue with probing, right? And for end applications I think I agree
with Stanislav that application might need to ensure rlimit bumping
for such backported changes.


>
> > +
> >       if (prog_fd >= 0) {
> >               close(prog_fd);
> >               return 1;
> > --
> > 2.34.1
>
Quentin Monnet June 9, 2022, 7:57 p.m. UTC | #3
On Thu, 9 Jun 2022 at 19:38, Andrii Nakryiko <andrii.nakryiko@gmail.com> wrote:
>
> On Thu, Jun 9, 2022 at 11:13 AM <sdf@google.com> wrote:
> >
> > On 06/09, Quentin Monnet wrote:
> > > To ensure that memory accounting will not hinder the load of BPF
> > > objects, libbpf may raise the memlock rlimit before proceeding to some
> > > operations. Whether this limit needs to be raised depends on the version
> > > of the kernel: newer versions use cgroup-based (memcg) memory
> > > accounting, and do not require any adjustment.
> >
> > > There is a probe in libbpf to determine whether memcg-based accounting
> > > is supported. But this probe currently relies on the availability of a
> > > given BPF helper, bpf_ktime_get_coarse_ns(), which landed in the same
> > > kernel version as the memory accounting change. This works in the
> > > generic case, but it may fail, for example, if the helper function has
> > > been backported to an older kernel. This has been observed for Google
> > > Cloud's Container-Optimized OS (COS), where the helper is available but
> > > rlimit is still in use. The probe succeeds, the rlimit is not raised,
> > > and probing features with bpftool, for example, fails.
> >
> > > Here we attempt to improve this probe and to effectively rely on memory
> > > accounting. Function probe_memcg_account() in libbpf is updated to set
> > > the rlimit to 0, then attempt to load a BPF object, and then to reset
> > > the rlimit. If the load still succeeds, then this means we're running
> > > with memcg-based accounting.
> >
> > > This probe was inspired by the similar one from the cilium/ebpf Go
> > > library [0].
> >
> > > [0] https://github.com/cilium/ebpf/blob/v0.9.0/rlimit/rlimit.go#L39
> >
> > > Signed-off-by: Quentin Monnet <quentin@isovalent.com>
> > > ---
> > >   tools/lib/bpf/bpf.c | 23 ++++++++++++++++++-----
> > >   1 file changed, 18 insertions(+), 5 deletions(-)
> >
> > > diff --git a/tools/lib/bpf/bpf.c b/tools/lib/bpf/bpf.c
> > > index 240186aac8e6..781387e6f66b 100644
> > > --- a/tools/lib/bpf/bpf.c
> > > +++ b/tools/lib/bpf/bpf.c
> > > @@ -99,31 +99,44 @@ static inline int sys_bpf_prog_load(union bpf_attr
> > > *attr, unsigned int size, int
> >
> > >   /* Probe whether kernel switched from memlock-based (RLIMIT_MEMLOCK) to
> > >    * memcg-based memory accounting for BPF maps and progs. This was done
> > > in [0].
> > > - * We use the support for bpf_ktime_get_coarse_ns() helper, which was
> > > added in
> > > - * the same 5.11 Linux release ([1]), to detect memcg-based accounting
> > > for BPF.
> > > + * To do so, we lower the soft memlock rlimit to 0 and attempt to create
> > > a BPF
> > > + * object. If it succeeds, then memcg-based accounting for BPF is
> > > available.
> > >    *
> > >    *   [0]
> > > https://lore.kernel.org/bpf/20201201215900.3569844-1-guro@fb.com/
> > > - *   [1] d05512618056 ("bpf: Add bpf_ktime_get_coarse_ns helper")
> > >    */
> > >   int probe_memcg_account(void)
> > >   {
> > >       const size_t prog_load_attr_sz = offsetofend(union bpf_attr,
> > > attach_btf_obj_fd);
> > >       struct bpf_insn insns[] = {
> > > -             BPF_EMIT_CALL(BPF_FUNC_ktime_get_coarse_ns),
> > >               BPF_EXIT_INSN(),
> > >       };
> > > +     struct rlimit rlim_init, rlim_cur_zero = {};
> > >       size_t insn_cnt = ARRAY_SIZE(insns);
> > >       union bpf_attr attr;
> > >       int prog_fd;
> >
> > > -     /* attempt loading freplace trying to use custom BTF */
> > >       memset(&attr, 0, prog_load_attr_sz);
> > >       attr.prog_type = BPF_PROG_TYPE_SOCKET_FILTER;
> > >       attr.insns = ptr_to_u64(insns);
> > >       attr.insn_cnt = insn_cnt;
> > >       attr.license = ptr_to_u64("GPL");
> >
> > > +     if (getrlimit(RLIMIT_MEMLOCK, &rlim_init))
> > > +             return -1;
> > > +
> > > +     /* Drop the soft limit to zero. We maintain the hard limit to its
> > > +      * current value, because lowering it would be a permanent operation
> > > +      * for unprivileged users.
> > > +      */
> > > +     rlim_cur_zero.rlim_max = rlim_init.rlim_max;
> > > +     if (setrlimit(RLIMIT_MEMLOCK, &rlim_cur_zero))
> > > +             return -1;
> > > +
> > >       prog_fd = sys_bpf_fd(BPF_PROG_LOAD, &attr, prog_load_attr_sz);
> > > +
> > > +     /* reset soft rlimit as soon as possible */
> > > +     setrlimit(RLIMIT_MEMLOCK, &rlim_init);
> >
> > Isn't that adding more flakiness to the other daemons running as
> > the same user? Also, there might be surprises if another daemon that
> > has libbpf in it starts right when we've set the limit temporarily to zero.
> >
>
> I agree, it briefly changes global process state and can introduce
> some undesirable (and very non-obvious) side effects.
>
> > Can we push these decisions to the users as part of libbpf 1.0 cleanup?
>
>
> Quentin, at least for bpftool, I think it's totally fine to just
> always bump RLIMIT_MEMLOCK to avoid this issue. That would solve the
> issue with probing, right? And for end applications I think I agree
> with Stanislav that application might need to ensure rlimit bumping
> for such backported changes.

Agreed, changing the rlimit in the probe is not ideal. We haven't
managed to find a better way to probe the feature, unfortunately. I
don't mind restoring the rlimit bump in bpftool - yes, it should be
fine for our use case. We already raise the limit in Cilium, but after
the feature probe from bpftool, so it's only the feature probing that
is an issue at the moment. I'll send a patch for it tomorrow.

I still believe it would be nice to improve the probe, I'll come back
to it if we ever find a better way to proceed.

Thanks,
Quentin
diff mbox series

Patch

diff --git a/tools/lib/bpf/bpf.c b/tools/lib/bpf/bpf.c
index 240186aac8e6..781387e6f66b 100644
--- a/tools/lib/bpf/bpf.c
+++ b/tools/lib/bpf/bpf.c
@@ -99,31 +99,44 @@  static inline int sys_bpf_prog_load(union bpf_attr *attr, unsigned int size, int
 
 /* Probe whether kernel switched from memlock-based (RLIMIT_MEMLOCK) to
  * memcg-based memory accounting for BPF maps and progs. This was done in [0].
- * We use the support for bpf_ktime_get_coarse_ns() helper, which was added in
- * the same 5.11 Linux release ([1]), to detect memcg-based accounting for BPF.
+ * To do so, we lower the soft memlock rlimit to 0 and attempt to create a BPF
+ * object. If it succeeds, then memcg-based accounting for BPF is available.
  *
  *   [0] https://lore.kernel.org/bpf/20201201215900.3569844-1-guro@fb.com/
- *   [1] d05512618056 ("bpf: Add bpf_ktime_get_coarse_ns helper")
  */
 int probe_memcg_account(void)
 {
 	const size_t prog_load_attr_sz = offsetofend(union bpf_attr, attach_btf_obj_fd);
 	struct bpf_insn insns[] = {
-		BPF_EMIT_CALL(BPF_FUNC_ktime_get_coarse_ns),
 		BPF_EXIT_INSN(),
 	};
+	struct rlimit rlim_init, rlim_cur_zero = {};
 	size_t insn_cnt = ARRAY_SIZE(insns);
 	union bpf_attr attr;
 	int prog_fd;
 
-	/* attempt loading freplace trying to use custom BTF */
 	memset(&attr, 0, prog_load_attr_sz);
 	attr.prog_type = BPF_PROG_TYPE_SOCKET_FILTER;
 	attr.insns = ptr_to_u64(insns);
 	attr.insn_cnt = insn_cnt;
 	attr.license = ptr_to_u64("GPL");
 
+	if (getrlimit(RLIMIT_MEMLOCK, &rlim_init))
+		return -1;
+
+	/* Drop the soft limit to zero. We maintain the hard limit to its
+	 * current value, because lowering it would be a permanent operation
+	 * for unprivileged users.
+	 */
+	rlim_cur_zero.rlim_max = rlim_init.rlim_max;
+	if (setrlimit(RLIMIT_MEMLOCK, &rlim_cur_zero))
+		return -1;
+
 	prog_fd = sys_bpf_fd(BPF_PROG_LOAD, &attr, prog_load_attr_sz);
+
+	/* reset soft rlimit as soon as possible */
+	setrlimit(RLIMIT_MEMLOCK, &rlim_init);
+
 	if (prog_fd >= 0) {
 		close(prog_fd);
 		return 1;