[bpf-next,v2,0/3] bpf: expose bpf_{g,s}et_retval to more cgroup hooks

Message ID	20220816201214.2489910-1-sdf@google.com (mailing list archive)
Headers	show Return-Path: <bpf-owner@kernel.org> Date: Tue, 16 Aug 2022 13:12:11 -0700 Message-Id: <20220816201214.2489910-1-sdf@google.com> Mime-Version: 1.0 Subject: [PATCH bpf-next v2 0/3] bpf: expose bpf_{g,s}et_retval to more cgroup hooks From: Stanislav Fomichev <sdf@google.com> To: bpf@vger.kernel.org Cc: ast@kernel.org, daniel@iogearbox.net, andrii@kernel.org, martin.lau@linux.dev, song@kernel.org, yhs@fb.com, john.fastabend@gmail.com, kpsingh@kernel.org, sdf@google.com, haoluo@google.com, jolsa@kernel.org Content-Type: text/plain; charset="UTF-8" Precedence: bulk
Series	bpf: expose bpf_{g,s}et_retval to more cgroup hooks \| expand [bpf-next,v2,0/3] bpf: expose bpf_{g,s}et_retval to more cgroup hooks [bpf-next,v2,1/3] bpf: Introduce cgroup_{common,current}_func_proto [bpf-next,v2,2/3] bpf: Use cgroup_{common,current}_func_proto in more hooks [bpf-next,v2,3/3] selftests/bpf: Make sure bpf_{g,s}et_retval is exposed everywhere

Stanislav Fomichev Aug. 16, 2022, 8:12 p.m. UTC

Apparently, only a small subset of cgroup hooks actually falls
back to cgroup_base_func_proto. This leads to unexpected result
where not all cgroup helpers have bpf_{g,s}et_retval.

It's getting harder and harder to manage which helpers are exported
to which hooks. We now have the following call chains:

- cg_skb_func_proto
  - sk_filter_func_proto
    - bpf_sk_base_func_proto
      - bpf_base_func_proto

So by looking at cg_skb_func_proto it's pretty hard to understand
what's going on.

For cgroup helpers, I'm proposing we do the following instead:

  func_proto = cgroup_common_func_proto();
  if (func_proto) return func_proto;

  /* optional, if hook has 'current' */
  func_proto = cgroup_current_func_proto();
  if (func_proto) return func_proto;

  ...

  switch (func_id) {
  /* hook specific helpers */
  case BPF_FUNC_hook_specific_helper:
    return &xyz;
  default:
    /* always fall back to plain bpf_base_func_proto */
    bpf_base_func_proto(func_id);
  }

If this turns out more workable, we can follow up with converting
the rest to the same pattern.

v2:
- move everything into kernel/bpf/cgroup.c instead (Martin)
- use cgroup_common_func_proto in lsm (Martin)

Stanislav Fomichev (3):
  bpf: Introduce cgroup_{common,current}_func_proto
  bpf: Use cgroup_{common,current}_func_proto in more hooks
  selftests/bpf: Make sure bpf_{g,s}et_retval is exposed everywhere

 include/linux/bpf.h                           |  18 +-
 kernel/bpf/bpf_lsm.c                          |  19 +-
 kernel/bpf/cgroup.c                           | 333 +++++++++++++++++-
 kernel/bpf/helpers.c                          | 205 +----------
 net/core/filter.c                             |  92 ++---
 tools/testing/selftests/bpf/Makefile          |   1 +
 .../bpf/cgroup_getset_retval_hooks.h          |  25 ++
 .../bpf/prog_tests/cgroup_getset_retval.c     |  48 +++
 .../bpf/progs/cgroup_getset_retval_hooks.c    |  16 +
 9 files changed, 466 insertions(+), 291 deletions(-)
 create mode 100644 tools/testing/selftests/bpf/cgroup_getset_retval_hooks.h
 create mode 100644 tools/testing/selftests/bpf/progs/cgroup_getset_retval_hooks.c

Martin KaFai Lau Aug. 17, 2022, 7:07 p.m. UTC | #1

On Tue, Aug 16, 2022 at 01:12:11PM -0700, Stanislav Fomichev wrote:
> Apparently, only a small subset of cgroup hooks actually falls
> back to cgroup_base_func_proto. This leads to unexpected result
> where not all cgroup helpers have bpf_{g,s}et_retval.
> 
> It's getting harder and harder to manage which helpers are exported
> to which hooks. We now have the following call chains:
> 
> - cg_skb_func_proto
>   - sk_filter_func_proto
>     - bpf_sk_base_func_proto
>       - bpf_base_func_proto
Could you explain how bpf_set_retval() will work with cgroup prog that
is not syscall and can return flags in the higher bit (e.g. cg_skb egress).
It will be a useful doc to add to the uapi bpf.h for
the bpf_set_retval() helper.

Stanislav Fomichev Aug. 17, 2022, 10:41 p.m. UTC | #2

On Wed, Aug 17, 2022 at 12:07 PM Martin KaFai Lau <kafai@fb.com> wrote:
>
> On Tue, Aug 16, 2022 at 01:12:11PM -0700, Stanislav Fomichev wrote:
> > Apparently, only a small subset of cgroup hooks actually falls
> > back to cgroup_base_func_proto. This leads to unexpected result
> > where not all cgroup helpers have bpf_{g,s}et_retval.
> >
> > It's getting harder and harder to manage which helpers are exported
> > to which hooks. We now have the following call chains:
> >
> > - cg_skb_func_proto
> >   - sk_filter_func_proto
> >     - bpf_sk_base_func_proto
> >       - bpf_base_func_proto
> Could you explain how bpf_set_retval() will work with cgroup prog that
> is not syscall and can return flags in the higher bit (e.g. cg_skb egress).
> It will be a useful doc to add to the uapi bpf.h for
> the bpf_set_retval() helper.

I think it's the same case as the case without bpf_set_retval? I don't
think the flags can be exported via bpf_set_retval, it just lets the
users override EPERM.
Let me verify and I can add a note to bpf_set_retval uapi definition
to mention that it just overrides EPERM. bpf_set_retval should
probably not talk about userspace/syscall and instead use the words
like "caller".

Martin KaFai Lau Aug. 17, 2022, 11:27 p.m. UTC | #3

On Wed, Aug 17, 2022 at 03:41:26PM -0700, Stanislav Fomichev wrote:
> On Wed, Aug 17, 2022 at 12:07 PM Martin KaFai Lau <kafai@fb.com> wrote:
> >
> > On Tue, Aug 16, 2022 at 01:12:11PM -0700, Stanislav Fomichev wrote:
> > > Apparently, only a small subset of cgroup hooks actually falls
> > > back to cgroup_base_func_proto. This leads to unexpected result
> > > where not all cgroup helpers have bpf_{g,s}et_retval.
> > >
> > > It's getting harder and harder to manage which helpers are exported
> > > to which hooks. We now have the following call chains:
> > >
> > > - cg_skb_func_proto
> > >   - sk_filter_func_proto
> > >     - bpf_sk_base_func_proto
> > >       - bpf_base_func_proto
> > Could you explain how bpf_set_retval() will work with cgroup prog that
> > is not syscall and can return flags in the higher bit (e.g. cg_skb egress).
> > It will be a useful doc to add to the uapi bpf.h for
> > the bpf_set_retval() helper.
> 
> I think it's the same case as the case without bpf_set_retval? I don't
> think the flags can be exported via bpf_set_retval, it just lets the
> users override EPERM.
eg. Before, a cg_skb@egress prog returns 3 to mean NET_XMIT_CN.
What if the prog now returns 3 and also bpf_set_retval(-Exxxx).
If I read how __cgroup_bpf_run_filter_skb() uses bpf_prog_run_array_cg()
correctly,  __cgroup_bpf_run_filter_skb() will return NET_XMIT_DROP
instead of the -Exxxx.  The -Exxxx is probably what the bpf prog
is expecting after calling bpf_set_retval(-Exxxx) ?
Thinking more about it, should __cgroup_bpf_run_filter_skb() always
return -Exxxx whenever a -ve retval is set in bpf_set_retval() ?

> Let me verify and I can add a note to bpf_set_retval uapi definition
> to mention that it just overrides EPERM. bpf_set_retval should
> probably not talk about userspace/syscall and instead use the words
> like "caller".
yeah, it is no longer syscall return value only now.

Stanislav Fomichev Aug. 17, 2022, 11:59 p.m. UTC | #4

On Wed, Aug 17, 2022 at 4:27 PM Martin KaFai Lau <kafai@fb.com> wrote:
>
> On Wed, Aug 17, 2022 at 03:41:26PM -0700, Stanislav Fomichev wrote:
> > On Wed, Aug 17, 2022 at 12:07 PM Martin KaFai Lau <kafai@fb.com> wrote:
> > >
> > > On Tue, Aug 16, 2022 at 01:12:11PM -0700, Stanislav Fomichev wrote:
> > > > Apparently, only a small subset of cgroup hooks actually falls
> > > > back to cgroup_base_func_proto. This leads to unexpected result
> > > > where not all cgroup helpers have bpf_{g,s}et_retval.
> > > >
> > > > It's getting harder and harder to manage which helpers are exported
> > > > to which hooks. We now have the following call chains:
> > > >
> > > > - cg_skb_func_proto
> > > >   - sk_filter_func_proto
> > > >     - bpf_sk_base_func_proto
> > > >       - bpf_base_func_proto
> > > Could you explain how bpf_set_retval() will work with cgroup prog that
> > > is not syscall and can return flags in the higher bit (e.g. cg_skb egress).
> > > It will be a useful doc to add to the uapi bpf.h for
> > > the bpf_set_retval() helper.
> >
> > I think it's the same case as the case without bpf_set_retval? I don't
> > think the flags can be exported via bpf_set_retval, it just lets the
> > users override EPERM.
> eg. Before, a cg_skb@egress prog returns 3 to mean NET_XMIT_CN.
> What if the prog now returns 3 and also bpf_set_retval(-Exxxx).
> If I read how __cgroup_bpf_run_filter_skb() uses bpf_prog_run_array_cg()
> correctly,  __cgroup_bpf_run_filter_skb() will return NET_XMIT_DROP
> instead of the -Exxxx.  The -Exxxx is probably what the bpf prog
> is expecting after calling bpf_set_retval(-Exxxx) ?
> Thinking more about it, should __cgroup_bpf_run_filter_skb() always
> return -Exxxx whenever a -ve retval is set in bpf_set_retval() ?

I think we used to have "return 0/1/2/3" to indicate different
conditions but then switched to "return 1/0" + flags.
So, technically, "return 3 + bpf_set_retval" is still fundamentally a
"return 3" api-wise.
I guess we can make bpf_set_retval override that but let me start by
trying to document what we currently have.
If it turns out to be super ugly, we can try to fix it. (not sure how
much of a uapi that is)



> > Let me verify and I can add a note to bpf_set_retval uapi definition
> > to mention that it just overrides EPERM. bpf_set_retval should
> > probably not talk about userspace/syscall and instead use the words
> > like "caller".
> yeah, it is no longer syscall return value only now.

Martin KaFai Lau Aug. 18, 2022, 12:21 a.m. UTC | #5

On Wed, Aug 17, 2022 at 04:59:06PM -0700, Stanislav Fomichev wrote:
> On Wed, Aug 17, 2022 at 4:27 PM Martin KaFai Lau <kafai@fb.com> wrote:
> >
> > On Wed, Aug 17, 2022 at 03:41:26PM -0700, Stanislav Fomichev wrote:
> > > On Wed, Aug 17, 2022 at 12:07 PM Martin KaFai Lau <kafai@fb.com> wrote:
> > > >
> > > > On Tue, Aug 16, 2022 at 01:12:11PM -0700, Stanislav Fomichev wrote:
> > > > > Apparently, only a small subset of cgroup hooks actually falls
> > > > > back to cgroup_base_func_proto. This leads to unexpected result
> > > > > where not all cgroup helpers have bpf_{g,s}et_retval.
> > > > >
> > > > > It's getting harder and harder to manage which helpers are exported
> > > > > to which hooks. We now have the following call chains:
> > > > >
> > > > > - cg_skb_func_proto
> > > > >   - sk_filter_func_proto
> > > > >     - bpf_sk_base_func_proto
> > > > >       - bpf_base_func_proto
> > > > Could you explain how bpf_set_retval() will work with cgroup prog that
> > > > is not syscall and can return flags in the higher bit (e.g. cg_skb egress).
> > > > It will be a useful doc to add to the uapi bpf.h for
> > > > the bpf_set_retval() helper.
> > >
> > > I think it's the same case as the case without bpf_set_retval? I don't
> > > think the flags can be exported via bpf_set_retval, it just lets the
> > > users override EPERM.
> > eg. Before, a cg_skb@egress prog returns 3 to mean NET_XMIT_CN.
> > What if the prog now returns 3 and also bpf_set_retval(-Exxxx).
> > If I read how __cgroup_bpf_run_filter_skb() uses bpf_prog_run_array_cg()
> > correctly,  __cgroup_bpf_run_filter_skb() will return NET_XMIT_DROP
> > instead of the -Exxxx.  The -Exxxx is probably what the bpf prog
> > is expecting after calling bpf_set_retval(-Exxxx) ?
> > Thinking more about it, should __cgroup_bpf_run_filter_skb() always
> > return -Exxxx whenever a -ve retval is set in bpf_set_retval() ?
> 
> I think we used to have "return 0/1/2/3" to indicate different
> conditions but then switched to "return 1/0" + flags.
For 'int bpf_prog_run_array_cg(..., u32 *ret_flags)'?
I think it is more like return "0 (OK)/-Exxxx" + ret_flags now.

> So, technically, "return 3 + bpf_set_retval" is still fundamentally a
> "return 3" api-wise.
hm....for the exisiting usecase (eg. CGROUP_SETSOCKOPT), what does
"bpf-prog-return 1 + bpf_set_retval(-EPERM)" mean?

> I guess we can make bpf_set_retval override that but let me start by
> trying to document what we currently have.
To be clear, for cg_skb case, I meant to clear the ret_flags only if
run_ctx.retval is set.

> If it turns out to be super ugly, we can try to fix it. (not sure how
> much of a uapi that is)
sgtm.

> 
> 
> 
> > > Let me verify and I can add a note to bpf_set_retval uapi definition
> > > to mention that it just overrides EPERM. bpf_set_retval should
> > > probably not talk about userspace/syscall and instead use the words
> > > like "caller".
> > yeah, it is no longer syscall return value only now.

Stanislav Fomichev Aug. 18, 2022, 3:42 a.m. UTC | #6

On Wed, Aug 17, 2022 at 5:21 PM Martin KaFai Lau <kafai@fb.com> wrote:
>
> On Wed, Aug 17, 2022 at 04:59:06PM -0700, Stanislav Fomichev wrote:
> > On Wed, Aug 17, 2022 at 4:27 PM Martin KaFai Lau <kafai@fb.com> wrote:
> > >
> > > On Wed, Aug 17, 2022 at 03:41:26PM -0700, Stanislav Fomichev wrote:
> > > > On Wed, Aug 17, 2022 at 12:07 PM Martin KaFai Lau <kafai@fb.com> wrote:
> > > > >
> > > > > On Tue, Aug 16, 2022 at 01:12:11PM -0700, Stanislav Fomichev wrote:
> > > > > > Apparently, only a small subset of cgroup hooks actually falls
> > > > > > back to cgroup_base_func_proto. This leads to unexpected result
> > > > > > where not all cgroup helpers have bpf_{g,s}et_retval.
> > > > > >
> > > > > > It's getting harder and harder to manage which helpers are exported
> > > > > > to which hooks. We now have the following call chains:
> > > > > >
> > > > > > - cg_skb_func_proto
> > > > > >   - sk_filter_func_proto
> > > > > >     - bpf_sk_base_func_proto
> > > > > >       - bpf_base_func_proto
> > > > > Could you explain how bpf_set_retval() will work with cgroup prog that
> > > > > is not syscall and can return flags in the higher bit (e.g. cg_skb egress).
> > > > > It will be a useful doc to add to the uapi bpf.h for
> > > > > the bpf_set_retval() helper.
> > > >
> > > > I think it's the same case as the case without bpf_set_retval? I don't
> > > > think the flags can be exported via bpf_set_retval, it just lets the
> > > > users override EPERM.
> > > eg. Before, a cg_skb@egress prog returns 3 to mean NET_XMIT_CN.
> > > What if the prog now returns 3 and also bpf_set_retval(-Exxxx).
> > > If I read how __cgroup_bpf_run_filter_skb() uses bpf_prog_run_array_cg()
> > > correctly,  __cgroup_bpf_run_filter_skb() will return NET_XMIT_DROP
> > > instead of the -Exxxx.  The -Exxxx is probably what the bpf prog
> > > is expecting after calling bpf_set_retval(-Exxxx) ?
> > > Thinking more about it, should __cgroup_bpf_run_filter_skb() always
> > > return -Exxxx whenever a -ve retval is set in bpf_set_retval() ?
> >
> > I think we used to have "return 0/1/2/3" to indicate different
> > conditions but then switched to "return 1/0" + flags.
> For 'int bpf_prog_run_array_cg(..., u32 *ret_flags)'?
> I think it is more like return "0 (OK)/-Exxxx" + ret_flags now.

Yes, right now that's that case. What I meant to say is that for the
BPF program itself, the api is still "return a set of predefined
values". We don't advertise the flags to the bpf programs. 'return 2'
is a perfectly valid return for cgroup/egress that will tx the packet
with a cn. (where bpf_prog_run_array_cg sees it as a 'return 0 + (1 <<
1)')

> > So, technically, "return 3 + bpf_set_retval" is still fundamentally a
> > "return 3" api-wise.
> hm....for the exisiting usecase (eg. CGROUP_SETSOCKOPT), what does
> "bpf-prog-return 1 + bpf_set_retval(-EPERM)" mean?

I think bpf_set_retval takes precedence and in this case bpf_prog_run
wrapper will return -EPERM to the caller.
Will try to document that as well.

> > I guess we can make bpf_set_retval override that but let me start by
> > trying to document what we currently have.
> To be clear, for cg_skb case, I meant to clear the ret_flags only if
> run_ctx.retval is set.

Are you suggesting something like the following?

diff --git a/kernel/bpf/cgroup.c b/kernel/bpf/cgroup.c
index fd113bd2f79c..c110cbe52001 100644
--- a/kernel/bpf/cgroup.c
+++ b/kernel/bpf/cgroup.c
@@ -61,6 +61,8 @@ bpf_prog_run_array_cg(const struct cgroup_bpf *cgrp,
        bpf_reset_run_ctx(old_run_ctx);
        rcu_read_unlock();
        migrate_enable();
+       if (IS_ERR_VALUE((long)run_ctx.retval))
+               *ret_flags = 0;
        return run_ctx.retval;
 }

I think this will break the 'return 2' case? But is it worth it doing
it more carefully like this? LMKWYT.

diff --git a/kernel/bpf/cgroup.c b/kernel/bpf/cgroup.c
index fd113bd2f79c..dcd25561f8d4 100644
--- a/kernel/bpf/cgroup.c
+++ b/kernel/bpf/cgroup.c
@@ -39,6 +39,7 @@ bpf_prog_run_array_cg(const struct cgroup_bpf *cgrp,
        const struct bpf_prog_array *array;
        struct bpf_run_ctx *old_run_ctx;
        struct bpf_cg_run_ctx run_ctx;
+       bool seen_return0 = false;
        u32 func_ret;

        run_ctx.retval = retval;
@@ -54,13 +55,17 @@ bpf_prog_run_array_cg(const struct cgroup_bpf *cgrp,
                        *(ret_flags) |= (func_ret >> 1);
                        func_ret &= 1;
                }
-               if (!func_ret && !IS_ERR_VALUE((long)run_ctx.retval))
+               if (!func_ret && !IS_ERR_VALUE((long)run_ctx.retval)) {
                        run_ctx.retval = -EPERM;
+                       seen_return0 = true;
+               }
                item++;
        }
        bpf_reset_run_ctx(old_run_ctx);
        rcu_read_unlock();
        migrate_enable();
+       if (IS_ERR_VALUE((long)run_ctx.retval) && !seen_return0)
+               *ret_flags = 0;
        return run_ctx.retval;
 }

> > If it turns out to be super ugly, we can try to fix it. (not sure how
> > much of a uapi that is)
> sgtm.
>
> >
> >
> >
> > > > Let me verify and I can add a note to bpf_set_retval uapi definition
> > > > to mention that it just overrides EPERM. bpf_set_retval should
> > > > probably not talk about userspace/syscall and instead use the words
> > > > like "caller".
> > > yeah, it is no longer syscall return value only now.

Martin KaFai Lau Aug. 18, 2022, 6:15 p.m. UTC | #7

On Wed, Aug 17, 2022 at 08:42:54PM -0700, Stanislav Fomichev wrote:
> On Wed, Aug 17, 2022 at 5:21 PM Martin KaFai Lau <kafai@fb.com> wrote:
> >
> > On Wed, Aug 17, 2022 at 04:59:06PM -0700, Stanislav Fomichev wrote:
> > > On Wed, Aug 17, 2022 at 4:27 PM Martin KaFai Lau <kafai@fb.com> wrote:
> > > >
> > > > On Wed, Aug 17, 2022 at 03:41:26PM -0700, Stanislav Fomichev wrote:
> > > > > On Wed, Aug 17, 2022 at 12:07 PM Martin KaFai Lau <kafai@fb.com> wrote:
> > > > > >
> > > > > > On Tue, Aug 16, 2022 at 01:12:11PM -0700, Stanislav Fomichev wrote:
> > > > > > > Apparently, only a small subset of cgroup hooks actually falls
> > > > > > > back to cgroup_base_func_proto. This leads to unexpected result
> > > > > > > where not all cgroup helpers have bpf_{g,s}et_retval.
> > > > > > >
> > > > > > > It's getting harder and harder to manage which helpers are exported
> > > > > > > to which hooks. We now have the following call chains:
> > > > > > >
> > > > > > > - cg_skb_func_proto
> > > > > > >   - sk_filter_func_proto
> > > > > > >     - bpf_sk_base_func_proto
> > > > > > >       - bpf_base_func_proto
> > > > > > Could you explain how bpf_set_retval() will work with cgroup prog that
> > > > > > is not syscall and can return flags in the higher bit (e.g. cg_skb egress).
> > > > > > It will be a useful doc to add to the uapi bpf.h for
> > > > > > the bpf_set_retval() helper.
> > > > >
> > > > > I think it's the same case as the case without bpf_set_retval? I don't
> > > > > think the flags can be exported via bpf_set_retval, it just lets the
> > > > > users override EPERM.
> > > > eg. Before, a cg_skb@egress prog returns 3 to mean NET_XMIT_CN.
> > > > What if the prog now returns 3 and also bpf_set_retval(-Exxxx).
> > > > If I read how __cgroup_bpf_run_filter_skb() uses bpf_prog_run_array_cg()
> > > > correctly,  __cgroup_bpf_run_filter_skb() will return NET_XMIT_DROP
> > > > instead of the -Exxxx.  The -Exxxx is probably what the bpf prog
> > > > is expecting after calling bpf_set_retval(-Exxxx) ?
> > > > Thinking more about it, should __cgroup_bpf_run_filter_skb() always
> > > > return -Exxxx whenever a -ve retval is set in bpf_set_retval() ?
> > >
> > > I think we used to have "return 0/1/2/3" to indicate different
> > > conditions but then switched to "return 1/0" + flags.
> > For 'int bpf_prog_run_array_cg(..., u32 *ret_flags)'?
> > I think it is more like return "0 (OK)/-Exxxx" + ret_flags now.
> 
> Yes, right now that's that case. What I meant to say is that for the
> BPF program itself, the api is still "return a set of predefined
> values". We don't advertise the flags to the bpf programs. 'return 2'
> is a perfectly valid return for cgroup/egress that will tx the packet
> with a cn. (where bpf_prog_run_array_cg sees it as a 'return 0 + (1 <<
> 1)')
> 
> > > So, technically, "return 3 + bpf_set_retval" is still fundamentally a
> > > "return 3" api-wise.
> > hm....for the exisiting usecase (eg. CGROUP_SETSOCKOPT), what does
> > "bpf-prog-return 1 + bpf_set_retval(-EPERM)" mean?
> 
> I think bpf_set_retval takes precedence and in this case bpf_prog_run
> wrapper will return -EPERM to the caller.
> Will try to document that as well.
> 
> > > I guess we can make bpf_set_retval override that but let me start by
> > > trying to document what we currently have.
> > To be clear, for cg_skb case, I meant to clear the ret_flags only if
> > run_ctx.retval is set.
> 
> Are you suggesting something like the following?
> 
> diff --git a/kernel/bpf/cgroup.c b/kernel/bpf/cgroup.c
> index fd113bd2f79c..c110cbe52001 100644
> --- a/kernel/bpf/cgroup.c
> +++ b/kernel/bpf/cgroup.c
> @@ -61,6 +61,8 @@ bpf_prog_run_array_cg(const struct cgroup_bpf *cgrp,
>         bpf_reset_run_ctx(old_run_ctx);
>         rcu_read_unlock();
>         migrate_enable();
> +       if (IS_ERR_VALUE((long)run_ctx.retval))
> +               *ret_flags = 0;
>         return run_ctx.retval;
>  }
> 
> I think this will break the 'return 2' case? But is it worth it doing
> it more carefully like this? LMKWYT.
The below should work. Not sure it is worth it
but before doing this...

During this discussion, I think I am not clear what is the use case
on bpf_{g,s}et_retval() for cg_skb.  Could you describe how it will be
used in your use case?  Is it for another tracing program to get
a different return value from (eg.) sk_filter_trim_cap or ip[6]_output?

Not meaning the helper should not be exposed.  It is easier
to think with some examples.

> 
> diff --git a/kernel/bpf/cgroup.c b/kernel/bpf/cgroup.c
> index fd113bd2f79c..dcd25561f8d4 100644
> --- a/kernel/bpf/cgroup.c
> +++ b/kernel/bpf/cgroup.c
> @@ -39,6 +39,7 @@ bpf_prog_run_array_cg(const struct cgroup_bpf *cgrp,
>         const struct bpf_prog_array *array;
>         struct bpf_run_ctx *old_run_ctx;
>         struct bpf_cg_run_ctx run_ctx;
> +       bool seen_return0 = false;
>         u32 func_ret;
> 
>         run_ctx.retval = retval;
> @@ -54,13 +55,17 @@ bpf_prog_run_array_cg(const struct cgroup_bpf *cgrp,
>                         *(ret_flags) |= (func_ret >> 1);
>                         func_ret &= 1;
>                 }
> -               if (!func_ret && !IS_ERR_VALUE((long)run_ctx.retval))
> +               if (!func_ret && !IS_ERR_VALUE((long)run_ctx.retval)) {
>                         run_ctx.retval = -EPERM;
> +                       seen_return0 = true;
> +               }
>                 item++;
>         }
>         bpf_reset_run_ctx(old_run_ctx);
>         rcu_read_unlock();
>         migrate_enable();
> +       if (IS_ERR_VALUE((long)run_ctx.retval) && !seen_return0)
> +               *ret_flags = 0;
>         return run_ctx.retval;
>  }
> 
> > > If it turns out to be super ugly, we can try to fix it. (not sure how
> > > much of a uapi that is)
> > sgtm.
> >
> > >
> > >
> > >
> > > > > Let me verify and I can add a note to bpf_set_retval uapi definition
> > > > > to mention that it just overrides EPERM. bpf_set_retval should
> > > > > probably not talk about userspace/syscall and instead use the words
> > > > > like "caller".
> > > > yeah, it is no longer syscall return value only now.

Stanislav Fomichev Aug. 18, 2022, 6:30 p.m. UTC | #8

On Thu, Aug 18, 2022 at 11:16 AM Martin KaFai Lau <kafai@fb.com> wrote:
>
> On Wed, Aug 17, 2022 at 08:42:54PM -0700, Stanislav Fomichev wrote:
> > On Wed, Aug 17, 2022 at 5:21 PM Martin KaFai Lau <kafai@fb.com> wrote:
> > >
> > > On Wed, Aug 17, 2022 at 04:59:06PM -0700, Stanislav Fomichev wrote:
> > > > On Wed, Aug 17, 2022 at 4:27 PM Martin KaFai Lau <kafai@fb.com> wrote:
> > > > >
> > > > > On Wed, Aug 17, 2022 at 03:41:26PM -0700, Stanislav Fomichev wrote:
> > > > > > On Wed, Aug 17, 2022 at 12:07 PM Martin KaFai Lau <kafai@fb.com> wrote:
> > > > > > >
> > > > > > > On Tue, Aug 16, 2022 at 01:12:11PM -0700, Stanislav Fomichev wrote:
> > > > > > > > Apparently, only a small subset of cgroup hooks actually falls
> > > > > > > > back to cgroup_base_func_proto. This leads to unexpected result
> > > > > > > > where not all cgroup helpers have bpf_{g,s}et_retval.
> > > > > > > >
> > > > > > > > It's getting harder and harder to manage which helpers are exported
> > > > > > > > to which hooks. We now have the following call chains:
> > > > > > > >
> > > > > > > > - cg_skb_func_proto
> > > > > > > >   - sk_filter_func_proto
> > > > > > > >     - bpf_sk_base_func_proto
> > > > > > > >       - bpf_base_func_proto
> > > > > > > Could you explain how bpf_set_retval() will work with cgroup prog that
> > > > > > > is not syscall and can return flags in the higher bit (e.g. cg_skb egress).
> > > > > > > It will be a useful doc to add to the uapi bpf.h for
> > > > > > > the bpf_set_retval() helper.
> > > > > >
> > > > > > I think it's the same case as the case without bpf_set_retval? I don't
> > > > > > think the flags can be exported via bpf_set_retval, it just lets the
> > > > > > users override EPERM.
> > > > > eg. Before, a cg_skb@egress prog returns 3 to mean NET_XMIT_CN.
> > > > > What if the prog now returns 3 and also bpf_set_retval(-Exxxx).
> > > > > If I read how __cgroup_bpf_run_filter_skb() uses bpf_prog_run_array_cg()
> > > > > correctly,  __cgroup_bpf_run_filter_skb() will return NET_XMIT_DROP
> > > > > instead of the -Exxxx.  The -Exxxx is probably what the bpf prog
> > > > > is expecting after calling bpf_set_retval(-Exxxx) ?
> > > > > Thinking more about it, should __cgroup_bpf_run_filter_skb() always
> > > > > return -Exxxx whenever a -ve retval is set in bpf_set_retval() ?
> > > >
> > > > I think we used to have "return 0/1/2/3" to indicate different
> > > > conditions but then switched to "return 1/0" + flags.
> > > For 'int bpf_prog_run_array_cg(..., u32 *ret_flags)'?
> > > I think it is more like return "0 (OK)/-Exxxx" + ret_flags now.
> >
> > Yes, right now that's that case. What I meant to say is that for the
> > BPF program itself, the api is still "return a set of predefined
> > values". We don't advertise the flags to the bpf programs. 'return 2'
> > is a perfectly valid return for cgroup/egress that will tx the packet
> > with a cn. (where bpf_prog_run_array_cg sees it as a 'return 0 + (1 <<
> > 1)')
> >
> > > > So, technically, "return 3 + bpf_set_retval" is still fundamentally a
> > > > "return 3" api-wise.
> > > hm....for the exisiting usecase (eg. CGROUP_SETSOCKOPT), what does
> > > "bpf-prog-return 1 + bpf_set_retval(-EPERM)" mean?
> >
> > I think bpf_set_retval takes precedence and in this case bpf_prog_run
> > wrapper will return -EPERM to the caller.
> > Will try to document that as well.
> >
> > > > I guess we can make bpf_set_retval override that but let me start by
> > > > trying to document what we currently have.
> > > To be clear, for cg_skb case, I meant to clear the ret_flags only if
> > > run_ctx.retval is set.
> >
> > Are you suggesting something like the following?
> >
> > diff --git a/kernel/bpf/cgroup.c b/kernel/bpf/cgroup.c
> > index fd113bd2f79c..c110cbe52001 100644
> > --- a/kernel/bpf/cgroup.c
> > +++ b/kernel/bpf/cgroup.c
> > @@ -61,6 +61,8 @@ bpf_prog_run_array_cg(const struct cgroup_bpf *cgrp,
> >         bpf_reset_run_ctx(old_run_ctx);
> >         rcu_read_unlock();
> >         migrate_enable();
> > +       if (IS_ERR_VALUE((long)run_ctx.retval))
> > +               *ret_flags = 0;
> >         return run_ctx.retval;
> >  }
> >
> > I think this will break the 'return 2' case? But is it worth it doing
> > it more carefully like this? LMKWYT.
> The below should work. Not sure it is worth it
> but before doing this...
>
> During this discussion, I think I am not clear what is the use case
> on bpf_{g,s}et_retval() for cg_skb.  Could you describe how it will be
> used in your use case?  Is it for another tracing program to get
> a different return value from (eg.) sk_filter_trim_cap or ip[6]_output?
>
> Not meaning the helper should not be exposed.  It is easier
> to think with some examples.

I don't really need them in cg_skb, I want them in cg_sock so I can
return a custom errno from socket() syscall.
You're probably right and it doesn't make sense to support them in
cg_skb. Most of the
BPF_CGROUP_RUN_PROG_INET_INGRESS/BPF_CGROUP_RUN_PROG_INET_EGRESS
callers don't seem to care about returned error code? (from my brief
grepping)
Let's maybe err on the safe side and special case cg_skb for now (in
cgroup_common_func_proto) and not expose retval helpers?

Martin KaFai Lau Aug. 18, 2022, 7:54 p.m. UTC | #9

On Thu, Aug 18, 2022 at 11:30:21AM -0700, Stanislav Fomichev wrote:
> On Thu, Aug 18, 2022 at 11:16 AM Martin KaFai Lau <kafai@fb.com> wrote:
> >
> > On Wed, Aug 17, 2022 at 08:42:54PM -0700, Stanislav Fomichev wrote:
> > > On Wed, Aug 17, 2022 at 5:21 PM Martin KaFai Lau <kafai@fb.com> wrote:
> > > >
> > > > On Wed, Aug 17, 2022 at 04:59:06PM -0700, Stanislav Fomichev wrote:
> > > > > On Wed, Aug 17, 2022 at 4:27 PM Martin KaFai Lau <kafai@fb.com> wrote:
> > > > > >
> > > > > > On Wed, Aug 17, 2022 at 03:41:26PM -0700, Stanislav Fomichev wrote:
> > > > > > > On Wed, Aug 17, 2022 at 12:07 PM Martin KaFai Lau <kafai@fb.com> wrote:
> > > > > > > >
> > > > > > > > On Tue, Aug 16, 2022 at 01:12:11PM -0700, Stanislav Fomichev wrote:
> > > > > > > > > Apparently, only a small subset of cgroup hooks actually falls
> > > > > > > > > back to cgroup_base_func_proto. This leads to unexpected result
> > > > > > > > > where not all cgroup helpers have bpf_{g,s}et_retval.
> > > > > > > > >
> > > > > > > > > It's getting harder and harder to manage which helpers are exported
> > > > > > > > > to which hooks. We now have the following call chains:
> > > > > > > > >
> > > > > > > > > - cg_skb_func_proto
> > > > > > > > >   - sk_filter_func_proto
> > > > > > > > >     - bpf_sk_base_func_proto
> > > > > > > > >       - bpf_base_func_proto
> > > > > > > > Could you explain how bpf_set_retval() will work with cgroup prog that
> > > > > > > > is not syscall and can return flags in the higher bit (e.g. cg_skb egress).
> > > > > > > > It will be a useful doc to add to the uapi bpf.h for
> > > > > > > > the bpf_set_retval() helper.
> > > > > > >
> > > > > > > I think it's the same case as the case without bpf_set_retval? I don't
> > > > > > > think the flags can be exported via bpf_set_retval, it just lets the
> > > > > > > users override EPERM.
> > > > > > eg. Before, a cg_skb@egress prog returns 3 to mean NET_XMIT_CN.
> > > > > > What if the prog now returns 3 and also bpf_set_retval(-Exxxx).
> > > > > > If I read how __cgroup_bpf_run_filter_skb() uses bpf_prog_run_array_cg()
> > > > > > correctly,  __cgroup_bpf_run_filter_skb() will return NET_XMIT_DROP
> > > > > > instead of the -Exxxx.  The -Exxxx is probably what the bpf prog
> > > > > > is expecting after calling bpf_set_retval(-Exxxx) ?
> > > > > > Thinking more about it, should __cgroup_bpf_run_filter_skb() always
> > > > > > return -Exxxx whenever a -ve retval is set in bpf_set_retval() ?
> > > > >
> > > > > I think we used to have "return 0/1/2/3" to indicate different
> > > > > conditions but then switched to "return 1/0" + flags.
> > > > For 'int bpf_prog_run_array_cg(..., u32 *ret_flags)'?
> > > > I think it is more like return "0 (OK)/-Exxxx" + ret_flags now.
> > >
> > > Yes, right now that's that case. What I meant to say is that for the
> > > BPF program itself, the api is still "return a set of predefined
> > > values". We don't advertise the flags to the bpf programs. 'return 2'
> > > is a perfectly valid return for cgroup/egress that will tx the packet
> > > with a cn. (where bpf_prog_run_array_cg sees it as a 'return 0 + (1 <<
> > > 1)')
> > >
> > > > > So, technically, "return 3 + bpf_set_retval" is still fundamentally a
> > > > > "return 3" api-wise.
> > > > hm....for the exisiting usecase (eg. CGROUP_SETSOCKOPT), what does
> > > > "bpf-prog-return 1 + bpf_set_retval(-EPERM)" mean?
> > >
> > > I think bpf_set_retval takes precedence and in this case bpf_prog_run
> > > wrapper will return -EPERM to the caller.
> > > Will try to document that as well.
> > >
> > > > > I guess we can make bpf_set_retval override that but let me start by
> > > > > trying to document what we currently have.
> > > > To be clear, for cg_skb case, I meant to clear the ret_flags only if
> > > > run_ctx.retval is set.
> > >
> > > Are you suggesting something like the following?
> > >
> > > diff --git a/kernel/bpf/cgroup.c b/kernel/bpf/cgroup.c
> > > index fd113bd2f79c..c110cbe52001 100644
> > > --- a/kernel/bpf/cgroup.c
> > > +++ b/kernel/bpf/cgroup.c
> > > @@ -61,6 +61,8 @@ bpf_prog_run_array_cg(const struct cgroup_bpf *cgrp,
> > >         bpf_reset_run_ctx(old_run_ctx);
> > >         rcu_read_unlock();
> > >         migrate_enable();
> > > +       if (IS_ERR_VALUE((long)run_ctx.retval))
> > > +               *ret_flags = 0;
> > >         return run_ctx.retval;
> > >  }
> > >
> > > I think this will break the 'return 2' case? But is it worth it doing
> > > it more carefully like this? LMKWYT.
> > The below should work. Not sure it is worth it
> > but before doing this...
> >
> > During this discussion, I think I am not clear what is the use case
> > on bpf_{g,s}et_retval() for cg_skb.  Could you describe how it will be
> > used in your use case?  Is it for another tracing program to get
> > a different return value from (eg.) sk_filter_trim_cap or ip[6]_output?
> >
> > Not meaning the helper should not be exposed.  It is easier
> > to think with some examples.
> 
> I don't really need them in cg_skb, I want them in cg_sock so I can
> return a custom errno from socket() syscall.
bpf_{g,s}et_retval() will probably be useful for sock_addr too.
There is a BPF_RET_BIND_NO_CAP_NET_BIND_SERVICE falling
into a similar bucket.  However, I think it should be fine
since the higher bit is only used when the bpf prog
returns OK.

> You're probably right and it doesn't make sense to support them in
> cg_skb. Most of the
> BPF_CGROUP_RUN_PROG_INET_INGRESS/BPF_CGROUP_RUN_PROG_INET_EGRESS
> callers don't seem to care about returned error code? (from my brief
> grepping)
> Let's maybe err on the safe side and special case cg_skb for now (in
> cgroup_common_func_proto) and not expose retval helpers?
Ah. I was under the wrong impression that you have a use case
on cg_skb because cg_skb_func_proto was in the stack
example above :)

Yep, lets disable the bpf_{g,s}et_retval() helper for CGROUP_SKB
and also SOCK_OP(S) for now.

[bpf-next,v2,0/3] bpf: expose bpf_{g,s}et_retval to more cgroup hooks

Message

Comments