Message ID | 20220816201214.2489910-1-sdf@google.com (mailing list archive) |
---|---|
Headers | show |
Series | bpf: expose bpf_{g,s}et_retval to more cgroup hooks | expand |
On Tue, Aug 16, 2022 at 01:12:11PM -0700, Stanislav Fomichev wrote: > Apparently, only a small subset of cgroup hooks actually falls > back to cgroup_base_func_proto. This leads to unexpected result > where not all cgroup helpers have bpf_{g,s}et_retval. > > It's getting harder and harder to manage which helpers are exported > to which hooks. We now have the following call chains: > > - cg_skb_func_proto > - sk_filter_func_proto > - bpf_sk_base_func_proto > - bpf_base_func_proto Could you explain how bpf_set_retval() will work with cgroup prog that is not syscall and can return flags in the higher bit (e.g. cg_skb egress). It will be a useful doc to add to the uapi bpf.h for the bpf_set_retval() helper.
On Wed, Aug 17, 2022 at 12:07 PM Martin KaFai Lau <kafai@fb.com> wrote: > > On Tue, Aug 16, 2022 at 01:12:11PM -0700, Stanislav Fomichev wrote: > > Apparently, only a small subset of cgroup hooks actually falls > > back to cgroup_base_func_proto. This leads to unexpected result > > where not all cgroup helpers have bpf_{g,s}et_retval. > > > > It's getting harder and harder to manage which helpers are exported > > to which hooks. We now have the following call chains: > > > > - cg_skb_func_proto > > - sk_filter_func_proto > > - bpf_sk_base_func_proto > > - bpf_base_func_proto > Could you explain how bpf_set_retval() will work with cgroup prog that > is not syscall and can return flags in the higher bit (e.g. cg_skb egress). > It will be a useful doc to add to the uapi bpf.h for > the bpf_set_retval() helper. I think it's the same case as the case without bpf_set_retval? I don't think the flags can be exported via bpf_set_retval, it just lets the users override EPERM. Let me verify and I can add a note to bpf_set_retval uapi definition to mention that it just overrides EPERM. bpf_set_retval should probably not talk about userspace/syscall and instead use the words like "caller".
On Wed, Aug 17, 2022 at 03:41:26PM -0700, Stanislav Fomichev wrote: > On Wed, Aug 17, 2022 at 12:07 PM Martin KaFai Lau <kafai@fb.com> wrote: > > > > On Tue, Aug 16, 2022 at 01:12:11PM -0700, Stanislav Fomichev wrote: > > > Apparently, only a small subset of cgroup hooks actually falls > > > back to cgroup_base_func_proto. This leads to unexpected result > > > where not all cgroup helpers have bpf_{g,s}et_retval. > > > > > > It's getting harder and harder to manage which helpers are exported > > > to which hooks. We now have the following call chains: > > > > > > - cg_skb_func_proto > > > - sk_filter_func_proto > > > - bpf_sk_base_func_proto > > > - bpf_base_func_proto > > Could you explain how bpf_set_retval() will work with cgroup prog that > > is not syscall and can return flags in the higher bit (e.g. cg_skb egress). > > It will be a useful doc to add to the uapi bpf.h for > > the bpf_set_retval() helper. > > I think it's the same case as the case without bpf_set_retval? I don't > think the flags can be exported via bpf_set_retval, it just lets the > users override EPERM. eg. Before, a cg_skb@egress prog returns 3 to mean NET_XMIT_CN. What if the prog now returns 3 and also bpf_set_retval(-Exxxx). If I read how __cgroup_bpf_run_filter_skb() uses bpf_prog_run_array_cg() correctly, __cgroup_bpf_run_filter_skb() will return NET_XMIT_DROP instead of the -Exxxx. The -Exxxx is probably what the bpf prog is expecting after calling bpf_set_retval(-Exxxx) ? Thinking more about it, should __cgroup_bpf_run_filter_skb() always return -Exxxx whenever a -ve retval is set in bpf_set_retval() ? > Let me verify and I can add a note to bpf_set_retval uapi definition > to mention that it just overrides EPERM. bpf_set_retval should > probably not talk about userspace/syscall and instead use the words > like "caller". yeah, it is no longer syscall return value only now.
On Wed, Aug 17, 2022 at 4:27 PM Martin KaFai Lau <kafai@fb.com> wrote: > > On Wed, Aug 17, 2022 at 03:41:26PM -0700, Stanislav Fomichev wrote: > > On Wed, Aug 17, 2022 at 12:07 PM Martin KaFai Lau <kafai@fb.com> wrote: > > > > > > On Tue, Aug 16, 2022 at 01:12:11PM -0700, Stanislav Fomichev wrote: > > > > Apparently, only a small subset of cgroup hooks actually falls > > > > back to cgroup_base_func_proto. This leads to unexpected result > > > > where not all cgroup helpers have bpf_{g,s}et_retval. > > > > > > > > It's getting harder and harder to manage which helpers are exported > > > > to which hooks. We now have the following call chains: > > > > > > > > - cg_skb_func_proto > > > > - sk_filter_func_proto > > > > - bpf_sk_base_func_proto > > > > - bpf_base_func_proto > > > Could you explain how bpf_set_retval() will work with cgroup prog that > > > is not syscall and can return flags in the higher bit (e.g. cg_skb egress). > > > It will be a useful doc to add to the uapi bpf.h for > > > the bpf_set_retval() helper. > > > > I think it's the same case as the case without bpf_set_retval? I don't > > think the flags can be exported via bpf_set_retval, it just lets the > > users override EPERM. > eg. Before, a cg_skb@egress prog returns 3 to mean NET_XMIT_CN. > What if the prog now returns 3 and also bpf_set_retval(-Exxxx). > If I read how __cgroup_bpf_run_filter_skb() uses bpf_prog_run_array_cg() > correctly, __cgroup_bpf_run_filter_skb() will return NET_XMIT_DROP > instead of the -Exxxx. The -Exxxx is probably what the bpf prog > is expecting after calling bpf_set_retval(-Exxxx) ? > Thinking more about it, should __cgroup_bpf_run_filter_skb() always > return -Exxxx whenever a -ve retval is set in bpf_set_retval() ? I think we used to have "return 0/1/2/3" to indicate different conditions but then switched to "return 1/0" + flags. So, technically, "return 3 + bpf_set_retval" is still fundamentally a "return 3" api-wise. I guess we can make bpf_set_retval override that but let me start by trying to document what we currently have. If it turns out to be super ugly, we can try to fix it. (not sure how much of a uapi that is) > > Let me verify and I can add a note to bpf_set_retval uapi definition > > to mention that it just overrides EPERM. bpf_set_retval should > > probably not talk about userspace/syscall and instead use the words > > like "caller". > yeah, it is no longer syscall return value only now.
On Wed, Aug 17, 2022 at 04:59:06PM -0700, Stanislav Fomichev wrote: > On Wed, Aug 17, 2022 at 4:27 PM Martin KaFai Lau <kafai@fb.com> wrote: > > > > On Wed, Aug 17, 2022 at 03:41:26PM -0700, Stanislav Fomichev wrote: > > > On Wed, Aug 17, 2022 at 12:07 PM Martin KaFai Lau <kafai@fb.com> wrote: > > > > > > > > On Tue, Aug 16, 2022 at 01:12:11PM -0700, Stanislav Fomichev wrote: > > > > > Apparently, only a small subset of cgroup hooks actually falls > > > > > back to cgroup_base_func_proto. This leads to unexpected result > > > > > where not all cgroup helpers have bpf_{g,s}et_retval. > > > > > > > > > > It's getting harder and harder to manage which helpers are exported > > > > > to which hooks. We now have the following call chains: > > > > > > > > > > - cg_skb_func_proto > > > > > - sk_filter_func_proto > > > > > - bpf_sk_base_func_proto > > > > > - bpf_base_func_proto > > > > Could you explain how bpf_set_retval() will work with cgroup prog that > > > > is not syscall and can return flags in the higher bit (e.g. cg_skb egress). > > > > It will be a useful doc to add to the uapi bpf.h for > > > > the bpf_set_retval() helper. > > > > > > I think it's the same case as the case without bpf_set_retval? I don't > > > think the flags can be exported via bpf_set_retval, it just lets the > > > users override EPERM. > > eg. Before, a cg_skb@egress prog returns 3 to mean NET_XMIT_CN. > > What if the prog now returns 3 and also bpf_set_retval(-Exxxx). > > If I read how __cgroup_bpf_run_filter_skb() uses bpf_prog_run_array_cg() > > correctly, __cgroup_bpf_run_filter_skb() will return NET_XMIT_DROP > > instead of the -Exxxx. The -Exxxx is probably what the bpf prog > > is expecting after calling bpf_set_retval(-Exxxx) ? > > Thinking more about it, should __cgroup_bpf_run_filter_skb() always > > return -Exxxx whenever a -ve retval is set in bpf_set_retval() ? > > I think we used to have "return 0/1/2/3" to indicate different > conditions but then switched to "return 1/0" + flags. For 'int bpf_prog_run_array_cg(..., u32 *ret_flags)'? I think it is more like return "0 (OK)/-Exxxx" + ret_flags now. > So, technically, "return 3 + bpf_set_retval" is still fundamentally a > "return 3" api-wise. hm....for the exisiting usecase (eg. CGROUP_SETSOCKOPT), what does "bpf-prog-return 1 + bpf_set_retval(-EPERM)" mean? > I guess we can make bpf_set_retval override that but let me start by > trying to document what we currently have. To be clear, for cg_skb case, I meant to clear the ret_flags only if run_ctx.retval is set. > If it turns out to be super ugly, we can try to fix it. (not sure how > much of a uapi that is) sgtm. > > > > > > Let me verify and I can add a note to bpf_set_retval uapi definition > > > to mention that it just overrides EPERM. bpf_set_retval should > > > probably not talk about userspace/syscall and instead use the words > > > like "caller". > > yeah, it is no longer syscall return value only now.
On Wed, Aug 17, 2022 at 5:21 PM Martin KaFai Lau <kafai@fb.com> wrote: > > On Wed, Aug 17, 2022 at 04:59:06PM -0700, Stanislav Fomichev wrote: > > On Wed, Aug 17, 2022 at 4:27 PM Martin KaFai Lau <kafai@fb.com> wrote: > > > > > > On Wed, Aug 17, 2022 at 03:41:26PM -0700, Stanislav Fomichev wrote: > > > > On Wed, Aug 17, 2022 at 12:07 PM Martin KaFai Lau <kafai@fb.com> wrote: > > > > > > > > > > On Tue, Aug 16, 2022 at 01:12:11PM -0700, Stanislav Fomichev wrote: > > > > > > Apparently, only a small subset of cgroup hooks actually falls > > > > > > back to cgroup_base_func_proto. This leads to unexpected result > > > > > > where not all cgroup helpers have bpf_{g,s}et_retval. > > > > > > > > > > > > It's getting harder and harder to manage which helpers are exported > > > > > > to which hooks. We now have the following call chains: > > > > > > > > > > > > - cg_skb_func_proto > > > > > > - sk_filter_func_proto > > > > > > - bpf_sk_base_func_proto > > > > > > - bpf_base_func_proto > > > > > Could you explain how bpf_set_retval() will work with cgroup prog that > > > > > is not syscall and can return flags in the higher bit (e.g. cg_skb egress). > > > > > It will be a useful doc to add to the uapi bpf.h for > > > > > the bpf_set_retval() helper. > > > > > > > > I think it's the same case as the case without bpf_set_retval? I don't > > > > think the flags can be exported via bpf_set_retval, it just lets the > > > > users override EPERM. > > > eg. Before, a cg_skb@egress prog returns 3 to mean NET_XMIT_CN. > > > What if the prog now returns 3 and also bpf_set_retval(-Exxxx). > > > If I read how __cgroup_bpf_run_filter_skb() uses bpf_prog_run_array_cg() > > > correctly, __cgroup_bpf_run_filter_skb() will return NET_XMIT_DROP > > > instead of the -Exxxx. The -Exxxx is probably what the bpf prog > > > is expecting after calling bpf_set_retval(-Exxxx) ? > > > Thinking more about it, should __cgroup_bpf_run_filter_skb() always > > > return -Exxxx whenever a -ve retval is set in bpf_set_retval() ? > > > > I think we used to have "return 0/1/2/3" to indicate different > > conditions but then switched to "return 1/0" + flags. > For 'int bpf_prog_run_array_cg(..., u32 *ret_flags)'? > I think it is more like return "0 (OK)/-Exxxx" + ret_flags now. Yes, right now that's that case. What I meant to say is that for the BPF program itself, the api is still "return a set of predefined values". We don't advertise the flags to the bpf programs. 'return 2' is a perfectly valid return for cgroup/egress that will tx the packet with a cn. (where bpf_prog_run_array_cg sees it as a 'return 0 + (1 << 1)') > > So, technically, "return 3 + bpf_set_retval" is still fundamentally a > > "return 3" api-wise. > hm....for the exisiting usecase (eg. CGROUP_SETSOCKOPT), what does > "bpf-prog-return 1 + bpf_set_retval(-EPERM)" mean? I think bpf_set_retval takes precedence and in this case bpf_prog_run wrapper will return -EPERM to the caller. Will try to document that as well. > > I guess we can make bpf_set_retval override that but let me start by > > trying to document what we currently have. > To be clear, for cg_skb case, I meant to clear the ret_flags only if > run_ctx.retval is set. Are you suggesting something like the following? diff --git a/kernel/bpf/cgroup.c b/kernel/bpf/cgroup.c index fd113bd2f79c..c110cbe52001 100644 --- a/kernel/bpf/cgroup.c +++ b/kernel/bpf/cgroup.c @@ -61,6 +61,8 @@ bpf_prog_run_array_cg(const struct cgroup_bpf *cgrp, bpf_reset_run_ctx(old_run_ctx); rcu_read_unlock(); migrate_enable(); + if (IS_ERR_VALUE((long)run_ctx.retval)) + *ret_flags = 0; return run_ctx.retval; } I think this will break the 'return 2' case? But is it worth it doing it more carefully like this? LMKWYT. diff --git a/kernel/bpf/cgroup.c b/kernel/bpf/cgroup.c index fd113bd2f79c..dcd25561f8d4 100644 --- a/kernel/bpf/cgroup.c +++ b/kernel/bpf/cgroup.c @@ -39,6 +39,7 @@ bpf_prog_run_array_cg(const struct cgroup_bpf *cgrp, const struct bpf_prog_array *array; struct bpf_run_ctx *old_run_ctx; struct bpf_cg_run_ctx run_ctx; + bool seen_return0 = false; u32 func_ret; run_ctx.retval = retval; @@ -54,13 +55,17 @@ bpf_prog_run_array_cg(const struct cgroup_bpf *cgrp, *(ret_flags) |= (func_ret >> 1); func_ret &= 1; } - if (!func_ret && !IS_ERR_VALUE((long)run_ctx.retval)) + if (!func_ret && !IS_ERR_VALUE((long)run_ctx.retval)) { run_ctx.retval = -EPERM; + seen_return0 = true; + } item++; } bpf_reset_run_ctx(old_run_ctx); rcu_read_unlock(); migrate_enable(); + if (IS_ERR_VALUE((long)run_ctx.retval) && !seen_return0) + *ret_flags = 0; return run_ctx.retval; } > > If it turns out to be super ugly, we can try to fix it. (not sure how > > much of a uapi that is) > sgtm. > > > > > > > > > > > Let me verify and I can add a note to bpf_set_retval uapi definition > > > > to mention that it just overrides EPERM. bpf_set_retval should > > > > probably not talk about userspace/syscall and instead use the words > > > > like "caller". > > > yeah, it is no longer syscall return value only now.
On Wed, Aug 17, 2022 at 08:42:54PM -0700, Stanislav Fomichev wrote: > On Wed, Aug 17, 2022 at 5:21 PM Martin KaFai Lau <kafai@fb.com> wrote: > > > > On Wed, Aug 17, 2022 at 04:59:06PM -0700, Stanislav Fomichev wrote: > > > On Wed, Aug 17, 2022 at 4:27 PM Martin KaFai Lau <kafai@fb.com> wrote: > > > > > > > > On Wed, Aug 17, 2022 at 03:41:26PM -0700, Stanislav Fomichev wrote: > > > > > On Wed, Aug 17, 2022 at 12:07 PM Martin KaFai Lau <kafai@fb.com> wrote: > > > > > > > > > > > > On Tue, Aug 16, 2022 at 01:12:11PM -0700, Stanislav Fomichev wrote: > > > > > > > Apparently, only a small subset of cgroup hooks actually falls > > > > > > > back to cgroup_base_func_proto. This leads to unexpected result > > > > > > > where not all cgroup helpers have bpf_{g,s}et_retval. > > > > > > > > > > > > > > It's getting harder and harder to manage which helpers are exported > > > > > > > to which hooks. We now have the following call chains: > > > > > > > > > > > > > > - cg_skb_func_proto > > > > > > > - sk_filter_func_proto > > > > > > > - bpf_sk_base_func_proto > > > > > > > - bpf_base_func_proto > > > > > > Could you explain how bpf_set_retval() will work with cgroup prog that > > > > > > is not syscall and can return flags in the higher bit (e.g. cg_skb egress). > > > > > > It will be a useful doc to add to the uapi bpf.h for > > > > > > the bpf_set_retval() helper. > > > > > > > > > > I think it's the same case as the case without bpf_set_retval? I don't > > > > > think the flags can be exported via bpf_set_retval, it just lets the > > > > > users override EPERM. > > > > eg. Before, a cg_skb@egress prog returns 3 to mean NET_XMIT_CN. > > > > What if the prog now returns 3 and also bpf_set_retval(-Exxxx). > > > > If I read how __cgroup_bpf_run_filter_skb() uses bpf_prog_run_array_cg() > > > > correctly, __cgroup_bpf_run_filter_skb() will return NET_XMIT_DROP > > > > instead of the -Exxxx. The -Exxxx is probably what the bpf prog > > > > is expecting after calling bpf_set_retval(-Exxxx) ? > > > > Thinking more about it, should __cgroup_bpf_run_filter_skb() always > > > > return -Exxxx whenever a -ve retval is set in bpf_set_retval() ? > > > > > > I think we used to have "return 0/1/2/3" to indicate different > > > conditions but then switched to "return 1/0" + flags. > > For 'int bpf_prog_run_array_cg(..., u32 *ret_flags)'? > > I think it is more like return "0 (OK)/-Exxxx" + ret_flags now. > > Yes, right now that's that case. What I meant to say is that for the > BPF program itself, the api is still "return a set of predefined > values". We don't advertise the flags to the bpf programs. 'return 2' > is a perfectly valid return for cgroup/egress that will tx the packet > with a cn. (where bpf_prog_run_array_cg sees it as a 'return 0 + (1 << > 1)') > > > > So, technically, "return 3 + bpf_set_retval" is still fundamentally a > > > "return 3" api-wise. > > hm....for the exisiting usecase (eg. CGROUP_SETSOCKOPT), what does > > "bpf-prog-return 1 + bpf_set_retval(-EPERM)" mean? > > I think bpf_set_retval takes precedence and in this case bpf_prog_run > wrapper will return -EPERM to the caller. > Will try to document that as well. > > > > I guess we can make bpf_set_retval override that but let me start by > > > trying to document what we currently have. > > To be clear, for cg_skb case, I meant to clear the ret_flags only if > > run_ctx.retval is set. > > Are you suggesting something like the following? > > diff --git a/kernel/bpf/cgroup.c b/kernel/bpf/cgroup.c > index fd113bd2f79c..c110cbe52001 100644 > --- a/kernel/bpf/cgroup.c > +++ b/kernel/bpf/cgroup.c > @@ -61,6 +61,8 @@ bpf_prog_run_array_cg(const struct cgroup_bpf *cgrp, > bpf_reset_run_ctx(old_run_ctx); > rcu_read_unlock(); > migrate_enable(); > + if (IS_ERR_VALUE((long)run_ctx.retval)) > + *ret_flags = 0; > return run_ctx.retval; > } > > I think this will break the 'return 2' case? But is it worth it doing > it more carefully like this? LMKWYT. The below should work. Not sure it is worth it but before doing this... During this discussion, I think I am not clear what is the use case on bpf_{g,s}et_retval() for cg_skb. Could you describe how it will be used in your use case? Is it for another tracing program to get a different return value from (eg.) sk_filter_trim_cap or ip[6]_output? Not meaning the helper should not be exposed. It is easier to think with some examples. > > diff --git a/kernel/bpf/cgroup.c b/kernel/bpf/cgroup.c > index fd113bd2f79c..dcd25561f8d4 100644 > --- a/kernel/bpf/cgroup.c > +++ b/kernel/bpf/cgroup.c > @@ -39,6 +39,7 @@ bpf_prog_run_array_cg(const struct cgroup_bpf *cgrp, > const struct bpf_prog_array *array; > struct bpf_run_ctx *old_run_ctx; > struct bpf_cg_run_ctx run_ctx; > + bool seen_return0 = false; > u32 func_ret; > > run_ctx.retval = retval; > @@ -54,13 +55,17 @@ bpf_prog_run_array_cg(const struct cgroup_bpf *cgrp, > *(ret_flags) |= (func_ret >> 1); > func_ret &= 1; > } > - if (!func_ret && !IS_ERR_VALUE((long)run_ctx.retval)) > + if (!func_ret && !IS_ERR_VALUE((long)run_ctx.retval)) { > run_ctx.retval = -EPERM; > + seen_return0 = true; > + } > item++; > } > bpf_reset_run_ctx(old_run_ctx); > rcu_read_unlock(); > migrate_enable(); > + if (IS_ERR_VALUE((long)run_ctx.retval) && !seen_return0) > + *ret_flags = 0; > return run_ctx.retval; > } > > > > If it turns out to be super ugly, we can try to fix it. (not sure how > > > much of a uapi that is) > > sgtm. > > > > > > > > > > > > > > > > Let me verify and I can add a note to bpf_set_retval uapi definition > > > > > to mention that it just overrides EPERM. bpf_set_retval should > > > > > probably not talk about userspace/syscall and instead use the words > > > > > like "caller". > > > > yeah, it is no longer syscall return value only now.
On Thu, Aug 18, 2022 at 11:16 AM Martin KaFai Lau <kafai@fb.com> wrote: > > On Wed, Aug 17, 2022 at 08:42:54PM -0700, Stanislav Fomichev wrote: > > On Wed, Aug 17, 2022 at 5:21 PM Martin KaFai Lau <kafai@fb.com> wrote: > > > > > > On Wed, Aug 17, 2022 at 04:59:06PM -0700, Stanislav Fomichev wrote: > > > > On Wed, Aug 17, 2022 at 4:27 PM Martin KaFai Lau <kafai@fb.com> wrote: > > > > > > > > > > On Wed, Aug 17, 2022 at 03:41:26PM -0700, Stanislav Fomichev wrote: > > > > > > On Wed, Aug 17, 2022 at 12:07 PM Martin KaFai Lau <kafai@fb.com> wrote: > > > > > > > > > > > > > > On Tue, Aug 16, 2022 at 01:12:11PM -0700, Stanislav Fomichev wrote: > > > > > > > > Apparently, only a small subset of cgroup hooks actually falls > > > > > > > > back to cgroup_base_func_proto. This leads to unexpected result > > > > > > > > where not all cgroup helpers have bpf_{g,s}et_retval. > > > > > > > > > > > > > > > > It's getting harder and harder to manage which helpers are exported > > > > > > > > to which hooks. We now have the following call chains: > > > > > > > > > > > > > > > > - cg_skb_func_proto > > > > > > > > - sk_filter_func_proto > > > > > > > > - bpf_sk_base_func_proto > > > > > > > > - bpf_base_func_proto > > > > > > > Could you explain how bpf_set_retval() will work with cgroup prog that > > > > > > > is not syscall and can return flags in the higher bit (e.g. cg_skb egress). > > > > > > > It will be a useful doc to add to the uapi bpf.h for > > > > > > > the bpf_set_retval() helper. > > > > > > > > > > > > I think it's the same case as the case without bpf_set_retval? I don't > > > > > > think the flags can be exported via bpf_set_retval, it just lets the > > > > > > users override EPERM. > > > > > eg. Before, a cg_skb@egress prog returns 3 to mean NET_XMIT_CN. > > > > > What if the prog now returns 3 and also bpf_set_retval(-Exxxx). > > > > > If I read how __cgroup_bpf_run_filter_skb() uses bpf_prog_run_array_cg() > > > > > correctly, __cgroup_bpf_run_filter_skb() will return NET_XMIT_DROP > > > > > instead of the -Exxxx. The -Exxxx is probably what the bpf prog > > > > > is expecting after calling bpf_set_retval(-Exxxx) ? > > > > > Thinking more about it, should __cgroup_bpf_run_filter_skb() always > > > > > return -Exxxx whenever a -ve retval is set in bpf_set_retval() ? > > > > > > > > I think we used to have "return 0/1/2/3" to indicate different > > > > conditions but then switched to "return 1/0" + flags. > > > For 'int bpf_prog_run_array_cg(..., u32 *ret_flags)'? > > > I think it is more like return "0 (OK)/-Exxxx" + ret_flags now. > > > > Yes, right now that's that case. What I meant to say is that for the > > BPF program itself, the api is still "return a set of predefined > > values". We don't advertise the flags to the bpf programs. 'return 2' > > is a perfectly valid return for cgroup/egress that will tx the packet > > with a cn. (where bpf_prog_run_array_cg sees it as a 'return 0 + (1 << > > 1)') > > > > > > So, technically, "return 3 + bpf_set_retval" is still fundamentally a > > > > "return 3" api-wise. > > > hm....for the exisiting usecase (eg. CGROUP_SETSOCKOPT), what does > > > "bpf-prog-return 1 + bpf_set_retval(-EPERM)" mean? > > > > I think bpf_set_retval takes precedence and in this case bpf_prog_run > > wrapper will return -EPERM to the caller. > > Will try to document that as well. > > > > > > I guess we can make bpf_set_retval override that but let me start by > > > > trying to document what we currently have. > > > To be clear, for cg_skb case, I meant to clear the ret_flags only if > > > run_ctx.retval is set. > > > > Are you suggesting something like the following? > > > > diff --git a/kernel/bpf/cgroup.c b/kernel/bpf/cgroup.c > > index fd113bd2f79c..c110cbe52001 100644 > > --- a/kernel/bpf/cgroup.c > > +++ b/kernel/bpf/cgroup.c > > @@ -61,6 +61,8 @@ bpf_prog_run_array_cg(const struct cgroup_bpf *cgrp, > > bpf_reset_run_ctx(old_run_ctx); > > rcu_read_unlock(); > > migrate_enable(); > > + if (IS_ERR_VALUE((long)run_ctx.retval)) > > + *ret_flags = 0; > > return run_ctx.retval; > > } > > > > I think this will break the 'return 2' case? But is it worth it doing > > it more carefully like this? LMKWYT. > The below should work. Not sure it is worth it > but before doing this... > > During this discussion, I think I am not clear what is the use case > on bpf_{g,s}et_retval() for cg_skb. Could you describe how it will be > used in your use case? Is it for another tracing program to get > a different return value from (eg.) sk_filter_trim_cap or ip[6]_output? > > Not meaning the helper should not be exposed. It is easier > to think with some examples. I don't really need them in cg_skb, I want them in cg_sock so I can return a custom errno from socket() syscall. You're probably right and it doesn't make sense to support them in cg_skb. Most of the BPF_CGROUP_RUN_PROG_INET_INGRESS/BPF_CGROUP_RUN_PROG_INET_EGRESS callers don't seem to care about returned error code? (from my brief grepping) Let's maybe err on the safe side and special case cg_skb for now (in cgroup_common_func_proto) and not expose retval helpers?
On Thu, Aug 18, 2022 at 11:30:21AM -0700, Stanislav Fomichev wrote: > On Thu, Aug 18, 2022 at 11:16 AM Martin KaFai Lau <kafai@fb.com> wrote: > > > > On Wed, Aug 17, 2022 at 08:42:54PM -0700, Stanislav Fomichev wrote: > > > On Wed, Aug 17, 2022 at 5:21 PM Martin KaFai Lau <kafai@fb.com> wrote: > > > > > > > > On Wed, Aug 17, 2022 at 04:59:06PM -0700, Stanislav Fomichev wrote: > > > > > On Wed, Aug 17, 2022 at 4:27 PM Martin KaFai Lau <kafai@fb.com> wrote: > > > > > > > > > > > > On Wed, Aug 17, 2022 at 03:41:26PM -0700, Stanislav Fomichev wrote: > > > > > > > On Wed, Aug 17, 2022 at 12:07 PM Martin KaFai Lau <kafai@fb.com> wrote: > > > > > > > > > > > > > > > > On Tue, Aug 16, 2022 at 01:12:11PM -0700, Stanislav Fomichev wrote: > > > > > > > > > Apparently, only a small subset of cgroup hooks actually falls > > > > > > > > > back to cgroup_base_func_proto. This leads to unexpected result > > > > > > > > > where not all cgroup helpers have bpf_{g,s}et_retval. > > > > > > > > > > > > > > > > > > It's getting harder and harder to manage which helpers are exported > > > > > > > > > to which hooks. We now have the following call chains: > > > > > > > > > > > > > > > > > > - cg_skb_func_proto > > > > > > > > > - sk_filter_func_proto > > > > > > > > > - bpf_sk_base_func_proto > > > > > > > > > - bpf_base_func_proto > > > > > > > > Could you explain how bpf_set_retval() will work with cgroup prog that > > > > > > > > is not syscall and can return flags in the higher bit (e.g. cg_skb egress). > > > > > > > > It will be a useful doc to add to the uapi bpf.h for > > > > > > > > the bpf_set_retval() helper. > > > > > > > > > > > > > > I think it's the same case as the case without bpf_set_retval? I don't > > > > > > > think the flags can be exported via bpf_set_retval, it just lets the > > > > > > > users override EPERM. > > > > > > eg. Before, a cg_skb@egress prog returns 3 to mean NET_XMIT_CN. > > > > > > What if the prog now returns 3 and also bpf_set_retval(-Exxxx). > > > > > > If I read how __cgroup_bpf_run_filter_skb() uses bpf_prog_run_array_cg() > > > > > > correctly, __cgroup_bpf_run_filter_skb() will return NET_XMIT_DROP > > > > > > instead of the -Exxxx. The -Exxxx is probably what the bpf prog > > > > > > is expecting after calling bpf_set_retval(-Exxxx) ? > > > > > > Thinking more about it, should __cgroup_bpf_run_filter_skb() always > > > > > > return -Exxxx whenever a -ve retval is set in bpf_set_retval() ? > > > > > > > > > > I think we used to have "return 0/1/2/3" to indicate different > > > > > conditions but then switched to "return 1/0" + flags. > > > > For 'int bpf_prog_run_array_cg(..., u32 *ret_flags)'? > > > > I think it is more like return "0 (OK)/-Exxxx" + ret_flags now. > > > > > > Yes, right now that's that case. What I meant to say is that for the > > > BPF program itself, the api is still "return a set of predefined > > > values". We don't advertise the flags to the bpf programs. 'return 2' > > > is a perfectly valid return for cgroup/egress that will tx the packet > > > with a cn. (where bpf_prog_run_array_cg sees it as a 'return 0 + (1 << > > > 1)') > > > > > > > > So, technically, "return 3 + bpf_set_retval" is still fundamentally a > > > > > "return 3" api-wise. > > > > hm....for the exisiting usecase (eg. CGROUP_SETSOCKOPT), what does > > > > "bpf-prog-return 1 + bpf_set_retval(-EPERM)" mean? > > > > > > I think bpf_set_retval takes precedence and in this case bpf_prog_run > > > wrapper will return -EPERM to the caller. > > > Will try to document that as well. > > > > > > > > I guess we can make bpf_set_retval override that but let me start by > > > > > trying to document what we currently have. > > > > To be clear, for cg_skb case, I meant to clear the ret_flags only if > > > > run_ctx.retval is set. > > > > > > Are you suggesting something like the following? > > > > > > diff --git a/kernel/bpf/cgroup.c b/kernel/bpf/cgroup.c > > > index fd113bd2f79c..c110cbe52001 100644 > > > --- a/kernel/bpf/cgroup.c > > > +++ b/kernel/bpf/cgroup.c > > > @@ -61,6 +61,8 @@ bpf_prog_run_array_cg(const struct cgroup_bpf *cgrp, > > > bpf_reset_run_ctx(old_run_ctx); > > > rcu_read_unlock(); > > > migrate_enable(); > > > + if (IS_ERR_VALUE((long)run_ctx.retval)) > > > + *ret_flags = 0; > > > return run_ctx.retval; > > > } > > > > > > I think this will break the 'return 2' case? But is it worth it doing > > > it more carefully like this? LMKWYT. > > The below should work. Not sure it is worth it > > but before doing this... > > > > During this discussion, I think I am not clear what is the use case > > on bpf_{g,s}et_retval() for cg_skb. Could you describe how it will be > > used in your use case? Is it for another tracing program to get > > a different return value from (eg.) sk_filter_trim_cap or ip[6]_output? > > > > Not meaning the helper should not be exposed. It is easier > > to think with some examples. > > I don't really need them in cg_skb, I want them in cg_sock so I can > return a custom errno from socket() syscall. bpf_{g,s}et_retval() will probably be useful for sock_addr too. There is a BPF_RET_BIND_NO_CAP_NET_BIND_SERVICE falling into a similar bucket. However, I think it should be fine since the higher bit is only used when the bpf prog returns OK. > You're probably right and it doesn't make sense to support them in > cg_skb. Most of the > BPF_CGROUP_RUN_PROG_INET_INGRESS/BPF_CGROUP_RUN_PROG_INET_EGRESS > callers don't seem to care about returned error code? (from my brief > grepping) > Let's maybe err on the safe side and special case cg_skb for now (in > cgroup_common_func_proto) and not expose retval helpers? Ah. I was under the wrong impression that you have a use case on cg_skb because cg_skb_func_proto was in the stack example above :) Yep, lets disable the bpf_{g,s}et_retval() helper for CGROUP_SKB and also SOCK_OP(S) for now.