Message ID | 20230824143959.1134019-2-liujian56@huawei.com (mailing list archive) |
---|---|
State | Superseded |
Delegated to: | BPF |
Headers | show |
Series | add BPF_F_PERMANENT flag for sockmap skmsg redirect | expand |
On Thu, Aug 24, 2023 at 10:39 PM +08, Liu Jian wrote: > If the sockmap msg redirection function is used only to forward packets > and no other operation, the execution result of the BPF_SK_MSG_VERDICT > program is the same each time. In this case, the BPF program only needs to > be run once. Add BPF_F_PERMANENT flag to bpf_msg_redirect_map() and > bpf_msg_redirect_hash() to implement this ability. > > Then we can enable this function in the bpf program as follows: > bpf_msg_redirect_hash(xx, xx, xx, BPF_F_INGRESS | BPF_F_PERMANENT); > > Test results using netperf TCP_STREAM mode: > for i in 1 64 128 512 1k 2k 32k 64k 100k 500k 1m;then > netperf -T 1,2 -t TCP_STREAM -H 127.0.0.1 -l 20 -- -m $i -s 100m,100m -S 100m,100m > done > > before: > 3.84 246.52 496.89 1885.03 3415.29 6375.03 40749.09 48764.40 51611.34 55678.26 55992.78 > after: > 4.43 279.20 555.82 2080.79 3870.70 7105.44 41836.41 49709.75 51861.56 55211.00 54566.85 > > Signed-off-by: Liu Jian <liujian56@huawei.com> > --- > include/linux/skmsg.h | 1 + > include/uapi/linux/bpf.h | 15 +++++++++++++-- > net/core/skmsg.c | 5 +++++ > net/core/sock_map.c | 4 ++-- > net/ipv4/tcp_bpf.c | 18 +++++++++++++----- > tools/include/uapi/linux/bpf.h | 15 +++++++++++++-- > 6 files changed, 47 insertions(+), 11 deletions(-) > > diff --git a/include/linux/skmsg.h b/include/linux/skmsg.h > index 054d7911bfc9..2f4e9811ff85 100644 > --- a/include/linux/skmsg.h > +++ b/include/linux/skmsg.h > @@ -82,6 +82,7 @@ struct sk_psock { > u32 cork_bytes; > u32 eval; > bool redir_ingress; /* undefined if sk_redir is null */ > + bool redir_permanent; > struct sk_msg *cork; > struct sk_psock_progs progs; > #if IS_ENABLED(CONFIG_BPF_STREAM_PARSER) > diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h > index 70da85200695..f4de1ba390b4 100644 > --- a/include/uapi/linux/bpf.h > +++ b/include/uapi/linux/bpf.h > @@ -3004,7 +3004,12 @@ union bpf_attr { > * egress interfaces can be used for redirection. The > * **BPF_F_INGRESS** value in *flags* is used to make the > * distinction (ingress path is selected if the flag is present, > - * egress path otherwise). This is the only flag supported for now. > + * egress path otherwise). The **BPF_F_PERMANENT** value in > + * *flags* is used to indicates whether the eBPF result is > + * permanently (please note that, BPF_F_PERMANENT does not work with > + * msg_apply_bytes() and msg_cork_bytes(), if msg_apply_bytes() or > + * msg_cork_bytes() is configured, the BPF_F_PERMANENT function is > + * automatically disabled). There are some grammar mistakes here we need to fix. Hint - I find it helpful to run the text through an online grammar checker or an AI chatbot when unsure. Either way, let's reword so the flags are clearly listed out, since we now have two of them: The following *flags* are supported: **BPF_F_INGRESS** Both ingress and egress interfaces can be used for redirection. The **BPF_F_INGRESS** value in *flags* is used to make the distinction. Ingress path is selected if the flag is present, egress path otherwise. **BPF_F_PERMANENT** Indicates that redirect verdict and the target socket should be remembered. The verdict program will not be run for subsequent packets, unless an error occurs when forwarding packets. **BPF_F_PERMANENT** cannot be use together with **bpf_msg_apply_bytes**\ () and **bpf_msg_cork_bytes**\ (). If either has been called, the **BPF_F_PERMANENT** flag is ignored. Please check the formatting is correct with: ./scripts/bpf_doc.py --filename include/uapi/linux/bpf.h \ | rst2man /dev/stdin | man /dev/stdin That said, I'm not sure I like these semantics - flag being ignored under some circumstances. It leads to a silent failure. If I've asked for a permanent redirect, but it is cannot work in my BPF program, I'd rather get an error from bpf_msg_redirect_map/hash(), than be surprised that it is getting executed for every packet and have to troubleshoot why. > * Return > * **SK_PASS** on success, or **SK_DROP** on error. > * > @@ -3276,7 +3281,12 @@ union bpf_attr { > * egress interfaces can be used for redirection. The > * **BPF_F_INGRESS** value in *flags* is used to make the > * distinction (ingress path is selected if the flag is present, > - * egress path otherwise). This is the only flag supported for now. > + * egress path otherwise). The **BPF_F_PERMANENT** value in > + * *flags* is used to indicates whether the eBPF result is > + * permanently (please note that, BPF_F_PERMANENT does not work with > + * msg_apply_bytes() and msg_cork_bytes(), if msg_apply_bytes() or > + * msg_cork_bytes() is configured, the BPF_F_PERMANENT function is > + * automatically disabled). > * Return > * **SK_PASS** on success, or **SK_DROP** on error. > * > @@ -5872,6 +5882,7 @@ enum { > /* BPF_FUNC_clone_redirect and BPF_FUNC_redirect flags. */ > enum { > BPF_F_INGRESS = (1ULL << 0), > + BPF_F_PERMANENT = (1ULL << 1), > }; > > /* BPF_FUNC_skb_set_tunnel_key and BPF_FUNC_skb_get_tunnel_key flags. */ > diff --git a/net/core/skmsg.c b/net/core/skmsg.c > index a29508e1ff35..df1443cf5fbd 100644 > --- a/net/core/skmsg.c > +++ b/net/core/skmsg.c > @@ -885,6 +885,11 @@ int sk_psock_msg_verdict(struct sock *sk, struct sk_psock *psock, > goto out; > } > psock->redir_ingress = sk_msg_to_ingress(msg); > + if (!msg->apply_bytes && !msg->cork_bytes) > + psock->redir_permanent = > + msg->flags & BPF_F_PERMANENT; > + else > + psock->redir_permanent = false; Above can be rewritten as: psock->redir_permanent = !msg->apply_bytes && !msg->cork_bytes && (msg->flags & BPF_F_PERMANENT); But as I wrote earlier, I don't think it's a good idea to ignore the flag. We can detect this conflict at the time the bpf_msg_sk_redirect_* helper is called and return an error. Naturally that means that that bpf_msg_{cork,apply}_bytes helpers need to be adjusted to return an error if BPF_F_PERMANENT has been set. > psock->sk_redir = msg->sk_redir; > sock_hold(psock->sk_redir); > } > diff --git a/net/core/sock_map.c b/net/core/sock_map.c > index 08ab108206bf..35a361614f5e 100644 > --- a/net/core/sock_map.c > +++ b/net/core/sock_map.c > @@ -662,7 +662,7 @@ BPF_CALL_4(bpf_msg_redirect_map, struct sk_msg *, msg, > { > struct sock *sk; > > - if (unlikely(flags & ~(BPF_F_INGRESS))) > + if (unlikely(flags & ~(BPF_F_INGRESS | BPF_F_PERMANENT))) > return SK_DROP; > > sk = __sock_map_lookup_elem(map, key); > @@ -1261,7 +1261,7 @@ BPF_CALL_4(bpf_msg_redirect_hash, struct sk_msg *, msg, > { > struct sock *sk; > > - if (unlikely(flags & ~(BPF_F_INGRESS))) > + if (unlikely(flags & ~(BPF_F_INGRESS | BPF_F_PERMANENT))) > return SK_DROP; > > sk = __sock_hash_lookup_elem(map, key); > diff --git a/net/ipv4/tcp_bpf.c b/net/ipv4/tcp_bpf.c > index 81f0dff69e0b..b53e356562a6 100644 > --- a/net/ipv4/tcp_bpf.c > +++ b/net/ipv4/tcp_bpf.c > @@ -419,8 +419,10 @@ static int tcp_bpf_send_verdict(struct sock *sk, struct sk_psock *psock, > if (!psock->apply_bytes) { > /* Clean up before releasing the sock lock. */ > eval = psock->eval; > - psock->eval = __SK_NONE; > - psock->sk_redir = NULL; > + if (!psock->redir_permanent) { > + psock->eval = __SK_NONE; > + psock->sk_redir = NULL; > + } > } > if (psock->cork) { > cork = true; > @@ -433,9 +435,15 @@ static int tcp_bpf_send_verdict(struct sock *sk, struct sk_psock *psock, > ret = tcp_bpf_sendmsg_redir(sk_redir, redir_ingress, > msg, tosend, flags); > sent = origsize - msg->sg.size; > + /* disable the ability when something wrong */ > + if (unlikely(ret < 0)) > + psock->redir_permanent = 0; > > - if (eval == __SK_REDIRECT) > + if (!psock->redir_permanent && eval == __SK_REDIRECT) { I believe eval == __SK_REDIRECT is always true here, and the eval local variable is redundant. We will be in this switch branch only if psock->eval == __SK_REDIRECT. It's something we missed during the review of cd9733f5d75c ("tcp_bpf: Fix one concurrency problem in the tcp_bpf_send_verdict function"). > sock_put(sk_redir); > + psock->sk_redir = NULL; > + psock->eval = __SK_NONE; > + } > > lock_sock(sk); > if (unlikely(ret < 0)) { > @@ -460,8 +468,8 @@ static int tcp_bpf_send_verdict(struct sock *sk, struct sk_psock *psock, > } > > if (likely(!ret)) { > - if (!psock->apply_bytes) { > - psock->eval = __SK_NONE; > + if (!psock->apply_bytes && !psock->redir_permanent) { > + psock->eval = __SK_NONE; > if (psock->sk_redir) { > sock_put(psock->sk_redir); > psock->sk_redir = NULL; > diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h > index 70da85200695..f4de1ba390b4 100644 > --- a/tools/include/uapi/linux/bpf.h > +++ b/tools/include/uapi/linux/bpf.h > @@ -3004,7 +3004,12 @@ union bpf_attr { > * egress interfaces can be used for redirection. The > * **BPF_F_INGRESS** value in *flags* is used to make the > * distinction (ingress path is selected if the flag is present, > - * egress path otherwise). This is the only flag supported for now. > + * egress path otherwise). The **BPF_F_PERMANENT** value in > + * *flags* is used to indicates whether the eBPF result is > + * permanently (please note that, BPF_F_PERMANENT does not work with > + * msg_apply_bytes() and msg_cork_bytes(), if msg_apply_bytes() or > + * msg_cork_bytes() is configured, the BPF_F_PERMANENT function is > + * automatically disabled). > * Return > * **SK_PASS** on success, or **SK_DROP** on error. > * > @@ -3276,7 +3281,12 @@ union bpf_attr { > * egress interfaces can be used for redirection. The > * **BPF_F_INGRESS** value in *flags* is used to make the > * distinction (ingress path is selected if the flag is present, > - * egress path otherwise). This is the only flag supported for now. > + * egress path otherwise). The **BPF_F_PERMANENT** value in > + * *flags* is used to indicates whether the eBPF result is > + * permanently (please note that, BPF_F_PERMANENT does not work with > + * msg_apply_bytes() and msg_cork_bytes(), if msg_apply_bytes() or > + * msg_cork_bytes() is configured, the BPF_F_PERMANENT function is > + * automatically disabled). > * Return > * **SK_PASS** on success, or **SK_DROP** on error. > * > @@ -5872,6 +5882,7 @@ enum { > /* BPF_FUNC_clone_redirect and BPF_FUNC_redirect flags. */ > enum { > BPF_F_INGRESS = (1ULL << 0), > + BPF_F_PERMANENT = (1ULL << 1), > }; > > /* BPF_FUNC_skb_set_tunnel_key and BPF_FUNC_skb_get_tunnel_key flags. */
Jakub Sitnicki wrote: > On Thu, Aug 24, 2023 at 10:39 PM +08, Liu Jian wrote: > > If the sockmap msg redirection function is used only to forward packets > > and no other operation, the execution result of the BPF_SK_MSG_VERDICT > > program is the same each time. In this case, the BPF program only needs to > > be run once. Add BPF_F_PERMANENT flag to bpf_msg_redirect_map() and > > bpf_msg_redirect_hash() to implement this ability. > > > > Then we can enable this function in the bpf program as follows: > > bpf_msg_redirect_hash(xx, xx, xx, BPF_F_INGRESS | BPF_F_PERMANENT); > > > > Test results using netperf TCP_STREAM mode: > > for i in 1 64 128 512 1k 2k 32k 64k 100k 500k 1m;then > > netperf -T 1,2 -t TCP_STREAM -H 127.0.0.1 -l 20 -- -m $i -s 100m,100m -S 100m,100m > > done > > > > before: > > 3.84 246.52 496.89 1885.03 3415.29 6375.03 40749.09 48764.40 51611.34 55678.26 55992.78 > > after: > > 4.43 279.20 555.82 2080.79 3870.70 7105.44 41836.41 49709.75 51861.56 55211.00 54566.85 > > > > Signed-off-by: Liu Jian <liujian56@huawei.com> [...] > > /* BPF_FUNC_skb_set_tunnel_key and BPF_FUNC_skb_get_tunnel_key flags. */ > > diff --git a/net/core/skmsg.c b/net/core/skmsg.c > > index a29508e1ff35..df1443cf5fbd 100644 > > --- a/net/core/skmsg.c > > +++ b/net/core/skmsg.c > > @@ -885,6 +885,11 @@ int sk_psock_msg_verdict(struct sock *sk, struct sk_psock *psock, > > goto out; > > } > > psock->redir_ingress = sk_msg_to_ingress(msg); > > + if (!msg->apply_bytes && !msg->cork_bytes) > > + psock->redir_permanent = > > + msg->flags & BPF_F_PERMANENT; > > + else > > + psock->redir_permanent = false; > > Above can be rewritten as: > > psock->redir_permanent = !msg->apply_bytes && > !msg->cork_bytes && > (msg->flags & BPF_F_PERMANENT); > > But as I wrote earlier, I don't think it's a good idea to ignore the > flag. We can detect this conflict at the time the bpf_msg_sk_redirect_* > helper is called and return an error. > > Naturally that means that that bpf_msg_{cork,apply}_bytes helpers need > to be adjusted to return an error if BPF_F_PERMANENT has been set. So far we've not really done much to protect a user from doing rather silly things. The following will all do something without errors, bpf_msg_apply_bytes() bpf_msg_apply_bytes() <- reset apply bytes bpf_msg_cork_bytes() bpf_msg_cork_bytes() <- resets cork byte also, bpf_msg_redirect(..., BPF_F_INGRESS); bpf_msg_redirect(..., 0); <- resets sk_redir and flags maybe there is some valid reason to even do above if further parsing identifies some reason to redirect to a alert socket or something. My original thinking was in the interest of not having a bunch of extra checks for performance reasons we shouldn't add guard rails unless something really unexpected might happen like a kernel panic or what not. This does feel a bit different though because before we didn't have calls that could impact other calls. My best idea is to just create a precedence and follow it. I would propose, 'If BPF_F_PERMANENT is set apply_bytes and cork_bytes are ignored.' The other direction (what is above?) has a bit of an inconsistency where these two flows are different? bpf_apply_bytes() bpf_msg_redirect(..., BPF_F_PERMANENT) and bpf_msg_redirect(..., BPF_F_PERMANENT) bpf_apply_bytes() It would be best if order of operations doesn't change the outcome because that starts to get really hard to reason about. This avoids having to add checks all over the place and then if users want we could give some mechanisms to read apply and cork bytes so people could write macros over those if they really want the hard error. WDYT? [...] Thanks!
On 2023/8/25 21:04, Jakub Sitnicki wrote: > On Thu, Aug 24, 2023 at 10:39 PM +08, Liu Jian wrote: >> If the sockmap msg redirection function is used only to forward packets >> and no other operation, the execution result of the BPF_SK_MSG_VERDICT >> program is the same each time. In this case, the BPF program only needs to >> be run once. Add BPF_F_PERMANENT flag to bpf_msg_redirect_map() and >> bpf_msg_redirect_hash() to implement this ability. >> >> Then we can enable this function in the bpf program as follows: >> bpf_msg_redirect_hash(xx, xx, xx, BPF_F_INGRESS | BPF_F_PERMANENT); >> >> Test results using netperf TCP_STREAM mode: >> for i in 1 64 128 512 1k 2k 32k 64k 100k 500k 1m;then >> netperf -T 1,2 -t TCP_STREAM -H 127.0.0.1 -l 20 -- -m $i -s 100m,100m -S 100m,100m >> done >> >> before: >> 3.84 246.52 496.89 1885.03 3415.29 6375.03 40749.09 48764.40 51611.34 55678.26 55992.78 >> after: >> 4.43 279.20 555.82 2080.79 3870.70 7105.44 41836.41 49709.75 51861.56 55211.00 54566.85 >> >> Signed-off-by: Liu Jian <liujian56@huawei.com> >> --- >> include/linux/skmsg.h | 1 + >> include/uapi/linux/bpf.h | 15 +++++++++++++-- >> net/core/skmsg.c | 5 +++++ >> net/core/sock_map.c | 4 ++-- >> net/ipv4/tcp_bpf.c | 18 +++++++++++++----- >> tools/include/uapi/linux/bpf.h | 15 +++++++++++++-- >> 6 files changed, 47 insertions(+), 11 deletions(-) >> >> diff --git a/include/linux/skmsg.h b/include/linux/skmsg.h >> index 054d7911bfc9..2f4e9811ff85 100644 >> --- a/include/linux/skmsg.h >> +++ b/include/linux/skmsg.h >> @@ -82,6 +82,7 @@ struct sk_psock { >> u32 cork_bytes; >> u32 eval; >> bool redir_ingress; /* undefined if sk_redir is null */ >> + bool redir_permanent; >> struct sk_msg *cork; >> struct sk_psock_progs progs; >> #if IS_ENABLED(CONFIG_BPF_STREAM_PARSER) >> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h >> index 70da85200695..f4de1ba390b4 100644 >> --- a/include/uapi/linux/bpf.h >> +++ b/include/uapi/linux/bpf.h >> @@ -3004,7 +3004,12 @@ union bpf_attr { >> * egress interfaces can be used for redirection. The >> * **BPF_F_INGRESS** value in *flags* is used to make the >> * distinction (ingress path is selected if the flag is present, >> - * egress path otherwise). This is the only flag supported for now. >> + * egress path otherwise). The **BPF_F_PERMANENT** value in >> + * *flags* is used to indicates whether the eBPF result is >> + * permanently (please note that, BPF_F_PERMANENT does not work with >> + * msg_apply_bytes() and msg_cork_bytes(), if msg_apply_bytes() or >> + * msg_cork_bytes() is configured, the BPF_F_PERMANENT function is >> + * automatically disabled). > > There are some grammar mistakes here we need to fix. Hint - I find it > helpful to run the text through an online grammar checker or an AI > chatbot when unsure. > > Either way, let's reword so the flags are clearly listed out, since we > now have two of them: > > The following *flags* are supported: > > **BPF_F_INGRESS** > Both ingress and egress interfaces can be used for redirection. > The **BPF_F_INGRESS** value in *flags* is used to make the > distinction. Ingress path is selected if the flag is present, > egress path otherwise. > **BPF_F_PERMANENT** > Indicates that redirect verdict and the target socket should be > remembered. The verdict program will not be run for subsequent > packets, unless an error occurs when forwarding packets. > > **BPF_F_PERMANENT** cannot be use together with > **bpf_msg_apply_bytes**\ () and **bpf_msg_cork_bytes**\ (). If > either has been called, the **BPF_F_PERMANENT** flag is ignored. > > Thanks a lot. > Please check the formatting is correct with: > > ./scripts/bpf_doc.py --filename include/uapi/linux/bpf.h \ > | rst2man /dev/stdin | man /dev/stdin > > That said, I'm not sure I like these semantics - flag being ignored > under some circumstances. It leads to a silent failure. > > If I've asked for a permanent redirect, but it is cannot work in my BPF > program, I'd rather get an error from bpf_msg_redirect_map/hash(), than > be surprised that it is getting executed for every packet and have to > troubleshoot why. > >> * Return >> * **SK_PASS** on success, or **SK_DROP** on error. >> * >> @@ -3276,7 +3281,12 @@ union bpf_attr { >> * egress interfaces can be used for redirection. The >> * **BPF_F_INGRESS** value in *flags* is used to make the >> * distinction (ingress path is selected if the flag is present, >> - * egress path otherwise). This is the only flag supported for now. >> + * egress path otherwise). The **BPF_F_PERMANENT** value in >> + * *flags* is used to indicates whether the eBPF result is >> + * permanently (please note that, BPF_F_PERMANENT does not work with >> + * msg_apply_bytes() and msg_cork_bytes(), if msg_apply_bytes() or >> + * msg_cork_bytes() is configured, the BPF_F_PERMANENT function is >> + * automatically disabled). >> * Return >> * **SK_PASS** on success, or **SK_DROP** on error. >> * >> @@ -5872,6 +5882,7 @@ enum { >> /* BPF_FUNC_clone_redirect and BPF_FUNC_redirect flags. */ >> enum { >> BPF_F_INGRESS = (1ULL << 0), >> + BPF_F_PERMANENT = (1ULL << 1), >> }; >> >> /* BPF_FUNC_skb_set_tunnel_key and BPF_FUNC_skb_get_tunnel_key flags. */ >> diff --git a/net/core/skmsg.c b/net/core/skmsg.c >> index a29508e1ff35..df1443cf5fbd 100644 >> --- a/net/core/skmsg.c >> +++ b/net/core/skmsg.c >> @@ -885,6 +885,11 @@ int sk_psock_msg_verdict(struct sock *sk, struct sk_psock *psock, >> goto out; >> } >> psock->redir_ingress = sk_msg_to_ingress(msg); >> + if (!msg->apply_bytes && !msg->cork_bytes) >> + psock->redir_permanent = >> + msg->flags & BPF_F_PERMANENT; >> + else >> + psock->redir_permanent = false; > > Above can be rewritten as: > > psock->redir_permanent = !msg->apply_bytes && > !msg->cork_bytes && > (msg->flags & BPF_F_PERMANENT); > Will change it in v4. Thanks. > But as I wrote earlier, I don't think it's a good idea to ignore the > flag. We can detect this conflict at the time the bpf_msg_sk_redirect_* > helper is called and return an error. > > Naturally that means that that bpf_msg_{cork,apply}_bytes helpers need > to be adjusted to return an error if BPF_F_PERMANENT has been set. OK. If so, is the configuration that is set first and has a higher priority? > >> psock->sk_redir = msg->sk_redir; >> sock_hold(psock->sk_redir); >> } >> diff --git a/net/core/sock_map.c b/net/core/sock_map.c >> index 08ab108206bf..35a361614f5e 100644 >> --- a/net/core/sock_map.c >> +++ b/net/core/sock_map.c >> @@ -662,7 +662,7 @@ BPF_CALL_4(bpf_msg_redirect_map, struct sk_msg *, msg, >> { >> struct sock *sk; >> >> - if (unlikely(flags & ~(BPF_F_INGRESS))) >> + if (unlikely(flags & ~(BPF_F_INGRESS | BPF_F_PERMANENT))) >> return SK_DROP; >> >> sk = __sock_map_lookup_elem(map, key); >> @@ -1261,7 +1261,7 @@ BPF_CALL_4(bpf_msg_redirect_hash, struct sk_msg *, msg, >> { >> struct sock *sk; >> >> - if (unlikely(flags & ~(BPF_F_INGRESS))) >> + if (unlikely(flags & ~(BPF_F_INGRESS | BPF_F_PERMANENT))) >> return SK_DROP; >> >> sk = __sock_hash_lookup_elem(map, key); >> diff --git a/net/ipv4/tcp_bpf.c b/net/ipv4/tcp_bpf.c >> index 81f0dff69e0b..b53e356562a6 100644 >> --- a/net/ipv4/tcp_bpf.c >> +++ b/net/ipv4/tcp_bpf.c >> @@ -419,8 +419,10 @@ static int tcp_bpf_send_verdict(struct sock *sk, struct sk_psock *psock, >> if (!psock->apply_bytes) { >> /* Clean up before releasing the sock lock. */ >> eval = psock->eval; >> - psock->eval = __SK_NONE; >> - psock->sk_redir = NULL; >> + if (!psock->redir_permanent) { >> + psock->eval = __SK_NONE; >> + psock->sk_redir = NULL; >> + } >> } >> if (psock->cork) { >> cork = true; >> @@ -433,9 +435,15 @@ static int tcp_bpf_send_verdict(struct sock *sk, struct sk_psock *psock, >> ret = tcp_bpf_sendmsg_redir(sk_redir, redir_ingress, >> msg, tosend, flags); >> sent = origsize - msg->sg.size; >> + /* disable the ability when something wrong */ >> + if (unlikely(ret < 0)) >> + psock->redir_permanent = 0; >> >> - if (eval == __SK_REDIRECT) >> + if (!psock->redir_permanent && eval == __SK_REDIRECT) { > > I believe eval == __SK_REDIRECT is always true here, and the eval local > variable is redundant. We will be in this switch branch only if > psock->eval == __SK_REDIRECT. It's something we missed during the review > of cd9733f5d75c ("tcp_bpf: Fix one concurrency problem in the > tcp_bpf_send_verdict function"). > When apply_bytes is set, eval == __SK_REDIRECT is not true. >> sock_put(sk_redir); >> + psock->sk_redir = NULL; >> + psock->eval = __SK_NONE; >> + } >> >> lock_sock(sk); >> if (unlikely(ret < 0)) { >> @@ -460,8 +468,8 @@ static int tcp_bpf_send_verdict(struct sock *sk, struct sk_psock *psock, >> } >> >> if (likely(!ret)) { >> - if (!psock->apply_bytes) { >> - psock->eval = __SK_NONE; >> + if (!psock->apply_bytes && !psock->redir_permanent) { >> + psock->eval = __SK_NONE; >> if (psock->sk_redir) { >> sock_put(psock->sk_redir); >> psock->sk_redir = NULL; >> diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h >> index 70da85200695..f4de1ba390b4 100644 >> --- a/tools/include/uapi/linux/bpf.h >> +++ b/tools/include/uapi/linux/bpf.h >> @@ -3004,7 +3004,12 @@ union bpf_attr { >> * egress interfaces can be used for redirection. The >> * **BPF_F_INGRESS** value in *flags* is used to make the >> * distinction (ingress path is selected if the flag is present, >> - * egress path otherwise). This is the only flag supported for now. >> + * egress path otherwise). The **BPF_F_PERMANENT** value in >> + * *flags* is used to indicates whether the eBPF result is >> + * permanently (please note that, BPF_F_PERMANENT does not work with >> + * msg_apply_bytes() and msg_cork_bytes(), if msg_apply_bytes() or >> + * msg_cork_bytes() is configured, the BPF_F_PERMANENT function is >> + * automatically disabled). >> * Return >> * **SK_PASS** on success, or **SK_DROP** on error. >> * >> @@ -3276,7 +3281,12 @@ union bpf_attr { >> * egress interfaces can be used for redirection. The >> * **BPF_F_INGRESS** value in *flags* is used to make the >> * distinction (ingress path is selected if the flag is present, >> - * egress path otherwise). This is the only flag supported for now. >> + * egress path otherwise). The **BPF_F_PERMANENT** value in >> + * *flags* is used to indicates whether the eBPF result is >> + * permanently (please note that, BPF_F_PERMANENT does not work with >> + * msg_apply_bytes() and msg_cork_bytes(), if msg_apply_bytes() or >> + * msg_cork_bytes() is configured, the BPF_F_PERMANENT function is >> + * automatically disabled). >> * Return >> * **SK_PASS** on success, or **SK_DROP** on error. >> * >> @@ -5872,6 +5882,7 @@ enum { >> /* BPF_FUNC_clone_redirect and BPF_FUNC_redirect flags. */ >> enum { >> BPF_F_INGRESS = (1ULL << 0), >> + BPF_F_PERMANENT = (1ULL << 1), >> }; >> >> /* BPF_FUNC_skb_set_tunnel_key and BPF_FUNC_skb_get_tunnel_key flags. */ >
On 2023/8/26 9:32, John Fastabend wrote: > Jakub Sitnicki wrote: >> On Thu, Aug 24, 2023 at 10:39 PM +08, Liu Jian wrote: >>> If the sockmap msg redirection function is used only to forward packets >>> and no other operation, the execution result of the BPF_SK_MSG_VERDICT >>> program is the same each time. In this case, the BPF program only needs to >>> be run once. Add BPF_F_PERMANENT flag to bpf_msg_redirect_map() and >>> bpf_msg_redirect_hash() to implement this ability. >>> >>> Then we can enable this function in the bpf program as follows: >>> bpf_msg_redirect_hash(xx, xx, xx, BPF_F_INGRESS | BPF_F_PERMANENT); >>> >>> Test results using netperf TCP_STREAM mode: >>> for i in 1 64 128 512 1k 2k 32k 64k 100k 500k 1m;then >>> netperf -T 1,2 -t TCP_STREAM -H 127.0.0.1 -l 20 -- -m $i -s 100m,100m -S 100m,100m >>> done >>> >>> before: >>> 3.84 246.52 496.89 1885.03 3415.29 6375.03 40749.09 48764.40 51611.34 55678.26 55992.78 >>> after: >>> 4.43 279.20 555.82 2080.79 3870.70 7105.44 41836.41 49709.75 51861.56 55211.00 54566.85 >>> >>> Signed-off-by: Liu Jian <liujian56@huawei.com> > > [...] > >>> /* BPF_FUNC_skb_set_tunnel_key and BPF_FUNC_skb_get_tunnel_key flags. */ >>> diff --git a/net/core/skmsg.c b/net/core/skmsg.c >>> index a29508e1ff35..df1443cf5fbd 100644 >>> --- a/net/core/skmsg.c >>> +++ b/net/core/skmsg.c >>> @@ -885,6 +885,11 @@ int sk_psock_msg_verdict(struct sock *sk, struct sk_psock *psock, >>> goto out; >>> } >>> psock->redir_ingress = sk_msg_to_ingress(msg); >>> + if (!msg->apply_bytes && !msg->cork_bytes) >>> + psock->redir_permanent = >>> + msg->flags & BPF_F_PERMANENT; >>> + else >>> + psock->redir_permanent = false; >> >> Above can be rewritten as: >> >> psock->redir_permanent = !msg->apply_bytes && >> !msg->cork_bytes && >> (msg->flags & BPF_F_PERMANENT); >> >> But as I wrote earlier, I don't think it's a good idea to ignore the >> flag. We can detect this conflict at the time the bpf_msg_sk_redirect_* >> helper is called and return an error. >> >> Naturally that means that that bpf_msg_{cork,apply}_bytes helpers need >> to be adjusted to return an error if BPF_F_PERMANENT has been set. > > So far we've not really done much to protect a user from doing > rather silly things. The following will all do something without > errors, > > bpf_msg_apply_bytes() > bpf_msg_apply_bytes() <- reset apply bytes > > bpf_msg_cork_bytes() > bpf_msg_cork_bytes() <- resets cork byte > > also, > > bpf_msg_redirect(..., BPF_F_INGRESS); > bpf_msg_redirect(..., 0); <- resets sk_redir and flags > > maybe there is some valid reason to even do above if further parsing > identifies some reason to redirect to a alert socket or something. > > My original thinking was in the interest of not having a bunch of > extra checks for performance reasons we shouldn't add guard rails > unless something really unexpected might happen like a kernel > panic or what not. > > This does feel a bit different though because before we > didn't have calls that could impact other calls. My best idea > is to just create a precedence and follow it. I would propose, > > 'If BPF_F_PERMANENT is set apply_bytes and cork_bytes are > ignored.' > I think it's better. Both low-priority or high-priority are ok for me. But I think it's better that BPF_F_PERMANENT has a low priority. Because BPF_F_PERMANEN is only for performance, and apply_bytes or cork_bytes may be used to a user logic function. > The other direction (what is above?) has a bit of an inconsistency > where these two flows are different? > > bpf_apply_bytes() > bpf_msg_redirect(..., BPF_F_PERMANENT) > > and > > bpf_msg_redirect(..., BPF_F_PERMANENT) > bpf_apply_bytes() > > It would be best if order of operations doesn't change the > outcome because that starts to get really hard to reason about. > > This avoids having to add checks all over the place and then > if users want we could give some mechanisms to read apply > and cork bytes so people could write macros over those if > they really want the hard error. > > WDYT? > > [...] > > Thanks!
On Fri, Aug 25, 2023 at 06:32 PM -07, John Fastabend wrote: > Jakub Sitnicki wrote: [...] >> But as I wrote earlier, I don't think it's a good idea to ignore the >> flag. We can detect this conflict at the time the bpf_msg_sk_redirect_* >> helper is called and return an error. >> >> Naturally that means that that bpf_msg_{cork,apply}_bytes helpers need >> to be adjusted to return an error if BPF_F_PERMANENT has been set. > > So far we've not really done much to protect a user from doing > rather silly things. The following will all do something without > errors, > > bpf_msg_apply_bytes() > bpf_msg_apply_bytes() <- reset apply bytes > > bpf_msg_cork_bytes() > bpf_msg_cork_bytes() <- resets cork byte > > also, > > bpf_msg_redirect(..., BPF_F_INGRESS); > bpf_msg_redirect(..., 0); <- resets sk_redir and flags > > maybe there is some valid reason to even do above if further parsing > identifies some reason to redirect to a alert socket or something. > > My original thinking was in the interest of not having a bunch of > extra checks for performance reasons we shouldn't add guard rails > unless something really unexpected might happen like a kernel > panic or what not. > > This does feel a bit different though because before we > didn't have calls that could impact other calls. My best idea > is to just create a precedence and follow it. I would propose, > > 'If BPF_F_PERMANENT is set apply_bytes and cork_bytes are > ignored.' > > The other direction (what is above?) has a bit of an inconsistency > where these two flows are different? > > bpf_apply_bytes() > bpf_msg_redirect(..., BPF_F_PERMANENT) > > and > > bpf_msg_redirect(..., BPF_F_PERMANENT) > bpf_apply_bytes() > > It would be best if order of operations doesn't change the > outcome because that starts to get really hard to reason about. > > This avoids having to add checks all over the place and then > if users want we could give some mechanisms to read apply > and cork bytes so people could write macros over those if > they really want the hard error. > > WDYT? These semantics sound sane to me. Easy to explain: BPF_F_PERMANENT takes precedence over apply/cork_bytes. Good point about order of operations.
diff --git a/include/linux/skmsg.h b/include/linux/skmsg.h index 054d7911bfc9..2f4e9811ff85 100644 --- a/include/linux/skmsg.h +++ b/include/linux/skmsg.h @@ -82,6 +82,7 @@ struct sk_psock { u32 cork_bytes; u32 eval; bool redir_ingress; /* undefined if sk_redir is null */ + bool redir_permanent; struct sk_msg *cork; struct sk_psock_progs progs; #if IS_ENABLED(CONFIG_BPF_STREAM_PARSER) diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h index 70da85200695..f4de1ba390b4 100644 --- a/include/uapi/linux/bpf.h +++ b/include/uapi/linux/bpf.h @@ -3004,7 +3004,12 @@ union bpf_attr { * egress interfaces can be used for redirection. The * **BPF_F_INGRESS** value in *flags* is used to make the * distinction (ingress path is selected if the flag is present, - * egress path otherwise). This is the only flag supported for now. + * egress path otherwise). The **BPF_F_PERMANENT** value in + * *flags* is used to indicates whether the eBPF result is + * permanently (please note that, BPF_F_PERMANENT does not work with + * msg_apply_bytes() and msg_cork_bytes(), if msg_apply_bytes() or + * msg_cork_bytes() is configured, the BPF_F_PERMANENT function is + * automatically disabled). * Return * **SK_PASS** on success, or **SK_DROP** on error. * @@ -3276,7 +3281,12 @@ union bpf_attr { * egress interfaces can be used for redirection. The * **BPF_F_INGRESS** value in *flags* is used to make the * distinction (ingress path is selected if the flag is present, - * egress path otherwise). This is the only flag supported for now. + * egress path otherwise). The **BPF_F_PERMANENT** value in + * *flags* is used to indicates whether the eBPF result is + * permanently (please note that, BPF_F_PERMANENT does not work with + * msg_apply_bytes() and msg_cork_bytes(), if msg_apply_bytes() or + * msg_cork_bytes() is configured, the BPF_F_PERMANENT function is + * automatically disabled). * Return * **SK_PASS** on success, or **SK_DROP** on error. * @@ -5872,6 +5882,7 @@ enum { /* BPF_FUNC_clone_redirect and BPF_FUNC_redirect flags. */ enum { BPF_F_INGRESS = (1ULL << 0), + BPF_F_PERMANENT = (1ULL << 1), }; /* BPF_FUNC_skb_set_tunnel_key and BPF_FUNC_skb_get_tunnel_key flags. */ diff --git a/net/core/skmsg.c b/net/core/skmsg.c index a29508e1ff35..df1443cf5fbd 100644 --- a/net/core/skmsg.c +++ b/net/core/skmsg.c @@ -885,6 +885,11 @@ int sk_psock_msg_verdict(struct sock *sk, struct sk_psock *psock, goto out; } psock->redir_ingress = sk_msg_to_ingress(msg); + if (!msg->apply_bytes && !msg->cork_bytes) + psock->redir_permanent = + msg->flags & BPF_F_PERMANENT; + else + psock->redir_permanent = false; psock->sk_redir = msg->sk_redir; sock_hold(psock->sk_redir); } diff --git a/net/core/sock_map.c b/net/core/sock_map.c index 08ab108206bf..35a361614f5e 100644 --- a/net/core/sock_map.c +++ b/net/core/sock_map.c @@ -662,7 +662,7 @@ BPF_CALL_4(bpf_msg_redirect_map, struct sk_msg *, msg, { struct sock *sk; - if (unlikely(flags & ~(BPF_F_INGRESS))) + if (unlikely(flags & ~(BPF_F_INGRESS | BPF_F_PERMANENT))) return SK_DROP; sk = __sock_map_lookup_elem(map, key); @@ -1261,7 +1261,7 @@ BPF_CALL_4(bpf_msg_redirect_hash, struct sk_msg *, msg, { struct sock *sk; - if (unlikely(flags & ~(BPF_F_INGRESS))) + if (unlikely(flags & ~(BPF_F_INGRESS | BPF_F_PERMANENT))) return SK_DROP; sk = __sock_hash_lookup_elem(map, key); diff --git a/net/ipv4/tcp_bpf.c b/net/ipv4/tcp_bpf.c index 81f0dff69e0b..b53e356562a6 100644 --- a/net/ipv4/tcp_bpf.c +++ b/net/ipv4/tcp_bpf.c @@ -419,8 +419,10 @@ static int tcp_bpf_send_verdict(struct sock *sk, struct sk_psock *psock, if (!psock->apply_bytes) { /* Clean up before releasing the sock lock. */ eval = psock->eval; - psock->eval = __SK_NONE; - psock->sk_redir = NULL; + if (!psock->redir_permanent) { + psock->eval = __SK_NONE; + psock->sk_redir = NULL; + } } if (psock->cork) { cork = true; @@ -433,9 +435,15 @@ static int tcp_bpf_send_verdict(struct sock *sk, struct sk_psock *psock, ret = tcp_bpf_sendmsg_redir(sk_redir, redir_ingress, msg, tosend, flags); sent = origsize - msg->sg.size; + /* disable the ability when something wrong */ + if (unlikely(ret < 0)) + psock->redir_permanent = 0; - if (eval == __SK_REDIRECT) + if (!psock->redir_permanent && eval == __SK_REDIRECT) { sock_put(sk_redir); + psock->sk_redir = NULL; + psock->eval = __SK_NONE; + } lock_sock(sk); if (unlikely(ret < 0)) { @@ -460,8 +468,8 @@ static int tcp_bpf_send_verdict(struct sock *sk, struct sk_psock *psock, } if (likely(!ret)) { - if (!psock->apply_bytes) { - psock->eval = __SK_NONE; + if (!psock->apply_bytes && !psock->redir_permanent) { + psock->eval = __SK_NONE; if (psock->sk_redir) { sock_put(psock->sk_redir); psock->sk_redir = NULL; diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h index 70da85200695..f4de1ba390b4 100644 --- a/tools/include/uapi/linux/bpf.h +++ b/tools/include/uapi/linux/bpf.h @@ -3004,7 +3004,12 @@ union bpf_attr { * egress interfaces can be used for redirection. The * **BPF_F_INGRESS** value in *flags* is used to make the * distinction (ingress path is selected if the flag is present, - * egress path otherwise). This is the only flag supported for now. + * egress path otherwise). The **BPF_F_PERMANENT** value in + * *flags* is used to indicates whether the eBPF result is + * permanently (please note that, BPF_F_PERMANENT does not work with + * msg_apply_bytes() and msg_cork_bytes(), if msg_apply_bytes() or + * msg_cork_bytes() is configured, the BPF_F_PERMANENT function is + * automatically disabled). * Return * **SK_PASS** on success, or **SK_DROP** on error. * @@ -3276,7 +3281,12 @@ union bpf_attr { * egress interfaces can be used for redirection. The * **BPF_F_INGRESS** value in *flags* is used to make the * distinction (ingress path is selected if the flag is present, - * egress path otherwise). This is the only flag supported for now. + * egress path otherwise). The **BPF_F_PERMANENT** value in + * *flags* is used to indicates whether the eBPF result is + * permanently (please note that, BPF_F_PERMANENT does not work with + * msg_apply_bytes() and msg_cork_bytes(), if msg_apply_bytes() or + * msg_cork_bytes() is configured, the BPF_F_PERMANENT function is + * automatically disabled). * Return * **SK_PASS** on success, or **SK_DROP** on error. * @@ -5872,6 +5882,7 @@ enum { /* BPF_FUNC_clone_redirect and BPF_FUNC_redirect flags. */ enum { BPF_F_INGRESS = (1ULL << 0), + BPF_F_PERMANENT = (1ULL << 1), }; /* BPF_FUNC_skb_set_tunnel_key and BPF_FUNC_skb_get_tunnel_key flags. */
If the sockmap msg redirection function is used only to forward packets and no other operation, the execution result of the BPF_SK_MSG_VERDICT program is the same each time. In this case, the BPF program only needs to be run once. Add BPF_F_PERMANENT flag to bpf_msg_redirect_map() and bpf_msg_redirect_hash() to implement this ability. Then we can enable this function in the bpf program as follows: bpf_msg_redirect_hash(xx, xx, xx, BPF_F_INGRESS | BPF_F_PERMANENT); Test results using netperf TCP_STREAM mode: for i in 1 64 128 512 1k 2k 32k 64k 100k 500k 1m;then netperf -T 1,2 -t TCP_STREAM -H 127.0.0.1 -l 20 -- -m $i -s 100m,100m -S 100m,100m done before: 3.84 246.52 496.89 1885.03 3415.29 6375.03 40749.09 48764.40 51611.34 55678.26 55992.78 after: 4.43 279.20 555.82 2080.79 3870.70 7105.44 41836.41 49709.75 51861.56 55211.00 54566.85 Signed-off-by: Liu Jian <liujian56@huawei.com> --- include/linux/skmsg.h | 1 + include/uapi/linux/bpf.h | 15 +++++++++++++-- net/core/skmsg.c | 5 +++++ net/core/sock_map.c | 4 ++-- net/ipv4/tcp_bpf.c | 18 +++++++++++++----- tools/include/uapi/linux/bpf.h | 15 +++++++++++++-- 6 files changed, 47 insertions(+), 11 deletions(-)