Message ID | 161661958954.28508.16923012330549206770.stgit@john-Precision-5820-Tower (mailing list archive) |
---|---|
State | Superseded |
Delegated to: | BPF |
Headers | show |
Series | bpf, sockmap fixes | expand |
Context | Check | Description |
---|---|---|
netdev/cover_letter | success | Link |
netdev/fixes_present | success | Link |
netdev/patch_count | success | Link |
netdev/tree_selection | success | Clearly marked for bpf |
netdev/subject_prefix | success | Link |
netdev/cc_maintainers | fail | 1 blamed authors not CCed: jakub@cloudflare.com; 8 maintainers not CCed: yhs@fb.com kpsingh@kernel.org kafai@fb.com ast@kernel.org songliubraving@fb.com davem@davemloft.net jakub@cloudflare.com kuba@kernel.org |
netdev/source_inline | success | Was 0 now: 0 |
netdev/verify_signedoff | success | Link |
netdev/module_param | success | Was 0 now: 0 |
netdev/build_32bit | success | Errors and warnings before: 1 this patch: 1 |
netdev/kdoc | success | Errors and warnings before: 0 this patch: 0 |
netdev/verify_fixes | success | Link |
netdev/checkpatch | success | total: 0 errors, 0 warnings, 0 checks, 53 lines checked |
netdev/build_allmodconfig_warn | success | Errors and warnings before: 1 this patch: 1 |
netdev/header_inline | success | Link |
On Wed, Mar 24, 2021 at 2:00 PM John Fastabend <john.fastabend@gmail.com> wrote: > > Incorrect accounting fwd_alloc can result in a warning when the socket > is torn down, > > [18455.319240] WARNING: CPU: 0 PID: 24075 at net/core/stream.c:208 sk_stream_kill_queues+0x21f/0x230 > [...] > [18455.319543] Call Trace: > [18455.319556] inet_csk_destroy_sock+0xba/0x1f0 > [18455.319577] tcp_rcv_state_process+0x1b4e/0x2380 > [18455.319593] ? lock_downgrade+0x3a0/0x3a0 > [18455.319617] ? tcp_finish_connect+0x1e0/0x1e0 > [18455.319631] ? sk_reset_timer+0x15/0x70 > [18455.319646] ? tcp_schedule_loss_probe+0x1b2/0x240 > [18455.319663] ? lock_release+0xb2/0x3f0 > [18455.319676] ? __release_sock+0x8a/0x1b0 > [18455.319690] ? lock_downgrade+0x3a0/0x3a0 > [18455.319704] ? lock_release+0x3f0/0x3f0 > [18455.319717] ? __tcp_close+0x2c6/0x790 > [18455.319736] ? tcp_v4_do_rcv+0x168/0x370 > [18455.319750] tcp_v4_do_rcv+0x168/0x370 > [18455.319767] __release_sock+0xbc/0x1b0 > [18455.319785] __tcp_close+0x2ee/0x790 > [18455.319805] tcp_close+0x20/0x80 > > This currently happens because on redirect case we do skb_set_owner_r() > with the original sock. This increments the fwd_alloc memory accounting > on the original sock. Then on redirect we may push this into the queue > of the psock we are redirecting to. When the skb is flushed from the > queue we give the memory back to the original sock. The problem is if > the original sock is destroyed/closed with skbs on another psocks queue > then the original sock will not have a way to reclaim the memory before > being destroyed. Then above warning will be thrown > > sockA sockB > > sk_psock_strp_read() > sk_psock_verdict_apply() > -- SK_REDIRECT -- > sk_psock_skb_redirect() > skb_queue_tail(psock_other->ingress_skb..) > > sk_close() > sock_map_unref() > sk_psock_put() > sk_psock_drop() > sk_psock_zap_ingress() > > At this point we have torn down our own psock, but have the outstanding > skb in psock_other. Note that SK_PASS doesn't have this problem because > the sk_psock_drop() logic releases the skb, its still associated with > our psock. > > To resolve lets only account for sockets on the ingress queue that are > still associated with the current socket. On the redirect case we will > check memory limits per 6fa9201a89898, but will omit fwd_alloc accounting > until skb is actually enqueued. When the skb is sent via skb_send_sock_locked > or received with sk_psock_skb_ingress memory will be claimed on psock_other. You mean sk_psock_skb_ingress(), right? > > Reported-by: Andrii Nakryiko <andrii@kernel.org> > Fixes: 6fa9201a89898 ("bpf, sockmap: Avoid returning unneeded EAGAIN when redirecting to self") > Signed-off-by: John Fastabend <john.fastabend@gmail.com> > --- > net/core/skmsg.c | 13 ++++++------- > 1 file changed, 6 insertions(+), 7 deletions(-) > > diff --git a/net/core/skmsg.c b/net/core/skmsg.c > index 1261512d6807..f150b5b63561 100644 > --- a/net/core/skmsg.c > +++ b/net/core/skmsg.c > @@ -488,6 +488,7 @@ static int sk_psock_skb_ingress_self(struct sk_psock *psock, struct sk_buff *skb > if (unlikely(!msg)) > return -EAGAIN; > sk_msg_init(msg); > + skb_set_owner_r(skb, sk); > return sk_psock_skb_ingress_enqueue(skb, psock, sk, msg); > } > > @@ -790,7 +791,6 @@ static void sk_psock_tls_verdict_apply(struct sk_buff *skb, struct sock *sk, int > { > switch (verdict) { > case __SK_REDIRECT: > - skb_set_owner_r(skb, sk); > sk_psock_skb_redirect(skb); > break; > case __SK_PASS: > @@ -808,10 +808,6 @@ int sk_psock_tls_strp_read(struct sk_psock *psock, struct sk_buff *skb) > rcu_read_lock(); > prog = READ_ONCE(psock->progs.skb_verdict); > if (likely(prog)) { > - /* We skip full set_owner_r here because if we do a SK_PASS > - * or SK_DROP we can skip skb memory accounting and use the > - * TLS context. > - */ > skb->sk = psock->sk; > tcp_skb_bpf_redirect_clear(skb); > ret = sk_psock_bpf_run(psock, prog, skb); > @@ -880,12 +876,13 @@ static void sk_psock_strp_read(struct strparser *strp, struct sk_buff *skb) > kfree_skb(skb); > goto out; > } > - skb_set_owner_r(skb, sk); > prog = READ_ONCE(psock->progs.skb_verdict); > if (likely(prog)) { > + skb->sk = psock->sk; Why is skb_orphan() not needed here? Nit: You can just use 'sk' here, so "skb->sk = sk". > tcp_skb_bpf_redirect_clear(skb); > ret = sk_psock_bpf_run(psock, prog, skb); > ret = sk_psock_map_verd(ret, tcp_skb_bpf_redirect_fetch(skb)); > + skb->sk = NULL; Why do you want to set it to NULL here? Thanks.
Cong Wang wrote: > On Wed, Mar 24, 2021 at 2:00 PM John Fastabend <john.fastabend@gmail.com> wrote: > > > > Incorrect accounting fwd_alloc can result in a warning when the socket > > is torn down, > > [...] > > To resolve lets only account for sockets on the ingress queue that are > > still associated with the current socket. On the redirect case we will > > check memory limits per 6fa9201a89898, but will omit fwd_alloc accounting > > until skb is actually enqueued. When the skb is sent via skb_send_sock_locked > > or received with sk_psock_skb_ingress memory will be claimed on psock_other. ^^^^^^^^^^^^^^^^^^^^ > > You mean sk_psock_skb_ingress(), right? Yes. [...] > > @@ -880,12 +876,13 @@ static void sk_psock_strp_read(struct strparser *strp, struct sk_buff *skb) > > kfree_skb(skb); > > goto out; > > } > > - skb_set_owner_r(skb, sk); > > prog = READ_ONCE(psock->progs.skb_verdict); > > if (likely(prog)) { > > + skb->sk = psock->sk; > > Why is skb_orphan() not needed here? These come from strparser which do not have skb->sk set. > > Nit: You can just use 'sk' here, so "skb->sk = sk". Sure that is a bit nicer, will respin with this. > > > > tcp_skb_bpf_redirect_clear(skb); > > ret = sk_psock_bpf_run(psock, prog, skb); > > ret = sk_psock_map_verd(ret, tcp_skb_bpf_redirect_fetch(skb)); > > + skb->sk = NULL; > > Why do you want to set it to NULL here? So we don't cause the stack to throw other errors later if we were to call skb_orphan for example. Various places in the skb helpers expect both skb->sk and skb->destructor to be set together and here we are just using it as a mechanism to feed the sk into the BPF program side. The above skb_set_owner_r for example would likely BUG(). > > Thanks.
On Wed, Mar 24, 2021 at 7:46 PM John Fastabend <john.fastabend@gmail.com> wrote: > > Cong Wang wrote: > > On Wed, Mar 24, 2021 at 2:00 PM John Fastabend <john.fastabend@gmail.com> wrote: > > > > > > Incorrect accounting fwd_alloc can result in a warning when the socket > > > is torn down, > > > > > [...] > > > > To resolve lets only account for sockets on the ingress queue that are > > > still associated with the current socket. On the redirect case we will > > > check memory limits per 6fa9201a89898, but will omit fwd_alloc accounting > > > until skb is actually enqueued. When the skb is sent via skb_send_sock_locked > > > or received with sk_psock_skb_ingress memory will be claimed on psock_other. > ^^^^^^^^^^^^^^^^^^^^ > > > > You mean sk_psock_skb_ingress(), right? > > Yes. skb_send_sock_locked() actually allocates its own skb when sending, hence it uses a different skb for memory accounting. > > [...] > > > > @@ -880,12 +876,13 @@ static void sk_psock_strp_read(struct strparser *strp, struct sk_buff *skb) > > > kfree_skb(skb); > > > goto out; > > > } > > > - skb_set_owner_r(skb, sk); > > > prog = READ_ONCE(psock->progs.skb_verdict); > > > if (likely(prog)) { > > > + skb->sk = psock->sk; > > > > Why is skb_orphan() not needed here? > > These come from strparser which do not have skb->sk set. Hmm, but sk_psock_verdict_recv() passes a clone too, like strparser, so either we need it for both, or not at all. Clones do not have skb->sk, so I think you can remove the one in sk_psock_verdict_recv() too. > > > > > Nit: You can just use 'sk' here, so "skb->sk = sk". > > Sure that is a bit nicer, will respin with this. > > > > > > > > tcp_skb_bpf_redirect_clear(skb); > > > ret = sk_psock_bpf_run(psock, prog, skb); > > > ret = sk_psock_map_verd(ret, tcp_skb_bpf_redirect_fetch(skb)); > > > + skb->sk = NULL; > > > > Why do you want to set it to NULL here? > > So we don't cause the stack to throw other errors later if we > were to call skb_orphan for example. Various places in the skb > helpers expect both skb->sk and skb->destructor to be set together > and here we are just using it as a mechanism to feed the sk into > the BPF program side. The above skb_set_owner_r for example > would likely BUG(). Sounds reasonable. Thanks.
Cong Wang wrote: > On Wed, Mar 24, 2021 at 7:46 PM John Fastabend <john.fastabend@gmail.com> wrote: > > > > Cong Wang wrote: > > > On Wed, Mar 24, 2021 at 2:00 PM John Fastabend <john.fastabend@gmail.com> wrote: > > > > > > > > Incorrect accounting fwd_alloc can result in a warning when the socket > > > > is torn down, > > > > > > > > [...] > > > > > > To resolve lets only account for sockets on the ingress queue that are > > > > still associated with the current socket. On the redirect case we will > > > > check memory limits per 6fa9201a89898, but will omit fwd_alloc accounting > > > > until skb is actually enqueued. When the skb is sent via skb_send_sock_locked > > > > or received with sk_psock_skb_ingress memory will be claimed on psock_other. > > ^^^^^^^^^^^^^^^^^^^^ > > > > > > You mean sk_psock_skb_ingress(), right? > > > > Yes. > > skb_send_sock_locked() actually allocates its own skb when sending, hence > it uses a different skb for memory accounting. > > > > > [...] > > > > > > @@ -880,12 +876,13 @@ static void sk_psock_strp_read(struct strparser *strp, struct sk_buff *skb) > > > > kfree_skb(skb); > > > > goto out; > > > > } > > > > - skb_set_owner_r(skb, sk); > > > > prog = READ_ONCE(psock->progs.skb_verdict); > > > > if (likely(prog)) { > > > > + skb->sk = psock->sk; > > > > > > Why is skb_orphan() not needed here? > > > > These come from strparser which do not have skb->sk set. > > Hmm, but sk_psock_verdict_recv() passes a clone too, like > strparser, so either we need it for both, or not at all. Clones > do not have skb->sk, so I think you can remove the one in > sk_psock_verdict_recv() too. Agree skb_orphan can just be removed, I was being overly paranoid. Thanks.
diff --git a/net/core/skmsg.c b/net/core/skmsg.c index 1261512d6807..f150b5b63561 100644 --- a/net/core/skmsg.c +++ b/net/core/skmsg.c @@ -488,6 +488,7 @@ static int sk_psock_skb_ingress_self(struct sk_psock *psock, struct sk_buff *skb if (unlikely(!msg)) return -EAGAIN; sk_msg_init(msg); + skb_set_owner_r(skb, sk); return sk_psock_skb_ingress_enqueue(skb, psock, sk, msg); } @@ -790,7 +791,6 @@ static void sk_psock_tls_verdict_apply(struct sk_buff *skb, struct sock *sk, int { switch (verdict) { case __SK_REDIRECT: - skb_set_owner_r(skb, sk); sk_psock_skb_redirect(skb); break; case __SK_PASS: @@ -808,10 +808,6 @@ int sk_psock_tls_strp_read(struct sk_psock *psock, struct sk_buff *skb) rcu_read_lock(); prog = READ_ONCE(psock->progs.skb_verdict); if (likely(prog)) { - /* We skip full set_owner_r here because if we do a SK_PASS - * or SK_DROP we can skip skb memory accounting and use the - * TLS context. - */ skb->sk = psock->sk; tcp_skb_bpf_redirect_clear(skb); ret = sk_psock_bpf_run(psock, prog, skb); @@ -880,12 +876,13 @@ static void sk_psock_strp_read(struct strparser *strp, struct sk_buff *skb) kfree_skb(skb); goto out; } - skb_set_owner_r(skb, sk); prog = READ_ONCE(psock->progs.skb_verdict); if (likely(prog)) { + skb->sk = psock->sk; tcp_skb_bpf_redirect_clear(skb); ret = sk_psock_bpf_run(psock, prog, skb); ret = sk_psock_map_verd(ret, tcp_skb_bpf_redirect_fetch(skb)); + skb->sk = NULL; } sk_psock_verdict_apply(psock, skb, ret); out: @@ -956,12 +953,14 @@ static int sk_psock_verdict_recv(read_descriptor_t *desc, struct sk_buff *skb, kfree_skb(skb); goto out; } - skb_set_owner_r(skb, sk); prog = READ_ONCE(psock->progs.skb_verdict); if (likely(prog)) { + skb_orphan(skb); + skb->sk = sk; tcp_skb_bpf_redirect_clear(skb); ret = sk_psock_bpf_run(psock, prog, skb); ret = sk_psock_map_verd(ret, tcp_skb_bpf_redirect_fetch(skb)); + skb->sk = NULL; } sk_psock_verdict_apply(psock, skb, ret); out:
Incorrect accounting fwd_alloc can result in a warning when the socket is torn down, [18455.319240] WARNING: CPU: 0 PID: 24075 at net/core/stream.c:208 sk_stream_kill_queues+0x21f/0x230 [...] [18455.319543] Call Trace: [18455.319556] inet_csk_destroy_sock+0xba/0x1f0 [18455.319577] tcp_rcv_state_process+0x1b4e/0x2380 [18455.319593] ? lock_downgrade+0x3a0/0x3a0 [18455.319617] ? tcp_finish_connect+0x1e0/0x1e0 [18455.319631] ? sk_reset_timer+0x15/0x70 [18455.319646] ? tcp_schedule_loss_probe+0x1b2/0x240 [18455.319663] ? lock_release+0xb2/0x3f0 [18455.319676] ? __release_sock+0x8a/0x1b0 [18455.319690] ? lock_downgrade+0x3a0/0x3a0 [18455.319704] ? lock_release+0x3f0/0x3f0 [18455.319717] ? __tcp_close+0x2c6/0x790 [18455.319736] ? tcp_v4_do_rcv+0x168/0x370 [18455.319750] tcp_v4_do_rcv+0x168/0x370 [18455.319767] __release_sock+0xbc/0x1b0 [18455.319785] __tcp_close+0x2ee/0x790 [18455.319805] tcp_close+0x20/0x80 This currently happens because on redirect case we do skb_set_owner_r() with the original sock. This increments the fwd_alloc memory accounting on the original sock. Then on redirect we may push this into the queue of the psock we are redirecting to. When the skb is flushed from the queue we give the memory back to the original sock. The problem is if the original sock is destroyed/closed with skbs on another psocks queue then the original sock will not have a way to reclaim the memory before being destroyed. Then above warning will be thrown sockA sockB sk_psock_strp_read() sk_psock_verdict_apply() -- SK_REDIRECT -- sk_psock_skb_redirect() skb_queue_tail(psock_other->ingress_skb..) sk_close() sock_map_unref() sk_psock_put() sk_psock_drop() sk_psock_zap_ingress() At this point we have torn down our own psock, but have the outstanding skb in psock_other. Note that SK_PASS doesn't have this problem because the sk_psock_drop() logic releases the skb, its still associated with our psock. To resolve lets only account for sockets on the ingress queue that are still associated with the current socket. On the redirect case we will check memory limits per 6fa9201a89898, but will omit fwd_alloc accounting until skb is actually enqueued. When the skb is sent via skb_send_sock_locked or received with sk_psock_skb_ingress memory will be claimed on psock_other. Reported-by: Andrii Nakryiko <andrii@kernel.org> Fixes: 6fa9201a89898 ("bpf, sockmap: Avoid returning unneeded EAGAIN when redirecting to self") Signed-off-by: John Fastabend <john.fastabend@gmail.com> --- net/core/skmsg.c | 13 ++++++------- 1 file changed, 6 insertions(+), 7 deletions(-)