diff mbox series

[bpf,1/4] bpf, sockmap: Remove unhash handler for BPF sockmap usage

Message ID 20211011191647.418704-2-john.fastabend@gmail.com (mailing list archive)
State Changes Requested
Delegated to: BPF
Headers show
Series bpf, sockmap: fixes stress testing and regression | expand

Checks

Context Check Description
bpf/vmtest-bpf-PR fail PR summary
netdev/cover_letter success Series has a cover letter
netdev/fixes_present success Fixes tag present in non-next series
netdev/patch_count success Link
netdev/tree_selection success Clearly marked for bpf
netdev/subject_prefix success Link
netdev/cc_maintainers warning 13 maintainers not CCed: andrii@kernel.org edumazet@google.com ast@kernel.org songliubraving@fb.com yoshfuji@linux-ipv6.org yhs@fb.com davem@davemloft.net kafai@fb.com jakub@cloudflare.com kuba@kernel.org dsahern@kernel.org lmb@cloudflare.com kpsingh@kernel.org
netdev/source_inline success Was 0 now: 0
netdev/verify_signedoff success Signed-off-by tag matches author and committer
netdev/module_param success Was 0 now: 0
netdev/build_32bit success Errors and warnings before: 1 this patch: 1
netdev/kdoc success Errors and warnings before: 0 this patch: 0
netdev/verify_fixes success Fixes tag looks correct
netdev/checkpatch success total: 0 errors, 0 warnings, 0 checks, 7 lines checked
netdev/build_allmodconfig_warn success Errors and warnings before: 1 this patch: 1
netdev/header_inline success No static functions without inline keyword in header files
bpf/vmtest-bpf fail VM_Test

Commit Message

John Fastabend Oct. 11, 2021, 7:16 p.m. UTC
We do not need to handle unhash from BPF side we can simply wait for the
close to happen. The original concern was a socket could transition from
ESTABLISHED state to a new state while the BPF hook was still attached.
But, we convinced ourself this is no longer possible and we also
improved BPF sockmap to handle listen sockets so this is no longer a
problem.

More importantly though there are cases where unhash is called when data is
in the receive queue. The BPF unhash logic will flush this data which is
wrong. To be correct it should keep the data in the receive queue and allow
a receiving application to continue reading the data. This may happen when
tcp_abort is received for example. Instead of complicating the logic in
unhash simply moving all this to tcp_close hook solves this.

Fixes: 51199405f9672 ("bpf: skb_verdict, support SK_PASS on RX BPF path")
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
---
 net/ipv4/tcp_bpf.c | 1 -
 1 file changed, 1 deletion(-)

Comments

Jakub Sitnicki Oct. 19, 2021, 7:17 a.m. UTC | #1
On Mon, Oct 11, 2021 at 09:16 PM CEST, John Fastabend wrote:
> We do not need to handle unhash from BPF side we can simply wait for the
> close to happen. The original concern was a socket could transition from
> ESTABLISHED state to a new state while the BPF hook was still attached.
> But, we convinced ourself this is no longer possible and we also
> improved BPF sockmap to handle listen sockets so this is no longer a
> problem.
>
> More importantly though there are cases where unhash is called when data is
> in the receive queue. The BPF unhash logic will flush this data which is
> wrong. To be correct it should keep the data in the receive queue and allow
> a receiving application to continue reading the data. This may happen when
> tcp_abort is received for example. Instead of complicating the logic in
> unhash simply moving all this to tcp_close hook solves this.
>
> Fixes: 51199405f9672 ("bpf: skb_verdict, support SK_PASS on RX BPF path")
> Signed-off-by: John Fastabend <john.fastabend@gmail.com>
> ---

Doesn't this open the possibility of having a TCP_CLOSE socket in
sockmap if I disconnect it, that is call connect(AF_UNSPEC), instead of
close it?
John Fastabend Oct. 20, 2021, 5:28 a.m. UTC | #2
Jakub Sitnicki wrote:
> On Mon, Oct 11, 2021 at 09:16 PM CEST, John Fastabend wrote:
> > We do not need to handle unhash from BPF side we can simply wait for the
> > close to happen. The original concern was a socket could transition from
> > ESTABLISHED state to a new state while the BPF hook was still attached.
> > But, we convinced ourself this is no longer possible and we also
> > improved BPF sockmap to handle listen sockets so this is no longer a
> > problem.
> >
> > More importantly though there are cases where unhash is called when data is
> > in the receive queue. The BPF unhash logic will flush this data which is
> > wrong. To be correct it should keep the data in the receive queue and allow
> > a receiving application to continue reading the data. This may happen when
> > tcp_abort is received for example. Instead of complicating the logic in
> > unhash simply moving all this to tcp_close hook solves this.
> >
> > Fixes: 51199405f9672 ("bpf: skb_verdict, support SK_PASS on RX BPF path")
> > Signed-off-by: John Fastabend <john.fastabend@gmail.com>
> > ---
> 
> Doesn't this open the possibility of having a TCP_CLOSE socket in
> sockmap if I disconnect it, that is call connect(AF_UNSPEC), instead of
> close it?

Correct it means we may have TCP_CLOSE socket in the map. I'm not
seeing any problem with this though. A send on the socket would
fail the sk_state checks in the send hooks. (tcp.c:1245). Receiving
from the TCP stack would fail with normal TCP stack checks.

Maybe we want a check on redirect into ingress if the sock is in
ESTABLISHED state as well? I might push that in its own patch
though it seems related, but I think we should have that there
regardless of this patch.

Did you happen to see any issues on the sock_map side for close case?
It looks good to me.

.John
Jakub Sitnicki Oct. 20, 2021, 3:11 p.m. UTC | #3
On Wed, Oct 20, 2021 at 07:28 AM CEST, John Fastabend wrote:
> Jakub Sitnicki wrote:
>> On Mon, Oct 11, 2021 at 09:16 PM CEST, John Fastabend wrote:
>> > We do not need to handle unhash from BPF side we can simply wait for the
>> > close to happen. The original concern was a socket could transition from
>> > ESTABLISHED state to a new state while the BPF hook was still attached.
>> > But, we convinced ourself this is no longer possible and we also
>> > improved BPF sockmap to handle listen sockets so this is no longer a
>> > problem.
>> >
>> > More importantly though there are cases where unhash is called when data is
>> > in the receive queue. The BPF unhash logic will flush this data which is
>> > wrong. To be correct it should keep the data in the receive queue and allow
>> > a receiving application to continue reading the data. This may happen when
>> > tcp_abort is received for example. Instead of complicating the logic in
>> > unhash simply moving all this to tcp_close hook solves this.
>> >
>> > Fixes: 51199405f9672 ("bpf: skb_verdict, support SK_PASS on RX BPF path")
>> > Signed-off-by: John Fastabend <john.fastabend@gmail.com>
>> > ---
>>
>> Doesn't this open the possibility of having a TCP_CLOSE socket in
>> sockmap if I disconnect it, that is call connect(AF_UNSPEC), instead of
>> close it?
>
> Correct it means we may have TCP_CLOSE socket in the map. I'm not
> seeing any problem with this though. A send on the socket would
> fail the sk_state checks in the send hooks. (tcp.c:1245). Receiving
> from the TCP stack would fail with normal TCP stack checks.
>
> Maybe we want a check on redirect into ingress if the sock is in
> ESTABLISHED state as well? I might push that in its own patch
> though it seems related, but I think we should have that there
> regardless of this patch.
>
> Did you happen to see any issues on the sock_map side for close case?
> It looks good to me.

OK, I didn't understand if that was an intended change or not.

If we're considering allowing TCP sockets in TCP_CLOSE state in sockmap,
a few things come to mind:

1) We can't insert TCP_CLOSE sockets today. sock_map_sk_state_allowed()
   won't allow it. However, with this change we will be able to have a
   TCP_CLOSE socket in sockmap by disconnecting it. If so, perhaps
   inserting TCP sockets in TCP_CLOSE state should be allowed for
   consistency.

2) Checks in bpf_sk_lookup_assign() helper need adjusting. Only TCP
   sockets in TCP_LISTEN state make a valid choice (and UDP sockets in
   TCP_CLOSE state). Today we rely on the fact there that you can't
   insert a TCP_CLOSE socket.

3) Checks in sk_select_reuseport() helper need adjusting as well. It's a
   similar same case as with bpf_sk_lookup_assign() (with a slight
   difference that reuseport allows dispatching to connected UDP
   sockets).

4) Don't know exactly how checks in sockmap redirect helpers would need
   to be tweaked. I recall that it can't be just TCP_ESTABLISHED state
   that's allowed due to a short window of opportunity that opens up
   when we transition from TCP_SYN_SENT to TCP_ESTABLISHED.
   BPF_SOCK_OPS_STATE_CB callback happens just before the state is
   switched to TCP_ESTABLISHED.

   TCP_CLOSE socket sure doesn't make sense as a redirect target. Would
   be nice to get an error from the redirect helper. If I understand
   correctly, if the TCP stack drops the packet after BPF verdict has
   selected a socket, only the socket owner will know about by reading
   the error queue.

   OTOH, redirecting to a TCP_CLOSE_WAIT socket doesn't make sense
   either, but we don't seem to filter it out today, so the helper is
   not airtight.

All in all, sounds like an API change when it comes to corner cases, in
addition to being a fix for the receive queue flush issue which you
explained in the patch description. If possible, would push it through
bpf-next.
John Fastabend Oct. 20, 2021, 3:51 p.m. UTC | #4
Jakub Sitnicki wrote:
> On Wed, Oct 20, 2021 at 07:28 AM CEST, John Fastabend wrote:
> > Jakub Sitnicki wrote:
> >> On Mon, Oct 11, 2021 at 09:16 PM CEST, John Fastabend wrote:
> >> > We do not need to handle unhash from BPF side we can simply wait for the
> >> > close to happen. The original concern was a socket could transition from
> >> > ESTABLISHED state to a new state while the BPF hook was still attached.
> >> > But, we convinced ourself this is no longer possible and we also
> >> > improved BPF sockmap to handle listen sockets so this is no longer a
> >> > problem.
> >> >
> >> > More importantly though there are cases where unhash is called when data is
> >> > in the receive queue. The BPF unhash logic will flush this data which is
> >> > wrong. To be correct it should keep the data in the receive queue and allow
> >> > a receiving application to continue reading the data. This may happen when
> >> > tcp_abort is received for example. Instead of complicating the logic in
> >> > unhash simply moving all this to tcp_close hook solves this.
> >> >
> >> > Fixes: 51199405f9672 ("bpf: skb_verdict, support SK_PASS on RX BPF path")
> >> > Signed-off-by: John Fastabend <john.fastabend@gmail.com>
> >> > ---
> >>
> >> Doesn't this open the possibility of having a TCP_CLOSE socket in
> >> sockmap if I disconnect it, that is call connect(AF_UNSPEC), instead of
> >> close it?
> >
> > Correct it means we may have TCP_CLOSE socket in the map. I'm not
> > seeing any problem with this though. A send on the socket would
> > fail the sk_state checks in the send hooks. (tcp.c:1245). Receiving
> > from the TCP stack would fail with normal TCP stack checks.
> >
> > Maybe we want a check on redirect into ingress if the sock is in
> > ESTABLISHED state as well? I might push that in its own patch
> > though it seems related, but I think we should have that there
> > regardless of this patch.
> >
> > Did you happen to see any issues on the sock_map side for close case?
> > It looks good to me.
> 
> OK, I didn't understand if that was an intended change or not.
> 

wrt bpf-next:
The problem is this needs to be backported in some way that fixes the
case for stable kernels as well. We have applications that are throwing
errors when they hit this at the moment.

> If we're considering allowing TCP sockets in TCP_CLOSE state in sockmap,
> a few things come to mind:

I think what makes most sense is to do the minimal work to fix the
described issue for bpf tree without introducing new issues and
then do the consistency/better cases in bpf-next.

> 
> 1) We can't insert TCP_CLOSE sockets today. sock_map_sk_state_allowed()
>    won't allow it. However, with this change we will be able to have a
>    TCP_CLOSE socket in sockmap by disconnecting it. If so, perhaps
>    inserting TCP sockets in TCP_CLOSE state should be allowed for
>    consistency.

I agree, but would hold off on this for bpf-next. I missed points
2,3 though in this series.

> 
> 2) Checks in bpf_sk_lookup_assign() helper need adjusting. Only TCP
>    sockets in TCP_LISTEN state make a valid choice (and UDP sockets in
>    TCP_CLOSE state). Today we rely on the fact there that you can't
>    insert a TCP_CLOSE socket.

This should be minimal change, just change the logic to allow only
TCP_LISTEN.

--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -10402,7 +10402,7 @@ BPF_CALL_3(bpf_sk_lookup_assign, struct bpf_sk_lookup_kern *, ctx,
                return -EINVAL;
        if (unlikely(sk && sk_is_refcounted(sk)))
                return -ESOCKTNOSUPPORT; /* reject non-RCU freed sockets */
-       if (unlikely(sk && sk->sk_state == TCP_ESTABLISHED))
+       if (unlikely(sk && sk->sk_state != TCP_LISTEN))
                return -ESOCKTNOSUPPORT; /* reject connected sockets */
 
        /* Check if socket is suitable for packet L3/L4 protocol */


> 
> 3) Checks in sk_select_reuseport() helper need adjusting as well. It's a
>    similar same case as with bpf_sk_lookup_assign() (with a slight
>    difference that reuseport allows dispatching to connected UDP
>    sockets).

Is it needed here? There is no obvious check now.  Is ESTABLISHED
state OK here now?

> 
> 4) Don't know exactly how checks in sockmap redirect helpers would need
>    to be tweaked. I recall that it can't be just TCP_ESTABLISHED state
>    that's allowed due to a short window of opportunity that opens up
>    when we transition from TCP_SYN_SENT to TCP_ESTABLISHED.
>    BPF_SOCK_OPS_STATE_CB callback happens just before the state is
>    switched to TCP_ESTABLISHED.
> 
>    TCP_CLOSE socket sure doesn't make sense as a redirect target. Would
>    be nice to get an error from the redirect helper. If I understand
>    correctly, if the TCP stack drops the packet after BPF verdict has
>    selected a socket, only the socket owner will know about by reading
>    the error queue.
> 
>    OTOH, redirecting to a TCP_CLOSE_WAIT socket doesn't make sense
>    either, but we don't seem to filter it out today, so the helper is
>    not airtight.

Right. At the moment for sending we call do_tcp_sendpages() and this
has the normal check ~(TCPF_ESABLISHED | TCPF_CLOSE_WAIT) so we
would return an error. The missing case is ingress. We currently
let these happen and would need a check there. I was thinking
of doing it in a separate patch, but could tack it on to this
series for completeness.

> 
> All in all, sounds like an API change when it comes to corner cases, in
> addition to being a fix for the receive queue flush issue which you
> explained in the patch description. If possible, would push it through
> bpf-next.

I think if we address 2,3,4 then we can fix the described issue
without introducing new cases. And then 1 is great for consistency
but can go via bpf-next?

WDYT.

Thanks,
John
Jakub Sitnicki Oct. 20, 2021, 4:35 p.m. UTC | #5
On Wed, Oct 20, 2021 at 05:51 PM CEST, John Fastabend wrote:
> Jakub Sitnicki wrote:
>> On Wed, Oct 20, 2021 at 07:28 AM CEST, John Fastabend wrote:
>> > Jakub Sitnicki wrote:
>> >> On Mon, Oct 11, 2021 at 09:16 PM CEST, John Fastabend wrote:
>> >> > We do not need to handle unhash from BPF side we can simply wait for the
>> >> > close to happen. The original concern was a socket could transition from
>> >> > ESTABLISHED state to a new state while the BPF hook was still attached.
>> >> > But, we convinced ourself this is no longer possible and we also
>> >> > improved BPF sockmap to handle listen sockets so this is no longer a
>> >> > problem.
>> >> >
>> >> > More importantly though there are cases where unhash is called when data is
>> >> > in the receive queue. The BPF unhash logic will flush this data which is
>> >> > wrong. To be correct it should keep the data in the receive queue and allow
>> >> > a receiving application to continue reading the data. This may happen when
>> >> > tcp_abort is received for example. Instead of complicating the logic in
>> >> > unhash simply moving all this to tcp_close hook solves this.
>> >> >
>> >> > Fixes: 51199405f9672 ("bpf: skb_verdict, support SK_PASS on RX BPF path")
>> >> > Signed-off-by: John Fastabend <john.fastabend@gmail.com>
>> >> > ---
>> >>
>> >> Doesn't this open the possibility of having a TCP_CLOSE socket in
>> >> sockmap if I disconnect it, that is call connect(AF_UNSPEC), instead of
>> >> close it?
>> >
>> > Correct it means we may have TCP_CLOSE socket in the map. I'm not
>> > seeing any problem with this though. A send on the socket would
>> > fail the sk_state checks in the send hooks. (tcp.c:1245). Receiving
>> > from the TCP stack would fail with normal TCP stack checks.
>> >
>> > Maybe we want a check on redirect into ingress if the sock is in
>> > ESTABLISHED state as well? I might push that in its own patch
>> > though it seems related, but I think we should have that there
>> > regardless of this patch.
>> >
>> > Did you happen to see any issues on the sock_map side for close case?
>> > It looks good to me.
>>
>> OK, I didn't understand if that was an intended change or not.
>>
>
> wrt bpf-next:
> The problem is this needs to be backported in some way that fixes the
> case for stable kernels as well. We have applications that are throwing
> errors when they hit this at the moment.

Understood.

>> If we're considering allowing TCP sockets in TCP_CLOSE state in sockmap,
>> a few things come to mind:
>
> I think what makes most sense is to do the minimal work to fix the
> described issue for bpf tree without introducing new issues and
> then do the consistency/better cases in bpf-next.
>
>>
>> 1) We can't insert TCP_CLOSE sockets today. sock_map_sk_state_allowed()
>>    won't allow it. However, with this change we will be able to have a
>>    TCP_CLOSE socket in sockmap by disconnecting it. If so, perhaps
>>    inserting TCP sockets in TCP_CLOSE state should be allowed for
>>    consistency.
>
> I agree, but would hold off on this for bpf-next. I missed points
> 2,3 though in this series.

OK, that makes sense.

>>
>> 2) Checks in bpf_sk_lookup_assign() helper need adjusting. Only TCP
>>    sockets in TCP_LISTEN state make a valid choice (and UDP sockets in
>>    TCP_CLOSE state). Today we rely on the fact there that you can't
>>    insert a TCP_CLOSE socket.
>
> This should be minimal change, just change the logic to allow only
> TCP_LISTEN.
>
> --- a/net/core/filter.c
> +++ b/net/core/filter.c
> @@ -10402,7 +10402,7 @@ BPF_CALL_3(bpf_sk_lookup_assign, struct bpf_sk_lookup_kern *, ctx,
>                 return -EINVAL;
>         if (unlikely(sk && sk_is_refcounted(sk)))
>                 return -ESOCKTNOSUPPORT; /* reject non-RCU freed sockets */
> -       if (unlikely(sk && sk->sk_state == TCP_ESTABLISHED))
> +       if (unlikely(sk && sk->sk_state != TCP_LISTEN))
>                 return -ESOCKTNOSUPPORT; /* reject connected sockets */
>
>         /* Check if socket is suitable for packet L3/L4 protocol */
>
>

Yeah, it shouldn't be hard. But we need to cover UDP as well. Something
along the lines of:

--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -10402,8 +10402,10 @@ BPF_CALL_3(bpf_sk_lookup_assign, struct bpf_sk_lookup_kern *, ctx,
                return -EINVAL;
        if (unlikely(sk && sk_is_refcounted(sk)))
                return -ESOCKTNOSUPPORT; /* reject non-RCU freed sockets */
-       if (unlikely(sk && sk->sk_state == TCP_ESTABLISHED))
-               return -ESOCKTNOSUPPORT; /* reject connected sockets */
+       if (unlikely(sk && sk_is_tcp(sk) && sk->sk_state != TCP_LISTEN))
+               return -ESOCKTNOSUPPORT; /* reject closed TCP sockets */
+       if (unlikely(sk && sk_is_udp(sk) && sk->sk_state != TCP_CLOSE))
+               return -ESOCKTNOSUPPORT; /* reject connected UDP sockets */

        /* Check if socket is suitable for packet L3/L4 protocol */
        if (sk && sk->sk_protocol != ctx->protocol)

We aren't testing today that that error case in sk_lookup test suite,
because it wasn't possible to insert a TCP_CLOSE socket. So once that
gets in, I can add coverage.

>>
>> 3) Checks in sk_select_reuseport() helper need adjusting as well. It's a
>>    similar same case as with bpf_sk_lookup_assign() (with a slight
>>    difference that reuseport allows dispatching to connected UDP
>>    sockets).
>
> Is it needed here? There is no obvious check now.  Is ESTABLISHED
> state OK here now?

TCP ESTABLISHED sockets are not okay. They can't join the reuseport
group and will always hit the !reuse branch.

Re-reading the code, though, I think nothing needs to be done for the
sk_select_reuseport() helper. TCP sockets will be detached from
reuseport group on unhash. Hence TCP_CLOSE socket will also hit the
!reuse branch.

CC'ing Martin just in case he wants to double-check.

>
>>
>> 4) Don't know exactly how checks in sockmap redirect helpers would need
>>    to be tweaked. I recall that it can't be just TCP_ESTABLISHED state
>>    that's allowed due to a short window of opportunity that opens up
>>    when we transition from TCP_SYN_SENT to TCP_ESTABLISHED.
>>    BPF_SOCK_OPS_STATE_CB callback happens just before the state is
>>    switched to TCP_ESTABLISHED.
>>
>>    TCP_CLOSE socket sure doesn't make sense as a redirect target. Would
>>    be nice to get an error from the redirect helper. If I understand
>>    correctly, if the TCP stack drops the packet after BPF verdict has
>>    selected a socket, only the socket owner will know about by reading
>>    the error queue.
>>
>>    OTOH, redirecting to a TCP_CLOSE_WAIT socket doesn't make sense
>>    either, but we don't seem to filter it out today, so the helper is
>>    not airtight.
>
> Right. At the moment for sending we call do_tcp_sendpages() and this
> has the normal check ~(TCPF_ESABLISHED | TCPF_CLOSE_WAIT) so we
> would return an error. The missing case is ingress. We currently
> let these happen and would need a check there. I was thinking
> of doing it in a separate patch, but could tack it on to this
> series for completeness.
>

Oh, yeah, right. I see now what you mean. No problem on egress.

So it's just an SK_DROP return code from bpf_sk_redirect_map() that
could be a potential improvement.

Your call if you want to add it this series. Patching it up as a follow
up works for me as well.

>>
>> All in all, sounds like an API change when it comes to corner cases, in
>> addition to being a fix for the receive queue flush issue which you
>> explained in the patch description. If possible, would push it through
>> bpf-next.
>
> I think if we address 2,3,4 then we can fix the described issue
> without introducing new cases. And then 1 is great for consistency
> but can go via bpf-next?

So (3) is out, reuseport+sockmap users should be unaffected by this.

If you could patch (2) that would be great. We rely on this, and I can't
assume that nobody isn't disconnecting their listener sockets for some
reason.

(4) and (1) can follow later, if you ask me.
John Fastabend Oct. 21, 2021, 7:24 p.m. UTC | #6
Jakub Sitnicki wrote:
> On Wed, Oct 20, 2021 at 05:51 PM CEST, John Fastabend wrote:
> > Jakub Sitnicki wrote:
> >> On Wed, Oct 20, 2021 at 07:28 AM CEST, John Fastabend wrote:
> >> > Jakub Sitnicki wrote:
> >> >> On Mon, Oct 11, 2021 at 09:16 PM CEST, John Fastabend wrote:
> >> >> > We do not need to handle unhash from BPF side we can simply wait for the
> >> >> > close to happen. The original concern was a socket could transition from
> >> >> > ESTABLISHED state to a new state while the BPF hook was still attached.
> >> >> > But, we convinced ourself this is no longer possible and we also
> >> >> > improved BPF sockmap to handle listen sockets so this is no longer a
> >> >> > problem.
> >> >> >
> >> >> > More importantly though there are cases where unhash is called when data is
> >> >> > in the receive queue. The BPF unhash logic will flush this data which is
> >> >> > wrong. To be correct it should keep the data in the receive queue and allow
> >> >> > a receiving application to continue reading the data. This may happen when
> >> >> > tcp_abort is received for example. Instead of complicating the logic in
> >> >> > unhash simply moving all this to tcp_close hook solves this.
> >> >> >
> >> >> > Fixes: 51199405f9672 ("bpf: skb_verdict, support SK_PASS on RX BPF path")
> >> >> > Signed-off-by: John Fastabend <john.fastabend@gmail.com>
> >> >> > ---
> >> >>
> >> >> Doesn't this open the possibility of having a TCP_CLOSE socket in
> >> >> sockmap if I disconnect it, that is call connect(AF_UNSPEC), instead of
> >> >> close it?
> >> >

[...]

 >> If we're considering allowing TCP sockets in TCP_CLOSE state in sockmap,
> >> a few things come to mind:
> >
> > I think what makes most sense is to do the minimal work to fix the
> > described issue for bpf tree without introducing new issues and
> > then do the consistency/better cases in bpf-next.
> >
> >>
> >> 1) We can't insert TCP_CLOSE sockets today. sock_map_sk_state_allowed()
> >>    won't allow it. However, with this change we will be able to have a
> >>    TCP_CLOSE socket in sockmap by disconnecting it. If so, perhaps
> >>    inserting TCP sockets in TCP_CLOSE state should be allowed for
> >>    consistency.
> >
> > I agree, but would hold off on this for bpf-next. I missed points
> > 2,3 though in this series.
> 
> OK, that makes sense.
> 
> >>
> >> 2) Checks in bpf_sk_lookup_assign() helper need adjusting. Only TCP
> >>    sockets in TCP_LISTEN state make a valid choice (and UDP sockets in
> >>    TCP_CLOSE state). Today we rely on the fact there that you can't
> >>    insert a TCP_CLOSE socket.
> >
> > This should be minimal change, just change the logic to allow only
> > TCP_LISTEN.
> >
> > --- a/net/core/filter.c
> > +++ b/net/core/filter.c
> > @@ -10402,7 +10402,7 @@ BPF_CALL_3(bpf_sk_lookup_assign, struct bpf_sk_lookup_kern *, ctx,
> >                 return -EINVAL;
> >         if (unlikely(sk && sk_is_refcounted(sk)))
> >                 return -ESOCKTNOSUPPORT; /* reject non-RCU freed sockets */
> > -       if (unlikely(sk && sk->sk_state == TCP_ESTABLISHED))
> > +       if (unlikely(sk && sk->sk_state != TCP_LISTEN))
> >                 return -ESOCKTNOSUPPORT; /* reject connected sockets */
> >
> >         /* Check if socket is suitable for packet L3/L4 protocol */
> >
> >
> 
> Yeah, it shouldn't be hard. But we need to cover UDP as well. Something
> along the lines of:
> 
> --- a/net/core/filter.c
> +++ b/net/core/filter.c
> @@ -10402,8 +10402,10 @@ BPF_CALL_3(bpf_sk_lookup_assign, struct bpf_sk_lookup_kern *, ctx,
>                 return -EINVAL;
>         if (unlikely(sk && sk_is_refcounted(sk)))
>                 return -ESOCKTNOSUPPORT; /* reject non-RCU freed sockets */
> -       if (unlikely(sk && sk->sk_state == TCP_ESTABLISHED))
> -               return -ESOCKTNOSUPPORT; /* reject connected sockets */
> +       if (unlikely(sk && sk_is_tcp(sk) && sk->sk_state != TCP_LISTEN))
> +               return -ESOCKTNOSUPPORT; /* reject closed TCP sockets */
> +       if (unlikely(sk && sk_is_udp(sk) && sk->sk_state != TCP_CLOSE))
> +               return -ESOCKTNOSUPPORT; /* reject connected UDP sockets */
> 
>         /* Check if socket is suitable for packet L3/L4 protocol */
>         if (sk && sk->sk_protocol != ctx->protocol)
> 
> We aren't testing today that that error case in sk_lookup test suite,
> because it wasn't possible to insert a TCP_CLOSE socket. So once that
> gets in, I can add coverage.
> 
> >>
> >> 3) Checks in sk_select_reuseport() helper need adjusting as well. It's a
> >>    similar same case as with bpf_sk_lookup_assign() (with a slight
> >>    difference that reuseport allows dispatching to connected UDP
> >>    sockets).
> >
> > Is it needed here? There is no obvious check now.  Is ESTABLISHED
> > state OK here now?
> 
> TCP ESTABLISHED sockets are not okay. They can't join the reuseport
> group and will always hit the !reuse branch.
> 
> Re-reading the code, though, I think nothing needs to be done for the
> sk_select_reuseport() helper. TCP sockets will be detached from
> reuseport group on unhash. Hence TCP_CLOSE socket will also hit the
> !reuse branch.
> 
> CC'ing Martin just in case he wants to double-check.
> 
> >
> >>
> >> 4) Don't know exactly how checks in sockmap redirect helpers would need
> >>    to be tweaked. I recall that it can't be just TCP_ESTABLISHED state
> >>    that's allowed due to a short window of opportunity that opens up
> >>    when we transition from TCP_SYN_SENT to TCP_ESTABLISHED.
> >>    BPF_SOCK_OPS_STATE_CB callback happens just before the state is
> >>    switched to TCP_ESTABLISHED.
> >>
> >>    TCP_CLOSE socket sure doesn't make sense as a redirect target. Would
> >>    be nice to get an error from the redirect helper. If I understand
> >>    correctly, if the TCP stack drops the packet after BPF verdict has
> >>    selected a socket, only the socket owner will know about by reading
> >>    the error queue.
> >>
> >>    OTOH, redirecting to a TCP_CLOSE_WAIT socket doesn't make sense
> >>    either, but we don't seem to filter it out today, so the helper is
> >>    not airtight.
> >
> > Right. At the moment for sending we call do_tcp_sendpages() and this
> > has the normal check ~(TCPF_ESABLISHED | TCPF_CLOSE_WAIT) so we
> > would return an error. The missing case is ingress. We currently
> > let these happen and would need a check there. I was thinking
> > of doing it in a separate patch, but could tack it on to this
> > series for completeness.
> >
> 
> Oh, yeah, right. I see now what you mean. No problem on egress.
> 
> So it's just an SK_DROP return code from bpf_sk_redirect_map() that
> could be a potential improvement.
> 
> Your call if you want to add it this series. Patching it up as a follow
> up works for me as well.
> 
> >>
> >> All in all, sounds like an API change when it comes to corner cases, in
> >> addition to being a fix for the receive queue flush issue which you
> >> explained in the patch description. If possible, would push it through
> >> bpf-next.
> >
> > I think if we address 2,3,4 then we can fix the described issue
> > without introducing new cases. And then 1 is great for consistency
> > but can go via bpf-next?
> 
> So (3) is out, reuseport+sockmap users should be unaffected by this.
> 
> If you could patch (2) that would be great. We rely on this, and I can't
> assume that nobody isn't disconnecting their listener sockets for some
> reason.

Yep I'll roll a new version with a fix for this and leave the rest for
a follow up.

> 
> (4) and (1) can follow later, if you ask me.

Agree.
diff mbox series

Patch

diff --git a/net/ipv4/tcp_bpf.c b/net/ipv4/tcp_bpf.c
index d3e9386b493e..35dcfb04f53d 100644
--- a/net/ipv4/tcp_bpf.c
+++ b/net/ipv4/tcp_bpf.c
@@ -476,7 +476,6 @@  static void tcp_bpf_rebuild_protos(struct proto prot[TCP_BPF_NUM_CFGS],
 				   struct proto *base)
 {
 	prot[TCP_BPF_BASE]			= *base;
-	prot[TCP_BPF_BASE].unhash		= sock_map_unhash;
 	prot[TCP_BPF_BASE].close		= sock_map_close;
 	prot[TCP_BPF_BASE].recvmsg		= tcp_bpf_recvmsg;
 	prot[TCP_BPF_BASE].stream_memory_read	= tcp_bpf_stream_read;