mbox series

[bpf-next,v7,00/13] sockmap: introduce BPF_SK_SKB_VERDICT and support UDP

Message ID 20210328202013.29223-1-xiyou.wangcong@gmail.com (mailing list archive)
Headers show
Series sockmap: introduce BPF_SK_SKB_VERDICT and support UDP | expand

Message

Cong Wang March 28, 2021, 8:20 p.m. UTC
From: Cong Wang <cong.wang@bytedance.com>

We have thousands of services connected to a daemon on every host
via AF_UNIX dgram sockets, after they are moved into VM, we have to
add a proxy to forward these communications from VM to host, because
rewriting thousands of them is not practical. This proxy uses an
AF_UNIX socket connected to services and a UDP socket to connect to
the host. It is inefficient because data is copied between kernel
space and user space twice, and we can not use splice() which only
supports TCP. Therefore, we want to use sockmap to do the splicing
without going to user-space at all (after the initial setup).

Currently sockmap only fully supports TCP, UDP is partially supported
as it is only allowed to add into sockmap. This patchset, as the second
part of the original large patchset, extends sockmap with:
1) cross-protocol support with BPF_SK_SKB_VERDICT; 2) full UDP support.

On the high level, ->read_sock() is required for each protocol to support
sockmap redirection, and in order to do sock proto update, a new ops
->psock_update_sk_prot() is introduced, which is also required. And the
BPF ->recvmsg() is also needed to replace the original ->recvmsg() to
retrieve skmsg. To make life easier, we have to get rid of lock_sock()
in sk_psock_handle_skb(), otherwise we would have to implement
->sendmsg_locked() on top of ->sendmsg(), which is ugly.

Please see each patch for more details.

To see the big picture, the original patchset is available here:
https://github.com/congwang/linux/tree/sockmap
this patchset is also available:
https://github.com/congwang/linux/tree/sockmap2

---
v7: use work_mutex to protect psock->work
    return err in udp_read_sock()
    add patch 6/13
    clean up test case

v6: get rid of sk_psock_zap_ingress()
    add rcu work patch

v5: use INDIRECT_CALL_2() for function pointers
    use ingress_lock to fix a race condition found by Jacub
    rename two helper functions

v4: get rid of lock_sock() in sk_psock_handle_skb()
    get rid of udp_sendmsg_locked()
    remove an empty line
    update cover letter

v3: export tcp/udp_update_proto()
    rename sk->sk_prot->psock_update_sk_prot()
    improve changelogs

v2: separate from the original large patchset
    rebase to the latest bpf-next
    split UDP test case
    move inet_csk_has_ulp() check to tcp_bpf.c
    clean up udp_read_sock()

Cong Wang (13):
  skmsg: lock ingress_skb when purging
  skmsg: introduce a spinlock to protect ingress_msg
  net: introduce skb_send_sock() for sock_map
  skmsg: avoid lock_sock() in sk_psock_backlog()
  skmsg: use rcu work for destroying psock
  skmsg: use GFP_KERNEL in sk_psock_create_ingress_msg()
  sock_map: introduce BPF_SK_SKB_VERDICT
  sock: introduce sk->sk_prot->psock_update_sk_prot()
  udp: implement ->read_sock() for sockmap
  skmsg: extract __tcp_bpf_recvmsg() and tcp_bpf_wait_data()
  udp: implement udp_bpf_recvmsg() for sockmap
  sock_map: update sock type checks for UDP
  selftests/bpf: add a test case for udp sockmap

 include/linux/skbuff.h                        |   1 +
 include/linux/skmsg.h                         |  77 ++++++--
 include/net/sock.h                            |   3 +
 include/net/tcp.h                             |   3 +-
 include/net/udp.h                             |   3 +
 include/uapi/linux/bpf.h                      |   1 +
 kernel/bpf/syscall.c                          |   1 +
 net/core/skbuff.c                             |  55 +++++-
 net/core/skmsg.c                              | 177 ++++++++++++++----
 net/core/sock_map.c                           |  53 +++---
 net/ipv4/af_inet.c                            |   1 +
 net/ipv4/tcp_bpf.c                            | 130 +++----------
 net/ipv4/tcp_ipv4.c                           |   3 +
 net/ipv4/udp.c                                |  38 ++++
 net/ipv4/udp_bpf.c                            |  79 +++++++-
 net/ipv6/af_inet6.c                           |   1 +
 net/ipv6/tcp_ipv6.c                           |   3 +
 net/ipv6/udp.c                                |   3 +
 net/tls/tls_sw.c                              |   4 +-
 tools/bpf/bpftool/common.c                    |   1 +
 tools/bpf/bpftool/prog.c                      |   1 +
 tools/include/uapi/linux/bpf.h                |   1 +
 .../selftests/bpf/prog_tests/sockmap_listen.c | 136 ++++++++++++++
 .../selftests/bpf/progs/test_sockmap_listen.c |  22 +++
 24 files changed, 601 insertions(+), 196 deletions(-)

Comments

Alexei Starovoitov March 28, 2021, 11:27 p.m. UTC | #1
On Sun, Mar 28, 2021 at 01:20:00PM -0700, Cong Wang wrote:
> From: Cong Wang <cong.wang@bytedance.com>
> 
> We have thousands of services connected to a daemon on every host
> via AF_UNIX dgram sockets, after they are moved into VM, we have to
> add a proxy to forward these communications from VM to host, because
> rewriting thousands of them is not practical. This proxy uses an
> AF_UNIX socket connected to services and a UDP socket to connect to
> the host. It is inefficient because data is copied between kernel
> space and user space twice, and we can not use splice() which only
> supports TCP. Therefore, we want to use sockmap to do the splicing
> without going to user-space at all (after the initial setup).
> 
> Currently sockmap only fully supports TCP, UDP is partially supported
> as it is only allowed to add into sockmap. This patchset, as the second
> part of the original large patchset, extends sockmap with:
> 1) cross-protocol support with BPF_SK_SKB_VERDICT; 2) full UDP support.
> 
> On the high level, ->read_sock() is required for each protocol to support
> sockmap redirection, and in order to do sock proto update, a new ops
> ->psock_update_sk_prot() is introduced, which is also required. And the
> BPF ->recvmsg() is also needed to replace the original ->recvmsg() to
> retrieve skmsg. To make life easier, we have to get rid of lock_sock()
> in sk_psock_handle_skb(), otherwise we would have to implement
> ->sendmsg_locked() on top of ->sendmsg(), which is ugly.
> 
> Please see each patch for more details.
> 
> To see the big picture, the original patchset is available here:
> https://github.com/congwang/linux/tree/sockmap
> this patchset is also available:
> https://github.com/congwang/linux/tree/sockmap2
> 
> ---
> v7: use work_mutex to protect psock->work
>     return err in udp_read_sock()
>     add patch 6/13
>     clean up test case

The feature looks great to me.
I think the selftest is a bit light in terms of coverage, but it's acceptable.
I'd like to see the final Acks from John/Daniel and Jakub/Lorenz before merging.
Folks,
please prioritize the review of these patches.
John Fastabend March 29, 2021, 3:03 p.m. UTC | #2
Alexei Starovoitov wrote:
> On Sun, Mar 28, 2021 at 01:20:00PM -0700, Cong Wang wrote:
> > From: Cong Wang <cong.wang@bytedance.com>
> > 
> > We have thousands of services connected to a daemon on every host
> > via AF_UNIX dgram sockets, after they are moved into VM, we have to
> > add a proxy to forward these communications from VM to host, because
> > rewriting thousands of them is not practical. This proxy uses an
> > AF_UNIX socket connected to services and a UDP socket to connect to
> > the host. It is inefficient because data is copied between kernel
> > space and user space twice, and we can not use splice() which only
> > supports TCP. Therefore, we want to use sockmap to do the splicing
> > without going to user-space at all (after the initial setup).
> > 
> > Currently sockmap only fully supports TCP, UDP is partially supported
> > as it is only allowed to add into sockmap. This patchset, as the second
> > part of the original large patchset, extends sockmap with:
> > 1) cross-protocol support with BPF_SK_SKB_VERDICT; 2) full UDP support.
> > 
> > On the high level, ->read_sock() is required for each protocol to support
> > sockmap redirection, and in order to do sock proto update, a new ops
> > ->psock_update_sk_prot() is introduced, which is also required. And the
> > BPF ->recvmsg() is also needed to replace the original ->recvmsg() to
> > retrieve skmsg. To make life easier, we have to get rid of lock_sock()
> > in sk_psock_handle_skb(), otherwise we would have to implement
> > ->sendmsg_locked() on top of ->sendmsg(), which is ugly.
> > 
> > Please see each patch for more details.
> > 
> > To see the big picture, the original patchset is available here:
> > https://github.com/congwang/linux/tree/sockmap
> > this patchset is also available:
> > https://github.com/congwang/linux/tree/sockmap2
> > 
> > ---
> > v7: use work_mutex to protect psock->work
> >     return err in udp_read_sock()
> >     add patch 6/13
> >     clean up test case
> 
> The feature looks great to me.
> I think the selftest is a bit light in terms of coverage, but it's acceptable.

+1

> I'd like to see the final Acks from John/Daniel and Jakub/Lorenz before merging.
> Folks,
> please prioritize the review of these patches.

This is getting really close I'll take another pass over it today. Thanks
Cong Wang March 29, 2021, 4:57 p.m. UTC | #3
On Mon, Mar 29, 2021 at 8:03 AM John Fastabend <john.fastabend@gmail.com> wrote:
>
> Alexei Starovoitov wrote:
> > On Sun, Mar 28, 2021 at 01:20:00PM -0700, Cong Wang wrote:
> > > From: Cong Wang <cong.wang@bytedance.com>
> > >
> > > We have thousands of services connected to a daemon on every host
> > > via AF_UNIX dgram sockets, after they are moved into VM, we have to
> > > add a proxy to forward these communications from VM to host, because
> > > rewriting thousands of them is not practical. This proxy uses an
> > > AF_UNIX socket connected to services and a UDP socket to connect to
> > > the host. It is inefficient because data is copied between kernel
> > > space and user space twice, and we can not use splice() which only
> > > supports TCP. Therefore, we want to use sockmap to do the splicing
> > > without going to user-space at all (after the initial setup).
> > >
> > > Currently sockmap only fully supports TCP, UDP is partially supported
> > > as it is only allowed to add into sockmap. This patchset, as the second
> > > part of the original large patchset, extends sockmap with:
> > > 1) cross-protocol support with BPF_SK_SKB_VERDICT; 2) full UDP support.
> > >
> > > On the high level, ->read_sock() is required for each protocol to support
> > > sockmap redirection, and in order to do sock proto update, a new ops
> > > ->psock_update_sk_prot() is introduced, which is also required. And the
> > > BPF ->recvmsg() is also needed to replace the original ->recvmsg() to
> > > retrieve skmsg. To make life easier, we have to get rid of lock_sock()
> > > in sk_psock_handle_skb(), otherwise we would have to implement
> > > ->sendmsg_locked() on top of ->sendmsg(), which is ugly.
> > >
> > > Please see each patch for more details.
> > >
> > > To see the big picture, the original patchset is available here:
> > > https://github.com/congwang/linux/tree/sockmap
> > > this patchset is also available:
> > > https://github.com/congwang/linux/tree/sockmap2
> > >
> > > ---
> > > v7: use work_mutex to protect psock->work
> > >     return err in udp_read_sock()
> > >     add patch 6/13
> > >     clean up test case
> >
> > The feature looks great to me.
> > I think the selftest is a bit light in terms of coverage, but it's acceptable.
>
> +1

Well, the first half of this patchset still focuses on the existing code, which
is already covered by existing test cases. The second half adds
BPF_SK_SKB_VERDICT and UDP support, which are already covered by
my new test case. And apparently UDP will never support other sockmap
programs like TCP, for example, BPF_SK_SKB_STREAM_PARSER, hence
it of course has much less test cases than TCP. Cross-protocol test case
will be added in the next patchset when AF_UNIX comes in.

If I miss anything, please be specific. Just saying the test case is light does
not help me to understand what I need to add.

Thanks.