mbox series

[bpf-next,00/19] sock_map: add non-TCP and cross-protocol support

Message ID 20210203041636.38555-1-xiyou.wangcong@gmail.com (mailing list archive)
Headers show
Series sock_map: add non-TCP and cross-protocol support | expand

Message

Cong Wang Feb. 3, 2021, 4:16 a.m. UTC
From: Cong Wang <cong.wang@bytedance.com>

Currently sockmap only fully supports TCP, UDP is partially supported
as it is only allowed to add into sockmap. This patch extends sockmap
with: 1) full UDP support; 2) full AF_UNIX dgram support; 3) cross
protocol support. Our goal is to allow socket splice between AF_UNIX
dgram and UDP.

On the high level, ->sendmsg_locked() and ->read_sock() are required
for each protocol to support sockmap redirection, and in order to do
sock proto update, a new ops ->update_proto() is introduced, which is
also required to implement. It is slightly harder for AF_UNIX, as it
does not have a full struct proto implementation and redirection.

In order to support cross protocol, we have to make skb independent
of protocols, which is extremely hard given how creatively UDP uses
dev_scratch. Fortunately, we can pass skmsg instead of skb when
redirecting to ingress, the only thing needs to add is a new
->recvmsg() to retrieve skmsg. On the egress side, a new skb is
allocated behind skb_send_sock_locked(), it comes for free.
Another big barrier is skb CB, which was hard-coded as TCP_CB(),
I switch it to skb ext to solve this problem. Please see patch 3 for
more details.

This patchset passed all tests, the existing ones and the new ones I
add within this patchset.

---

Cong Wang (19):
  bpf: rename BPF_STREAM_PARSER to BPF_SOCK_MAP
  skmsg: get rid of struct sk_psock_parser
  skmsg: use skb ext instead of TCP_SKB_CB
  sock_map: rename skb_parser and skb_verdict
  sock_map: introduce BPF_SK_SKB_VERDICT
  sock: introduce sk_prot->update_proto()
  udp: implement ->sendmsg_locked()
  udp: implement ->read_sock() for sockmap
  udp: add ->read_sock() and ->sendmsg_locked() to ipv6
  af_unix: implement ->sendmsg_locked for dgram socket
  af_unix: implement ->read_sock() for sockmap
  af_unix: implement ->update_proto()
  af_unix: set TCP_ESTABLISHED for datagram sockets too
  skmsg: extract __tcp_bpf_recvmsg() and tcp_bpf_wait_data()
  udp: implement udp_bpf_recvmsg() for sockmap
  af_unix: implement unix_dgram_bpf_recvmsg()
  sock_map: update sock type checks
  selftests/bpf: add test cases for unix and udp sockmap
  selftests/bpf: add test case for redirection between udp and unix

 MAINTAINERS                                   |   1 +
 include/linux/bpf.h                           |   4 +-
 include/linux/bpf_types.h                     |   2 +-
 include/linux/skbuff.h                        |   4 +
 include/linux/skmsg.h                         |  90 +++-
 include/net/af_unix.h                         |  13 +
 include/net/ipv6.h                            |   1 +
 include/net/sock.h                            |   3 +
 include/net/tcp.h                             |  33 +-
 include/net/udp.h                             |   9 +-
 include/uapi/linux/bpf.h                      |   1 +
 kernel/bpf/syscall.c                          |   1 +
 net/Kconfig                                   |  14 +-
 net/core/Makefile                             |   2 +-
 net/core/filter.c                             |   3 +-
 net/core/skbuff.c                             |   7 +
 net/core/skmsg.c                              | 223 +++++---
 net/core/sock_map.c                           | 128 ++---
 net/ipv4/Makefile                             |   2 +-
 net/ipv4/af_inet.c                            |   2 +
 net/ipv4/tcp_bpf.c                            | 130 +----
 net/ipv4/tcp_ipv4.c                           |   3 +
 net/ipv4/udp.c                                |  68 ++-
 net/ipv4/udp_bpf.c                            |  78 ++-
 net/ipv6/af_inet6.c                           |   2 +
 net/ipv6/tcp_ipv6.c                           |   3 +
 net/ipv6/udp.c                                |  30 +-
 net/tls/tls_sw.c                              |   4 +-
 net/unix/Makefile                             |   1 +
 net/unix/af_unix.c                            | 105 +++-
 net/unix/unix_bpf.c                           |  99 ++++
 tools/bpf/bpftool/common.c                    |   1 +
 tools/bpf/bpftool/prog.c                      |   1 +
 tools/include/uapi/linux/bpf.h                |   1 +
 .../selftests/bpf/prog_tests/sockmap_listen.c | 475 +++++++++++++++++-
 .../selftests/bpf/progs/test_sockmap_listen.c |  24 +-
 36 files changed, 1233 insertions(+), 335 deletions(-)
 create mode 100644 net/unix/unix_bpf.c

Comments

Alexei Starovoitov Feb. 3, 2021, 5:48 p.m. UTC | #1
On Tue, Feb 02, 2021 at 08:16:17PM -0800, Cong Wang wrote:
> From: Cong Wang <cong.wang@bytedance.com>
> 
> Currently sockmap only fully supports TCP, UDP is partially supported
> as it is only allowed to add into sockmap. This patch extends sockmap
> with: 1) full UDP support; 2) full AF_UNIX dgram support; 3) cross
> protocol support. Our goal is to allow socket splice between AF_UNIX
> dgram and UDP.

Please expand on the use case. The 'splice between af_unix and udp'
doesn't tell me much. The selftest doesn't help to understand the scope either.
Cong Wang Feb. 3, 2021, 7:22 p.m. UTC | #2
On Wed, Feb 3, 2021 at 9:48 AM Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
>
> On Tue, Feb 02, 2021 at 08:16:17PM -0800, Cong Wang wrote:
> > From: Cong Wang <cong.wang@bytedance.com>
> >
> > Currently sockmap only fully supports TCP, UDP is partially supported
> > as it is only allowed to add into sockmap. This patch extends sockmap
> > with: 1) full UDP support; 2) full AF_UNIX dgram support; 3) cross
> > protocol support. Our goal is to allow socket splice between AF_UNIX
> > dgram and UDP.
>
> Please expand on the use case. The 'splice between af_unix and udp'
> doesn't tell me much. The selftest doesn't help to understand the scope either.

Sure. We have thousands of services connected to a daemon on every host
with UNIX dgram sockets, after they are moved into VM, we have to add a proxy
to forward these communications from VM to host, because rewriting thousands
of them is not practical. This proxy uses a UNIX socket connected to services
and uses a UDP socket to connect to the host. It is inefficient because data is
copied between kernel space and user space twice, and we can not use
splice() which only supports TCP. Therefore, we want to use sockmap to do
the splicing without even going to user-space at all (after the initial setup).

My colleague Jiang (already Cc'ed) is working on the sockmap support for
vsock so that we can move from UDP to vsock for host-VM communications.

If this is useful, I can add it in this cover letter in the next update.

Thanks.
John Fastabend Feb. 3, 2021, 8:29 p.m. UTC | #3
Cong Wang wrote:
> On Wed, Feb 3, 2021 at 9:48 AM Alexei Starovoitov
> <alexei.starovoitov@gmail.com> wrote:
> >
> > On Tue, Feb 02, 2021 at 08:16:17PM -0800, Cong Wang wrote:
> > > From: Cong Wang <cong.wang@bytedance.com>
> > >
> > > Currently sockmap only fully supports TCP, UDP is partially supported
> > > as it is only allowed to add into sockmap. This patch extends sockmap
> > > with: 1) full UDP support; 2) full AF_UNIX dgram support; 3) cross
> > > protocol support. Our goal is to allow socket splice between AF_UNIX
> > > dgram and UDP.
> >
> > Please expand on the use case. The 'splice between af_unix and udp'
> > doesn't tell me much. The selftest doesn't help to understand the scope either.
> 
> Sure. We have thousands of services connected to a daemon on every host
> with UNIX dgram sockets, after they are moved into VM, we have to add a proxy
> to forward these communications from VM to host, because rewriting thousands
> of them is not practical. This proxy uses a UNIX socket connected to services
> and uses a UDP socket to connect to the host. It is inefficient because data is
> copied between kernel space and user space twice, and we can not use
> splice() which only supports TCP. Therefore, we want to use sockmap to do
> the splicing without even going to user-space at all (after the initial setup).

Thanks for the details. We also have a use-case similar to TCP sockets
to apply policy/redirect to UDP sockets so will want similar semantics to
how TCP skmsg programs work on egress.

> 
> My colleague Jiang (already Cc'ed) is working on the sockmap support for
> vsock so that we can move from UDP to vsock for host-VM communications.

Great. The host-VM channel came up a few times in the initial sockmap work,
but I never got around to starting.

> 
> If this is useful, I can add it in this cover letter in the next update.
> 

Please add to the cover letter. I'll review the series today or
tomorrow, I have a couple things on the TODO list for today that
I need to get done first.

> Thanks.

Thanks for doing this work.