Message ID | cover.1623674025.git.lorenzo@kernel.org (mailing list archive) |
---|---|
Headers | show |
Series | mvneta: introduce XDP multi-buffer support | expand |
Lorenzo Bianconi wrote: > This series introduce XDP multi-buffer support. The mvneta driver is > the first to support these new "non-linear" xdp_{buff,frame}. Reviewers > please focus on how these new types of xdp_{buff,frame} packets > traverse the different layers and the layout design. It is on purpose > that BPF-helpers are kept simple, as we don't want to expose the > internal layout to allow later changes. > > For now, to keep the design simple and to maintain performance, the XDP > BPF-prog (still) only have access to the first-buffer. It is left for > later (another patchset) to add payload access across multiple buffers. > This patchset should still allow for these future extensions. The goal > is to lift the XDP MTU restriction that comes with XDP, but maintain > same performance as before. At this point I don't think we can have a partial implementation. At the moment we have packet capture applications and protocol parsers running in production. If we allow this to go in staged we are going to break those applications that make the fundamental assumption they have access to all the data in the packet. There will be no way to fix it when it happens. The teams running the applications wont necessarily be able to change the network MTU. Now it doesn't work, hard stop. This is better than it sort of works some of the time. Worse if we get in a situation where some drivers support partial access and others support full access the support matrix gets worse. I think we need to get full support and access to all bytes. I believe I said this earlier, but now we've deployed apps that really do need access to the payloads so its not a theoritical concern anymore, but rather a real one based on deployed BPF programs. > > The main idea for the new multi-buffer layout is to reuse the same > layout used for non-linear SKB. This rely on the "skb_shared_info" > struct at the end of the first buffer to link together subsequent > buffers. Keeping the layout compatible with SKBs is also done to ease > and speedup creating an SKB from an xdp_{buff,frame}. > Converting xdp_frame to SKB and deliver it to the network stack is shown > in patch 07/14 (e.g. cpumaps). > > A multi-buffer bit (mb) has been introduced in the flags field of xdp_{buff,frame} > structure to notify the bpf/network layer if this is a xdp multi-buffer frame > (mb = 1) or not (mb = 0). > The mb bit will be set by a xdp multi-buffer capable driver only for > non-linear frames maintaining the capability to receive linear frames > without any extra cost since the skb_shared_info structure at the end > of the first buffer will be initialized only if mb is set. > Moreover the flags field in xdp_{buff,frame} will be reused even for > xdp rx csum offloading in future series. > > Typical use cases for this series are: > - Jumbo-frames > - Packet header split (please see Google���s use-case @ NetDevConf 0x14, [0]) > - TSO > > A new bpf helper (bpf_xdp_get_buff_len) has been introduce in order to notify > the eBPF layer about the total frame size (linear + paged parts). Is it possible to make currently working programs continue to work? For a simple packet capture example a program might capture the entire packet of bytes '(data_end - data_start)'. With above implementation the program will continue to run, but will no longer be capturing all the bytes... so its a silent failure. Otherwise I'll need to backport fixes into my BPF programs and releases to ensure they don't walk onto a new kernel with multi-buffer support enabled. Its not ideal. > > bpf_xdp_adjust_tail and bpf_xdp_copy helpers have been modified to take into > account xdp multi-buff frames. > > More info about the main idea behind this approach can be found here [1][2]. Will read [1],[2]. Where did the perf data for the 40gbps NIC go? I think we want that done again on this series with at least 40gbps NICs and better yet 100gbps drivers. If its addressed in a patch commit message I'm reading the series now. > > Changes since v8: > - add proper dma unmapping if XDP_TX fails on mvneta for a xdp multi-buff > - switch back to skb_shared_info implementation from previous xdp_shared_info > one > - avoid using a bietfield in xdp_buff/xdp_frame since it introduces performance > regressions. Tested now on 10G NIC (ixgbe) to verify there are no performance > penalties for regular codebase > - add bpf_xdp_get_buff_len helper and remove frame_length field in xdp ctx > - add data_len field in skb_shared_info struct > > Changes since v7: > - rebase on top of bpf-next > - fix sparse warnings > - improve comments for frame_length in include/net/xdp.h > > Changes since v6: > - the main difference respect to previous versions is the new approach proposed > by Eelco to pass full length of the packet to eBPF layer in XDP context > - reintroduce multi-buff support to eBPF kself-tests > - reintroduce multi-buff support to bpf_xdp_adjust_tail helper > - introduce multi-buffer support to bpf_xdp_copy helper > - rebase on top of bpf-next > > Changes since v5: > - rebase on top of bpf-next > - initialize mb bit in xdp_init_buff() and drop per-driver initialization > - drop xdp->mb initialization in xdp_convert_zc_to_xdp_frame() > - postpone introduction of frame_length field in XDP ctx to another series > - minor changes > > Changes since v4: > - rebase ontop of bpf-next > - introduce xdp_shared_info to build xdp multi-buff instead of using the > skb_shared_info struct > - introduce frame_length in xdp ctx > - drop previous bpf helpers > - fix bpf_xdp_adjust_tail for xdp multi-buff > - introduce xdp multi-buff self-tests for bpf_xdp_adjust_tail > - fix xdp_return_frame_bulk for xdp multi-buff > > Changes since v3: > - rebase ontop of bpf-next > - add patch 10/13 to copy back paged data from a xdp multi-buff frame to > userspace buffer for xdp multi-buff selftests > > Changes since v2: > - add throughput measurements > - drop bpf_xdp_adjust_mb_header bpf helper > - introduce selftest for xdp multibuffer > - addressed comments on bpf_xdp_get_frags_count > - introduce xdp multi-buff support to cpumaps > > Changes since v1: > - Fix use-after-free in xdp_return_{buff/frame} > - Introduce bpf helpers > - Introduce xdp_mb sample program > - access skb_shared_info->nr_frags only on the last fragment > > Changes since RFC: > - squash multi-buffer bit initialization in a single patch > - add mvneta non-linear XDP buff support for tx side > > [0] https://netdevconf.info/0x14/session.html?talk-the-path-to-tcp-4k-mtu-and-rx-zerocopy > [1] https://github.com/xdp-project/xdp-project/blob/master/areas/core/xdp-multi-buffer01-design.org > [2] https://netdevconf.info/0x14/session.html?tutorial-add-XDP-support-to-a-NIC-driver (XDPmulti-buffers section) > > Eelco Chaudron (3): > bpf: add multi-buff support to the bpf_xdp_adjust_tail() API > bpf: add multi-buffer support to xdp copy helpers > bpf: update xdp_adjust_tail selftest to include multi-buffer > > Lorenzo Bianconi (11): > net: skbuff: add data_len field to skb_shared_info > xdp: introduce flags field in xdp_buff/xdp_frame > net: mvneta: update mb bit before passing the xdp buffer to eBPF layer > xdp: add multi-buff support to xdp_return_{buff/frame} > net: mvneta: add multi buffer support to XDP_TX > net: mvneta: enable jumbo frames for XDP > net: xdp: add multi-buff support to xdp_build_skb_from_frame > bpf: introduce bpf_xdp_get_buff_len helper > bpf: move user_size out of bpf_test_init > bpf: introduce multibuff support to bpf_prog_test_run_xdp() > bpf: test_run: add xdp_shared_info pointer in bpf_test_finish > signature > > drivers/net/ethernet/marvell/mvneta.c | 143 ++++++++++------ > include/linux/skbuff.h | 5 +- > include/net/xdp.h | 56 ++++++- > include/uapi/linux/bpf.h | 7 + > kernel/trace/bpf_trace.c | 3 + > net/bpf/test_run.c | 108 +++++++++--- > net/core/filter.c | 157 +++++++++++++++++- > net/core/xdp.c | 72 +++++++- > tools/include/uapi/linux/bpf.h | 7 + > .../bpf/prog_tests/xdp_adjust_tail.c | 105 ++++++++++++ > .../selftests/bpf/prog_tests/xdp_bpf2bpf.c | 127 +++++++++----- > .../bpf/progs/test_xdp_adjust_tail_grow.c | 10 +- > .../bpf/progs/test_xdp_adjust_tail_shrink.c | 32 +++- > .../selftests/bpf/progs/test_xdp_bpf2bpf.c | 2 +- > 14 files changed, 705 insertions(+), 129 deletions(-) > > -- > 2.31.1 >
On 6/22/21 5:18 PM, John Fastabend wrote: > At this point I don't think we can have a partial implementation. At > the moment we have packet capture applications and protocol parsers > running in production. If we allow this to go in staged we are going > to break those applications that make the fundamental assumption they > have access to all the data in the packet. What about cases like netgpu where headers are accessible but data is not (e.g., gpu memory)? If the API indicates limited buffer access, is that sufficient?
David Ahern wrote: > On 6/22/21 5:18 PM, John Fastabend wrote: > > At this point I don't think we can have a partial implementation. At > > the moment we have packet capture applications and protocol parsers > > running in production. If we allow this to go in staged we are going > > to break those applications that make the fundamental assumption they > > have access to all the data in the packet. > > What about cases like netgpu where headers are accessible but data is > not (e.g., gpu memory)? If the API indicates limited buffer access, is > that sufficient? I never consider netgpus and I guess I don't fully understand the architecture to say. But, I would try to argue that an XDP API should allow XDP to reach into the payload of these GPU packets as well. Of course it might be slow. I'm not really convinced just indicating its a limited buffer is enough. I think we want to be able to read/write any byte in the packet. I see two ways to do it, /* xdp_pull_data moves data and data_end pointers into the frag * containing the byte offset start. * * returns negative value on error otherwise returns offset of * data pointer into payload. */ int xdp_pull_data(int start) This would be a helper call to push the xdp->data{_end} pointers into the correct frag and then normal verification should work. From my side this works because I can always find the next frag by starting at 'xdp_pull_data(xdp->data_end+1)'. And by returning offset we can always figure out where we are in the payload. This is the easiest thing I could come up with. And hopefully for _most_ cases the bytes we need are in the initial data. Also I don't see how extending tail works without something like this. My other thought, but requires some verifier work would be to extend 'struct xdp_md' with a frags[] pointer. struct xdp_md { __u32 data; __u32 data_end; __u32 data_meta; /* metadata stuff */ struct _xdp_md frags[] __u32 frags_end; } Then a XDP program could read access a frag like so, if (i < xdp->frags_end) { frag = xdp->frags[i]; if (offset + hdr_size < frag->data_end) memcpy(dst, frag->data[offset], hdr_size); } The nice bit about above is you avoid the call, but maybe it doesn't matter if you are already looking into frags pps is probably not at 64B sizes anyways. My main concern here is we hit a case where the driver doesn't pull in the bytes we need and then we are stuck without a workaround. The helper looks fairly straightforward to me could we try that? Also I thought we had another driver in the works? Any ideas where that went... Last, I'll add thanks for working on this everyone. .John
On 6/22/21 11:48 PM, John Fastabend wrote: > David Ahern wrote: >> On 6/22/21 5:18 PM, John Fastabend wrote: >>> At this point I don't think we can have a partial implementation. At >>> the moment we have packet capture applications and protocol parsers >>> running in production. If we allow this to go in staged we are going >>> to break those applications that make the fundamental assumption they >>> have access to all the data in the packet. >> >> What about cases like netgpu where headers are accessible but data is >> not (e.g., gpu memory)? If the API indicates limited buffer access, is >> that sufficient? > > I never consider netgpus and I guess I don't fully understand the > architecture to say. But, I would try to argue that an XDP API > should allow XDP to reach into the payload of these GPU packets as well. > Of course it might be slow. AIUI S/W on the host can not access gpu memory, so that is not a possibility at all. Another use case is DDP and ZC. Mellanox has a proposal for NVME (with intentions to extend to iscsi) to do direct data placement. This is really just an example of zerocopy (and netgpu has morphed into zctap with current prototype working for host memory) which will become more prominent. XDP programs accessing memory already mapped to user space will be racy. To me these proposals suggest a trend and one that XDP APIs should be ready to handle - like indicating limited access or specifying length that can be accessed. > > I'm not really convinced just indicating its a limited buffer is enough. > I think we want to be able to read/write any byte in the packet. I see > two ways to do it, > > /* xdp_pull_data moves data and data_end pointers into the frag > * containing the byte offset start. > * > * returns negative value on error otherwise returns offset of > * data pointer into payload. > */ > int xdp_pull_data(int start) > > This would be a helper call to push the xdp->data{_end} pointers into > the correct frag and then normal verification should work. From my > side this works because I can always find the next frag by starting > at 'xdp_pull_data(xdp->data_end+1)'. And by returning offset we can > always figure out where we are in the payload. This is the easiest > thing I could come up with. And hopefully for _most_ cases the bytes > we need are in the initial data. Also I don't see how extending tail > works without something like this. > > My other thought, but requires some verifier work would be to extend > 'struct xdp_md' with a frags[] pointer. > > struct xdp_md { > __u32 data; > __u32 data_end; > __u32 data_meta; > /* metadata stuff */ > struct _xdp_md frags[] > __u32 frags_end; > } > > Then a XDP program could read access a frag like so, > > if (i < xdp->frags_end) { > frag = xdp->frags[i]; > if (offset + hdr_size < frag->data_end) > memcpy(dst, frag->data[offset], hdr_size); > } > > The nice bit about above is you avoid the call, but maybe it doesn't > matter if you are already looking into frags pps is probably not at > 64B sizes anyways. > > My main concern here is we hit a case where the driver doesn't pull in > the bytes we need and then we are stuck without a workaround. The helper > looks fairly straightforward to me could we try that? > > Also I thought we had another driver in the works? Any ideas where > that went... > > Last, I'll add thanks for working on this everyone. > > .John >
David Ahern wrote: > On 6/22/21 11:48 PM, John Fastabend wrote: > > David Ahern wrote: > >> On 6/22/21 5:18 PM, John Fastabend wrote: > >>> At this point I don't think we can have a partial implementation. At > >>> the moment we have packet capture applications and protocol parsers > >>> running in production. If we allow this to go in staged we are going > >>> to break those applications that make the fundamental assumption they > >>> have access to all the data in the packet. > >> > >> What about cases like netgpu where headers are accessible but data is > >> not (e.g., gpu memory)? If the API indicates limited buffer access, is > >> that sufficient? > > > > I never consider netgpus and I guess I don't fully understand the > > architecture to say. But, I would try to argue that an XDP API > > should allow XDP to reach into the payload of these GPU packets as well. > > Of course it might be slow. > > AIUI S/W on the host can not access gpu memory, so that is not a > possibility at all. interesting. > > Another use case is DDP and ZC. Mellanox has a proposal for NVME (with > intentions to extend to iscsi) to do direct data placement. This is > really just an example of zerocopy (and netgpu has morphed into zctap > with current prototype working for host memory) which will become more > prominent. XDP programs accessing memory already mapped to user space > will be racy. Its racy in the sense that if the application is reading data before the driver flips some bit to tell the application new data is available XDP could write old data or read application changed data? I think it would still "work" same as AF_XDP? If you allow DDP then you lose ability to l7 security as far as I can tell. But, thats a general comment not specific to XDP. > > To me these proposals suggest a trend and one that XDP APIs should be > ready to handle - like indicating limited access or specifying length > that can be accessed. I still think the only case is this net-gpu which we don't have in kernel at the moment right? I think a bit or size or ... would make sense if we had this hardware. And then for the other DDP/ZC case the system owner would need to know what they are doing when they turn on DDP or whatever. .John
On Wed, Jun 23, 2021 at 1:19 AM John Fastabend <john.fastabend@gmail.com> wrote: > > Lorenzo Bianconi wrote: > > This series introduce XDP multi-buffer support. The mvneta driver is > > the first to support these new "non-linear" xdp_{buff,frame}. Reviewers > > please focus on how these new types of xdp_{buff,frame} packets > > traverse the different layers and the layout design. It is on purpose > > that BPF-helpers are kept simple, as we don't want to expose the > > internal layout to allow later changes. > > > > For now, to keep the design simple and to maintain performance, the XDP > > BPF-prog (still) only have access to the first-buffer. It is left for > > later (another patchset) to add payload access across multiple buffers. > > This patchset should still allow for these future extensions. The goal > > is to lift the XDP MTU restriction that comes with XDP, but maintain > > same performance as before. > > At this point I don't think we can have a partial implementation. At > the moment we have packet capture applications and protocol parsers > running in production. If we allow this to go in staged we are going > to break those applications that make the fundamental assumption they > have access to all the data in the packet. > > There will be no way to fix it when it happens. The teams running the > applications wont necessarily be able to change the network MTU. Now > it doesn't work, hard stop. This is better than it sort of works some > of the time. Worse if we get in a situation where some drivers support > partial access and others support full access the support matrix gets worse. > > I think we need to get full support and access to all bytes. I believe > I said this earlier, but now we've deployed apps that really do need > access to the payloads so its not a theoritical concern anymore, but > rather a real one based on deployed BPF programs. > > > > > The main idea for the new multi-buffer layout is to reuse the same > > layout used for non-linear SKB. This rely on the "skb_shared_info" > > struct at the end of the first buffer to link together subsequent > > buffers. Keeping the layout compatible with SKBs is also done to ease > > and speedup creating an SKB from an xdp_{buff,frame}. > > Converting xdp_frame to SKB and deliver it to the network stack is shown > > in patch 07/14 (e.g. cpumaps). > > > > A multi-buffer bit (mb) has been introduced in the flags field of xdp_{buff,frame} > > structure to notify the bpf/network layer if this is a xdp multi-buffer frame > > (mb = 1) or not (mb = 0). > > The mb bit will be set by a xdp multi-buffer capable driver only for > > non-linear frames maintaining the capability to receive linear frames > > without any extra cost since the skb_shared_info structure at the end > > of the first buffer will be initialized only if mb is set. > > Moreover the flags field in xdp_{buff,frame} will be reused even for > > xdp rx csum offloading in future series. > > > > Typical use cases for this series are: > > - Jumbo-frames > > - Packet header split (please see Google���s use-case @ NetDevConf 0x14, [0]) > > - TSO > > > > A new bpf helper (bpf_xdp_get_buff_len) has been introduce in order to notify > > the eBPF layer about the total frame size (linear + paged parts). > > Is it possible to make currently working programs continue to work? > For a simple packet capture example a program might capture the > entire packet of bytes '(data_end - data_start)'. With above implementation > the program will continue to run, but will no longer be capturing > all the bytes... so its a silent failure. Otherwise I'll need to > backport fixes into my BPF programs and releases to ensure they > don't walk onto a new kernel with multi-buffer support enabled. > Its not ideal. > > > > > bpf_xdp_adjust_tail and bpf_xdp_copy helpers have been modified to take into > > account xdp multi-buff frames. > > > > More info about the main idea behind this approach can be found here [1][2]. > > Will read [1],[2]. > > Where did the perf data for the 40gbps NIC go? I think we want that > done again on this series with at least 40gbps NICs and better > yet 100gbps drivers. If its addressed in a patch commit message > I'm reading the series now. Here is the perf data for a 40 gbps i40e on my 2.1 GHz Cascade Lake server. xdpsock -r XDP_DROP XDP_TX Lorenzo's patches: -2%/+1.5 cycles -3%/+3 +2%/-6 (Yes, it gets better!) + i40e support: -5.5%/+5 -8%/+9 -9%/+31 It seems that it is the driver support itself that hurts now. The overhead of the base support has decreased substantially over time which is good. > > > > Changes since v8: > > - add proper dma unmapping if XDP_TX fails on mvneta for a xdp multi-buff > > - switch back to skb_shared_info implementation from previous xdp_shared_info > > one > > - avoid using a bietfield in xdp_buff/xdp_frame since it introduces performance > > regressions. Tested now on 10G NIC (ixgbe) to verify there are no performance > > penalties for regular codebase > > - add bpf_xdp_get_buff_len helper and remove frame_length field in xdp ctx > > - add data_len field in skb_shared_info struct > > > > Changes since v7: > > - rebase on top of bpf-next > > - fix sparse warnings > > - improve comments for frame_length in include/net/xdp.h > > > > Changes since v6: > > - the main difference respect to previous versions is the new approach proposed > > by Eelco to pass full length of the packet to eBPF layer in XDP context > > - reintroduce multi-buff support to eBPF kself-tests > > - reintroduce multi-buff support to bpf_xdp_adjust_tail helper > > - introduce multi-buffer support to bpf_xdp_copy helper > > - rebase on top of bpf-next > > > > Changes since v5: > > - rebase on top of bpf-next > > - initialize mb bit in xdp_init_buff() and drop per-driver initialization > > - drop xdp->mb initialization in xdp_convert_zc_to_xdp_frame() > > - postpone introduction of frame_length field in XDP ctx to another series > > - minor changes > > > > Changes since v4: > > - rebase ontop of bpf-next > > - introduce xdp_shared_info to build xdp multi-buff instead of using the > > skb_shared_info struct > > - introduce frame_length in xdp ctx > > - drop previous bpf helpers > > - fix bpf_xdp_adjust_tail for xdp multi-buff > > - introduce xdp multi-buff self-tests for bpf_xdp_adjust_tail > > - fix xdp_return_frame_bulk for xdp multi-buff > > > > Changes since v3: > > - rebase ontop of bpf-next > > - add patch 10/13 to copy back paged data from a xdp multi-buff frame to > > userspace buffer for xdp multi-buff selftests > > > > Changes since v2: > > - add throughput measurements > > - drop bpf_xdp_adjust_mb_header bpf helper > > - introduce selftest for xdp multibuffer > > - addressed comments on bpf_xdp_get_frags_count > > - introduce xdp multi-buff support to cpumaps > > > > Changes since v1: > > - Fix use-after-free in xdp_return_{buff/frame} > > - Introduce bpf helpers > > - Introduce xdp_mb sample program > > - access skb_shared_info->nr_frags only on the last fragment > > > > Changes since RFC: > > - squash multi-buffer bit initialization in a single patch > > - add mvneta non-linear XDP buff support for tx side > > > > [0] https://netdevconf.info/0x14/session.html?talk-the-path-to-tcp-4k-mtu-and-rx-zerocopy > > [1] https://github.com/xdp-project/xdp-project/blob/master/areas/core/xdp-multi-buffer01-design.org > > [2] https://netdevconf.info/0x14/session.html?tutorial-add-XDP-support-to-a-NIC-driver (XDPmulti-buffers section) > > > > Eelco Chaudron (3): > > bpf: add multi-buff support to the bpf_xdp_adjust_tail() API > > bpf: add multi-buffer support to xdp copy helpers > > bpf: update xdp_adjust_tail selftest to include multi-buffer > > > > Lorenzo Bianconi (11): > > net: skbuff: add data_len field to skb_shared_info > > xdp: introduce flags field in xdp_buff/xdp_frame > > net: mvneta: update mb bit before passing the xdp buffer to eBPF layer > > xdp: add multi-buff support to xdp_return_{buff/frame} > > net: mvneta: add multi buffer support to XDP_TX > > net: mvneta: enable jumbo frames for XDP > > net: xdp: add multi-buff support to xdp_build_skb_from_frame > > bpf: introduce bpf_xdp_get_buff_len helper > > bpf: move user_size out of bpf_test_init > > bpf: introduce multibuff support to bpf_prog_test_run_xdp() > > bpf: test_run: add xdp_shared_info pointer in bpf_test_finish > > signature > > > > drivers/net/ethernet/marvell/mvneta.c | 143 ++++++++++------ > > include/linux/skbuff.h | 5 +- > > include/net/xdp.h | 56 ++++++- > > include/uapi/linux/bpf.h | 7 + > > kernel/trace/bpf_trace.c | 3 + > > net/bpf/test_run.c | 108 +++++++++--- > > net/core/filter.c | 157 +++++++++++++++++- > > net/core/xdp.c | 72 +++++++- > > tools/include/uapi/linux/bpf.h | 7 + > > .../bpf/prog_tests/xdp_adjust_tail.c | 105 ++++++++++++ > > .../selftests/bpf/prog_tests/xdp_bpf2bpf.c | 127 +++++++++----- > > .../bpf/progs/test_xdp_adjust_tail_grow.c | 10 +- > > .../bpf/progs/test_xdp_adjust_tail_shrink.c | 32 +++- > > .../selftests/bpf/progs/test_xdp_bpf2bpf.c | 2 +- > > 14 files changed, 705 insertions(+), 129 deletions(-) > > > > -- > > 2.31.1 > > > >