diff mbox series

[v4,bpf-next,15/22] xsk: add multi-buffer documentation

Message ID 20230615172606.349557-16-maciej.fijalkowski@intel.com (mailing list archive)
State Changes Requested
Delegated to: BPF
Headers show
Series xsk: multi-buffer support | expand

Checks

Context Check Description
netdev/series_format fail Series longer than 15 patches (and no cover letter)
netdev/tree_selection success Clearly marked for bpf-next, async
netdev/fixes_present success Fixes tag not required for -next series
netdev/header_inline success No static functions without inline keyword in header files
netdev/build_32bit success Errors and warnings before: 8 this patch: 8
netdev/cc_maintainers warning 9 maintainers not CCed: kuba@kernel.org hawk@kernel.org john.fastabend@gmail.com corbet@lwn.net davem@davemloft.net jonathan.lemon@gmail.com pabeni@redhat.com edumazet@google.com linux-doc@vger.kernel.org
netdev/build_clang success Errors and warnings before: 8 this patch: 8
netdev/verify_signedoff success Signed-off-by tag matches author and committer
netdev/deprecated_api success None detected
netdev/check_selftest success No net selftest shell script
netdev/verify_fixes success No Fixes tag
netdev/build_allmodconfig_warn success Errors and warnings before: 8 this patch: 8
netdev/checkpatch success total: 0 errors, 0 warnings, 0 checks, 189 lines checked
netdev/kdoc success Errors and warnings before: 0 this patch: 0
netdev/source_inline success Was 0 now: 0
bpf/vmtest-bpf-next-VM_Test-1 success Logs for ShellCheck
bpf/vmtest-bpf-next-VM_Test-6 success Logs for set-matrix
bpf/vmtest-bpf-next-VM_Test-2 success Logs for build for aarch64 with gcc
bpf/vmtest-bpf-next-VM_Test-4 success Logs for build for x86_64 with gcc
bpf/vmtest-bpf-next-VM_Test-5 success Logs for build for x86_64 with llvm-16
bpf/vmtest-bpf-next-VM_Test-3 success Logs for build for s390x with gcc
bpf/vmtest-bpf-next-VM_Test-7 success Logs for test_maps on aarch64 with gcc
bpf/vmtest-bpf-next-VM_Test-19 success Logs for test_progs_no_alu32_parallel on aarch64 with gcc
bpf/vmtest-bpf-next-VM_Test-25 success Logs for test_verifier on aarch64 with gcc
bpf/vmtest-bpf-next-VM_Test-9 success Logs for test_maps on x86_64 with gcc
bpf/vmtest-bpf-next-VM_Test-10 success Logs for test_maps on x86_64 with llvm-16
bpf/vmtest-bpf-next-VM_Test-11 success Logs for test_progs on aarch64 with gcc
bpf/vmtest-bpf-next-VM_Test-13 success Logs for test_progs on x86_64 with gcc
bpf/vmtest-bpf-next-VM_Test-14 success Logs for test_progs on x86_64 with llvm-16
bpf/vmtest-bpf-next-VM_Test-15 success Logs for test_progs_no_alu32 on aarch64 with gcc
bpf/vmtest-bpf-next-VM_Test-17 success Logs for test_progs_no_alu32 on x86_64 with gcc
bpf/vmtest-bpf-next-VM_Test-18 success Logs for test_progs_no_alu32 on x86_64 with llvm-16
bpf/vmtest-bpf-next-VM_Test-20 success Logs for test_progs_no_alu32_parallel on x86_64 with gcc
bpf/vmtest-bpf-next-VM_Test-21 success Logs for test_progs_no_alu32_parallel on x86_64 with llvm-16
bpf/vmtest-bpf-next-VM_Test-22 success Logs for test_progs_parallel on aarch64 with gcc
bpf/vmtest-bpf-next-VM_Test-23 success Logs for test_progs_parallel on x86_64 with gcc
bpf/vmtest-bpf-next-VM_Test-24 success Logs for test_progs_parallel on x86_64 with llvm-16
bpf/vmtest-bpf-next-VM_Test-27 success Logs for test_verifier on x86_64 with gcc
bpf/vmtest-bpf-next-VM_Test-28 success Logs for test_verifier on x86_64 with llvm-16
bpf/vmtest-bpf-next-VM_Test-29 success Logs for veristat
bpf/vmtest-bpf-next-VM_Test-12 success Logs for test_progs on s390x with gcc
bpf/vmtest-bpf-next-VM_Test-16 success Logs for test_progs_no_alu32 on s390x with gcc
bpf/vmtest-bpf-next-VM_Test-26 fail Logs for test_verifier on s390x with gcc
bpf/vmtest-bpf-next-PR fail PR summary
bpf/vmtest-bpf-next-VM_Test-8 success Logs for test_maps on s390x with gcc

Commit Message

Fijalkowski, Maciej June 15, 2023, 5:25 p.m. UTC
From: Magnus Karlsson <magnus.karlsson@intel.com>

Add AF_XDP multi-buffer support documentation including two
pseudo-code samples.

Signed-off-by: Magnus Karlsson <magnus.karlsson@intel.com>
---
 Documentation/networking/af_xdp.rst | 177 ++++++++++++++++++++++++++++
 1 file changed, 177 insertions(+)

Comments

Toke Høiland-Jørgensen June 20, 2023, 5:34 p.m. UTC | #1
Maciej Fijalkowski <maciej.fijalkowski@intel.com> writes:

> From: Magnus Karlsson <magnus.karlsson@intel.com>
>
> Add AF_XDP multi-buffer support documentation including two
> pseudo-code samples.
>
> Signed-off-by: Magnus Karlsson <magnus.karlsson@intel.com>
> ---
>  Documentation/networking/af_xdp.rst | 177 ++++++++++++++++++++++++++++
>  1 file changed, 177 insertions(+)
>
> diff --git a/Documentation/networking/af_xdp.rst b/Documentation/networking/af_xdp.rst
> index 247c6c4127e9..2b583f58967b 100644
> --- a/Documentation/networking/af_xdp.rst
> +++ b/Documentation/networking/af_xdp.rst
> @@ -453,6 +453,93 @@ XDP_OPTIONS getsockopt
>  Gets options from an XDP socket. The only one supported so far is
>  XDP_OPTIONS_ZEROCOPY which tells you if zero-copy is on or not.
>  
> +Multi-Buffer Support
> +--------------------
> +
> +With multi-buffer support, programs using AF_XDP sockets can receive
> +and transmit packets consisting of multiple buffers both in copy and
> +zero-copy mode. For example, a packet can consist of two
> +frames/buffers, one with the header and the other one with the data,
> +or a 9K Ethernet jumbo frame can be constructed by chaining together
> +three 4K frames.
> +
> +Some definitions:
> +
> +* A packet consists of one or more frames
> +
> +* A descriptor in one of the AF_XDP rings always refers to a single
> +  frame. In the case the packet consists of a single frame, the
> +  descriptor refers to the whole packet.
> +
> +To enable multi-buffer support for an AF_XDP socket, use the new bind
> +flag XDP_USE_SG. If this is not provided, all multi-buffer packets
> +will be dropped just as before. Note that the XDP program loaded also
> +needs to be in multi-buffer mode. This can be accomplished by using
> +"xdp.frags" as the section name of the XDP program used.
> +
> +To represent a packet consisting of multiple frames, a new flag called
> +XDP_PKT_CONTD is introduced in the options field of the Rx and Tx
> +descriptors. If it is true (1) the packet continues with the next
> +descriptor and if it is false (0) it means this is the last descriptor
> +of the packet. Why the reverse logic of end-of-packet (eop) flag found
> +in many NICs? Just to preserve compatibility with non-multi-buffer
> +applications that have this bit set to false for all packets on Rx,
> +and the apps set the options field to zero for Tx, as anything else
> +will be treated as an invalid descriptor.
> +
> +These are the semantics for producing packets onto AF_XDP Tx ring
> +consisting of multiple frames:
> +
> +* When an invalid descriptor is found, all the other
> +  descriptors/frames of this packet are marked as invalid and not
> +  completed. The next descriptor is treated as the start of a new
> +  packet, even if this was not the intent (because we cannot guess
> +  the intent). As before, if your program is producing invalid
> +  descriptors you have a bug that must be fixed.
> +
> +* Zero length descriptors are treated as invalid descriptors.
> +
> +* For copy mode, the maximum supported number of frames in a packet is
> +  equal to CONFIG_MAX_SKB_FRAGS + 1. If it is exceeded, all
> +  descriptors accumulated so far are dropped and treated as
> +  invalid. To produce an application that will work on any system
> +  regardless of this config setting, limit the number of frags to 18,
> +  as the minimum value of the config is 17.
> +
> +* For zero-copy mode, the limit is up to what the NIC HW
> +  supports. Usually at least five on the NICs we have checked. We
> +  consciously chose to not enforce a rigid limit (such as
> +  CONFIG_MAX_SKB_FRAGS + 1) for zero-copy mode, as it would have
> +  resulted in copy actions under the hood to fit into what limit
> +  the NIC supports. Kind of defeats the purpose of zero-copy mode.

How is an application supposed to discover the actual limit for a given
NIC/driver?

> +* The ZC batch API guarantees that it will provide a batch of Tx
> +  descriptors that ends with full packet at the end. If not, ZC
> +  drivers would have to gather the full packet on their side. The
> +  approach we picked makes ZC drivers' life much easier (at least on
> +  Tx side).

This seems like it implies some constraint on how an application can use
the APIs, but it's not quite clear to me what those constraints are, nor
what happens if an application does something different. This should
probably be spelled out...

> +On the Rx path in copy-mode, the xsk core copies the XDP data into
> +multiple descriptors, if needed, and sets the XDP_PKT_CONTD flag as
> +detailed before. Zero-copy mode works the same, though the data is not
> +copied. When the application gets a descriptor with the XDP_PKT_CONTD
> +flag set to one, it means that the packet consists of multiple buffers
> +and it continues with the next buffer in the following
> +descriptor. When a descriptor with XDP_PKT_CONTD == 0 is received, it
> +means that this is the last buffer of the packet. AF_XDP guarantees
> +that only a complete packet (all frames in the packet) is sent to the
> +application.

In light of the comment on batch size below, I think it would be useful
to spell out what this means exactly. IIUC correctly, it means that the
kernel will check the ringbuffer before starting to copy the data, and
if there are not enough descriptors available, it will drop the packet
instead of doing a partial copy, right? And this is the case for both ZC
and copy mode?

> +If application reads a batch of descriptors, using for example the libxdp
> +interfaces, it is not guaranteed that the batch will end with a full
> +packet. It might end in the middle of a packet and the rest of the
> +buffers of that packet will arrive at the beginning of the next batch,
> +since the libxdp interface does not read the whole ring (unless you
> +have an enormous batch size or a very small ring size).
> +
> +An example program each for Rx and Tx multi-buffer support can be found
> +later in this document.
> +
>  Usage
>  =====
>  
> @@ -532,6 +619,96 @@ like this:
>  But please use the libbpf functions as they are optimized and ready to
>  use. Will make your life easier.
>  
> +Usage Multi-Buffer Rx
> +=====================
> +
> +Here is a simple Rx path pseudo-code example (using libxdp interfaces
> +for simplicity). Error paths have been excluded to keep it short:
> +
> +.. code-block:: c
> +
> +    void rx_packets(struct xsk_socket_info *xsk)
> +    {
> +        static bool new_packet = true;
> +        u32 idx_rx = 0, idx_fq = 0;
> +        static char *pkt;
> +
> +        int rcvd = xsk_ring_cons__peek(&xsk->rx, opt_batch_size, &idx_rx);
> +
> +        xsk_ring_prod__reserve(&xsk->umem->fq, rcvd, &idx_fq);
> +
> +        for (int i = 0; i < rcvd; i++) {
> +            struct xdp_desc *desc = xsk_ring_cons__rx_desc(&xsk->rx, idx_rx++);
> +            char *frag = xsk_umem__get_data(xsk->umem->buffer, desc->addr);
> +            bool eop = !(desc->options & XDP_PKT_CONTD);
> +
> +        if (new_packet)
> +            pkt = frag;
> +        else
> +            add_frag_to_pkt(pkt, frag);
> +
> +        if (eop)
> +            process_pkt(pkt);
> +
> +        new_packet = eop;
> +
> +        *xsk_ring_prod__fill_addr(&xsk->umem->fq, idx_fq++) = desc->addr;

Indentation is off here...

-Toke
Magnus Karlsson June 21, 2023, 8:06 a.m. UTC | #2
On Tue, 20 Jun 2023 at 19:34, Toke Høiland-Jørgensen <toke@kernel.org> wrote:
>
> Maciej Fijalkowski <maciej.fijalkowski@intel.com> writes:
>
> > From: Magnus Karlsson <magnus.karlsson@intel.com>
> >
> > Add AF_XDP multi-buffer support documentation including two
> > pseudo-code samples.
> >
> > Signed-off-by: Magnus Karlsson <magnus.karlsson@intel.com>
> > ---
> >  Documentation/networking/af_xdp.rst | 177 ++++++++++++++++++++++++++++
> >  1 file changed, 177 insertions(+)
> >
> > diff --git a/Documentation/networking/af_xdp.rst b/Documentation/networking/af_xdp.rst
> > index 247c6c4127e9..2b583f58967b 100644
> > --- a/Documentation/networking/af_xdp.rst
> > +++ b/Documentation/networking/af_xdp.rst
> > @@ -453,6 +453,93 @@ XDP_OPTIONS getsockopt
> >  Gets options from an XDP socket. The only one supported so far is
> >  XDP_OPTIONS_ZEROCOPY which tells you if zero-copy is on or not.
> >
> > +Multi-Buffer Support
> > +--------------------
> > +
> > +With multi-buffer support, programs using AF_XDP sockets can receive
> > +and transmit packets consisting of multiple buffers both in copy and
> > +zero-copy mode. For example, a packet can consist of two
> > +frames/buffers, one with the header and the other one with the data,
> > +or a 9K Ethernet jumbo frame can be constructed by chaining together
> > +three 4K frames.
> > +
> > +Some definitions:
> > +
> > +* A packet consists of one or more frames
> > +
> > +* A descriptor in one of the AF_XDP rings always refers to a single
> > +  frame. In the case the packet consists of a single frame, the
> > +  descriptor refers to the whole packet.
> > +
> > +To enable multi-buffer support for an AF_XDP socket, use the new bind
> > +flag XDP_USE_SG. If this is not provided, all multi-buffer packets
> > +will be dropped just as before. Note that the XDP program loaded also
> > +needs to be in multi-buffer mode. This can be accomplished by using
> > +"xdp.frags" as the section name of the XDP program used.
> > +
> > +To represent a packet consisting of multiple frames, a new flag called
> > +XDP_PKT_CONTD is introduced in the options field of the Rx and Tx
> > +descriptors. If it is true (1) the packet continues with the next
> > +descriptor and if it is false (0) it means this is the last descriptor
> > +of the packet. Why the reverse logic of end-of-packet (eop) flag found
> > +in many NICs? Just to preserve compatibility with non-multi-buffer
> > +applications that have this bit set to false for all packets on Rx,
> > +and the apps set the options field to zero for Tx, as anything else
> > +will be treated as an invalid descriptor.
> > +
> > +These are the semantics for producing packets onto AF_XDP Tx ring
> > +consisting of multiple frames:
> > +
> > +* When an invalid descriptor is found, all the other
> > +  descriptors/frames of this packet are marked as invalid and not
> > +  completed. The next descriptor is treated as the start of a new
> > +  packet, even if this was not the intent (because we cannot guess
> > +  the intent). As before, if your program is producing invalid
> > +  descriptors you have a bug that must be fixed.
> > +
> > +* Zero length descriptors are treated as invalid descriptors.
> > +
> > +* For copy mode, the maximum supported number of frames in a packet is
> > +  equal to CONFIG_MAX_SKB_FRAGS + 1. If it is exceeded, all
> > +  descriptors accumulated so far are dropped and treated as
> > +  invalid. To produce an application that will work on any system
> > +  regardless of this config setting, limit the number of frags to 18,
> > +  as the minimum value of the config is 17.
> > +
> > +* For zero-copy mode, the limit is up to what the NIC HW
> > +  supports. Usually at least five on the NICs we have checked. We
> > +  consciously chose to not enforce a rigid limit (such as
> > +  CONFIG_MAX_SKB_FRAGS + 1) for zero-copy mode, as it would have
> > +  resulted in copy actions under the hood to fit into what limit
> > +  the NIC supports. Kind of defeats the purpose of zero-copy mode.
>
> How is an application supposed to discover the actual limit for a given
> NIC/driver?

Thanks for your comments Toke. I will add an example here of how to
discover this. Basically you can send a packet with N frags (N = 2 to
start with), check the error stats through the getsockopt. If no
invalid_tx_desc error, increase N with one and send this new packet.
If you get an error, then the max number of frags is N-1.

> > +* The ZC batch API guarantees that it will provide a batch of Tx
> > +  descriptors that ends with full packet at the end. If not, ZC
> > +  drivers would have to gather the full packet on their side. The
> > +  approach we picked makes ZC drivers' life much easier (at least on
> > +  Tx side).
>
> This seems like it implies some constraint on how an application can use
> the APIs, but it's not quite clear to me what those constraints are, nor
> what happens if an application does something different. This should
> probably be spelled out...
>
> > +On the Rx path in copy-mode, the xsk core copies the XDP data into
> > +multiple descriptors, if needed, and sets the XDP_PKT_CONTD flag as
> > +detailed before. Zero-copy mode works the same, though the data is not
> > +copied. When the application gets a descriptor with the XDP_PKT_CONTD
> > +flag set to one, it means that the packet consists of multiple buffers
> > +and it continues with the next buffer in the following
> > +descriptor. When a descriptor with XDP_PKT_CONTD == 0 is received, it
> > +means that this is the last buffer of the packet. AF_XDP guarantees
> > +that only a complete packet (all frames in the packet) is sent to the
> > +application.
>
> In light of the comment on batch size below, I think it would be useful
> to spell out what this means exactly. IIUC correctly, it means that the
> kernel will check the ringbuffer before starting to copy the data, and
> if there are not enough descriptors available, it will drop the packet
> instead of doing a partial copy, right? And this is the case for both ZC
> and copy mode?

I will make this paragraph and the previous one clearer. And yes, copy
mode and zc mode have the same behaviour.

> > +If application reads a batch of descriptors, using for example the libxdp
> > +interfaces, it is not guaranteed that the batch will end with a full
> > +packet. It might end in the middle of a packet and the rest of the
> > +buffers of that packet will arrive at the beginning of the next batch,
> > +since the libxdp interface does not read the whole ring (unless you
> > +have an enormous batch size or a very small ring size).
> > +
> > +An example program each for Rx and Tx multi-buffer support can be found
> > +later in this document.
> > +
> >  Usage
> >  =====
> >
> > @@ -532,6 +619,96 @@ like this:
> >  But please use the libbpf functions as they are optimized and ready to
> >  use. Will make your life easier.
> >
> > +Usage Multi-Buffer Rx
> > +=====================
> > +
> > +Here is a simple Rx path pseudo-code example (using libxdp interfaces
> > +for simplicity). Error paths have been excluded to keep it short:
> > +
> > +.. code-block:: c
> > +
> > +    void rx_packets(struct xsk_socket_info *xsk)
> > +    {
> > +        static bool new_packet = true;
> > +        u32 idx_rx = 0, idx_fq = 0;
> > +        static char *pkt;
> > +
> > +        int rcvd = xsk_ring_cons__peek(&xsk->rx, opt_batch_size, &idx_rx);
> > +
> > +        xsk_ring_prod__reserve(&xsk->umem->fq, rcvd, &idx_fq);
> > +
> > +        for (int i = 0; i < rcvd; i++) {
> > +            struct xdp_desc *desc = xsk_ring_cons__rx_desc(&xsk->rx, idx_rx++);
> > +            char *frag = xsk_umem__get_data(xsk->umem->buffer, desc->addr);
> > +            bool eop = !(desc->options & XDP_PKT_CONTD);
> > +
> > +        if (new_packet)
> > +            pkt = frag;
> > +        else
> > +            add_frag_to_pkt(pkt, frag);
> > +
> > +        if (eop)
> > +            process_pkt(pkt);
> > +
> > +        new_packet = eop;
> > +
> > +        *xsk_ring_prod__fill_addr(&xsk->umem->fq, idx_fq++) = desc->addr;
>
> Indentation is off here...

Will fix.

>
> -Toke
>
Toke Høiland-Jørgensen June 21, 2023, 1:30 p.m. UTC | #3
Magnus Karlsson <magnus.karlsson@gmail.com> writes:

> On Tue, 20 Jun 2023 at 19:34, Toke Høiland-Jørgensen <toke@kernel.org> wrote:
>>
>> Maciej Fijalkowski <maciej.fijalkowski@intel.com> writes:
>>
>> > From: Magnus Karlsson <magnus.karlsson@intel.com>
>> >
>> > Add AF_XDP multi-buffer support documentation including two
>> > pseudo-code samples.
>> >
>> > Signed-off-by: Magnus Karlsson <magnus.karlsson@intel.com>
>> > ---
>> >  Documentation/networking/af_xdp.rst | 177 ++++++++++++++++++++++++++++
>> >  1 file changed, 177 insertions(+)
>> >
>> > diff --git a/Documentation/networking/af_xdp.rst b/Documentation/networking/af_xdp.rst
>> > index 247c6c4127e9..2b583f58967b 100644
>> > --- a/Documentation/networking/af_xdp.rst
>> > +++ b/Documentation/networking/af_xdp.rst
>> > @@ -453,6 +453,93 @@ XDP_OPTIONS getsockopt
>> >  Gets options from an XDP socket. The only one supported so far is
>> >  XDP_OPTIONS_ZEROCOPY which tells you if zero-copy is on or not.
>> >
>> > +Multi-Buffer Support
>> > +--------------------
>> > +
>> > +With multi-buffer support, programs using AF_XDP sockets can receive
>> > +and transmit packets consisting of multiple buffers both in copy and
>> > +zero-copy mode. For example, a packet can consist of two
>> > +frames/buffers, one with the header and the other one with the data,
>> > +or a 9K Ethernet jumbo frame can be constructed by chaining together
>> > +three 4K frames.
>> > +
>> > +Some definitions:
>> > +
>> > +* A packet consists of one or more frames
>> > +
>> > +* A descriptor in one of the AF_XDP rings always refers to a single
>> > +  frame. In the case the packet consists of a single frame, the
>> > +  descriptor refers to the whole packet.
>> > +
>> > +To enable multi-buffer support for an AF_XDP socket, use the new bind
>> > +flag XDP_USE_SG. If this is not provided, all multi-buffer packets
>> > +will be dropped just as before. Note that the XDP program loaded also
>> > +needs to be in multi-buffer mode. This can be accomplished by using
>> > +"xdp.frags" as the section name of the XDP program used.
>> > +
>> > +To represent a packet consisting of multiple frames, a new flag called
>> > +XDP_PKT_CONTD is introduced in the options field of the Rx and Tx
>> > +descriptors. If it is true (1) the packet continues with the next
>> > +descriptor and if it is false (0) it means this is the last descriptor
>> > +of the packet. Why the reverse logic of end-of-packet (eop) flag found
>> > +in many NICs? Just to preserve compatibility with non-multi-buffer
>> > +applications that have this bit set to false for all packets on Rx,
>> > +and the apps set the options field to zero for Tx, as anything else
>> > +will be treated as an invalid descriptor.
>> > +
>> > +These are the semantics for producing packets onto AF_XDP Tx ring
>> > +consisting of multiple frames:
>> > +
>> > +* When an invalid descriptor is found, all the other
>> > +  descriptors/frames of this packet are marked as invalid and not
>> > +  completed. The next descriptor is treated as the start of a new
>> > +  packet, even if this was not the intent (because we cannot guess
>> > +  the intent). As before, if your program is producing invalid
>> > +  descriptors you have a bug that must be fixed.
>> > +
>> > +* Zero length descriptors are treated as invalid descriptors.
>> > +
>> > +* For copy mode, the maximum supported number of frames in a packet is
>> > +  equal to CONFIG_MAX_SKB_FRAGS + 1. If it is exceeded, all
>> > +  descriptors accumulated so far are dropped and treated as
>> > +  invalid. To produce an application that will work on any system
>> > +  regardless of this config setting, limit the number of frags to 18,
>> > +  as the minimum value of the config is 17.
>> > +
>> > +* For zero-copy mode, the limit is up to what the NIC HW
>> > +  supports. Usually at least five on the NICs we have checked. We
>> > +  consciously chose to not enforce a rigid limit (such as
>> > +  CONFIG_MAX_SKB_FRAGS + 1) for zero-copy mode, as it would have
>> > +  resulted in copy actions under the hood to fit into what limit
>> > +  the NIC supports. Kind of defeats the purpose of zero-copy mode.
>>
>> How is an application supposed to discover the actual limit for a given
>> NIC/driver?
>
> Thanks for your comments Toke. I will add an example here of how to
> discover this. Basically you can send a packet with N frags (N = 2 to
> start with), check the error stats through the getsockopt. If no
> invalid_tx_desc error, increase N with one and send this new packet.
> If you get an error, then the max number of frags is N-1.

Hmm, okay, that sounds pretty tedious :P

Also, it has side effects (you'll be putting frames on the wire while
testing, right?), so I guess this is not something you'll do on every
startup of your application? What are you expecting app developers to do
here in practice? Run the test while developing and expect the value to
stay constant for a given driver (does it?), or? Or will most people in
practice only need a few frags to get up to 9k MTU?

>> > +* The ZC batch API guarantees that it will provide a batch of Tx
>> > +  descriptors that ends with full packet at the end. If not, ZC
>> > +  drivers would have to gather the full packet on their side. The
>> > +  approach we picked makes ZC drivers' life much easier (at least on
>> > +  Tx side).
>>
>> This seems like it implies some constraint on how an application can use
>> the APIs, but it's not quite clear to me what those constraints are, nor
>> what happens if an application does something different. This should
>> probably be spelled out...
>>
>> > +On the Rx path in copy-mode, the xsk core copies the XDP data into
>> > +multiple descriptors, if needed, and sets the XDP_PKT_CONTD flag as
>> > +detailed before. Zero-copy mode works the same, though the data is not
>> > +copied. When the application gets a descriptor with the XDP_PKT_CONTD
>> > +flag set to one, it means that the packet consists of multiple buffers
>> > +and it continues with the next buffer in the following
>> > +descriptor. When a descriptor with XDP_PKT_CONTD == 0 is received, it
>> > +means that this is the last buffer of the packet. AF_XDP guarantees
>> > +that only a complete packet (all frames in the packet) is sent to the
>> > +application.
>>
>> In light of the comment on batch size below, I think it would be useful
>> to spell out what this means exactly. IIUC correctly, it means that the
>> kernel will check the ringbuffer before starting to copy the data, and
>> if there are not enough descriptors available, it will drop the packet
>> instead of doing a partial copy, right? And this is the case for both ZC
>> and copy mode?
>
> I will make this paragraph and the previous one clearer. And yes, copy
> mode and zc mode have the same behaviour.

Alright, great!

-Toke
Magnus Karlsson June 21, 2023, 2:15 p.m. UTC | #4
On Wed, 21 Jun 2023 at 15:30, Toke Høiland-Jørgensen <toke@redhat.com> wrote:
>
> Magnus Karlsson <magnus.karlsson@gmail.com> writes:
>
> > On Tue, 20 Jun 2023 at 19:34, Toke Høiland-Jørgensen <toke@kernel.org> wrote:
> >>
> >> Maciej Fijalkowski <maciej.fijalkowski@intel.com> writes:
> >>
> >> > From: Magnus Karlsson <magnus.karlsson@intel.com>
> >> >
> >> > Add AF_XDP multi-buffer support documentation including two
> >> > pseudo-code samples.
> >> >
> >> > Signed-off-by: Magnus Karlsson <magnus.karlsson@intel.com>
> >> > ---
> >> >  Documentation/networking/af_xdp.rst | 177 ++++++++++++++++++++++++++++
> >> >  1 file changed, 177 insertions(+)
> >> >
> >> > diff --git a/Documentation/networking/af_xdp.rst b/Documentation/networking/af_xdp.rst
> >> > index 247c6c4127e9..2b583f58967b 100644
> >> > --- a/Documentation/networking/af_xdp.rst
> >> > +++ b/Documentation/networking/af_xdp.rst
> >> > @@ -453,6 +453,93 @@ XDP_OPTIONS getsockopt
> >> >  Gets options from an XDP socket. The only one supported so far is
> >> >  XDP_OPTIONS_ZEROCOPY which tells you if zero-copy is on or not.
> >> >
> >> > +Multi-Buffer Support
> >> > +--------------------
> >> > +
> >> > +With multi-buffer support, programs using AF_XDP sockets can receive
> >> > +and transmit packets consisting of multiple buffers both in copy and
> >> > +zero-copy mode. For example, a packet can consist of two
> >> > +frames/buffers, one with the header and the other one with the data,
> >> > +or a 9K Ethernet jumbo frame can be constructed by chaining together
> >> > +three 4K frames.
> >> > +
> >> > +Some definitions:
> >> > +
> >> > +* A packet consists of one or more frames
> >> > +
> >> > +* A descriptor in one of the AF_XDP rings always refers to a single
> >> > +  frame. In the case the packet consists of a single frame, the
> >> > +  descriptor refers to the whole packet.
> >> > +
> >> > +To enable multi-buffer support for an AF_XDP socket, use the new bind
> >> > +flag XDP_USE_SG. If this is not provided, all multi-buffer packets
> >> > +will be dropped just as before. Note that the XDP program loaded also
> >> > +needs to be in multi-buffer mode. This can be accomplished by using
> >> > +"xdp.frags" as the section name of the XDP program used.
> >> > +
> >> > +To represent a packet consisting of multiple frames, a new flag called
> >> > +XDP_PKT_CONTD is introduced in the options field of the Rx and Tx
> >> > +descriptors. If it is true (1) the packet continues with the next
> >> > +descriptor and if it is false (0) it means this is the last descriptor
> >> > +of the packet. Why the reverse logic of end-of-packet (eop) flag found
> >> > +in many NICs? Just to preserve compatibility with non-multi-buffer
> >> > +applications that have this bit set to false for all packets on Rx,
> >> > +and the apps set the options field to zero for Tx, as anything else
> >> > +will be treated as an invalid descriptor.
> >> > +
> >> > +These are the semantics for producing packets onto AF_XDP Tx ring
> >> > +consisting of multiple frames:
> >> > +
> >> > +* When an invalid descriptor is found, all the other
> >> > +  descriptors/frames of this packet are marked as invalid and not
> >> > +  completed. The next descriptor is treated as the start of a new
> >> > +  packet, even if this was not the intent (because we cannot guess
> >> > +  the intent). As before, if your program is producing invalid
> >> > +  descriptors you have a bug that must be fixed.
> >> > +
> >> > +* Zero length descriptors are treated as invalid descriptors.
> >> > +
> >> > +* For copy mode, the maximum supported number of frames in a packet is
> >> > +  equal to CONFIG_MAX_SKB_FRAGS + 1. If it is exceeded, all
> >> > +  descriptors accumulated so far are dropped and treated as
> >> > +  invalid. To produce an application that will work on any system
> >> > +  regardless of this config setting, limit the number of frags to 18,
> >> > +  as the minimum value of the config is 17.
> >> > +
> >> > +* For zero-copy mode, the limit is up to what the NIC HW
> >> > +  supports. Usually at least five on the NICs we have checked. We
> >> > +  consciously chose to not enforce a rigid limit (such as
> >> > +  CONFIG_MAX_SKB_FRAGS + 1) for zero-copy mode, as it would have
> >> > +  resulted in copy actions under the hood to fit into what limit
> >> > +  the NIC supports. Kind of defeats the purpose of zero-copy mode.
> >>
> >> How is an application supposed to discover the actual limit for a given
> >> NIC/driver?
> >
> > Thanks for your comments Toke. I will add an example here of how to
> > discover this. Basically you can send a packet with N frags (N = 2 to
> > start with), check the error stats through the getsockopt. If no
> > invalid_tx_desc error, increase N with one and send this new packet.
> > If you get an error, then the max number of frags is N-1.
>
> Hmm, okay, that sounds pretty tedious :P

Indeed if you had to do it manually ;-). Do not think this max is
important though, see next answer.

> Also, it has side effects (you'll be putting frames on the wire while
> testing, right?), so I guess this is not something you'll do on every
> startup of your application? What are you expecting app developers to do
> here in practice? Run the test while developing and expect the value to
> stay constant for a given driver (does it?), or? Or will most people in
> practice only need a few frags to get up to 9k MTU?

I believe that the question that a developer wants to answer is not
"max number of frags" supported, but "is 3 or 5 frags" supported. This
corresponds to 4K and 2K chunks used to send a 9K MTU. My guess is
that any driver that reports that it supports XDP multibuffer will
support at least 5 frags too. (Just note that I have not checked
around if there are NICs that only support up to 3 frags which would
mean that 4K chunks would have to be used.) This value will not change
for a NIC and the question is if there are XDP MB enabled NIC drivers
that do not support 5 frags? If not, then 5 is the lower bound for the
max number of frags in zc mode, at least for the current set of XDP MB
enabled drivers.

If an application would like to try if 3 or 5 frags (or any other
value) is supported by a NIC when the app is launched, then we would
need to try to send something as you say. I wonder if there is a way
to transmit a frame that does not reach the wire? What happens if I
send a frame with all zeroes, MAC addresses and the like? Will it
reach the wire? Another option would be to just send something you
really would like to send and if it does not complete and/or you get
an error in the getsockopt() stats call, you know that this amount of
frags is not supported and you need to try it again with less frags or
fallback to copy mode.

I do think there will be applications that require more than 5 frags.
I just do not know what they are due to a lack of imagination on my
part. They could use the algorithm from above, or just try the number
of frags they need depending on the flexibility of the code. A
fallback to copy-mode is always possible when the supported number of
frags is not there in the NIC HW.

> >> > +* The ZC batch API guarantees that it will provide a batch of Tx
> >> > +  descriptors that ends with full packet at the end. If not, ZC
> >> > +  drivers would have to gather the full packet on their side. The
> >> > +  approach we picked makes ZC drivers' life much easier (at least on
> >> > +  Tx side).
> >>
> >> This seems like it implies some constraint on how an application can use
> >> the APIs, but it's not quite clear to me what those constraints are, nor
> >> what happens if an application does something different. This should
> >> probably be spelled out...
> >>
> >> > +On the Rx path in copy-mode, the xsk core copies the XDP data into
> >> > +multiple descriptors, if needed, and sets the XDP_PKT_CONTD flag as
> >> > +detailed before. Zero-copy mode works the same, though the data is not
> >> > +copied. When the application gets a descriptor with the XDP_PKT_CONTD
> >> > +flag set to one, it means that the packet consists of multiple buffers
> >> > +and it continues with the next buffer in the following
> >> > +descriptor. When a descriptor with XDP_PKT_CONTD == 0 is received, it
> >> > +means that this is the last buffer of the packet. AF_XDP guarantees
> >> > +that only a complete packet (all frames in the packet) is sent to the
> >> > +application.
> >>
> >> In light of the comment on batch size below, I think it would be useful
> >> to spell out what this means exactly. IIUC correctly, it means that the
> >> kernel will check the ringbuffer before starting to copy the data, and
> >> if there are not enough descriptors available, it will drop the packet
> >> instead of doing a partial copy, right? And this is the case for both ZC
> >> and copy mode?
> >
> > I will make this paragraph and the previous one clearer. And yes, copy
> > mode and zc mode have the same behaviour.
>
> Alright, great!
>
> -Toke
>
Jakub Kicinski June 21, 2023, 8:34 p.m. UTC | #5
On Wed, 21 Jun 2023 16:15:32 +0200 Magnus Karlsson wrote:
> > Hmm, okay, that sounds pretty tedious :P  
> 
> Indeed if you had to do it manually ;-). Do not think this max is
> important though, see next answer.

Can't we add max segs to Lorenzo's XDP info?
include/uapi/linux/netdev.h
Magnus Karlsson June 22, 2023, 8:24 a.m. UTC | #6
On Wed, 21 Jun 2023 at 22:34, Jakub Kicinski <kuba@kernel.org> wrote:
>
> On Wed, 21 Jun 2023 16:15:32 +0200 Magnus Karlsson wrote:
> > > Hmm, okay, that sounds pretty tedious :P
> >
> > Indeed if you had to do it manually ;-). Do not think this max is
> > important though, see next answer.
>
> Can't we add max segs to Lorenzo's XDP info?
> include/uapi/linux/netdev.h

That should be straight forward. I am just reluctant to add a user
interface that might not be necessary.

Maciej, how about changing your patch #13 so that we do not add a flag
for zc_mb supported or not, but instead we add a flag that gives the
user the max number of frags supported in zc mode? A 1 returned would
mean that max 1 frag is supported, i.e. mb is not supported. Any
number >1 would mean that mb is supported in zc mode for this device
and the returned number is the max number of frags supported. This way
we would not have to add one more user interface solely for getting
the max number of frags supported. What do you think?
Toke Høiland-Jørgensen June 22, 2023, 10:56 a.m. UTC | #7
Magnus Karlsson <magnus.karlsson@gmail.com> writes:

> On Wed, 21 Jun 2023 at 22:34, Jakub Kicinski <kuba@kernel.org> wrote:
>>
>> On Wed, 21 Jun 2023 16:15:32 +0200 Magnus Karlsson wrote:
>> > > Hmm, okay, that sounds pretty tedious :P
>> >
>> > Indeed if you had to do it manually ;-). Do not think this max is
>> > important though, see next answer.
>>
>> Can't we add max segs to Lorenzo's XDP info?
>> include/uapi/linux/netdev.h
>
> That should be straight forward. I am just reluctant to add a user
> interface that might not be necessary.

Yeah, that was why I was asking what the expectations were before
suggesting adding this to the feature bits :)

However, given that the answer seems to be "it varies"...

> Maciej, how about changing your patch #13 so that we do not add a flag
> for zc_mb supported or not, but instead we add a flag that gives the
> user the max number of frags supported in zc mode? A 1 returned would
> mean that max 1 frag is supported, i.e. mb is not supported. Any
> number >1 would mean that mb is supported in zc mode for this device
> and the returned number is the max number of frags supported. This way
> we would not have to add one more user interface solely for getting
> the max number of frags supported. What do you think?

...I think it's a good idea to add the field, and this sounds like a
reasonable way of dealing with it (although it may need a bit more
plumbing on the netlink side?)

-Toke
Jakub Kicinski June 28, 2023, 8:28 p.m. UTC | #8
On Wed, 28 Jun 2023 20:35:06 +0200 Maciej Fijalkowski wrote:
> Okay, here's what I came up with, PTAL, it's on top of the current set but
> that should not matter a lot, you'll get the idea of it. I think it's
> better to post a diff here and if you guys find it alright then I'll
> include this in v5.

LGTM!
Toke Høiland-Jørgensen June 28, 2023, 9:02 p.m. UTC | #9
> diff --git a/net/core/netdev-genl.c b/net/core/netdev-genl.c
> index a4270fafdf11..b24244f768e3 100644
> --- a/net/core/netdev-genl.c
> +++ b/net/core/netdev-genl.c
> @@ -19,6 +19,8 @@ netdev_nl_dev_fill(struct net_device *netdev, struct sk_buff *rsp,
>  		return -EMSGSIZE;
>  
>  	if (nla_put_u32(rsp, NETDEV_A_DEV_IFINDEX, netdev->ifindex) ||
> +	    nla_put_u32(rsp, NETDEV_A_DEV_XDP_ZC_MAX_SEGS,
> +			netdev->xdp_zc_max_segs) ||

Should this be omitted if the driver doesn't support zero-copy at all?

-Toke
Fijalkowski, Maciej June 29, 2023, 8:28 p.m. UTC | #10
On Wed, Jun 28, 2023 at 11:02:06PM +0200, Toke Høiland-Jørgensen wrote:
> > diff --git a/net/core/netdev-genl.c b/net/core/netdev-genl.c
> > index a4270fafdf11..b24244f768e3 100644
> > --- a/net/core/netdev-genl.c
> > +++ b/net/core/netdev-genl.c
> > @@ -19,6 +19,8 @@ netdev_nl_dev_fill(struct net_device *netdev, struct sk_buff *rsp,
> >  		return -EMSGSIZE;
> >  
> >  	if (nla_put_u32(rsp, NETDEV_A_DEV_IFINDEX, netdev->ifindex) ||
> > +	    nla_put_u32(rsp, NETDEV_A_DEV_XDP_ZC_MAX_SEGS,
> > +			netdev->xdp_zc_max_segs) ||
> 
> Should this be omitted if the driver doesn't support zero-copy at all?

This is now set independently when allocing net_device struct, so this can
be read without issues. Furthermore this value should not be used to find
out if underlying driver supports ZC or not - let us keep using
xdp_features for that.

Does it make sense?

> 
> -Toke
> 
>
Toke Høiland-Jørgensen June 29, 2023, 8:57 p.m. UTC | #11
Maciej Fijalkowski <maciej.fijalkowski@intel.com> writes:

> On Wed, Jun 28, 2023 at 11:02:06PM +0200, Toke Høiland-Jørgensen wrote:
>> > diff --git a/net/core/netdev-genl.c b/net/core/netdev-genl.c
>> > index a4270fafdf11..b24244f768e3 100644
>> > --- a/net/core/netdev-genl.c
>> > +++ b/net/core/netdev-genl.c
>> > @@ -19,6 +19,8 @@ netdev_nl_dev_fill(struct net_device *netdev, struct sk_buff *rsp,
>> >  		return -EMSGSIZE;
>> >  
>> >  	if (nla_put_u32(rsp, NETDEV_A_DEV_IFINDEX, netdev->ifindex) ||
>> > +	    nla_put_u32(rsp, NETDEV_A_DEV_XDP_ZC_MAX_SEGS,
>> > +			netdev->xdp_zc_max_segs) ||
>> 
>> Should this be omitted if the driver doesn't support zero-copy at all?
>
> This is now set independently when allocing net_device struct, so this can
> be read without issues. Furthermore this value should not be used to find
> out if underlying driver supports ZC or not - let us keep using
> xdp_features for that.
>
> Does it make sense?

Yes, I agree we shouldn't use this field for that. However, I am not
sure I trust all userspace applications to get that right, so I fear
some will end up looking at the field even when the flag is not set,
which will lead to confused users. So why not just omit the property
entirely when the flag is not set? :)

-Toke
Fijalkowski, Maciej June 30, 2023, 6 p.m. UTC | #12
On Thu, Jun 29, 2023 at 10:57:05PM +0200, Toke Høiland-Jørgensen wrote:
> Maciej Fijalkowski <maciej.fijalkowski@intel.com> writes:
> 
> > On Wed, Jun 28, 2023 at 11:02:06PM +0200, Toke Høiland-Jørgensen wrote:
> >> > diff --git a/net/core/netdev-genl.c b/net/core/netdev-genl.c
> >> > index a4270fafdf11..b24244f768e3 100644
> >> > --- a/net/core/netdev-genl.c
> >> > +++ b/net/core/netdev-genl.c
> >> > @@ -19,6 +19,8 @@ netdev_nl_dev_fill(struct net_device *netdev, struct sk_buff *rsp,
> >> >  		return -EMSGSIZE;
> >> >  
> >> >  	if (nla_put_u32(rsp, NETDEV_A_DEV_IFINDEX, netdev->ifindex) ||
> >> > +	    nla_put_u32(rsp, NETDEV_A_DEV_XDP_ZC_MAX_SEGS,
> >> > +			netdev->xdp_zc_max_segs) ||
> >> 
> >> Should this be omitted if the driver doesn't support zero-copy at all?
> >
> > This is now set independently when allocing net_device struct, so this can
> > be read without issues. Furthermore this value should not be used to find
> > out if underlying driver supports ZC or not - let us keep using
> > xdp_features for that.
> >
> > Does it make sense?
> 
> Yes, I agree we shouldn't use this field for that. However, I am not
> sure I trust all userspace applications to get that right, so I fear
> some will end up looking at the field even when the flag is not set,
> which will lead to confused users. So why not just omit the property
> entirely when the flag is not set? :)

I think that if you would read anything different than default 1 from this
field and your driver does not zupport even ZC then your driver is wrong.
It's like reporting something via xdp_features and not supporting it. You
only overwrite this within your driver *if* you support ZC multi-buffer.

OTOH were you referring to omitting putting the u32 to netlink response at
all?

> 
> -Toke
>
Toke Høiland-Jørgensen July 1, 2023, 1:51 p.m. UTC | #13
Maciej Fijalkowski <maciej.fijalkowski@intel.com> writes:

> On Thu, Jun 29, 2023 at 10:57:05PM +0200, Toke Høiland-Jørgensen wrote:
>> Maciej Fijalkowski <maciej.fijalkowski@intel.com> writes:
>> 
>> > On Wed, Jun 28, 2023 at 11:02:06PM +0200, Toke Høiland-Jørgensen wrote:
>> >> > diff --git a/net/core/netdev-genl.c b/net/core/netdev-genl.c
>> >> > index a4270fafdf11..b24244f768e3 100644
>> >> > --- a/net/core/netdev-genl.c
>> >> > +++ b/net/core/netdev-genl.c
>> >> > @@ -19,6 +19,8 @@ netdev_nl_dev_fill(struct net_device *netdev, struct sk_buff *rsp,
>> >> >  		return -EMSGSIZE;
>> >> >  
>> >> >  	if (nla_put_u32(rsp, NETDEV_A_DEV_IFINDEX, netdev->ifindex) ||
>> >> > +	    nla_put_u32(rsp, NETDEV_A_DEV_XDP_ZC_MAX_SEGS,
>> >> > +			netdev->xdp_zc_max_segs) ||
>> >> 
>> >> Should this be omitted if the driver doesn't support zero-copy at all?
>> >
>> > This is now set independently when allocing net_device struct, so this can
>> > be read without issues. Furthermore this value should not be used to find
>> > out if underlying driver supports ZC or not - let us keep using
>> > xdp_features for that.
>> >
>> > Does it make sense?
>> 
>> Yes, I agree we shouldn't use this field for that. However, I am not
>> sure I trust all userspace applications to get that right, so I fear
>> some will end up looking at the field even when the flag is not set,
>> which will lead to confused users. So why not just omit the property
>> entirely when the flag is not set? :)
>
> I think that if you would read anything different than default 1 from this
> field and your driver does not zupport even ZC then your driver is wrong.
> It's like reporting something via xdp_features and not supporting it. You
> only overwrite this within your driver *if* you support ZC multi-buffer.
>
> OTOH were you referring to omitting putting the u32 to netlink response at
> all?

Yes, the latter. I have no objection to the internal field being set to
1 by default or anything, I just think we should omit the netlink
attribute when it doesn't have a meaningful value, to avoid confusion -
being able to do that is one of the nice properties of netlink, after all :)

-Toke
diff mbox series

Patch

diff --git a/Documentation/networking/af_xdp.rst b/Documentation/networking/af_xdp.rst
index 247c6c4127e9..2b583f58967b 100644
--- a/Documentation/networking/af_xdp.rst
+++ b/Documentation/networking/af_xdp.rst
@@ -453,6 +453,93 @@  XDP_OPTIONS getsockopt
 Gets options from an XDP socket. The only one supported so far is
 XDP_OPTIONS_ZEROCOPY which tells you if zero-copy is on or not.
 
+Multi-Buffer Support
+--------------------
+
+With multi-buffer support, programs using AF_XDP sockets can receive
+and transmit packets consisting of multiple buffers both in copy and
+zero-copy mode. For example, a packet can consist of two
+frames/buffers, one with the header and the other one with the data,
+or a 9K Ethernet jumbo frame can be constructed by chaining together
+three 4K frames.
+
+Some definitions:
+
+* A packet consists of one or more frames
+
+* A descriptor in one of the AF_XDP rings always refers to a single
+  frame. In the case the packet consists of a single frame, the
+  descriptor refers to the whole packet.
+
+To enable multi-buffer support for an AF_XDP socket, use the new bind
+flag XDP_USE_SG. If this is not provided, all multi-buffer packets
+will be dropped just as before. Note that the XDP program loaded also
+needs to be in multi-buffer mode. This can be accomplished by using
+"xdp.frags" as the section name of the XDP program used.
+
+To represent a packet consisting of multiple frames, a new flag called
+XDP_PKT_CONTD is introduced in the options field of the Rx and Tx
+descriptors. If it is true (1) the packet continues with the next
+descriptor and if it is false (0) it means this is the last descriptor
+of the packet. Why the reverse logic of end-of-packet (eop) flag found
+in many NICs? Just to preserve compatibility with non-multi-buffer
+applications that have this bit set to false for all packets on Rx,
+and the apps set the options field to zero for Tx, as anything else
+will be treated as an invalid descriptor.
+
+These are the semantics for producing packets onto AF_XDP Tx ring
+consisting of multiple frames:
+
+* When an invalid descriptor is found, all the other
+  descriptors/frames of this packet are marked as invalid and not
+  completed. The next descriptor is treated as the start of a new
+  packet, even if this was not the intent (because we cannot guess
+  the intent). As before, if your program is producing invalid
+  descriptors you have a bug that must be fixed.
+
+* Zero length descriptors are treated as invalid descriptors.
+
+* For copy mode, the maximum supported number of frames in a packet is
+  equal to CONFIG_MAX_SKB_FRAGS + 1. If it is exceeded, all
+  descriptors accumulated so far are dropped and treated as
+  invalid. To produce an application that will work on any system
+  regardless of this config setting, limit the number of frags to 18,
+  as the minimum value of the config is 17.
+
+* For zero-copy mode, the limit is up to what the NIC HW
+  supports. Usually at least five on the NICs we have checked. We
+  consciously chose to not enforce a rigid limit (such as
+  CONFIG_MAX_SKB_FRAGS + 1) for zero-copy mode, as it would have
+  resulted in copy actions under the hood to fit into what limit
+  the NIC supports. Kind of defeats the purpose of zero-copy mode.
+
+* The ZC batch API guarantees that it will provide a batch of Tx
+  descriptors that ends with full packet at the end. If not, ZC
+  drivers would have to gather the full packet on their side. The
+  approach we picked makes ZC drivers' life much easier (at least on
+  Tx side).
+
+On the Rx path in copy-mode, the xsk core copies the XDP data into
+multiple descriptors, if needed, and sets the XDP_PKT_CONTD flag as
+detailed before. Zero-copy mode works the same, though the data is not
+copied. When the application gets a descriptor with the XDP_PKT_CONTD
+flag set to one, it means that the packet consists of multiple buffers
+and it continues with the next buffer in the following
+descriptor. When a descriptor with XDP_PKT_CONTD == 0 is received, it
+means that this is the last buffer of the packet. AF_XDP guarantees
+that only a complete packet (all frames in the packet) is sent to the
+application.
+
+If application reads a batch of descriptors, using for example the libxdp
+interfaces, it is not guaranteed that the batch will end with a full
+packet. It might end in the middle of a packet and the rest of the
+buffers of that packet will arrive at the beginning of the next batch,
+since the libxdp interface does not read the whole ring (unless you
+have an enormous batch size or a very small ring size).
+
+An example program each for Rx and Tx multi-buffer support can be found
+later in this document.
+
 Usage
 =====
 
@@ -532,6 +619,96 @@  like this:
 But please use the libbpf functions as they are optimized and ready to
 use. Will make your life easier.
 
+Usage Multi-Buffer Rx
+=====================
+
+Here is a simple Rx path pseudo-code example (using libxdp interfaces
+for simplicity). Error paths have been excluded to keep it short:
+
+.. code-block:: c
+
+    void rx_packets(struct xsk_socket_info *xsk)
+    {
+        static bool new_packet = true;
+        u32 idx_rx = 0, idx_fq = 0;
+        static char *pkt;
+
+        int rcvd = xsk_ring_cons__peek(&xsk->rx, opt_batch_size, &idx_rx);
+
+        xsk_ring_prod__reserve(&xsk->umem->fq, rcvd, &idx_fq);
+
+        for (int i = 0; i < rcvd; i++) {
+            struct xdp_desc *desc = xsk_ring_cons__rx_desc(&xsk->rx, idx_rx++);
+            char *frag = xsk_umem__get_data(xsk->umem->buffer, desc->addr);
+            bool eop = !(desc->options & XDP_PKT_CONTD);
+
+        if (new_packet)
+            pkt = frag;
+        else
+            add_frag_to_pkt(pkt, frag);
+
+        if (eop)
+            process_pkt(pkt);
+
+        new_packet = eop;
+
+        *xsk_ring_prod__fill_addr(&xsk->umem->fq, idx_fq++) = desc->addr;
+        }
+
+        xsk_ring_prod__submit(&xsk->umem->fq, rcvd);
+        xsk_ring_cons__release(&xsk->rx, rcvd);
+    }
+
+Usage Multi-Buffer Tx
+=====================
+
+Here is an example Tx path pseudo-code (using libxdp interfaces for
+simplicity) ignoring that the umem is finite in size, and that we
+eventually will run out of packets to send. Also assumes pkts.addr
+points to a valid location in the umem.
+
+.. code-block:: c
+
+    void tx_packets(struct xsk_socket_info *xsk, struct pkt *pkts,
+                    int batch_size)
+    {
+        u32 idx, i, pkt_nb = 0;
+
+        xsk_ring_prod__reserve(&xsk->tx, batch_size, &idx);
+
+        for (i = 0; i < batch_size;) {
+            u64 addr = pkts[pkt_nb].addr;
+            u32 len = pkts[pkt_nb].size;
+
+            do {
+                struct xdp_desc *tx_desc;
+
+                tx_desc = xsk_ring_prod__tx_desc(&xsk->tx, idx + i++);
+                tx_desc->addr = addr;
+
+                if (len > xsk_frame_size) {
+                    tx_desc->len = xsk_frame_size;
+                    tx_desc->options = XDP_PKT_CONTD;
+                } else {
+                    tx_desc->len = len;
+                    tx_desc->options = 0;
+                    pkt_nb++;
+                }
+                len -= tx_desc->len;
+                addr += xsk_frame_size;
+
+                if (i == batch_size) {
+                    /* Remember len, addr, pkt_nb for next iteration.
+                     * Skipped for simplicity.
+                     */
+                    break;
+                }
+            } while (len);
+        }
+
+        xsk_ring_prod__submit(&xsk->tx, i);
+    }
+
 Sample application
 ==================