mbox series

[RFC,bpf-next,0/8] XDP_REDIRECT_XSK and Batched AF_XDP Rx

Message ID 20211116073742.7941-1-ciara.loftus@intel.com (mailing list archive)
Headers show
Series XDP_REDIRECT_XSK and Batched AF_XDP Rx | expand

Message

Ciara Loftus Nov. 16, 2021, 7:37 a.m. UTC
The common case for AF_XDP sockets (xsks) is creating a single xsk on a queue for sending and 
receiving frames as this is analogous to HW packet steering through RSS and other classification 
methods in the NIC. AF_XDP uses the xdp redirect infrastructure to direct packets to the socket. It 
was designed for the much more complicated case of DEVMAP xdp_redirects which directs traffic to 
another netdev and thus potentially another driver. In the xsk redirect case, by skipping the 
unnecessary parts of this common code we can significantly improve performance and pave the way 
for batching in the driver. This RFC proposes one such way to simplify the infrastructure which 
yields a 27% increase in throughput and a decrease in cycles per packet of 24 cycles [1]. The goal 
of this RFC is to start a discussion on how best to simplify the single-socket datapath while 
providing one method as an example.

Current approach:
1. XSK pointer: an xsk is created and a handle to the xsk is stored in the XSKMAP.
2. XDP program: bpf_redirect_map helper triggers the XSKMAP lookup which stores the result (handle 
to the xsk) and the map type (XSKMAP) in the percpu bpf_redirect_info struct. The XDP_REDIRECT 
action is returned.
3. XDP_REDIRECT handling called by the driver: the map type (XSKMAP) is read from the 
bpf_redirect_info which selects the xsk_map_redirect path. The xsk pointer is retrieved from the
bpf_redirect_info and the XDP descriptor is pushed to the xsk's Rx ring. The socket is added to a
list for flushing later.
4. xdp_do_flush: iterate through the lists of all maps that can be used for redirect (CPUMAP, 
DEVMAP and XSKMAP). When XSKMAP is flushed, go through all xsks that had any traffic redirected to 
them and bump the Rx ring head pointer(s).

For the end goal of submitting the descriptor to the Rx ring and bumping the head pointer of that 
ring, only some of these steps are needed. The rest is overhead. The bpf_redirect_map 
infrastructure is needed for all other redirect operations, but is not necessary when redirecting 
to a single AF_XDP socket. And similarly, flushing the list for every map type in step 4 is not 
necessary when only one socket needs to be flushed.

Proposed approach:
1. XSK pointer: an xsk is created and a handle to the xsk is stored both in the XSKMAP and also the 
netdev_rx_queue struct.
2. XDP program: new bpf_redirect_xsk helper returns XDP_REDIRECT_XSK.
3. XDP_REDIRECT_XSK handling called by the driver: the xsk pointer is retrieved from the 
netdev_rx_queue struct and the XDP descriptor is pushed to the xsk's Rx ring.
4. xsk_flush: fetch the handle from the netdev_rx_queue and flush the xsk.

This fast path is triggered on XDP_REDIRECT_XSK if:
  (i) AF_XDP socket SW Rx ring configured
 (ii) Exactly one xsk attached to the queue
If any of these conditions are not met, fall back to the same behavior as the original approach: 
xdp_redirect_map. This is handled under-the-hood in the new bpf_xdp_redirect_xsk helper so the user
does not need to be aware of these conditions.

Batching:
With this new approach it is possible to optimize the driver by submitting a batch of descriptors 
to the Rx ring in Step 3 of the new approach by simply verifying that the action returned from 
every program run of each packet in a batch equals XDP_REDIRECT_XSK. That's because with this 
action we know the socket to redirect to will be the same for each packet in the batch. This is 
not possible with XDP_REDIRECT because the xsk pointer is stored in the bpf_redirect_info and not
guaranteed to be the same for every packet in a batch.

[1] Performance:
The benchmarks were performed on VM running a 2.4GHz Ice Lake host with an i40e device passed 
through. The xdpsock app was run on a single core with busy polling and configured in 'rxonly' mode.
./xdpsock -i <iface> -r -B
The improvement in throughput when using the new bpf helper and XDP action was measured at ~13% for 
scalar processing, with reduction in cycles per packet of ~13. A further ~14% improvement in 
throughput and reduction of ~11 cycles per packet was measured when the batched i40e driver path 
was used, for a total improvement of ~27% in throughput and reduction of ~24 cycles per packet.

Other approaches considered:
Two other approaches were considered. The advantage of both being that neither involved introducing 
a new XDP action. The first alternative approach considered was to create a new map type 
BPF_MAP_TYPE_XSKMAP_DIRECT. When the XDP_REDIRECT action was returned, this map type could be 
checked and used as an indicator to skip the map lookup and use the netdev_rx_queue xsk instead. 
The second approach considered was similar and involved using a new bpf_redirect_info flag which 
could be used in a similar fashion.
While both approaches yielded a performance improvement they were measured at about half of what 
was measured for the approach outlined in this RFC. It seems using bpf_redirect_info is too 
expensive.
Also, centralised processing of XDP actions was investigated. This would involve porting all drivers
to a common interface for handling XDP actions which would greatly simplify the work involved in
adding support for new XDP actions such as XDP_REDIRECT_XSK. However it was deemed at this point to
be more complex than adding support for the new action to every driver. Should this series be
considered worth pursuing for a proper patch set, the intention would be to update each driver 
individually.

Thank you to Magnus Karlsson and Maciej Fijalkowski for several suggestions and insight provided.

TODO:
* Add selftest(s)
* Add support for all copy and zero copy drivers
* Libxdp support

The series applies on commit e5043894b21f ("bpftool: Use libbpf_get_error() to check error")

Thanks,
Ciara

Ciara Loftus (8):
  xsk: add struct xdp_sock to netdev_rx_queue
  bpf: add bpf_redirect_xsk helper and XDP_REDIRECT_XSK action
  xsk: handle XDP_REDIRECT_XSK and expose xsk_rcv/flush
  i40e: handle the XDP_REDIRECT_XSK action
  xsk: implement a batched version of xsk_rcv
  i40e: isolate descriptor processing in separate function
  i40e: introduce batched XDP rx descriptor processing
  libbpf: use bpf_redirect_xsk in the default program

 drivers/net/ethernet/intel/i40e/i40e_txrx.c   |  13 +-
 .../ethernet/intel/i40e/i40e_txrx_common.h    |   1 +
 drivers/net/ethernet/intel/i40e/i40e_xsk.c    | 285 +++++++++++++++---
 include/linux/netdevice.h                     |   2 +
 include/net/xdp_sock_drv.h                    |  49 +++
 include/net/xsk_buff_pool.h                   |  22 ++
 include/uapi/linux/bpf.h                      |  13 +
 kernel/bpf/verifier.c                         |   7 +-
 net/core/dev.c                                |  14 +
 net/core/filter.c                             |  26 ++
 net/xdp/xsk.c                                 |  69 ++++-
 net/xdp/xsk_queue.h                           |  31 ++
 tools/include/uapi/linux/bpf.h                |  13 +
 tools/lib/bpf/xsk.c                           |  50 ++-
 14 files changed, 551 insertions(+), 44 deletions(-)

Comments

Jesper Dangaard Brouer Nov. 16, 2021, 9:43 a.m. UTC | #1
On 16/11/2021 08.37, Ciara Loftus wrote:
> The common case for AF_XDP sockets (xsks) is creating a single xsk on a queue for sending and
> receiving frames as this is analogous to HW packet steering through RSS and other classification
> methods in the NIC. AF_XDP uses the xdp redirect infrastructure to direct packets to the socket. It
> was designed for the much more complicated case of DEVMAP xdp_redirects which directs traffic to
> another netdev and thus potentially another driver. In the xsk redirect case, by skipping the
> unnecessary parts of this common code we can significantly improve performance and pave the way
> for batching in the driver. This RFC proposes one such way to simplify the infrastructure which
> yields a 27% increase in throughput and a decrease in cycles per packet of 24 cycles [1]. The goal
> of this RFC is to start a discussion on how best to simplify the single-socket datapath while
> providing one method as an example.
> 
> Current approach:
> 1. XSK pointer: an xsk is created and a handle to the xsk is stored in the XSKMAP.
> 2. XDP program: bpf_redirect_map helper triggers the XSKMAP lookup which stores the result (handle
> to the xsk) and the map type (XSKMAP) in the percpu bpf_redirect_info struct. The XDP_REDIRECT
> action is returned.
> 3. XDP_REDIRECT handling called by the driver: the map type (XSKMAP) is read from the
> bpf_redirect_info which selects the xsk_map_redirect path. The xsk pointer is retrieved from the
> bpf_redirect_info and the XDP descriptor is pushed to the xsk's Rx ring. The socket is added to a
> list for flushing later.
> 4. xdp_do_flush: iterate through the lists of all maps that can be used for redirect (CPUMAP,
> DEVMAP and XSKMAP). When XSKMAP is flushed, go through all xsks that had any traffic redirected to
> them and bump the Rx ring head pointer(s).
> 
> For the end goal of submitting the descriptor to the Rx ring and bumping the head pointer of that
> ring, only some of these steps are needed. The rest is overhead. The bpf_redirect_map
> infrastructure is needed for all other redirect operations, but is not necessary when redirecting
> to a single AF_XDP socket. And similarly, flushing the list for every map type in step 4 is not
> necessary when only one socket needs to be flushed.
> 
> Proposed approach:
> 1. XSK pointer: an xsk is created and a handle to the xsk is stored both in the XSKMAP and also the
> netdev_rx_queue struct.
> 2. XDP program: new bpf_redirect_xsk helper returns XDP_REDIRECT_XSK.
> 3. XDP_REDIRECT_XSK handling called by the driver: the xsk pointer is retrieved from the
> netdev_rx_queue struct and the XDP descriptor is pushed to the xsk's Rx ring.
> 4. xsk_flush: fetch the handle from the netdev_rx_queue and flush the xsk.
> 
> This fast path is triggered on XDP_REDIRECT_XSK if:
>    (i) AF_XDP socket SW Rx ring configured
>   (ii) Exactly one xsk attached to the queue
> If any of these conditions are not met, fall back to the same behavior as the original approach:
> xdp_redirect_map. This is handled under-the-hood in the new bpf_xdp_redirect_xsk helper so the user
> does not need to be aware of these conditions.
> 
> Batching:
> With this new approach it is possible to optimize the driver by submitting a batch of descriptors
> to the Rx ring in Step 3 of the new approach by simply verifying that the action returned from
> every program run of each packet in a batch equals XDP_REDIRECT_XSK. That's because with this
> action we know the socket to redirect to will be the same for each packet in the batch. This is
> not possible with XDP_REDIRECT because the xsk pointer is stored in the bpf_redirect_info and not
> guaranteed to be the same for every packet in a batch.
> 
> [1] Performance:
> The benchmarks were performed on VM running a 2.4GHz Ice Lake host with an i40e device passed
> through. The xdpsock app was run on a single core with busy polling and configured in 'rxonly' mode.
> ./xdpsock -i <iface> -r -B
> The improvement in throughput when using the new bpf helper and XDP action was measured at ~13% for
> scalar processing, with reduction in cycles per packet of ~13. A further ~14% improvement in
> throughput and reduction of ~11 cycles per packet was measured when the batched i40e driver path
> was used, for a total improvement of ~27% in throughput and reduction of ~24 cycles per packet.
> 
> Other approaches considered:
> Two other approaches were considered. The advantage of both being that neither involved introducing
> a new XDP action. The first alternative approach considered was to create a new map type
> BPF_MAP_TYPE_XSKMAP_DIRECT. When the XDP_REDIRECT action was returned, this map type could be
> checked and used as an indicator to skip the map lookup and use the netdev_rx_queue xsk instead.
> The second approach considered was similar and involved using a new bpf_redirect_info flag which
> could be used in a similar fashion.
> While both approaches yielded a performance improvement they were measured at about half of what
> was measured for the approach outlined in this RFC. It seems using bpf_redirect_info is too
> expensive.

I think it was Bjørn that discovered that accessing the per CPU 
bpf_redirect_info struct have an overhead of approx 2 ns (times 2.4GHz 
~4.8 cycles).  Your reduction in cycles per packet was ~13, where ~4.8 
seem to be large.

The code access this_cpu_ptr(&bpf_redirect_info) two times.
One time in the BPF-helper redirect call and second in xdp_do_redirect.
(Hint xdp_redirect_map end-up calling __bpf_xdp_redirect_map)

Thus, it seems (as you say), the bpf_redirect_info approach is too 
expensive.  May be should look at storing bpf_redirect_info in a place 
that doesn't requires the this_cpu_ptr() lookup... or cache the lookup 
per NAPI cycle.
Have you tried this?


> Also, centralised processing of XDP actions was investigated. This would involve porting all drivers
> to a common interface for handling XDP actions which would greatly simplify the work involved in
> adding support for new XDP actions such as XDP_REDIRECT_XSK. However it was deemed at this point to
> be more complex than adding support for the new action to every driver. Should this series be
> considered worth pursuing for a proper patch set, the intention would be to update each driver
> individually.

I'm fine with adding a new helper, but I don't like introducing a new 
XDP_REDIRECT_XSK action, which requires updating ALL the drivers.

With XDP_REDIRECT infra we beleived we didn't need to add more 
XDP-action code to drivers, as we multiplex/add new features by 
extending the bpf_redirect_info.
In this extreme performance case, it seems the this_cpu_ptr "lookup" of 
bpf_redirect_info is the performance issue itself.

Could you experiement with different approaches that modify 
xdp_do_redirect() to handle if new helper bpf_redirect_xsk was called, 
prior to this_cpu_ptr() call.
(Thus, avoiding to introduce a new XDP-action).

> Thank you to Magnus Karlsson and Maciej Fijalkowski for several suggestions and insight provided.
> 
> TODO:
> * Add selftest(s)
> * Add support for all copy and zero copy drivers
> * Libxdp support
> 
> The series applies on commit e5043894b21f ("bpftool: Use libbpf_get_error() to check error")
> 
> Thanks,
> Ciara
> 
> Ciara Loftus (8):
>    xsk: add struct xdp_sock to netdev_rx_queue
>    bpf: add bpf_redirect_xsk helper and XDP_REDIRECT_XSK action
>    xsk: handle XDP_REDIRECT_XSK and expose xsk_rcv/flush
>    i40e: handle the XDP_REDIRECT_XSK action
>    xsk: implement a batched version of xsk_rcv
>    i40e: isolate descriptor processing in separate function
>    i40e: introduce batched XDP rx descriptor processing
>    libbpf: use bpf_redirect_xsk in the default program
> 
>   drivers/net/ethernet/intel/i40e/i40e_txrx.c   |  13 +-
>   .../ethernet/intel/i40e/i40e_txrx_common.h    |   1 +
>   drivers/net/ethernet/intel/i40e/i40e_xsk.c    | 285 +++++++++++++++---
>   include/linux/netdevice.h                     |   2 +
>   include/net/xdp_sock_drv.h                    |  49 +++
>   include/net/xsk_buff_pool.h                   |  22 ++
>   include/uapi/linux/bpf.h                      |  13 +
>   kernel/bpf/verifier.c                         |   7 +-
>   net/core/dev.c                                |  14 +
>   net/core/filter.c                             |  26 ++
>   net/xdp/xsk.c                                 |  69 ++++-
>   net/xdp/xsk_queue.h                           |  31 ++
>   tools/include/uapi/linux/bpf.h                |  13 +
>   tools/lib/bpf/xsk.c                           |  50 ++-
>   14 files changed, 551 insertions(+), 44 deletions(-)
>
Ciara Loftus Nov. 16, 2021, 4:29 p.m. UTC | #2
> 
> 
> 
> On 16/11/2021 08.37, Ciara Loftus wrote:
> > The common case for AF_XDP sockets (xsks) is creating a single xsk on a
> queue for sending and
> > receiving frames as this is analogous to HW packet steering through RSS
> and other classification
> > methods in the NIC. AF_XDP uses the xdp redirect infrastructure to direct
> packets to the socket. It
> > was designed for the much more complicated case of DEVMAP
> xdp_redirects which directs traffic to
> > another netdev and thus potentially another driver. In the xsk redirect
> case, by skipping the
> > unnecessary parts of this common code we can significantly improve
> performance and pave the way
> > for batching in the driver. This RFC proposes one such way to simplify the
> infrastructure which
> > yields a 27% increase in throughput and a decrease in cycles per packet of
> 24 cycles [1]. The goal
> > of this RFC is to start a discussion on how best to simplify the single-socket
> datapath while
> > providing one method as an example.
> >
> > Current approach:
> > 1. XSK pointer: an xsk is created and a handle to the xsk is stored in the
> XSKMAP.
> > 2. XDP program: bpf_redirect_map helper triggers the XSKMAP lookup
> which stores the result (handle
> > to the xsk) and the map type (XSKMAP) in the percpu bpf_redirect_info
> struct. The XDP_REDIRECT
> > action is returned.
> > 3. XDP_REDIRECT handling called by the driver: the map type (XSKMAP) is
> read from the
> > bpf_redirect_info which selects the xsk_map_redirect path. The xsk
> pointer is retrieved from the
> > bpf_redirect_info and the XDP descriptor is pushed to the xsk's Rx ring. The
> socket is added to a
> > list for flushing later.
> > 4. xdp_do_flush: iterate through the lists of all maps that can be used for
> redirect (CPUMAP,
> > DEVMAP and XSKMAP). When XSKMAP is flushed, go through all xsks that
> had any traffic redirected to
> > them and bump the Rx ring head pointer(s).
> >
> > For the end goal of submitting the descriptor to the Rx ring and bumping
> the head pointer of that
> > ring, only some of these steps are needed. The rest is overhead. The
> bpf_redirect_map
> > infrastructure is needed for all other redirect operations, but is not
> necessary when redirecting
> > to a single AF_XDP socket. And similarly, flushing the list for every map type
> in step 4 is not
> > necessary when only one socket needs to be flushed.
> >
> > Proposed approach:
> > 1. XSK pointer: an xsk is created and a handle to the xsk is stored both in
> the XSKMAP and also the
> > netdev_rx_queue struct.
> > 2. XDP program: new bpf_redirect_xsk helper returns XDP_REDIRECT_XSK.
> > 3. XDP_REDIRECT_XSK handling called by the driver: the xsk pointer is
> retrieved from the
> > netdev_rx_queue struct and the XDP descriptor is pushed to the xsk's Rx
> ring.
> > 4. xsk_flush: fetch the handle from the netdev_rx_queue and flush the
> xsk.
> >
> > This fast path is triggered on XDP_REDIRECT_XSK if:
> >    (i) AF_XDP socket SW Rx ring configured
> >   (ii) Exactly one xsk attached to the queue
> > If any of these conditions are not met, fall back to the same behavior as the
> original approach:
> > xdp_redirect_map. This is handled under-the-hood in the new
> bpf_xdp_redirect_xsk helper so the user
> > does not need to be aware of these conditions.
> >
> > Batching:
> > With this new approach it is possible to optimize the driver by submitting a
> batch of descriptors
> > to the Rx ring in Step 3 of the new approach by simply verifying that the
> action returned from
> > every program run of each packet in a batch equals XDP_REDIRECT_XSK.
> That's because with this
> > action we know the socket to redirect to will be the same for each packet in
> the batch. This is
> > not possible with XDP_REDIRECT because the xsk pointer is stored in the
> bpf_redirect_info and not
> > guaranteed to be the same for every packet in a batch.
> >
> > [1] Performance:
> > The benchmarks were performed on VM running a 2.4GHz Ice Lake host
> with an i40e device passed
> > through. The xdpsock app was run on a single core with busy polling and
> configured in 'rxonly' mode.
> > ./xdpsock -i <iface> -r -B
> > The improvement in throughput when using the new bpf helper and XDP
> action was measured at ~13% for
> > scalar processing, with reduction in cycles per packet of ~13. A further ~14%
> improvement in
> > throughput and reduction of ~11 cycles per packet was measured when
> the batched i40e driver path
> > was used, for a total improvement of ~27% in throughput and reduction of
> ~24 cycles per packet.
> >
> > Other approaches considered:
> > Two other approaches were considered. The advantage of both being that
> neither involved introducing
> > a new XDP action. The first alternative approach considered was to create a
> new map type
> > BPF_MAP_TYPE_XSKMAP_DIRECT. When the XDP_REDIRECT action was
> returned, this map type could be
> > checked and used as an indicator to skip the map lookup and use the
> netdev_rx_queue xsk instead.
> > The second approach considered was similar and involved using a new
> bpf_redirect_info flag which
> > could be used in a similar fashion.
> > While both approaches yielded a performance improvement they were
> measured at about half of what
> > was measured for the approach outlined in this RFC. It seems using
> bpf_redirect_info is too
> > expensive.
> 
> I think it was Bjørn that discovered that accessing the per CPU
> bpf_redirect_info struct have an overhead of approx 2 ns (times 2.4GHz
> ~4.8 cycles).  Your reduction in cycles per packet was ~13, where ~4.8
> seem to be large.
> 
> The code access this_cpu_ptr(&bpf_redirect_info) two times.
> One time in the BPF-helper redirect call and second in xdp_do_redirect.
> (Hint xdp_redirect_map end-up calling __bpf_xdp_redirect_map)
> 
> Thus, it seems (as you say), the bpf_redirect_info approach is too
> expensive.  May be should look at storing bpf_redirect_info in a place
> that doesn't requires the this_cpu_ptr() lookup... or cache the lookup
> per NAPI cycle.
> Have you tried this?
> 
> 
> > Also, centralised processing of XDP actions was investigated. This would
> involve porting all drivers
> > to a common interface for handling XDP actions which would greatly
> simplify the work involved in
> > adding support for new XDP actions such as XDP_REDIRECT_XSK. However
> it was deemed at this point to
> > be more complex than adding support for the new action to every driver.
> Should this series be
> > considered worth pursuing for a proper patch set, the intention would be
> to update each driver
> > individually.
> 
> I'm fine with adding a new helper, but I don't like introducing a new
> XDP_REDIRECT_XSK action, which requires updating ALL the drivers.
> 
> With XDP_REDIRECT infra we beleived we didn't need to add more
> XDP-action code to drivers, as we multiplex/add new features by
> extending the bpf_redirect_info.
> In this extreme performance case, it seems the this_cpu_ptr "lookup" of
> bpf_redirect_info is the performance issue itself.
> 
> Could you experiement with different approaches that modify
> xdp_do_redirect() to handle if new helper bpf_redirect_xsk was called,
> prior to this_cpu_ptr() call.
> (Thus, avoiding to introduce a new XDP-action).

Thanks for your feedback Jesper.
I understand the hesitation of adding a new action. If we can achieve the same improvement without
introducing a new action I would be very happy!
Without new the action we'll need a new way to indicate that the bpf_redirect_xsk helper was
called. Maybe another new field in the netdev alongside the xsk_refcnt. Or else extend
bpf_redirect_info - if we find a new home for it that it's too costly to access.
Thanks for your suggestions. I'll experiment as you suggested and report back.

Thanks,
Ciara

> 
> > Thank you to Magnus Karlsson and Maciej Fijalkowski for several
> suggestions and insight provided.
> >
> > TODO:
> > * Add selftest(s)
> > * Add support for all copy and zero copy drivers
> > * Libxdp support
> >
> > The series applies on commit e5043894b21f ("bpftool: Use
> libbpf_get_error() to check error")
> >
> > Thanks,
> > Ciara
> >
> > Ciara Loftus (8):
> >    xsk: add struct xdp_sock to netdev_rx_queue
> >    bpf: add bpf_redirect_xsk helper and XDP_REDIRECT_XSK action
> >    xsk: handle XDP_REDIRECT_XSK and expose xsk_rcv/flush
> >    i40e: handle the XDP_REDIRECT_XSK action
> >    xsk: implement a batched version of xsk_rcv
> >    i40e: isolate descriptor processing in separate function
> >    i40e: introduce batched XDP rx descriptor processing
> >    libbpf: use bpf_redirect_xsk in the default program
> >
> >   drivers/net/ethernet/intel/i40e/i40e_txrx.c   |  13 +-
> >   .../ethernet/intel/i40e/i40e_txrx_common.h    |   1 +
> >   drivers/net/ethernet/intel/i40e/i40e_xsk.c    | 285 +++++++++++++++---
> >   include/linux/netdevice.h                     |   2 +
> >   include/net/xdp_sock_drv.h                    |  49 +++
> >   include/net/xsk_buff_pool.h                   |  22 ++
> >   include/uapi/linux/bpf.h                      |  13 +
> >   kernel/bpf/verifier.c                         |   7 +-
> >   net/core/dev.c                                |  14 +
> >   net/core/filter.c                             |  26 ++
> >   net/xdp/xsk.c                                 |  69 ++++-
> >   net/xdp/xsk_queue.h                           |  31 ++
> >   tools/include/uapi/linux/bpf.h                |  13 +
> >   tools/lib/bpf/xsk.c                           |  50 ++-
> >   14 files changed, 551 insertions(+), 44 deletions(-)
> >
Toke Høiland-Jørgensen Nov. 17, 2021, 2:24 p.m. UTC | #3
"Loftus, Ciara" <ciara.loftus@intel.com> writes:

>> I'm fine with adding a new helper, but I don't like introducing a new
>> XDP_REDIRECT_XSK action, which requires updating ALL the drivers.
>> 
>> With XDP_REDIRECT infra we beleived we didn't need to add more
>> XDP-action code to drivers, as we multiplex/add new features by
>> extending the bpf_redirect_info.
>> In this extreme performance case, it seems the this_cpu_ptr "lookup" of
>> bpf_redirect_info is the performance issue itself.
>> 
>> Could you experiement with different approaches that modify
>> xdp_do_redirect() to handle if new helper bpf_redirect_xsk was called,
>> prior to this_cpu_ptr() call.
>> (Thus, avoiding to introduce a new XDP-action).
>
> Thanks for your feedback Jesper.
> I understand the hesitation of adding a new action. If we can achieve the same improvement without
> introducing a new action I would be very happy!
> Without new the action we'll need a new way to indicate that the bpf_redirect_xsk helper was
> called. Maybe another new field in the netdev alongside the xsk_refcnt. Or else extend
> bpf_redirect_info - if we find a new home for it that it's too costly to access.
> Thanks for your suggestions. I'll experiment as you suggested and
> report back.

I'll add a +1 to the "let's try to solve this without a new return code" :)

Also, I don't think we need a new helper either; the bpf_redirect()
helper takes a flags argument, so we could just use ifindex=0,
flags=DEV_XSK or something like that.

Also, I think the batching in the driver idea can be generalised: we
just need to generalise the idea of "are all these packets going to the
same place" and have a batched version of xdp_do_redirect(), no? The
other map types do batching internally already, though, so I'm wondering
why batching in the driver helps XSK?

-Toke
Alexei Starovoitov Nov. 18, 2021, 2:54 a.m. UTC | #4
On Tue, Nov 16, 2021 at 07:37:34AM +0000, Ciara Loftus wrote:
> The common case for AF_XDP sockets (xsks) is creating a single xsk on a queue for sending and 
> receiving frames as this is analogous to HW packet steering through RSS and other classification 
> methods in the NIC. AF_XDP uses the xdp redirect infrastructure to direct packets to the socket. It 
> was designed for the much more complicated case of DEVMAP xdp_redirects which directs traffic to 
> another netdev and thus potentially another driver. In the xsk redirect case, by skipping the 
> unnecessary parts of this common code we can significantly improve performance and pave the way 
> for batching in the driver. This RFC proposes one such way to simplify the infrastructure which 
> yields a 27% increase in throughput and a decrease in cycles per packet of 24 cycles [1]. The goal 
> of this RFC is to start a discussion on how best to simplify the single-socket datapath while 
> providing one method as an example.
> 
> Current approach:
> 1. XSK pointer: an xsk is created and a handle to the xsk is stored in the XSKMAP.
> 2. XDP program: bpf_redirect_map helper triggers the XSKMAP lookup which stores the result (handle 
> to the xsk) and the map type (XSKMAP) in the percpu bpf_redirect_info struct. The XDP_REDIRECT 
> action is returned.
> 3. XDP_REDIRECT handling called by the driver: the map type (XSKMAP) is read from the 
> bpf_redirect_info which selects the xsk_map_redirect path. The xsk pointer is retrieved from the
> bpf_redirect_info and the XDP descriptor is pushed to the xsk's Rx ring. The socket is added to a
> list for flushing later.
> 4. xdp_do_flush: iterate through the lists of all maps that can be used for redirect (CPUMAP, 
> DEVMAP and XSKMAP). When XSKMAP is flushed, go through all xsks that had any traffic redirected to 
> them and bump the Rx ring head pointer(s).
> 
> For the end goal of submitting the descriptor to the Rx ring and bumping the head pointer of that 
> ring, only some of these steps are needed. The rest is overhead. The bpf_redirect_map 
> infrastructure is needed for all other redirect operations, but is not necessary when redirecting 
> to a single AF_XDP socket. And similarly, flushing the list for every map type in step 4 is not 
> necessary when only one socket needs to be flushed.
> 
> Proposed approach:
> 1. XSK pointer: an xsk is created and a handle to the xsk is stored both in the XSKMAP and also the 
> netdev_rx_queue struct.
> 2. XDP program: new bpf_redirect_xsk helper returns XDP_REDIRECT_XSK.
> 3. XDP_REDIRECT_XSK handling called by the driver: the xsk pointer is retrieved from the 
> netdev_rx_queue struct and the XDP descriptor is pushed to the xsk's Rx ring.
> 4. xsk_flush: fetch the handle from the netdev_rx_queue and flush the xsk.
> 
> This fast path is triggered on XDP_REDIRECT_XSK if:
>   (i) AF_XDP socket SW Rx ring configured
>  (ii) Exactly one xsk attached to the queue
> If any of these conditions are not met, fall back to the same behavior as the original approach: 
> xdp_redirect_map. This is handled under-the-hood in the new bpf_xdp_redirect_xsk helper so the user
> does not need to be aware of these conditions.

I don't think the micro optimization for specific use case warrants addition of new apis.
Please optimize it without adding new actions and new helpers.
Ciara Loftus Nov. 19, 2021, 3:48 p.m. UTC | #5
> "Loftus, Ciara" <ciara.loftus@intel.com> writes:
> 
> >> I'm fine with adding a new helper, but I don't like introducing a new
> >> XDP_REDIRECT_XSK action, which requires updating ALL the drivers.
> >>
> >> With XDP_REDIRECT infra we beleived we didn't need to add more
> >> XDP-action code to drivers, as we multiplex/add new features by
> >> extending the bpf_redirect_info.
> >> In this extreme performance case, it seems the this_cpu_ptr "lookup" of
> >> bpf_redirect_info is the performance issue itself.
> >>
> >> Could you experiement with different approaches that modify
> >> xdp_do_redirect() to handle if new helper bpf_redirect_xsk was called,
> >> prior to this_cpu_ptr() call.
> >> (Thus, avoiding to introduce a new XDP-action).
> >
> > Thanks for your feedback Jesper.
> > I understand the hesitation of adding a new action. If we can achieve the
> same improvement without
> > introducing a new action I would be very happy!
> > Without new the action we'll need a new way to indicate that the
> bpf_redirect_xsk helper was
> > called. Maybe another new field in the netdev alongside the xsk_refcnt. Or
> else extend
> > bpf_redirect_info - if we find a new home for it that it's too costly to
> access.
> > Thanks for your suggestions. I'll experiment as you suggested and
> > report back.
> 
> I'll add a +1 to the "let's try to solve this without a new return code" :)
> 
> Also, I don't think we need a new helper either; the bpf_redirect()
> helper takes a flags argument, so we could just use ifindex=0,
> flags=DEV_XSK or something like that.

The advantage of a new helper is that we can access the netdev 
struct from it and check if there's a valid xsk stored in it, before
returning XDP_REDIRECT without the xskmap lookup. However,
I think your suggestion could work too. We would just
have to delay the check until xdp_do_redirect. At this point
though, if there isn't a valid xsk we might have to drop the packet
instead of falling back to the xskmap.

> 
> Also, I think the batching in the driver idea can be generalised: we
> just need to generalise the idea of "are all these packets going to the
> same place" and have a batched version of xdp_do_redirect(), no? The
> other map types do batching internally already, though, so I'm wondering
> why batching in the driver helps XSK?
> 

With the current infrastructure figuring out if "all the packets are going
to the same place" looks like an expensive operation which could undo
the benefits of the batching that would come after it. We would need
to run the program N=batch_size times, store the actions and
bpf_redirect_info for each run and perform a series of compares. The new
action really helped here because it could easily indicate if all the
packets in a batch were going to the same place. But I understand it's
not an option. Maybe if we can mitigate the cost of accessing the
bpf_redirect_info as Jesper suggested, we can use a flag in it to signal
what the new action was signalling.
I'm not familiar with how the other map types and how they handle
batching so I will look into that.

Appreciate your feedback. I have a few avenues to explore.

Ciara

> -Toke
Toke Høiland-Jørgensen Nov. 22, 2021, 2:38 p.m. UTC | #6
"Loftus, Ciara" <ciara.loftus@intel.com> writes:

>> "Loftus, Ciara" <ciara.loftus@intel.com> writes:
>> 
>> >> I'm fine with adding a new helper, but I don't like introducing a new
>> >> XDP_REDIRECT_XSK action, which requires updating ALL the drivers.
>> >>
>> >> With XDP_REDIRECT infra we beleived we didn't need to add more
>> >> XDP-action code to drivers, as we multiplex/add new features by
>> >> extending the bpf_redirect_info.
>> >> In this extreme performance case, it seems the this_cpu_ptr "lookup" of
>> >> bpf_redirect_info is the performance issue itself.
>> >>
>> >> Could you experiement with different approaches that modify
>> >> xdp_do_redirect() to handle if new helper bpf_redirect_xsk was called,
>> >> prior to this_cpu_ptr() call.
>> >> (Thus, avoiding to introduce a new XDP-action).
>> >
>> > Thanks for your feedback Jesper.
>> > I understand the hesitation of adding a new action. If we can achieve the
>> same improvement without
>> > introducing a new action I would be very happy!
>> > Without new the action we'll need a new way to indicate that the
>> bpf_redirect_xsk helper was
>> > called. Maybe another new field in the netdev alongside the xsk_refcnt. Or
>> else extend
>> > bpf_redirect_info - if we find a new home for it that it's too costly to
>> access.
>> > Thanks for your suggestions. I'll experiment as you suggested and
>> > report back.
>> 
>> I'll add a +1 to the "let's try to solve this without a new return code" :)
>> 
>> Also, I don't think we need a new helper either; the bpf_redirect()
>> helper takes a flags argument, so we could just use ifindex=0,
>> flags=DEV_XSK or something like that.
>
> The advantage of a new helper is that we can access the netdev 
> struct from it and check if there's a valid xsk stored in it, before
> returning XDP_REDIRECT without the xskmap lookup. However,
> I think your suggestion could work too. We would just
> have to delay the check until xdp_do_redirect. At this point
> though, if there isn't a valid xsk we might have to drop the packet
> instead of falling back to the xskmap.

I think it's OK to require the user to make sure that there is such a
socket loaded before using that flag...

>> Also, I think the batching in the driver idea can be generalised: we
>> just need to generalise the idea of "are all these packets going to the
>> same place" and have a batched version of xdp_do_redirect(), no? The
>> other map types do batching internally already, though, so I'm wondering
>> why batching in the driver helps XSK?
>> 
>
> With the current infrastructure figuring out if "all the packets are going
> to the same place" looks like an expensive operation which could undo
> the benefits of the batching that would come after it. We would need
> to run the program N=batch_size times, store the actions and
> bpf_redirect_info for each run and perform a series of compares. The new
> action really helped here because it could easily indicate if all the
> packets in a batch were going to the same place. But I understand it's
> not an option. Maybe if we can mitigate the cost of accessing the
> bpf_redirect_info as Jesper suggested, we can use a flag in it to signal
> what the new action was signalling.

Yes, it would probably require comparing the contents of the whole
bpf_redirect_info, at least as it is now; but assuming we can find a way
to mitigate the cost of accessing that structure, this may not be so
bad, and I believe there is some potential for compressing the state
further so we can get down to a single, or maybe two, 64-bit compares...

-Toke