mbox series

[bpf-next,0/3] XDP bonding support

Message ID 20210609135537.1460244-1-joamaki@gmail.com (mailing list archive)
Headers show
Series XDP bonding support | expand

Message

Jussi Maki June 9, 2021, 1:55 p.m. UTC
This patchset introduces XDP support to the bonding driver.

Patch 1 contains the implementation, including support for
the recently introduced EXCLUDE_INGRESS. Patch 2 contains a
performance fix to the roundrobin mode which switches rr_tx_counter
to be per-cpu. Patch 3 contains the test suite for the implementation
using a pair of veth devices.

The vmtest.sh is modified to enable the bonding module and install
modules. The config change should probably be done in the libbpf
repository. Andrii: How would you like this done properly?

The motivation for this change is to enable use of bonding (and
802.3ad) in hairpinning L4 load-balancers such as [1] implemented with
XDP and also to transparently support bond devices for projects that
use XDP given most modern NICs have dual port adapters.  An alternative
to this approach would be to implement 802.3ad in user-space and
implement the bonding load-balancing in the XDP program itself, but
is rather a cumbersome endeavor in terms of slave device management
(e.g. by watching netlink) and requires separate programs for native
vs bond cases for the orchestrator. A native in-kernel implementation
overcomes these issues and provides more flexibility.

Below are benchmark results done on two machines with 100Gbit
Intel E810 (ice) NIC and with 32-core 3970X on sending machine, and
16-core 3950X on receiving machine. 64 byte packets were sent with
pktgen-dpdk at full rate. Two issues [2, 3] were identified with the
ice driver, so the tests were performed with iommu=off and patch [2]
applied. Additionally the bonding round robin algorithm was modified
to use per-cpu tx counters as high CPU load (50% vs 10%) and high rate
of cache misses were caused by the shared rr_tx_counter (see patch
2/3). The statistics were collected using "sar -n dev -u 1 10".

 -----------------------|  CPU  |--| rxpck/s |--| txpck/s |----
 without patch (1 dev):
   XDP_DROP:              3.15%      48.6Mpps
   XDP_TX:                3.12%      18.3Mpps     18.3Mpps
   XDP_DROP (RSS):        9.47%      116.5Mpps
   XDP_TX (RSS):          9.67%      25.3Mpps     24.2Mpps
 -----------------------
 with patch, bond (1 dev):
   XDP_DROP:              3.14%      46.7Mpps
   XDP_TX:                3.15%      13.9Mpps     13.9Mpps
   XDP_DROP (RSS):        10.33%     117.2Mpps
   XDP_TX (RSS):          10.64%     25.1Mpps     24.0Mpps
 -----------------------
 with patch, bond (2 devs):
   XDP_DROP:              6.27%      92.7Mpps
   XDP_TX:                6.26%      17.6Mpps     17.5Mpps
   XDP_DROP (RSS):       11.38%      117.2Mpps
   XDP_TX (RSS):         14.30%      28.7Mpps     27.4Mpps
 --------------------------------------------------------------

RSS: Receive Side Scaling, e.g. the packets were sent to a range of
destination IPs.

[1]: https://cilium.io/blog/2021/05/20/cilium-110#standalonelb
[2]: https://lore.kernel.org/bpf/20210601113236.42651-1-maciej.fijalkowski@intel.com/T/#t
[3]: https://lore.kernel.org/bpf/CAHn8xckNXci+X_Eb2WMv4uVYjO2331UWB2JLtXr_58z0Av8+8A@mail.gmail.com/

---

Jussi Maki (3):
  net: bonding: Add XDP support to the bonding driver
  net: bonding: Use per-cpu rr_tx_counter
  selftests/bpf: Add tests for XDP bonding

 drivers/net/bonding/bond_main.c               | 459 +++++++++++++++---
 include/linux/filter.h                        |  13 +-
 include/linux/netdevice.h                     |   5 +
 include/net/bonding.h                         |   3 +-
 kernel/bpf/devmap.c                           |  34 +-
 net/core/filter.c                             |  37 +-
 .../selftests/bpf/prog_tests/xdp_bonding.c    | 342 +++++++++++++
 tools/testing/selftests/bpf/vmtest.sh         |  30 +-
 8 files changed, 843 insertions(+), 80 deletions(-)
 create mode 100644 tools/testing/selftests/bpf/prog_tests/xdp_bonding.c

Comments

Andrii Nakryiko June 10, 2021, 5:24 p.m. UTC | #1
On Wed, Jun 9, 2021 at 6:55 AM Jussi Maki <joamaki@gmail.com> wrote:
>
> This patchset introduces XDP support to the bonding driver.
>
> Patch 1 contains the implementation, including support for
> the recently introduced EXCLUDE_INGRESS. Patch 2 contains a
> performance fix to the roundrobin mode which switches rr_tx_counter
> to be per-cpu. Patch 3 contains the test suite for the implementation
> using a pair of veth devices.
>
> The vmtest.sh is modified to enable the bonding module and install
> modules. The config change should probably be done in the libbpf
> repository. Andrii: How would you like this done properly?

I think vmtest.sh and CI setup doesn't support modules (not easily at
least). Can we just compile that driver in? Then you can submit a PR
against libbpf Github repo to adjust the config. We have also kernel
CI repo where we'll need to make this change.

>
> The motivation for this change is to enable use of bonding (and
> 802.3ad) in hairpinning L4 load-balancers such as [1] implemented with
> XDP and also to transparently support bond devices for projects that
> use XDP given most modern NICs have dual port adapters.  An alternative
> to this approach would be to implement 802.3ad in user-space and
> implement the bonding load-balancing in the XDP program itself, but
> is rather a cumbersome endeavor in terms of slave device management
> (e.g. by watching netlink) and requires separate programs for native
> vs bond cases for the orchestrator. A native in-kernel implementation
> overcomes these issues and provides more flexibility.
>
> Below are benchmark results done on two machines with 100Gbit
> Intel E810 (ice) NIC and with 32-core 3970X on sending machine, and
> 16-core 3950X on receiving machine. 64 byte packets were sent with
> pktgen-dpdk at full rate. Two issues [2, 3] were identified with the
> ice driver, so the tests were performed with iommu=off and patch [2]
> applied. Additionally the bonding round robin algorithm was modified
> to use per-cpu tx counters as high CPU load (50% vs 10%) and high rate
> of cache misses were caused by the shared rr_tx_counter (see patch
> 2/3). The statistics were collected using "sar -n dev -u 1 10".
>
>  -----------------------|  CPU  |--| rxpck/s |--| txpck/s |----
>  without patch (1 dev):
>    XDP_DROP:              3.15%      48.6Mpps
>    XDP_TX:                3.12%      18.3Mpps     18.3Mpps
>    XDP_DROP (RSS):        9.47%      116.5Mpps
>    XDP_TX (RSS):          9.67%      25.3Mpps     24.2Mpps
>  -----------------------
>  with patch, bond (1 dev):
>    XDP_DROP:              3.14%      46.7Mpps
>    XDP_TX:                3.15%      13.9Mpps     13.9Mpps
>    XDP_DROP (RSS):        10.33%     117.2Mpps
>    XDP_TX (RSS):          10.64%     25.1Mpps     24.0Mpps
>  -----------------------
>  with patch, bond (2 devs):
>    XDP_DROP:              6.27%      92.7Mpps
>    XDP_TX:                6.26%      17.6Mpps     17.5Mpps
>    XDP_DROP (RSS):       11.38%      117.2Mpps
>    XDP_TX (RSS):         14.30%      28.7Mpps     27.4Mpps
>  --------------------------------------------------------------
>
> RSS: Receive Side Scaling, e.g. the packets were sent to a range of
> destination IPs.
>
> [1]: https://cilium.io/blog/2021/05/20/cilium-110#standalonelb
> [2]: https://lore.kernel.org/bpf/20210601113236.42651-1-maciej.fijalkowski@intel.com/T/#t
> [3]: https://lore.kernel.org/bpf/CAHn8xckNXci+X_Eb2WMv4uVYjO2331UWB2JLtXr_58z0Av8+8A@mail.gmail.com/
>
> ---
>
> Jussi Maki (3):
>   net: bonding: Add XDP support to the bonding driver
>   net: bonding: Use per-cpu rr_tx_counter
>   selftests/bpf: Add tests for XDP bonding
>
>  drivers/net/bonding/bond_main.c               | 459 +++++++++++++++---
>  include/linux/filter.h                        |  13 +-
>  include/linux/netdevice.h                     |   5 +
>  include/net/bonding.h                         |   3 +-
>  kernel/bpf/devmap.c                           |  34 +-
>  net/core/filter.c                             |  37 +-
>  .../selftests/bpf/prog_tests/xdp_bonding.c    | 342 +++++++++++++
>  tools/testing/selftests/bpf/vmtest.sh         |  30 +-
>  8 files changed, 843 insertions(+), 80 deletions(-)
>  create mode 100644 tools/testing/selftests/bpf/prog_tests/xdp_bonding.c
>
> --
> 2.30.2
>
Jussi Maki June 14, 2021, 12:25 p.m. UTC | #2
On Thu, Jun 10, 2021 at 7:24 PM Andrii Nakryiko
<andrii.nakryiko@gmail.com> wrote:
>
> On Wed, Jun 9, 2021 at 6:55 AM Jussi Maki <joamaki@gmail.com> wrote:
> >
> > This patchset introduces XDP support to the bonding driver.
> >
> > Patch 1 contains the implementation, including support for
> > the recently introduced EXCLUDE_INGRESS. Patch 2 contains a
> > performance fix to the roundrobin mode which switches rr_tx_counter
> > to be per-cpu. Patch 3 contains the test suite for the implementation
> > using a pair of veth devices.
> >
> > The vmtest.sh is modified to enable the bonding module and install
> > modules. The config change should probably be done in the libbpf
> > repository. Andrii: How would you like this done properly?
>
> I think vmtest.sh and CI setup doesn't support modules (not easily at
> least). Can we just compile that driver in? Then you can submit a PR
> against libbpf Github repo to adjust the config. We have also kernel
> CI repo where we'll need to make this change.

Unfortunately the mode and xmit_policy options of the bonding driver
are module params, so it'll need to be a module so the different modes
can be tested. I already modified vmtest.sh [1] to "make
module_install" into the rootfs and enable the bonding module via
scripts/config, but a cleaner approach would probably be to, as you
suggested, update latest.config in libbpf repo and probably get the
"modules_install" change into vmtest.sh separately (if you're happy
with this approach). What do you think?

[1] https://lore.kernel.org/netdev/20210609135537.1460244-1-joamaki@gmail.com/T/#maaf15ecd6b7c3af764558589118a3c6213e0af81
Jay Vosburgh June 14, 2021, 3:37 p.m. UTC | #3
Jussi Maki <joamaki@gmail.com> wrote:

>On Thu, Jun 10, 2021 at 7:24 PM Andrii Nakryiko
><andrii.nakryiko@gmail.com> wrote:
>>
>> On Wed, Jun 9, 2021 at 6:55 AM Jussi Maki <joamaki@gmail.com> wrote:
>> >
>> > This patchset introduces XDP support to the bonding driver.
>> >
>> > Patch 1 contains the implementation, including support for
>> > the recently introduced EXCLUDE_INGRESS. Patch 2 contains a
>> > performance fix to the roundrobin mode which switches rr_tx_counter
>> > to be per-cpu. Patch 3 contains the test suite for the implementation
>> > using a pair of veth devices.
>> >
>> > The vmtest.sh is modified to enable the bonding module and install
>> > modules. The config change should probably be done in the libbpf
>> > repository. Andrii: How would you like this done properly?
>>
>> I think vmtest.sh and CI setup doesn't support modules (not easily at
>> least). Can we just compile that driver in? Then you can submit a PR
>> against libbpf Github repo to adjust the config. We have also kernel
>> CI repo where we'll need to make this change.
>
>Unfortunately the mode and xmit_policy options of the bonding driver
>are module params, so it'll need to be a module so the different modes
>can be tested. I already modified vmtest.sh [1] to "make
>module_install" into the rootfs and enable the bonding module via
>scripts/config, but a cleaner approach would probably be to, as you
>suggested, update latest.config in libbpf repo and probably get the
>"modules_install" change into vmtest.sh separately (if you're happy
>with this approach). What do you think?

	The bonding mode and xmit_hash_policy (and any other option) can
be changed via "ip link"; no module parameter needed, e.g.,

ip link set dev bond0 type bond xmit_hash_policy layer2

	-J

>[1] https://lore.kernel.org/netdev/20210609135537.1460244-1-joamaki@gmail.com/T/#maaf15ecd6b7c3af764558589118a3c6213e0af81

---
	-Jay Vosburgh, jay.vosburgh@canonical.com
Andrii Nakryiko June 15, 2021, 5:34 a.m. UTC | #4
On Mon, Jun 14, 2021 at 5:25 AM Jussi Maki <joamaki@gmail.com> wrote:
>
> On Thu, Jun 10, 2021 at 7:24 PM Andrii Nakryiko
> <andrii.nakryiko@gmail.com> wrote:
> >
> > On Wed, Jun 9, 2021 at 6:55 AM Jussi Maki <joamaki@gmail.com> wrote:
> > >
> > > This patchset introduces XDP support to the bonding driver.
> > >
> > > Patch 1 contains the implementation, including support for
> > > the recently introduced EXCLUDE_INGRESS. Patch 2 contains a
> > > performance fix to the roundrobin mode which switches rr_tx_counter
> > > to be per-cpu. Patch 3 contains the test suite for the implementation
> > > using a pair of veth devices.
> > >
> > > The vmtest.sh is modified to enable the bonding module and install
> > > modules. The config change should probably be done in the libbpf
> > > repository. Andrii: How would you like this done properly?
> >
> > I think vmtest.sh and CI setup doesn't support modules (not easily at
> > least). Can we just compile that driver in? Then you can submit a PR
> > against libbpf Github repo to adjust the config. We have also kernel
> > CI repo where we'll need to make this change.
>
> Unfortunately the mode and xmit_policy options of the bonding driver
> are module params, so it'll need to be a module so the different modes
> can be tested. I already modified vmtest.sh [1] to "make
> module_install" into the rootfs and enable the bonding module via
> scripts/config, but a cleaner approach would probably be to, as you
> suggested, update latest.config in libbpf repo and probably get the
> "modules_install" change into vmtest.sh separately (if you're happy
> with this approach). What do you think?

If we can make modules work in vmtest.sh then it's great, regardless
if you need it still or not. It's not supported right now because no
one did work to support modules, not because we explicitly didn't want
modules in CI.

>
> [1] https://lore.kernel.org/netdev/20210609135537.1460244-1-joamaki@gmail.com/T/#maaf15ecd6b7c3af764558589118a3c6213e0af81
Jussi Maki June 24, 2021, 9:18 a.m. UTC | #5
From: Jussi Maki <joamaki@gmail.com>

This patchset introduces XDP support to the bonding driver.

The motivation for this change is to enable use of bonding (and
802.3ad) in hairpinning L4 load-balancers such as [1] implemented with
XDP and also to transparently support bond devices for projects that
use XDP given most modern NICs have dual port adapters.  An alternative
to this approach would be to implement 802.3ad in user-space and
implement the bonding load-balancing in the XDP program itself, but
is rather a cumbersome endeavor in terms of slave device management
(e.g. by watching netlink) and requires separate programs for native
vs bond cases for the orchestrator. A native in-kernel implementation
overcomes these issues and provides more flexibility.

Below are benchmark results done on two machines with 100Gbit
Intel E810 (ice) NIC and with 32-core 3970X on sending machine, and
16-core 3950X on receiving machine. 64 byte packets were sent with
pktgen-dpdk at full rate. Two issues [2, 3] were identified with the
ice driver, so the tests were performed with iommu=off and patch [2]
applied. Additionally the bonding round robin algorithm was modified
to use per-cpu tx counters as high CPU load (50% vs 10%) and high rate
of cache misses were caused by the shared rr_tx_counter. Fix for this
has been already merged into net-next. The statistics were collected 
using "sar -n dev -u 1 10".

 -----------------------|  CPU  |--| rxpck/s |--| txpck/s |----
 without patch (1 dev):
   XDP_DROP:              3.15%      48.6Mpps
   XDP_TX:                3.12%      18.3Mpps     18.3Mpps
   XDP_DROP (RSS):        9.47%      116.5Mpps
   XDP_TX (RSS):          9.67%      25.3Mpps     24.2Mpps
 -----------------------
 with patch, bond (1 dev):
   XDP_DROP:              3.14%      46.7Mpps
   XDP_TX:                3.15%      13.9Mpps     13.9Mpps
   XDP_DROP (RSS):        10.33%     117.2Mpps
   XDP_TX (RSS):          10.64%     25.1Mpps     24.0Mpps
 -----------------------
 with patch, bond (2 devs):
   XDP_DROP:              6.27%      92.7Mpps
   XDP_TX:                6.26%      17.6Mpps     17.5Mpps
   XDP_DROP (RSS):       11.38%      117.2Mpps
   XDP_TX (RSS):         14.30%      28.7Mpps     27.4Mpps
 --------------------------------------------------------------

RSS: Receive Side Scaling, e.g. the packets were sent to a range of
destination IPs.

[1]: https://cilium.io/blog/2021/05/20/cilium-110#standalonelb
[2]: https://lore.kernel.org/bpf/20210601113236.42651-1-maciej.fijalkowski@intel.com/T/#t
[3]: https://lore.kernel.org/bpf/CAHn8xckNXci+X_Eb2WMv4uVYjO2331UWB2JLtXr_58z0Av8+8A@mail.gmail.com/

Patch 1 prepares bond_xmit_hash for hashing xdp_buff's
Patch 2 adds hooks to implement redirection after bpf prog run
Patch 3 implements the hooks in the bonding driver. 
Patch 4 modifies devmap to properly handle EXCLUDE_INGRESS with a slave device.

v1->v2:
- Split up into smaller easier to review patches and address cosmetic 
  review comments.
- Drop the INDIRECT_CALL optimization as it showed little improvement in tests.
- Drop the rr_tx_counter patch as that has already been merged into net-next.
- Separate the test suite into another patch set. This will follow later once a
  patch set from Magnus Karlsson is merged and provides test utilities that can
  be reused for XDP bonding tests. v2 contains no major functional changes and
  was tested with the test suite included in v1.
  (https://lore.kernel.org/bpf/202106221509.kwNvAAZg-lkp@intel.com/T/#m464146d47299125d5868a08affd6d6ce526dfad1)

---

Jussi Maki (4):
  net: bonding: Refactor bond_xmit_hash for use with xdp_buff
  net: core: Add support for XDP redirection to slave device
  net: bonding: Add XDP support to the bonding driver
  devmap: Exclude XDP broadcast to master device

 drivers/net/bonding/bond_main.c | 431 +++++++++++++++++++++++++++-----
 include/linux/filter.h          |  13 +-
 include/linux/netdevice.h       |   5 +
 include/net/bonding.h           |   1 +
 kernel/bpf/devmap.c             |  34 ++-
 net/core/filter.c               |  25 ++
 6 files changed, 445 insertions(+), 64 deletions(-)
Jussi Maki July 7, 2021, 11:25 a.m. UTC | #6
This patchset introduces XDP support to the bonding driver.

The motivation for this change is to enable use of bonding (and
802.3ad) in hairpinning L4 load-balancers such as [1] implemented with
XDP and also to transparently support bond devices for projects that
use XDP given most modern NICs have dual port adapters.  An alternative
to this approach would be to implement 802.3ad in user-space and
implement the bonding load-balancing in the XDP program itself, but
is rather a cumbersome endeavor in terms of slave device management
(e.g. by watching netlink) and requires separate programs for native
vs bond cases for the orchestrator. A native in-kernel implementation
overcomes these issues and provides more flexibility.

Below are benchmark results done on two machines with 100Gbit
Intel E810 (ice) NIC and with 32-core 3970X on sending machine, and
16-core 3950X on receiving machine. 64 byte packets were sent with
pktgen-dpdk at full rate. Two issues [2, 3] were identified with the
ice driver, so the tests were performed with iommu=off and patch [2]
applied. Additionally the bonding round robin algorithm was modified
to use per-cpu tx counters as high CPU load (50% vs 10%) and high rate
of cache misses were caused by the shared rr_tx_counter. Fix for this
has been already merged into net-next. The statistics were collected 
using "sar -n dev -u 1 10".

 -----------------------|  CPU  |--| rxpck/s |--| txpck/s |----
 without patch (1 dev):
   XDP_DROP:              3.15%      48.6Mpps
   XDP_TX:                3.12%      18.3Mpps     18.3Mpps
   XDP_DROP (RSS):        9.47%      116.5Mpps
   XDP_TX (RSS):          9.67%      25.3Mpps     24.2Mpps
 -----------------------
 with patch, bond (1 dev):
   XDP_DROP:              3.14%      46.7Mpps
   XDP_TX:                3.15%      13.9Mpps     13.9Mpps
   XDP_DROP (RSS):        10.33%     117.2Mpps
   XDP_TX (RSS):          10.64%     25.1Mpps     24.0Mpps
 -----------------------
 with patch, bond (2 devs):
   XDP_DROP:              6.27%      92.7Mpps
   XDP_TX:                6.26%      17.6Mpps     17.5Mpps
   XDP_DROP (RSS):       11.38%      117.2Mpps
   XDP_TX (RSS):         14.30%      28.7Mpps     27.4Mpps
 --------------------------------------------------------------

RSS: Receive Side Scaling, e.g. the packets were sent to a range of
destination IPs.

[1]: https://cilium.io/blog/2021/05/20/cilium-110#standalonelb
[2]: https://lore.kernel.org/bpf/20210601113236.42651-1-maciej.fijalkowski@intel.com/T/#t
[3]: https://lore.kernel.org/bpf/CAHn8xckNXci+X_Eb2WMv4uVYjO2331UWB2JLtXr_58z0Av8+8A@mail.gmail.com/

Patch 1 prepares bond_xmit_hash for hashing xdp_buff's.
Patch 2 adds hooks to implement redirection after bpf prog run.
Patch 3 implements the hooks in the bonding driver. 
Patch 4 modifies devmap to properly handle EXCLUDE_INGRESS with a slave device.
Patch 5 fixes an issue related to recent cleanup of rcu_read_lock in XDP context.

v2->v3:
- Address Jay's comment to properly exclude upper devices with EXCLUDE_INGRESS
  when there are deeper nesting involved. Now all upper devices are excluded.
- Refuse to enslave devices that already have XDP programs loaded and refuse to
  load XDP programs to slave devices. Earlier one could have a XDP program loaded
  and after enslaving and loading another program onto the bond device the xdp_state
  of the enslaved device would be pointing at an old program.
- Adapt netdev_lower_get_next_private_rcu so it can be called in the XDP context.

v1->v2:
- Split up into smaller easier to review patches and address cosmetic 
  review comments.
- Drop the INDIRECT_CALL optimization as it showed little improvement in tests.
- Drop the rr_tx_counter patch as that has already been merged into net-next.
- Separate the test suite into another patch set. This will follow later once a
  patch set from Magnus Karlsson is merged and provides test utilities that can
  be reused for XDP bonding tests. v2 contains no major functional changes and
  was tested with the test suite included in v1.
  (https://lore.kernel.org/bpf/202106221509.kwNvAAZg-lkp@intel.com/T/#m464146d47299125d5868a08affd6d6ce526dfad1)

---

Jussi Maki (5):
  net: bonding: Refactor bond_xmit_hash for use with xdp_buff
  net: core: Add support for XDP redirection to slave device
  net: bonding: Add XDP support to the bonding driver
  devmap: Exclude XDP broadcast to master device
  net: core: Allow netdev_lower_get_next_private_rcu in bh context

 drivers/net/bonding/bond_main.c | 450 ++++++++++++++++++++++++++++----
 include/linux/filter.h          |  13 +-
 include/linux/netdevice.h       |   6 +
 include/net/bonding.h           |   1 +
 kernel/bpf/devmap.c             |  67 ++++-
 net/core/dev.c                  |  11 +-
 net/core/filter.c               |  25 ++
 7 files changed, 504 insertions(+), 69 deletions(-)
Jussi Maki July 28, 2021, 11:43 p.m. UTC | #7
This patchset introduces XDP support to the bonding driver.

The motivation for this change is to enable use of bonding (and
802.3ad) in hairpinning L4 load-balancers such as [1] implemented with
XDP and also to transparently support bond devices for projects that
use XDP given most modern NICs have dual port adapters.  An alternative
to this approach would be to implement 802.3ad in user-space and
implement the bonding load-balancing in the XDP program itself, but
is rather a cumbersome endeavor in terms of slave device management
(e.g. by watching netlink) and requires separate programs for native
vs bond cases for the orchestrator. A native in-kernel implementation
overcomes these issues and provides more flexibility.

Below are benchmark results done on two machines with 100Gbit
Intel E810 (ice) NIC and with 32-core 3970X on sending machine, and
16-core 3950X on receiving machine. 64 byte packets were sent with
pktgen-dpdk at full rate. Two issues [2, 3] were identified with the
ice driver, so the tests were performed with iommu=off and patch [2]
applied. Additionally the bonding round robin algorithm was modified
to use per-cpu tx counters as high CPU load (50% vs 10%) and high rate
of cache misses were caused by the shared rr_tx_counter. Fix for this
has been already merged into net-next. The statistics were collected 
using "sar -n dev -u 1 10".

 -----------------------|  CPU  |--| rxpck/s |--| txpck/s |----
 without patch (1 dev):
   XDP_DROP:              3.15%      48.6Mpps
   XDP_TX:                3.12%      18.3Mpps     18.3Mpps
   XDP_DROP (RSS):        9.47%      116.5Mpps
   XDP_TX (RSS):          9.67%      25.3Mpps     24.2Mpps
 -----------------------
 with patch, bond (1 dev):
   XDP_DROP:              3.14%      46.7Mpps
   XDP_TX:                3.15%      13.9Mpps     13.9Mpps
   XDP_DROP (RSS):        10.33%     117.2Mpps
   XDP_TX (RSS):          10.64%     25.1Mpps     24.0Mpps
 -----------------------
 with patch, bond (2 devs):
   XDP_DROP:              6.27%      92.7Mpps
   XDP_TX:                6.26%      17.6Mpps     17.5Mpps
   XDP_DROP (RSS):       11.38%      117.2Mpps
   XDP_TX (RSS):         14.30%      28.7Mpps     27.4Mpps
 --------------------------------------------------------------

RSS: Receive Side Scaling, e.g. the packets were sent to a range of
destination IPs.

[1]: https://cilium.io/blog/2021/05/20/cilium-110#standalonelb
[2]: https://lore.kernel.org/bpf/20210601113236.42651-1-maciej.fijalkowski@intel.com/T/#t
[3]: https://lore.kernel.org/bpf/CAHn8xckNXci+X_Eb2WMv4uVYjO2331UWB2JLtXr_58z0Av8+8A@mail.gmail.com/

Patch 1 prepares bond_xmit_hash for hashing xdp_buff's.
Patch 2 adds hooks to implement redirection after bpf prog run.
Patch 3 implements the hooks in the bonding driver. 
Patch 4 modifies devmap to properly handle EXCLUDE_INGRESS with a slave device.
Patch 5 fixes an issue related to recent cleanup of rcu_read_lock in XDP context.
Patch 6 adds tests

v3->v4:
- Add back the test suite, while removing the vmtest.sh modifications to kernel
  config new that CONFIG_BONDING=y is set. Discussed with Magnus Karlsson that 
  it makes sense right now to not reuse the code from xdpceiver.c for testing 
  XDP bonding.

v2->v3:
- Address Jay's comment to properly exclude upper devices with EXCLUDE_INGRESS
  when there are deeper nesting involved. Now all upper devices are excluded.
- Refuse to enslave devices that already have XDP programs loaded and refuse to
  load XDP programs to slave devices. Earlier one could have a XDP program loaded
  and after enslaving and loading another program onto the bond device the xdp_state
  of the enslaved device would be pointing at an old program.
- Adapt netdev_lower_get_next_private_rcu so it can be called in the XDP context.

v1->v2:
- Split up into smaller easier to review patches and address cosmetic 
  review comments.
- Drop the INDIRECT_CALL optimization as it showed little improvement in tests.
- Drop the rr_tx_counter patch as that has already been merged into net-next.
- Separate the test suite into another patch set. This will follow later once a
  patch set from Magnus Karlsson is merged and provides test utilities that can
  be reused for XDP bonding tests. v2 contains no major functional changes and
  was tested with the test suite included in v1.
  (https://lore.kernel.org/bpf/202106221509.kwNvAAZg-lkp@intel.com/T/#m464146d47299125d5868a08affd6d6ce526dfad1)

---
Jussi Maki July 30, 2021, 6:18 a.m. UTC | #8
This patchset introduces XDP support to the bonding driver.

The motivation for this change is to enable use of bonding (and
802.3ad) in hairpinning L4 load-balancers such as [1] implemented with
XDP and also to transparently support bond devices for projects that
use XDP given most modern NICs have dual port adapters.  An alternative
to this approach would be to implement 802.3ad in user-space and
implement the bonding load-balancing in the XDP program itself, but
is rather a cumbersome endeavor in terms of slave device management
(e.g. by watching netlink) and requires separate programs for native
vs bond cases for the orchestrator. A native in-kernel implementation
overcomes these issues and provides more flexibility.

Below are benchmark results done on two machines with 100Gbit
Intel E810 (ice) NIC and with 32-core 3970X on sending machine, and
16-core 3950X on receiving machine. 64 byte packets were sent with
pktgen-dpdk at full rate. Two issues [2, 3] were identified with the
ice driver, so the tests were performed with iommu=off and patch [2]
applied. Additionally the bonding round robin algorithm was modified
to use per-cpu tx counters as high CPU load (50% vs 10%) and high rate
of cache misses were caused by the shared rr_tx_counter. Fix for this
has been already merged into net-next. The statistics were collected
using "sar -n dev -u 1 10".

 -----------------------|  CPU  |--| rxpck/s |--| txpck/s |----
 without patch (1 dev):
   XDP_DROP:              3.15%      48.6Mpps
   XDP_TX:                3.12%      18.3Mpps     18.3Mpps
   XDP_DROP (RSS):        9.47%      116.5Mpps
   XDP_TX (RSS):          9.67%      25.3Mpps     24.2Mpps
 -----------------------
 with patch, bond (1 dev):
   XDP_DROP:              3.14%      46.7Mpps
   XDP_TX:                3.15%      13.9Mpps     13.9Mpps
   XDP_DROP (RSS):        10.33%     117.2Mpps
   XDP_TX (RSS):          10.64%     25.1Mpps     24.0Mpps
 -----------------------
 with patch, bond (2 devs):
   XDP_DROP:              6.27%      92.7Mpps
   XDP_TX:                6.26%      17.6Mpps     17.5Mpps
   XDP_DROP (RSS):       11.38%      117.2Mpps
   XDP_TX (RSS):         14.30%      28.7Mpps     27.4Mpps
 --------------------------------------------------------------

RSS: Receive Side Scaling, e.g. the packets were sent to a range of
destination IPs.

[1]: https://cilium.io/blog/2021/05/20/cilium-110#standalonelb
[2]: https://lore.kernel.org/bpf/20210601113236.42651-1-maciej.fijalkowski@intel.com/T/#t
[3]: https://lore.kernel.org/bpf/CAHn8xckNXci+X_Eb2WMv4uVYjO2331UWB2JLtXr_58z0Av8+8A@mail.gmail.com/

Patch 1 prepares bond_xmit_hash for hashing xdp_buff's.
Patch 2 adds hooks to implement redirection after bpf prog run.
Patch 3 implements the hooks in the bonding driver.
Patch 4 modifies devmap to properly handle EXCLUDE_INGRESS with a slave device.
Patch 5 fixes an issue related to recent cleanup of rcu_read_lock in XDP context.
Patch 6 fixes loading of xdp_tx.o by renaming section name.
Patch 7 adds tests.

v4->v5:
- As pointed by Andrii, use the generated BPF skeletons rather than libbpf
  directly.
- Renamed section name in progs/xdp_tx.c as the BPF skeleton wouldn't load it
  otherwise due to unknown program type.
- Daniel Borkmann noted that to retain backwards compatibility and allow some
  use cases we should allow attaching XDP programs to a slave device when the
  master does not have a program loaded. Modified the logic to allow this and
  added tests for the different combinations of attaching a program.

v3->v4:
- Add back the test suite, while removing the vmtest.sh modifications to kernel
  config new that CONFIG_BONDING=y is set. Discussed with Magnus Karlsson that
  it makes sense right now to not reuse the code from xdpceiver.c for testing
  XDP bonding.

v2->v3:
- Address Jay's comment to properly exclude upper devices with EXCLUDE_INGRESS
  when there are deeper nesting involved. Now all upper devices are excluded.
- Refuse to enslave devices that already have XDP programs loaded and refuse to
  load XDP programs to slave devices. Earlier one could have a XDP program loaded
  and after enslaving and loading another program onto the bond device the xdp_state
  of the enslaved device would be pointing at an old program.
- Adapt netdev_lower_get_next_private_rcu so it can be called in the XDP context.

v1->v2:
- Split up into smaller easier to review patches and address cosmetic
  review comments.
- Drop the INDIRECT_CALL optimization as it showed little improvement in tests.
- Drop the rr_tx_counter patch as that has already been merged into net-next.
- Separate the test suite into another patch set. This will follow later once a
  patch set from Magnus Karlsson is merged and provides test utilities that can
  be reused for XDP bonding tests. v2 contains no major functional changes and
  was tested with the test suite included in v1.
  (https://lore.kernel.org/bpf/202106221509.kwNvAAZg-lkp@intel.com/T/#m464146d47299125d5868a08affd6d6ce526dfad1)

---
Jussi Maki July 31, 2021, 5:57 a.m. UTC | #9
This patchset introduces XDP support to the bonding driver.

The motivation for this change is to enable use of bonding (and
802.3ad) in hairpinning L4 load-balancers such as [1] implemented with
XDP and also to transparently support bond devices for projects that
use XDP given most modern NICs have dual port adapters.  An alternative
to this approach would be to implement 802.3ad in user-space and
implement the bonding load-balancing in the XDP program itself, but
is rather a cumbersome endeavor in terms of slave device management
(e.g. by watching netlink) and requires separate programs for native
vs bond cases for the orchestrator. A native in-kernel implementation
overcomes these issues and provides more flexibility.

Below are benchmark results done on two machines with 100Gbit
Intel E810 (ice) NIC and with 32-core 3970X on sending machine, and
16-core 3950X on receiving machine. 64 byte packets were sent with
pktgen-dpdk at full rate. Two issues [2, 3] were identified with the
ice driver, so the tests were performed with iommu=off and patch [2]
applied. Additionally the bonding round robin algorithm was modified
to use per-cpu tx counters as high CPU load (50% vs 10%) and high rate
of cache misses were caused by the shared rr_tx_counter. Fix for this
has been already merged into net-next. The statistics were collected
using "sar -n dev -u 1 10".

 -----------------------|  CPU  |--| rxpck/s |--| txpck/s |----
 without patch (1 dev):
   XDP_DROP:              3.15%      48.6Mpps
   XDP_TX:                3.12%      18.3Mpps     18.3Mpps
   XDP_DROP (RSS):        9.47%      116.5Mpps
   XDP_TX (RSS):          9.67%      25.3Mpps     24.2Mpps
 -----------------------
 with patch, bond (1 dev):
   XDP_DROP:              3.14%      46.7Mpps
   XDP_TX:                3.15%      13.9Mpps     13.9Mpps
   XDP_DROP (RSS):        10.33%     117.2Mpps
   XDP_TX (RSS):          10.64%     25.1Mpps     24.0Mpps
 -----------------------
 with patch, bond (2 devs):
   XDP_DROP:              6.27%      92.7Mpps
   XDP_TX:                6.26%      17.6Mpps     17.5Mpps
   XDP_DROP (RSS):       11.38%      117.2Mpps
   XDP_TX (RSS):         14.30%      28.7Mpps     27.4Mpps
 --------------------------------------------------------------

RSS: Receive Side Scaling, e.g. the packets were sent to a range of
destination IPs.

[1]: https://cilium.io/blog/2021/05/20/cilium-110#standalonelb
[2]: https://lore.kernel.org/bpf/20210601113236.42651-1-maciej.fijalkowski@intel.com/T/#t
[3]: https://lore.kernel.org/bpf/CAHn8xckNXci+X_Eb2WMv4uVYjO2331UWB2JLtXr_58z0Av8+8A@mail.gmail.com/

Patch 1 prepares bond_xmit_hash for hashing xdp_buff's.
Patch 2 adds hooks to implement redirection after bpf prog run.
Patch 3 implements the hooks in the bonding driver.
Patch 4 modifies devmap to properly handle EXCLUDE_INGRESS with a slave device.
Patch 5 fixes an issue related to recent cleanup of rcu_read_lock in XDP context.
Patch 6 fixes loading of xdp_tx.o by renaming section name.
Patch 7 adds tests.

v5->v6:
- Address Andrii's comments about the tests.

v4->v5:
- As pointed by Andrii, use the generated BPF skeletons rather than libbpf
  directly.
- Renamed section name in progs/xdp_tx.c as the BPF skeleton wouldn't load it
  otherwise due to unknown program type.
- Daniel Borkmann noted that to retain backwards compatibility and allow some
  use cases we should allow attaching XDP programs to a slave device when the
  master does not have a program loaded. Modified the logic to allow this and
  added tests for the different combinations of attaching a program.

v3->v4:
- Add back the test suite, while removing the vmtest.sh modifications to kernel
  config new that CONFIG_BONDING=y is set. Discussed with Magnus Karlsson that
  it makes sense right now to not reuse the code from xdpceiver.c for testing
  XDP bonding.

v2->v3:
- Address Jay's comment to properly exclude upper devices with EXCLUDE_INGRESS
  when there are deeper nesting involved. Now all upper devices are excluded.
- Refuse to enslave devices that already have XDP programs loaded and refuse to
  load XDP programs to slave devices. Earlier one could have a XDP program loaded
  and after enslaving and loading another program onto the bond device the xdp_state
  of the enslaved device would be pointing at an old program.
- Adapt netdev_lower_get_next_private_rcu so it can be called in the XDP context.

v1->v2:
- Split up into smaller easier to review patches and address cosmetic
  review comments.
- Drop the INDIRECT_CALL optimization as it showed little improvement in tests.
- Drop the rr_tx_counter patch as that has already been merged into net-next.
- Separate the test suite into another patch set. This will follow later once a
  patch set from Magnus Karlsson is merged and provides test utilities that can
  be reused for XDP bonding tests. v2 contains no major functional changes and
  was tested with the test suite included in v1.
  (https://lore.kernel.org/bpf/202106221509.kwNvAAZg-lkp@intel.com/T/#m464146d47299125d5868a08affd6d6ce526dfad1)

---