mbox series

[RFC,00/17] xdp: Add packet queueing and scheduling capabilities

Message ID 20220713111430.134810-1-toke@redhat.com (mailing list archive)
Headers show
Series xdp: Add packet queueing and scheduling capabilities | expand

Message

Toke Høiland-Jørgensen July 13, 2022, 11:14 a.m. UTC
Packet forwarding is an important use case for XDP, which offers
significant performance improvements compared to forwarding using the
regular networking stack. However, XDP currently offers no mechanism to
delay, queue or schedule packets, which limits the practical uses for
XDP-based forwarding to those where the capacity of input and output links
always match each other (i.e., no rate transitions or many-to-one
forwarding). It also prevents an XDP-based router from doing any kind of
traffic shaping or reordering to enforce policy.

This series represents a first RFC of our attempt to remedy this lack. The
code in these patches is functional, but needs additional testing and
polishing before being considered for merging. I'm posting it here as an
RFC to get some early feedback on the API and overall design of the
feature.

DESIGN

The design consists of three components: A new map type for storing XDP
frames, a new 'dequeue' program type that will run in the TX softirq to
provide the stack with packets to transmit, and a set of helpers to dequeue
packets from the map, optionally drop them, and to schedule an interface
for transmission.

The new map type is modelled on the PIFO data structure proposed in the
literature[0][1]. It represents a priority queue where packets can be
enqueued in any priority, but is always dequeued from the head. From the
XDP side, the map is simply used as a target for the bpf_redirect_map()
helper, where the target index is the desired priority.

The dequeue program type is a new BPF program type that is attached to an
interface; when an interface is scheduled for transmission, the stack will
execute the attached dequeue program and, if it returns a packet to
transmit, that packet will be transmitted using the existing ndo_xdp_xmit()
driver function.

The dequeue program can obtain packets by pulling them out of a PIFO map
using the new bpf_packet_dequeue() helper. This returns a pointer to an
xdp_md structure, which can be dereferenced to obtain packet data and
data_meta pointers like in an XDP program. The returned packets are also
reference counted, meaning the verifier enforces that the dequeue program
either drops the packet (with the bpf_packet_drop() helper), or returns it
for transmission. Finally, a helper is added that can be used to actually
schedule an interface for transmission using the dequeue program type; this
helper can be called from both XDP and dequeue programs.

PERFORMANCE

Preliminary performance tests indicate about 50ns overhead of adding
queueing to the xdp_fwd example (last patch), which translates to a 20% PPS
overhead (but still 2x the forwarding performance of the netstack):

xdp_fwd :     4.7 Mpps  (213 ns /pkt)
xdp_fwd -Q:   3.8 Mpps  (263 ns /pkt)
netstack:       2 Mpps  (500 ns /pkt)

RELATION TO BPF QDISC

Cong Wang's BPF qdisc patches[2] share some aspects of this series, in
particular the use of a map to store packets. This is no accident, as we've
had ongoing discussions for a while now. I have no great hope that we can
completely converge the two efforts into a single BPF-based queueing
API (as has been discussed before[3], consolidating the SKB and XDP paths
is challenging). Rather, I'm hoping that we can converge the designs enough
that we can share BPF code between XDP and qdisc layers using common
functions, like it's possible to do with XDP and TC-BPF today. This would
imply agreeing on the map type and API, and possibly on the set of helpers
available to the BPF programs.

PATCH STRUCTURE

This series consists of a total of 17 patches, as follows:

Patches 1-3 are smaller preparatory refactoring patches used by subsequent
patches.

Patches 4-5 introduce the PIFO map type, and patch 6 introduces the dequeue
program type.

Patches 7-10 adds the dequeue helpers and the verifier features needed to
recognise packet pointers, reference count them, and allow dereferencing
them to obtain packet data pointers.

Patches 11 and 12 add the dequeue program hook to the TX path, and the
helpers to schedule an interface.

Patches 13-16 add libbpf support for the new types, and selftests for the
new features.

Finally, patch 17 adds queueing support to the xdp_fwd program in
samples/bpf to provide an easy-to-use way of testing the feature; this is
for illustrative purposes for the RFC only, and will not be included in the
final submission.

SUPPLEMENTARY MATERIAL

A (WiP) test harness for implementing and unit-testing scheduling
algorithms using this framework (and the bpf_prog_run() hook) is available
as part of the bpf-examples repository[4]. We plan to expand this with more
test algorithms to smoke-test the API, and also add ready-to-use queueing
algorithms for use for forwarding (to replace the xdp_fwd patch included as
part of this RFC submission).

The work represented in this series was done in collaboration with several
people. Thanks to Kumar Kartikeya Dwivedi for writing the verifier
enhancements in this series, to Frey Alfredsson for his work on the testing
harness in [4], and to Jesper Brouer, Per Hurtig and Anna Brunstrom for
their valuable input on the design of the queueing APIs.

This series is also available as a git tree on git.kernel.org[5].

NOTES

[0] http://web.mit.edu/pifo/
[1] https://arxiv.org/abs/1810.03060
[2] https://lore.kernel.org/r/20220602041028.95124-1-xiyou.wangcong@gmail.com
[3] https://lore.kernel.org/r/b4ff6a2b-1478-89f8-ea9f-added498c59f@gmail.com
[4] https://github.com/xdp-project/bpf-examples/pull/40
[5] https://git.kernel.org/pub/scm/linux/kernel/git/toke/linux.git/log/?h=xdp-queueing-06

Kumar Kartikeya Dwivedi (5):
  bpf: Use 64-bit return value for bpf_prog_run
  bpf: Teach the verifier about referenced packets returned from dequeue
    programs
  bpf: Introduce pkt_uid member for PTR_TO_PACKET
  bpf: Implement direct packet access in dequeue progs
  selftests/bpf: Add verifier tests for dequeue prog

Toke Høiland-Jørgensen (12):
  dev: Move received_rps counter next to RPS members in softnet data
  bpf: Expand map key argument of bpf_redirect_map to u64
  bpf: Add a PIFO priority queue map type
  pifomap: Add queue rotation for continuously increasing rank mode
  xdp: Add dequeue program type for getting packets from a PIFO
  bpf: Add helpers to dequeue from a PIFO map
  dev: Add XDP dequeue hook
  bpf: Add helper to schedule an interface for TX dequeue
  libbpf: Add support for dequeue program type and PIFO map type
  libbpf: Add support for querying dequeue programs
  selftests/bpf: Add test for XDP queueing through PIFO maps
  samples/bpf: Add queueing support to xdp_fwd sample

 include/linux/bpf-cgroup.h                    |  12 +-
 include/linux/bpf.h                           |  64 +-
 include/linux/bpf_types.h                     |   4 +
 include/linux/bpf_verifier.h                  |  14 +-
 include/linux/filter.h                        |  63 +-
 include/linux/netdevice.h                     |   8 +-
 include/net/xdp.h                             |  16 +-
 include/uapi/linux/bpf.h                      |  50 +-
 include/uapi/linux/if_link.h                  |   4 +-
 kernel/bpf/Makefile                           |   2 +-
 kernel/bpf/cgroup.c                           |  12 +-
 kernel/bpf/core.c                             |  14 +-
 kernel/bpf/cpumap.c                           |   4 +-
 kernel/bpf/devmap.c                           |  92 ++-
 kernel/bpf/offload.c                          |   4 +-
 kernel/bpf/pifomap.c                          | 635 ++++++++++++++++++
 kernel/bpf/syscall.c                          |   3 +
 kernel/bpf/verifier.c                         | 148 +++-
 net/bpf/test_run.c                            |  54 +-
 net/core/dev.c                                | 109 +++
 net/core/dev.h                                |   2 +
 net/core/filter.c                             | 307 ++++++++-
 net/core/rtnetlink.c                          |  30 +-
 net/packet/af_packet.c                        |   7 +-
 net/xdp/xskmap.c                              |   4 +-
 samples/bpf/xdp_fwd_kern.c                    |  65 +-
 samples/bpf/xdp_fwd_user.c                    | 200 ++++--
 tools/include/uapi/linux/bpf.h                |  48 ++
 tools/include/uapi/linux/if_link.h            |   4 +-
 tools/lib/bpf/libbpf.c                        |   1 +
 tools/lib/bpf/libbpf.h                        |   1 +
 tools/lib/bpf/libbpf_probes.c                 |   5 +
 tools/lib/bpf/netlink.c                       |   8 +
 .../selftests/bpf/prog_tests/pifo_map.c       | 125 ++++
 .../bpf/prog_tests/xdp_pifo_test_run.c        | 154 +++++
 tools/testing/selftests/bpf/progs/pifo_map.c  |  54 ++
 .../selftests/bpf/progs/test_xdp_pifo.c       | 110 +++
 tools/testing/selftests/bpf/test_verifier.c   |  29 +-
 .../testing/selftests/bpf/verifier/dequeue.c  | 160 +++++
 39 files changed, 2426 insertions(+), 200 deletions(-)
 create mode 100644 kernel/bpf/pifomap.c
 create mode 100644 tools/testing/selftests/bpf/prog_tests/pifo_map.c
 create mode 100644 tools/testing/selftests/bpf/prog_tests/xdp_pifo_test_run.c
 create mode 100644 tools/testing/selftests/bpf/progs/pifo_map.c
 create mode 100644 tools/testing/selftests/bpf/progs/test_xdp_pifo.c
 create mode 100644 tools/testing/selftests/bpf/verifier/dequeue.c

Comments

Stanislav Fomichev July 13, 2022, 6:36 p.m. UTC | #1
On Wed, Jul 13, 2022 at 4:14 AM Toke Høiland-Jørgensen <toke@redhat.com> wrote:
>
> Packet forwarding is an important use case for XDP, which offers
> significant performance improvements compared to forwarding using the
> regular networking stack. However, XDP currently offers no mechanism to
> delay, queue or schedule packets, which limits the practical uses for
> XDP-based forwarding to those where the capacity of input and output links
> always match each other (i.e., no rate transitions or many-to-one
> forwarding). It also prevents an XDP-based router from doing any kind of
> traffic shaping or reordering to enforce policy.
>
> This series represents a first RFC of our attempt to remedy this lack. The
> code in these patches is functional, but needs additional testing and
> polishing before being considered for merging. I'm posting it here as an
> RFC to get some early feedback on the API and overall design of the
> feature.
>
> DESIGN
>
> The design consists of three components: A new map type for storing XDP
> frames, a new 'dequeue' program type that will run in the TX softirq to
> provide the stack with packets to transmit, and a set of helpers to dequeue
> packets from the map, optionally drop them, and to schedule an interface
> for transmission.
>
> The new map type is modelled on the PIFO data structure proposed in the
> literature[0][1]. It represents a priority queue where packets can be
> enqueued in any priority, but is always dequeued from the head. From the
> XDP side, the map is simply used as a target for the bpf_redirect_map()
> helper, where the target index is the desired priority.

I have the same question I asked on the series from Cong:
Any considerations for existing carousel/edt-like models?
Can we make the map flexible enough to implement different qdisc policies?

> The dequeue program type is a new BPF program type that is attached to an
> interface; when an interface is scheduled for transmission, the stack will
> execute the attached dequeue program and, if it returns a packet to
> transmit, that packet will be transmitted using the existing ndo_xdp_xmit()
> driver function.
>
> The dequeue program can obtain packets by pulling them out of a PIFO map
> using the new bpf_packet_dequeue() helper. This returns a pointer to an
> xdp_md structure, which can be dereferenced to obtain packet data and
> data_meta pointers like in an XDP program. The returned packets are also
> reference counted, meaning the verifier enforces that the dequeue program
> either drops the packet (with the bpf_packet_drop() helper), or returns it
> for transmission. Finally, a helper is added that can be used to actually
> schedule an interface for transmission using the dequeue program type; this
> helper can be called from both XDP and dequeue programs.
>
> PERFORMANCE
>
> Preliminary performance tests indicate about 50ns overhead of adding
> queueing to the xdp_fwd example (last patch), which translates to a 20% PPS
> overhead (but still 2x the forwarding performance of the netstack):
>
> xdp_fwd :     4.7 Mpps  (213 ns /pkt)
> xdp_fwd -Q:   3.8 Mpps  (263 ns /pkt)
> netstack:       2 Mpps  (500 ns /pkt)
>
> RELATION TO BPF QDISC
>
> Cong Wang's BPF qdisc patches[2] share some aspects of this series, in
> particular the use of a map to store packets. This is no accident, as we've
> had ongoing discussions for a while now. I have no great hope that we can
> completely converge the two efforts into a single BPF-based queueing
> API (as has been discussed before[3], consolidating the SKB and XDP paths
> is challenging). Rather, I'm hoping that we can converge the designs enough
> that we can share BPF code between XDP and qdisc layers using common
> functions, like it's possible to do with XDP and TC-BPF today. This would
> imply agreeing on the map type and API, and possibly on the set of helpers
> available to the BPF programs.

What would be the big difference for the map wrt xdp_frame vs sk_buff
excluding all obvious stuff like locking/refcnt?

> PATCH STRUCTURE
>
> This series consists of a total of 17 patches, as follows:
>
> Patches 1-3 are smaller preparatory refactoring patches used by subsequent
> patches.

Seems like these can go separately without holding the rest?

> Patches 4-5 introduce the PIFO map type, and patch 6 introduces the dequeue
> program type.

[...]

> Patches 7-10 adds the dequeue helpers and the verifier features needed to
> recognise packet pointers, reference count them, and allow dereferencing
> them to obtain packet data pointers.

Have you considered using kfuncs for these instead of introducing new
hooks/contexts/etc?

> Patches 11 and 12 add the dequeue program hook to the TX path, and the
> helpers to schedule an interface.
>
> Patches 13-16 add libbpf support for the new types, and selftests for the
> new features.
>
> Finally, patch 17 adds queueing support to the xdp_fwd program in
> samples/bpf to provide an easy-to-use way of testing the feature; this is
> for illustrative purposes for the RFC only, and will not be included in the
> final submission.
>
> SUPPLEMENTARY MATERIAL
>
> A (WiP) test harness for implementing and unit-testing scheduling
> algorithms using this framework (and the bpf_prog_run() hook) is available
> as part of the bpf-examples repository[4]. We plan to expand this with more
> test algorithms to smoke-test the API, and also add ready-to-use queueing
> algorithms for use for forwarding (to replace the xdp_fwd patch included as
> part of this RFC submission).
>
> The work represented in this series was done in collaboration with several
> people. Thanks to Kumar Kartikeya Dwivedi for writing the verifier
> enhancements in this series, to Frey Alfredsson for his work on the testing
> harness in [4], and to Jesper Brouer, Per Hurtig and Anna Brunstrom for
> their valuable input on the design of the queueing APIs.
>
> This series is also available as a git tree on git.kernel.org[5].
>
> NOTES
>
> [0] http://web.mit.edu/pifo/
> [1] https://arxiv.org/abs/1810.03060
> [2] https://lore.kernel.org/r/20220602041028.95124-1-xiyou.wangcong@gmail.com
> [3] https://lore.kernel.org/r/b4ff6a2b-1478-89f8-ea9f-added498c59f@gmail.com
> [4] https://github.com/xdp-project/bpf-examples/pull/40
> [5] https://git.kernel.org/pub/scm/linux/kernel/git/toke/linux.git/log/?h=xdp-queueing-06
>
> Kumar Kartikeya Dwivedi (5):
>   bpf: Use 64-bit return value for bpf_prog_run
>   bpf: Teach the verifier about referenced packets returned from dequeue
>     programs
>   bpf: Introduce pkt_uid member for PTR_TO_PACKET
>   bpf: Implement direct packet access in dequeue progs
>   selftests/bpf: Add verifier tests for dequeue prog
>
> Toke Høiland-Jørgensen (12):
>   dev: Move received_rps counter next to RPS members in softnet data
>   bpf: Expand map key argument of bpf_redirect_map to u64
>   bpf: Add a PIFO priority queue map type
>   pifomap: Add queue rotation for continuously increasing rank mode
>   xdp: Add dequeue program type for getting packets from a PIFO
>   bpf: Add helpers to dequeue from a PIFO map
>   dev: Add XDP dequeue hook
>   bpf: Add helper to schedule an interface for TX dequeue
>   libbpf: Add support for dequeue program type and PIFO map type
>   libbpf: Add support for querying dequeue programs
>   selftests/bpf: Add test for XDP queueing through PIFO maps
>   samples/bpf: Add queueing support to xdp_fwd sample
>
>  include/linux/bpf-cgroup.h                    |  12 +-
>  include/linux/bpf.h                           |  64 +-
>  include/linux/bpf_types.h                     |   4 +
>  include/linux/bpf_verifier.h                  |  14 +-
>  include/linux/filter.h                        |  63 +-
>  include/linux/netdevice.h                     |   8 +-
>  include/net/xdp.h                             |  16 +-
>  include/uapi/linux/bpf.h                      |  50 +-
>  include/uapi/linux/if_link.h                  |   4 +-
>  kernel/bpf/Makefile                           |   2 +-
>  kernel/bpf/cgroup.c                           |  12 +-
>  kernel/bpf/core.c                             |  14 +-
>  kernel/bpf/cpumap.c                           |   4 +-
>  kernel/bpf/devmap.c                           |  92 ++-
>  kernel/bpf/offload.c                          |   4 +-
>  kernel/bpf/pifomap.c                          | 635 ++++++++++++++++++
>  kernel/bpf/syscall.c                          |   3 +
>  kernel/bpf/verifier.c                         | 148 +++-
>  net/bpf/test_run.c                            |  54 +-
>  net/core/dev.c                                | 109 +++
>  net/core/dev.h                                |   2 +
>  net/core/filter.c                             | 307 ++++++++-
>  net/core/rtnetlink.c                          |  30 +-
>  net/packet/af_packet.c                        |   7 +-
>  net/xdp/xskmap.c                              |   4 +-
>  samples/bpf/xdp_fwd_kern.c                    |  65 +-
>  samples/bpf/xdp_fwd_user.c                    | 200 ++++--
>  tools/include/uapi/linux/bpf.h                |  48 ++
>  tools/include/uapi/linux/if_link.h            |   4 +-
>  tools/lib/bpf/libbpf.c                        |   1 +
>  tools/lib/bpf/libbpf.h                        |   1 +
>  tools/lib/bpf/libbpf_probes.c                 |   5 +
>  tools/lib/bpf/netlink.c                       |   8 +
>  .../selftests/bpf/prog_tests/pifo_map.c       | 125 ++++
>  .../bpf/prog_tests/xdp_pifo_test_run.c        | 154 +++++
>  tools/testing/selftests/bpf/progs/pifo_map.c  |  54 ++
>  .../selftests/bpf/progs/test_xdp_pifo.c       | 110 +++
>  tools/testing/selftests/bpf/test_verifier.c   |  29 +-
>  .../testing/selftests/bpf/verifier/dequeue.c  | 160 +++++
>  39 files changed, 2426 insertions(+), 200 deletions(-)
>  create mode 100644 kernel/bpf/pifomap.c
>  create mode 100644 tools/testing/selftests/bpf/prog_tests/pifo_map.c
>  create mode 100644 tools/testing/selftests/bpf/prog_tests/xdp_pifo_test_run.c
>  create mode 100644 tools/testing/selftests/bpf/progs/pifo_map.c
>  create mode 100644 tools/testing/selftests/bpf/progs/test_xdp_pifo.c
>  create mode 100644 tools/testing/selftests/bpf/verifier/dequeue.c
>
> --
> 2.37.0
>
Toke Høiland-Jørgensen July 13, 2022, 9:52 p.m. UTC | #2
Stanislav Fomichev <sdf@google.com> writes:

> On Wed, Jul 13, 2022 at 4:14 AM Toke Høiland-Jørgensen <toke@redhat.com> wrote:
>>
>> Packet forwarding is an important use case for XDP, which offers
>> significant performance improvements compared to forwarding using the
>> regular networking stack. However, XDP currently offers no mechanism to
>> delay, queue or schedule packets, which limits the practical uses for
>> XDP-based forwarding to those where the capacity of input and output links
>> always match each other (i.e., no rate transitions or many-to-one
>> forwarding). It also prevents an XDP-based router from doing any kind of
>> traffic shaping or reordering to enforce policy.
>>
>> This series represents a first RFC of our attempt to remedy this lack. The
>> code in these patches is functional, but needs additional testing and
>> polishing before being considered for merging. I'm posting it here as an
>> RFC to get some early feedback on the API and overall design of the
>> feature.
>>
>> DESIGN
>>
>> The design consists of three components: A new map type for storing XDP
>> frames, a new 'dequeue' program type that will run in the TX softirq to
>> provide the stack with packets to transmit, and a set of helpers to dequeue
>> packets from the map, optionally drop them, and to schedule an interface
>> for transmission.
>>
>> The new map type is modelled on the PIFO data structure proposed in the
>> literature[0][1]. It represents a priority queue where packets can be
>> enqueued in any priority, but is always dequeued from the head. From the
>> XDP side, the map is simply used as a target for the bpf_redirect_map()
>> helper, where the target index is the desired priority.
>
> I have the same question I asked on the series from Cong:
> Any considerations for existing carousel/edt-like models?

Well, the reason for the addition in patch 5 (continuously increasing
priorities) is exactly to be able to implement EDT-like behaviour, where
the priority is used as time units to clock out packets.

> Can we make the map flexible enough to implement different qdisc
> policies?

That's one of the things we want to be absolutely sure about. We are
starting out with the PIFO map type because the literature makes a good
case that it is flexible enough to implement all conceivable policies.
The goal of the test harness linked as note [4] is to actually examine
this; Frey is our PhD student working on this bit.

Thus far we haven't hit any limitations on this, but we'll need to add
more policies before we are done with this. Another consideration is
performance, of course, so we're also planning to do a comparison with a
more traditional "bunch of FIFO queues" type data structure for at least
a subset of the algorithms. Kartikeya also had an idea for an
alternative way to implement a priority queue using (semi-)lockless
skiplists, which may turn out to perform better.

If there's any particular policy/algorithm you'd like to see included in
this evaluation, please do let us know, BTW! :)

>> The dequeue program type is a new BPF program type that is attached to an
>> interface; when an interface is scheduled for transmission, the stack will
>> execute the attached dequeue program and, if it returns a packet to
>> transmit, that packet will be transmitted using the existing ndo_xdp_xmit()
>> driver function.
>>
>> The dequeue program can obtain packets by pulling them out of a PIFO map
>> using the new bpf_packet_dequeue() helper. This returns a pointer to an
>> xdp_md structure, which can be dereferenced to obtain packet data and
>> data_meta pointers like in an XDP program. The returned packets are also
>> reference counted, meaning the verifier enforces that the dequeue program
>> either drops the packet (with the bpf_packet_drop() helper), or returns it
>> for transmission. Finally, a helper is added that can be used to actually
>> schedule an interface for transmission using the dequeue program type; this
>> helper can be called from both XDP and dequeue programs.
>>
>> PERFORMANCE
>>
>> Preliminary performance tests indicate about 50ns overhead of adding
>> queueing to the xdp_fwd example (last patch), which translates to a 20% PPS
>> overhead (but still 2x the forwarding performance of the netstack):
>>
>> xdp_fwd :     4.7 Mpps  (213 ns /pkt)
>> xdp_fwd -Q:   3.8 Mpps  (263 ns /pkt)
>> netstack:       2 Mpps  (500 ns /pkt)
>>
>> RELATION TO BPF QDISC
>>
>> Cong Wang's BPF qdisc patches[2] share some aspects of this series, in
>> particular the use of a map to store packets. This is no accident, as we've
>> had ongoing discussions for a while now. I have no great hope that we can
>> completely converge the two efforts into a single BPF-based queueing
>> API (as has been discussed before[3], consolidating the SKB and XDP paths
>> is challenging). Rather, I'm hoping that we can converge the designs enough
>> that we can share BPF code between XDP and qdisc layers using common
>> functions, like it's possible to do with XDP and TC-BPF today. This would
>> imply agreeing on the map type and API, and possibly on the set of helpers
>> available to the BPF programs.
>
> What would be the big difference for the map wrt xdp_frame vs sk_buff
> excluding all obvious stuff like locking/refcnt?

I expect it would be quite straight-forward to just add a second subtype
of the PIFO map in this series that holds skbs. In fact, I think that
from the BPF side, the whole model implemented here would be possible to
carry over to the qdisc layer more or less wholesale. Some other
features of the qdisc layer, like locking, classes, and
multi-CPU/multi-queue management may be trickier, but I'm not sure how
much of that we should expose in a BPF qdisc anyway (as you may have
noticed I commented on Cong's series to this effect regarding the
classful qdiscs).

>> PATCH STRUCTURE
>>
>> This series consists of a total of 17 patches, as follows:
>>
>> Patches 1-3 are smaller preparatory refactoring patches used by subsequent
>> patches.
>
> Seems like these can go separately without holding the rest?

Yeah, guess so? They don't really provide much benefit without the users
alter in the series, though, so not sure there's much point in sending
them separately?

>> Patches 4-5 introduce the PIFO map type, and patch 6 introduces the dequeue
>> program type.
>
> [...]
>
>> Patches 7-10 adds the dequeue helpers and the verifier features needed to
>> recognise packet pointers, reference count them, and allow dereferencing
>> them to obtain packet data pointers.
>
> Have you considered using kfuncs for these instead of introducing new
> hooks/contexts/etc?

I did, but I'm not sure it's such a good fit? In particular, the way the
direct packet access is implemented for dequeue programs (where you can
get an xdp_md pointer and deref that to get data and data_end pointers)
is done this way so programs can share utility functions between XDP and
dequeue programs. And having a new program type for the dequeue progs
seem like the obvious thing to do since they're doing something new?

Maybe I'm missing something, though; could you elaborate on how you'd
use kfuncs instead?

-Toke
Stanislav Fomichev July 13, 2022, 10:56 p.m. UTC | #3
On Wed, Jul 13, 2022 at 2:52 PM Toke Høiland-Jørgensen <toke@redhat.com> wrote:
>
> Stanislav Fomichev <sdf@google.com> writes:
>
> > On Wed, Jul 13, 2022 at 4:14 AM Toke Høiland-Jørgensen <toke@redhat.com> wrote:
> >>
> >> Packet forwarding is an important use case for XDP, which offers
> >> significant performance improvements compared to forwarding using the
> >> regular networking stack. However, XDP currently offers no mechanism to
> >> delay, queue or schedule packets, which limits the practical uses for
> >> XDP-based forwarding to those where the capacity of input and output links
> >> always match each other (i.e., no rate transitions or many-to-one
> >> forwarding). It also prevents an XDP-based router from doing any kind of
> >> traffic shaping or reordering to enforce policy.
> >>
> >> This series represents a first RFC of our attempt to remedy this lack. The
> >> code in these patches is functional, but needs additional testing and
> >> polishing before being considered for merging. I'm posting it here as an
> >> RFC to get some early feedback on the API and overall design of the
> >> feature.
> >>
> >> DESIGN
> >>
> >> The design consists of three components: A new map type for storing XDP
> >> frames, a new 'dequeue' program type that will run in the TX softirq to
> >> provide the stack with packets to transmit, and a set of helpers to dequeue
> >> packets from the map, optionally drop them, and to schedule an interface
> >> for transmission.
> >>
> >> The new map type is modelled on the PIFO data structure proposed in the
> >> literature[0][1]. It represents a priority queue where packets can be
> >> enqueued in any priority, but is always dequeued from the head. From the
> >> XDP side, the map is simply used as a target for the bpf_redirect_map()
> >> helper, where the target index is the desired priority.
> >
> > I have the same question I asked on the series from Cong:
> > Any considerations for existing carousel/edt-like models?
>
> Well, the reason for the addition in patch 5 (continuously increasing
> priorities) is exactly to be able to implement EDT-like behaviour, where
> the priority is used as time units to clock out packets.

Ah, ok, I didn't read the patches closely enough. I saw some limits
for the ranges and assumed that it wasn't capable of efficiently
storing 64-bit timestamps..

> > Can we make the map flexible enough to implement different qdisc
> > policies?
>
> That's one of the things we want to be absolutely sure about. We are
> starting out with the PIFO map type because the literature makes a good
> case that it is flexible enough to implement all conceivable policies.
> The goal of the test harness linked as note [4] is to actually examine
> this; Frey is our PhD student working on this bit.
>
> Thus far we haven't hit any limitations on this, but we'll need to add
> more policies before we are done with this. Another consideration is
> performance, of course, so we're also planning to do a comparison with a
> more traditional "bunch of FIFO queues" type data structure for at least
> a subset of the algorithms. Kartikeya also had an idea for an
> alternative way to implement a priority queue using (semi-)lockless
> skiplists, which may turn out to perform better.
>
> If there's any particular policy/algorithm you'd like to see included in
> this evaluation, please do let us know, BTW! :)

I honestly am not sure what the bar for accepting this should be. But
on the Cong's series I mentioned Martin's CC bpf work as a great
example of what we should be trying to do for qdisc-like maps. Having
a bpf version of fq/fq_codel/whatever_other_complex_qdisc might be
very convincing :-)

> >> The dequeue program type is a new BPF program type that is attached to an
> >> interface; when an interface is scheduled for transmission, the stack will
> >> execute the attached dequeue program and, if it returns a packet to
> >> transmit, that packet will be transmitted using the existing ndo_xdp_xmit()
> >> driver function.
> >>
> >> The dequeue program can obtain packets by pulling them out of a PIFO map
> >> using the new bpf_packet_dequeue() helper. This returns a pointer to an
> >> xdp_md structure, which can be dereferenced to obtain packet data and
> >> data_meta pointers like in an XDP program. The returned packets are also
> >> reference counted, meaning the verifier enforces that the dequeue program
> >> either drops the packet (with the bpf_packet_drop() helper), or returns it
> >> for transmission. Finally, a helper is added that can be used to actually
> >> schedule an interface for transmission using the dequeue program type; this
> >> helper can be called from both XDP and dequeue programs.
> >>
> >> PERFORMANCE
> >>
> >> Preliminary performance tests indicate about 50ns overhead of adding
> >> queueing to the xdp_fwd example (last patch), which translates to a 20% PPS
> >> overhead (but still 2x the forwarding performance of the netstack):
> >>
> >> xdp_fwd :     4.7 Mpps  (213 ns /pkt)
> >> xdp_fwd -Q:   3.8 Mpps  (263 ns /pkt)
> >> netstack:       2 Mpps  (500 ns /pkt)
> >>
> >> RELATION TO BPF QDISC
> >>
> >> Cong Wang's BPF qdisc patches[2] share some aspects of this series, in
> >> particular the use of a map to store packets. This is no accident, as we've
> >> had ongoing discussions for a while now. I have no great hope that we can
> >> completely converge the two efforts into a single BPF-based queueing
> >> API (as has been discussed before[3], consolidating the SKB and XDP paths
> >> is challenging). Rather, I'm hoping that we can converge the designs enough
> >> that we can share BPF code between XDP and qdisc layers using common
> >> functions, like it's possible to do with XDP and TC-BPF today. This would
> >> imply agreeing on the map type and API, and possibly on the set of helpers
> >> available to the BPF programs.
> >
> > What would be the big difference for the map wrt xdp_frame vs sk_buff
> > excluding all obvious stuff like locking/refcnt?
>
> I expect it would be quite straight-forward to just add a second subtype
> of the PIFO map in this series that holds skbs. In fact, I think that
> from the BPF side, the whole model implemented here would be possible to
> carry over to the qdisc layer more or less wholesale. Some other
> features of the qdisc layer, like locking, classes, and
> multi-CPU/multi-queue management may be trickier, but I'm not sure how
> much of that we should expose in a BPF qdisc anyway (as you may have
> noticed I commented on Cong's series to this effect regarding the
> classful qdiscs).

Maybe a related question here: with the way you do
BPF_MAP_TYPE_PIFO_GENERIC vs BPF_MAP_TYPE_PIFO_XDP, how hard it would
be have support for storing xdp_frames/skb in any map? Let's say we
have generic BPF_MAP_TYPE_RBTREE, where the key is
priority/timestamp/whatever, can we, based on the value's btf_id,
figure out the rest? (that the value is kernel structure and needs
special care and more constraints - can't be looked up from user space
and so on)

Seems like we really need to have two special cases: where we transfer
ownership of xdp_frame/skb to/from the map, any other big
complications?

That way we can maybe untangle the series a bit: we can talk about
efficient data structures for storing frames/skbs independently of
some generic support for storing them in the maps. Any major
complications with that approach?

> >> PATCH STRUCTURE
> >>
> >> This series consists of a total of 17 patches, as follows:
> >>
> >> Patches 1-3 are smaller preparatory refactoring patches used by subsequent
> >> patches.
> >
> > Seems like these can go separately without holding the rest?
>
> Yeah, guess so? They don't really provide much benefit without the users
> alter in the series, though, so not sure there's much point in sending
> them separately?
>
> >> Patches 4-5 introduce the PIFO map type, and patch 6 introduces the dequeue
> >> program type.
> >
> > [...]
> >
> >> Patches 7-10 adds the dequeue helpers and the verifier features needed to
> >> recognise packet pointers, reference count them, and allow dereferencing
> >> them to obtain packet data pointers.
> >
> > Have you considered using kfuncs for these instead of introducing new
> > hooks/contexts/etc?
>
> I did, but I'm not sure it's such a good fit? In particular, the way the
> direct packet access is implemented for dequeue programs (where you can
> get an xdp_md pointer and deref that to get data and data_end pointers)
> is done this way so programs can share utility functions between XDP and
> dequeue programs. And having a new program type for the dequeue progs
> seem like the obvious thing to do since they're doing something new?
>
> Maybe I'm missing something, though; could you elaborate on how you'd
> use kfuncs instead?

I was thinking about the approach in general. In networking bpf, we've
been adding new program types, new contexts and new explicit hooks.
This all requires a ton of boiler plate (converting from uapi ctx to
the kernel, exposing hook points, etc, etc). And looking at Benjamin's
HID series, it's so much more elegant: there is no uapi, just kernel
function that allows it to be overridden and a bunch of kfuncs
exposed. No uapi, no helpers, no fake contexts.

For networking and xdp the ship might have sailed, but I was wondering
whether we should be still stuck in that 'old' boilerplate world or we
have a chance to use new nice shiny things :-)

(but it might be all moot if we'd like to have stable upis?)
Kumar Kartikeya Dwivedi July 14, 2022, 6:34 a.m. UTC | #4
On Wed, 13 Jul 2022 at 23:52, Toke Høiland-Jørgensen <toke@redhat.com> wrote:
>
> Stanislav Fomichev <sdf@google.com> writes:
>
> > On Wed, Jul 13, 2022 at 4:14 AM Toke Høiland-Jørgensen <toke@redhat.com> wrote:
> >>
> >> Packet forwarding is an important use case for XDP, which offers
> >> significant performance improvements compared to forwarding using the
> >> regular networking stack. However, XDP currently offers no mechanism to
> >> delay, queue or schedule packets, which limits the practical uses for
> >> XDP-based forwarding to those where the capacity of input and output links
> >> always match each other (i.e., no rate transitions or many-to-one
> >> forwarding). It also prevents an XDP-based router from doing any kind of
> >> traffic shaping or reordering to enforce policy.
> >>
> >> This series represents a first RFC of our attempt to remedy this lack. The
> >> code in these patches is functional, but needs additional testing and
> >> polishing before being considered for merging. I'm posting it here as an
> >> RFC to get some early feedback on the API and overall design of the
> >> feature.
> >>
> >> DESIGN
> >>
> >> The design consists of three components: A new map type for storing XDP
> >> frames, a new 'dequeue' program type that will run in the TX softirq to
> >> provide the stack with packets to transmit, and a set of helpers to dequeue
> >> packets from the map, optionally drop them, and to schedule an interface
> >> for transmission.
> >>
> >> The new map type is modelled on the PIFO data structure proposed in the
> >> literature[0][1]. It represents a priority queue where packets can be
> >> enqueued in any priority, but is always dequeued from the head. From the
> >> XDP side, the map is simply used as a target for the bpf_redirect_map()
> >> helper, where the target index is the desired priority.
> >
> > I have the same question I asked on the series from Cong:
> > Any considerations for existing carousel/edt-like models?
>
> Well, the reason for the addition in patch 5 (continuously increasing
> priorities) is exactly to be able to implement EDT-like behaviour, where
> the priority is used as time units to clock out packets.
>
> > Can we make the map flexible enough to implement different qdisc
> > policies?
>
> That's one of the things we want to be absolutely sure about. We are
> starting out with the PIFO map type because the literature makes a good
> case that it is flexible enough to implement all conceivable policies.
> The goal of the test harness linked as note [4] is to actually examine
> this; Frey is our PhD student working on this bit.
>
> Thus far we haven't hit any limitations on this, but we'll need to add
> more policies before we are done with this. Another consideration is
> performance, of course, so we're also planning to do a comparison with a
> more traditional "bunch of FIFO queues" type data structure for at least
> a subset of the algorithms. Kartikeya also had an idea for an
> alternative way to implement a priority queue using (semi-)lockless
> skiplists, which may turn out to perform better.
>

There's also code to go with the idea, just to show it can work :)
https://github.com/kkdwivedi/linux/commits/skiplist
Lookups are fully lockless, updates only contend when the same nodes
are preds,succs. Still needs a lot of testing though. It's meant to be
a generic ordered map, but can be repurposed as a priority queue.
Toke Høiland-Jørgensen July 14, 2022, 10:46 a.m. UTC | #5
Stanislav Fomichev <sdf@google.com> writes:

> On Wed, Jul 13, 2022 at 2:52 PM Toke Høiland-Jørgensen <toke@redhat.com> wrote:
>>
>> Stanislav Fomichev <sdf@google.com> writes:
>>
>> > On Wed, Jul 13, 2022 at 4:14 AM Toke Høiland-Jørgensen <toke@redhat.com> wrote:
>> >>
>> >> Packet forwarding is an important use case for XDP, which offers
>> >> significant performance improvements compared to forwarding using the
>> >> regular networking stack. However, XDP currently offers no mechanism to
>> >> delay, queue or schedule packets, which limits the practical uses for
>> >> XDP-based forwarding to those where the capacity of input and output links
>> >> always match each other (i.e., no rate transitions or many-to-one
>> >> forwarding). It also prevents an XDP-based router from doing any kind of
>> >> traffic shaping or reordering to enforce policy.
>> >>
>> >> This series represents a first RFC of our attempt to remedy this lack. The
>> >> code in these patches is functional, but needs additional testing and
>> >> polishing before being considered for merging. I'm posting it here as an
>> >> RFC to get some early feedback on the API and overall design of the
>> >> feature.
>> >>
>> >> DESIGN
>> >>
>> >> The design consists of three components: A new map type for storing XDP
>> >> frames, a new 'dequeue' program type that will run in the TX softirq to
>> >> provide the stack with packets to transmit, and a set of helpers to dequeue
>> >> packets from the map, optionally drop them, and to schedule an interface
>> >> for transmission.
>> >>
>> >> The new map type is modelled on the PIFO data structure proposed in the
>> >> literature[0][1]. It represents a priority queue where packets can be
>> >> enqueued in any priority, but is always dequeued from the head. From the
>> >> XDP side, the map is simply used as a target for the bpf_redirect_map()
>> >> helper, where the target index is the desired priority.
>> >
>> > I have the same question I asked on the series from Cong:
>> > Any considerations for existing carousel/edt-like models?
>>
>> Well, the reason for the addition in patch 5 (continuously increasing
>> priorities) is exactly to be able to implement EDT-like behaviour, where
>> the priority is used as time units to clock out packets.
>
> Ah, ok, I didn't read the patches closely enough. I saw some limits
> for the ranges and assumed that it wasn't capable of efficiently
> storing 64-bit timestamps..

The goal is definitely to support full 64-bit priorities. Right now you
have to start out at 0 but can go on for a full 64 bits, but that's a
bit of an API wart that I'd like to get rid of eventually...

>> > Can we make the map flexible enough to implement different qdisc
>> > policies?
>>
>> That's one of the things we want to be absolutely sure about. We are
>> starting out with the PIFO map type because the literature makes a good
>> case that it is flexible enough to implement all conceivable policies.
>> The goal of the test harness linked as note [4] is to actually examine
>> this; Frey is our PhD student working on this bit.
>>
>> Thus far we haven't hit any limitations on this, but we'll need to add
>> more policies before we are done with this. Another consideration is
>> performance, of course, so we're also planning to do a comparison with a
>> more traditional "bunch of FIFO queues" type data structure for at least
>> a subset of the algorithms. Kartikeya also had an idea for an
>> alternative way to implement a priority queue using (semi-)lockless
>> skiplists, which may turn out to perform better.
>>
>> If there's any particular policy/algorithm you'd like to see included in
>> this evaluation, please do let us know, BTW! :)
>
> I honestly am not sure what the bar for accepting this should be. But
> on the Cong's series I mentioned Martin's CC bpf work as a great
> example of what we should be trying to do for qdisc-like maps. Having
> a bpf version of fq/fq_codel/whatever_other_complex_qdisc might be
> very convincing :-)

Just doing flow queueing is quite straight forward with PIFOs. We're
working on fq_codel. Personally I also want to implement something that
has feature parity with sch_cake (which includes every feature and the
kitchen sink already) :)

>> >> The dequeue program type is a new BPF program type that is attached to an
>> >> interface; when an interface is scheduled for transmission, the stack will
>> >> execute the attached dequeue program and, if it returns a packet to
>> >> transmit, that packet will be transmitted using the existing ndo_xdp_xmit()
>> >> driver function.
>> >>
>> >> The dequeue program can obtain packets by pulling them out of a PIFO map
>> >> using the new bpf_packet_dequeue() helper. This returns a pointer to an
>> >> xdp_md structure, which can be dereferenced to obtain packet data and
>> >> data_meta pointers like in an XDP program. The returned packets are also
>> >> reference counted, meaning the verifier enforces that the dequeue program
>> >> either drops the packet (with the bpf_packet_drop() helper), or returns it
>> >> for transmission. Finally, a helper is added that can be used to actually
>> >> schedule an interface for transmission using the dequeue program type; this
>> >> helper can be called from both XDP and dequeue programs.
>> >>
>> >> PERFORMANCE
>> >>
>> >> Preliminary performance tests indicate about 50ns overhead of adding
>> >> queueing to the xdp_fwd example (last patch), which translates to a 20% PPS
>> >> overhead (but still 2x the forwarding performance of the netstack):
>> >>
>> >> xdp_fwd :     4.7 Mpps  (213 ns /pkt)
>> >> xdp_fwd -Q:   3.8 Mpps  (263 ns /pkt)
>> >> netstack:       2 Mpps  (500 ns /pkt)
>> >>
>> >> RELATION TO BPF QDISC
>> >>
>> >> Cong Wang's BPF qdisc patches[2] share some aspects of this series, in
>> >> particular the use of a map to store packets. This is no accident, as we've
>> >> had ongoing discussions for a while now. I have no great hope that we can
>> >> completely converge the two efforts into a single BPF-based queueing
>> >> API (as has been discussed before[3], consolidating the SKB and XDP paths
>> >> is challenging). Rather, I'm hoping that we can converge the designs enough
>> >> that we can share BPF code between XDP and qdisc layers using common
>> >> functions, like it's possible to do with XDP and TC-BPF today. This would
>> >> imply agreeing on the map type and API, and possibly on the set of helpers
>> >> available to the BPF programs.
>> >
>> > What would be the big difference for the map wrt xdp_frame vs sk_buff
>> > excluding all obvious stuff like locking/refcnt?
>>
>> I expect it would be quite straight-forward to just add a second subtype
>> of the PIFO map in this series that holds skbs. In fact, I think that
>> from the BPF side, the whole model implemented here would be possible to
>> carry over to the qdisc layer more or less wholesale. Some other
>> features of the qdisc layer, like locking, classes, and
>> multi-CPU/multi-queue management may be trickier, but I'm not sure how
>> much of that we should expose in a BPF qdisc anyway (as you may have
>> noticed I commented on Cong's series to this effect regarding the
>> classful qdiscs).
>
> Maybe a related question here: with the way you do
> BPF_MAP_TYPE_PIFO_GENERIC vs BPF_MAP_TYPE_PIFO_XDP, how hard it would
> be have support for storing xdp_frames/skb in any map? Let's say we
> have generic BPF_MAP_TYPE_RBTREE, where the key is
> priority/timestamp/whatever, can we, based on the value's btf_id,
> figure out the rest? (that the value is kernel structure and needs
> special care and more constraints - can't be looked up from user space
> and so on)
>
> Seems like we really need to have two special cases: where we transfer
> ownership of xdp_frame/skb to/from the map, any other big
> complications?
>
> That way we can maybe untangle the series a bit: we can talk about
> efficient data structures for storing frames/skbs independently of
> some generic support for storing them in the maps. Any major
> complications with that approach?

I've had discussions with Kartikeya on this already (based on his 'kptr
in map' work). That may well end up being feasible, which would be
fantastic. The reason we didn't use it for this series is that there's
still some work to do on the generic verifier/infrastructure support
side of this (the PIFO map is the oldest part of this series), and I
didn't want to hold up the rest of the queueing work until that landed.

Now that we have a functional prototype I expect that iterating on the
data structure will be the next step. One complication with XDP is that
we probably want to keep using XDP_REDIRECT to place packets into the
map because that gets us bulking which is important for performance;
however, in general I like the idea of using BTF to designate the map
value type, and if we can figure out a way to make it completely generic
even for packets I'm all for that! :)

>> >> PATCH STRUCTURE
>> >>
>> >> This series consists of a total of 17 patches, as follows:
>> >>
>> >> Patches 1-3 are smaller preparatory refactoring patches used by subsequent
>> >> patches.
>> >
>> > Seems like these can go separately without holding the rest?
>>
>> Yeah, guess so? They don't really provide much benefit without the users
>> alter in the series, though, so not sure there's much point in sending
>> them separately?
>>
>> >> Patches 4-5 introduce the PIFO map type, and patch 6 introduces the dequeue
>> >> program type.
>> >
>> > [...]
>> >
>> >> Patches 7-10 adds the dequeue helpers and the verifier features needed to
>> >> recognise packet pointers, reference count them, and allow dereferencing
>> >> them to obtain packet data pointers.
>> >
>> > Have you considered using kfuncs for these instead of introducing new
>> > hooks/contexts/etc?
>>
>> I did, but I'm not sure it's such a good fit? In particular, the way the
>> direct packet access is implemented for dequeue programs (where you can
>> get an xdp_md pointer and deref that to get data and data_end pointers)
>> is done this way so programs can share utility functions between XDP and
>> dequeue programs. And having a new program type for the dequeue progs
>> seem like the obvious thing to do since they're doing something new?
>>
>> Maybe I'm missing something, though; could you elaborate on how you'd
>> use kfuncs instead?
>
> I was thinking about the approach in general. In networking bpf, we've
> been adding new program types, new contexts and new explicit hooks.
> This all requires a ton of boiler plate (converting from uapi ctx to
> the kernel, exposing hook points, etc, etc). And looking at Benjamin's
> HID series, it's so much more elegant: there is no uapi, just kernel
> function that allows it to be overridden and a bunch of kfuncs
> exposed. No uapi, no helpers, no fake contexts.
>
> For networking and xdp the ship might have sailed, but I was wondering
> whether we should be still stuck in that 'old' boilerplate world or we
> have a chance to use new nice shiny things :-)
>
> (but it might be all moot if we'd like to have stable upis?)

Right, I see what you mean. My immediate feeling is that having an
explicit stable UAPI for XDP has served us well. We do all kinds of
rewrite tricks behind the scenes (things like switching between xdp_buff
and xdp_frame, bulking, direct packet access, reading ifindexes by
pointer walking txq->dev, etc) which are important ways to improve
performance without exposing too many nitty-gritty details into the API.

There's also consistency to consider: I think the addition of queueing
should work as a natural extension of the existing programming model for
XDP. So I feel like this is more a case of "if we were starting from
scratch today we might do things differently (like the HID series), but
when extending things let's keep it consistent"?

-Toke
Jamal Hadi Salim July 14, 2022, 2:05 p.m. UTC | #6
I think what would be really interesting is to see the performance numbers when
you have multiple producers/consumers(translation multiple
threads/softirqs) in play
targeting the same queues. Does PIFO alleviate the synchronization challenge
when you have multiple concurrent readers/writers? Or maybe for your use case
this would not be a common occurrence or not something you care about?

As I mentioned previously, I think this is what Cong's approach gets for free.

cheers,
jamal


cheers,
jamal

On Wed, Jul 13, 2022 at 7:14 AM Toke Høiland-Jørgensen <toke@redhat.com> wrote:
>
> Packet forwarding is an important use case for XDP, which offers
> significant performance improvements compared to forwarding using the
> regular networking stack. However, XDP currently offers no mechanism to
> delay, queue or schedule packets, which limits the practical uses for
> XDP-based forwarding to those where the capacity of input and output links
> always match each other (i.e., no rate transitions or many-to-one
> forwarding). It also prevents an XDP-based router from doing any kind of
> traffic shaping or reordering to enforce policy.
>
> This series represents a first RFC of our attempt to remedy this lack. The
> code in these patches is functional, but needs additional testing and
> polishing before being considered for merging. I'm posting it here as an
> RFC to get some early feedback on the API and overall design of the
> feature.
>
> DESIGN
>
> The design consists of three components: A new map type for storing XDP
> frames, a new 'dequeue' program type that will run in the TX softirq to
> provide the stack with packets to transmit, and a set of helpers to dequeue
> packets from the map, optionally drop them, and to schedule an interface
> for transmission.
>
> The new map type is modelled on the PIFO data structure proposed in the
> literature[0][1]. It represents a priority queue where packets can be
> enqueued in any priority, but is always dequeued from the head. From the
> XDP side, the map is simply used as a target for the bpf_redirect_map()
> helper, where the target index is the desired priority.
>
> The dequeue program type is a new BPF program type that is attached to an
> interface; when an interface is scheduled for transmission, the stack will
> execute the attached dequeue program and, if it returns a packet to
> transmit, that packet will be transmitted using the existing ndo_xdp_xmit()
> driver function.
>
> The dequeue program can obtain packets by pulling them out of a PIFO map
> using the new bpf_packet_dequeue() helper. This returns a pointer to an
> xdp_md structure, which can be dereferenced to obtain packet data and
> data_meta pointers like in an XDP program. The returned packets are also
> reference counted, meaning the verifier enforces that the dequeue program
> either drops the packet (with the bpf_packet_drop() helper), or returns it
> for transmission. Finally, a helper is added that can be used to actually
> schedule an interface for transmission using the dequeue program type; this
> helper can be called from both XDP and dequeue programs.
>
> PERFORMANCE
>
> Preliminary performance tests indicate about 50ns overhead of adding
> queueing to the xdp_fwd example (last patch), which translates to a 20% PPS
> overhead (but still 2x the forwarding performance of the netstack):
>
> xdp_fwd :     4.7 Mpps  (213 ns /pkt)
> xdp_fwd -Q:   3.8 Mpps  (263 ns /pkt)
> netstack:       2 Mpps  (500 ns /pkt)
>
> RELATION TO BPF QDISC
>
> Cong Wang's BPF qdisc patches[2] share some aspects of this series, in
> particular the use of a map to store packets. This is no accident, as we've
> had ongoing discussions for a while now. I have no great hope that we can
> completely converge the two efforts into a single BPF-based queueing
> API (as has been discussed before[3], consolidating the SKB and XDP paths
> is challenging). Rather, I'm hoping that we can converge the designs enough
> that we can share BPF code between XDP and qdisc layers using common
> functions, like it's possible to do with XDP and TC-BPF today. This would
> imply agreeing on the map type and API, and possibly on the set of helpers
> available to the BPF programs.
>
> PATCH STRUCTURE
>
> This series consists of a total of 17 patches, as follows:
>
> Patches 1-3 are smaller preparatory refactoring patches used by subsequent
> patches.
>
> Patches 4-5 introduce the PIFO map type, and patch 6 introduces the dequeue
> program type.
>
> Patches 7-10 adds the dequeue helpers and the verifier features needed to
> recognise packet pointers, reference count them, and allow dereferencing
> them to obtain packet data pointers.
>
> Patches 11 and 12 add the dequeue program hook to the TX path, and the
> helpers to schedule an interface.
>
> Patches 13-16 add libbpf support for the new types, and selftests for the
> new features.
>
> Finally, patch 17 adds queueing support to the xdp_fwd program in
> samples/bpf to provide an easy-to-use way of testing the feature; this is
> for illustrative purposes for the RFC only, and will not be included in the
> final submission.
>
> SUPPLEMENTARY MATERIAL
>
> A (WiP) test harness for implementing and unit-testing scheduling
> algorithms using this framework (and the bpf_prog_run() hook) is available
> as part of the bpf-examples repository[4]. We plan to expand this with more
> test algorithms to smoke-test the API, and also add ready-to-use queueing
> algorithms for use for forwarding (to replace the xdp_fwd patch included as
> part of this RFC submission).
>
> The work represented in this series was done in collaboration with several
> people. Thanks to Kumar Kartikeya Dwivedi for writing the verifier
> enhancements in this series, to Frey Alfredsson for his work on the testing
> harness in [4], and to Jesper Brouer, Per Hurtig and Anna Brunstrom for
> their valuable input on the design of the queueing APIs.
>
> This series is also available as a git tree on git.kernel.org[5].
>
> NOTES
>
> [0] http://web.mit.edu/pifo/
> [1] https://arxiv.org/abs/1810.03060
> [2] https://lore.kernel.org/r/20220602041028.95124-1-xiyou.wangcong@gmail.com
> [3] https://lore.kernel.org/r/b4ff6a2b-1478-89f8-ea9f-added498c59f@gmail.com
> [4] https://github.com/xdp-project/bpf-examples/pull/40
> [5] https://git.kernel.org/pub/scm/linux/kernel/git/toke/linux.git/log/?h=xdp-queueing-06
>
> Kumar Kartikeya Dwivedi (5):
>   bpf: Use 64-bit return value for bpf_prog_run
>   bpf: Teach the verifier about referenced packets returned from dequeue
>     programs
>   bpf: Introduce pkt_uid member for PTR_TO_PACKET
>   bpf: Implement direct packet access in dequeue progs
>   selftests/bpf: Add verifier tests for dequeue prog
>
> Toke Høiland-Jørgensen (12):
>   dev: Move received_rps counter next to RPS members in softnet data
>   bpf: Expand map key argument of bpf_redirect_map to u64
>   bpf: Add a PIFO priority queue map type
>   pifomap: Add queue rotation for continuously increasing rank mode
>   xdp: Add dequeue program type for getting packets from a PIFO
>   bpf: Add helpers to dequeue from a PIFO map
>   dev: Add XDP dequeue hook
>   bpf: Add helper to schedule an interface for TX dequeue
>   libbpf: Add support for dequeue program type and PIFO map type
>   libbpf: Add support for querying dequeue programs
>   selftests/bpf: Add test for XDP queueing through PIFO maps
>   samples/bpf: Add queueing support to xdp_fwd sample
>
>  include/linux/bpf-cgroup.h                    |  12 +-
>  include/linux/bpf.h                           |  64 +-
>  include/linux/bpf_types.h                     |   4 +
>  include/linux/bpf_verifier.h                  |  14 +-
>  include/linux/filter.h                        |  63 +-
>  include/linux/netdevice.h                     |   8 +-
>  include/net/xdp.h                             |  16 +-
>  include/uapi/linux/bpf.h                      |  50 +-
>  include/uapi/linux/if_link.h                  |   4 +-
>  kernel/bpf/Makefile                           |   2 +-
>  kernel/bpf/cgroup.c                           |  12 +-
>  kernel/bpf/core.c                             |  14 +-
>  kernel/bpf/cpumap.c                           |   4 +-
>  kernel/bpf/devmap.c                           |  92 ++-
>  kernel/bpf/offload.c                          |   4 +-
>  kernel/bpf/pifomap.c                          | 635 ++++++++++++++++++
>  kernel/bpf/syscall.c                          |   3 +
>  kernel/bpf/verifier.c                         | 148 +++-
>  net/bpf/test_run.c                            |  54 +-
>  net/core/dev.c                                | 109 +++
>  net/core/dev.h                                |   2 +
>  net/core/filter.c                             | 307 ++++++++-
>  net/core/rtnetlink.c                          |  30 +-
>  net/packet/af_packet.c                        |   7 +-
>  net/xdp/xskmap.c                              |   4 +-
>  samples/bpf/xdp_fwd_kern.c                    |  65 +-
>  samples/bpf/xdp_fwd_user.c                    | 200 ++++--
>  tools/include/uapi/linux/bpf.h                |  48 ++
>  tools/include/uapi/linux/if_link.h            |   4 +-
>  tools/lib/bpf/libbpf.c                        |   1 +
>  tools/lib/bpf/libbpf.h                        |   1 +
>  tools/lib/bpf/libbpf_probes.c                 |   5 +
>  tools/lib/bpf/netlink.c                       |   8 +
>  .../selftests/bpf/prog_tests/pifo_map.c       | 125 ++++
>  .../bpf/prog_tests/xdp_pifo_test_run.c        | 154 +++++
>  tools/testing/selftests/bpf/progs/pifo_map.c  |  54 ++
>  .../selftests/bpf/progs/test_xdp_pifo.c       | 110 +++
>  tools/testing/selftests/bpf/test_verifier.c   |  29 +-
>  .../testing/selftests/bpf/verifier/dequeue.c  | 160 +++++
>  39 files changed, 2426 insertions(+), 200 deletions(-)
>  create mode 100644 kernel/bpf/pifomap.c
>  create mode 100644 tools/testing/selftests/bpf/prog_tests/pifo_map.c
>  create mode 100644 tools/testing/selftests/bpf/prog_tests/xdp_pifo_test_run.c
>  create mode 100644 tools/testing/selftests/bpf/progs/pifo_map.c
>  create mode 100644 tools/testing/selftests/bpf/progs/test_xdp_pifo.c
>  create mode 100644 tools/testing/selftests/bpf/verifier/dequeue.c
>
> --
> 2.37.0
>
Dave Taht July 14, 2022, 2:56 p.m. UTC | #7
In general I feel a programmable packet pacing approach is the right
way forward for the internet as a whole.

It lends itself more easily and accurately to offloading in an age
where it is difficult to do anything sane within a ms on the host
cpu, especially in virtualized environments, in the enormous dynamic
range of kbits/ms to gbits/ms between host an potential recipient [1]

So considerations about what is easier to offload moving forward vs
central cpu costs should be in this conversation.

[1] I'm kind of on a campaign to get people to stop thinking about
mbits/sec and about intervals well below human perception, thus,
gbits/ms - or packets/ns!
Jamal Hadi Salim July 14, 2022, 3:33 p.m. UTC | #8
On Thu, Jul 14, 2022 at 10:56 AM Dave Taht <dave.taht@gmail.com> wrote:
>
> In general I feel a programmable packet pacing approach is the right
> way forward for the internet as a whole.
>
> It lends itself more easily and accurately to offloading in an age
> where it is difficult to do anything sane within a ms on the host
> cpu, especially in virtualized environments, in the enormous dynamic
> range of kbits/ms to gbits/ms between host an potential recipient [1]
>
> So considerations about what is easier to offload moving forward vs
> central cpu costs should be in this conversation.
>

If you know your hardware can offload - there is a lot less to worry about.
You can let the synchronization be handled by hardware. For example,
if your hardware can do strict priority scheduling/queueing you really
should bypass the kernel layer, set appropriate metadata (skb prio)
and let the hw handle it. See the HTB offload from Nvidia.
OTOH, EDT based approaches are the best for some lightweight
approach which takes advantage of simple hardware
features (like timestamps, etc).

cheers,
jamal
Toke Høiland-Jørgensen July 14, 2022, 4:21 p.m. UTC | #9
Jamal Hadi Salim <jhs@mojatatu.com> writes:

> I think what would be really interesting is to see the performance numbers when
> you have multiple producers/consumers(translation multiple
> threads/softirqs) in play
> targeting the same queues. Does PIFO alleviate the synchronization challenge
> when you have multiple concurrent readers/writers? Or maybe for your use case
> this would not be a common occurrence or not something you care about?

Right, this is definitely one of the areas we want to flesh out some
more and benchmark. I think a PIFO-based algorithm *can* be an
improvement here because you can compute the priority without holding
any lock and only grab a lock for inserting the packet; which can be
made even better with a (partially) lockless data structure and/or
batching.

In any case we *have* to do a certain amount of re-inventing for XDP
because we can't reuse the qdisc infrastructure anyway. Ultimately, I
expect it will be possible to write both really well-performing
algorithms, and really badly-performing ones. Such is the power of BPF,
after all, and as long as we can provide an existence proof of the
former, that's fine with me :)

> As I mentioned previously, I think this is what Cong's approach gets
> for free.

Yes, but it also retains the global qdisc lock; my (naive, perhaps?)
hope is that since we have to do things differently in XDP land anyway,
that work can translate into something that is amenable to being
lockless in qdisc land as well...

-Toke
Stanislav Fomichev July 14, 2022, 5:24 p.m. UTC | #10
On Thu, Jul 14, 2022 at 3:46 AM Toke Høiland-Jørgensen <toke@redhat.com> wrote:
>
> Stanislav Fomichev <sdf@google.com> writes:
>
> > On Wed, Jul 13, 2022 at 2:52 PM Toke Høiland-Jørgensen <toke@redhat.com> wrote:
> >>
> >> Stanislav Fomichev <sdf@google.com> writes:
> >>
> >> > On Wed, Jul 13, 2022 at 4:14 AM Toke Høiland-Jørgensen <toke@redhat.com> wrote:
> >> >>
> >> >> Packet forwarding is an important use case for XDP, which offers
> >> >> significant performance improvements compared to forwarding using the
> >> >> regular networking stack. However, XDP currently offers no mechanism to
> >> >> delay, queue or schedule packets, which limits the practical uses for
> >> >> XDP-based forwarding to those where the capacity of input and output links
> >> >> always match each other (i.e., no rate transitions or many-to-one
> >> >> forwarding). It also prevents an XDP-based router from doing any kind of
> >> >> traffic shaping or reordering to enforce policy.
> >> >>
> >> >> This series represents a first RFC of our attempt to remedy this lack. The
> >> >> code in these patches is functional, but needs additional testing and
> >> >> polishing before being considered for merging. I'm posting it here as an
> >> >> RFC to get some early feedback on the API and overall design of the
> >> >> feature.
> >> >>
> >> >> DESIGN
> >> >>
> >> >> The design consists of three components: A new map type for storing XDP
> >> >> frames, a new 'dequeue' program type that will run in the TX softirq to
> >> >> provide the stack with packets to transmit, and a set of helpers to dequeue
> >> >> packets from the map, optionally drop them, and to schedule an interface
> >> >> for transmission.
> >> >>
> >> >> The new map type is modelled on the PIFO data structure proposed in the
> >> >> literature[0][1]. It represents a priority queue where packets can be
> >> >> enqueued in any priority, but is always dequeued from the head. From the
> >> >> XDP side, the map is simply used as a target for the bpf_redirect_map()
> >> >> helper, where the target index is the desired priority.
> >> >
> >> > I have the same question I asked on the series from Cong:
> >> > Any considerations for existing carousel/edt-like models?
> >>
> >> Well, the reason for the addition in patch 5 (continuously increasing
> >> priorities) is exactly to be able to implement EDT-like behaviour, where
> >> the priority is used as time units to clock out packets.
> >
> > Ah, ok, I didn't read the patches closely enough. I saw some limits
> > for the ranges and assumed that it wasn't capable of efficiently
> > storing 64-bit timestamps..
>
> The goal is definitely to support full 64-bit priorities. Right now you
> have to start out at 0 but can go on for a full 64 bits, but that's a
> bit of an API wart that I'd like to get rid of eventually...
>
> >> > Can we make the map flexible enough to implement different qdisc
> >> > policies?
> >>
> >> That's one of the things we want to be absolutely sure about. We are
> >> starting out with the PIFO map type because the literature makes a good
> >> case that it is flexible enough to implement all conceivable policies.
> >> The goal of the test harness linked as note [4] is to actually examine
> >> this; Frey is our PhD student working on this bit.
> >>
> >> Thus far we haven't hit any limitations on this, but we'll need to add
> >> more policies before we are done with this. Another consideration is
> >> performance, of course, so we're also planning to do a comparison with a
> >> more traditional "bunch of FIFO queues" type data structure for at least
> >> a subset of the algorithms. Kartikeya also had an idea for an
> >> alternative way to implement a priority queue using (semi-)lockless
> >> skiplists, which may turn out to perform better.
> >>
> >> If there's any particular policy/algorithm you'd like to see included in
> >> this evaluation, please do let us know, BTW! :)
> >
> > I honestly am not sure what the bar for accepting this should be. But
> > on the Cong's series I mentioned Martin's CC bpf work as a great
> > example of what we should be trying to do for qdisc-like maps. Having
> > a bpf version of fq/fq_codel/whatever_other_complex_qdisc might be
> > very convincing :-)
>
> Just doing flow queueing is quite straight forward with PIFOs. We're
> working on fq_codel. Personally I also want to implement something that
> has feature parity with sch_cake (which includes every feature and the
> kitchen sink already) :)

Yeah, sch_cake works too 
Alexei Starovoitov July 15, 2022, 1:12 a.m. UTC | #11
On Thu, Jul 14, 2022 at 12:46:54PM +0200, Toke Høiland-Jørgensen wrote:
> >
> > Maybe a related question here: with the way you do
> > BPF_MAP_TYPE_PIFO_GENERIC vs BPF_MAP_TYPE_PIFO_XDP, how hard it would
> > be have support for storing xdp_frames/skb in any map? Let's say we
> > have generic BPF_MAP_TYPE_RBTREE, where the key is
> > priority/timestamp/whatever, can we, based on the value's btf_id,
> > figure out the rest? (that the value is kernel structure and needs
> > special care and more constraints - can't be looked up from user space
> > and so on)
> >
> > Seems like we really need to have two special cases: where we transfer
> > ownership of xdp_frame/skb to/from the map, any other big
> > complications?
> >
> > That way we can maybe untangle the series a bit: we can talk about
> > efficient data structures for storing frames/skbs independently of
> > some generic support for storing them in the maps. Any major
> > complications with that approach?
> 
> I've had discussions with Kartikeya on this already (based on his 'kptr
> in map' work). That may well end up being feasible, which would be
> fantastic. The reason we didn't use it for this series is that there's
> still some work to do on the generic verifier/infrastructure support
> side of this (the PIFO map is the oldest part of this series), and I
> didn't want to hold up the rest of the queueing work until that landed.

Certainly makes sense for RFC to be sent out earlier,
but Stan's point stands. We have to avoid type-specific maps when
generic will do. kptr infra is getting close to be that answer.

> >> Maybe I'm missing something, though; could you elaborate on how you'd
> >> use kfuncs instead?
> >
> > I was thinking about the approach in general. In networking bpf, we've
> > been adding new program types, new contexts and new explicit hooks.
> > This all requires a ton of boiler plate (converting from uapi ctx to
> > the kernel, exposing hook points, etc, etc). And looking at Benjamin's
> > HID series, it's so much more elegant: there is no uapi, just kernel
> > function that allows it to be overridden and a bunch of kfuncs
> > exposed. No uapi, no helpers, no fake contexts.
> >
> > For networking and xdp the ship might have sailed, but I was wondering
> > whether we should be still stuck in that 'old' boilerplate world or we
> > have a chance to use new nice shiny things :-)
> >
> > (but it might be all moot if we'd like to have stable upis?)
> 
> Right, I see what you mean. My immediate feeling is that having an
> explicit stable UAPI for XDP has served us well. We do all kinds of
> rewrite tricks behind the scenes (things like switching between xdp_buff
> and xdp_frame, bulking, direct packet access, reading ifindexes by
> pointer walking txq->dev, etc) which are important ways to improve
> performance without exposing too many nitty-gritty details into the API.
> 
> There's also consistency to consider: I think the addition of queueing
> should work as a natural extension of the existing programming model for
> XDP. So I feel like this is more a case of "if we were starting from
> scratch today we might do things differently (like the HID series), but
> when extending things let's keep it consistent"?

"consistent" makes sense when new feature follows established path.
The programmable packet scheduling in TX is just as revolutionary as
XDP in RX was years ago :)
This feature can be done similar to hid-bpf without cast-in-stone uapi
and hooks. Such patches would be much easier to land and iterate on top.
The amount of bike shedding will be 10 times less.
No need for new program type, no new hooks, no new FDs and attach uapi-s.
Even libbpf won't need any changes.
Add few kfuncs and __weak noinline "hooks" in TX path.
Only new map type would be necessary.
If it can be made with kptr then it will be the only uapi exposure that
will be heavily scrutinized.

It doesn't mean that it will stay unstable-api forever. Once it demonstrates
that it is on par with fq/fq_codel/cake feature-wise we can bake it into uapi.
Toke Høiland-Jørgensen July 15, 2022, 12:55 p.m. UTC | #12
Alexei Starovoitov <alexei.starovoitov@gmail.com> writes:

> On Thu, Jul 14, 2022 at 12:46:54PM +0200, Toke Høiland-Jørgensen wrote:
>> >
>> > Maybe a related question here: with the way you do
>> > BPF_MAP_TYPE_PIFO_GENERIC vs BPF_MAP_TYPE_PIFO_XDP, how hard it would
>> > be have support for storing xdp_frames/skb in any map? Let's say we
>> > have generic BPF_MAP_TYPE_RBTREE, where the key is
>> > priority/timestamp/whatever, can we, based on the value's btf_id,
>> > figure out the rest? (that the value is kernel structure and needs
>> > special care and more constraints - can't be looked up from user space
>> > and so on)
>> >
>> > Seems like we really need to have two special cases: where we transfer
>> > ownership of xdp_frame/skb to/from the map, any other big
>> > complications?
>> >
>> > That way we can maybe untangle the series a bit: we can talk about
>> > efficient data structures for storing frames/skbs independently of
>> > some generic support for storing them in the maps. Any major
>> > complications with that approach?
>> 
>> I've had discussions with Kartikeya on this already (based on his 'kptr
>> in map' work). That may well end up being feasible, which would be
>> fantastic. The reason we didn't use it for this series is that there's
>> still some work to do on the generic verifier/infrastructure support
>> side of this (the PIFO map is the oldest part of this series), and I
>> didn't want to hold up the rest of the queueing work until that landed.
>
> Certainly makes sense for RFC to be sent out earlier,
> but Stan's point stands. We have to avoid type-specific maps when
> generic will do. kptr infra is getting close to be that answer.

ACK, I'll iterate on the map types and see how far we can get with the
kptr approach.

Do people feel a generic priority queue type would be generally useful?
Because in that case I can split out this work into a separate series...

>> >> Maybe I'm missing something, though; could you elaborate on how you'd
>> >> use kfuncs instead?
>> >
>> > I was thinking about the approach in general. In networking bpf, we've
>> > been adding new program types, new contexts and new explicit hooks.
>> > This all requires a ton of boiler plate (converting from uapi ctx to
>> > the kernel, exposing hook points, etc, etc). And looking at Benjamin's
>> > HID series, it's so much more elegant: there is no uapi, just kernel
>> > function that allows it to be overridden and a bunch of kfuncs
>> > exposed. No uapi, no helpers, no fake contexts.
>> >
>> > For networking and xdp the ship might have sailed, but I was wondering
>> > whether we should be still stuck in that 'old' boilerplate world or we
>> > have a chance to use new nice shiny things :-)
>> >
>> > (but it might be all moot if we'd like to have stable upis?)
>> 
>> Right, I see what you mean. My immediate feeling is that having an
>> explicit stable UAPI for XDP has served us well. We do all kinds of
>> rewrite tricks behind the scenes (things like switching between xdp_buff
>> and xdp_frame, bulking, direct packet access, reading ifindexes by
>> pointer walking txq->dev, etc) which are important ways to improve
>> performance without exposing too many nitty-gritty details into the API.
>> 
>> There's also consistency to consider: I think the addition of queueing
>> should work as a natural extension of the existing programming model for
>> XDP. So I feel like this is more a case of "if we were starting from
>> scratch today we might do things differently (like the HID series), but
>> when extending things let's keep it consistent"?
>
> "consistent" makes sense when new feature follows established path.
> The programmable packet scheduling in TX is just as revolutionary as
> XDP in RX was years ago :)
> This feature can be done similar to hid-bpf without cast-in-stone uapi
> and hooks. Such patches would be much easier to land and iterate on top.
> The amount of bike shedding will be 10 times less.
> No need for new program type, no new hooks, no new FDs and attach uapi-s.
> Even libbpf won't need any changes.
> Add few kfuncs and __weak noinline "hooks" in TX path.
> Only new map type would be necessary.
> If it can be made with kptr then it will be the only uapi exposure that
> will be heavily scrutinized.

I'm not quite convinced it's that obvious that it can be implemented
this way; but I don't mind trying it out either, if nothing else it'll
give us something to compare against...

> It doesn't mean that it will stay unstable-api forever. Once it demonstrates
> that it is on par with fq/fq_codel/cake feature-wise we can bake it into uapi.

In any case, I don't think we should merge anything until we've shown
this :)

-Toke
Cong Wang July 17, 2022, 5:46 p.m. UTC | #13
On Wed, Jul 13, 2022 at 01:14:08PM +0200, Toke Høiland-Jørgensen wrote:
> Packet forwarding is an important use case for XDP, which offers
> significant performance improvements compared to forwarding using the
> regular networking stack. However, XDP currently offers no mechanism to
> delay, queue or schedule packets, which limits the practical uses for
> XDP-based forwarding to those where the capacity of input and output links
> always match each other (i.e., no rate transitions or many-to-one
> forwarding). It also prevents an XDP-based router from doing any kind of
> traffic shaping or reordering to enforce policy.
> 

Sorry for forgetting to respond to your email to my patchset.

The most important question from you is actually why I give up on PIFO.
Actually its limitation is already in its name, its name Push In First
Out already says clearly that it only allows to dequeue the first one.
Still confusing?

You can take a look at your pifo_map_pop_elem(), which is the
implementation for bpf_map_pop_elem(), which is:

       long bpf_map_pop_elem(struct bpf_map *map, void *value)

Clearly, there is no even 'key' in its parameter list. If you just
compare it to mine:

	BPF_CALL_2(bpf_skb_map_pop, struct bpf_map *, map, u64, key)

Is their difference now 100% clear? :)

The next question is why this is important (actually it is the most
important)? Because we (I mean for eBPF Qdisc users, not sure about you)
want the programmability, which I have been emphasizing since V1...

Clearly it is already too late to fix bpf_map_pop_elem(), we don't want
to repeat that mistake again.

More importantly, the latter can easily implement the former, as shown below:

bpf_stack_for_min; // Just BPF_MAP_TYPE_STACK

push(map, key, value)
{
  bpf_stack_for_min.push(min(key, bpf_stack_for_min.top()));
  // Insert key value pair here
}

pop_the_first(map, value)
{
   val = pop_any(map, bpf_stack_for_min.top());
   *value = val;
   bpf_stack_for_min.pop();
}


BTW, what is _your_ use case for skb map and user-space PIFO map? I am
sure you have uses case for XDP, it is unclear what you have for other
cases. Please don't piggy back use cases you don't have, we all have to
justify all use cases. :)

Thanks.
Cong Wang July 17, 2022, 6:17 p.m. UTC | #14
On Wed, Jul 13, 2022 at 11:52:07PM +0200, Toke Høiland-Jørgensen wrote:
> Stanislav Fomichev <sdf@google.com> writes:
> 
> > On Wed, Jul 13, 2022 at 4:14 AM Toke Høiland-Jørgensen <toke@redhat.com> wrote:
> >>
> >> Packet forwarding is an important use case for XDP, which offers
> >> significant performance improvements compared to forwarding using the
> >> regular networking stack. However, XDP currently offers no mechanism to
> >> delay, queue or schedule packets, which limits the practical uses for
> >> XDP-based forwarding to those where the capacity of input and output links
> >> always match each other (i.e., no rate transitions or many-to-one
> >> forwarding). It also prevents an XDP-based router from doing any kind of
> >> traffic shaping or reordering to enforce policy.
> >>
> >> This series represents a first RFC of our attempt to remedy this lack. The
> >> code in these patches is functional, but needs additional testing and
> >> polishing before being considered for merging. I'm posting it here as an
> >> RFC to get some early feedback on the API and overall design of the
> >> feature.
> >>
> >> DESIGN
> >>
> >> The design consists of three components: A new map type for storing XDP
> >> frames, a new 'dequeue' program type that will run in the TX softirq to
> >> provide the stack with packets to transmit, and a set of helpers to dequeue
> >> packets from the map, optionally drop them, and to schedule an interface
> >> for transmission.
> >>
> >> The new map type is modelled on the PIFO data structure proposed in the
> >> literature[0][1]. It represents a priority queue where packets can be
> >> enqueued in any priority, but is always dequeued from the head. From the
> >> XDP side, the map is simply used as a target for the bpf_redirect_map()
> >> helper, where the target index is the desired priority.
> >
> > I have the same question I asked on the series from Cong:
> > Any considerations for existing carousel/edt-like models?
> 
> Well, the reason for the addition in patch 5 (continuously increasing
> priorities) is exactly to be able to implement EDT-like behaviour, where
> the priority is used as time units to clock out packets.

Are you sure? I seriouly doubt your patch can do this at all...

Since your patch relies on bpf_map_push_elem(), which has no room for
'key' hence you reuse 'flags' but you also reserve 4 bits there... How
could tstamp be packed with 4 reserved bits??

To answer Stanislav's question, this is how my code could handle EDT:

// BPF_CALL_3(bpf_skb_map_push, struct bpf_map *, map, struct sk_buff *, skb, u64, key)
skb->tstamp = XXX;
bpf_skb_map_push(map, skb, skb->tstamp);

(Please refer another reply from me for how to get the min when poping,
which is essentially just a popular interview coding problem.)

Actually, if we look into the in-kernel EDT implementation (net/sched/sch_etf.c),
it is also based on rbtree rather than PIFO. ;-)

Thanks.
Kumar Kartikeya Dwivedi July 17, 2022, 6:41 p.m. UTC | #15
On Sun, 17 Jul 2022 at 20:17, Cong Wang <xiyou.wangcong@gmail.com> wrote:
>
> On Wed, Jul 13, 2022 at 11:52:07PM +0200, Toke Høiland-Jørgensen wrote:
> > Stanislav Fomichev <sdf@google.com> writes:
> >
> > > On Wed, Jul 13, 2022 at 4:14 AM Toke Høiland-Jørgensen <toke@redhat.com> wrote:
> > >>
> > >> Packet forwarding is an important use case for XDP, which offers
> > >> significant performance improvements compared to forwarding using the
> > >> regular networking stack. However, XDP currently offers no mechanism to
> > >> delay, queue or schedule packets, which limits the practical uses for
> > >> XDP-based forwarding to those where the capacity of input and output links
> > >> always match each other (i.e., no rate transitions or many-to-one
> > >> forwarding). It also prevents an XDP-based router from doing any kind of
> > >> traffic shaping or reordering to enforce policy.
> > >>
> > >> This series represents a first RFC of our attempt to remedy this lack. The
> > >> code in these patches is functional, but needs additional testing and
> > >> polishing before being considered for merging. I'm posting it here as an
> > >> RFC to get some early feedback on the API and overall design of the
> > >> feature.
> > >>
> > >> DESIGN
> > >>
> > >> The design consists of three components: A new map type for storing XDP
> > >> frames, a new 'dequeue' program type that will run in the TX softirq to
> > >> provide the stack with packets to transmit, and a set of helpers to dequeue
> > >> packets from the map, optionally drop them, and to schedule an interface
> > >> for transmission.
> > >>
> > >> The new map type is modelled on the PIFO data structure proposed in the
> > >> literature[0][1]. It represents a priority queue where packets can be
> > >> enqueued in any priority, but is always dequeued from the head. From the
> > >> XDP side, the map is simply used as a target for the bpf_redirect_map()
> > >> helper, where the target index is the desired priority.
> > >
> > > I have the same question I asked on the series from Cong:
> > > Any considerations for existing carousel/edt-like models?
> >
> > Well, the reason for the addition in patch 5 (continuously increasing
> > priorities) is exactly to be able to implement EDT-like behaviour, where
> > the priority is used as time units to clock out packets.
>
> Are you sure? I seriouly doubt your patch can do this at all...
>
> Since your patch relies on bpf_map_push_elem(), which has no room for
> 'key' hence you reuse 'flags' but you also reserve 4 bits there... How
> could tstamp be packed with 4 reserved bits??
>
> To answer Stanislav's question, this is how my code could handle EDT:
>
> // BPF_CALL_3(bpf_skb_map_push, struct bpf_map *, map, struct sk_buff *, skb, u64, key)
> skb->tstamp = XXX;
> bpf_skb_map_push(map, skb, skb->tstamp);

It is also possible here, if we could not push into the map with a
certain key it wouldn't be a PIFO.
Please look at patch 16/17 for an example (test_xdp_pifo.c), it's just
that the interface is different (bpf_redirect_map),
the key has been expanded to 64 bits to accommodate such use cases. It
is also possible in a future version of the patch to amortize the cost
of taking the lock for each enqueue by doing batching, similar to what
cpumap/devmap implementations do.

>
> (Please refer another reply from me for how to get the min when poping,
> which is essentially just a popular interview coding problem.)
>
> Actually, if we look into the in-kernel EDT implementation (net/sched/sch_etf.c),
> it is also based on rbtree rather than PIFO. ;-)
>
> Thanks.
Cong Wang July 17, 2022, 7:12 p.m. UTC | #16
On Thu, Jul 14, 2022 at 12:46:54PM +0200, Toke Høiland-Jørgensen wrote:
> Stanislav Fomichev <sdf@google.com> writes:
> 
> > On Wed, Jul 13, 2022 at 2:52 PM Toke Høiland-Jørgensen <toke@redhat.com> wrote:
> >>
> >> Stanislav Fomichev <sdf@google.com> writes:
> >>
> >> > On Wed, Jul 13, 2022 at 4:14 AM Toke Høiland-Jørgensen <toke@redhat.com> wrote:
> >> >>
> >> >> Packet forwarding is an important use case for XDP, which offers
> >> >> significant performance improvements compared to forwarding using the
> >> >> regular networking stack. However, XDP currently offers no mechanism to
> >> >> delay, queue or schedule packets, which limits the practical uses for
> >> >> XDP-based forwarding to those where the capacity of input and output links
> >> >> always match each other (i.e., no rate transitions or many-to-one
> >> >> forwarding). It also prevents an XDP-based router from doing any kind of
> >> >> traffic shaping or reordering to enforce policy.
> >> >>
> >> >> This series represents a first RFC of our attempt to remedy this lack. The
> >> >> code in these patches is functional, but needs additional testing and
> >> >> polishing before being considered for merging. I'm posting it here as an
> >> >> RFC to get some early feedback on the API and overall design of the
> >> >> feature.
> >> >>
> >> >> DESIGN
> >> >>
> >> >> The design consists of three components: A new map type for storing XDP
> >> >> frames, a new 'dequeue' program type that will run in the TX softirq to
> >> >> provide the stack with packets to transmit, and a set of helpers to dequeue
> >> >> packets from the map, optionally drop them, and to schedule an interface
> >> >> for transmission.
> >> >>
> >> >> The new map type is modelled on the PIFO data structure proposed in the
> >> >> literature[0][1]. It represents a priority queue where packets can be
> >> >> enqueued in any priority, but is always dequeued from the head. From the
> >> >> XDP side, the map is simply used as a target for the bpf_redirect_map()
> >> >> helper, where the target index is the desired priority.
> >> >
> >> > I have the same question I asked on the series from Cong:
> >> > Any considerations for existing carousel/edt-like models?
> >>
> >> Well, the reason for the addition in patch 5 (continuously increasing
> >> priorities) is exactly to be able to implement EDT-like behaviour, where
> >> the priority is used as time units to clock out packets.
> >
> > Ah, ok, I didn't read the patches closely enough. I saw some limits
> > for the ranges and assumed that it wasn't capable of efficiently
> > storing 64-bit timestamps..
> 
> The goal is definitely to support full 64-bit priorities. Right now you
> have to start out at 0 but can go on for a full 64 bits, but that's a
> bit of an API wart that I'd like to get rid of eventually...
> 
> >> > Can we make the map flexible enough to implement different qdisc
> >> > policies?
> >>
> >> That's one of the things we want to be absolutely sure about. We are
> >> starting out with the PIFO map type because the literature makes a good
> >> case that it is flexible enough to implement all conceivable policies.
> >> The goal of the test harness linked as note [4] is to actually examine
> >> this; Frey is our PhD student working on this bit.
> >>
> >> Thus far we haven't hit any limitations on this, but we'll need to add
> >> more policies before we are done with this. Another consideration is
> >> performance, of course, so we're also planning to do a comparison with a
> >> more traditional "bunch of FIFO queues" type data structure for at least
> >> a subset of the algorithms. Kartikeya also had an idea for an
> >> alternative way to implement a priority queue using (semi-)lockless
> >> skiplists, which may turn out to perform better.
> >>
> >> If there's any particular policy/algorithm you'd like to see included in
> >> this evaluation, please do let us know, BTW! :)
> >
> > I honestly am not sure what the bar for accepting this should be. But
> > on the Cong's series I mentioned Martin's CC bpf work as a great
> > example of what we should be trying to do for qdisc-like maps. Having
> > a bpf version of fq/fq_codel/whatever_other_complex_qdisc might be
> > very convincing :-)
> 
> Just doing flow queueing is quite straight forward with PIFOs. We're
> working on fq_codel. Personally I also want to implement something that
> has feature parity with sch_cake (which includes every feature and the
> kitchen sink already) :)

And how exactly would you plan to implement Least Slack Time First with
PIFOs?  See https://www.usenix.org/system/files/nsdi20-paper-sharma.pdf.
Can you be as specific as possible ideally with a pesudo code?

BTW, this is very easy to do with my approach as no FO limitations.

Thanks!
Cong Wang July 17, 2022, 7:23 p.m. UTC | #17
On Sun, Jul 17, 2022 at 08:41:10PM +0200, Kumar Kartikeya Dwivedi wrote:
> On Sun, 17 Jul 2022 at 20:17, Cong Wang <xiyou.wangcong@gmail.com> wrote:
> >
> > On Wed, Jul 13, 2022 at 11:52:07PM +0200, Toke Høiland-Jørgensen wrote:
> > > Stanislav Fomichev <sdf@google.com> writes:
> > >
> > > > On Wed, Jul 13, 2022 at 4:14 AM Toke Høiland-Jørgensen <toke@redhat.com> wrote:
> > > >>
> > > >> Packet forwarding is an important use case for XDP, which offers
> > > >> significant performance improvements compared to forwarding using the
> > > >> regular networking stack. However, XDP currently offers no mechanism to
> > > >> delay, queue or schedule packets, which limits the practical uses for
> > > >> XDP-based forwarding to those where the capacity of input and output links
> > > >> always match each other (i.e., no rate transitions or many-to-one
> > > >> forwarding). It also prevents an XDP-based router from doing any kind of
> > > >> traffic shaping or reordering to enforce policy.
> > > >>
> > > >> This series represents a first RFC of our attempt to remedy this lack. The
> > > >> code in these patches is functional, but needs additional testing and
> > > >> polishing before being considered for merging. I'm posting it here as an
> > > >> RFC to get some early feedback on the API and overall design of the
> > > >> feature.
> > > >>
> > > >> DESIGN
> > > >>
> > > >> The design consists of three components: A new map type for storing XDP
> > > >> frames, a new 'dequeue' program type that will run in the TX softirq to
> > > >> provide the stack with packets to transmit, and a set of helpers to dequeue
> > > >> packets from the map, optionally drop them, and to schedule an interface
> > > >> for transmission.
> > > >>
> > > >> The new map type is modelled on the PIFO data structure proposed in the
> > > >> literature[0][1]. It represents a priority queue where packets can be
> > > >> enqueued in any priority, but is always dequeued from the head. From the
> > > >> XDP side, the map is simply used as a target for the bpf_redirect_map()
> > > >> helper, where the target index is the desired priority.
> > > >
> > > > I have the same question I asked on the series from Cong:
> > > > Any considerations for existing carousel/edt-like models?
> > >
> > > Well, the reason for the addition in patch 5 (continuously increasing
> > > priorities) is exactly to be able to implement EDT-like behaviour, where
> > > the priority is used as time units to clock out packets.
> >
> > Are you sure? I seriouly doubt your patch can do this at all...
> >
> > Since your patch relies on bpf_map_push_elem(), which has no room for
> > 'key' hence you reuse 'flags' but you also reserve 4 bits there... How
> > could tstamp be packed with 4 reserved bits??
> >
> > To answer Stanislav's question, this is how my code could handle EDT:
> >
> > // BPF_CALL_3(bpf_skb_map_push, struct bpf_map *, map, struct sk_buff *, skb, u64, key)
> > skb->tstamp = XXX;
> > bpf_skb_map_push(map, skb, skb->tstamp);
> 
> It is also possible here, if we could not push into the map with a
> certain key it wouldn't be a PIFO.
> Please look at patch 16/17 for an example (test_xdp_pifo.c), it's just
> that the interface is different (bpf_redirect_map),


Sorry for mentioning that I don't care about XDP case at all. Please let me
know how this works for eBPF Qdisc. This is what I found in 16/17:

+ ret = bpf_map_push_elem(&pifo_map, &val, flags);


> the key has been expanded to 64 bits to accommodate such use cases. It
> is also possible in a future version of the patch to amortize the cost
> of taking the lock for each enqueue by doing batching, similar to what
> cpumap/devmap implementations do.

How about the 4 reserved bits?

 ret = bpf_map_push_elem(&pifo_map, &val, flags);

which leads to:

+#define BPF_PIFO_PRIO_MASK	(~0ULL >> 4)
...
+static int pifo_map_push_elem(struct bpf_map *map, void *value, u64 flags)
+{
+	struct bpf_pifo_map *pifo = container_of(map, struct bpf_pifo_map, map);
+	struct bpf_pifo_element *dst;
+	unsigned long irq_flags;
+	u64 prio;
+	int ret;
+
+	/* Check if any of the actual flag bits are set */
+	if (flags & ~BPF_PIFO_PRIO_MASK)
+		return -EINVAL;
+
+	prio = flags & BPF_PIFO_PRIO_MASK;


Please let me know how you calculate 64 bits while I only calculate 60
bits (for skb case, obviously)?

Wait for a second, as BPF_EXIST is already a bit, I think you have 59
bits here actually...

Thanks!
Toke Høiland-Jørgensen July 18, 2022, 12:12 p.m. UTC | #18
Cong Wang <xiyou.wangcong@gmail.com> writes:

> On Wed, Jul 13, 2022 at 11:52:07PM +0200, Toke Høiland-Jørgensen wrote:
>> Stanislav Fomichev <sdf@google.com> writes:
>> 
>> > On Wed, Jul 13, 2022 at 4:14 AM Toke Høiland-Jørgensen <toke@redhat.com> wrote:
>> >>
>> >> Packet forwarding is an important use case for XDP, which offers
>> >> significant performance improvements compared to forwarding using the
>> >> regular networking stack. However, XDP currently offers no mechanism to
>> >> delay, queue or schedule packets, which limits the practical uses for
>> >> XDP-based forwarding to those where the capacity of input and output links
>> >> always match each other (i.e., no rate transitions or many-to-one
>> >> forwarding). It also prevents an XDP-based router from doing any kind of
>> >> traffic shaping or reordering to enforce policy.
>> >>
>> >> This series represents a first RFC of our attempt to remedy this lack. The
>> >> code in these patches is functional, but needs additional testing and
>> >> polishing before being considered for merging. I'm posting it here as an
>> >> RFC to get some early feedback on the API and overall design of the
>> >> feature.
>> >>
>> >> DESIGN
>> >>
>> >> The design consists of three components: A new map type for storing XDP
>> >> frames, a new 'dequeue' program type that will run in the TX softirq to
>> >> provide the stack with packets to transmit, and a set of helpers to dequeue
>> >> packets from the map, optionally drop them, and to schedule an interface
>> >> for transmission.
>> >>
>> >> The new map type is modelled on the PIFO data structure proposed in the
>> >> literature[0][1]. It represents a priority queue where packets can be
>> >> enqueued in any priority, but is always dequeued from the head. From the
>> >> XDP side, the map is simply used as a target for the bpf_redirect_map()
>> >> helper, where the target index is the desired priority.
>> >
>> > I have the same question I asked on the series from Cong:
>> > Any considerations for existing carousel/edt-like models?
>> 
>> Well, the reason for the addition in patch 5 (continuously increasing
>> priorities) is exactly to be able to implement EDT-like behaviour, where
>> the priority is used as time units to clock out packets.
>
> Are you sure? I seriouly doubt your patch can do this at all...
>
> Since your patch relies on bpf_map_push_elem(), which has no room for
> 'key' hence you reuse 'flags' but you also reserve 4 bits there... How
> could tstamp be packed with 4 reserved bits??

Well, my point was that the *data structure* itself supports 64-bit
priorities, and that's what we use from bpf_map_redirect() in XDP. The
choice of reserving four bits was a bit of an arbitrary choice on my
part. I actually figured 60 bits would be plenty to represent timestamps
in themselves, but I guess I miscalculated a bit for nanosecond
timestamps (60 bits only gets you 36 years of range there).

We could lower that to 2 reserved bits, which gets you a range of 146
years using 62 bits; or users could just right-shift the value by a
couple of bits before putting them in the map (scheduling with
single-nanosecond precision is not possible anyway, so losing a few bits
of precision is no big deal); or we could add a new helper instead of
reusing the existing one.

> Actually, if we look into the in-kernel EDT implementation
> (net/sched/sch_etf.c), it is also based on rbtree rather than PIFO.

The main reason I eschewed the existing rbtree code is that I don't
believe it's sufficiently performant, mainly due to the rebalancing.
This is a hunch, though, and as I mentioned in a different reply I'm
planning to go back and revisit the data structure, including
benchmarking different implementations against each other.

-Toke
Toke Høiland-Jørgensen July 18, 2022, 12:25 p.m. UTC | #19
Cong Wang <xiyou.wangcong@gmail.com> writes:

> On Thu, Jul 14, 2022 at 12:46:54PM +0200, Toke Høiland-Jørgensen wrote:
>> Stanislav Fomichev <sdf@google.com> writes:
>> 
>> > On Wed, Jul 13, 2022 at 2:52 PM Toke Høiland-Jørgensen <toke@redhat.com> wrote:
>> >>
>> >> Stanislav Fomichev <sdf@google.com> writes:
>> >>
>> >> > On Wed, Jul 13, 2022 at 4:14 AM Toke Høiland-Jørgensen <toke@redhat.com> wrote:
>> >> >>
>> >> >> Packet forwarding is an important use case for XDP, which offers
>> >> >> significant performance improvements compared to forwarding using the
>> >> >> regular networking stack. However, XDP currently offers no mechanism to
>> >> >> delay, queue or schedule packets, which limits the practical uses for
>> >> >> XDP-based forwarding to those where the capacity of input and output links
>> >> >> always match each other (i.e., no rate transitions or many-to-one
>> >> >> forwarding). It also prevents an XDP-based router from doing any kind of
>> >> >> traffic shaping or reordering to enforce policy.
>> >> >>
>> >> >> This series represents a first RFC of our attempt to remedy this lack. The
>> >> >> code in these patches is functional, but needs additional testing and
>> >> >> polishing before being considered for merging. I'm posting it here as an
>> >> >> RFC to get some early feedback on the API and overall design of the
>> >> >> feature.
>> >> >>
>> >> >> DESIGN
>> >> >>
>> >> >> The design consists of three components: A new map type for storing XDP
>> >> >> frames, a new 'dequeue' program type that will run in the TX softirq to
>> >> >> provide the stack with packets to transmit, and a set of helpers to dequeue
>> >> >> packets from the map, optionally drop them, and to schedule an interface
>> >> >> for transmission.
>> >> >>
>> >> >> The new map type is modelled on the PIFO data structure proposed in the
>> >> >> literature[0][1]. It represents a priority queue where packets can be
>> >> >> enqueued in any priority, but is always dequeued from the head. From the
>> >> >> XDP side, the map is simply used as a target for the bpf_redirect_map()
>> >> >> helper, where the target index is the desired priority.
>> >> >
>> >> > I have the same question I asked on the series from Cong:
>> >> > Any considerations for existing carousel/edt-like models?
>> >>
>> >> Well, the reason for the addition in patch 5 (continuously increasing
>> >> priorities) is exactly to be able to implement EDT-like behaviour, where
>> >> the priority is used as time units to clock out packets.
>> >
>> > Ah, ok, I didn't read the patches closely enough. I saw some limits
>> > for the ranges and assumed that it wasn't capable of efficiently
>> > storing 64-bit timestamps..
>> 
>> The goal is definitely to support full 64-bit priorities. Right now you
>> have to start out at 0 but can go on for a full 64 bits, but that's a
>> bit of an API wart that I'd like to get rid of eventually...
>> 
>> >> > Can we make the map flexible enough to implement different qdisc
>> >> > policies?
>> >>
>> >> That's one of the things we want to be absolutely sure about. We are
>> >> starting out with the PIFO map type because the literature makes a good
>> >> case that it is flexible enough to implement all conceivable policies.
>> >> The goal of the test harness linked as note [4] is to actually examine
>> >> this; Frey is our PhD student working on this bit.
>> >>
>> >> Thus far we haven't hit any limitations on this, but we'll need to add
>> >> more policies before we are done with this. Another consideration is
>> >> performance, of course, so we're also planning to do a comparison with a
>> >> more traditional "bunch of FIFO queues" type data structure for at least
>> >> a subset of the algorithms. Kartikeya also had an idea for an
>> >> alternative way to implement a priority queue using (semi-)lockless
>> >> skiplists, which may turn out to perform better.
>> >>
>> >> If there's any particular policy/algorithm you'd like to see included in
>> >> this evaluation, please do let us know, BTW! :)
>> >
>> > I honestly am not sure what the bar for accepting this should be. But
>> > on the Cong's series I mentioned Martin's CC bpf work as a great
>> > example of what we should be trying to do for qdisc-like maps. Having
>> > a bpf version of fq/fq_codel/whatever_other_complex_qdisc might be
>> > very convincing :-)
>> 
>> Just doing flow queueing is quite straight forward with PIFOs. We're
>> working on fq_codel. Personally I also want to implement something that
>> has feature parity with sch_cake (which includes every feature and the
>> kitchen sink already) :)
>
> And how exactly would you plan to implement Least Slack Time First with
> PIFOs?  See https://www.usenix.org/system/files/nsdi20-paper-sharma.pdf.
> Can you be as specific as possible ideally with a pesudo code?

By sticking flows into the PIFO instead of individual packets.
Basically:

enqueue:

flow_id = hash_pkt(pkt);
flow_pifo = &flows[flow_id];
pifo_enqueue(flow_pifo, pkt, 0); // always enqueue at rank 0, so effectively a FIFO
pifo_enqueue(toplevel_pifo, flow_id, compute_rank(flow_id));

dequeue:

flow_id = pifo_dequeue(toplevel_pifo);
flow_pifo = &flows[flow_id];
pkt = pifo_dequeue(flow_pifo);
pifo_enqueue(toplevel_pifo, flow_id, compute_rank(flow_id)); // re-enqueue
return pkt;


We have not gotten around to doing a full implementation of this yet,
but SRPT/LSTF is on our list of algorithms to add :)

> BTW, this is very easy to do with my approach as no FO limitations.

How does being able to dequeue out-of-order actually help with this
particular scheme? On dequeue you still process things in priority
order?

-Toke
Toke Høiland-Jørgensen July 18, 2022, 12:45 p.m. UTC | #20
Cong Wang <xiyou.wangcong@gmail.com> writes:

> On Wed, Jul 13, 2022 at 01:14:08PM +0200, Toke Høiland-Jørgensen wrote:
>> Packet forwarding is an important use case for XDP, which offers
>> significant performance improvements compared to forwarding using the
>> regular networking stack. However, XDP currently offers no mechanism to
>> delay, queue or schedule packets, which limits the practical uses for
>> XDP-based forwarding to those where the capacity of input and output links
>> always match each other (i.e., no rate transitions or many-to-one
>> forwarding). It also prevents an XDP-based router from doing any kind of
>> traffic shaping or reordering to enforce policy.
>> 
>
> Sorry for forgetting to respond to your email to my patchset.
>
> The most important question from you is actually why I give up on PIFO.
> Actually its limitation is already in its name, its name Push In First
> Out already says clearly that it only allows to dequeue the first one.
> Still confusing?
>
> You can take a look at your pifo_map_pop_elem(), which is the
> implementation for bpf_map_pop_elem(), which is:
>
>        long bpf_map_pop_elem(struct bpf_map *map, void *value)
>
> Clearly, there is no even 'key' in its parameter list. If you just
> compare it to mine:
>
> 	BPF_CALL_2(bpf_skb_map_pop, struct bpf_map *, map, u64, key)
>
> Is their difference now 100% clear? :)
>
> The next question is why this is important (actually it is the most
> important)? Because we (I mean for eBPF Qdisc users, not sure about you)
> want the programmability, which I have been emphasizing since V1...

Right, I realise that in a strictly abstract sense, only being able to
dequeue at the head is a limitation. However, what I'm missing is what
concrete thing that limitation prevents you from implementing (see my
reply to your other email about LSTF)? I'm really not trying to be
disingenuous, I have no interest in ending up with a map primitive that
turns out to be limiting down the road...

> BTW, what is _your_ use case for skb map and user-space PIFO map? I am
> sure you have uses case for XDP, it is unclear what you have for other
> cases. Please don't piggy back use cases you don't have, we all have
> to justify all use cases. :)

You keep talking about the SKB and XDP paths as though they're
completely separate things. They're not: we're adding a general
capability to the kernel to implement programmable packet queueing using
BPF. One is for packets forwarded from the XDP hook, the other is for
packets coming from the stack. In an ideal world we'd only need one hook
that could handle both paths, but as the discussion I linked to in my
cover letter shows that is probably not going to be feasible.

So we'll most likely end up with two hooks, but as far as use cases are
concerned they are the same: I want to make sure that the primitives we
add are expressive enough to implement every conceivable queueing and
scheduling algorithm. I don't view the two efforts to be in competition
with each other either; we're really trying to achieve the same thing
here, so let's work together to make sure we end up with something that
works for both the XDP and qdisc layers? :)

The reason I mention the SKB path in particular in this series is that I
want to make sure we make the two as compatible as we can, so as not to
unnecessarily fragment the ecosystem. Sharing primitives is the obvious
way to do that.

-Toke