[RFC,v5,0/5] net_sched: introduce eBPF based Qdisc

Message ID	20220602041028.95124-1-xiyou.wangcong@gmail.com (mailing list archive)
Headers	show Return-Path: <netdev-owner@kernel.org> From: Cong Wang <xiyou.wangcong@gmail.com> To: netdev@vger.kernel.org Cc: bpf@vger.kernel.org, Cong Wang <cong.wang@bytedance.com>, =?utf-8?q?Toke?= =?utf-8?q?_H=C3=B8iland-J=C3=B8rgensen?= <toke@redhat.com>, Jamal Hadi Salim <jhs@mojatatu.com>, Jiri Pirko <jiri@resnulli.us> Subject: [RFC Patch v5 0/5] net_sched: introduce eBPF based Qdisc Date: Wed, 1 Jun 2022 21:10:23 -0700 Message-Id: <20220602041028.95124-1-xiyou.wangcong@gmail.com> MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Precedence: bulk
Series	net_sched: introduce eBPF based Qdisc \| expand [RFC,v5,0/5] net_sched: introduce eBPF based Qdisc [RFC,v5,1/5] net: introduce skb_rbtree_walk_safe() [RFC,v5,2/5] bpf: move map in map declarations to bpf.h [RFC,v5,3/5] bpf: introduce skb map and flow map [RFC,v5,4/5] net_sched: introduce eBPF based Qdisc [RFC,v5,5/5] net_sched: introduce helper bpf_skb_tc_classify()

Message ID

20220602041028.95124-1-xiyou.wangcong@gmail.com (mailing list archive)

Headers

From: Cong Wang <xiyou.wangcong@gmail.com>
To: netdev@vger.kernel.org
Cc: bpf@vger.kernel.org, Cong Wang <cong.wang@bytedance.com>, =?utf-8?q?Toke?=
	=?utf-8?q?_H=C3=B8iland-J=C3=B8rgensen?= <toke@redhat.com>,
 Jamal Hadi Salim <jhs@mojatatu.com>, Jiri Pirko <jiri@resnulli.us>
Subject: [RFC Patch v5 0/5] net_sched: introduce eBPF based Qdisc
Date: Wed,  1 Jun 2022 21:10:23 -0700
Message-Id: <20220602041028.95124-1-xiyou.wangcong@gmail.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
Precedence: bulk

Series

net_sched: introduce eBPF based Qdisc | expand

Message

Cong Wang June 2, 2022, 4:10 a.m. UTC

From: Cong Wang <cong.wang@bytedance.com>

This *incomplete* patchset introduces a programmable Qdisc with eBPF.

There are a few use cases:

1. Allow customizing Qdisc's in an easier way. So that people don't
   have to write a complete Qdisc kernel module just to experiment
   some new queuing theory.

2. Solve EDT's problem. EDT calcuates the "tokens" in clsact which
   is before enqueue, it is impossible to adjust those "tokens" after
   packets get dropped in enqueue. With eBPF Qdisc, it is easy to
   be solved with a shared map between clsact and sch_bpf.

3. Replace qevents, as now the user gains much more control over the
   skb and queues.

4. Provide a new way to reuse TC filters. Currently TC relies on filter
   chain and block to reuse the TC filters, but they are too complicated
   to understand. With eBPF helper bpf_skb_tc_classify(), we can invoke
   TC filters on _any_ Qdisc (even on a different netdev) to do the
   classification.

5. Potentially pave a way for ingress to queue packets, although
   current implementation is still only for egress.

6. Possibly pave a way for handling TCP protocol in TC, as rbtree itself
   is already used by TCP to handle TCP retransmission.

The goal here is to make this Qdisc as programmable as possible,
that is, to replace as many existing Qdisc's as we can, no matter
in tree or out of tree. This is why I give up on PIFO which has
serious limitations on the programmablity.

Here is a summary of design decisions I made:

1. Avoid eBPF struct_ops, as it would be really hard to program
   a Qdisc with this approach, literally all the struct Qdisc_ops
   and struct Qdisc_class_ops are needed to implement. This is almost
   as hard as programming a Qdisc kernel module.

2. Introduce skb map, which will allow other eBPF programs to store skb's
   too.

   a) As eBPF maps are not directly visible to the kernel, we have to
   dump the stats via eBPF map API's instead of netlink.

   b) The user-space is not allowed to read the entire packets, only __sk_buff
   itself is readable, because we don't have such a use case yet and it would
   require a different API to read the data, as map values have fixed length.

   c) Two eBPF helpers are introduced for skb map operations:
   bpf_skb_map_push() and bpf_skb_map_pop(). Normal map update is
   not allowed.

   d) Multi-queue support is implemented via map-in-map, in a similar
   push/pop fasion.

   e) Use the netdevice notifier to reset the packets inside skb map upon
   NETDEV_DOWN event.

3. Integrate with existing TC infra. For example, if the user doesn't want
   to implement her own filters (e.g. a flow dissector), she should be able
   to re-use the existing TC filters. Another helper bpf_skb_tc_classify() is
   introduced for this purpose.

Any high-level feedback is welcome. Please kindly do not review any coding
details until RFC tag is removed.

TODO:
1. actually test it
2. write a document for this Qdisc
3. add test cases and sample code

Cc: Toke Høiland-Jørgensen <toke@redhat.com>
Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Cc: Jiri Pirko <jiri@resnulli.us>
Signed-off-by: Cong Wang <cong.wang@bytedance.com>
---
v5: mv kernel/bpf/skb_map.c net/core/skb_map.c
    implement flow map as map-in-map
    rename bpf_skb_tc_classify() and move it to net/sched/cls_api.c
    clean up eBPF qdisc program context

v4: get rid of PIFO, use rbtree directly

v3: move priority queue from sch_bpf to skb map
    introduce skb map and its helpers
    introduce bpf_skb_classify()
    use netdevice notifier to reset skb's
    Rebase on latest bpf-next

v2: Rebase on latest net-next
    Make the code more complete (but still incomplete)

Cong Wang (5):
  net: introduce skb_rbtree_walk_safe()
  bpf: move map in map declarations to bpf.h
  bpf: introduce skb map and flow map
  net_sched: introduce eBPF based Qdisc
  net_sched: introduce helper bpf_skb_tc_classify()

 include/linux/bpf.h            |   6 +
 include/linux/bpf_types.h      |   4 +
 include/linux/skbuff.h         |   9 +-
 include/uapi/linux/bpf.h       |  23 ++
 include/uapi/linux/pkt_sched.h |  17 ++
 kernel/bpf/arraymap.c          |   2 -
 kernel/bpf/hashtab.c           |   1 -
 kernel/bpf/map_in_map.c        |   2 -
 kernel/bpf/map_in_map.h        |  19 --
 kernel/bpf/verifier.c          |  10 +
 net/core/Makefile              |   1 +
 net/core/filter.c              |  39 +++
 net/core/skb_map.c             | 520 +++++++++++++++++++++++++++++++++
 net/sched/Kconfig              |  15 +
 net/sched/Makefile             |   1 +
 net/sched/cls_api.c            |  69 +++++
 net/sched/sch_bpf.c            | 485 ++++++++++++++++++++++++++++++
 17 files changed, 1198 insertions(+), 25 deletions(-)
 delete mode 100644 kernel/bpf/map_in_map.h
 create mode 100644 net/core/skb_map.c
 create mode 100644 net/sched/sch_bpf.c

Comments

Toke Høiland-Jørgensen June 24, 2022, 4:52 p.m. UTC | #1

Cong Wang <xiyou.wangcong@gmail.com> writes:

> From: Cong Wang <cong.wang@bytedance.com>
>
> This *incomplete* patchset introduces a programmable Qdisc with eBPF.

Sorry for the delay in looking at this; a few comments below:

[...]

> The goal here is to make this Qdisc as programmable as possible,
> that is, to replace as many existing Qdisc's as we can, no matter
> in tree or out of tree. This is why I give up on PIFO which has
> serious limitations on the programmablity.

Could you elaborate on the limitations of the PIFO? It looks to me like
the SKB map you're proposing is very close to a PIFO (it's a priority
queue that SKBs can be inserted into at an arbitrary position)?

> Here is a summary of design decisions I made:
>
> 1. Avoid eBPF struct_ops, as it would be really hard to program
>    a Qdisc with this approach, literally all the struct Qdisc_ops
>    and struct Qdisc_class_ops are needed to implement. This is almost
>    as hard as programming a Qdisc kernel module.

I agree that implementing the full Qdisc_class_ops as BPF functions is
going to be prohibitive; let's use this opportunity to define a simpler
API!

> 2. Introduce skb map, which will allow other eBPF programs to store skb's
>    too.
>
>    a) As eBPF maps are not directly visible to the kernel, we have to
>    dump the stats via eBPF map API's instead of netlink.

Why not do a hybrid thing where the kernel side of the qdisc keeps some
basic stats (packet counter for number of packets queued/dropped for
instance) and define a separate BPF callback that can return more stats
to the kernel for translation into netlink (e.g., "how many packets are
currently in the queue")? And then if a particular implementation wants
to do more custom stats, they can use BPF APIs for that?

>    b) The user-space is not allowed to read the entire packets, only
>    __sk_buff itself is readable, because we don't have such a use case
>    yet and it would require a different API to read the data, as map
>    values have fixed length.

I agree there's not any need to make the packet contents directly
accessible to userspace via reading the map.

>    c) Two eBPF helpers are introduced for skb map operations:
>    bpf_skb_map_push() and bpf_skb_map_pop(). Normal map update is
>    not allowed.

So with kptr support in the map this could conceivably be done via the
existing map_push() etc helpers?

>    d) Multi-queue support is implemented via map-in-map, in a similar
>    push/pop fasion.

Not sure I understand what you mean by this? Will the qdisc
automatically attach itself to all HWQs of an interface (like sch_mq
does)? Surely it will be up to the BPF program whether it'll use
map-in-map or something else?

>    e) Use the netdevice notifier to reset the packets inside skb map upon
>    NETDEV_DOWN event.

Is it really necessary to pull out SKBs from under the BPF program? If
the qdisc is removed and the BPF program goes away, so will the map and
all SKBs stored in it?

> 3. Integrate with existing TC infra. For example, if the user doesn't want
>    to implement her own filters (e.g. a flow dissector), she should be able
>    to re-use the existing TC filters. Another helper bpf_skb_tc_classify() is
>    introduced for this purpose.

This seems a bit convoluted? If a BPF program wants to reuse an existing
BPF TC filter it can just do that at the code level. As for the
in-kernel filters are there really any of them it would be worth reusing
from a BPF qdisc other than the flow dissector? In which case, why not
just expose that instead of taking all the TC filter infrastructure with
you?

Bringing in some text from patch 4 as well to comment on it all in one go:

> Introduce a new Qdisc which is completely managed by eBPF program
> of type BPF_PROG_TYPE_SCHED_QDISC. It accepts two eBPF programs of the
> same type, but one for enqueue and the other for dequeue.
> 
> And it interacts with Qdisc layer in two ways:
> 1) It relies on Qdisc watchdog to handle throttling;

Having a way for BPF to schedule the qdisc watchdog wakeup is probably
needed. For the XDP queueing side I'm planning to use BPF timers for
(the equivalent of) this, but since we already have a watchdog mechanism
on the qdisc side maybe just exposing that is fine...

> 2) It could pass the skb enqueue/dequeue down to child classes

I'm not sure I think it's a good idea to keep the qdisc hierarchy. If
someone wants to build a classification hierarchy they could just do
this directly in BPF by whichever mechanism they want? I.e., it can just
instantiate multiple skb maps in a single program to do classful
queueing and select the right one however it wants?

Keeping the qdisc itself simple as, basically:

enqueue: pass skb to BPF program to do with as it will (enqueue into a
map, drop, etc)

dequeue: execute BPF program, transmit pkt if it returns one

-Toke

Dave Taht June 24, 2022, 5:35 p.m. UTC | #2

In my world, it's inbound shaping that's the biggest problem, and my
dream has long been to somehow have a way to do:

tc qdisc add dev eth0 ingress cake bandwidth 800mbit

and that be able to work across multiple cores, without act_mirred,
without any filter magic. Getting there via ebpf as in
https://github.com/rchac/LibreQoS is working for ISPs (plea, anyone
got time to make ipv6 work there?), but it's
running out of cpu for inbound shaping on the current act_mirred path
we use that's killing us.

Stanislav Fomichev June 24, 2022, 8:51 p.m. UTC | #3

On 06/01, Cong Wang wrote:
> From: Cong Wang <cong.wang@bytedance.com>

> This *incomplete* patchset introduces a programmable Qdisc with eBPF.

> There are a few use cases:

> 1. Allow customizing Qdisc's in an easier way. So that people don't
>     have to write a complete Qdisc kernel module just to experiment
>     some new queuing theory.

> 2. Solve EDT's problem. EDT calcuates the "tokens" in clsact which
>     is before enqueue, it is impossible to adjust those "tokens" after
>     packets get dropped in enqueue. With eBPF Qdisc, it is easy to
>     be solved with a shared map between clsact and sch_bpf.

> 3. Replace qevents, as now the user gains much more control over the
>     skb and queues.

> 4. Provide a new way to reuse TC filters. Currently TC relies on filter
>     chain and block to reuse the TC filters, but they are too complicated
>     to understand. With eBPF helper bpf_skb_tc_classify(), we can invoke
>     TC filters on _any_ Qdisc (even on a different netdev) to do the
>     classification.

> 5. Potentially pave a way for ingress to queue packets, although
>     current implementation is still only for egress.

> 6. Possibly pave a way for handling TCP protocol in TC, as rbtree itself
>     is already used by TCP to handle TCP retransmission.

> The goal here is to make this Qdisc as programmable as possible,
> that is, to replace as many existing Qdisc's as we can, no matter
> in tree or out of tree. This is why I give up on PIFO which has
> serious limitations on the programmablity.

> Here is a summary of design decisions I made:

> 1. Avoid eBPF struct_ops, as it would be really hard to program
>     a Qdisc with this approach, literally all the struct Qdisc_ops
>     and struct Qdisc_class_ops are needed to implement. This is almost
>     as hard as programming a Qdisc kernel module.

> 2. Introduce skb map, which will allow other eBPF programs to store skb's
>     too.

>     a) As eBPF maps are not directly visible to the kernel, we have to
>     dump the stats via eBPF map API's instead of netlink.

>     b) The user-space is not allowed to read the entire packets, only  
> __sk_buff
>     itself is readable, because we don't have such a use case yet and it  
> would
>     require a different API to read the data, as map values have fixed  
> length.

>     c) Two eBPF helpers are introduced for skb map operations:
>     bpf_skb_map_push() and bpf_skb_map_pop(). Normal map update is
>     not allowed.

>     d) Multi-queue support is implemented via map-in-map, in a similar
>     push/pop fasion.

>     e) Use the netdevice notifier to reset the packets inside skb map upon
>     NETDEV_DOWN event.

> 3. Integrate with existing TC infra. For example, if the user doesn't want
>     to implement her own filters (e.g. a flow dissector), she should be  
> able
>     to re-use the existing TC filters. Another helper  
> bpf_skb_tc_classify() is
>     introduced for this purpose.

> Any high-level feedback is welcome. Please kindly do not review any coding
> details until RFC tag is removed.

> TODO:
> 1. actually test it

Can you try to implement some existing qdisc using your new mechanism?
For BPF-CC, Martin showcased how dctcp/cubic can be reimplemented;
I feel like this patch series (even rfc), should also have a good example
to show that bpf qdisc is on par and can be used to at least implement
existing policies. fq/fq_codel/cake are good candidates.