[RFC,v6,0/5] net_sched: introduce eBPF based Qdisc

Message ID	20221005171709.150520-1-xiyou.wangcong@gmail.com (mailing list archive)
Headers	show Return-Path: <netdev-owner@kernel.org> From: Cong Wang <xiyou.wangcong@gmail.com> To: netdev@vger.kernel.org Cc: yangpeihao@sjtu.edu.cn, toke@redhat.com, jhs@mojatatu.com, jiri@resnulli.us, bpf@vger.kernel.org, sdf@google.com, Cong Wang <cong.wang@bytedance.com> Subject: [RFC Patch v6 0/5] net_sched: introduce eBPF based Qdisc Date: Wed, 5 Oct 2022 10:17:04 -0700 Message-Id: <20221005171709.150520-1-xiyou.wangcong@gmail.com> MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Precedence: bulk
Series	net_sched: introduce eBPF based Qdisc \| expand [RFC,v6,0/5] net_sched: introduce eBPF based Qdisc [RFC,v6,1/5] bpf: Introduce rbtree map [RFC,v6,2/5] bpf: Add map in map support to rbtree [RFC,v6,3/5] net_sched: introduce eBPF based Qdisc [RFC,v6,4/5] net_sched: Add kfuncs for storing skb [RFC,v6,5/5] net_sched: Introduce helper bpf_skb_tc_classify()

Message ID

20221005171709.150520-1-xiyou.wangcong@gmail.com (mailing list archive)

Headers

From: Cong Wang <xiyou.wangcong@gmail.com>
To: netdev@vger.kernel.org
Cc: yangpeihao@sjtu.edu.cn, toke@redhat.com, jhs@mojatatu.com,
        jiri@resnulli.us, bpf@vger.kernel.org, sdf@google.com,
        Cong Wang <cong.wang@bytedance.com>
Subject: [RFC Patch v6 0/5] net_sched: introduce eBPF based Qdisc
Date: Wed,  5 Oct 2022 10:17:04 -0700
Message-Id: <20221005171709.150520-1-xiyou.wangcong@gmail.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
Precedence: bulk

Series

net_sched: introduce eBPF based Qdisc | expand

Message

Cong Wang Oct. 5, 2022, 5:17 p.m. UTC

From: Cong Wang <cong.wang@bytedance.com>

This *incomplete* patchset introduces a programmable Qdisc with eBPF.

There are a few use cases:

1. Allow customizing Qdisc's in an easier way. So that people don't
   have to write a complete Qdisc kernel module just to experiment
   some new queuing theory.

2. Solve EDT's problem. EDT calcuates the "tokens" in clsact which
   is before enqueue, it is impossible to adjust those "tokens" after
   packets get dropped in enqueue. With eBPF Qdisc, it is easy to
   be solved with a shared map between clsact and sch_bpf.

3. Replace qevents, as now the user gains much more control over the
   skb and queues.

4. Provide a new way to reuse TC filters. Currently TC relies on filter
   chain and block to reuse the TC filters, but they are too complicated
   to understand. With eBPF helper bpf_skb_tc_classify(), we can invoke
   TC filters on _any_ Qdisc (even on a different netdev) to do the
   classification.

5. Potentially pave a way for ingress to queue packets, although
   current implementation is still only for egress.

6. Possibly pave a way for handling TCP protocol in TC, as rbtree itself
   is already used by TCP to handle TCP retransmission.

7. Potentially pave a way for implementing eBPF based CPU scheduler,
   because we already have task kptr and CFS is clearly based on rbtree
   too.

The goal here is to make this Qdisc as programmable as possible,
that is, to replace as many existing Qdisc's as we can, no matter
in tree or out of tree. This is why I give up on PIFO which has
serious limitations on the programmablity. More importantly, with
rbtree, it is easy to implement PIFO logic too.

Here is a summary of design decisions I made:

1. Avoid eBPF struct_ops, as it would be really hard to program
   a Qdisc with this approach, literally all the struct Qdisc_ops
   and struct Qdisc_class_ops are needed to implement. This is almost
   as hard as programming a Qdisc kernel module.

2. Introduce a generic rbtree map, which supports kptr, so that skb's
   can be stored inside.

   a) As eBPF maps are not directly visible to the kernel, we have to
   dump the stats via eBPF map API's instead of netlink.

   b) The user-space is not allowed to read the entire packets, only __sk_buff
   itself is readable, because we don't have such a use case yet and it would
   require a different API to read the data, as map values have fixed length.

   c) Multi-queue support is implemented via map-in-map, in a similar
   push/pop fasion based on rbtree too.

   d) Use the netdevice notifier to reset the packets inside skb map upon
   NETDEV_DOWN event. (TODO: need to recognize kptr type)

3. Integrate with existing TC infra. For example, if the user doesn't want
   to implement her own filters (e.g. a flow dissector), she should be able
   to re-use the existing TC filters. Another helper bpf_skb_tc_classify() is
   introduced for this purpose.

Any high-level feedback is welcome. Please kindly ignore any coding details
until RFC tag is removed.

TODO:
1. actually test it
2. write a document for this Qdisc
3. add test cases and sample code

Cc: Toke Høiland-Jørgensen <toke@redhat.com>
Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Cc: Jiri Pirko <jiri@resnulli.us>
Signed-off-by: Cong Wang <cong.wang@bytedance.com>
---
v6: switch to kptr based approach

v5: mv kernel/bpf/skb_map.c net/core/skb_map.c
    implement flow map as map-in-map
    rename bpf_skb_tc_classify() and move it to net/sched/cls_api.c
    clean up eBPF qdisc program context

v4: get rid of PIFO, use rbtree directly

v3: move priority queue from sch_bpf to skb map
    introduce skb map and its helpers
    introduce bpf_skb_classify()
    use netdevice notifier to reset skb's
    Rebase on latest bpf-next

v2: Rebase on latest net-next
    Make the code more complete (but still incomplete)

Cong Wang (5):
  bpf: Introduce rbtree map
  bpf: Add map in map support to rbtree
  net_sched: introduce eBPF based Qdisc
  net_sched: Add kfuncs for storing skb
  net_sched: Introduce helper bpf_skb_tc_classify()

 include/linux/bpf.h            |   4 +
 include/linux/bpf_types.h      |   4 +
 include/uapi/linux/bpf.h       |  19 ++
 include/uapi/linux/pkt_sched.h |  17 +
 kernel/bpf/Makefile            |   2 +-
 kernel/bpf/rbtree.c            | 603 +++++++++++++++++++++++++++++++++
 kernel/bpf/syscall.c           |   7 +
 net/core/filter.c              |  27 ++
 net/sched/Kconfig              |  15 +
 net/sched/Makefile             |   1 +
 net/sched/cls_api.c            |  69 ++++
 net/sched/sch_bpf.c            | 564 ++++++++++++++++++++++++++++++
 12 files changed, 1331 insertions(+), 1 deletion(-)
 create mode 100644 kernel/bpf/rbtree.c
 create mode 100644 net/sched/sch_bpf.c

Comments

Stanislav Fomichev Oct. 7, 2022, 5:27 p.m. UTC | #1

On 10/05, Cong Wang wrote:
> From: Cong Wang <cong.wang@bytedance.com>

> This *incomplete* patchset introduces a programmable Qdisc with eBPF.

> There are a few use cases:

> 1. Allow customizing Qdisc's in an easier way. So that people don't
>     have to write a complete Qdisc kernel module just to experiment
>     some new queuing theory.

> 2. Solve EDT's problem. EDT calcuates the "tokens" in clsact which
>     is before enqueue, it is impossible to adjust those "tokens" after
>     packets get dropped in enqueue. With eBPF Qdisc, it is easy to
>     be solved with a shared map between clsact and sch_bpf.

> 3. Replace qevents, as now the user gains much more control over the
>     skb and queues.

> 4. Provide a new way to reuse TC filters. Currently TC relies on filter
>     chain and block to reuse the TC filters, but they are too complicated
>     to understand. With eBPF helper bpf_skb_tc_classify(), we can invoke
>     TC filters on _any_ Qdisc (even on a different netdev) to do the
>     classification.

> 5. Potentially pave a way for ingress to queue packets, although
>     current implementation is still only for egress.

> 6. Possibly pave a way for handling TCP protocol in TC, as rbtree itself
>     is already used by TCP to handle TCP retransmission.

> 7. Potentially pave a way for implementing eBPF based CPU scheduler,
>     because we already have task kptr and CFS is clearly based on rbtree
>     too.

> The goal here is to make this Qdisc as programmable as possible,
> that is, to replace as many existing Qdisc's as we can, no matter
> in tree or out of tree. This is why I give up on PIFO which has
> serious limitations on the programmablity. More importantly, with
> rbtree, it is easy to implement PIFO logic too.

> Here is a summary of design decisions I made:

[..]

> 1. Avoid eBPF struct_ops, as it would be really hard to program
>     a Qdisc with this approach, literally all the struct Qdisc_ops
>     and struct Qdisc_class_ops are needed to implement. This is almost
>     as hard as programming a Qdisc kernel module.

Can you expand more on what's exactly 'hard'? We need to override
enqueue/dequeue mostly and that's it?

Does it make sense to at least do, say, sch_struct_opts? Maybe we can
have some higher level struct_opts with enqueue/dequeue only for
bpf to override?

> 2. Introduce a generic rbtree map, which supports kptr, so that skb's
>     can be stored inside.

There has been several rbtree proposals. Do we really need an rbtree
here (ignoring performance)? Can we demonstrate these new qdisc
capabilities with a hashtable?

>     a) As eBPF maps are not directly visible to the kernel, we have to
>     dump the stats via eBPF map API's instead of netlink.

>     b) The user-space is not allowed to read the entire packets, only  
> __sk_buff
>     itself is readable, because we don't have such a use case yet and it  
> would
>     require a different API to read the data, as map values have fixed  
> length.

[..]

>     c) Multi-queue support is implemented via map-in-map, in a similar
>     push/pop fasion based on rbtree too.

I took a brief look at sch_bpf.c and I don't see how that would work.
Can you expand more on that (+locking)?

Also, in sch_bpf.c, you care about classid, why? Isn't the whole purpose
of the series is to expand tc capabilities? Why are we still dragging
this classid?


>     d) Use the netdevice notifier to reset the packets inside skb map upon
>     NETDEV_DOWN event. (TODO: need to recognize kptr type)

[..]

> 3. Integrate with existing TC infra. For example, if the user doesn't want
>     to implement her own filters (e.g. a flow dissector), she should be  
> able
>     to re-use the existing TC filters. Another helper  
> bpf_skb_tc_classify() is
>     introduced for this purpose.

Supposedly, the idea is that bpf can implement all existing policies +
more, so why are we trying to re-use existing filters?

> Any high-level feedback is welcome. Please kindly ignore any coding  
> details
> until RFC tag is removed.

> TODO:
> 1. actually test it
> 2. write a document for this Qdisc

[..]

> 3. add test cases and sample code

I'd start with that; otherwise, it's not clear what are the current
capabilities.

> Cc: Toke H�iland-J�rgensen <toke@redhat.com>
> Cc: Jamal Hadi Salim <jhs@mojatatu.com>
> Cc: Jiri Pirko <jiri@resnulli.us>
> Signed-off-by: Cong Wang <cong.wang@bytedance.com>
> ---
> v6: switch to kptr based approach

> v5: mv kernel/bpf/skb_map.c net/core/skb_map.c
>      implement flow map as map-in-map
>      rename bpf_skb_tc_classify() and move it to net/sched/cls_api.c
>      clean up eBPF qdisc program context

> v4: get rid of PIFO, use rbtree directly

> v3: move priority queue from sch_bpf to skb map
>      introduce skb map and its helpers
>      introduce bpf_skb_classify()
>      use netdevice notifier to reset skb's
>      Rebase on latest bpf-next

> v2: Rebase on latest net-next
>      Make the code more complete (but still incomplete)

> Cong Wang (5):
>    bpf: Introduce rbtree map
>    bpf: Add map in map support to rbtree
>    net_sched: introduce eBPF based Qdisc
>    net_sched: Add kfuncs for storing skb
>    net_sched: Introduce helper bpf_skb_tc_classify()

>   include/linux/bpf.h            |   4 +
>   include/linux/bpf_types.h      |   4 +
>   include/uapi/linux/bpf.h       |  19 ++
>   include/uapi/linux/pkt_sched.h |  17 +
>   kernel/bpf/Makefile            |   2 +-
>   kernel/bpf/rbtree.c            | 603 +++++++++++++++++++++++++++++++++
>   kernel/bpf/syscall.c           |   7 +
>   net/core/filter.c              |  27 ++
>   net/sched/Kconfig              |  15 +
>   net/sched/Makefile             |   1 +
>   net/sched/cls_api.c            |  69 ++++
>   net/sched/sch_bpf.c            | 564 ++++++++++++++++++++++++++++++
>   12 files changed, 1331 insertions(+), 1 deletion(-)
>   create mode 100644 kernel/bpf/rbtree.c
>   create mode 100644 net/sched/sch_bpf.c

> --
> 2.34.1