diff mbox series

[RFC,net-next] net_sched: introduce eBPF based Qdisc

Message ID 20210821010240.10373-1-xiyou.wangcong@gmail.com (mailing list archive)
State RFC
Delegated to: Netdev Maintainers
Headers show
Series [RFC,net-next] net_sched: introduce eBPF based Qdisc | expand

Checks

Context Check Description
netdev/cover_letter success Link
netdev/fixes_present success Link
netdev/patch_count success Link
netdev/tree_selection success Clearly marked for net-next
netdev/subject_prefix success Link
netdev/cc_maintainers warning 19 maintainers not CCed: andrii@kernel.org kafai@fb.com nogikh@google.com yhs@fb.com elver@google.com jonathan.lemon@gmail.com davem@davemloft.net memxor@gmail.com songliubraving@fb.com daniel@iogearbox.net willemb@google.com ilias.apalodimas@linaro.org kuba@kernel.org alobakin@pm.me haokexin@gmail.com pabeni@redhat.com kpsingh@kernel.org john.fastabend@gmail.com ast@kernel.org
netdev/source_inline success Was 0 now: 0
netdev/verify_signedoff success Link
netdev/module_param success Was 0 now: 0
netdev/build_32bit fail Errors and warnings before: 12711 this patch: 11828
netdev/kdoc fail Errors and warnings before: 0 this patch: 1
netdev/verify_fixes success Link
netdev/checkpatch warning CHECK: Comparison to NULL could be written "node" CHECK: No space is necessary after a cast WARNING: Improper SPDX comment style for 'include/linux/priority_queue.h', please use '/*' instead WARNING: Missing or malformed SPDX-License-Identifier tag in line 1 WARNING: added, moved or deleted file(s), does MAINTAINERS need updating? WARNING: line length of 81 exceeds 80 columns WARNING: line length of 82 exceeds 80 columns WARNING: line length of 83 exceeds 80 columns WARNING: line length of 84 exceeds 80 columns WARNING: line length of 89 exceeds 80 columns WARNING: line length of 91 exceeds 80 columns WARNING: line length of 96 exceeds 80 columns
netdev/build_allmodconfig_warn fail Errors and warnings before: 12232 this patch: 3200
netdev/header_inline success Link

Commit Message

Cong Wang Aug. 21, 2021, 1:02 a.m. UTC
From: Cong Wang <cong.wang@bytedance.com>

This *incomplete* patch introduces a programmable Qdisc with
eBPF.  The goal is to make Qdisc as programmable as possible,
that is, to replace as many existing Qdisc's as we can. ;)

The design was discussed during last LPC:
https://linuxplumbersconf.org/event/7/contributions/679/attachments/520/1188/sch_bpf.pdf

Here is a summary of design decisions I made:

1. Avoid eBPF struct_ops, as it would be really hard to program
   a Qdisc with this approach.

2. Avoid exposing skb's to user-space, which means we can't introduce
   a map to store skb's. Instead, store them in kernel without exposure
   to user-space.

So I choose to use priority queues to store skb's inside a
flow and to store flows inside a Qdisc, and let eBPF programs
decide the *relative* position of the skb within the flow and the
*relative* order of the flows too, upon each enqueue and dequeue.
Each flow is also exposed to user as a TC class, like many other
classful Qdisc's.

Although the biggest limitation is obviously that users can
not traverse the packets or flows inside the Qdisc, I think
at least they could store those global information of interest
inside their own map and map can be shared between enqueue and
dequeue. For example, users could use skb pointer as key and
rank as a value to find out the absolute order.

One of the challeges is how to interact with existing TC infra,
for instance, if users install TC filters on this Qdisc, should
we respect this by ignoring or rejecting eBPF enqueue program
attached or vice versa? Should we allow users to replace each
priority queue of a class with a regular Qdisc?

Any high-level feedbacks are welcome. Please do not review any
coding details until RFC tag is removed.

Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Cc: Jiri Pirko <jiri@resnulli.us>
Signed-off-by: Cong Wang <cong.wang@bytedance.com>
---
 include/linux/bpf_types.h      |   2 +
 include/linux/priority_queue.h |  90 +++++++
 include/linux/skbuff.h         |   2 +
 include/uapi/linux/bpf.h       |  20 ++
 include/uapi/linux/pkt_sched.h |  15 ++
 net/sched/Kconfig              |  15 ++
 net/sched/Makefile             |   1 +
 net/sched/sch_bpf.c            | 590 +++++++++++++++++++++++++++++++++++++++++
 8 files changed, 735 insertions(+)

Comments

Martin KaFai Lau Aug. 24, 2021, 11:47 p.m. UTC | #1
On Fri, Aug 20, 2021 at 06:02:40PM -0700, Cong Wang wrote:
> From: Cong Wang <cong.wang@bytedance.com>
> 
> This *incomplete* patch introduces a programmable Qdisc with
> eBPF.  The goal is to make Qdisc as programmable as possible,
> that is, to replace as many existing Qdisc's as we can. ;)
> 
> The design was discussed during last LPC:
> https://linuxplumbersconf.org/event/7/contributions/679/attachments/520/1188/sch_bpf.pdf 
> 
> Here is a summary of design decisions I made:
> 
> 1. Avoid eBPF struct_ops, as it would be really hard to program
>    a Qdisc with this approach.
Please explain more on this.  What is currently missing
to make qdisc in struct_ops possible?

> 2. Avoid exposing skb's to user-space, which means we can't introduce
>    a map to store skb's. Instead, store them in kernel without exposure
>    to user-space.
> 
> So I choose to use priority queues to store skb's inside a
> flow and to store flows inside a Qdisc, and let eBPF programs
> decide the *relative* position of the skb within the flow and the
> *relative* order of the flows too, upon each enqueue and dequeue.
> Each flow is also exposed to user as a TC class, like many other
> classful Qdisc's.
> 
> Although the biggest limitation is obviously that users can
> not traverse the packets or flows inside the Qdisc, I think
> at least they could store those global information of interest
> inside their own map and map can be shared between enqueue and
> dequeue. For example, users could use skb pointer as key and
> rank as a value to find out the absolute order.
> 
> One of the challeges is how to interact with existing TC infra,
> for instance, if users install TC filters on this Qdisc, should
> we respect this by ignoring or rejecting eBPF enqueue program
> attached or vice versa? Should we allow users to replace each
> priority queue of a class with a regular Qdisc?
> 
> Any high-level feedbacks are welcome. Please do not review any
> coding details until RFC tag is removed.
> 
> Cc: Jamal Hadi Salim <jhs@mojatatu.com>
> Cc: Jiri Pirko <jiri@resnulli.us>
> Signed-off-by: Cong Wang <cong.wang@bytedance.com>
Cong Wang Sept. 1, 2021, 4:39 a.m. UTC | #2
On Tue, Aug 24, 2021 at 4:47 PM Martin KaFai Lau <kafai@fb.com> wrote:
> Please explain more on this.  What is currently missing
> to make qdisc in struct_ops possible?

I think you misunderstand this point. The reason why I avoid it is
_not_ anything is missing, quite oppositely, it is because it requires
a lot of work to implement a Qdisc with struct_ops approach, literally
all those struct Qdisc_ops (not to mention struct Qdisc_class_ops).
WIth current approach, programmers only need to implement two
eBPF programs (enqueue and dequeue).

Thanks.
John Fastabend Sept. 1, 2021, 5:45 a.m. UTC | #3
Cong Wang wrote:
> On Tue, Aug 24, 2021 at 4:47 PM Martin KaFai Lau <kafai@fb.com> wrote:
> > Please explain more on this.  What is currently missing
> > to make qdisc in struct_ops possible?
> 
> I think you misunderstand this point. The reason why I avoid it is
> _not_ anything is missing, quite oppositely, it is because it requires
> a lot of work to implement a Qdisc with struct_ops approach, literally
> all those struct Qdisc_ops (not to mention struct Qdisc_class_ops).
> WIth current approach, programmers only need to implement two
> eBPF programs (enqueue and dequeue).
> 
> Thanks.

Another idea. Rather than work with qdisc objects which creates all
these issues with how to work with existing interfaces, filters, etc.
Why not create an sk_buff map? Then this can be used from the existing
egress/ingress hooks independent of the actual qdisc being used.

You mention skb should not be exposed to userspace? Why? Whats the
reason for this? Anyways we can make kernel only maps if we want or
scrub the data before passing it to userspace. We do this already in
some cases.

IMO it seems cleaner and more general to allow sk_buffs
to be stored in maps and pulled back out later for enqueue/dequeue.

I think one trick might be how to trigger the dequeue event on
transition from stopped to running net_device or other events like
this, but that could be solved with another program attached to
those events to kick the dequeue logic.

.John
Toke Høiland-Jørgensen Sept. 1, 2021, 10:42 a.m. UTC | #4
John Fastabend <john.fastabend@gmail.com> writes:

> Cong Wang wrote:
>> On Tue, Aug 24, 2021 at 4:47 PM Martin KaFai Lau <kafai@fb.com> wrote:
>> > Please explain more on this.  What is currently missing
>> > to make qdisc in struct_ops possible?
>> 
>> I think you misunderstand this point. The reason why I avoid it is
>> _not_ anything is missing, quite oppositely, it is because it requires
>> a lot of work to implement a Qdisc with struct_ops approach, literally
>> all those struct Qdisc_ops (not to mention struct Qdisc_class_ops).
>> WIth current approach, programmers only need to implement two
>> eBPF programs (enqueue and dequeue).
>> 
>> Thanks.
>
> Another idea. Rather than work with qdisc objects which creates all
> these issues with how to work with existing interfaces, filters, etc.
> Why not create an sk_buff map? Then this can be used from the existing
> egress/ingress hooks independent of the actual qdisc being used.

I agree. In fact, I'm working on doing just this for XDP, and I see no
reason why the map type couldn't be reused for skbs as well. Doing it
this way has a couple of benefits:

- It leaves more flexibility to BPF: want a simple FIFO queue? just
  implement that with a single queue map. Or do you want to build a full
  hierarchical queueing structure? Just instantiate as many queue maps
  as you need to achieve this. Etc.

- The behaviour is defined entirely by BPF program behaviour, and does
  not require setting up a qdisc hierarchy in addition to writing BPF
  code.

- It should be possible to structure the hooks in a way that allows
  reusing queueing algorithm implementations between the qdisc and XDP
  layers.

> You mention skb should not be exposed to userspace? Why? Whats the
> reason for this? Anyways we can make kernel only maps if we want or
> scrub the data before passing it to userspace. We do this already in
> some cases.

Yup, that's my approach as well.

> IMO it seems cleaner and more general to allow sk_buffs
> to be stored in maps and pulled back out later for enqueue/dequeue.

FWIW there's some gnarly details here (for instance, we need to make
sure the BPF program doesn't leak packet references after they are
dequeued from the map). My idea is to use a scheme similar to what we do
for XDP_REDIRECT, where a helper sets some hidden variables and doesn't
actually remove the packet from the queue until the BPF program exits
(so the kernel can make sure things are accounted correctly).

> I think one trick might be how to trigger the dequeue event on
> transition from stopped to running net_device or other events like
> this, but that could be solved with another program attached to
> those events to kick the dequeue logic.

This is actually easy in the qdisc case, I think: there's already a
qdisc_dequeue() operation, which just needs to execute a BPF program
that picks which packet to dequeue (by pulling it off a queue map). For
XDP we do need a new hook, on driver TX completion or something like
that. Details TBD. Also, we need a way to BPF to kick an idle interface
and make it start transmitting; that way we can implement a traffic
shaper (that delays packets) by using BPF timers :)

-Toke
Martin KaFai Lau Sept. 1, 2021, 5:45 p.m. UTC | #5
On Wed, Sep 01, 2021 at 12:42:03PM +0200, Toke Høiland-Jørgensen wrote:
> John Fastabend <john.fastabend@gmail.com> writes:
> 
> > Cong Wang wrote:
> >> On Tue, Aug 24, 2021 at 4:47 PM Martin KaFai Lau <kafai@fb.com> wrote:
> >> > Please explain more on this.  What is currently missing
> >> > to make qdisc in struct_ops possible?
> >> 
> >> I think you misunderstand this point. The reason why I avoid it is
> >> _not_ anything is missing, quite oppositely, it is because it requires
> >> a lot of work to implement a Qdisc with struct_ops approach, literally
> >> all those struct Qdisc_ops (not to mention struct Qdisc_class_ops).
> >> WIth current approach, programmers only need to implement two
> >> eBPF programs (enqueue and dequeue).
_if_ it is using as a qdisc object/interface,
the patch "looks" easier because it obscures some of the ops/interface
from the bpf user.  The user will eventually ask for more flexibility
and then an on-par interface as the kernel's qdisc.  If there are some
common 'ops', the common bpf code can be shared as a library in userspace
or there is also kfunc call to call into the kernel implementation.
For existing kernel qdisc author,  it will be easier to use the same
interface also.

> > Another idea. Rather than work with qdisc objects which creates all
> > these issues with how to work with existing interfaces, filters, etc.
> > Why not create an sk_buff map? Then this can be used from the existing
> > egress/ingress hooks independent of the actual qdisc being used.
> 
> I agree. In fact, I'm working on doing just this for XDP, and I see no
> reason why the map type couldn't be reused for skbs as well. Doing it
> this way has a couple of benefits:
> 
> - It leaves more flexibility to BPF: want a simple FIFO queue? just
>   implement that with a single queue map. Or do you want to build a full
>   hierarchical queueing structure? Just instantiate as many queue maps
>   as you need to achieve this. Etc.
Agree.  Regardless how the interface may look like,
I even think being able to queue/dequeue an skb into different bpf maps
should be the first thing to do here.  Looking forward to your patches.

> 
> - The behaviour is defined entirely by BPF program behaviour, and does
>   not require setting up a qdisc hierarchy in addition to writing BPF
>   code.
Interesting idea.  If it does not need to use the qdisc object/interface
and be able to do the qdisc hierarchy setup in a programmable way, it may
be nice.  It will be useful for the future patches to come with some
bpf prog examples to do that.

> 
> - It should be possible to structure the hooks in a way that allows
>   reusing queueing algorithm implementations between the qdisc and XDP
>   layers.
> 
> > You mention skb should not be exposed to userspace? Why? Whats the
> > reason for this? Anyways we can make kernel only maps if we want or
> > scrub the data before passing it to userspace. We do this already in
> > some cases.
> 
> Yup, that's my approach as well.
> 
> > IMO it seems cleaner and more general to allow sk_buffs
> > to be stored in maps and pulled back out later for enqueue/dequeue.
> 
> FWIW there's some gnarly details here (for instance, we need to make
> sure the BPF program doesn't leak packet references after they are
> dequeued from the map). My idea is to use a scheme similar to what we do
> for XDP_REDIRECT, where a helper sets some hidden variables and doesn't
> actually remove the packet from the queue until the BPF program exits
> (so the kernel can make sure things are accounted correctly).
The verifier is tracking the sk's references.  Can it be reused to
track the skb's reference?

> 
> > I think one trick might be how to trigger the dequeue event on
> > transition from stopped to running net_device or other events like
> > this, but that could be solved with another program attached to
> > those events to kick the dequeue logic.
> 
> This is actually easy in the qdisc case, I think: there's already a
> qdisc_dequeue() operation, which just needs to execute a BPF program
> that picks which packet to dequeue (by pulling it off a queue map). For
> XDP we do need a new hook, on driver TX completion or something like
> that. Details TBD. Also, we need a way to BPF to kick an idle interface
> and make it start transmitting; that way we can implement a traffic
> shaper (that delays packets) by using BPF timers :)
> 
> -Toke
>
Alexei Starovoitov Sept. 1, 2021, 6:03 p.m. UTC | #6
On Wed, Sep 1, 2021 at 10:46 AM Martin KaFai Lau <kafai@fb.com> wrote:
> > > Another idea. Rather than work with qdisc objects which creates all
> > > these issues with how to work with existing interfaces, filters, etc.
> > > Why not create an sk_buff map? Then this can be used from the existing
> > > egress/ingress hooks independent of the actual qdisc being used.
> >
> > I agree. In fact, I'm working on doing just this for XDP, and I see no
> > reason why the map type couldn't be reused for skbs as well. Doing it
> > this way has a couple of benefits:
> >
> > - It leaves more flexibility to BPF: want a simple FIFO queue? just
> >   implement that with a single queue map. Or do you want to build a full
> >   hierarchical queueing structure? Just instantiate as many queue maps
> >   as you need to achieve this. Etc.
> Agree.  Regardless how the interface may look like,
> I even think being able to queue/dequeue an skb into different bpf maps
> should be the first thing to do here.  Looking forward to your patches.
>
> >
> > - The behaviour is defined entirely by BPF program behaviour, and does
> >   not require setting up a qdisc hierarchy in addition to writing BPF
> >   code.
> Interesting idea.  If it does not need to use the qdisc object/interface
> and be able to do the qdisc hierarchy setup in a programmable way, it may
> be nice.  It will be useful for the future patches to come with some
> bpf prog examples to do that.

Wow. When core developers think along the same lines and
build/refine the idea together it's simply awesome.
Toke Høiland-Jørgensen Sept. 2, 2021, 4:57 p.m. UTC | #7
Martin KaFai Lau <kafai@fb.com> writes:

> On Wed, Sep 01, 2021 at 12:42:03PM +0200, Toke Høiland-Jørgensen wrote:
>> John Fastabend <john.fastabend@gmail.com> writes:
>> 
>> > Cong Wang wrote:
>> >> On Tue, Aug 24, 2021 at 4:47 PM Martin KaFai Lau <kafai@fb.com> wrote:
>> >> > Please explain more on this.  What is currently missing
>> >> > to make qdisc in struct_ops possible?
>> >> 
>> >> I think you misunderstand this point. The reason why I avoid it is
>> >> _not_ anything is missing, quite oppositely, it is because it requires
>> >> a lot of work to implement a Qdisc with struct_ops approach, literally
>> >> all those struct Qdisc_ops (not to mention struct Qdisc_class_ops).
>> >> WIth current approach, programmers only need to implement two
>> >> eBPF programs (enqueue and dequeue).
> _if_ it is using as a qdisc object/interface,
> the patch "looks" easier because it obscures some of the ops/interface
> from the bpf user.  The user will eventually ask for more flexibility
> and then an on-par interface as the kernel's qdisc.  If there are some
> common 'ops', the common bpf code can be shared as a library in userspace
> or there is also kfunc call to call into the kernel implementation.
> For existing kernel qdisc author,  it will be easier to use the same
> interface also.

The question is if it's useful to provide the full struct_ops for
qdiscs? Having it would allow a BPF program to implement that interface
towards userspace (things like statistics, classes etc), but the
question is if anyone is going to bother with that given the wealth of
BPF-specific introspection tools already available?

My hope is that we can (longer term) develop some higher-level tools to
express queueing policies that can then generate the BPF code needed to
implement them. Or as a start just some libraries to make this easier,
which I think is also what you're hinting at here? :)

>> > Another idea. Rather than work with qdisc objects which creates all
>> > these issues with how to work with existing interfaces, filters, etc.
>> > Why not create an sk_buff map? Then this can be used from the existing
>> > egress/ingress hooks independent of the actual qdisc being used.
>> 
>> I agree. In fact, I'm working on doing just this for XDP, and I see no
>> reason why the map type couldn't be reused for skbs as well. Doing it
>> this way has a couple of benefits:
>> 
>> - It leaves more flexibility to BPF: want a simple FIFO queue? just
>>   implement that with a single queue map. Or do you want to build a full
>>   hierarchical queueing structure? Just instantiate as many queue maps
>>   as you need to achieve this. Etc.
> Agree.  Regardless how the interface may look like,
> I even think being able to queue/dequeue an skb into different bpf maps
> should be the first thing to do here.  Looking forward to your patches.

Thanks! Guess I should go work on them, then :D

>> - The behaviour is defined entirely by BPF program behaviour, and does
>>   not require setting up a qdisc hierarchy in addition to writing BPF
>>   code.
> Interesting idea.  If it does not need to use the qdisc object/interface
> and be able to do the qdisc hierarchy setup in a programmable way, it may
> be nice.  It will be useful for the future patches to come with some
> bpf prog examples to do that.

Absolutely; we plan to include example algorithm implementations as well!

>> - It should be possible to structure the hooks in a way that allows
>>   reusing queueing algorithm implementations between the qdisc and XDP
>>   layers.
>> 
>> > You mention skb should not be exposed to userspace? Why? Whats the
>> > reason for this? Anyways we can make kernel only maps if we want or
>> > scrub the data before passing it to userspace. We do this already in
>> > some cases.
>> 
>> Yup, that's my approach as well.
>> 
>> > IMO it seems cleaner and more general to allow sk_buffs
>> > to be stored in maps and pulled back out later for enqueue/dequeue.
>> 
>> FWIW there's some gnarly details here (for instance, we need to make
>> sure the BPF program doesn't leak packet references after they are
>> dequeued from the map). My idea is to use a scheme similar to what we do
>> for XDP_REDIRECT, where a helper sets some hidden variables and doesn't
>> actually remove the packet from the queue until the BPF program exits
>> (so the kernel can make sure things are accounted correctly).
> The verifier is tracking the sk's references.  Can it be reused to
> track the skb's reference?

I was vaguely aware that it does this, but have not looked at the
details. Would be great if this was possible; will see how far I get
with it, and iterate from there (with your help, hopefully :))

-Toke
John Fastabend Sept. 2, 2021, 8:40 p.m. UTC | #8
Toke Høiland-Jørgensen wrote:
> Martin KaFai Lau <kafai@fb.com> writes:
> 
> > On Wed, Sep 01, 2021 at 12:42:03PM +0200, Toke Høiland-Jørgensen wrote:
> >> John Fastabend <john.fastabend@gmail.com> writes:
> >> 
> >> > Cong Wang wrote:
> >> >> On Tue, Aug 24, 2021 at 4:47 PM Martin KaFai Lau <kafai@fb.com> wrote:
> >> >> > Please explain more on this.  What is currently missing
> >> >> > to make qdisc in struct_ops possible?
> >> >> 
> >> >> I think you misunderstand this point. The reason why I avoid it is
> >> >> _not_ anything is missing, quite oppositely, it is because it requires
> >> >> a lot of work to implement a Qdisc with struct_ops approach, literally
> >> >> all those struct Qdisc_ops (not to mention struct Qdisc_class_ops).
> >> >> WIth current approach, programmers only need to implement two
> >> >> eBPF programs (enqueue and dequeue).
> > _if_ it is using as a qdisc object/interface,
> > the patch "looks" easier because it obscures some of the ops/interface
> > from the bpf user.  The user will eventually ask for more flexibility
> > and then an on-par interface as the kernel's qdisc.  If there are some
> > common 'ops', the common bpf code can be shared as a library in userspace
> > or there is also kfunc call to call into the kernel implementation.
> > For existing kernel qdisc author,  it will be easier to use the same
> > interface also.
> 
> The question is if it's useful to provide the full struct_ops for
> qdiscs? Having it would allow a BPF program to implement that interface
> towards userspace (things like statistics, classes etc), but the
> question is if anyone is going to bother with that given the wealth of
> BPF-specific introspection tools already available?

If its a map value then you get all the goodness with normal map
inspection.

> 
> My hope is that we can (longer term) develop some higher-level tools to
> express queueing policies that can then generate the BPF code needed to
> implement them. Or as a start just some libraries to make this easier,
> which I think is also what you're hinting at here? :)

The P4 working group has thought about QOS and queuing from P4 side if
you want to think in terms of a DSL. Might be interesting and have
some benefits if you want to drop into hardware offload side. For example
compile to XDP for fast CPU architectures, Altera/Xilinx backend for FPGA or
switch silicon for others. This was always the dream on my side maybe
we've finally got close to actualizing it, 10 years later ;)

> 
> >> > Another idea. Rather than work with qdisc objects which creates all
> >> > these issues with how to work with existing interfaces, filters, etc.
> >> > Why not create an sk_buff map? Then this can be used from the existing
> >> > egress/ingress hooks independent of the actual qdisc being used.
> >> 
> >> I agree. In fact, I'm working on doing just this for XDP, and I see no
> >> reason why the map type couldn't be reused for skbs as well. Doing it
> >> this way has a couple of benefits:
> >> 
> >> - It leaves more flexibility to BPF: want a simple FIFO queue? just
> >>   implement that with a single queue map. Or do you want to build a full
> >>   hierarchical queueing structure? Just instantiate as many queue maps
> >>   as you need to achieve this. Etc.
> > Agree.  Regardless how the interface may look like,
> > I even think being able to queue/dequeue an skb into different bpf maps
> > should be the first thing to do here.  Looking forward to your patches.
> 
> Thanks! Guess I should go work on them, then :D

Happy to review any RFCs.

> 
> >> - The behaviour is defined entirely by BPF program behaviour, and does
> >>   not require setting up a qdisc hierarchy in addition to writing BPF
> >>   code.
> > Interesting idea.  If it does not need to use the qdisc object/interface
> > and be able to do the qdisc hierarchy setup in a programmable way, it may
> > be nice.  It will be useful for the future patches to come with some
> > bpf prog examples to do that.
> 
> Absolutely; we plan to include example algorithm implementations as well!

A weighted round robin queue setup might be a useful example and easy
to implement/understand, but slightly more interesting than a pfifo. Also
would force understanding multiple cpus and timer issues.

> 
> >> - It should be possible to structure the hooks in a way that allows
> >>   reusing queueing algorithm implementations between the qdisc and XDP
> >>   layers.
> >> 
> >> > You mention skb should not be exposed to userspace? Why? Whats the
> >> > reason for this? Anyways we can make kernel only maps if we want or
> >> > scrub the data before passing it to userspace. We do this already in
> >> > some cases.
> >> 
> >> Yup, that's my approach as well.

Having something reported back to userspace as the value might be helpful
for debugging/tracing. Maybe the skb->hash? Then you could set this and
then track a skb through the stack even when its in a bpf skb queue.

> >> 
> >> > IMO it seems cleaner and more general to allow sk_buffs
> >> > to be stored in maps and pulled back out later for enqueue/dequeue.
> >> 
> >> FWIW there's some gnarly details here (for instance, we need to make
> >> sure the BPF program doesn't leak packet references after they are
> >> dequeued from the map). My idea is to use a scheme similar to what we do
> >> for XDP_REDIRECT, where a helper sets some hidden variables and doesn't
> >> actually remove the packet from the queue until the BPF program exits
> >> (so the kernel can make sure things are accounted correctly).
> > The verifier is tracking the sk's references.  Can it be reused to
> > track the skb's reference?
> 
> I was vaguely aware that it does this, but have not looked at the
> details. Would be great if this was possible; will see how far I get
> with it, and iterate from there (with your help, hopefully :))

Also might need to drop any socket references from the networking side
so an enqueued sock can't hold a socket open.

> 
> -Toke
>
Toke Høiland-Jørgensen Sept. 2, 2021, 10:27 p.m. UTC | #9
John Fastabend <john.fastabend@gmail.com> writes:

> Toke Høiland-Jørgensen wrote:
>> Martin KaFai Lau <kafai@fb.com> writes:
>> 
>> > On Wed, Sep 01, 2021 at 12:42:03PM +0200, Toke Høiland-Jørgensen wrote:
>> >> John Fastabend <john.fastabend@gmail.com> writes:
>> >> 
>> >> > Cong Wang wrote:
>> >> >> On Tue, Aug 24, 2021 at 4:47 PM Martin KaFai Lau <kafai@fb.com> wrote:
>> >> >> > Please explain more on this.  What is currently missing
>> >> >> > to make qdisc in struct_ops possible?
>> >> >> 
>> >> >> I think you misunderstand this point. The reason why I avoid it is
>> >> >> _not_ anything is missing, quite oppositely, it is because it requires
>> >> >> a lot of work to implement a Qdisc with struct_ops approach, literally
>> >> >> all those struct Qdisc_ops (not to mention struct Qdisc_class_ops).
>> >> >> WIth current approach, programmers only need to implement two
>> >> >> eBPF programs (enqueue and dequeue).
>> > _if_ it is using as a qdisc object/interface,
>> > the patch "looks" easier because it obscures some of the ops/interface
>> > from the bpf user.  The user will eventually ask for more flexibility
>> > and then an on-par interface as the kernel's qdisc.  If there are some
>> > common 'ops', the common bpf code can be shared as a library in userspace
>> > or there is also kfunc call to call into the kernel implementation.
>> > For existing kernel qdisc author,  it will be easier to use the same
>> > interface also.
>> 
>> The question is if it's useful to provide the full struct_ops for
>> qdiscs? Having it would allow a BPF program to implement that interface
>> towards userspace (things like statistics, classes etc), but the
>> question is if anyone is going to bother with that given the wealth of
>> BPF-specific introspection tools already available?
>
> If its a map value then you get all the goodness with normal map
> inspection.

Yup, exactly, so why bother with struct_ops to implement all the other
qdisc ops (apart from enqueue/dequeue)?

>> My hope is that we can (longer term) develop some higher-level tools to
>> express queueing policies that can then generate the BPF code needed to
>> implement them. Or as a start just some libraries to make this easier,
>> which I think is also what you're hinting at here? :)
>
> The P4 working group has thought about QOS and queuing from P4 side if
> you want to think in terms of a DSL. Might be interesting and have
> some benefits if you want to drop into hardware offload side. For example
> compile to XDP for fast CPU architectures, Altera/Xilinx backend for FPGA or
> switch silicon for others. This was always the dream on my side maybe
> we've finally got close to actualizing it, 10 years later ;)

Yup, would love to see this! Let's just hope it doesn't take another
decade ;)

>> >> > Another idea. Rather than work with qdisc objects which creates all
>> >> > these issues with how to work with existing interfaces, filters, etc.
>> >> > Why not create an sk_buff map? Then this can be used from the existing
>> >> > egress/ingress hooks independent of the actual qdisc being used.
>> >> 
>> >> I agree. In fact, I'm working on doing just this for XDP, and I see no
>> >> reason why the map type couldn't be reused for skbs as well. Doing it
>> >> this way has a couple of benefits:
>> >> 
>> >> - It leaves more flexibility to BPF: want a simple FIFO queue? just
>> >>   implement that with a single queue map. Or do you want to build a full
>> >>   hierarchical queueing structure? Just instantiate as many queue maps
>> >>   as you need to achieve this. Etc.
>> > Agree.  Regardless how the interface may look like,
>> > I even think being able to queue/dequeue an skb into different bpf maps
>> > should be the first thing to do here.  Looking forward to your patches.
>> 
>> Thanks! Guess I should go work on them, then :D
>
> Happy to review any RFCs.
>
>> 
>> >> - The behaviour is defined entirely by BPF program behaviour, and does
>> >>   not require setting up a qdisc hierarchy in addition to writing BPF
>> >>   code.
>> > Interesting idea.  If it does not need to use the qdisc object/interface
>> > and be able to do the qdisc hierarchy setup in a programmable way, it may
>> > be nice.  It will be useful for the future patches to come with some
>> > bpf prog examples to do that.
>> 
>> Absolutely; we plan to include example algorithm implementations as well!
>
> A weighted round robin queue setup might be a useful example and easy
> to implement/understand, but slightly more interesting than a pfifo. Also
> would force understanding multiple cpus and timer issues.

Yup, some sort of RR queueing is definitely on the list!

>> >> - It should be possible to structure the hooks in a way that allows
>> >>   reusing queueing algorithm implementations between the qdisc and XDP
>> >>   layers.
>> >> 
>> >> > You mention skb should not be exposed to userspace? Why? Whats the
>> >> > reason for this? Anyways we can make kernel only maps if we want or
>> >> > scrub the data before passing it to userspace. We do this already in
>> >> > some cases.
>> >> 
>> >> Yup, that's my approach as well.
>
> Having something reported back to userspace as the value might be helpful
> for debugging/tracing. Maybe the skb->hash? Then you could set this and
> then track a skb through the stack even when its in a bpf skb queue.

Yeah. I've just been using the pointer value for my initial testing.
That's not a good solution, of course, but having a visible identifier
would be neat. skb->hash makes sense for the qdisc layer, but not for
XDP...

>> >> 
>> >> > IMO it seems cleaner and more general to allow sk_buffs
>> >> > to be stored in maps and pulled back out later for enqueue/dequeue.
>> >> 
>> >> FWIW there's some gnarly details here (for instance, we need to make
>> >> sure the BPF program doesn't leak packet references after they are
>> >> dequeued from the map). My idea is to use a scheme similar to what we do
>> >> for XDP_REDIRECT, where a helper sets some hidden variables and doesn't
>> >> actually remove the packet from the queue until the BPF program exits
>> >> (so the kernel can make sure things are accounted correctly).
>> > The verifier is tracking the sk's references.  Can it be reused to
>> > track the skb's reference?
>> 
>> I was vaguely aware that it does this, but have not looked at the
>> details. Would be great if this was possible; will see how far I get
>> with it, and iterate from there (with your help, hopefully :))
>
> Also might need to drop any socket references from the networking side
> so an enqueued sock can't hold a socket open.

Not sure I'm following you here?

-Toke
Martin KaFai Lau Sept. 2, 2021, 11:35 p.m. UTC | #10
On Fri, Sep 03, 2021 at 12:27:52AM +0200, Toke Høiland-Jørgensen wrote:
> >> The question is if it's useful to provide the full struct_ops for
> >> qdiscs? Having it would allow a BPF program to implement that interface
> >> towards userspace (things like statistics, classes etc), but the
> >> question is if anyone is going to bother with that given the wealth of
> >> BPF-specific introspection tools already available?
Instead of bpftool can only introspect bpf qdisc and the existing tc
can only introspect kernel qdisc,  it will be nice to have bpf
qdisc work as other qdisc and showing details together with others
in tc.  e.g. a bpf qdisc export its data/stats with its btf-id
to tc and have tc print it out in a generic way?
Toke Høiland-Jørgensen Sept. 3, 2021, 2:44 p.m. UTC | #11
Martin KaFai Lau <kafai@fb.com> writes:

> On Fri, Sep 03, 2021 at 12:27:52AM +0200, Toke Høiland-Jørgensen wrote:
>> >> The question is if it's useful to provide the full struct_ops for
>> >> qdiscs? Having it would allow a BPF program to implement that interface
>> >> towards userspace (things like statistics, classes etc), but the
>> >> question is if anyone is going to bother with that given the wealth of
>> >> BPF-specific introspection tools already available?
> Instead of bpftool can only introspect bpf qdisc and the existing tc
> can only introspect kernel qdisc,  it will be nice to have bpf
> qdisc work as other qdisc and showing details together with others
> in tc.  e.g. a bpf qdisc export its data/stats with its btf-id
> to tc and have tc print it out in a generic way?

I'm not opposed to the idea, certainly. I just wonder if people who go
to the trouble of writing a custom qdisc in BPF will feel it's worth it
to do the extra work to make this available via a second API. We could
certainly encourage it, and some things are easy (drop and pkt counters,
etc), but other things (like class stats) will depend on the semantics
of the qdisc being implemented, so will require extra work from the BPF
qdisc developer...

-Toke
Jamal Hadi Salim Sept. 3, 2021, 3:33 p.m. UTC | #12
On 2021-09-03 10:44 a.m., Toke Høiland-Jørgensen wrote:
> Martin KaFai Lau <kafai@fb.com> writes:
> 
>> On Fri, Sep 03, 2021 at 12:27:52AM +0200, Toke Høiland-Jørgensen wrote:
>>>>> The question is if it's useful to provide the full struct_ops for
>>>>> qdiscs? Having it would allow a BPF program to implement that interface
>>>>> towards userspace (things like statistics, classes etc), but the
>>>>> question is if anyone is going to bother with that given the wealth of
>>>>> BPF-specific introspection tools already available?
>> Instead of bpftool can only introspect bpf qdisc and the existing tc
>> can only introspect kernel qdisc,  it will be nice to have bpf
>> qdisc work as other qdisc and showing details together with others
>> in tc.  e.g. a bpf qdisc export its data/stats with its btf-id
>> to tc and have tc print it out in a generic way?
> 
> I'm not opposed to the idea, certainly. I just wonder if people who go
> to the trouble of writing a custom qdisc in BPF will feel it's worth it
> to do the extra work to make this available via a second API. We could
> certainly encourage it, and some things are easy (drop and pkt counters,
> etc), but other things (like class stats) will depend on the semantics
> of the qdisc being implemented, so will require extra work from the BPF
> qdisc developer...

The idea of using btf to overcome the domain difference is _very_
appealing but sounds like a lot of work? Havent delved enough
into btf - but wondering if the same could be stated for filters
and actions...Note:
Aside from current existing tooling being well understood,
challenges  you will be faced with is reinventing all the
infrastructure that tc qdiscs have taken care of over the years,
example:
the proper integrations with softirqs and multiprocessor protections,
irqs, timers etc which take care of smooth triggering of
enqueue/dequeue, taking care of defering things when the target
device/hw is busy, hierarchies, etc, etc;
not saying it is the most perfect or performant but it is one of
those 'day 3' deployments i.e a lot of corner cases taken care of.
I noticed you mentioned some of those things in one of your emails.
For this reason - Cong's approach looks appealing because it
reuses said infra. Main thing that needs to have extensibility is
the de/enqueue ops as ebpf progs. Allowing enq/deq to be ebpf specific
sounds like will allow one scheme that works for both tc and XDP
(with enq/deq taking care of the buffer contextual differences).
I admit XDP is a little harder than plain tc....

cheers,
jamal
Cong Wang Sept. 4, 2021, 1:05 a.m. UTC | #13
On Tue, Aug 31, 2021 at 10:45 PM John Fastabend
<john.fastabend@gmail.com> wrote:
>
> Cong Wang wrote:
> > On Tue, Aug 24, 2021 at 4:47 PM Martin KaFai Lau <kafai@fb.com> wrote:
> > > Please explain more on this.  What is currently missing
> > > to make qdisc in struct_ops possible?
> >
> > I think you misunderstand this point. The reason why I avoid it is
> > _not_ anything is missing, quite oppositely, it is because it requires
> > a lot of work to implement a Qdisc with struct_ops approach, literally
> > all those struct Qdisc_ops (not to mention struct Qdisc_class_ops).
> > WIth current approach, programmers only need to implement two
> > eBPF programs (enqueue and dequeue).
> >
> > Thanks.
>
> Another idea. Rather than work with qdisc objects which creates all
> these issues with how to work with existing interfaces, filters, etc.
> Why not create an sk_buff map? Then this can be used from the existing
> egress/ingress hooks independent of the actual qdisc being used.

Because it is pointless to expose them to user-space in this context.
For example, what is the point of dropping a packet from user-space
in the context of Qdisc?

And I don't think there is a way for user-space to read those skb's inside
a map, which makes it more pointless.

>
> You mention skb should not be exposed to userspace? Why? Whats the
> reason for this? Anyways we can make kernel only maps if we want or
> scrub the data before passing it to userspace. We do this already in
> some cases.

I am not aware of any kernel-only map. For starters, we can't create
a map from kernel-space.

>
> IMO it seems cleaner and more general to allow sk_buffs
> to be stored in maps and pulled back out later for enqueue/dequeue.

Which exact map are you referring to? The queue map? It would only
provide FIFO. We want to give users as much freedom to order the
skbs as we can, I doubt hashmap could offer such freedom.

>
> I think one trick might be how to trigger the dequeue event on
> transition from stopped to running net_device or other events like
> this, but that could be solved with another program attached to
> those events to kick the dequeue logic.

I think we can still use current enqueue/dequeue eBPF program
except we need to transfer skb ownership and storage to map.

Thanks.
Cong Wang Sept. 4, 2021, 1:09 a.m. UTC | #14
On Wed, Sep 1, 2021 at 10:45 AM Martin KaFai Lau <kafai@fb.com> wrote:
> _if_ it is using as a qdisc object/interface,
> the patch "looks" easier because it obscures some of the ops/interface
> from the bpf user.  The user will eventually ask for more flexibility
> and then an on-par interface as the kernel's qdisc.  If there are some
> common 'ops', the common bpf code can be shared as a library in userspace
> or there is also kfunc call to call into the kernel implementation.
> For existing kernel qdisc author,  it will be easier to use the same
> interface also.

Thanks for showing the advantages of a kernel module. And no, we
are not writing kernel modules in eBPF.

And kfunc call really sucks, it does not even guarantee a stable ABI, it
is a serious mistake you made for eBPF.

Thanks.
Cong Wang Sept. 4, 2021, 1:30 a.m. UTC | #15
On Wed, Sep 1, 2021 at 3:42 AM Toke Høiland-Jørgensen <toke@redhat.com> wrote:
>
> John Fastabend <john.fastabend@gmail.com> writes:
>
> > Cong Wang wrote:
> >> On Tue, Aug 24, 2021 at 4:47 PM Martin KaFai Lau <kafai@fb.com> wrote:
> >> > Please explain more on this.  What is currently missing
> >> > to make qdisc in struct_ops possible?
> >>
> >> I think you misunderstand this point. The reason why I avoid it is
> >> _not_ anything is missing, quite oppositely, it is because it requires
> >> a lot of work to implement a Qdisc with struct_ops approach, literally
> >> all those struct Qdisc_ops (not to mention struct Qdisc_class_ops).
> >> WIth current approach, programmers only need to implement two
> >> eBPF programs (enqueue and dequeue).
> >>
> >> Thanks.
> >
> > Another idea. Rather than work with qdisc objects which creates all
> > these issues with how to work with existing interfaces, filters, etc.
> > Why not create an sk_buff map? Then this can be used from the existing
> > egress/ingress hooks independent of the actual qdisc being used.
>
> I agree. In fact, I'm working on doing just this for XDP, and I see no
> reason why the map type couldn't be reused for skbs as well. Doing it
> this way has a couple of benefits:

I do see a lot of reasons, for starters, struct skb_buff is very different
from struct xdp_buff, any specialized map can not be reused. I guess you
are using a generic one, how do you handle the refcnt at least for skb?

>
> - It leaves more flexibility to BPF: want a simple FIFO queue? just
>   implement that with a single queue map. Or do you want to build a full
>   hierarchical queueing structure? Just instantiate as many queue maps
>   as you need to achieve this. Etc.

Please give an example without a queue. ;) Queue is too simple, show us
something more useful please. How do you plan to re-implement EDT with
just queues?

>
> - The behaviour is defined entirely by BPF program behaviour, and does
>   not require setting up a qdisc hierarchy in addition to writing BPF
>   code.

I have no idea why you call this a benefit, because my goal is to replace
Qdisc's, not to replace any other things. You know there are plenty of Qdisc's
which are not implemented in Linux kernel.

>
> - It should be possible to structure the hooks in a way that allows
>   reusing queueing algorithm implementations between the qdisc and XDP
>   layers.

XDP has no skb but xdp_buff, no? And again, why only queues?

Thanks.
Toke Høiland-Jørgensen Sept. 6, 2021, 11:45 a.m. UTC | #16
Cong Wang <xiyou.wangcong@gmail.com> writes:

> On Wed, Sep 1, 2021 at 3:42 AM Toke Høiland-Jørgensen <toke@redhat.com> wrote:
>>
>> John Fastabend <john.fastabend@gmail.com> writes:
>>
>> > Cong Wang wrote:
>> >> On Tue, Aug 24, 2021 at 4:47 PM Martin KaFai Lau <kafai@fb.com> wrote:
>> >> > Please explain more on this.  What is currently missing
>> >> > to make qdisc in struct_ops possible?
>> >>
>> >> I think you misunderstand this point. The reason why I avoid it is
>> >> _not_ anything is missing, quite oppositely, it is because it requires
>> >> a lot of work to implement a Qdisc with struct_ops approach, literally
>> >> all those struct Qdisc_ops (not to mention struct Qdisc_class_ops).
>> >> WIth current approach, programmers only need to implement two
>> >> eBPF programs (enqueue and dequeue).
>> >>
>> >> Thanks.
>> >
>> > Another idea. Rather than work with qdisc objects which creates all
>> > these issues with how to work with existing interfaces, filters, etc.
>> > Why not create an sk_buff map? Then this can be used from the existing
>> > egress/ingress hooks independent of the actual qdisc being used.
>>
>> I agree. In fact, I'm working on doing just this for XDP, and I see no
>> reason why the map type couldn't be reused for skbs as well. Doing it
>> this way has a couple of benefits:
>
> I do see a lot of reasons, for starters, struct skb_buff is very different
> from struct xdp_buff, any specialized map can not be reused. I guess you
> are using a generic one, how do you handle the refcnt at least for skb?

Well, you can't keep XDP frames and skbs in the same map instance, but
you can create a map type that can be instantiated to hold either type
and otherwise keep the same semantics. The map can just inc/dec the
refcnt as skbs are added/removed from it.

>> - It leaves more flexibility to BPF: want a simple FIFO queue? just
>>   implement that with a single queue map. Or do you want to build a full
>>   hierarchical queueing structure? Just instantiate as many queue maps
>>   as you need to achieve this. Etc.
>
> Please give an example without a queue. ;) Queue is too simple, show us
> something more useful please. How do you plan to re-implement EDT with
> just queues?

I'm using 'queue' as a shorthand for any queueing/scheduling algorithm
implementable by a qdisc. We need to cover them all, obviously, not just
FIFO queues (in fact I think we should actively be discouraging those,
but that's a different story :) )

For EDT it would be something like:

- On enqueue, stick frames into the map with a rank corresponding to
  their transmission time (the map implements the PIFO queue, just like
  your patch).

- (re-)arm a BPF timer to fire at the time of the next transmission
  event, and have that timer trigger interface TX.

The first bit is straight-forward, and that last bit needs a new helper
or something like it. For qdiscs I guess we could just expose
qdisc_watchdog()? For XDP we'd need something new...


>> - The behaviour is defined entirely by BPF program behaviour, and does
>>   not require setting up a qdisc hierarchy in addition to writing BPF
>>   code.
>
> I have no idea why you call this a benefit, because my goal is to
> replace Qdisc's, not to replace any other things. You know there are
> plenty of Qdisc's which are not implemented in Linux kernel.

It's a benefit because it means you can keep everything together. I.e.,
you don't need to *both* write BPF code implementing your qdisc, *and* a
setup script to build the qdisc hierarchy. That simplifies deployment.

I suppose we could support inserting BPF qdiscs into a qdisc hierarchy
as well if needed. I don't personally see much use for that, but if
there's a use case, sure, why not?

>> - It should be possible to structure the hooks in a way that allows
>>   reusing queueing algorithm implementations between the qdisc and XDP
>>   layers.
>
> XDP has no skb but xdp_buff, no? And again, why only queues?

See above :)

-Toke
Martin KaFai Lau Sept. 10, 2021, 6:55 a.m. UTC | #17
On Fri, Sep 03, 2021 at 04:44:04PM +0200, Toke Høiland-Jørgensen wrote:
> Martin KaFai Lau <kafai@fb.com> writes:
> 
> > On Fri, Sep 03, 2021 at 12:27:52AM +0200, Toke Høiland-Jørgensen wrote:
> >> >> The question is if it's useful to provide the full struct_ops for
> >> >> qdiscs? Having it would allow a BPF program to implement that interface
> >> >> towards userspace (things like statistics, classes etc), but the
> >> >> question is if anyone is going to bother with that given the wealth of
> >> >> BPF-specific introspection tools already available?
> > Instead of bpftool can only introspect bpf qdisc and the existing tc
> > can only introspect kernel qdisc,  it will be nice to have bpf
> > qdisc work as other qdisc and showing details together with others
> > in tc.  e.g. a bpf qdisc export its data/stats with its btf-id
> > to tc and have tc print it out in a generic way?
> 
> I'm not opposed to the idea, certainly. I just wonder if people who go
> to the trouble of writing a custom qdisc in BPF will feel it's worth it
> to do the extra work to make this available via a second API. We could
> certainly encourage it, and some things are easy (drop and pkt counters,
> etc), but other things (like class stats) will depend on the semantics
> of the qdisc being implemented, so will require extra work from the BPF
> qdisc developer...
Right, different qdisc has different stats, I think it is currently
stored in qdisc_priv()?  When a qdisc is created, a separate priv is
created together.

Yes, the bpf qdisc prog can store its stats to a bpf map, but then when the
same prog attached to different qdiscs, it has to create different stats maps?

Also, instead of ->enqueue() itself is a bpf prog,
having an ->enqueue() preparing a bpf ctx (zeroing, assigning...etc) and
then make another call to a bpf prog will all add some costs.

That said, I still think it needs a bpf skb map that can queue/dequeue
skb first.  Then it will become possible to prototype different interface
ideas.
Toke Høiland-Jørgensen Sept. 10, 2021, 11:31 a.m. UTC | #18
Martin KaFai Lau <kafai@fb.com> writes:

> On Fri, Sep 03, 2021 at 04:44:04PM +0200, Toke Høiland-Jørgensen wrote:
>> Martin KaFai Lau <kafai@fb.com> writes:
>> 
>> > On Fri, Sep 03, 2021 at 12:27:52AM +0200, Toke Høiland-Jørgensen wrote:
>> >> >> The question is if it's useful to provide the full struct_ops for
>> >> >> qdiscs? Having it would allow a BPF program to implement that interface
>> >> >> towards userspace (things like statistics, classes etc), but the
>> >> >> question is if anyone is going to bother with that given the wealth of
>> >> >> BPF-specific introspection tools already available?
>> > Instead of bpftool can only introspect bpf qdisc and the existing tc
>> > can only introspect kernel qdisc,  it will be nice to have bpf
>> > qdisc work as other qdisc and showing details together with others
>> > in tc.  e.g. a bpf qdisc export its data/stats with its btf-id
>> > to tc and have tc print it out in a generic way?
>> 
>> I'm not opposed to the idea, certainly. I just wonder if people who go
>> to the trouble of writing a custom qdisc in BPF will feel it's worth it
>> to do the extra work to make this available via a second API. We could
>> certainly encourage it, and some things are easy (drop and pkt counters,
>> etc), but other things (like class stats) will depend on the semantics
>> of the qdisc being implemented, so will require extra work from the BPF
>> qdisc developer...
> Right, different qdisc has different stats, I think it is currently
> stored in qdisc_priv()?  When a qdisc is created, a separate priv is
> created together.
>
> Yes, the bpf qdisc prog can store its stats to a bpf map, but then
> when the same prog attached to different qdiscs, it has to create
> different stats maps?

Hmm, yeah, I guess it would. But if it's storing the packets in a map it
would need to have separate instances of those as well. I was kinda
assuming that a separate instance of the BPF program would be loaded
into the kernel for each qdisc instance, with its own instance of all
maps etc.

> Also, instead of ->enqueue() itself is a bpf prog, having an
> ->enqueue() preparing a bpf ctx (zeroing, assigning...etc) and then
> make another call to a bpf prog will all add some costs.

Hmm, yeah, I guess, but I kinda doubt we can avoid having *some* kind of
setup to get the right semantics for the BPF program, which might as
well be in the qdisc enqueue() func. But let's see, happy to be proved
wrong on this :)

> That said, I still think it needs a bpf skb map that can queue/dequeue
> skb first.  Then it will become possible to prototype different interface
> ideas.

Agreed!

-Toke
Martin KaFai Lau Sept. 17, 2021, 4:19 a.m. UTC | #19
On Fri, Sep 03, 2021 at 06:09:46PM -0700, Cong Wang wrote:
> On Wed, Sep 1, 2021 at 10:45 AM Martin KaFai Lau <kafai@fb.com> wrote:
> > _if_ it is using as a qdisc object/interface,
> > the patch "looks" easier because it obscures some of the ops/interface
> > from the bpf user.  The user will eventually ask for more flexibility
> > and then an on-par interface as the kernel's qdisc.  If there are some
> > common 'ops', the common bpf code can be shared as a library in userspace
> > or there is also kfunc call to call into the kernel implementation.
> > For existing kernel qdisc author,  it will be easier to use the same
> > interface also.
> 
> Thanks for showing the advantages of a kernel module. And no, we
> are not writing kernel modules in eBPF.
The line is very blurry between a bpf_prog and kernel module,
especially with the advancement of bpf, btf, and CO-RE.

Both bpf_prog.o (struct_ops or not) and some_native_kern_mod.ko are attaching
to some kernel hooks to be called.  If writing bpf and attaching it to a hook
does not work for you, bpf does not fit your case.

> And kfunc call really sucks, it does not even guarantee a stable ABI, it
> is a serious mistake you made for eBPF.
Not ture.  It depends on what is allowed to be called by bpf.
Needless to say I cannot agree with the "sucks" description.

This kind of dismissive discussion is worse than unproductive
and not the best way to use the mailing list time.
diff mbox series

Patch

diff --git a/include/linux/bpf_types.h b/include/linux/bpf_types.h
index ae3ac3a2018c..b3023e41fe8a 100644
--- a/include/linux/bpf_types.h
+++ b/include/linux/bpf_types.h
@@ -8,6 +8,8 @@  BPF_PROG_TYPE(BPF_PROG_TYPE_SCHED_CLS, tc_cls_act,
 	      struct __sk_buff, struct sk_buff)
 BPF_PROG_TYPE(BPF_PROG_TYPE_SCHED_ACT, tc_cls_act,
 	      struct __sk_buff, struct sk_buff)
+//BPF_PROG_TYPE(BPF_PROG_TYPE_SCHED_QDISC, tc_cls_act,
+//	      struct __sk_buff, struct sk_buff)
 BPF_PROG_TYPE(BPF_PROG_TYPE_XDP, xdp,
 	      struct xdp_md, struct xdp_buff)
 #ifdef CONFIG_CGROUP_BPF
diff --git a/include/linux/priority_queue.h b/include/linux/priority_queue.h
new file mode 100644
index 000000000000..08177517977f
--- /dev/null
+++ b/include/linux/priority_queue.h
@@ -0,0 +1,90 @@ 
+// SPDX-License-Identifier: GPL-2.0
+/*
+ *  A priority queue implementation based on rbtree
+ *
+ *   Copyright (C) 2021, Bytedance, Cong Wang <cong.wang@bytedance.com>
+ */
+
+#ifndef	_LINUX_PRIORITY_QUEUE_H
+#define	_LINUX_PRIORITY_QUEUE_H
+
+#include <linux/rbtree.h>
+
+struct pq_node {
+	struct rb_node rb_node;
+};
+
+struct pq_root {
+	struct rb_root_cached rb_root;
+	bool (*cmp)(struct pq_node *l, struct pq_node *r);
+};
+
+static inline void pq_root_init(struct pq_root *root,
+				bool (*cmp)(struct pq_node *l, struct pq_node *r))
+{
+	root->rb_root = RB_ROOT_CACHED;
+	root->cmp = cmp;
+}
+
+static inline void pq_push(struct pq_root *root, struct pq_node *node)
+{
+	struct rb_node **link = &root->rb_root.rb_root.rb_node;
+	struct rb_node *parent = NULL;
+	struct pq_node *entry;
+	bool leftmost = true;
+
+	/*
+	 * Find the right place in the rbtree:
+	 */
+	while (*link) {
+		parent = *link;
+		entry = rb_entry(parent, struct pq_node, rb_node);
+		/*
+		 * We dont care about collisions. Nodes with
+		 * the same key stay together.
+		 */
+		if (root->cmp(entry, node)) {
+			link = &parent->rb_left;
+		} else {
+			link = &parent->rb_right;
+			leftmost = false;
+		}
+	}
+
+	rb_link_node(&node->rb_node, parent, link);
+	rb_insert_color_cached(&node->rb_node, &root->rb_root, leftmost);
+}
+
+static inline struct pq_node *pq_top(struct pq_root *root)
+{
+	struct rb_node *left = rb_first_cached(&root->rb_root);
+
+	if (!left)
+		return NULL;
+	return rb_entry(left, struct pq_node, rb_node);
+}
+
+static inline struct pq_node *pq_pop(struct pq_root *root)
+{
+	struct pq_node *t = pq_top(root);
+
+	if (t)
+		rb_erase_cached(&t->rb_node, &root->rb_root);
+	return t;
+}
+
+static inline void pq_flush(struct pq_root *root, void (*destroy)(struct pq_node *))
+{
+	struct rb_node *node, *next;
+
+	for (node = rb_first(&root->rb_root.rb_root);
+	     next = node ? rb_next(node) : NULL, node != NULL;
+	     node = next) {
+		struct pq_node *pqe;
+
+		pqe = rb_entry(node, struct pq_node, rb_node);
+		if (destroy)
+			destroy(pqe);
+	}
+}
+#endif	/* _LINUX_PRIORITY_QUEUE_H */
diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index 6bdb0db3e825..800d76e480da 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -36,6 +36,7 @@ 
 #include <linux/splice.h>
 #include <linux/in6.h>
 #include <linux/if_packet.h>
+#include <linux/priority_queue.h>
 #include <net/flow.h>
 #include <net/page_pool.h>
 #if IS_ENABLED(CONFIG_NF_CONNTRACK)
@@ -735,6 +736,7 @@  struct sk_buff {
 			};
 		};
 		struct rb_node		rbnode; /* used in netem, ip4 defrag, and tcp stack */
+		struct pq_node		pqnode; /* used in ebpf qdisc */
 		struct list_head	list;
 	};
 
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 2db6925e04f4..251d749d66df 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -949,6 +949,7 @@  enum bpf_prog_type {
 	BPF_PROG_TYPE_LSM,
 	BPF_PROG_TYPE_SK_LOOKUP,
 	BPF_PROG_TYPE_SYSCALL, /* a program that can execute syscalls */
+	BPF_PROG_TYPE_SCHED_QDISC,
 };
 
 enum bpf_attach_type {
@@ -6226,4 +6227,23 @@  enum {
 	BTF_F_ZERO	=	(1ULL << 3),
 };
 
+struct sch_bpf_ctx {
+	/* Input */
+	struct __sk_buff *skb;
+	u32 nr_flows;
+	u32 nr_packets;
+
+	/* Output */
+	u32 classid;
+	u64 rank;
+	u64 delay;
+};
+
+enum {
+	SCH_BPF_OK,
+	SCH_BPF_REQUEUE,
+	SCH_BPF_DROP,
+	SCH_BPF_THROTTLE,
+};
+
 #endif /* _UAPI__LINUX_BPF_H__ */
diff --git a/include/uapi/linux/pkt_sched.h b/include/uapi/linux/pkt_sched.h
index 79a699f106b1..d9cc1989ab36 100644
--- a/include/uapi/linux/pkt_sched.h
+++ b/include/uapi/linux/pkt_sched.h
@@ -1263,4 +1263,19 @@  enum {
 
 #define TCA_ETS_MAX (__TCA_ETS_MAX - 1)
 
+enum {
+	TCA_SCH_BPF_UNSPEC,
+	TCA_SCH_BPF_ENQUEUE_PROG_NAME,	/* string */
+	TCA_SCH_BPF_ENQUEUE_PROG_FD,	/* u32 */
+	TCA_SCH_BPF_ENQUEUE_PROG_ID,	/* u32 */
+	TCA_SCH_BPF_ENQUEUE_PROG_TAG,	/* data */
+	TCA_SCH_BPF_DEQUEUE_PROG_NAME,	/* string */
+	TCA_SCH_BPF_DEQUEUE_PROG_FD,	/* u32 */
+	TCA_SCH_BPF_DEQUEUE_PROG_ID,	/* u32 */
+	TCA_SCH_BPF_DEQUEUE_PROG_TAG,	/* data */
+	__TCA_SCH_BPF_MAX,
+};
+
+#define TCA_SCH_BPF_MAX (__TCA_SCH_BPF_MAX - 1)
+
 #endif
diff --git a/net/sched/Kconfig b/net/sched/Kconfig
index 1e8ab4749c6c..19f68aac79b1 100644
--- a/net/sched/Kconfig
+++ b/net/sched/Kconfig
@@ -439,6 +439,21 @@  config NET_SCH_ETS
 
 	  If unsure, say N.
 
+config NET_SCH_BPF
+	tristate "eBPF based programmable queue discipline"
+	help
+	  This eBPF based queue discipline offers a way to program your
+	  own packet scheduling algorithm. This is a classful qdisc which
+	  also allows you to decide the hierarchy.
+
+	  Say Y here if you want to use the eBPF based programmable queue
+	  discipline.
+
+	  To compile this driver as a module, choose M here: the module
+	  will be called sch_bpf.
+
+	  If unsure, say N.
+
 menuconfig NET_SCH_DEFAULT
 	bool "Allow override default queue discipline"
 	help
diff --git a/net/sched/Makefile b/net/sched/Makefile
index dd14ef413fda..9ef0d579f5ff 100644
--- a/net/sched/Makefile
+++ b/net/sched/Makefile
@@ -65,6 +65,7 @@  obj-$(CONFIG_NET_SCH_FQ_PIE)	+= sch_fq_pie.o
 obj-$(CONFIG_NET_SCH_CBS)	+= sch_cbs.o
 obj-$(CONFIG_NET_SCH_ETF)	+= sch_etf.o
 obj-$(CONFIG_NET_SCH_TAPRIO)	+= sch_taprio.o
+obj-$(CONFIG_NET_SCH_BPF)	+= sch_bpf.o
 
 obj-$(CONFIG_NET_CLS_U32)	+= cls_u32.o
 obj-$(CONFIG_NET_CLS_ROUTE4)	+= cls_route.o
diff --git a/net/sched/sch_bpf.c b/net/sched/sch_bpf.c
new file mode 100644
index 000000000000..b6ddd0cfa99c
--- /dev/null
+++ b/net/sched/sch_bpf.c
@@ -0,0 +1,590 @@ 
+// SPDX-License-Identifier: GPL-2.0-or-later
+/*
+ * Programmable Qdisc with eBPF
+ *
+ * Copyright (C) 2021, Bytedance, Cong Wang <cong.wang@bytedance.com>
+ */
+#include <linux/module.h>
+#include <linux/types.h>
+#include <linux/kernel.h>
+#include <linux/jiffies.h>
+#include <linux/string.h>
+#include <linux/errno.h>
+#include <linux/init.h>
+#include <linux/skbuff.h>
+#include <linux/slab.h>
+#include <linux/filter.h>
+#include <linux/bpf.h>
+#include <linux/priority_queue.h>
+#include <net/netlink.h>
+#include <net/pkt_sched.h>
+#include <net/pkt_cls.h>
+
+#define ACT_BPF_NAME_LEN	256
+
+struct sch_bpf_prog {
+	struct bpf_prog *prog;
+	const char *name;
+};
+
+struct sch_bpf_class {
+	struct Qdisc_class_common common;
+	struct pq_node node;
+	u32 rank;
+
+	struct Qdisc *qdisc;
+	struct pq_root pq;
+
+	unsigned int drops;
+	unsigned int overlimits;
+	struct gnet_stats_basic_packed bstats;
+};
+
+struct sch_bpf_qdisc {
+	struct tcf_proto __rcu *filter_list; /* optional external classifier */
+	struct tcf_block *block;
+	struct Qdisc_class_hash clhash;
+	struct sch_bpf_prog enqueue_prog;
+	struct sch_bpf_prog dequeue_prog;
+
+	struct pq_root flows;
+	struct pq_root default_queue;
+	struct qdisc_watchdog watchdog;
+};
+
+struct sch_bpf_skb_cb {
+	u64 rank;
+};
+
+static struct sch_bpf_skb_cb *sch_bpf_skb_cb(struct sk_buff *skb)
+{
+	qdisc_cb_private_validate(skb, sizeof(struct sch_bpf_skb_cb));
+	return (struct sch_bpf_skb_cb *)qdisc_skb_cb(skb)->data;
+}
+
+static int sch_bpf_dump_prog(const struct sch_bpf_prog *prog, struct sk_buff *skb,
+			     int name, int id, int tag)
+{
+	struct nlattr *nla;
+
+	if (prog->name &&
+	    nla_put_string(skb, name, prog->name))
+		return -EMSGSIZE;
+
+	if (nla_put_u32(skb, id, prog->prog->aux->id))
+		return -EMSGSIZE;
+
+	nla = nla_reserve(skb, tag, sizeof(prog->prog->tag));
+	if (!nla)
+		return -EMSGSIZE;
+
+	memcpy(nla_data(nla), prog->prog->tag, nla_len(nla));
+	return 0;
+}
+
+static int sch_bpf_dump(struct Qdisc *sch, struct sk_buff *skb)
+{
+	struct sch_bpf_qdisc *q = qdisc_priv(sch);
+	struct nlattr *opts;
+
+	opts = nla_nest_start_noflag(skb, TCA_OPTIONS);
+	if (!opts)
+		goto nla_put_failure;
+
+	if (sch_bpf_dump_prog(&q->enqueue_prog, skb, TCA_SCH_BPF_ENQUEUE_PROG_NAME,
+			      TCA_SCH_BPF_ENQUEUE_PROG_ID, TCA_SCH_BPF_ENQUEUE_PROG_TAG))
+		goto nla_put_failure;
+	if (sch_bpf_dump_prog(&q->dequeue_prog, skb, TCA_SCH_BPF_DEQUEUE_PROG_NAME,
+			      TCA_SCH_BPF_DEQUEUE_PROG_ID, TCA_SCH_BPF_DEQUEUE_PROG_TAG))
+		goto nla_put_failure;
+
+	return nla_nest_end(skb, opts);
+
+nla_put_failure:
+	return -1;
+}
+
+static int sch_bpf_dump_stats(struct Qdisc *sch, struct gnet_dump *d)
+{
+	return 0;
+}
+
+static struct sch_bpf_class *sch_bpf_find(struct Qdisc *sch, u32 classid)
+{
+	struct sch_bpf_qdisc *q = qdisc_priv(sch);
+	struct Qdisc_class_common *clc;
+
+	clc = qdisc_class_find(&q->clhash, classid);
+	if (!clc)
+		return NULL;
+	return container_of(clc, struct sch_bpf_class, common);
+}
+
+static struct sch_bpf_class *sch_bpf_classify(struct sk_buff *skb, struct Qdisc *sch, int *qerr)
+{
+	struct sch_bpf_qdisc *q = qdisc_priv(sch);
+	struct sch_bpf_class *cl;
+	struct tcf_proto *tcf;
+	struct tcf_result res;
+	int result;
+
+	tcf = rcu_dereference_bh(q->filter_list);
+	if (!tcf)
+		return NULL;
+	*qerr = NET_XMIT_SUCCESS | __NET_XMIT_BYPASS;
+	result = tcf_classify(skb, NULL, tcf, &res, false);
+	if (result  >= 0) {
+#ifdef CONFIG_NET_CLS_ACT
+		switch (result) {
+		case TC_ACT_QUEUED:
+		case TC_ACT_STOLEN:
+		case TC_ACT_TRAP:
+			*qerr = NET_XMIT_SUCCESS | __NET_XMIT_STOLEN;
+			fallthrough;
+		case TC_ACT_SHOT:
+			return NULL;
+		}
+#endif
+		cl = (void *)res.class;
+		if (!cl) {
+			cl = sch_bpf_find(sch, res.classid);
+			if (!cl)
+				return NULL;
+		}
+	}
+
+	return cl;
+}
+
+static int sch_bpf_enqueue(struct sk_buff *skb, struct Qdisc *sch,
+			   struct sk_buff **to_free)
+{
+	struct sch_bpf_qdisc *q = qdisc_priv(sch);
+	unsigned int len = qdisc_pkt_len(skb);
+	struct sch_bpf_ctx ctx = {};
+	struct sch_bpf_class *cl;
+	int res;
+
+	cl = sch_bpf_classify(skb, sch, &res);
+	if (!cl) {
+		struct bpf_prog *enqueue;
+
+		enqueue = rcu_dereference(q->enqueue_prog.prog);
+		bpf_compute_data_pointers(skb);
+
+		ctx.skb = (struct __sk_buff *)skb;
+		ctx.nr_flows = q->clhash.hashelems;
+		ctx.nr_packets = sch->q.qlen;
+		res = BPF_PROG_RUN(enqueue, &ctx);
+		if (res == SCH_BPF_DROP) {
+			__qdisc_drop(skb, to_free);
+			return NET_XMIT_DROP;
+		}
+		cl = sch_bpf_find(sch, ctx.classid);
+		if (!cl) {
+			if (res & __NET_XMIT_BYPASS)
+				qdisc_qstats_drop(sch);
+			__qdisc_drop(skb, to_free);
+			return res;
+		}
+	}
+
+	if (cl->qdisc) {
+		res = qdisc_enqueue(skb, cl->qdisc, to_free);
+		if (res != NET_XMIT_SUCCESS) {
+			if (net_xmit_drop_count(res)) {
+				qdisc_qstats_drop(sch);
+				cl->drops++;
+			}
+			return res;
+		}
+	} else {
+		sch_bpf_skb_cb(skb)->rank = ctx.rank;
+		pq_push(&cl->pq, &skb->pqnode);
+	}
+
+	sch->qstats.backlog += len;
+	sch->q.qlen++;
+	return res;
+}
+
+static struct sk_buff *sch_bpf_dequeue(struct Qdisc *sch)
+{
+	struct sch_bpf_qdisc *q = qdisc_priv(sch);
+	struct sk_buff *skb, *ret = NULL;
+	struct sch_bpf_ctx ctx = {};
+	struct bpf_prog *dequeue;
+	struct sch_bpf_class *cl;
+	struct pq_node *flow;
+	s64 now;
+	int res;
+
+requeue:
+	flow = pq_pop(&q->flows);
+	if (!flow)
+		return NULL;
+
+	cl = container_of(flow, struct sch_bpf_class, node);
+	if (cl->qdisc) {
+		skb = cl->qdisc->dequeue(cl->qdisc);
+		ctx.classid = cl->common.classid;
+	} else {
+		struct pq_node *p = pq_pop(&cl->pq);
+
+		if (!p)
+			return NULL;
+		skb = container_of(p, struct sk_buff, pqnode);
+		ctx.classid = cl->rank;
+	}
+	ctx.skb = (struct __sk_buff *) skb;
+	ctx.nr_flows = q->clhash.hashelems;
+	ctx.nr_packets = sch->q.qlen;
+
+	dequeue = rcu_dereference(q->dequeue_prog.prog);
+	bpf_compute_data_pointers(skb);
+	res = BPF_PROG_RUN(dequeue, &ctx);
+	switch (res) {
+	case SCH_BPF_OK:
+		ret = skb;
+		break;
+	case SCH_BPF_REQUEUE:
+		sch_bpf_skb_cb(skb)->rank = ctx.rank;
+		cl->rank = ctx.classid;
+		pq_push(&cl->pq, &skb->pqnode);
+		bstats_update(&cl->bstats, skb);
+		pq_push(&q->flows, &cl->node);
+		goto requeue;
+	case SCH_BPF_THROTTLE:
+		now = ktime_get_ns();
+		qdisc_watchdog_schedule_ns(&q->watchdog, now + ctx.delay);
+		qdisc_qstats_overlimit(sch);
+		cl->overlimits++;
+		return NULL;
+	default:
+		kfree_skb(skb);
+		ret = NULL;
+	}
+
+	if (pq_top(&cl->pq))
+		pq_push(&q->flows, &cl->node);
+	return ret;
+}
+
+static struct Qdisc *sch_bpf_leaf(struct Qdisc *sch, unsigned long arg)
+{
+	struct sch_bpf_class *cl = (struct sch_bpf_class *)arg;
+
+	return cl->qdisc;
+}
+
+static int sch_bpf_graft(struct Qdisc *sch, unsigned long arg, struct Qdisc *new,
+			 struct Qdisc **old, struct netlink_ext_ack *extack)
+{
+	struct sch_bpf_class *cl = (struct sch_bpf_class *)arg;
+
+	if (new)
+		*old = qdisc_replace(sch, new, &cl->qdisc);
+	return 0;
+}
+
+static unsigned long sch_bpf_bind(struct Qdisc *sch, unsigned long parent,
+				  u32 classid)
+{
+	return 0;
+}
+
+static void sch_bpf_unbind(struct Qdisc *q, unsigned long cl)
+{
+}
+
+static unsigned long sch_bpf_search(struct Qdisc *sch, u32 handle)
+{
+	return (unsigned long)sch_bpf_find(sch, handle);
+}
+
+static struct tcf_block *sch_bpf_tcf_block(struct Qdisc *sch, unsigned long cl,
+					   struct netlink_ext_ack *extack)
+{
+	struct sch_bpf_qdisc *q = qdisc_priv(sch);
+
+	if (cl)
+		return NULL;
+	return q->block;
+}
+
+static const struct nla_policy sch_bpf_policy[TCA_SCH_BPF_MAX + 1] = {
+	[TCA_SCH_BPF_ENQUEUE_PROG_FD]	= { .type = NLA_U32 },
+	[TCA_SCH_BPF_ENQUEUE_PROG_NAME]	= { .type = NLA_NUL_STRING,
+					    .len = ACT_BPF_NAME_LEN },
+	[TCA_SCH_BPF_DEQUEUE_PROG_FD]	= { .type = NLA_U32 },
+	[TCA_SCH_BPF_DEQUEUE_PROG_NAME]	= { .type = NLA_NUL_STRING,
+					    .len = ACT_BPF_NAME_LEN },
+};
+
+static int bpf_init_prog(struct nlattr *fd, struct nlattr *name, struct sch_bpf_prog *prog)
+{
+	char *prog_name = NULL;
+	struct bpf_prog *fp;
+	u32 bpf_fd;
+
+	if (!fd)
+		return -EINVAL;
+	bpf_fd = nla_get_u32(fd);
+
+	fp = bpf_prog_get_type(bpf_fd, BPF_PROG_TYPE_SCHED_QDISC);
+	if (IS_ERR(fp))
+		return PTR_ERR(fp);
+
+	if (name) {
+		prog_name = nla_memdup(name, GFP_KERNEL);
+		if (!prog_name) {
+			bpf_prog_put(fp);
+			return -ENOMEM;
+		}
+	}
+
+	prog->name = prog_name;
+	prog->prog = fp;
+	return 0;
+}
+
+static void bpf_cleanup_prog(struct sch_bpf_prog *prog)
+{
+	if (prog->prog)
+		bpf_prog_put(prog->prog);
+	kfree(prog->name);
+}
+
+static int sch_bpf_change(struct Qdisc *sch, struct nlattr *opt,
+			  struct netlink_ext_ack *extack)
+{
+	struct sch_bpf_qdisc *q = qdisc_priv(sch);
+	struct nlattr *tb[TCA_SCH_BPF_MAX + 1];
+	int err;
+
+	if (!opt)
+		return -EINVAL;
+
+	err = nla_parse_nested_deprecated(tb, TCA_SCH_BPF_MAX, opt,
+					  sch_bpf_policy, NULL);
+	if (err < 0)
+		return err;
+	err = bpf_init_prog(tb[TCA_SCH_BPF_ENQUEUE_PROG_FD],
+			    tb[TCA_SCH_BPF_ENQUEUE_PROG_NAME], &q->enqueue_prog);
+	if (err)
+		return err;
+	err = bpf_init_prog(tb[TCA_SCH_BPF_DEQUEUE_PROG_FD],
+			    tb[TCA_SCH_BPF_DEQUEUE_PROG_NAME], &q->dequeue_prog);
+	return err;
+}
+
+static bool skb_rank(struct pq_node *l, struct pq_node *r)
+{
+	struct sk_buff *lskb, *rskb;
+
+	lskb = container_of(l, struct sk_buff, pqnode);
+	rskb = container_of(r, struct sk_buff, pqnode);
+
+	return sch_bpf_skb_cb(lskb)->rank < sch_bpf_skb_cb(rskb)->rank;
+}
+
+static void skb_flush(struct pq_node *n)
+{
+	struct sk_buff *skb = container_of(n, struct sk_buff, pqnode);
+
+	kfree_skb(skb);
+}
+
+static bool flow_rank(struct pq_node *l, struct pq_node *r)
+{
+	struct sch_bpf_class *lflow, *rflow;
+
+	lflow = container_of(l, struct sch_bpf_class, node);
+	rflow = container_of(r, struct sch_bpf_class, node);
+
+	return lflow->rank < rflow->rank;
+}
+
+static void flow_flush(struct pq_node *n)
+{
+	struct sch_bpf_class *cl = container_of(n, struct sch_bpf_class, node);
+
+	pq_flush(&cl->pq, skb_flush);
+}
+
+static int sch_bpf_init(struct Qdisc *sch, struct nlattr *opt,
+			struct netlink_ext_ack *extack)
+{
+	struct sch_bpf_qdisc *q = qdisc_priv(sch);
+	int err;
+
+	qdisc_watchdog_init(&q->watchdog, sch);
+	if (opt) {
+		err = sch_bpf_change(sch, opt, extack);
+		if (err)
+			return err;
+	}
+
+	err = tcf_block_get(&q->block, &q->filter_list, sch, extack);
+	if (err)
+		return err;
+
+	pq_root_init(&q->flows, flow_rank);
+	pq_root_init(&q->default_queue, skb_rank);
+	return qdisc_class_hash_init(&q->clhash);
+}
+
+static void sch_bpf_reset(struct Qdisc *sch)
+{
+	struct sch_bpf_qdisc *q = qdisc_priv(sch);
+
+	qdisc_watchdog_cancel(&q->watchdog);
+	pq_flush(&q->flows, flow_flush);
+	pq_flush(&q->default_queue, skb_flush);
+}
+
+static void sch_bpf_destroy(struct Qdisc *sch)
+{
+	struct sch_bpf_qdisc *q = qdisc_priv(sch);
+
+	qdisc_watchdog_cancel(&q->watchdog);
+	tcf_block_put(q->block);
+	qdisc_class_hash_destroy(&q->clhash);
+	sch_bpf_reset(sch);
+	bpf_cleanup_prog(&q->enqueue_prog);
+	bpf_cleanup_prog(&q->dequeue_prog);
+}
+
+static int sch_bpf_change_class(struct Qdisc *sch, u32 classid,
+				u32 parentid, struct nlattr **tca,
+				unsigned long *arg,
+				struct netlink_ext_ack *extack)
+{
+	struct sch_bpf_class *cl = (struct sch_bpf_class *)*arg;
+	struct sch_bpf_qdisc *q = qdisc_priv(sch);
+
+	if (!cl) {
+		cl = kzalloc(sizeof(*cl), GFP_KERNEL);
+		if (!cl)
+			return -ENOBUFS;
+		pq_root_init(&cl->pq, skb_rank);
+		qdisc_class_hash_insert(&q->clhash, &cl->common);
+	}
+
+	qdisc_class_hash_grow(sch, &q->clhash);
+	*arg = (unsigned long)cl;
+	return 0;
+}
+
+static int sch_bpf_delete(struct Qdisc *sch, unsigned long arg,
+			  struct netlink_ext_ack *extack)
+{
+	struct sch_bpf_qdisc *q = qdisc_priv(sch);
+	struct sch_bpf_class *cl = (struct sch_bpf_class *)arg;
+
+	qdisc_class_hash_remove(&q->clhash, &cl->common);
+	if (cl->qdisc)
+		qdisc_put(cl->qdisc);
+	else
+		pq_flush(&cl->pq, skb_flush);
+	return 0;
+}
+
+static int sch_bpf_dump_class(struct Qdisc *sch, unsigned long arg,
+			      struct sk_buff *skb, struct tcmsg *tcm)
+{
+	return 0;
+}
+
+static int
+sch_bpf_dump_class_stats(struct Qdisc *sch, unsigned long arg, struct gnet_dump *d)
+{
+	struct sch_bpf_class *cl = (struct sch_bpf_class *)arg;
+	struct sch_bpf_qdisc *q = qdisc_priv(sch);
+	struct gnet_stats_queue qs = {
+		.drops = cl->drops,
+		.overlimits = cl->overlimits,
+	};
+	__u32 qlen = 0;
+
+	if (cl->qdisc)
+		qdisc_qstats_qlen_backlog(cl->qdisc, &qlen, &qs.backlog);
+	else
+		qlen = 0;
+
+	if (gnet_stats_copy_basic(qdisc_root_sleeping_running(sch),
+				  d, NULL, &cl->bstats) < 0 ||
+	    gnet_stats_copy_queue(d, NULL, &qs, qlen) < 0)
+		return -1;
+	return 0;
+}
+
+static void sch_bpf_walk(struct Qdisc *sch, struct qdisc_walker *arg)
+{
+	struct sch_bpf_qdisc *q = qdisc_priv(sch);
+	struct sch_bpf_class *cl;
+	unsigned int i;
+
+	if (arg->stop)
+		return;
+
+	for (i = 0; i < q->clhash.hashsize; i++) {
+		hlist_for_each_entry(cl, &q->clhash.hash[i], common.hnode) {
+			if (arg->count < arg->skip) {
+				arg->count++;
+				continue;
+			}
+			if (arg->fn(sch, (unsigned long)cl, arg) < 0) {
+				arg->stop = 1;
+				return;
+			}
+			arg->count++;
+		}
+	}
+}
+
+static const struct Qdisc_class_ops sch_bpf_class_ops = {
+	.graft		=	sch_bpf_graft,
+	.leaf		=	sch_bpf_leaf,
+	.find		=	sch_bpf_search,
+	.change		=	sch_bpf_change_class,
+	.delete		=	sch_bpf_delete,
+	.tcf_block	=	sch_bpf_tcf_block,
+	.bind_tcf	=	sch_bpf_bind,
+	.unbind_tcf	=	sch_bpf_unbind,
+	.dump		=	sch_bpf_dump_class,
+	.dump_stats	=	sch_bpf_dump_class_stats,
+	.walk		=	sch_bpf_walk,
+};
+
+static struct Qdisc_ops sch_bpf_qdisc_ops __read_mostly = {
+	.cl_ops		=	&sch_bpf_class_ops,
+	.id		=	"bpf",
+	.priv_size	=	sizeof(struct sch_bpf_qdisc),
+	.enqueue	=	sch_bpf_enqueue,
+	.dequeue	=	sch_bpf_dequeue,
+	.peek		=	qdisc_peek_dequeued,
+	.init		=	sch_bpf_init,
+	.reset		=	sch_bpf_reset,
+	.destroy	=	sch_bpf_destroy,
+	.change		=	sch_bpf_change,
+	.dump		=	sch_bpf_dump,
+	.dump_stats	=	sch_bpf_dump_stats,
+	.owner		=	THIS_MODULE,
+};
+
+static int __init sch_bpf_mod_init(void)
+{
+	return register_qdisc(&sch_bpf_qdisc_ops);
+}
+
+static void __exit sch_bpf_mod_exit(void)
+{
+	unregister_qdisc(&sch_bpf_qdisc_ops);
+}
+
+module_init(sch_bpf_mod_init)
+module_exit(sch_bpf_mod_exit)
+MODULE_AUTHOR("Cong Wang");
+MODULE_LICENSE("GPL");
+MODULE_DESCRIPTION("eBPF queue discipline");