Message ID | 20250311001447.114579-4-chia-yu.chang@nokia-bell-labs.com (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | DUALPI2 patch | expand |
On Mon, Mar 10, 2025 at 7:15 PM <chia-yu.chang@nokia-bell-labs.com> wrote: > > From: Koen De Schepper <koen.de_schepper@nokia-bell-labs.com> > > DualPI2 provides L4S-type low latency & loss to traffic that uses a > scalable congestion controller (e.g. TCP-Prague, DCTCP) without > degrading the performance of 'classic' traffic (e.g. Reno, > Cubic etc.). It is to be the reference implementation of IETF RFC9332 > DualQ Coupled AQM (https://datatracker.ietf.org/doc/html/rfc9332). > > Note that creating two independent queues cannot meet the goal of > DualPI2 mentioned in RFC9332: "...to preserve fairness between > ECN-capable and non-ECN-capable traffic." Further, it could even > lead to starvation of Classic traffic, which is also inconsistent > with the requirements in RFC9332: "...although priority MUST be > bounded in order not to starve Classic traffic." DualPI2 is > designed to maintain approximate per-flow fairness on L-queue and > C-queue by forming a single qdisc using the coupling factor and > scheduler between two queues. > > The qdisc provides two queues called low latency and classic. It > classifies packets based on the ECN field in the IP headers. By > default it directs non-ECN and ECT(0) into the classic queue and > ECT(1) and CE into the low latency queue, as per the IETF spec. > > Each queue runs its own AQM: > * The classic AQM is called PI2, which is similar to the PIE AQM but > more responsive and simpler. Classic traffic requires a decent > target queue (default 15ms for Internet deployment) to fully > utilize the link and to avoid high drop rates. > * The low latency AQM is, by default, a very shallow ECN marking > threshold (1ms) similar to that used for DCTCP. > > The DualQ isolates the low queuing delay of the Low Latency queue > from the larger delay of the 'Classic' queue. However, from a > bandwidth perspective, flows in either queue will share out the link > capacity as if there was just a single queue. This bandwidth pooling > effect is achieved by coupling together the drop and ECN-marking > probabilities of the two AQMs. > > The PI2 AQM has two main parameters in addition to its target delay. > The integral gain factor alpha is used to slowly correct any persistent > standing queue error from the target delay, while the proportional gain > factor beta is used to quickly compensate for queue changes (growth or > shrinkage). Either alpha and beta are given as a parameter, or they can > be calculated by tc from alternative typical and maximum RTT parameters. > > Internally, the output of a linear Proportional Integral (PI) > controller is used for both queues. This output is squared to > calculate the drop or ECN-marking probability of the classic queue. > This counterbalances the square-root rate equation of Reno/Cubic, > which is the trick that balances flow rates across the queues. For > the ECN-marking probability of the low latency queue, the output of > the base AQM is multiplied by a coupling factor. This determines the > balance between the flow rates in each queue. The default setting > makes the flow rates roughly equal, which should be generally > applicable. > > If DUALPI2 AQM has detected overload (due to excessive non-responsive > traffic in either queue), it will switch to signaling congestion > solely using drop, irrespective of the ECN field. Alternatively, it > can be configured to limit the drop probability and let the queue > grow and eventually overflow (like tail-drop). > > GSO splitting in DUALPI2 is configurable from userspace while the > default behavior is to split gso. When running DUALPI2 at unshaped > 10gigE with 4 download streams test, splitting gso apart results in > halving the latency with no loss in throughput: > > Summary of tcp_4down run 'no_split_gso': > avg median # data pts > Ping (ms) ICMP : 0.53 0.30 ms 350 > TCP download avg : 2326.86 N/A Mbits/s 350 > TCP download sum : 9307.42 N/A Mbits/s 350 > TCP download::1 : 2672.99 2568.73 Mbits/s 350 > TCP download::2 : 2586.96 2570.51 Mbits/s 350 > TCP download::3 : 1786.26 1798.82 Mbits/s 350 > TCP download::4 : 2261.21 2309.49 Mbits/s 350 > > Summart of tcp_4down run 'split_gso': > avg median # data pts > Ping (ms) ICMP : 0.22 0.23 ms 350 > TCP download avg : 2335.02 N/A Mbits/s 350 > TCP download sum : 9340.09 N/A Mbits/s 350 > TCP download::1 : 2335.30 2334.22 Mbits/s 350 > TCP download::2 : 2334.72 2334.20 Mbits/s 350 > TCP download::3 : 2335.28 2334.58 Mbits/s 350 > TCP download::4 : 2334.79 2334.39 Mbits/s 350 > > A similar result is observed when running DUALPI2 at unshaped 1gigE > with 1 download stream test: > > Summary of tcp_1down run 'no_split_gso': > avg median # data pts > Ping (ms) ICMP : 1.13 1.25 ms 350 > TCP download : 941.41 941.46 Mbits/s 350 > > Summart of tcp_1down run 'split_gso': > avg median # data pts > Ping (ms) ICMP : 0.51 0.55 ms 350 > TCP download : 941.41 941.45 Mbits/s 350 > > Additional details can be found in the draft: > https://datatracker.ietf.org/doc/html/rfc9332 > > Signed-off-by: Koen De Schepper <koen.de_schepper@nokia-bell-labs.com> > Co-developed-by: Olga Albisser <olga@albisser.org> > Signed-off-by: Olga Albisser <olga@albisser.org> > Co-developed-by: Olivier Tilmans <olivier.tilmans@nokia.com> > Signed-off-by: Olivier Tilmans <olivier.tilmans@nokia.com> > Co-developed-by: Henrik Steen <henrist@henrist.net> > Signed-off-by: Henrik Steen <henrist@henrist.net> > Signed-off-by: Bob Briscoe <research@bobbriscoe.net> > Signed-off-by: Ilpo Järvinen <ij@kernel.org> > Co-developed-by: Chia-Yu Chang <chia-yu.chang@nokia-bell-labs.com> > Signed-off-by: Chia-Yu Chang <chia-yu.chang@nokia-bell-labs.com> > --- > include/linux/netdevice.h | 1 + > include/uapi/linux/pkt_sched.h | 38 ++ > net/sched/Kconfig | 12 + > net/sched/Makefile | 1 + > net/sched/sch_dualpi2.c | 1082 ++++++++++++++++++++++++++++++++ > 5 files changed, 1134 insertions(+) > create mode 100644 net/sched/sch_dualpi2.c > > diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h > index d206c9592b60..3e74938285d7 100644 > --- a/include/linux/netdevice.h > +++ b/include/linux/netdevice.h > @@ -30,6 +30,7 @@ > #include <asm/byteorder.h> > #include <asm/local.h> > > +#include <linux/netdev_features.h> > #include <linux/percpu.h> > #include <linux/rculist.h> > #include <linux/workqueue.h> > diff --git a/include/uapi/linux/pkt_sched.h b/include/uapi/linux/pkt_sched.h > index 25a9a47001cd..69ffe68118b6 100644 > --- a/include/uapi/linux/pkt_sched.h > +++ b/include/uapi/linux/pkt_sched.h > @@ -1210,4 +1210,42 @@ enum { > > #define TCA_ETS_MAX (__TCA_ETS_MAX - 1) > > +/* DUALPI2 */ > +enum { > + TCA_DUALPI2_UNSPEC, > + TCA_DUALPI2_LIMIT, /* Packets */ > + TCA_DUALPI2_MEMORY_LIMIT, /* Bytes */ > + TCA_DUALPI2_TARGET, /* us */ > + TCA_DUALPI2_TUPDATE, /* us */ > + TCA_DUALPI2_ALPHA, /* Hz scaled up by 256 */ > + TCA_DUALPI2_BETA, /* HZ scaled up by 256 */ > + TCA_DUALPI2_STEP_THRESH, /* Packets or us */ > + TCA_DUALPI2_STEP_PACKETS, /* Whether STEP_THRESH is in packets */ > + TCA_DUALPI2_COUPLING, /* Coupling factor between queues */ > + TCA_DUALPI2_DROP_OVERLOAD, /* Whether to drop on overload */ > + TCA_DUALPI2_DROP_EARLY, /* Whether to drop on enqueue */ > + TCA_DUALPI2_C_PROTECTION, /* Percentage */ > + TCA_DUALPI2_ECN_MASK, /* L4S queue classification mask */ > + TCA_DUALPI2_SPLIT_GSO, /* Split GSO packets at enqueue */ > + TCA_DUALPI2_PAD, > + __TCA_DUALPI2_MAX > +}; > + > +#define TCA_DUALPI2_MAX (__TCA_DUALPI2_MAX - 1) > + > +struct tc_dualpi2_xstats { > + __u32 prob; /* current probability */ > + __u32 delay_c; /* current delay in C queue */ > + __u32 delay_l; /* current delay in L queue */ > + __s32 credit; /* current c_protection credit */ > + __u32 packets_in_c; /* number of packets enqueued in C queue */ > + __u32 packets_in_l; /* number of packets enqueued in L queue */ > + __u32 maxq; /* maximum queue size */ > + __u32 ecn_mark; /* packets marked with ecn*/ > + __u32 step_marks; /* ECN marks due to the step AQM */ > + __u32 memory_used; /* Meory used of both queues */ > + __u32 max_memory_used; /* Maximum used memory */ > + __u32 memory_limit; /* Memory limit of both queues */ > +}; > + > #endif > diff --git a/net/sched/Kconfig b/net/sched/Kconfig > index 8180d0c12fce..f00b5ad92ce2 100644 > --- a/net/sched/Kconfig > +++ b/net/sched/Kconfig > @@ -403,6 +403,18 @@ config NET_SCH_ETS > > If unsure, say N. > > +config NET_SCH_DUALPI2 > + tristate "Dual Queue PI Square (DUALPI2) scheduler" > + help > + Say Y here if you want to use the Dual Queue Proportional Integral > + Controller Improved with a Square scheduling algorithm. > + For more information, please see https://tools.ietf.org/html/rfc9332 > + > + To compile this driver as a module, choose M here: the module > + will be called sch_dualpi2. > + > + If unsure, say N. > + > menuconfig NET_SCH_DEFAULT > bool "Allow override default queue discipline" > help > diff --git a/net/sched/Makefile b/net/sched/Makefile > index 82c3f78ca486..1abb06554057 100644 > --- a/net/sched/Makefile > +++ b/net/sched/Makefile > @@ -62,6 +62,7 @@ obj-$(CONFIG_NET_SCH_FQ_PIE) += sch_fq_pie.o > obj-$(CONFIG_NET_SCH_CBS) += sch_cbs.o > obj-$(CONFIG_NET_SCH_ETF) += sch_etf.o > obj-$(CONFIG_NET_SCH_TAPRIO) += sch_taprio.o > +obj-$(CONFIG_NET_SCH_DUALPI2) += sch_dualpi2.o > > obj-$(CONFIG_NET_CLS_U32) += cls_u32.o > obj-$(CONFIG_NET_CLS_ROUTE4) += cls_route.o > diff --git a/net/sched/sch_dualpi2.c b/net/sched/sch_dualpi2.c > new file mode 100644 > index 000000000000..e5d4c3bac093 > --- /dev/null > +++ b/net/sched/sch_dualpi2.c > @@ -0,0 +1,1082 @@ > +// SPDX-License-Identifier: GPL-2.0-only OR BSD-2-Clause > +/* Copyright (C) 2024 Nokia > + * > + * Author: Koen De Schepper <koen.de_schepper@nokia-bell-labs.com> > + * Author: Olga Albisser <olga@albisser.org> > + * Author: Henrik Steen <henrist@henrist.net> > + * Author: Olivier Tilmans <olivier.tilmans@nokia.com> > + * Author: Chia-Yu Chang <chia-yu.chang@nokia-bell-labs.com> > + * > + * DualPI Improved with a Square (dualpi2): > + * - Supports congestion controls that comply with the Prague requirements > + * in RFC9331 (e.g. TCP-Prague) > + * - Supports coupled dual-queue with PI2 as defined in RFC9332 > + * - Supports ECN L4S-identifier (IP.ECN==0b*1) > + * > + * note: Although DCTCP and BBRv3 can use shallow-threshold ECN marks, > + * they do not meet the 'Prague L4S Requirements' listed in RFC 9331 > + * Section 4, so they can only be used with DualPI2 in a datacenter > + * context. > + * > + * References: > + * - RFC9332: https://datatracker.ietf.org/doc/html/rfc9332 > + * - De Schepper, Koen, et al. "PI 2: A linearized AQM for both classic and > + * scalable TCP." in proc. ACM CoNEXT'16, 2016. > + */ > + > +#include <linux/errno.h> > +#include <linux/hrtimer.h> > +#include <linux/if_vlan.h> > +#include <linux/kernel.h> > +#include <linux/limits.h> > +#include <linux/module.h> > +#include <linux/skbuff.h> > +#include <linux/types.h> > + > +#include <net/gso.h> > +#include <net/inet_ecn.h> > +#include <net/pkt_cls.h> > +#include <net/pkt_sched.h> > + > +/* 32b enable to support flows with windows up to ~8.6 * 1e9 packets > + * i.e., twice the maximal snd_cwnd. > + * MAX_PROB must be consistent with the RNG in dualpi2_roll(). > + */ > +#define MAX_PROB U32_MAX > + > +/* alpha/beta values exchanged over netlink are in units of 256ns */ > +#define ALPHA_BETA_SHIFT 8 > + > +/* Scaled values of alpha/beta must fit in 32b to avoid overflow in later > + * computations. Consequently (see and dualpi2_scale_alpha_beta()), their > + * netlink-provided values can use at most 31b, i.e. be at most (2^23)-1 > + * (~4MHz) as those are given in 1/256th. This enable to tune alpha/beta to > + * control flows whose maximal RTTs can be in usec up to few secs. > + */ > +#define ALPHA_BETA_MAX ((1U << 31) - 1) > + > +/* Internal alpha/beta are in units of 64ns. > + * This enables to use all alpha/beta values in the allowed range without loss > + * of precision due to rounding when scaling them internally, e.g., > + * scale_alpha_beta(1) will not round down to 0. > + */ > +#define ALPHA_BETA_GRANULARITY 6 > + > +#define ALPHA_BETA_SCALING (ALPHA_BETA_SHIFT - ALPHA_BETA_GRANULARITY) > + > +/* We express the weights (wc, wl) in %, i.e., wc + wl = 100 */ > +#define MAX_WC 100 > + > +struct dualpi2_sched_data { > + struct Qdisc *l_queue; /* The L4S Low latency queue (L-queue) */ > + struct Qdisc *sch; /* The Classic queue (C-queue) */ > + > + /* Registered tc filters */ > + struct { > + struct tcf_proto __rcu *filters; > + struct tcf_block *block; > + } tcf; > + > + struct { /* PI2 parameters */ > + u64 target; /* Target delay in nanoseconds */ > + u32 tupdate;/* Timer frequency in nanoseconds */ > + u32 prob; /* Base PI probability */ > + u32 alpha; /* Gain factor for the integral rate response */ > + u32 beta; /* Gain factor for the proportional response */ > + struct hrtimer timer; /* prob update timer */ > + } pi2; > + > + struct { /* Step AQM (L-queue only) parameters */ > + u32 thresh; /* Step threshold */ > + bool in_packets;/* Whether the step is in packets or time */ > + } step; > + > + struct { /* C-queue starvation protection */ > + s32 credit; /* Credit (sign indicates which queue) */ > + s32 init; /* Reset value of the credit */ > + u8 wc; /* C-queue weight (between 0 and MAX_WC) */ > + u8 wl; /* L-queue weight (MAX_WC - wc) */ > + } c_protection; > + > + /* General dualQ parameters */ > + u32 memory_limit; /* Memory limit of both queues */ > + u8 coupling_factor;/* Coupling factor (k) between both queues */ > + u8 ecn_mask; /* Mask to match packets into L-queue */ > + bool drop_early; /* Drop at enqueue instead of dequeue if true */ > + bool drop_overload; /* Drop (1) on overload, or overflow (0) */ > + bool split_gso; /* Split aggregated skb (1) or leave as is */ > + > + /* Statistics */ > + u64 c_head_ts; /* Enqueue timestamp of the C-queue head */ > + u64 l_head_ts; /* Enqueue timestamp of the L-queue head */ > + u64 last_qdelay; /* Q delay val at the last probability update */ > + u32 packets_in_c; /* Enqueue packet counter of the C-queue */ > + u32 packets_in_l; /* Enqueue packet counter of the L-queue */ > + u32 maxq; /* Maximum queue size of the C-queue */ > + u32 ecn_mark; /* ECN mark pkt counter due to PI probability */ > + u32 step_marks; /* ECN mark pkt counter due to step AQM */ > + u32 memory_used; /* Memory used of both queues */ > + u32 max_memory_used;/* Maximum used memory */ > + > + struct { /* Deferred drop statistics */ > + u32 cnt; /* Packets dropped */ > + u32 len; /* Bytes dropped */ > + } deferred_drops; > +}; > + > +struct dualpi2_skb_cb { > + u64 ts; /* Timestamp at enqueue */ > + u8 apply_step:1, /* Can we apply the step threshold */ > + classified:2, /* Packet classification results */ > + ect:2; /* Packet ECT codepoint */ > +}; > + > +enum dualpi2_classification_results { > + DUALPI2_C_CLASSIC = 0, /* C-queue */ > + DUALPI2_C_L4S = 1, /* L-queue (scale mark/classic drop) */ > + DUALPI2_C_LLLL = 2, /* L-queue (no drops/marks) */ > + __DUALPI2_C_MAX /* Keep last*/ > +}; > + > +static struct dualpi2_skb_cb *dualpi2_skb_cb(struct sk_buff *skb) > +{ > + qdisc_cb_private_validate(skb, sizeof(struct dualpi2_skb_cb)); > + return (struct dualpi2_skb_cb *)qdisc_skb_cb(skb)->data; > +} > + > +static u64 dualpi2_sojourn_time(struct sk_buff *skb, u64 reference) > +{ > + return reference - dualpi2_skb_cb(skb)->ts; > +} > + > +static u64 head_enqueue_time(struct Qdisc *q) > +{ > + struct sk_buff *skb = qdisc_peek_head(q); > + > + return skb ? dualpi2_skb_cb(skb)->ts : 0; > +} > + > +static u32 dualpi2_scale_alpha_beta(u32 param) > +{ > + u64 tmp = ((u64)param * MAX_PROB >> ALPHA_BETA_SCALING); > + > + do_div(tmp, NSEC_PER_SEC); > + return tmp; > +} > + > +static u32 dualpi2_unscale_alpha_beta(u32 param) > +{ > + u64 tmp = ((u64)param * NSEC_PER_SEC << ALPHA_BETA_SCALING); > + > + do_div(tmp, MAX_PROB); > + return tmp; > +} > + > +static ktime_t next_pi2_timeout(struct dualpi2_sched_data *q) > +{ > + return ktime_add_ns(ktime_get_ns(), q->pi2.tupdate); > +} > + > +static bool skb_is_l4s(struct sk_buff *skb) > +{ > + return dualpi2_skb_cb(skb)->classified == DUALPI2_C_L4S; > +} > + > +static bool skb_in_l_queue(struct sk_buff *skb) > +{ > + return dualpi2_skb_cb(skb)->classified != DUALPI2_C_CLASSIC; > +} > + > +static bool dualpi2_mark(struct dualpi2_sched_data *q, struct sk_buff *skb) > +{ > + if (INET_ECN_set_ce(skb)) { > + q->ecn_mark++; > + return true; > + } > + return false; > +} > + > +static void dualpi2_reset_c_protection(struct dualpi2_sched_data *q) > +{ > + q->c_protection.credit = q->c_protection.init; > +} > + > +/* This computes the initial credit value and WRR weight for the L queue (wl) > + * from the weight of the C queue (wc). > + * If wl > wc, the scheduler will start with the L queue when reset. > + */ > +static void dualpi2_calculate_c_protection(struct Qdisc *sch, > + struct dualpi2_sched_data *q, u32 wc) > +{ > + q->c_protection.wc = wc; > + q->c_protection.wl = MAX_WC - wc; > + q->c_protection.init = (s32)psched_mtu(qdisc_dev(sch)) * > + ((int)q->c_protection.wc - (int)q->c_protection.wl); > + dualpi2_reset_c_protection(q); > +} > + > +static bool dualpi2_roll(u32 prob) > +{ > + return get_random_u32() <= prob; > +} > + > +/* Packets in the C-queue are subject to a marking probability pC, which is the > + * square of the internal PI probability (i.e., have an overall lower mark/drop > + * probability). If the qdisc is overloaded, ignore ECT values and only drop. > + * > + * Note that this marking scheme is also applied to L4S packets during overload. > + * Return true if packet dropping is required in C queue > + */ > +static bool dualpi2_classic_marking(struct dualpi2_sched_data *q, > + struct sk_buff *skb, u32 prob, > + bool overload) > +{ > + if (dualpi2_roll(prob) && dualpi2_roll(prob)) { > + if (overload || dualpi2_skb_cb(skb)->ect == INET_ECN_NOT_ECT) > + return true; > + dualpi2_mark(q, skb); > + } > + return false; > +} > + > +/* Packets in the L-queue are subject to a marking probability pL given by the > + * internal PI probability scaled by the coupling factor. > + * > + * On overload (i.e., @local_l_prob is >= 100%): > + * - if the qdisc is configured to trade losses to preserve latency (i.e., > + * @q->drop_overload), apply classic drops first before marking. > + * - otherwise, preserve the "no loss" property of ECN at the cost of queueing > + * delay, eventually resulting in taildrop behavior once sch->limit is > + * reached. > + * Return true if packet dropping is required in L queue > + */ > +static bool dualpi2_scalable_marking(struct dualpi2_sched_data *q, > + struct sk_buff *skb, > + u64 local_l_prob, u32 prob, > + bool overload) > +{ > + if (overload) { > + /* Apply classic drop */ > + if (!q->drop_overload || > + !(dualpi2_roll(prob) && dualpi2_roll(prob))) > + goto mark; > + return true; > + } > + > + /* We can safely cut the upper 32b as overload==false */ > + if (dualpi2_roll(local_l_prob)) { > + /* Non-ECT packets could have classified as L4S by filters. */ > + if (dualpi2_skb_cb(skb)->ect == INET_ECN_NOT_ECT) > + return true; > +mark: > + dualpi2_mark(q, skb); > + } > + return false; > +} > + > +/* Decide whether a given packet must be dropped (or marked if ECT), according > + * to the PI2 probability. > + * > + * Never mark/drop if we have a standing queue of less than 2 MTUs. > + */ > +static bool must_drop(struct Qdisc *sch, struct dualpi2_sched_data *q, > + struct sk_buff *skb) > +{ > + u64 local_l_prob; > + u32 prob; > + bool overload; > + > + if (sch->qstats.backlog < 2 * psched_mtu(qdisc_dev(sch))) > + return false; > + > + prob = READ_ONCE(q->pi2.prob); > + local_l_prob = (u64)prob * q->coupling_factor; > + overload = local_l_prob > MAX_PROB; > + > + switch (dualpi2_skb_cb(skb)->classified) { > + case DUALPI2_C_CLASSIC: > + return dualpi2_classic_marking(q, skb, prob, overload); > + case DUALPI2_C_L4S: > + return dualpi2_scalable_marking(q, skb, local_l_prob, prob, > + overload); > + default: /* DUALPI2_C_LLLL */ > + return false; > + } > +} > + > +static void dualpi2_read_ect(struct sk_buff *skb) > +{ > + struct dualpi2_skb_cb *cb = dualpi2_skb_cb(skb); > + int wlen = skb_network_offset(skb); > + > + switch (skb_protocol(skb, true)) { > + case htons(ETH_P_IP): > + wlen += sizeof(struct iphdr); > + if (!pskb_may_pull(skb, wlen) || > + skb_try_make_writable(skb, wlen)) > + goto not_ecn; > + > + cb->ect = ipv4_get_dsfield(ip_hdr(skb)) & INET_ECN_MASK; > + break; > + case htons(ETH_P_IPV6): > + wlen += sizeof(struct ipv6hdr); > + if (!pskb_may_pull(skb, wlen) || > + skb_try_make_writable(skb, wlen)) > + goto not_ecn; > + > + cb->ect = ipv6_get_dsfield(ipv6_hdr(skb)) & INET_ECN_MASK; > + break; > + default: > + goto not_ecn; > + } > + return; > + > +not_ecn: > + /* Non pullable/writable packets can only be dropped hence are > + * classified as not ECT. > + */ > + cb->ect = INET_ECN_NOT_ECT; > +} > + > +static int dualpi2_skb_classify(struct dualpi2_sched_data *q, > + struct sk_buff *skb) > +{ > + struct dualpi2_skb_cb *cb = dualpi2_skb_cb(skb); > + struct tcf_result res; > + struct tcf_proto *fl; > + int result; > + > + dualpi2_read_ect(skb); > + if (cb->ect & q->ecn_mask) { > + cb->classified = DUALPI2_C_L4S; > + return NET_XMIT_SUCCESS; > + } > + > + if (TC_H_MAJ(skb->priority) == q->sch->handle && > + TC_H_MIN(skb->priority) < __DUALPI2_C_MAX) { > + cb->classified = TC_H_MIN(skb->priority); > + return NET_XMIT_SUCCESS; > + } > + > + fl = rcu_dereference_bh(q->tcf.filters); > + if (!fl) { > + cb->classified = DUALPI2_C_CLASSIC; > + return NET_XMIT_SUCCESS; > + } > + > + result = tcf_classify(skb, NULL, fl, &res, false); > + if (result >= 0) { > +#ifdef CONFIG_NET_CLS_ACT > + switch (result) { > + case TC_ACT_STOLEN: > + case TC_ACT_QUEUED: > + case TC_ACT_TRAP: > + return NET_XMIT_SUCCESS | __NET_XMIT_STOLEN; > + case TC_ACT_SHOT: > + return NET_XMIT_SUCCESS | __NET_XMIT_BYPASS; > + } > +#endif > + cb->classified = TC_H_MIN(res.classid) < __DUALPI2_C_MAX ? > + TC_H_MIN(res.classid) : DUALPI2_C_CLASSIC; > + } > + return NET_XMIT_SUCCESS; > +} > + > +static int dualpi2_enqueue_skb(struct sk_buff *skb, struct Qdisc *sch, > + struct sk_buff **to_free) > +{ > + struct dualpi2_sched_data *q = qdisc_priv(sch); > + struct dualpi2_skb_cb *cb; > + > + if (unlikely(qdisc_qlen(sch) >= sch->limit) || > + unlikely((u64)q->memory_used + skb->truesize > q->memory_limit)) { > + qdisc_qstats_overlimit(sch); > + if (skb_in_l_queue(skb)) > + qdisc_qstats_overlimit(q->l_queue); > + return qdisc_drop(skb, sch, to_free); > + } > + > + if (q->drop_early && must_drop(sch, q, skb)) { > + qdisc_drop(skb, sch, to_free); > + return NET_XMIT_SUCCESS | __NET_XMIT_BYPASS; > + } > + > + cb = dualpi2_skb_cb(skb); > + cb->ts = ktime_get_ns(); > + q->memory_used += skb->truesize; > + if (q->memory_used > q->max_memory_used) > + q->max_memory_used = q->memory_used; > + > + if (qdisc_qlen(sch) > q->maxq) > + q->maxq = qdisc_qlen(sch); > + > + if (skb_in_l_queue(skb)) { > + /* Only apply the step if a queue is building up */ > + dualpi2_skb_cb(skb)->apply_step = > + skb_is_l4s(skb) && qdisc_qlen(q->l_queue) > 1; > + /* Keep the overall qdisc stats consistent */ > + ++sch->q.qlen; > + qdisc_qstats_backlog_inc(sch, skb); > + ++q->packets_in_l; > + if (!q->l_head_ts) > + q->l_head_ts = cb->ts; > + return qdisc_enqueue_tail(skb, q->l_queue); > + } > + ++q->packets_in_c; > + if (!q->c_head_ts) > + q->c_head_ts = cb->ts; > + return qdisc_enqueue_tail(skb, sch); > +} > + > +/* By default, dualpi2 will split GSO skbs into independent skbs and enqueue > + * each of those individually. This yields the following benefits, at the > + * expense of CPU usage: > + * - Finer-grained AQM actions as the sub-packets of a burst no longer share the > + * same fate (e.g., the random mark/drop probability is applied individually) > + * - Improved precision of the starvation protection/WRR scheduler at dequeue, > + * as the size of the dequeued packets will be smaller. > + */ > +static int dualpi2_qdisc_enqueue(struct sk_buff *skb, struct Qdisc *sch, > + struct sk_buff **to_free) > +{ > + struct dualpi2_sched_data *q = qdisc_priv(sch); > + int err; > + > + err = dualpi2_skb_classify(q, skb); > + if (err != NET_XMIT_SUCCESS) { > + if (err & __NET_XMIT_BYPASS) > + qdisc_qstats_drop(sch); > + __qdisc_drop(skb, to_free); > + return err; > + } > + > + if (q->split_gso && skb_is_gso(skb)) { > + netdev_features_t features; > + struct sk_buff *nskb, *next; > + int cnt, byte_len, orig_len; > + int err; > + > + features = netif_skb_features(skb); > + nskb = skb_gso_segment(skb, features & ~NETIF_F_GSO_MASK); > + if (IS_ERR_OR_NULL(nskb)) > + return qdisc_drop(skb, sch, to_free); > + > + cnt = 1; > + byte_len = 0; > + orig_len = qdisc_pkt_len(skb); > + skb_list_walk_safe(nskb, nskb, next) { > + skb_mark_not_on_list(nskb); > + qdisc_skb_cb(nskb)->pkt_len = nskb->len; > + dualpi2_skb_cb(nskb)->classified = > + dualpi2_skb_cb(skb)->classified; > + dualpi2_skb_cb(nskb)->ect = dualpi2_skb_cb(skb)->ect; > + err = dualpi2_enqueue_skb(nskb, sch, to_free); > + if (err == NET_XMIT_SUCCESS) { > + /* Compute the backlog adjustment that needs > + * to be propagated in the qdisc tree to reflect > + * all new skbs successfully enqueued. > + */ > + ++cnt; > + byte_len += nskb->len; > + } > + } > + if (err == NET_XMIT_SUCCESS) { > + /* The caller will add the original skb stats to its > + * backlog, compensate this. > + */ > + --cnt; > + byte_len -= orig_len; > + } > + qdisc_tree_reduce_backlog(sch, -cnt, -byte_len); > + consume_skb(skb); > + return err; > + } > + return dualpi2_enqueue_skb(skb, sch, to_free); > +} > + > +/* Select the queue from which the next packet can be dequeued, ensuring that > + * neither queue can starve the other with a WRR scheduler. > + * > + * The sign of the WRR credit determines the next queue, while the size of > + * the dequeued packet determines the magnitude of the WRR credit change. If > + * either queue is empty, the WRR credit is kept unchanged. > + * > + * As the dequeued packet can be dropped later, the caller has to perform the > + * qdisc_bstats_update() calls. > + */ > +static struct sk_buff *dequeue_packet(struct Qdisc *sch, > + struct dualpi2_sched_data *q, > + int *credit_change, > + u64 now) > +{ > + struct sk_buff *skb = NULL; > + int c_len; > + > + *credit_change = 0; > + c_len = qdisc_qlen(sch) - qdisc_qlen(q->l_queue); > + if (qdisc_qlen(q->l_queue) && (!c_len || q->c_protection.credit <= 0)) { > + skb = __qdisc_dequeue_head(&q->l_queue->q); > + WRITE_ONCE(q->l_head_ts, head_enqueue_time(q->l_queue)); > + if (c_len) > + *credit_change = q->c_protection.wc; > + qdisc_qstats_backlog_dec(q->l_queue, skb); > + /* Keep the global queue size consistent */ > + --sch->q.qlen; > + q->memory_used -= skb->truesize; > + } else if (c_len) { > + skb = __qdisc_dequeue_head(&sch->q); > + WRITE_ONCE(q->c_head_ts, head_enqueue_time(sch)); > + if (qdisc_qlen(q->l_queue)) > + *credit_change = ~((s32)q->c_protection.wl) + 1; > + q->memory_used -= skb->truesize; > + } else { > + dualpi2_reset_c_protection(q); > + return NULL; > + } > + *credit_change *= qdisc_pkt_len(skb); > + qdisc_qstats_backlog_dec(sch, skb); > + return skb; > +} > + > +static int do_step_aqm(struct dualpi2_sched_data *q, struct sk_buff *skb, > + u64 now) > +{ > + u64 qdelay = 0; > + > + if (q->step.in_packets) > + qdelay = qdisc_qlen(q->l_queue); > + else > + qdelay = dualpi2_sojourn_time(skb, now); > + > + if (dualpi2_skb_cb(skb)->apply_step && qdelay > q->step.thresh) { > + if (!dualpi2_skb_cb(skb)->ect) > + /* Drop this non-ECT packet */ > + return 1; > + if (dualpi2_mark(q, skb)) > + ++q->step_marks; > + } > + qdisc_bstats_update(q->l_queue, skb); > + return 0; > +} > + > +static void drop_and_retry(struct dualpi2_sched_data *q, struct sk_buff *skb, > + struct Qdisc *sch) > +{ > + ++q->deferred_drops.cnt; > + q->deferred_drops.len += qdisc_pkt_len(skb); > + q->memory_used -= skb->truesize; > + consume_skb(skb); > + qdisc_qstats_drop(sch); > +} > + > +static struct sk_buff *dualpi2_qdisc_dequeue(struct Qdisc *sch) > +{ > + struct dualpi2_sched_data *q = qdisc_priv(sch); > + struct sk_buff *skb; > + int credit_change; > + u64 now; > + > + now = ktime_get_ns(); > + > + while ((skb = dequeue_packet(sch, q, &credit_change, now))) { > + if (!q->drop_early && must_drop(sch, q, skb)) { > + drop_and_retry(q, skb, sch); > + continue; > + } > + > + if (skb_in_l_queue(skb) && do_step_aqm(q, skb, now)) { > + qdisc_qstats_drop(q->l_queue); > + drop_and_retry(q, skb, sch); > + continue; > + } > + > + q->c_protection.credit += credit_change; > + qdisc_bstats_update(sch, skb); > + break; > + } > + > + /* We cannot call qdisc_tree_reduce_backlog() if our qlen is 0, > + * or HTB crashes. > + */ > + if (q->deferred_drops.cnt && qdisc_qlen(sch)) { > + qdisc_tree_reduce_backlog(sch, q->deferred_drops.cnt, > + q->deferred_drops.len); > + q->deferred_drops.cnt = 0; > + q->deferred_drops.len = 0; > + } > + return skb; > +} > + > +static s64 __scale_delta(u64 diff) > +{ > + do_div(diff, 1 << ALPHA_BETA_GRANULARITY); > + return diff; > +} > + > +static void get_queue_delays(struct dualpi2_sched_data *q, u64 *qdelay_c, > + u64 *qdelay_l) > +{ > + u64 now, qc, ql; > + > + now = ktime_get_ns(); > + qc = READ_ONCE(q->c_head_ts); > + ql = READ_ONCE(q->l_head_ts); > + > + *qdelay_c = qc ? now - qc : 0; > + *qdelay_l = ql ? now - ql : 0; > +} > + > +static u32 calculate_probability(struct Qdisc *sch) > +{ > + struct dualpi2_sched_data *q = qdisc_priv(sch); > + u32 new_prob; > + u64 qdelay_c; > + u64 qdelay_l; > + u64 qdelay; > + s64 delta; > + > + get_queue_delays(q, &qdelay_c, &qdelay_l); > + qdelay = max(qdelay_l, qdelay_c); > + /* Alpha and beta take at most 32b, i.e, the delay difference would > + * overflow for queuing delay differences > ~4.2sec. > + */ > + delta = ((s64)qdelay - q->pi2.target) * q->pi2.alpha; > + delta += ((s64)qdelay - q->last_qdelay) * q->pi2.beta; > + if (delta > 0) { > + new_prob = __scale_delta(delta) + q->pi2.prob; > + if (new_prob < q->pi2.prob) > + new_prob = MAX_PROB; > + } else { > + new_prob = q->pi2.prob - __scale_delta(~delta + 1); > + if (new_prob > q->pi2.prob) > + new_prob = 0; > + } > + q->last_qdelay = qdelay; > + /* If we do not drop on overload, ensure we cap the L4S probability to > + * 100% to keep window fairness when overflowing. > + */ > + if (!q->drop_overload) > + return min_t(u32, new_prob, MAX_PROB / q->coupling_factor); > + return new_prob; > +} > + > +static enum hrtimer_restart dualpi2_timer(struct hrtimer *timer) > +{ > + struct dualpi2_sched_data *q = from_timer(q, timer, pi2.timer); > + > + WRITE_ONCE(q->pi2.prob, calculate_probability(q->sch)); > + > + hrtimer_set_expires(&q->pi2.timer, next_pi2_timeout(q)); > + return HRTIMER_RESTART; > +} > + > +static struct netlink_range_validation dualpi2_alpha_beta_range = { > + .min = 1, > + .max = ALPHA_BETA_MAX, > +}; > + > +static struct netlink_range_validation dualpi2_wc_range = { > + .min = 0, > + .max = MAX_WC, > +}; > + > +static const struct nla_policy dualpi2_policy[TCA_DUALPI2_MAX + 1] = { > + [TCA_DUALPI2_LIMIT] = NLA_POLICY_MIN(NLA_U32, 1), > + [TCA_DUALPI2_MEMORY_LIMIT] = NLA_POLICY_MIN(NLA_U32, 1), > + [TCA_DUALPI2_TARGET] = {.type = NLA_U32}, > + [TCA_DUALPI2_TUPDATE] = NLA_POLICY_MIN(NLA_U32, 1), > + [TCA_DUALPI2_ALPHA] = > + NLA_POLICY_FULL_RANGE(NLA_U32, &dualpi2_alpha_beta_range), > + [TCA_DUALPI2_BETA] = > + NLA_POLICY_FULL_RANGE(NLA_U32, &dualpi2_alpha_beta_range), > + [TCA_DUALPI2_STEP_THRESH] = {.type = NLA_U32}, > + [TCA_DUALPI2_STEP_PACKETS] = {.type = NLA_U8}, > + [TCA_DUALPI2_COUPLING] = NLA_POLICY_MIN(NLA_U8, 1), > + [TCA_DUALPI2_DROP_OVERLOAD] = {.type = NLA_U8}, > + [TCA_DUALPI2_DROP_EARLY] = {.type = NLA_U8}, > + [TCA_DUALPI2_C_PROTECTION] = > + NLA_POLICY_FULL_RANGE(NLA_U8, &dualpi2_wc_range), > + [TCA_DUALPI2_ECN_MASK] = {.type = NLA_U8}, > + [TCA_DUALPI2_SPLIT_GSO] = {.type = NLA_U8}, > +}; > + > +static int dualpi2_change(struct Qdisc *sch, struct nlattr *opt, > + struct netlink_ext_ack *extack) > +{ > + struct nlattr *tb[TCA_DUALPI2_MAX + 1]; > + struct dualpi2_sched_data *q; > + int old_backlog; > + int old_qlen; > + int err; > + > + if (!opt) > + return -EINVAL; > + err = nla_parse_nested(tb, TCA_DUALPI2_MAX, opt, dualpi2_policy, > + extack); > + if (err < 0) > + return err; > + > + q = qdisc_priv(sch); > + sch_tree_lock(sch); > + > + if (tb[TCA_DUALPI2_LIMIT]) { > + u32 limit = nla_get_u32(tb[TCA_DUALPI2_LIMIT]); > + > + WRITE_ONCE(sch->limit, limit); > + WRITE_ONCE(q->memory_limit, limit * psched_mtu(qdisc_dev(sch))); > + } > + > + if (tb[TCA_DUALPI2_MEMORY_LIMIT]) > + WRITE_ONCE(q->memory_limit, > + nla_get_u32(tb[TCA_DUALPI2_MEMORY_LIMIT])); > + > + if (tb[TCA_DUALPI2_TARGET]) { > + u64 target = nla_get_u32(tb[TCA_DUALPI2_TARGET]); > + > + WRITE_ONCE(q->pi2.target, target * NSEC_PER_USEC); > + } > + > + if (tb[TCA_DUALPI2_TUPDATE]) { > + u64 tupdate = nla_get_u32(tb[TCA_DUALPI2_TUPDATE]); > + > + WRITE_ONCE(q->pi2.tupdate, tupdate * NSEC_PER_USEC); > + } > + > + if (tb[TCA_DUALPI2_ALPHA]) { > + u32 alpha = nla_get_u32(tb[TCA_DUALPI2_ALPHA]); > + > + WRITE_ONCE(q->pi2.alpha, dualpi2_scale_alpha_beta(alpha)); > + } > + > + if (tb[TCA_DUALPI2_BETA]) { > + u32 beta = nla_get_u32(tb[TCA_DUALPI2_BETA]); > + > + WRITE_ONCE(q->pi2.beta, dualpi2_scale_alpha_beta(beta)); > + } > + > + if (tb[TCA_DUALPI2_STEP_PACKETS]) { > + bool step_pkt = !!nla_get_u8(tb[TCA_DUALPI2_STEP_PACKETS]); > + u32 step_th = READ_ONCE(q->step.thresh); > + > + WRITE_ONCE(q->step.in_packets, step_pkt); > + WRITE_ONCE(q->step.thresh, > + step_pkt ? step_th : (step_th * NSEC_PER_USEC)); > + } > + > + if (tb[TCA_DUALPI2_STEP_THRESH]) { > + u32 step_th = nla_get_u32(tb[TCA_DUALPI2_STEP_THRESH]); > + bool step_pkt = READ_ONCE(q->step.in_packets); > + > + WRITE_ONCE(q->step.thresh, > + step_pkt ? step_th : (step_th * NSEC_PER_USEC)); > + } > + > + if (tb[TCA_DUALPI2_COUPLING]) { > + u8 coupling = nla_get_u8(tb[TCA_DUALPI2_COUPLING]); > + > + WRITE_ONCE(q->coupling_factor, coupling); > + } > + > + if (tb[TCA_DUALPI2_DROP_OVERLOAD]) > + WRITE_ONCE(q->drop_overload, > + !!nla_get_u8(tb[TCA_DUALPI2_DROP_OVERLOAD])); > + > + if (tb[TCA_DUALPI2_DROP_EARLY]) > + WRITE_ONCE(q->drop_early, > + !!nla_get_u8(tb[TCA_DUALPI2_DROP_EARLY])); > + > + if (tb[TCA_DUALPI2_C_PROTECTION]) { > + u8 wc = nla_get_u8(tb[TCA_DUALPI2_C_PROTECTION]); > + > + dualpi2_calculate_c_protection(sch, q, wc); > + } > + > + if (tb[TCA_DUALPI2_ECN_MASK]) > + WRITE_ONCE(q->ecn_mask, > + nla_get_u8(tb[TCA_DUALPI2_ECN_MASK])); > + > + if (tb[TCA_DUALPI2_SPLIT_GSO]) > + WRITE_ONCE(q->split_gso, > + !!nla_get_u8(tb[TCA_DUALPI2_SPLIT_GSO])); > + > + old_qlen = qdisc_qlen(sch); > + old_backlog = sch->qstats.backlog; > + while (qdisc_qlen(sch) > sch->limit || > + q->memory_used > q->memory_limit) { > + struct sk_buff *skb = __qdisc_dequeue_head(&sch->q); > + > + q->memory_used -= skb->truesize; > + qdisc_qstats_backlog_dec(sch, skb); > + rtnl_qdisc_drop(skb, sch); > + } > + qdisc_tree_reduce_backlog(sch, old_qlen - qdisc_qlen(sch), > + old_backlog - sch->qstats.backlog); > + > + sch_tree_unlock(sch); > + return 0; > +} > + > +/* Default alpha/beta values give a 10dB stability margin with max_rtt=100ms. */ > +static void dualpi2_reset_default(struct Qdisc *sch) > +{ > + struct dualpi2_sched_data *q = qdisc_priv(sch); > + > + q->sch->limit = 10000; /* Max 125ms at 1Gbps */ > + q->memory_limit = q->sch->limit * psched_mtu(qdisc_dev(sch)); > + > + q->pi2.target = 15 * NSEC_PER_MSEC; > + q->pi2.tupdate = 16 * NSEC_PER_MSEC; > + q->pi2.alpha = dualpi2_scale_alpha_beta(41); /* ~0.16 Hz * 256 */ > + q->pi2.beta = dualpi2_scale_alpha_beta(819); /* ~3.20 Hz * 256 */ > + > + q->step.thresh = 1 * NSEC_PER_MSEC; > + q->step.in_packets = false; > + > + dualpi2_calculate_c_protection(q->sch, q, 10); /* wc=10%, wl=90% */ > + > + q->ecn_mask = INET_ECN_ECT_1; > + q->coupling_factor = 2; /* window fairness for equal RTTs */ > + q->drop_overload = true; /* Preserve latency by dropping */ > + q->drop_early = false; /* PI2 drops on dequeue */ > + q->split_gso = true; > +} > + > +static int dualpi2_init(struct Qdisc *sch, struct nlattr *opt, > + struct netlink_ext_ack *extack) > +{ > + struct dualpi2_sched_data *q = qdisc_priv(sch); > + int err; > + > + q->l_queue = qdisc_create_dflt(sch->dev_queue, &pfifo_qdisc_ops, > + TC_H_MAKE(sch->handle, 1), extack); > + if (!q->l_queue) > + return -ENOMEM; > + > + err = tcf_block_get(&q->tcf.block, &q->tcf.filters, sch, extack); > + if (err) > + return err; > + > + q->sch = sch; > + dualpi2_reset_default(sch); > + hrtimer_init(&q->pi2.timer, CLOCK_MONOTONIC, HRTIMER_MODE_ABS_PINNED); > + q->pi2.timer.function = dualpi2_timer; > + > + if (opt) { > + err = dualpi2_change(sch, opt, extack); > + > + if (err) > + return err; > + } > + > + hrtimer_start(&q->pi2.timer, next_pi2_timeout(q), > + HRTIMER_MODE_ABS_PINNED); > + return 0; > +} > + > +static u32 convert_ns_to_usec(u64 ns) > +{ > + do_div(ns, NSEC_PER_USEC); > + return ns; > +} > + > +static int dualpi2_dump(struct Qdisc *sch, struct sk_buff *skb) > +{ > + struct dualpi2_sched_data *q = qdisc_priv(sch); > + struct nlattr *opts; > + > + opts = nla_nest_start_noflag(skb, TCA_OPTIONS); > + if (!opts) > + goto nla_put_failure; > + > + if (nla_put_u32(skb, TCA_DUALPI2_LIMIT, READ_ONCE(sch->limit)) || > + nla_put_u32(skb, TCA_DUALPI2_MEMORY_LIMIT, > + READ_ONCE(q->memory_limit)) || > + nla_put_u32(skb, TCA_DUALPI2_TARGET, > + convert_ns_to_usec(READ_ONCE(q->pi2.target))) || > + nla_put_u32(skb, TCA_DUALPI2_TUPDATE, > + convert_ns_to_usec(READ_ONCE(q->pi2.tupdate))) || > + nla_put_u32(skb, TCA_DUALPI2_ALPHA, > + dualpi2_unscale_alpha_beta(READ_ONCE(q->pi2.alpha))) || > + nla_put_u32(skb, TCA_DUALPI2_BETA, > + dualpi2_unscale_alpha_beta(READ_ONCE(q->pi2.beta))) || > + nla_put_u32(skb, TCA_DUALPI2_STEP_THRESH, > + READ_ONCE(q->step.in_packets) ? > + READ_ONCE(q->step.thresh) : > + convert_ns_to_usec(READ_ONCE(q->step.thresh))) || > + nla_put_u8(skb, TCA_DUALPI2_COUPLING, > + READ_ONCE(q->coupling_factor)) || > + nla_put_u8(skb, TCA_DUALPI2_DROP_OVERLOAD, > + READ_ONCE(q->drop_overload)) || > + nla_put_u8(skb, TCA_DUALPI2_STEP_PACKETS, > + READ_ONCE(q->step.in_packets)) || > + nla_put_u8(skb, TCA_DUALPI2_DROP_EARLY, > + READ_ONCE(q->drop_early)) || > + nla_put_u8(skb, TCA_DUALPI2_C_PROTECTION, > + READ_ONCE(q->c_protection.wc)) || > + nla_put_u8(skb, TCA_DUALPI2_ECN_MASK, READ_ONCE(q->ecn_mask)) || > + nla_put_u8(skb, TCA_DUALPI2_SPLIT_GSO, READ_ONCE(q->split_gso))) > + goto nla_put_failure; > + > + return nla_nest_end(skb, opts); > + > +nla_put_failure: > + nla_nest_cancel(skb, opts); > + return -1; > +} > + > +static int dualpi2_dump_stats(struct Qdisc *sch, struct gnet_dump *d) > +{ > + struct dualpi2_sched_data *q = qdisc_priv(sch); > + struct tc_dualpi2_xstats st = { > + .prob = READ_ONCE(q->pi2.prob), > + .packets_in_c = q->packets_in_c, > + .packets_in_l = q->packets_in_l, > + .maxq = q->maxq, > + .ecn_mark = q->ecn_mark, > + .credit = q->c_protection.credit, > + .step_marks = q->step_marks, > + .memory_used = q->memory_used, > + .max_memory_used = q->max_memory_used, > + .memory_limit = q->memory_limit, > + }; > + u64 qc, ql; > + > + get_queue_delays(q, &qc, &ql); > + st.delay_l = convert_ns_to_usec(ql); > + st.delay_c = convert_ns_to_usec(qc); > + return gnet_stats_copy_app(d, &st, sizeof(st)); > +} > + > +/* Reset both L-queue and C-queue, internal packet counters, PI probability, > + * C-queue protection credit, and timestamps, while preserving current > + * configuration of DUALPI2. > + */ > +static void dualpi2_reset(struct Qdisc *sch) > +{ > + struct dualpi2_sched_data *q = qdisc_priv(sch); > + > + qdisc_reset_queue(sch); > + qdisc_reset_queue(q->l_queue); > + q->c_head_ts = 0; > + q->l_head_ts = 0; > + q->pi2.prob = 0; > + q->packets_in_c = 0; > + q->packets_in_l = 0; > + q->maxq = 0; > + q->ecn_mark = 0; > + q->step_marks = 0; > + q->memory_used = 0; > + q->max_memory_used = 0; > + dualpi2_reset_c_protection(q); > +} > + > +static void dualpi2_destroy(struct Qdisc *sch) > +{ > + struct dualpi2_sched_data *q = qdisc_priv(sch); > + > + q->pi2.tupdate = 0; > + hrtimer_cancel(&q->pi2.timer); > + if (q->l_queue) > + qdisc_put(q->l_queue); > + tcf_block_put(q->tcf.block); > +} > + > +static struct Qdisc *dualpi2_leaf(struct Qdisc *sch, unsigned long arg) > +{ > + return NULL; > +} > + > +static unsigned long dualpi2_find(struct Qdisc *sch, u32 classid) > +{ > + return 0; > +} > + > +static unsigned long dualpi2_bind(struct Qdisc *sch, unsigned long parent, > + u32 classid) > +{ > + return 0; > +} > + > +static void dualpi2_unbind(struct Qdisc *q, unsigned long cl) > +{ > +} > + > +static struct tcf_block *dualpi2_tcf_block(struct Qdisc *sch, unsigned long cl, > + struct netlink_ext_ack *extack) > +{ > + struct dualpi2_sched_data *q = qdisc_priv(sch); > + > + if (cl) > + return NULL; > + return q->tcf.block; > +} > + > +static void dualpi2_walk(struct Qdisc *sch, struct qdisc_walker *arg) > +{ > + unsigned int i; > + > + if (arg->stop) > + return; > + > + /* We statically define only 2 queues */ > + for (i = 0; i < 2; i++) { > + if (arg->count < arg->skip) { > + arg->count++; > + continue; > + } > + if (arg->fn(sch, i + 1, arg) < 0) { > + arg->stop = 1; > + break; > + } > + arg->count++; > + } > +} > + > +/* Minimal class support to handler tc filters */ > +static const struct Qdisc_class_ops dualpi2_class_ops = { > + .leaf = dualpi2_leaf, > + .find = dualpi2_find, > + .tcf_block = dualpi2_tcf_block, > + .bind_tcf = dualpi2_bind, > + .unbind_tcf = dualpi2_unbind, > + .walk = dualpi2_walk, > +}; > + > +static struct Qdisc_ops dualpi2_qdisc_ops __read_mostly = { > + .id = "dualpi2", > + .cl_ops = &dualpi2_class_ops, > + .priv_size = sizeof(struct dualpi2_sched_data), > + .enqueue = dualpi2_qdisc_enqueue, > + .dequeue = dualpi2_qdisc_dequeue, > + .peek = qdisc_peek_dequeued, > + .init = dualpi2_init, > + .destroy = dualpi2_destroy, > + .reset = dualpi2_reset, > + .change = dualpi2_change, > + .dump = dualpi2_dump, > + .dump_stats = dualpi2_dump_stats, > + .owner = THIS_MODULE, > +}; > + > +static int __init dualpi2_module_init(void) > +{ > + return register_qdisc(&dualpi2_qdisc_ops); > +} > + > +static void __exit dualpi2_module_exit(void) > +{ > + unregister_qdisc(&dualpi2_qdisc_ops); > +} > + > +module_init(dualpi2_module_init); > +module_exit(dualpi2_module_exit); > + > +MODULE_DESCRIPTION("Dual Queue with Proportional Integral controller Improved with a Square (dualpi2) scheduler"); > +MODULE_AUTHOR("Koen De Schepper <koen.de_schepper@nokia-bell-labs.com>"); > +MODULE_AUTHOR("Olga Albisser <olga@albisser.org>"); > +MODULE_AUTHOR("Henrik Steen <henrist@henrist.net>"); > +MODULE_AUTHOR("Olivier Tilmans <olivier.tilmans@nokia.com>"); > +MODULE_AUTHOR("Chia-Yu Chang <chia-yu.chang@nokia-bell-labs.com>"); > + > +MODULE_LICENSE("Dual BSD/GPL"); > +MODULE_VERSION("1.0"); > -- > 2.34.1 > While I still have some nits... Acked-By: Dave Taht <dave.taht@gmail.com>
diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h index d206c9592b60..3e74938285d7 100644 --- a/include/linux/netdevice.h +++ b/include/linux/netdevice.h @@ -30,6 +30,7 @@ #include <asm/byteorder.h> #include <asm/local.h> +#include <linux/netdev_features.h> #include <linux/percpu.h> #include <linux/rculist.h> #include <linux/workqueue.h> diff --git a/include/uapi/linux/pkt_sched.h b/include/uapi/linux/pkt_sched.h index 25a9a47001cd..69ffe68118b6 100644 --- a/include/uapi/linux/pkt_sched.h +++ b/include/uapi/linux/pkt_sched.h @@ -1210,4 +1210,42 @@ enum { #define TCA_ETS_MAX (__TCA_ETS_MAX - 1) +/* DUALPI2 */ +enum { + TCA_DUALPI2_UNSPEC, + TCA_DUALPI2_LIMIT, /* Packets */ + TCA_DUALPI2_MEMORY_LIMIT, /* Bytes */ + TCA_DUALPI2_TARGET, /* us */ + TCA_DUALPI2_TUPDATE, /* us */ + TCA_DUALPI2_ALPHA, /* Hz scaled up by 256 */ + TCA_DUALPI2_BETA, /* HZ scaled up by 256 */ + TCA_DUALPI2_STEP_THRESH, /* Packets or us */ + TCA_DUALPI2_STEP_PACKETS, /* Whether STEP_THRESH is in packets */ + TCA_DUALPI2_COUPLING, /* Coupling factor between queues */ + TCA_DUALPI2_DROP_OVERLOAD, /* Whether to drop on overload */ + TCA_DUALPI2_DROP_EARLY, /* Whether to drop on enqueue */ + TCA_DUALPI2_C_PROTECTION, /* Percentage */ + TCA_DUALPI2_ECN_MASK, /* L4S queue classification mask */ + TCA_DUALPI2_SPLIT_GSO, /* Split GSO packets at enqueue */ + TCA_DUALPI2_PAD, + __TCA_DUALPI2_MAX +}; + +#define TCA_DUALPI2_MAX (__TCA_DUALPI2_MAX - 1) + +struct tc_dualpi2_xstats { + __u32 prob; /* current probability */ + __u32 delay_c; /* current delay in C queue */ + __u32 delay_l; /* current delay in L queue */ + __s32 credit; /* current c_protection credit */ + __u32 packets_in_c; /* number of packets enqueued in C queue */ + __u32 packets_in_l; /* number of packets enqueued in L queue */ + __u32 maxq; /* maximum queue size */ + __u32 ecn_mark; /* packets marked with ecn*/ + __u32 step_marks; /* ECN marks due to the step AQM */ + __u32 memory_used; /* Meory used of both queues */ + __u32 max_memory_used; /* Maximum used memory */ + __u32 memory_limit; /* Memory limit of both queues */ +}; + #endif diff --git a/net/sched/Kconfig b/net/sched/Kconfig index 8180d0c12fce..f00b5ad92ce2 100644 --- a/net/sched/Kconfig +++ b/net/sched/Kconfig @@ -403,6 +403,18 @@ config NET_SCH_ETS If unsure, say N. +config NET_SCH_DUALPI2 + tristate "Dual Queue PI Square (DUALPI2) scheduler" + help + Say Y here if you want to use the Dual Queue Proportional Integral + Controller Improved with a Square scheduling algorithm. + For more information, please see https://tools.ietf.org/html/rfc9332 + + To compile this driver as a module, choose M here: the module + will be called sch_dualpi2. + + If unsure, say N. + menuconfig NET_SCH_DEFAULT bool "Allow override default queue discipline" help diff --git a/net/sched/Makefile b/net/sched/Makefile index 82c3f78ca486..1abb06554057 100644 --- a/net/sched/Makefile +++ b/net/sched/Makefile @@ -62,6 +62,7 @@ obj-$(CONFIG_NET_SCH_FQ_PIE) += sch_fq_pie.o obj-$(CONFIG_NET_SCH_CBS) += sch_cbs.o obj-$(CONFIG_NET_SCH_ETF) += sch_etf.o obj-$(CONFIG_NET_SCH_TAPRIO) += sch_taprio.o +obj-$(CONFIG_NET_SCH_DUALPI2) += sch_dualpi2.o obj-$(CONFIG_NET_CLS_U32) += cls_u32.o obj-$(CONFIG_NET_CLS_ROUTE4) += cls_route.o diff --git a/net/sched/sch_dualpi2.c b/net/sched/sch_dualpi2.c new file mode 100644 index 000000000000..e5d4c3bac093 --- /dev/null +++ b/net/sched/sch_dualpi2.c @@ -0,0 +1,1082 @@ +// SPDX-License-Identifier: GPL-2.0-only OR BSD-2-Clause +/* Copyright (C) 2024 Nokia + * + * Author: Koen De Schepper <koen.de_schepper@nokia-bell-labs.com> + * Author: Olga Albisser <olga@albisser.org> + * Author: Henrik Steen <henrist@henrist.net> + * Author: Olivier Tilmans <olivier.tilmans@nokia.com> + * Author: Chia-Yu Chang <chia-yu.chang@nokia-bell-labs.com> + * + * DualPI Improved with a Square (dualpi2): + * - Supports congestion controls that comply with the Prague requirements + * in RFC9331 (e.g. TCP-Prague) + * - Supports coupled dual-queue with PI2 as defined in RFC9332 + * - Supports ECN L4S-identifier (IP.ECN==0b*1) + * + * note: Although DCTCP and BBRv3 can use shallow-threshold ECN marks, + * they do not meet the 'Prague L4S Requirements' listed in RFC 9331 + * Section 4, so they can only be used with DualPI2 in a datacenter + * context. + * + * References: + * - RFC9332: https://datatracker.ietf.org/doc/html/rfc9332 + * - De Schepper, Koen, et al. "PI 2: A linearized AQM for both classic and + * scalable TCP." in proc. ACM CoNEXT'16, 2016. + */ + +#include <linux/errno.h> +#include <linux/hrtimer.h> +#include <linux/if_vlan.h> +#include <linux/kernel.h> +#include <linux/limits.h> +#include <linux/module.h> +#include <linux/skbuff.h> +#include <linux/types.h> + +#include <net/gso.h> +#include <net/inet_ecn.h> +#include <net/pkt_cls.h> +#include <net/pkt_sched.h> + +/* 32b enable to support flows with windows up to ~8.6 * 1e9 packets + * i.e., twice the maximal snd_cwnd. + * MAX_PROB must be consistent with the RNG in dualpi2_roll(). + */ +#define MAX_PROB U32_MAX + +/* alpha/beta values exchanged over netlink are in units of 256ns */ +#define ALPHA_BETA_SHIFT 8 + +/* Scaled values of alpha/beta must fit in 32b to avoid overflow in later + * computations. Consequently (see and dualpi2_scale_alpha_beta()), their + * netlink-provided values can use at most 31b, i.e. be at most (2^23)-1 + * (~4MHz) as those are given in 1/256th. This enable to tune alpha/beta to + * control flows whose maximal RTTs can be in usec up to few secs. + */ +#define ALPHA_BETA_MAX ((1U << 31) - 1) + +/* Internal alpha/beta are in units of 64ns. + * This enables to use all alpha/beta values in the allowed range without loss + * of precision due to rounding when scaling them internally, e.g., + * scale_alpha_beta(1) will not round down to 0. + */ +#define ALPHA_BETA_GRANULARITY 6 + +#define ALPHA_BETA_SCALING (ALPHA_BETA_SHIFT - ALPHA_BETA_GRANULARITY) + +/* We express the weights (wc, wl) in %, i.e., wc + wl = 100 */ +#define MAX_WC 100 + +struct dualpi2_sched_data { + struct Qdisc *l_queue; /* The L4S Low latency queue (L-queue) */ + struct Qdisc *sch; /* The Classic queue (C-queue) */ + + /* Registered tc filters */ + struct { + struct tcf_proto __rcu *filters; + struct tcf_block *block; + } tcf; + + struct { /* PI2 parameters */ + u64 target; /* Target delay in nanoseconds */ + u32 tupdate;/* Timer frequency in nanoseconds */ + u32 prob; /* Base PI probability */ + u32 alpha; /* Gain factor for the integral rate response */ + u32 beta; /* Gain factor for the proportional response */ + struct hrtimer timer; /* prob update timer */ + } pi2; + + struct { /* Step AQM (L-queue only) parameters */ + u32 thresh; /* Step threshold */ + bool in_packets;/* Whether the step is in packets or time */ + } step; + + struct { /* C-queue starvation protection */ + s32 credit; /* Credit (sign indicates which queue) */ + s32 init; /* Reset value of the credit */ + u8 wc; /* C-queue weight (between 0 and MAX_WC) */ + u8 wl; /* L-queue weight (MAX_WC - wc) */ + } c_protection; + + /* General dualQ parameters */ + u32 memory_limit; /* Memory limit of both queues */ + u8 coupling_factor;/* Coupling factor (k) between both queues */ + u8 ecn_mask; /* Mask to match packets into L-queue */ + bool drop_early; /* Drop at enqueue instead of dequeue if true */ + bool drop_overload; /* Drop (1) on overload, or overflow (0) */ + bool split_gso; /* Split aggregated skb (1) or leave as is */ + + /* Statistics */ + u64 c_head_ts; /* Enqueue timestamp of the C-queue head */ + u64 l_head_ts; /* Enqueue timestamp of the L-queue head */ + u64 last_qdelay; /* Q delay val at the last probability update */ + u32 packets_in_c; /* Enqueue packet counter of the C-queue */ + u32 packets_in_l; /* Enqueue packet counter of the L-queue */ + u32 maxq; /* Maximum queue size of the C-queue */ + u32 ecn_mark; /* ECN mark pkt counter due to PI probability */ + u32 step_marks; /* ECN mark pkt counter due to step AQM */ + u32 memory_used; /* Memory used of both queues */ + u32 max_memory_used;/* Maximum used memory */ + + struct { /* Deferred drop statistics */ + u32 cnt; /* Packets dropped */ + u32 len; /* Bytes dropped */ + } deferred_drops; +}; + +struct dualpi2_skb_cb { + u64 ts; /* Timestamp at enqueue */ + u8 apply_step:1, /* Can we apply the step threshold */ + classified:2, /* Packet classification results */ + ect:2; /* Packet ECT codepoint */ +}; + +enum dualpi2_classification_results { + DUALPI2_C_CLASSIC = 0, /* C-queue */ + DUALPI2_C_L4S = 1, /* L-queue (scale mark/classic drop) */ + DUALPI2_C_LLLL = 2, /* L-queue (no drops/marks) */ + __DUALPI2_C_MAX /* Keep last*/ +}; + +static struct dualpi2_skb_cb *dualpi2_skb_cb(struct sk_buff *skb) +{ + qdisc_cb_private_validate(skb, sizeof(struct dualpi2_skb_cb)); + return (struct dualpi2_skb_cb *)qdisc_skb_cb(skb)->data; +} + +static u64 dualpi2_sojourn_time(struct sk_buff *skb, u64 reference) +{ + return reference - dualpi2_skb_cb(skb)->ts; +} + +static u64 head_enqueue_time(struct Qdisc *q) +{ + struct sk_buff *skb = qdisc_peek_head(q); + + return skb ? dualpi2_skb_cb(skb)->ts : 0; +} + +static u32 dualpi2_scale_alpha_beta(u32 param) +{ + u64 tmp = ((u64)param * MAX_PROB >> ALPHA_BETA_SCALING); + + do_div(tmp, NSEC_PER_SEC); + return tmp; +} + +static u32 dualpi2_unscale_alpha_beta(u32 param) +{ + u64 tmp = ((u64)param * NSEC_PER_SEC << ALPHA_BETA_SCALING); + + do_div(tmp, MAX_PROB); + return tmp; +} + +static ktime_t next_pi2_timeout(struct dualpi2_sched_data *q) +{ + return ktime_add_ns(ktime_get_ns(), q->pi2.tupdate); +} + +static bool skb_is_l4s(struct sk_buff *skb) +{ + return dualpi2_skb_cb(skb)->classified == DUALPI2_C_L4S; +} + +static bool skb_in_l_queue(struct sk_buff *skb) +{ + return dualpi2_skb_cb(skb)->classified != DUALPI2_C_CLASSIC; +} + +static bool dualpi2_mark(struct dualpi2_sched_data *q, struct sk_buff *skb) +{ + if (INET_ECN_set_ce(skb)) { + q->ecn_mark++; + return true; + } + return false; +} + +static void dualpi2_reset_c_protection(struct dualpi2_sched_data *q) +{ + q->c_protection.credit = q->c_protection.init; +} + +/* This computes the initial credit value and WRR weight for the L queue (wl) + * from the weight of the C queue (wc). + * If wl > wc, the scheduler will start with the L queue when reset. + */ +static void dualpi2_calculate_c_protection(struct Qdisc *sch, + struct dualpi2_sched_data *q, u32 wc) +{ + q->c_protection.wc = wc; + q->c_protection.wl = MAX_WC - wc; + q->c_protection.init = (s32)psched_mtu(qdisc_dev(sch)) * + ((int)q->c_protection.wc - (int)q->c_protection.wl); + dualpi2_reset_c_protection(q); +} + +static bool dualpi2_roll(u32 prob) +{ + return get_random_u32() <= prob; +} + +/* Packets in the C-queue are subject to a marking probability pC, which is the + * square of the internal PI probability (i.e., have an overall lower mark/drop + * probability). If the qdisc is overloaded, ignore ECT values and only drop. + * + * Note that this marking scheme is also applied to L4S packets during overload. + * Return true if packet dropping is required in C queue + */ +static bool dualpi2_classic_marking(struct dualpi2_sched_data *q, + struct sk_buff *skb, u32 prob, + bool overload) +{ + if (dualpi2_roll(prob) && dualpi2_roll(prob)) { + if (overload || dualpi2_skb_cb(skb)->ect == INET_ECN_NOT_ECT) + return true; + dualpi2_mark(q, skb); + } + return false; +} + +/* Packets in the L-queue are subject to a marking probability pL given by the + * internal PI probability scaled by the coupling factor. + * + * On overload (i.e., @local_l_prob is >= 100%): + * - if the qdisc is configured to trade losses to preserve latency (i.e., + * @q->drop_overload), apply classic drops first before marking. + * - otherwise, preserve the "no loss" property of ECN at the cost of queueing + * delay, eventually resulting in taildrop behavior once sch->limit is + * reached. + * Return true if packet dropping is required in L queue + */ +static bool dualpi2_scalable_marking(struct dualpi2_sched_data *q, + struct sk_buff *skb, + u64 local_l_prob, u32 prob, + bool overload) +{ + if (overload) { + /* Apply classic drop */ + if (!q->drop_overload || + !(dualpi2_roll(prob) && dualpi2_roll(prob))) + goto mark; + return true; + } + + /* We can safely cut the upper 32b as overload==false */ + if (dualpi2_roll(local_l_prob)) { + /* Non-ECT packets could have classified as L4S by filters. */ + if (dualpi2_skb_cb(skb)->ect == INET_ECN_NOT_ECT) + return true; +mark: + dualpi2_mark(q, skb); + } + return false; +} + +/* Decide whether a given packet must be dropped (or marked if ECT), according + * to the PI2 probability. + * + * Never mark/drop if we have a standing queue of less than 2 MTUs. + */ +static bool must_drop(struct Qdisc *sch, struct dualpi2_sched_data *q, + struct sk_buff *skb) +{ + u64 local_l_prob; + u32 prob; + bool overload; + + if (sch->qstats.backlog < 2 * psched_mtu(qdisc_dev(sch))) + return false; + + prob = READ_ONCE(q->pi2.prob); + local_l_prob = (u64)prob * q->coupling_factor; + overload = local_l_prob > MAX_PROB; + + switch (dualpi2_skb_cb(skb)->classified) { + case DUALPI2_C_CLASSIC: + return dualpi2_classic_marking(q, skb, prob, overload); + case DUALPI2_C_L4S: + return dualpi2_scalable_marking(q, skb, local_l_prob, prob, + overload); + default: /* DUALPI2_C_LLLL */ + return false; + } +} + +static void dualpi2_read_ect(struct sk_buff *skb) +{ + struct dualpi2_skb_cb *cb = dualpi2_skb_cb(skb); + int wlen = skb_network_offset(skb); + + switch (skb_protocol(skb, true)) { + case htons(ETH_P_IP): + wlen += sizeof(struct iphdr); + if (!pskb_may_pull(skb, wlen) || + skb_try_make_writable(skb, wlen)) + goto not_ecn; + + cb->ect = ipv4_get_dsfield(ip_hdr(skb)) & INET_ECN_MASK; + break; + case htons(ETH_P_IPV6): + wlen += sizeof(struct ipv6hdr); + if (!pskb_may_pull(skb, wlen) || + skb_try_make_writable(skb, wlen)) + goto not_ecn; + + cb->ect = ipv6_get_dsfield(ipv6_hdr(skb)) & INET_ECN_MASK; + break; + default: + goto not_ecn; + } + return; + +not_ecn: + /* Non pullable/writable packets can only be dropped hence are + * classified as not ECT. + */ + cb->ect = INET_ECN_NOT_ECT; +} + +static int dualpi2_skb_classify(struct dualpi2_sched_data *q, + struct sk_buff *skb) +{ + struct dualpi2_skb_cb *cb = dualpi2_skb_cb(skb); + struct tcf_result res; + struct tcf_proto *fl; + int result; + + dualpi2_read_ect(skb); + if (cb->ect & q->ecn_mask) { + cb->classified = DUALPI2_C_L4S; + return NET_XMIT_SUCCESS; + } + + if (TC_H_MAJ(skb->priority) == q->sch->handle && + TC_H_MIN(skb->priority) < __DUALPI2_C_MAX) { + cb->classified = TC_H_MIN(skb->priority); + return NET_XMIT_SUCCESS; + } + + fl = rcu_dereference_bh(q->tcf.filters); + if (!fl) { + cb->classified = DUALPI2_C_CLASSIC; + return NET_XMIT_SUCCESS; + } + + result = tcf_classify(skb, NULL, fl, &res, false); + if (result >= 0) { +#ifdef CONFIG_NET_CLS_ACT + switch (result) { + case TC_ACT_STOLEN: + case TC_ACT_QUEUED: + case TC_ACT_TRAP: + return NET_XMIT_SUCCESS | __NET_XMIT_STOLEN; + case TC_ACT_SHOT: + return NET_XMIT_SUCCESS | __NET_XMIT_BYPASS; + } +#endif + cb->classified = TC_H_MIN(res.classid) < __DUALPI2_C_MAX ? + TC_H_MIN(res.classid) : DUALPI2_C_CLASSIC; + } + return NET_XMIT_SUCCESS; +} + +static int dualpi2_enqueue_skb(struct sk_buff *skb, struct Qdisc *sch, + struct sk_buff **to_free) +{ + struct dualpi2_sched_data *q = qdisc_priv(sch); + struct dualpi2_skb_cb *cb; + + if (unlikely(qdisc_qlen(sch) >= sch->limit) || + unlikely((u64)q->memory_used + skb->truesize > q->memory_limit)) { + qdisc_qstats_overlimit(sch); + if (skb_in_l_queue(skb)) + qdisc_qstats_overlimit(q->l_queue); + return qdisc_drop(skb, sch, to_free); + } + + if (q->drop_early && must_drop(sch, q, skb)) { + qdisc_drop(skb, sch, to_free); + return NET_XMIT_SUCCESS | __NET_XMIT_BYPASS; + } + + cb = dualpi2_skb_cb(skb); + cb->ts = ktime_get_ns(); + q->memory_used += skb->truesize; + if (q->memory_used > q->max_memory_used) + q->max_memory_used = q->memory_used; + + if (qdisc_qlen(sch) > q->maxq) + q->maxq = qdisc_qlen(sch); + + if (skb_in_l_queue(skb)) { + /* Only apply the step if a queue is building up */ + dualpi2_skb_cb(skb)->apply_step = + skb_is_l4s(skb) && qdisc_qlen(q->l_queue) > 1; + /* Keep the overall qdisc stats consistent */ + ++sch->q.qlen; + qdisc_qstats_backlog_inc(sch, skb); + ++q->packets_in_l; + if (!q->l_head_ts) + q->l_head_ts = cb->ts; + return qdisc_enqueue_tail(skb, q->l_queue); + } + ++q->packets_in_c; + if (!q->c_head_ts) + q->c_head_ts = cb->ts; + return qdisc_enqueue_tail(skb, sch); +} + +/* By default, dualpi2 will split GSO skbs into independent skbs and enqueue + * each of those individually. This yields the following benefits, at the + * expense of CPU usage: + * - Finer-grained AQM actions as the sub-packets of a burst no longer share the + * same fate (e.g., the random mark/drop probability is applied individually) + * - Improved precision of the starvation protection/WRR scheduler at dequeue, + * as the size of the dequeued packets will be smaller. + */ +static int dualpi2_qdisc_enqueue(struct sk_buff *skb, struct Qdisc *sch, + struct sk_buff **to_free) +{ + struct dualpi2_sched_data *q = qdisc_priv(sch); + int err; + + err = dualpi2_skb_classify(q, skb); + if (err != NET_XMIT_SUCCESS) { + if (err & __NET_XMIT_BYPASS) + qdisc_qstats_drop(sch); + __qdisc_drop(skb, to_free); + return err; + } + + if (q->split_gso && skb_is_gso(skb)) { + netdev_features_t features; + struct sk_buff *nskb, *next; + int cnt, byte_len, orig_len; + int err; + + features = netif_skb_features(skb); + nskb = skb_gso_segment(skb, features & ~NETIF_F_GSO_MASK); + if (IS_ERR_OR_NULL(nskb)) + return qdisc_drop(skb, sch, to_free); + + cnt = 1; + byte_len = 0; + orig_len = qdisc_pkt_len(skb); + skb_list_walk_safe(nskb, nskb, next) { + skb_mark_not_on_list(nskb); + qdisc_skb_cb(nskb)->pkt_len = nskb->len; + dualpi2_skb_cb(nskb)->classified = + dualpi2_skb_cb(skb)->classified; + dualpi2_skb_cb(nskb)->ect = dualpi2_skb_cb(skb)->ect; + err = dualpi2_enqueue_skb(nskb, sch, to_free); + if (err == NET_XMIT_SUCCESS) { + /* Compute the backlog adjustment that needs + * to be propagated in the qdisc tree to reflect + * all new skbs successfully enqueued. + */ + ++cnt; + byte_len += nskb->len; + } + } + if (err == NET_XMIT_SUCCESS) { + /* The caller will add the original skb stats to its + * backlog, compensate this. + */ + --cnt; + byte_len -= orig_len; + } + qdisc_tree_reduce_backlog(sch, -cnt, -byte_len); + consume_skb(skb); + return err; + } + return dualpi2_enqueue_skb(skb, sch, to_free); +} + +/* Select the queue from which the next packet can be dequeued, ensuring that + * neither queue can starve the other with a WRR scheduler. + * + * The sign of the WRR credit determines the next queue, while the size of + * the dequeued packet determines the magnitude of the WRR credit change. If + * either queue is empty, the WRR credit is kept unchanged. + * + * As the dequeued packet can be dropped later, the caller has to perform the + * qdisc_bstats_update() calls. + */ +static struct sk_buff *dequeue_packet(struct Qdisc *sch, + struct dualpi2_sched_data *q, + int *credit_change, + u64 now) +{ + struct sk_buff *skb = NULL; + int c_len; + + *credit_change = 0; + c_len = qdisc_qlen(sch) - qdisc_qlen(q->l_queue); + if (qdisc_qlen(q->l_queue) && (!c_len || q->c_protection.credit <= 0)) { + skb = __qdisc_dequeue_head(&q->l_queue->q); + WRITE_ONCE(q->l_head_ts, head_enqueue_time(q->l_queue)); + if (c_len) + *credit_change = q->c_protection.wc; + qdisc_qstats_backlog_dec(q->l_queue, skb); + /* Keep the global queue size consistent */ + --sch->q.qlen; + q->memory_used -= skb->truesize; + } else if (c_len) { + skb = __qdisc_dequeue_head(&sch->q); + WRITE_ONCE(q->c_head_ts, head_enqueue_time(sch)); + if (qdisc_qlen(q->l_queue)) + *credit_change = ~((s32)q->c_protection.wl) + 1; + q->memory_used -= skb->truesize; + } else { + dualpi2_reset_c_protection(q); + return NULL; + } + *credit_change *= qdisc_pkt_len(skb); + qdisc_qstats_backlog_dec(sch, skb); + return skb; +} + +static int do_step_aqm(struct dualpi2_sched_data *q, struct sk_buff *skb, + u64 now) +{ + u64 qdelay = 0; + + if (q->step.in_packets) + qdelay = qdisc_qlen(q->l_queue); + else + qdelay = dualpi2_sojourn_time(skb, now); + + if (dualpi2_skb_cb(skb)->apply_step && qdelay > q->step.thresh) { + if (!dualpi2_skb_cb(skb)->ect) + /* Drop this non-ECT packet */ + return 1; + if (dualpi2_mark(q, skb)) + ++q->step_marks; + } + qdisc_bstats_update(q->l_queue, skb); + return 0; +} + +static void drop_and_retry(struct dualpi2_sched_data *q, struct sk_buff *skb, + struct Qdisc *sch) +{ + ++q->deferred_drops.cnt; + q->deferred_drops.len += qdisc_pkt_len(skb); + q->memory_used -= skb->truesize; + consume_skb(skb); + qdisc_qstats_drop(sch); +} + +static struct sk_buff *dualpi2_qdisc_dequeue(struct Qdisc *sch) +{ + struct dualpi2_sched_data *q = qdisc_priv(sch); + struct sk_buff *skb; + int credit_change; + u64 now; + + now = ktime_get_ns(); + + while ((skb = dequeue_packet(sch, q, &credit_change, now))) { + if (!q->drop_early && must_drop(sch, q, skb)) { + drop_and_retry(q, skb, sch); + continue; + } + + if (skb_in_l_queue(skb) && do_step_aqm(q, skb, now)) { + qdisc_qstats_drop(q->l_queue); + drop_and_retry(q, skb, sch); + continue; + } + + q->c_protection.credit += credit_change; + qdisc_bstats_update(sch, skb); + break; + } + + /* We cannot call qdisc_tree_reduce_backlog() if our qlen is 0, + * or HTB crashes. + */ + if (q->deferred_drops.cnt && qdisc_qlen(sch)) { + qdisc_tree_reduce_backlog(sch, q->deferred_drops.cnt, + q->deferred_drops.len); + q->deferred_drops.cnt = 0; + q->deferred_drops.len = 0; + } + return skb; +} + +static s64 __scale_delta(u64 diff) +{ + do_div(diff, 1 << ALPHA_BETA_GRANULARITY); + return diff; +} + +static void get_queue_delays(struct dualpi2_sched_data *q, u64 *qdelay_c, + u64 *qdelay_l) +{ + u64 now, qc, ql; + + now = ktime_get_ns(); + qc = READ_ONCE(q->c_head_ts); + ql = READ_ONCE(q->l_head_ts); + + *qdelay_c = qc ? now - qc : 0; + *qdelay_l = ql ? now - ql : 0; +} + +static u32 calculate_probability(struct Qdisc *sch) +{ + struct dualpi2_sched_data *q = qdisc_priv(sch); + u32 new_prob; + u64 qdelay_c; + u64 qdelay_l; + u64 qdelay; + s64 delta; + + get_queue_delays(q, &qdelay_c, &qdelay_l); + qdelay = max(qdelay_l, qdelay_c); + /* Alpha and beta take at most 32b, i.e, the delay difference would + * overflow for queuing delay differences > ~4.2sec. + */ + delta = ((s64)qdelay - q->pi2.target) * q->pi2.alpha; + delta += ((s64)qdelay - q->last_qdelay) * q->pi2.beta; + if (delta > 0) { + new_prob = __scale_delta(delta) + q->pi2.prob; + if (new_prob < q->pi2.prob) + new_prob = MAX_PROB; + } else { + new_prob = q->pi2.prob - __scale_delta(~delta + 1); + if (new_prob > q->pi2.prob) + new_prob = 0; + } + q->last_qdelay = qdelay; + /* If we do not drop on overload, ensure we cap the L4S probability to + * 100% to keep window fairness when overflowing. + */ + if (!q->drop_overload) + return min_t(u32, new_prob, MAX_PROB / q->coupling_factor); + return new_prob; +} + +static enum hrtimer_restart dualpi2_timer(struct hrtimer *timer) +{ + struct dualpi2_sched_data *q = from_timer(q, timer, pi2.timer); + + WRITE_ONCE(q->pi2.prob, calculate_probability(q->sch)); + + hrtimer_set_expires(&q->pi2.timer, next_pi2_timeout(q)); + return HRTIMER_RESTART; +} + +static struct netlink_range_validation dualpi2_alpha_beta_range = { + .min = 1, + .max = ALPHA_BETA_MAX, +}; + +static struct netlink_range_validation dualpi2_wc_range = { + .min = 0, + .max = MAX_WC, +}; + +static const struct nla_policy dualpi2_policy[TCA_DUALPI2_MAX + 1] = { + [TCA_DUALPI2_LIMIT] = NLA_POLICY_MIN(NLA_U32, 1), + [TCA_DUALPI2_MEMORY_LIMIT] = NLA_POLICY_MIN(NLA_U32, 1), + [TCA_DUALPI2_TARGET] = {.type = NLA_U32}, + [TCA_DUALPI2_TUPDATE] = NLA_POLICY_MIN(NLA_U32, 1), + [TCA_DUALPI2_ALPHA] = + NLA_POLICY_FULL_RANGE(NLA_U32, &dualpi2_alpha_beta_range), + [TCA_DUALPI2_BETA] = + NLA_POLICY_FULL_RANGE(NLA_U32, &dualpi2_alpha_beta_range), + [TCA_DUALPI2_STEP_THRESH] = {.type = NLA_U32}, + [TCA_DUALPI2_STEP_PACKETS] = {.type = NLA_U8}, + [TCA_DUALPI2_COUPLING] = NLA_POLICY_MIN(NLA_U8, 1), + [TCA_DUALPI2_DROP_OVERLOAD] = {.type = NLA_U8}, + [TCA_DUALPI2_DROP_EARLY] = {.type = NLA_U8}, + [TCA_DUALPI2_C_PROTECTION] = + NLA_POLICY_FULL_RANGE(NLA_U8, &dualpi2_wc_range), + [TCA_DUALPI2_ECN_MASK] = {.type = NLA_U8}, + [TCA_DUALPI2_SPLIT_GSO] = {.type = NLA_U8}, +}; + +static int dualpi2_change(struct Qdisc *sch, struct nlattr *opt, + struct netlink_ext_ack *extack) +{ + struct nlattr *tb[TCA_DUALPI2_MAX + 1]; + struct dualpi2_sched_data *q; + int old_backlog; + int old_qlen; + int err; + + if (!opt) + return -EINVAL; + err = nla_parse_nested(tb, TCA_DUALPI2_MAX, opt, dualpi2_policy, + extack); + if (err < 0) + return err; + + q = qdisc_priv(sch); + sch_tree_lock(sch); + + if (tb[TCA_DUALPI2_LIMIT]) { + u32 limit = nla_get_u32(tb[TCA_DUALPI2_LIMIT]); + + WRITE_ONCE(sch->limit, limit); + WRITE_ONCE(q->memory_limit, limit * psched_mtu(qdisc_dev(sch))); + } + + if (tb[TCA_DUALPI2_MEMORY_LIMIT]) + WRITE_ONCE(q->memory_limit, + nla_get_u32(tb[TCA_DUALPI2_MEMORY_LIMIT])); + + if (tb[TCA_DUALPI2_TARGET]) { + u64 target = nla_get_u32(tb[TCA_DUALPI2_TARGET]); + + WRITE_ONCE(q->pi2.target, target * NSEC_PER_USEC); + } + + if (tb[TCA_DUALPI2_TUPDATE]) { + u64 tupdate = nla_get_u32(tb[TCA_DUALPI2_TUPDATE]); + + WRITE_ONCE(q->pi2.tupdate, tupdate * NSEC_PER_USEC); + } + + if (tb[TCA_DUALPI2_ALPHA]) { + u32 alpha = nla_get_u32(tb[TCA_DUALPI2_ALPHA]); + + WRITE_ONCE(q->pi2.alpha, dualpi2_scale_alpha_beta(alpha)); + } + + if (tb[TCA_DUALPI2_BETA]) { + u32 beta = nla_get_u32(tb[TCA_DUALPI2_BETA]); + + WRITE_ONCE(q->pi2.beta, dualpi2_scale_alpha_beta(beta)); + } + + if (tb[TCA_DUALPI2_STEP_PACKETS]) { + bool step_pkt = !!nla_get_u8(tb[TCA_DUALPI2_STEP_PACKETS]); + u32 step_th = READ_ONCE(q->step.thresh); + + WRITE_ONCE(q->step.in_packets, step_pkt); + WRITE_ONCE(q->step.thresh, + step_pkt ? step_th : (step_th * NSEC_PER_USEC)); + } + + if (tb[TCA_DUALPI2_STEP_THRESH]) { + u32 step_th = nla_get_u32(tb[TCA_DUALPI2_STEP_THRESH]); + bool step_pkt = READ_ONCE(q->step.in_packets); + + WRITE_ONCE(q->step.thresh, + step_pkt ? step_th : (step_th * NSEC_PER_USEC)); + } + + if (tb[TCA_DUALPI2_COUPLING]) { + u8 coupling = nla_get_u8(tb[TCA_DUALPI2_COUPLING]); + + WRITE_ONCE(q->coupling_factor, coupling); + } + + if (tb[TCA_DUALPI2_DROP_OVERLOAD]) + WRITE_ONCE(q->drop_overload, + !!nla_get_u8(tb[TCA_DUALPI2_DROP_OVERLOAD])); + + if (tb[TCA_DUALPI2_DROP_EARLY]) + WRITE_ONCE(q->drop_early, + !!nla_get_u8(tb[TCA_DUALPI2_DROP_EARLY])); + + if (tb[TCA_DUALPI2_C_PROTECTION]) { + u8 wc = nla_get_u8(tb[TCA_DUALPI2_C_PROTECTION]); + + dualpi2_calculate_c_protection(sch, q, wc); + } + + if (tb[TCA_DUALPI2_ECN_MASK]) + WRITE_ONCE(q->ecn_mask, + nla_get_u8(tb[TCA_DUALPI2_ECN_MASK])); + + if (tb[TCA_DUALPI2_SPLIT_GSO]) + WRITE_ONCE(q->split_gso, + !!nla_get_u8(tb[TCA_DUALPI2_SPLIT_GSO])); + + old_qlen = qdisc_qlen(sch); + old_backlog = sch->qstats.backlog; + while (qdisc_qlen(sch) > sch->limit || + q->memory_used > q->memory_limit) { + struct sk_buff *skb = __qdisc_dequeue_head(&sch->q); + + q->memory_used -= skb->truesize; + qdisc_qstats_backlog_dec(sch, skb); + rtnl_qdisc_drop(skb, sch); + } + qdisc_tree_reduce_backlog(sch, old_qlen - qdisc_qlen(sch), + old_backlog - sch->qstats.backlog); + + sch_tree_unlock(sch); + return 0; +} + +/* Default alpha/beta values give a 10dB stability margin with max_rtt=100ms. */ +static void dualpi2_reset_default(struct Qdisc *sch) +{ + struct dualpi2_sched_data *q = qdisc_priv(sch); + + q->sch->limit = 10000; /* Max 125ms at 1Gbps */ + q->memory_limit = q->sch->limit * psched_mtu(qdisc_dev(sch)); + + q->pi2.target = 15 * NSEC_PER_MSEC; + q->pi2.tupdate = 16 * NSEC_PER_MSEC; + q->pi2.alpha = dualpi2_scale_alpha_beta(41); /* ~0.16 Hz * 256 */ + q->pi2.beta = dualpi2_scale_alpha_beta(819); /* ~3.20 Hz * 256 */ + + q->step.thresh = 1 * NSEC_PER_MSEC; + q->step.in_packets = false; + + dualpi2_calculate_c_protection(q->sch, q, 10); /* wc=10%, wl=90% */ + + q->ecn_mask = INET_ECN_ECT_1; + q->coupling_factor = 2; /* window fairness for equal RTTs */ + q->drop_overload = true; /* Preserve latency by dropping */ + q->drop_early = false; /* PI2 drops on dequeue */ + q->split_gso = true; +} + +static int dualpi2_init(struct Qdisc *sch, struct nlattr *opt, + struct netlink_ext_ack *extack) +{ + struct dualpi2_sched_data *q = qdisc_priv(sch); + int err; + + q->l_queue = qdisc_create_dflt(sch->dev_queue, &pfifo_qdisc_ops, + TC_H_MAKE(sch->handle, 1), extack); + if (!q->l_queue) + return -ENOMEM; + + err = tcf_block_get(&q->tcf.block, &q->tcf.filters, sch, extack); + if (err) + return err; + + q->sch = sch; + dualpi2_reset_default(sch); + hrtimer_init(&q->pi2.timer, CLOCK_MONOTONIC, HRTIMER_MODE_ABS_PINNED); + q->pi2.timer.function = dualpi2_timer; + + if (opt) { + err = dualpi2_change(sch, opt, extack); + + if (err) + return err; + } + + hrtimer_start(&q->pi2.timer, next_pi2_timeout(q), + HRTIMER_MODE_ABS_PINNED); + return 0; +} + +static u32 convert_ns_to_usec(u64 ns) +{ + do_div(ns, NSEC_PER_USEC); + return ns; +} + +static int dualpi2_dump(struct Qdisc *sch, struct sk_buff *skb) +{ + struct dualpi2_sched_data *q = qdisc_priv(sch); + struct nlattr *opts; + + opts = nla_nest_start_noflag(skb, TCA_OPTIONS); + if (!opts) + goto nla_put_failure; + + if (nla_put_u32(skb, TCA_DUALPI2_LIMIT, READ_ONCE(sch->limit)) || + nla_put_u32(skb, TCA_DUALPI2_MEMORY_LIMIT, + READ_ONCE(q->memory_limit)) || + nla_put_u32(skb, TCA_DUALPI2_TARGET, + convert_ns_to_usec(READ_ONCE(q->pi2.target))) || + nla_put_u32(skb, TCA_DUALPI2_TUPDATE, + convert_ns_to_usec(READ_ONCE(q->pi2.tupdate))) || + nla_put_u32(skb, TCA_DUALPI2_ALPHA, + dualpi2_unscale_alpha_beta(READ_ONCE(q->pi2.alpha))) || + nla_put_u32(skb, TCA_DUALPI2_BETA, + dualpi2_unscale_alpha_beta(READ_ONCE(q->pi2.beta))) || + nla_put_u32(skb, TCA_DUALPI2_STEP_THRESH, + READ_ONCE(q->step.in_packets) ? + READ_ONCE(q->step.thresh) : + convert_ns_to_usec(READ_ONCE(q->step.thresh))) || + nla_put_u8(skb, TCA_DUALPI2_COUPLING, + READ_ONCE(q->coupling_factor)) || + nla_put_u8(skb, TCA_DUALPI2_DROP_OVERLOAD, + READ_ONCE(q->drop_overload)) || + nla_put_u8(skb, TCA_DUALPI2_STEP_PACKETS, + READ_ONCE(q->step.in_packets)) || + nla_put_u8(skb, TCA_DUALPI2_DROP_EARLY, + READ_ONCE(q->drop_early)) || + nla_put_u8(skb, TCA_DUALPI2_C_PROTECTION, + READ_ONCE(q->c_protection.wc)) || + nla_put_u8(skb, TCA_DUALPI2_ECN_MASK, READ_ONCE(q->ecn_mask)) || + nla_put_u8(skb, TCA_DUALPI2_SPLIT_GSO, READ_ONCE(q->split_gso))) + goto nla_put_failure; + + return nla_nest_end(skb, opts); + +nla_put_failure: + nla_nest_cancel(skb, opts); + return -1; +} + +static int dualpi2_dump_stats(struct Qdisc *sch, struct gnet_dump *d) +{ + struct dualpi2_sched_data *q = qdisc_priv(sch); + struct tc_dualpi2_xstats st = { + .prob = READ_ONCE(q->pi2.prob), + .packets_in_c = q->packets_in_c, + .packets_in_l = q->packets_in_l, + .maxq = q->maxq, + .ecn_mark = q->ecn_mark, + .credit = q->c_protection.credit, + .step_marks = q->step_marks, + .memory_used = q->memory_used, + .max_memory_used = q->max_memory_used, + .memory_limit = q->memory_limit, + }; + u64 qc, ql; + + get_queue_delays(q, &qc, &ql); + st.delay_l = convert_ns_to_usec(ql); + st.delay_c = convert_ns_to_usec(qc); + return gnet_stats_copy_app(d, &st, sizeof(st)); +} + +/* Reset both L-queue and C-queue, internal packet counters, PI probability, + * C-queue protection credit, and timestamps, while preserving current + * configuration of DUALPI2. + */ +static void dualpi2_reset(struct Qdisc *sch) +{ + struct dualpi2_sched_data *q = qdisc_priv(sch); + + qdisc_reset_queue(sch); + qdisc_reset_queue(q->l_queue); + q->c_head_ts = 0; + q->l_head_ts = 0; + q->pi2.prob = 0; + q->packets_in_c = 0; + q->packets_in_l = 0; + q->maxq = 0; + q->ecn_mark = 0; + q->step_marks = 0; + q->memory_used = 0; + q->max_memory_used = 0; + dualpi2_reset_c_protection(q); +} + +static void dualpi2_destroy(struct Qdisc *sch) +{ + struct dualpi2_sched_data *q = qdisc_priv(sch); + + q->pi2.tupdate = 0; + hrtimer_cancel(&q->pi2.timer); + if (q->l_queue) + qdisc_put(q->l_queue); + tcf_block_put(q->tcf.block); +} + +static struct Qdisc *dualpi2_leaf(struct Qdisc *sch, unsigned long arg) +{ + return NULL; +} + +static unsigned long dualpi2_find(struct Qdisc *sch, u32 classid) +{ + return 0; +} + +static unsigned long dualpi2_bind(struct Qdisc *sch, unsigned long parent, + u32 classid) +{ + return 0; +} + +static void dualpi2_unbind(struct Qdisc *q, unsigned long cl) +{ +} + +static struct tcf_block *dualpi2_tcf_block(struct Qdisc *sch, unsigned long cl, + struct netlink_ext_ack *extack) +{ + struct dualpi2_sched_data *q = qdisc_priv(sch); + + if (cl) + return NULL; + return q->tcf.block; +} + +static void dualpi2_walk(struct Qdisc *sch, struct qdisc_walker *arg) +{ + unsigned int i; + + if (arg->stop) + return; + + /* We statically define only 2 queues */ + for (i = 0; i < 2; i++) { + if (arg->count < arg->skip) { + arg->count++; + continue; + } + if (arg->fn(sch, i + 1, arg) < 0) { + arg->stop = 1; + break; + } + arg->count++; + } +} + +/* Minimal class support to handler tc filters */ +static const struct Qdisc_class_ops dualpi2_class_ops = { + .leaf = dualpi2_leaf, + .find = dualpi2_find, + .tcf_block = dualpi2_tcf_block, + .bind_tcf = dualpi2_bind, + .unbind_tcf = dualpi2_unbind, + .walk = dualpi2_walk, +}; + +static struct Qdisc_ops dualpi2_qdisc_ops __read_mostly = { + .id = "dualpi2", + .cl_ops = &dualpi2_class_ops, + .priv_size = sizeof(struct dualpi2_sched_data), + .enqueue = dualpi2_qdisc_enqueue, + .dequeue = dualpi2_qdisc_dequeue, + .peek = qdisc_peek_dequeued, + .init = dualpi2_init, + .destroy = dualpi2_destroy, + .reset = dualpi2_reset, + .change = dualpi2_change, + .dump = dualpi2_dump, + .dump_stats = dualpi2_dump_stats, + .owner = THIS_MODULE, +}; + +static int __init dualpi2_module_init(void) +{ + return register_qdisc(&dualpi2_qdisc_ops); +} + +static void __exit dualpi2_module_exit(void) +{ + unregister_qdisc(&dualpi2_qdisc_ops); +} + +module_init(dualpi2_module_init); +module_exit(dualpi2_module_exit); + +MODULE_DESCRIPTION("Dual Queue with Proportional Integral controller Improved with a Square (dualpi2) scheduler"); +MODULE_AUTHOR("Koen De Schepper <koen.de_schepper@nokia-bell-labs.com>"); +MODULE_AUTHOR("Olga Albisser <olga@albisser.org>"); +MODULE_AUTHOR("Henrik Steen <henrist@henrist.net>"); +MODULE_AUTHOR("Olivier Tilmans <olivier.tilmans@nokia.com>"); +MODULE_AUTHOR("Chia-Yu Chang <chia-yu.chang@nokia-bell-labs.com>"); + +MODULE_LICENSE("Dual BSD/GPL"); +MODULE_VERSION("1.0");