diff mbox series

[net-next,2/2] fq_codel: implement L4S style ce_threshold_ect1 marking

Message ID 20211014175918.60188-3-eric.dumazet@gmail.com (mailing list archive)
State Accepted
Commit e72aeb9ee0e34c57dc90793d0bf82cab9624d64e
Delegated to: Netdev Maintainers
Headers show
Series net/sched: implement L4S style ce_threshold_ect1 marking | expand

Checks

Context Check Description
netdev/cover_letter success Series has a cover letter
netdev/fixes_present success Fixes tag not required for -next series
netdev/patch_count success Link
netdev/tree_selection success Clearly marked for net-next
netdev/subject_prefix success Link
netdev/cc_maintainers warning 5 maintainers not CCed: jhs@mojatatu.com linux-wireless@vger.kernel.org xiyou.wangcong@gmail.com johannes@sipsolutions.net jiri@resnulli.us
netdev/source_inline success Was 0 now: 0
netdev/verify_signedoff success Signed-off-by tag matches author and committer
netdev/module_param success Was 0 now: 0
netdev/build_32bit success Errors and warnings before: 4806 this patch: 4806
netdev/kdoc success Errors and warnings before: 2 this patch: 2
netdev/verify_fixes success No Fixes tag
netdev/checkpatch warning WARNING: line length of 96 exceeds 80 columns WARNING: line length of 98 exceeds 80 columns
netdev/build_allmodconfig_warn success Errors and warnings before: 4872 this patch: 4872
netdev/header_inline success No static functions without inline keyword in header files

Commit Message

Eric Dumazet Oct. 14, 2021, 5:59 p.m. UTC
From: Eric Dumazet <edumazet@google.com>

Add TCA_FQ_CODEL_CE_THRESHOLD_ECT1 boolean option to select Low Latency,
Low Loss, Scalable Throughput (L4S) style marking, along with ce_threshold.

If enabled, only packets with ECT(1) can be transformed to CE
if their sojourn time is above the ce_threshold.

Note that this new option does not change rules for codel law.
In particular, if TCA_FQ_CODEL_ECN is left enabled (this is
the default when fq_codel qdisc is created), ECT(0) packets can
still get CE if codel law (as governed by limit/target) decides so.

Section 4.3.b of current draft [1] states:

b.  A scheduler with per-flow queues such as FQ-CoDel or FQ-PIE can
    be used for L4S.  For instance within each queue of an FQ-CoDel
    system, as well as a CoDel AQM, there is typically also ECN
    marking at an immediate (unsmoothed) shallow threshold to support
    use in data centres (see Sec.5.2.7 of [RFC8290]).  This can be
    modified so that the shallow threshold is solely applied to
    ECT(1) packets.  Then if there is a flow of non-ECN or ECT(0)
    packets in the per-flow-queue, the Classic AQM (e.g.  CoDel) is
    applied; while if there is a flow of ECT(1) packets in the queue,
    the shallower (typically sub-millisecond) threshold is applied.

Tested:

tc qd replace dev eth1 root fq_codel ce_threshold_ect1 50usec

netperf ... -t TCP_STREAM -- K dctcp

tc -s -d qd sh dev eth1
qdisc fq_codel 8022: root refcnt 32 limit 10240p flows 1024 quantum 9212 target 5ms ce_threshold_ect1 49us interval 100ms memory_limit 32Mb ecn drop_batch 64
 Sent 14388596616 bytes 9543449 pkt (dropped 0, overlimits 0 requeues 152013)
 backlog 0b 0p requeues 152013
  maxpacket 68130 drop_overlimit 0 new_flow_count 95678 ecn_mark 0 ce_mark 7639
  new_flows_len 0 old_flows_len 0

[1] L4S current draft:
https://datatracker.ietf.org/doc/html/draft-ietf-tsvwg-l4s-arch

Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Neal Cardwell <ncardwell@google.com>
Cc: Ingemar Johansson S <ingemar.s.johansson@ericsson.com>
Cc: Tom Henderson <tomh@tomh.org>
Cc: Bob Briscoe <in@bobbriscoe.net>
---
 include/net/codel.h            |  2 ++
 include/net/codel_impl.h       | 18 +++++++++++++++---
 include/uapi/linux/pkt_sched.h |  1 +
 net/mac80211/sta_info.c        |  1 +
 net/sched/sch_fq_codel.c       | 15 +++++++++++----
 5 files changed, 30 insertions(+), 7 deletions(-)

Comments

Toke Høiland-Jørgensen Oct. 14, 2021, 7:54 p.m. UTC | #1
Eric Dumazet <eric.dumazet@gmail.com> writes:

> From: Eric Dumazet <edumazet@google.com>
>
> Add TCA_FQ_CODEL_CE_THRESHOLD_ECT1 boolean option to select Low Latency,
> Low Loss, Scalable Throughput (L4S) style marking, along with ce_threshold.
>
> If enabled, only packets with ECT(1) can be transformed to CE
> if their sojourn time is above the ce_threshold.
>
> Note that this new option does not change rules for codel law.
> In particular, if TCA_FQ_CODEL_ECN is left enabled (this is
> the default when fq_codel qdisc is created), ECT(0) packets can
> still get CE if codel law (as governed by limit/target) decides so.

The ability to have certain packets receive a shallow marking threshold
and others regular ECN semantics is no doubt useful. However, given that
it is by no means certain how the L4S experiment will pan out (and I for
one remain sceptical that the real-world benefits will turn out to match
the tech demos), I think it's premature to bake the ECT(1) semantics
into UAPI.

So how about tying this behaviour to a configurable skb->mark instead?
That way users can get the shallow marking behaviour for any subset of
packets they want, simply by installing a suitable filter on the
qdisc...

-Toke
Eric Dumazet Oct. 14, 2021, 9:35 p.m. UTC | #2
On Thu, Oct 14, 2021 at 12:54 PM Toke Høiland-Jørgensen <toke@redhat.com> wrote:
>
> Eric Dumazet <eric.dumazet@gmail.com> writes:
>
> > From: Eric Dumazet <edumazet@google.com>
> >
> > Add TCA_FQ_CODEL_CE_THRESHOLD_ECT1 boolean option to select Low Latency,
> > Low Loss, Scalable Throughput (L4S) style marking, along with ce_threshold.
> >
> > If enabled, only packets with ECT(1) can be transformed to CE
> > if their sojourn time is above the ce_threshold.
> >
> > Note that this new option does not change rules for codel law.
> > In particular, if TCA_FQ_CODEL_ECN is left enabled (this is
> > the default when fq_codel qdisc is created), ECT(0) packets can
> > still get CE if codel law (as governed by limit/target) decides so.
>
> The ability to have certain packets receive a shallow marking threshold
> and others regular ECN semantics is no doubt useful. However, given that
> it is by no means certain how the L4S experiment will pan out (and I for
> one remain sceptical that the real-world benefits will turn out to match
> the tech demos), I think it's premature to bake the ECT(1) semantics
> into UAPI.

Chicken and egg problem.
We had fq_codel in linux kernel years before RFC after all :)

>
> So how about tying this behaviour to a configurable skb->mark instead?
> That way users can get the shallow marking behaviour for any subset of
> packets they want, simply by installing a suitable filter on the
> qdisc...

This seems an idea, but do you really expect users installing a sophisticated
filter ? Please provide more details, and cost analysis.
(Having to install a filter is probably more expensive than testing a boolean,
after the sojourn time  has exceeded the threshold)

Given that INET_ECN_set_ce(skb) only operates on ECT(1) and ECT(0),
I guess we could  use a bitmask of two bits so that users can decide
which code points can become CE.
Toke Høiland-Jørgensen Oct. 14, 2021, 11:24 p.m. UTC | #3
Eric Dumazet <edumazet@google.com> writes:

> On Thu, Oct 14, 2021 at 12:54 PM Toke Høiland-Jørgensen <toke@redhat.com> wrote:
>>
>> Eric Dumazet <eric.dumazet@gmail.com> writes:
>>
>> > From: Eric Dumazet <edumazet@google.com>
>> >
>> > Add TCA_FQ_CODEL_CE_THRESHOLD_ECT1 boolean option to select Low Latency,
>> > Low Loss, Scalable Throughput (L4S) style marking, along with ce_threshold.
>> >
>> > If enabled, only packets with ECT(1) can be transformed to CE
>> > if their sojourn time is above the ce_threshold.
>> >
>> > Note that this new option does not change rules for codel law.
>> > In particular, if TCA_FQ_CODEL_ECN is left enabled (this is
>> > the default when fq_codel qdisc is created), ECT(0) packets can
>> > still get CE if codel law (as governed by limit/target) decides so.
>>
>> The ability to have certain packets receive a shallow marking threshold
>> and others regular ECN semantics is no doubt useful. However, given that
>> it is by no means certain how the L4S experiment will pan out (and I for
>> one remain sceptical that the real-world benefits will turn out to match
>> the tech demos), I think it's premature to bake the ECT(1) semantics
>> into UAPI.
>
> Chicken and egg problem.
> We had fq_codel in linux kernel years before RFC after all :)

Sure, but fq_codel is a self-contained algorithm, it doesn't add new
meanings to bits of the IP header... :)

>> So how about tying this behaviour to a configurable skb->mark instead?
>> That way users can get the shallow marking behaviour for any subset of
>> packets they want, simply by installing a suitable filter on the
>> qdisc...
>
> This seems an idea, but do you really expect users installing a sophisticated
> filter ? Please provide more details, and cost analysis.

Not sure it's that sophisticated; pretty simple to do with tc-u32
(although it's complicated a bit by having to restore the default
hashing behaviour of fq_codel with a second filter). Something like:

# tc qdisc replace dev $DEV handle 1: fq_codel
# tc filter add dev $DEV parent 1: pref 1 protocol ipv6 u32 match u32 00100000 00100000 action skbedit mark 2 continue
# tc filter add dev $DEV parent 1: pref 2 protocol ip u32 match ip dsfield 1 1 action skbedit mark 2 continue
# tc filter add dev $DEV parent 1: handle 1 pref 3 protocol all flow hash keys src,dst,proto,proto-src,proto-dst divisor 1024

or one could write a single BPF program that combines all three to save
some cycles walking the filter chain.

> (Having to install a filter is probably more expensive than testing a
> boolean, after the sojourn time has exceeded the threshold)

No doubt, all other things being equal. But odds are they're not: if
you're already running a BPF filter somewhere in the path, adding the
logic above to an existing filter reduces it back down to a couple of
boolean comparisons, for instance.

But even if it does add a bit of overhead, IMO the flexibility makes up
for this. We can always revisit it if L4S becomes a standards-track RFC
at some point :)

> Given that INET_ECN_set_ce(skb) only operates on ECT(1) and ECT(0),
> I guess we could  use a bitmask of two bits so that users can decide
> which code points can become CE.

That would be an improvement. But if we're doing bitmasks, and since the
code is reading the whole dsfield anyway, why not extend that bitmask to
the whole dsfield?

-Toke
Bob Briscoe Oct. 15, 2021, 12:59 p.m. UTC | #4
Eric,

Because the threshold is in time units, I suggest the condition for 
exceeding it needs to be AND'd with (*backlog > mtu), otherwise you can 
get 100% solid marking at low link rates.

When ce_threshold is for DCs, low link rates are unlikely.
However, given ce_threshold_ect1 is mainly for the Internet, during 
testing with 1ms threshold we encountered solid marking at low link 
rates, so we had to add a 1 packet floor:
https://bobbriscoe.net/projects/latency/dctth_journal_draft20190726.pdf

Sorry to chime in after your patch went to net-next.


Bob


On 14/10/2021 18:59, Eric Dumazet wrote:
> From: Eric Dumazet <edumazet@google.com>
>
> Add TCA_FQ_CODEL_CE_THRESHOLD_ECT1 boolean option to select Low Latency,
> Low Loss, Scalable Throughput (L4S) style marking, along with ce_threshold.
>
> If enabled, only packets with ECT(1) can be transformed to CE
> if their sojourn time is above the ce_threshold.
>
> Note that this new option does not change rules for codel law.
> In particular, if TCA_FQ_CODEL_ECN is left enabled (this is
> the default when fq_codel qdisc is created), ECT(0) packets can
> still get CE if codel law (as governed by limit/target) decides so.
>
> Section 4.3.b of current draft [1] states:
>
> b.  A scheduler with per-flow queues such as FQ-CoDel or FQ-PIE can
>      be used for L4S.  For instance within each queue of an FQ-CoDel
>      system, as well as a CoDel AQM, there is typically also ECN
>      marking at an immediate (unsmoothed) shallow threshold to support
>      use in data centres (see Sec.5.2.7 of [RFC8290]).  This can be
>      modified so that the shallow threshold is solely applied to
>      ECT(1) packets.  Then if there is a flow of non-ECN or ECT(0)
>      packets in the per-flow-queue, the Classic AQM (e.g.  CoDel) is
>      applied; while if there is a flow of ECT(1) packets in the queue,
>      the shallower (typically sub-millisecond) threshold is applied.
>
> Tested:
>
> tc qd replace dev eth1 root fq_codel ce_threshold_ect1 50usec
>
> netperf ... -t TCP_STREAM -- K dctcp
>
> tc -s -d qd sh dev eth1
> qdisc fq_codel 8022: root refcnt 32 limit 10240p flows 1024 quantum 9212 target 5ms ce_threshold_ect1 49us interval 100ms memory_limit 32Mb ecn drop_batch 64
>   Sent 14388596616 bytes 9543449 pkt (dropped 0, overlimits 0 requeues 152013)
>   backlog 0b 0p requeues 152013
>    maxpacket 68130 drop_overlimit 0 new_flow_count 95678 ecn_mark 0 ce_mark 7639
>    new_flows_len 0 old_flows_len 0
>
> [1] L4S current draft:
> https://datatracker.ietf.org/doc/html/draft-ietf-tsvwg-l4s-arch
>
> Signed-off-by: Eric Dumazet <edumazet@google.com>
> Cc: Neal Cardwell <ncardwell@google.com>
> Cc: Ingemar Johansson S <ingemar.s.johansson@ericsson.com>
> Cc: Tom Henderson <tomh@tomh.org>
> Cc: Bob Briscoe <in@bobbriscoe.net>
> ---
>   include/net/codel.h            |  2 ++
>   include/net/codel_impl.h       | 18 +++++++++++++++---
>   include/uapi/linux/pkt_sched.h |  1 +
>   net/mac80211/sta_info.c        |  1 +
>   net/sched/sch_fq_codel.c       | 15 +++++++++++----
>   5 files changed, 30 insertions(+), 7 deletions(-)
>
> diff --git a/include/net/codel.h b/include/net/codel.h
> index a6e428f801350809322aaff08d92904e059c3b5a..5e8b181b76b829d6af3c57809d9bc5f0578dd112 100644
> --- a/include/net/codel.h
> +++ b/include/net/codel.h
> @@ -102,6 +102,7 @@ static inline u32 codel_time_to_us(codel_time_t val)
>    * @interval:	width of moving time window
>    * @mtu:	device mtu, or minimal queue backlog in bytes.
>    * @ecn:	is Explicit Congestion Notification enabled
> + * @ce_threshold_ect1: if ce_threshold only marks ECT(1) packets
>    */
>   struct codel_params {
>   	codel_time_t	target;
> @@ -109,6 +110,7 @@ struct codel_params {
>   	codel_time_t	interval;
>   	u32		mtu;
>   	bool		ecn;
> +	bool		ce_threshold_ect1;
>   };
>   
>   /**
> diff --git a/include/net/codel_impl.h b/include/net/codel_impl.h
> index d289b91dcd65ecdc96dc0c9bf85d4a4be6961022..7af2c3eb3c43c24364519120aad5be77522854a6 100644
> --- a/include/net/codel_impl.h
> +++ b/include/net/codel_impl.h
> @@ -54,6 +54,7 @@ static void codel_params_init(struct codel_params *params)
>   	params->interval = MS2TIME(100);
>   	params->target = MS2TIME(5);
>   	params->ce_threshold = CODEL_DISABLED_THRESHOLD;
> +	params->ce_threshold_ect1 = false;
>   	params->ecn = false;
>   }
>   
> @@ -246,9 +247,20 @@ static struct sk_buff *codel_dequeue(void *ctx,
>   						    vars->rec_inv_sqrt);
>   	}
>   end:
> -	if (skb && codel_time_after(vars->ldelay, params->ce_threshold) &&
> -	    INET_ECN_set_ce(skb))
> -		stats->ce_mark++;
> +	if (skb && codel_time_after(vars->ldelay, params->ce_threshold)) {
> +		bool set_ce = true;
> +
> +		if (params->ce_threshold_ect1) {
> +			/* Note: if skb_get_dsfield() returns -1, following
> +			 * gives INET_ECN_MASK, which is != INET_ECN_ECT_1.
> +			 */
> +			u8 ecn = skb_get_dsfield(skb) & INET_ECN_MASK;
> +
> +			set_ce = (ecn == INET_ECN_ECT_1);
> +		}
> +		if (set_ce && INET_ECN_set_ce(skb))
> +			stats->ce_mark++;
> +	}
>   	return skb;
>   }
>   
> diff --git a/include/uapi/linux/pkt_sched.h b/include/uapi/linux/pkt_sched.h
> index ec88590b3198441f18cc9def7bd40c48f0bc82a1..6be9a84cccfa79bace1f3f7123d02f484b67a25e 100644
> --- a/include/uapi/linux/pkt_sched.h
> +++ b/include/uapi/linux/pkt_sched.h
> @@ -840,6 +840,7 @@ enum {
>   	TCA_FQ_CODEL_CE_THRESHOLD,
>   	TCA_FQ_CODEL_DROP_BATCH_SIZE,
>   	TCA_FQ_CODEL_MEMORY_LIMIT,
> +	TCA_FQ_CODEL_CE_THRESHOLD_ECT1,
>   	__TCA_FQ_CODEL_MAX
>   };
>   
> diff --git a/net/mac80211/sta_info.c b/net/mac80211/sta_info.c
> index 2b5acb37587f7068e2d11fe842ec963a556f1eb1..a39830418434d4bb74d238373f63a4858230fce5 100644
> --- a/net/mac80211/sta_info.c
> +++ b/net/mac80211/sta_info.c
> @@ -513,6 +513,7 @@ struct sta_info *sta_info_alloc(struct ieee80211_sub_if_data *sdata,
>   	sta->cparams.target = MS2TIME(20);
>   	sta->cparams.interval = MS2TIME(100);
>   	sta->cparams.ecn = true;
> +	sta->cparams.ce_threshold_ect1 = false;
>   
>   	sta_dbg(sdata, "Allocated STA %pM\n", sta->sta.addr);
>   
> diff --git a/net/sched/sch_fq_codel.c b/net/sched/sch_fq_codel.c
> index bb0cd6d3d2c2749d54e26368fb2558beedea85c9..033d65d06eb136ff704cddd3ee950a5c3a5d9831 100644
> --- a/net/sched/sch_fq_codel.c
> +++ b/net/sched/sch_fq_codel.c
> @@ -362,6 +362,7 @@ static const struct nla_policy fq_codel_policy[TCA_FQ_CODEL_MAX + 1] = {
>   	[TCA_FQ_CODEL_CE_THRESHOLD] = { .type = NLA_U32 },
>   	[TCA_FQ_CODEL_DROP_BATCH_SIZE] = { .type = NLA_U32 },
>   	[TCA_FQ_CODEL_MEMORY_LIMIT] = { .type = NLA_U32 },
> +	[TCA_FQ_CODEL_CE_THRESHOLD_ECT1] = { .type = NLA_U8 },
>   };
>   
>   static int fq_codel_change(struct Qdisc *sch, struct nlattr *opt,
> @@ -408,6 +409,9 @@ static int fq_codel_change(struct Qdisc *sch, struct nlattr *opt,
>   		q->cparams.ce_threshold = (val * NSEC_PER_USEC) >> CODEL_SHIFT;
>   	}
>   
> +	if (tb[TCA_FQ_CODEL_CE_THRESHOLD_ECT1])
> +		q->cparams.ce_threshold_ect1 = !!nla_get_u8(tb[TCA_FQ_CODEL_CE_THRESHOLD_ECT1]);
> +
>   	if (tb[TCA_FQ_CODEL_INTERVAL]) {
>   		u64 interval = nla_get_u32(tb[TCA_FQ_CODEL_INTERVAL]);
>   
> @@ -544,10 +548,13 @@ static int fq_codel_dump(struct Qdisc *sch, struct sk_buff *skb)
>   			q->flows_cnt))
>   		goto nla_put_failure;
>   
> -	if (q->cparams.ce_threshold != CODEL_DISABLED_THRESHOLD &&
> -	    nla_put_u32(skb, TCA_FQ_CODEL_CE_THRESHOLD,
> -			codel_time_to_us(q->cparams.ce_threshold)))
> -		goto nla_put_failure;
> +	if (q->cparams.ce_threshold != CODEL_DISABLED_THRESHOLD) {
> +		if (nla_put_u32(skb, TCA_FQ_CODEL_CE_THRESHOLD,
> +				codel_time_to_us(q->cparams.ce_threshold)))
> +			goto nla_put_failure;
> +		if (nla_put_u8(skb, TCA_FQ_CODEL_CE_THRESHOLD_ECT1, q->cparams.ce_threshold_ect1))
> +			goto nla_put_failure;
> +	}
>   
>   	return nla_nest_end(skb, opts);
>
Eric Dumazet Oct. 15, 2021, 2:08 p.m. UTC | #5
On Fri, Oct 15, 2021 at 5:59 AM Bob Briscoe <ietf@bobbriscoe.net> wrote:
>
> Eric,
>
> Because the threshold is in time units, I suggest the condition for
> exceeding it needs to be AND'd with (*backlog > mtu), otherwise you can
> get 100% solid marking at low link rates.
>
> When ce_threshold is for DCs, low link rates are unlikely.
> However, given ce_threshold_ect1 is mainly for the Internet, during
> testing with 1ms threshold we encountered solid marking at low link
> rates, so we had to add a 1 packet floor:
> https://bobbriscoe.net/projects/latency/dctth_journal_draft20190726.pdf
>
> Sorry to chime in after your patch went to net-next.
>

What you describe about a minimal backlog was already there with
ce_threshold handling ?

Or is it something exclusive to L4S ?

This deserves a separate patch, if anything :)


>
> Bob
>
>
> On 14/10/2021 18:59, Eric Dumazet wrote:
> > From: Eric Dumazet <edumazet@google.com>
> >
> > Add TCA_FQ_CODEL_CE_THRESHOLD_ECT1 boolean option to select Low Latency,
> > Low Loss, Scalable Throughput (L4S) style marking, along with ce_threshold.
> >
> > If enabled, only packets with ECT(1) can be transformed to CE
> > if their sojourn time is above the ce_threshold.
> >
> > Note that this new option does not change rules for codel law.
> > In particular, if TCA_FQ_CODEL_ECN is left enabled (this is
> > the default when fq_codel qdisc is created), ECT(0) packets can
> > still get CE if codel law (as governed by limit/target) decides so.
> >
> > Section 4.3.b of current draft [1] states:
> >
> > b.  A scheduler with per-flow queues such as FQ-CoDel or FQ-PIE can
> >      be used for L4S.  For instance within each queue of an FQ-CoDel
> >      system, as well as a CoDel AQM, there is typically also ECN
> >      marking at an immediate (unsmoothed) shallow threshold to support
> >      use in data centres (see Sec.5.2.7 of [RFC8290]).  This can be
> >      modified so that the shallow threshold is solely applied to
> >      ECT(1) packets.  Then if there is a flow of non-ECN or ECT(0)
> >      packets in the per-flow-queue, the Classic AQM (e.g.  CoDel) is
> >      applied; while if there is a flow of ECT(1) packets in the queue,
> >      the shallower (typically sub-millisecond) threshold is applied.
> >
> > Tested:
> >
> > tc qd replace dev eth1 root fq_codel ce_threshold_ect1 50usec
> >
> > netperf ... -t TCP_STREAM -- K dctcp
> >
> > tc -s -d qd sh dev eth1
> > qdisc fq_codel 8022: root refcnt 32 limit 10240p flows 1024 quantum 9212 target 5ms ce_threshold_ect1 49us interval 100ms memory_limit 32Mb ecn drop_batch 64
> >   Sent 14388596616 bytes 9543449 pkt (dropped 0, overlimits 0 requeues 152013)
> >   backlog 0b 0p requeues 152013
> >    maxpacket 68130 drop_overlimit 0 new_flow_count 95678 ecn_mark 0 ce_mark 7639
> >    new_flows_len 0 old_flows_len 0
> >
> > [1] L4S current draft:
> > https://datatracker.ietf.org/doc/html/draft-ietf-tsvwg-l4s-arch
> >
> > Signed-off-by: Eric Dumazet <edumazet@google.com>
> > Cc: Neal Cardwell <ncardwell@google.com>
> > Cc: Ingemar Johansson S <ingemar.s.johansson@ericsson.com>
> > Cc: Tom Henderson <tomh@tomh.org>
> > Cc: Bob Briscoe <in@bobbriscoe.net>
> > ---
> >   include/net/codel.h            |  2 ++
> >   include/net/codel_impl.h       | 18 +++++++++++++++---
> >   include/uapi/linux/pkt_sched.h |  1 +
> >   net/mac80211/sta_info.c        |  1 +
> >   net/sched/sch_fq_codel.c       | 15 +++++++++++----
> >   5 files changed, 30 insertions(+), 7 deletions(-)
> >
> > diff --git a/include/net/codel.h b/include/net/codel.h
> > index a6e428f801350809322aaff08d92904e059c3b5a..5e8b181b76b829d6af3c57809d9bc5f0578dd112 100644
> > --- a/include/net/codel.h
> > +++ b/include/net/codel.h
> > @@ -102,6 +102,7 @@ static inline u32 codel_time_to_us(codel_time_t val)
> >    * @interval:       width of moving time window
> >    * @mtu:    device mtu, or minimal queue backlog in bytes.
> >    * @ecn:    is Explicit Congestion Notification enabled
> > + * @ce_threshold_ect1: if ce_threshold only marks ECT(1) packets
> >    */
> >   struct codel_params {
> >       codel_time_t    target;
> > @@ -109,6 +110,7 @@ struct codel_params {
> >       codel_time_t    interval;
> >       u32             mtu;
> >       bool            ecn;
> > +     bool            ce_threshold_ect1;
> >   };
> >
> >   /**
> > diff --git a/include/net/codel_impl.h b/include/net/codel_impl.h
> > index d289b91dcd65ecdc96dc0c9bf85d4a4be6961022..7af2c3eb3c43c24364519120aad5be77522854a6 100644
> > --- a/include/net/codel_impl.h
> > +++ b/include/net/codel_impl.h
> > @@ -54,6 +54,7 @@ static void codel_params_init(struct codel_params *params)
> >       params->interval = MS2TIME(100);
> >       params->target = MS2TIME(5);
> >       params->ce_threshold = CODEL_DISABLED_THRESHOLD;
> > +     params->ce_threshold_ect1 = false;
> >       params->ecn = false;
> >   }
> >
> > @@ -246,9 +247,20 @@ static struct sk_buff *codel_dequeue(void *ctx,
> >                                                   vars->rec_inv_sqrt);
> >       }
> >   end:
> > -     if (skb && codel_time_after(vars->ldelay, params->ce_threshold) &&
> > -         INET_ECN_set_ce(skb))
> > -             stats->ce_mark++;
> > +     if (skb && codel_time_after(vars->ldelay, params->ce_threshold)) {
> > +             bool set_ce = true;
> > +
> > +             if (params->ce_threshold_ect1) {
> > +                     /* Note: if skb_get_dsfield() returns -1, following
> > +                      * gives INET_ECN_MASK, which is != INET_ECN_ECT_1.
> > +                      */
> > +                     u8 ecn = skb_get_dsfield(skb) & INET_ECN_MASK;
> > +
> > +                     set_ce = (ecn == INET_ECN_ECT_1);
> > +             }
> > +             if (set_ce && INET_ECN_set_ce(skb))
> > +                     stats->ce_mark++;
> > +     }
> >       return skb;
> >   }
> >
> > diff --git a/include/uapi/linux/pkt_sched.h b/include/uapi/linux/pkt_sched.h
> > index ec88590b3198441f18cc9def7bd40c48f0bc82a1..6be9a84cccfa79bace1f3f7123d02f484b67a25e 100644
> > --- a/include/uapi/linux/pkt_sched.h
> > +++ b/include/uapi/linux/pkt_sched.h
> > @@ -840,6 +840,7 @@ enum {
> >       TCA_FQ_CODEL_CE_THRESHOLD,
> >       TCA_FQ_CODEL_DROP_BATCH_SIZE,
> >       TCA_FQ_CODEL_MEMORY_LIMIT,
> > +     TCA_FQ_CODEL_CE_THRESHOLD_ECT1,
> >       __TCA_FQ_CODEL_MAX
> >   };
> >
> > diff --git a/net/mac80211/sta_info.c b/net/mac80211/sta_info.c
> > index 2b5acb37587f7068e2d11fe842ec963a556f1eb1..a39830418434d4bb74d238373f63a4858230fce5 100644
> > --- a/net/mac80211/sta_info.c
> > +++ b/net/mac80211/sta_info.c
> > @@ -513,6 +513,7 @@ struct sta_info *sta_info_alloc(struct ieee80211_sub_if_data *sdata,
> >       sta->cparams.target = MS2TIME(20);
> >       sta->cparams.interval = MS2TIME(100);
> >       sta->cparams.ecn = true;
> > +     sta->cparams.ce_threshold_ect1 = false;
> >
> >       sta_dbg(sdata, "Allocated STA %pM\n", sta->sta.addr);
> >
> > diff --git a/net/sched/sch_fq_codel.c b/net/sched/sch_fq_codel.c
> > index bb0cd6d3d2c2749d54e26368fb2558beedea85c9..033d65d06eb136ff704cddd3ee950a5c3a5d9831 100644
> > --- a/net/sched/sch_fq_codel.c
> > +++ b/net/sched/sch_fq_codel.c
> > @@ -362,6 +362,7 @@ static const struct nla_policy fq_codel_policy[TCA_FQ_CODEL_MAX + 1] = {
> >       [TCA_FQ_CODEL_CE_THRESHOLD] = { .type = NLA_U32 },
> >       [TCA_FQ_CODEL_DROP_BATCH_SIZE] = { .type = NLA_U32 },
> >       [TCA_FQ_CODEL_MEMORY_LIMIT] = { .type = NLA_U32 },
> > +     [TCA_FQ_CODEL_CE_THRESHOLD_ECT1] = { .type = NLA_U8 },
> >   };
> >
> >   static int fq_codel_change(struct Qdisc *sch, struct nlattr *opt,
> > @@ -408,6 +409,9 @@ static int fq_codel_change(struct Qdisc *sch, struct nlattr *opt,
> >               q->cparams.ce_threshold = (val * NSEC_PER_USEC) >> CODEL_SHIFT;
> >       }
> >
> > +     if (tb[TCA_FQ_CODEL_CE_THRESHOLD_ECT1])
> > +             q->cparams.ce_threshold_ect1 = !!nla_get_u8(tb[TCA_FQ_CODEL_CE_THRESHOLD_ECT1]);
> > +
> >       if (tb[TCA_FQ_CODEL_INTERVAL]) {
> >               u64 interval = nla_get_u32(tb[TCA_FQ_CODEL_INTERVAL]);
> >
> > @@ -544,10 +548,13 @@ static int fq_codel_dump(struct Qdisc *sch, struct sk_buff *skb)
> >                       q->flows_cnt))
> >               goto nla_put_failure;
> >
> > -     if (q->cparams.ce_threshold != CODEL_DISABLED_THRESHOLD &&
> > -         nla_put_u32(skb, TCA_FQ_CODEL_CE_THRESHOLD,
> > -                     codel_time_to_us(q->cparams.ce_threshold)))
> > -             goto nla_put_failure;
> > +     if (q->cparams.ce_threshold != CODEL_DISABLED_THRESHOLD) {
> > +             if (nla_put_u32(skb, TCA_FQ_CODEL_CE_THRESHOLD,
> > +                             codel_time_to_us(q->cparams.ce_threshold)))
> > +                     goto nla_put_failure;
> > +             if (nla_put_u8(skb, TCA_FQ_CODEL_CE_THRESHOLD_ECT1, q->cparams.ce_threshold_ect1))
> > +                     goto nla_put_failure;
> > +     }
> >
> >       return nla_nest_end(skb, opts);
> >
>
> --
> ________________________________________________________________
> Bob Briscoe                               http://bobbriscoe.net/
>
Neal Cardwell Oct. 15, 2021, 3:49 p.m. UTC | #6
On Fri, Oct 15, 2021 at 10:08 AM Eric Dumazet <edumazet@google.com> wrote:
>
> On Fri, Oct 15, 2021 at 5:59 AM Bob Briscoe <ietf@bobbriscoe.net> wrote:
> >
> > Eric,
> >
> > Because the threshold is in time units, I suggest the condition for
> > exceeding it needs to be AND'd with (*backlog > mtu), otherwise you can
> > get 100% solid marking at low link rates.
> >
> > When ce_threshold is for DCs, low link rates are unlikely.
> > However, given ce_threshold_ect1 is mainly for the Internet, during
> > testing with 1ms threshold we encountered solid marking at low link
> > rates, so we had to add a 1 packet floor:
> > https://bobbriscoe.net/projects/latency/dctth_journal_draft20190726.pdf
> >
> > Sorry to chime in after your patch went to net-next.
> >
>
> What you describe about a minimal backlog was already there with
> ce_threshold handling ?

For my education, do you have a pointer to where the ce_threshold
marking logic has a minimum backlog size requirement in packets or
bytes? AFAICT the ce_threshold marking in include/net/codel_impl.h
happens regardless of the current size of the backlog.

> Or is it something exclusive to L4S ?

I don't think it's exclusive to L4S. I think Bob is raising a general
issue about improving ECN marking based on ce_threshold. My
interpretation of Bob's point is that there is sort of a quantization
issue at very low link speeds, where the serialization delay for a
packet is at or above the ce_threshold delay. In such cases it seems
there can be behavior where the bottleneck marks every packet CE all
the time, causing any ECN-based algorithm (even DCTCP) to suffer poor
utilization.

I suppose with a fixed-speed link the operator could adjust the
ce_threshold based on the serialization delays implied by the link
speed, but perhaps in general this is infeasible due to variable-speed
(e.g., radio) links.

I guess perhaps this could be reproduced/tested with DCTCP (using
ECT(0)), a ce_threshold of 1ms (for ECT(0)), and an emulated
bottleneck link speed with a serialization delay well above 1ms (so a
link speed well below 12Mbps).

> This deserves a separate patch, if anything :)

Agreed, in the Linux development model this would make sense as a
separate patch, since it is conceptually separate and there do not
need to be any dependencies between the two changes. :-)

neal
Jonathan Morton Oct. 16, 2021, 7:39 a.m. UTC | #7
> On 15 Oct, 2021, at 2:24 am, Toke Høiland-Jørgensen <toke@redhat.com> wrote:
> 
>>>> Add TCA_FQ_CODEL_CE_THRESHOLD_ECT1 boolean option to select Low Latency,
>>>> Low Loss, Scalable Throughput (L4S) style marking, along with ce_threshold.
>>>> 
>>>> If enabled, only packets with ECT(1) can be transformed to CE
>>>> if their sojourn time is above the ce_threshold.
>>>> 
>>>> Note that this new option does not change rules for codel law.
>>>> In particular, if TCA_FQ_CODEL_ECN is left enabled (this is
>>>> the default when fq_codel qdisc is created), ECT(0) packets can
>>>> still get CE if codel law (as governed by limit/target) decides so.
>>> 
>>> The ability to have certain packets receive a shallow marking threshold
>>> and others regular ECN semantics is no doubt useful. However, given that
>>> it is by no means certain how the L4S experiment will pan out (and I for
>>> one remain sceptical that the real-world benefits will turn out to match
>>> the tech demos), I think it's premature to bake the ECT(1) semantics
>>> into UAPI.
>> 
>> Chicken and egg problem.
>> We had fq_codel in linux kernel years before RFC after all :)
> 
> Sure, but fq_codel is a self-contained algorithm, it doesn't add new
> meanings to bits of the IP header... :)

I'll be blunter:

In its original (and currently stable) form, fq_codel is RFC-compliant.  It conforms, in particular, to RFC-3168 (ECN).  There's a relatively low threshold for adding RFC-compliant network algorithms to Linux, and it is certainly not required to have a published RFC specifically describing each qdisc's operating principles before it can be upstreamed.  It just so happens that fq_codel (and some other notable algorithms such as CUBIC) proved sufficiently useful in practice to warrant post-hoc documentation in RFC form.

However, this patch adds an option which, when enabled, makes fq_codel *non-compliant* with RFC-3168, specifically the requirement to treat ECT(0) and ECT(1) identically, unless conforming to another published RFC which permits different behaviour.

There is a path via RFC-8311 to experiment with alternative ECN semantics in this way, but the way ECT(1) is used by L4S is specifically mentioned as requiring a published RFC for public deployments.  The L4S Internet Drafts have *just failed* an IETF WGLC, which means they are *not* advancing to publication as RFCs in their current form.  The primary reason for this failure is L4S' fundamental incompatibility with existing Internet traffic, despite its stated goal of general Internet deployment.  It is my considered opinion, indeed, that moving *away* from ECT(1) as the L4S identifier is the best option for improving that compatibility.

I believe there is a much higher threshold required for adding such things to publicly maintained versions of Linux (as opposed to privately maintained experimental versions).

- Jonathan Morton
Bob Briscoe Oct. 17, 2021, 12:42 a.m. UTC | #8
Eric, (thanks Neal)

On 15/10/2021 16:49, Neal Cardwell wrote:
> On Fri, Oct 15, 2021 at 10:08 AM Eric Dumazet <edumazet@google.com> wrote:
>> On Fri, Oct 15, 2021 at 5:59 AM Bob Briscoe <ietf@bobbriscoe.net> wrote:
>>> Eric,
>>>
>>> Because the threshold is in time units, I suggest the condition for
>>> exceeding it needs to be AND'd with (*backlog > mtu), otherwise you can
>>> get 100% solid marking at low link rates.
>>>
>>> When ce_threshold is for DCs, low link rates are unlikely.
>>> However, given ce_threshold_ect1 is mainly for the Internet, during
>>> testing with 1ms threshold we encountered solid marking at low link
>>> rates, so we had to add a 1 packet floor:
>>> https://bobbriscoe.net/projects/latency/dctth_journal_draft20190726.pdf
>>>
>>> Sorry to chime in after your patch went to net-next.
>>>
>> What you describe about a minimal backlog was already there with
>> ce_threshold handling ?
> For my education, do you have a pointer to where the ce_threshold
> marking logic has a minimum backlog size requirement in packets or
> bytes? AFAICT the ce_threshold marking in include/net/codel_impl.h
> happens regardless of the current size of the backlog.

[BB] When I checked before my original posting, the only check for 
single packet backlog was within should_drop() here:
https://elixir.bootlin.com/linux/latest/source/include/net/codel_impl.h#L125
However, whether or not that causes should_drop() to return false, 
codel_dequeue() still always falls through to the ce_threshold marking 
after end:
https://elixir.bootlin.com/linux/latest/source/include/net/codel_impl.h#L249


>
>> Or is it something exclusive to L4S ?
> I don't think it's exclusive to L4S. I think Bob is raising a general
> issue about improving ECN marking based on ce_threshold. My
> interpretation of Bob's point is that there is sort of a quantization
> issue at very low link speeds, where the serialization delay for a
> packet is at or above the ce_threshold delay. In such cases it seems
> there can be behavior where the bottleneck marks every packet CE all
> the time, causing any ECN-based algorithm (even DCTCP) to suffer poor
> utilization.
>
> I suppose with a fixed-speed link the operator could adjust the
> ce_threshold based on the serialization delays implied by the link
> speed, but perhaps in general this is infeasible due to variable-speed
> (e.g., radio) links.
>
> I guess perhaps this could be reproduced/tested with DCTCP (using
> ECT(0)), a ce_threshold of 1ms (for ECT(0)), and an emulated
> bottleneck link speed with a serialization delay well above 1ms (so a
> link speed well below 12Mbps).

[BB] Yes.

>
>> This deserves a separate patch, if anything :)
> Agreed, in the Linux development model this would make sense as a
> separate patch, since it is conceptually separate and there do not
> need to be any dependencies between the two changes. :-)

[BB] Sure. We'll see to it.

The (loose/indirect) dependency I saw was just that ce_threshold_ect1 
opens up the possibility of using the ce_threshold on the public 
Internet not just in DCs. So low rate links become a certainty, rather 
than a mere theoretical possibility.



Bob

>
> neal
Bob Briscoe Oct. 17, 2021, 11:22 a.m. UTC | #9
On 16/10/2021 08:39, Jonathan Morton wrote:
>> On 15 Oct, 2021, at 2:24 am, Toke Høiland-Jørgensen <toke@redhat.com> wrote:
>>
>>>>> Add TCA_FQ_CODEL_CE_THRESHOLD_ECT1 boolean option to select Low Latency,
>>>>> Low Loss, Scalable Throughput (L4S) style marking, along with ce_threshold.
>>>>>
>>>>> If enabled, only packets with ECT(1) can be transformed to CE
>>>>> if their sojourn time is above the ce_threshold.
>>>>>
>>>>> Note that this new option does not change rules for codel law.
>>>>> In particular, if TCA_FQ_CODEL_ECN is left enabled (this is
>>>>> the default when fq_codel qdisc is created), ECT(0) packets can
>>>>> still get CE if codel law (as governed by limit/target) decides so.
>>>> The ability to have certain packets receive a shallow marking threshold
>>>> and others regular ECN semantics is no doubt useful. However, given that
>>>> it is by no means certain how the L4S experiment will pan out (and I for
>>>> one remain sceptical that the real-world benefits will turn out to match
>>>> the tech demos), I think it's premature to bake the ECT(1) semantics
>>>> into UAPI.
>>> Chicken and egg problem.
>>> We had fq_codel in linux kernel years before RFC after all :)
>> Sure, but fq_codel is a self-contained algorithm, it doesn't add new
>> meanings to bits of the IP header... :)
> I'll be blunter:
>
> In its original (and currently stable) form, fq_codel is RFC-compliant.  It conforms, in particular, to RFC-3168 (ECN).  There's a relatively low threshold for adding RFC-compliant network algorithms to Linux, and it is certainly not required to have a published RFC specifically describing each qdisc's operating principles before it can be upstreamed.  It just so happens that fq_codel (and some other notable algorithms such as CUBIC) proved sufficiently useful in practice to warrant post-hoc documentation in RFC form.
>
> However, this patch adds an option which, when enabled, makes fq_codel *non-compliant* with RFC-3168, specifically the requirement to treat ECT(0) and ECT(1) identically, unless conforming to another published RFC which permits different behaviour.
>
> There is a path via RFC-8311 to experiment with alternative ECN semantics in this way, but the way ECT(1) is used by L4S is specifically mentioned as requiring a published RFC for public deployments.  The L4S Internet Drafts have *just failed* an IETF WGLC, which means they are *not* advancing to publication as RFCs in their current form.

[BB] Clarification of IETF process: A first Working Group Last Call 
(WGLC) is nearly always the beginning of the end of the IETF's RFC 
publication process. Usually the majority of detailed comments arrive 
during a WGLC. Then the draft has to be fixed, and then it goes either 
directly through to the next stage (in this case, an IETF-wide last 
call), or to another WGLC.

> The primary reason for this failure is L4S' fundamental incompatibility with existing Internet traffic, despite its stated goal of general Internet deployment.

[BB] s/The primary reason /JM's primary objection /
There is no ranking of the reasons for more work being needed.  The WG 
had already developed a way to mitigate this objection. Otherwise, a 
WGLC would not have been started in the first place. Further work on 
this issue is now more likely to be wordsmithing.

I hope this level of brevity was useful for netdev. See tsvwg@ietf.org 
for details.


Bob

> It is my considered opinion, indeed, that moving *away* from ECT(1) as the L4S identifier is the best option for improving that compatibility.
>
> I believe there is a much higher threshold required for adding such things to publicly maintained versions of Linux (as opposed to privately maintained experimental versions).
>
> - Jonathan Morton
Jonathan Morton Oct. 17, 2021, 12:18 p.m. UTC | #10
> On 17 Oct, 2021, at 2:22 pm, Bob Briscoe <ietf@bobbriscoe.net> wrote:
> 
>> I'll be blunter:
>> 
>> In its original (and currently stable) form, fq_codel is RFC-compliant.  It conforms, in particular, to RFC-3168 (ECN).  There's a relatively low threshold for adding RFC-compliant network algorithms to Linux, and it is certainly not required to have a published RFC specifically describing each qdisc's operating principles before it can be upstreamed.  It just so happens that fq_codel (and some other notable algorithms such as CUBIC) proved sufficiently useful in practice to warrant post-hoc documentation in RFC form.
>> 
>> However, this patch adds an option which, when enabled, makes fq_codel *non-compliant* with RFC-3168, specifically the requirement to treat ECT(0) and ECT(1) identically, unless conforming to another published RFC which permits different behaviour.
>> 
>> There is a path via RFC-8311 to experiment with alternative ECN semantics in this way, but the way ECT(1) is used by L4S is specifically mentioned as requiring a published RFC for public deployments.  The L4S Internet Drafts have *just failed* an IETF WGLC, which means they are *not* advancing to publication as RFCs in their current form.
> 
> [BB] Clarification of IETF process: A first Working Group Last Call (WGLC) is nearly always the beginning of the end of the IETF's RFC publication process. Usually the majority of detailed comments arrive during a WGLC. Then the draft has to be fixed, and then it goes either directly through to the next stage (in this case, an IETF-wide last call), or to another WGLC.

Further clarification: this is already the second WGLC for L4S.  The one two years previously (at Montreal) yielded a number of major technical objections, which remained unresolved as of this latest WGLC.

>> The primary reason for this failure is L4S' fundamental incompatibility with existing Internet traffic, despite its stated goal of general Internet deployment.
> 
> [BB] s/The primary reason /JM's primary objection /
> There is no ranking of the reasons for more work being needed.  The WG had already developed a way to mitigate this objection. Otherwise, a WGLC would not have been started in the first place. Further work on this issue is now more likely to be wordsmithing.

Given that the objections cited by the TSVWG Chairs were technical in nature, and related specifically to the incompatibility between L4S and existing conventional traffic, it is clear to me that wordsmithing will *not* be sufficient to render L4S publishable in RFC form, nor deployable at Internet scale.  

To quote David Black, one of the aforementioned Chairs and also an author of RFC-8311:

> Two overall conclusions are that a) the WGLC has been productive, and shows significant continuing support for L4S, and b) the L4S drafts should be revised to address the WGLC concerns raised.   The WG chairs strongly suggest that the revisions include limiting the scope and impact of initial L4S experiments on RFC 3168 functionality (both existing usage and potential deployment) to ensure that the L4S experiments are safe to perform on the Internet, paying particular attention to potential impacts on networks and users that are not participating in the L4S experiments.

It is my recommendation to netdev to stay out of this ongoing mess, by rejecting this patch.

 - Jonathan Morton
Dave Taht Oct. 18, 2021, 11:42 a.m. UTC | #11
Normally I would comment in-line on the patches but was somehow
unsubscribed to netdev as of oct 11. I'm glad more eyeballs
are on this, finally, however.

1) I would prefer this series bake in the l4s and bbrv2 trees a while,
be tested on vms, containers, on cloudy substrates, and home routers.

Note, I just said series. pie, fq_pie, and cake all can use
rfc3168-style ecn. Worse, fq_codel is also used in the wifi stack, but
not as a qdisc.

I think a safer and simpler path forward for existing assumptions and
callers for rfc3168 -style is that the existing INET_ECN_set_ce and
related calls be reworked to exclude ect1.

(as seen here) https://code.woboq.org/linux/linux/include/net/codel_impl.h.html#182

and the existing ce_threshold parameter enabled by default (at some
acceptible threshold) but only on detecting ect_1. That eliminates
the change to the uapi, and more correctly supports both standards.

Not having the ce_threshold param exposed currently in wifi and not
knowing how to make l4s-style signalling (at least vs tcp prague,
maybe not against bbrv2), it seems safest to make ect_1 - > drop reno
style on wifi, presently.

2) I don't know what is supposed to happen with GRO/GSO packets. This
is actually a long standing confusion of mine in the present day
linux stack - if a GRO packet (say an IW10 burst) is marked - do all
the packets get the marking? or just one? Is it driver or hardware
dependent? (software GRO is the bane of my existence)

3) My other long standing confusion is "triggered" by how we do
statistics keeping for packets - ce_mark gets overloaded by this
patch,
and we end up tracking two very different instances of the same idea
in the kernel statistics (much like the confusion along the path).
Given the frequency of marking in the l4s case is much higher than the
rfc3168 case, I'd advocate for a separate, 64 bit statistic.

Please note that in general I find the "packets" stat in the kernel,
given gso/gro, kind of hard to deal with in the first place,
"bytes_marked"
would be a better statistic to track than packets.
Gorry Fairhurst Oct. 18, 2021, 7:43 p.m. UTC | #12
Your  summary seems way off the mark to me, and I suggest you ask Wes Eddy for an actual summary of the current status ... 

Gorry


> On 18 Oct 2021, at 16:43, Jonathan Morton <chromatix99@gmail.com> wrote:
> 
> 
>> 
>>> On 17 Oct, 2021, at 2:22 pm, Bob Briscoe <ietf@bobbriscoe.net> wrote:
>>> 
>>> I'll be blunter:
>>> 
>>> In its original (and currently stable) form, fq_codel is RFC-compliant.  It conforms, in particular, to RFC-3168 (ECN).  There's a relatively low threshold for adding RFC-compliant network algorithms to Linux, and it is certainly not required to have a published RFC specifically describing each qdisc's operating principles before it can be upstreamed.  It just so happens that fq_codel (and some other notable algorithms such as CUBIC) proved sufficiently useful in practice to warrant post-hoc documentation in RFC form.
>>> 
>>> However, this patch adds an option which, when enabled, makes fq_codel *non-compliant* with RFC-3168, specifically the requirement to treat ECT(0) and ECT(1) identically, unless conforming to another published RFC which permits different behaviour.
>>> 
>>> There is a path via RFC-8311 to experiment with alternative ECN semantics in this way, but the way ECT(1) is used by L4S is specifically mentioned as requiring a published RFC for public deployments.  The L4S Internet Drafts have *just failed* an IETF WGLC, which means they are *not* advancing to publication as RFCs in their current form.
>> 
>> [BB] Clarification of IETF process: A first Working Group Last Call (WGLC) is nearly always the beginning of the end of the IETF's RFC publication process. Usually the majority of detailed comments arrive during a WGLC. Then the draft has to be fixed, and then it goes either directly through to the next stage (in this case, an IETF-wide last call), or to another WGLC.
> 
> Further clarification: this is already the second WGLC for L4S.  The one two years previously (at Montreal) yielded a number of major technical objections, which remained unresolved as of this latest WGLC.
> 
>>> The primary reason for this failure is L4S' fundamental incompatibility with existing Internet traffic, despite its stated goal of general Internet deployment.
>> 
>> [BB] s/The primary reason /JM's primary objection /
>> There is no ranking of the reasons for more work being needed.  The WG had already developed a way to mitigate this objection. Otherwise, a WGLC would not have been started in the first place. Further work on this issue is now more likely to be wordsmithing.
> 
> Given that the objections cited by the TSVWG Chairs were technical in nature, and related specifically to the incompatibility between L4S and existing conventional traffic, it is clear to me that wordsmithing will *not* be sufficient to render L4S publishable in RFC form, nor deployable at Internet scale.  
> 
> To quote David Black, one of the aforementioned Chairs and also an author of RFC-8311:
> 
>> Two overall conclusions are that a) the WGLC has been productive, and shows significant continuing support for L4S, and b) the L4S drafts should be revised to address the WGLC concerns raised.   The WG chairs strongly suggest that the revisions include limiting the scope and impact of initial L4S experiments on RFC 3168 functionality (both existing usage and potential deployment) to ensure that the L4S experiments are safe to perform on the Internet, paying particular attention to potential impacts on networks and users that are not participating in the L4S experiments.
> 
> It is my recommendation to netdev to stay out of this ongoing mess, by rejecting this patch.
> 
> - Jonathan Morton
diff mbox series

Patch

diff --git a/include/net/codel.h b/include/net/codel.h
index a6e428f801350809322aaff08d92904e059c3b5a..5e8b181b76b829d6af3c57809d9bc5f0578dd112 100644
--- a/include/net/codel.h
+++ b/include/net/codel.h
@@ -102,6 +102,7 @@  static inline u32 codel_time_to_us(codel_time_t val)
  * @interval:	width of moving time window
  * @mtu:	device mtu, or minimal queue backlog in bytes.
  * @ecn:	is Explicit Congestion Notification enabled
+ * @ce_threshold_ect1: if ce_threshold only marks ECT(1) packets
  */
 struct codel_params {
 	codel_time_t	target;
@@ -109,6 +110,7 @@  struct codel_params {
 	codel_time_t	interval;
 	u32		mtu;
 	bool		ecn;
+	bool		ce_threshold_ect1;
 };
 
 /**
diff --git a/include/net/codel_impl.h b/include/net/codel_impl.h
index d289b91dcd65ecdc96dc0c9bf85d4a4be6961022..7af2c3eb3c43c24364519120aad5be77522854a6 100644
--- a/include/net/codel_impl.h
+++ b/include/net/codel_impl.h
@@ -54,6 +54,7 @@  static void codel_params_init(struct codel_params *params)
 	params->interval = MS2TIME(100);
 	params->target = MS2TIME(5);
 	params->ce_threshold = CODEL_DISABLED_THRESHOLD;
+	params->ce_threshold_ect1 = false;
 	params->ecn = false;
 }
 
@@ -246,9 +247,20 @@  static struct sk_buff *codel_dequeue(void *ctx,
 						    vars->rec_inv_sqrt);
 	}
 end:
-	if (skb && codel_time_after(vars->ldelay, params->ce_threshold) &&
-	    INET_ECN_set_ce(skb))
-		stats->ce_mark++;
+	if (skb && codel_time_after(vars->ldelay, params->ce_threshold)) {
+		bool set_ce = true;
+
+		if (params->ce_threshold_ect1) {
+			/* Note: if skb_get_dsfield() returns -1, following
+			 * gives INET_ECN_MASK, which is != INET_ECN_ECT_1.
+			 */
+			u8 ecn = skb_get_dsfield(skb) & INET_ECN_MASK;
+
+			set_ce = (ecn == INET_ECN_ECT_1);
+		}
+		if (set_ce && INET_ECN_set_ce(skb))
+			stats->ce_mark++;
+	}
 	return skb;
 }
 
diff --git a/include/uapi/linux/pkt_sched.h b/include/uapi/linux/pkt_sched.h
index ec88590b3198441f18cc9def7bd40c48f0bc82a1..6be9a84cccfa79bace1f3f7123d02f484b67a25e 100644
--- a/include/uapi/linux/pkt_sched.h
+++ b/include/uapi/linux/pkt_sched.h
@@ -840,6 +840,7 @@  enum {
 	TCA_FQ_CODEL_CE_THRESHOLD,
 	TCA_FQ_CODEL_DROP_BATCH_SIZE,
 	TCA_FQ_CODEL_MEMORY_LIMIT,
+	TCA_FQ_CODEL_CE_THRESHOLD_ECT1,
 	__TCA_FQ_CODEL_MAX
 };
 
diff --git a/net/mac80211/sta_info.c b/net/mac80211/sta_info.c
index 2b5acb37587f7068e2d11fe842ec963a556f1eb1..a39830418434d4bb74d238373f63a4858230fce5 100644
--- a/net/mac80211/sta_info.c
+++ b/net/mac80211/sta_info.c
@@ -513,6 +513,7 @@  struct sta_info *sta_info_alloc(struct ieee80211_sub_if_data *sdata,
 	sta->cparams.target = MS2TIME(20);
 	sta->cparams.interval = MS2TIME(100);
 	sta->cparams.ecn = true;
+	sta->cparams.ce_threshold_ect1 = false;
 
 	sta_dbg(sdata, "Allocated STA %pM\n", sta->sta.addr);
 
diff --git a/net/sched/sch_fq_codel.c b/net/sched/sch_fq_codel.c
index bb0cd6d3d2c2749d54e26368fb2558beedea85c9..033d65d06eb136ff704cddd3ee950a5c3a5d9831 100644
--- a/net/sched/sch_fq_codel.c
+++ b/net/sched/sch_fq_codel.c
@@ -362,6 +362,7 @@  static const struct nla_policy fq_codel_policy[TCA_FQ_CODEL_MAX + 1] = {
 	[TCA_FQ_CODEL_CE_THRESHOLD] = { .type = NLA_U32 },
 	[TCA_FQ_CODEL_DROP_BATCH_SIZE] = { .type = NLA_U32 },
 	[TCA_FQ_CODEL_MEMORY_LIMIT] = { .type = NLA_U32 },
+	[TCA_FQ_CODEL_CE_THRESHOLD_ECT1] = { .type = NLA_U8 },
 };
 
 static int fq_codel_change(struct Qdisc *sch, struct nlattr *opt,
@@ -408,6 +409,9 @@  static int fq_codel_change(struct Qdisc *sch, struct nlattr *opt,
 		q->cparams.ce_threshold = (val * NSEC_PER_USEC) >> CODEL_SHIFT;
 	}
 
+	if (tb[TCA_FQ_CODEL_CE_THRESHOLD_ECT1])
+		q->cparams.ce_threshold_ect1 = !!nla_get_u8(tb[TCA_FQ_CODEL_CE_THRESHOLD_ECT1]);
+
 	if (tb[TCA_FQ_CODEL_INTERVAL]) {
 		u64 interval = nla_get_u32(tb[TCA_FQ_CODEL_INTERVAL]);
 
@@ -544,10 +548,13 @@  static int fq_codel_dump(struct Qdisc *sch, struct sk_buff *skb)
 			q->flows_cnt))
 		goto nla_put_failure;
 
-	if (q->cparams.ce_threshold != CODEL_DISABLED_THRESHOLD &&
-	    nla_put_u32(skb, TCA_FQ_CODEL_CE_THRESHOLD,
-			codel_time_to_us(q->cparams.ce_threshold)))
-		goto nla_put_failure;
+	if (q->cparams.ce_threshold != CODEL_DISABLED_THRESHOLD) {
+		if (nla_put_u32(skb, TCA_FQ_CODEL_CE_THRESHOLD,
+				codel_time_to_us(q->cparams.ce_threshold)))
+			goto nla_put_failure;
+		if (nla_put_u8(skb, TCA_FQ_CODEL_CE_THRESHOLD_ECT1, q->cparams.ce_threshold_ect1))
+			goto nla_put_failure;
+	}
 
 	return nla_nest_end(skb, opts);