diff mbox series

[net-next,2/2] net_sched: sch_fq: add the ability to offload pacing

Message ID 20240930152304.472767-3-edumazet@google.com (mailing list archive)
State Superseded
Delegated to: Netdev Maintainers
Headers show
Series net: prepare pacing offload support | expand

Checks

Context Check Description
netdev/series_format success Posting correctly formatted
netdev/tree_selection success Clearly marked for net-next, async
netdev/ynl success Generated files up to date; no warnings/errors; no diff in generated;
netdev/fixes_present success Fixes tag not required for -next series
netdev/header_inline success No static functions without inline keyword in header files
netdev/build_32bit success Errors and warnings before: 48 this patch: 48
netdev/build_tools success Errors and warnings before: 0 this patch: 0
netdev/cc_maintainers warning 3 maintainers not CCed: xiyou.wangcong@gmail.com jiri@resnulli.us jhs@mojatatu.com
netdev/build_clang success Errors and warnings before: 102 this patch: 102
netdev/verify_signedoff success Signed-off-by tag matches author and committer
netdev/deprecated_api success None detected
netdev/check_selftest success No net selftest shell script
netdev/verify_fixes success No Fixes tag
netdev/build_allmodconfig_warn fail Errors and warnings before: 14 this patch: 13
netdev/checkpatch warning WARNING: line length of 102 exceeds 80 columns WARNING: line length of 84 exceeds 80 columns
netdev/build_clang_rust success No Rust files in patch. Skipping build
netdev/kdoc success Errors and warnings before: 0 this patch: 0
netdev/source_inline success Was 0 now: 0

Commit Message

Eric Dumazet Sept. 30, 2024, 3:23 p.m. UTC
From: Jeffrey Ji <jeffreyji@google.com>

Some network devices have the ability to offload EDT (Earliest
Departure Time) which is the model used for TCP pacing and FQ packet
scheduler.

Some of them implement the timing wheel mechanism described in
https://saeed.github.io/files/carousel-sigcomm17.pdf
with an associated 'timing wheel horizon'.

This patchs adds to FQ packet scheduler TCA_FQ_OFFLOAD_HORIZON
attribute.

Its value is capped by the device max_pacing_offload_horizon,
added in the prior patch.

It allows FQ to let packets within pacing offload horizon
to be delivered to the device, which will handle the needed
delay without host involvement.

Signed-off-by: Jeffrey Ji <jeffreyji@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
---
 include/uapi/linux/pkt_sched.h |  2 ++
 net/sched/sch_fq.c             | 33 +++++++++++++++++++++++++++------
 2 files changed, 29 insertions(+), 6 deletions(-)

Comments

Willem de Bruijn Sept. 30, 2024, 5:33 p.m. UTC | #1
Eric Dumazet wrote:
> From: Jeffrey Ji <jeffreyji@google.com>
> 
> Some network devices have the ability to offload EDT (Earliest
> Departure Time) which is the model used for TCP pacing and FQ packet
> scheduler.
> 
> Some of them implement the timing wheel mechanism described in
> https://saeed.github.io/files/carousel-sigcomm17.pdf
> with an associated 'timing wheel horizon'.
> 
> This patchs adds to FQ packet scheduler TCA_FQ_OFFLOAD_HORIZON
> attribute.
> 
> Its value is capped by the device max_pacing_offload_horizon,
> added in the prior patch.
> 
> It allows FQ to let packets within pacing offload horizon
> to be delivered to the device, which will handle the needed
> delay without host involvement.
> 
> Signed-off-by: Jeffrey Ji <jeffreyji@google.com>
> Signed-off-by: Eric Dumazet <edumazet@google.com>

> @@ -1100,6 +1105,17 @@ static int fq_change(struct Qdisc *sch, struct nlattr *opt,
>  		WRITE_ONCE(q->horizon_drop,
>  			   nla_get_u8(tb[TCA_FQ_HORIZON_DROP]));
>  
> +	if (tb[TCA_FQ_OFFLOAD_HORIZON]) {
> +		u64 offload_horizon = (u64)NSEC_PER_USEC *
> +				      nla_get_u32(tb[TCA_FQ_OFFLOAD_HORIZON]);
> +
> +		if (offload_horizon <= qdisc_dev(sch)->max_pacing_offload_horizon) {
> +			WRITE_ONCE(q->offload_horizon, offload_horizon);

Do we expect that that an administrator will ever set the offload
horizon different from the device horizon?

It might be useful to have a wildcard value that means "match
hardware ability"?

Both here and in the device, realistic values will likely always be
MSEC scale?

> +		} else {
> +			NL_SET_ERR_MSG_MOD(extack, "invalid offload_horizon");
> +			err = -EINVAL;
> +		}
> +	}
>  	if (!err) {
>  
>  		sch_tree_unlock(sch);
Eric Dumazet Sept. 30, 2024, 5:55 p.m. UTC | #2
On Mon, Sep 30, 2024 at 7:33 PM Willem de Bruijn
<willemdebruijn.kernel@gmail.com> wrote:
>
> Eric Dumazet wrote:
> > From: Jeffrey Ji <jeffreyji@google.com>
> >
> > Some network devices have the ability to offload EDT (Earliest
> > Departure Time) which is the model used for TCP pacing and FQ packet
> > scheduler.
> >
> > Some of them implement the timing wheel mechanism described in
> > https://saeed.github.io/files/carousel-sigcomm17.pdf
> > with an associated 'timing wheel horizon'.
> >
> > This patchs adds to FQ packet scheduler TCA_FQ_OFFLOAD_HORIZON
> > attribute.
> >
> > Its value is capped by the device max_pacing_offload_horizon,
> > added in the prior patch.
> >
> > It allows FQ to let packets within pacing offload horizon
> > to be delivered to the device, which will handle the needed
> > delay without host involvement.
> >
> > Signed-off-by: Jeffrey Ji <jeffreyji@google.com>
> > Signed-off-by: Eric Dumazet <edumazet@google.com>
>
> > @@ -1100,6 +1105,17 @@ static int fq_change(struct Qdisc *sch, struct nlattr *opt,
> >               WRITE_ONCE(q->horizon_drop,
> >                          nla_get_u8(tb[TCA_FQ_HORIZON_DROP]));
> >
> > +     if (tb[TCA_FQ_OFFLOAD_HORIZON]) {
> > +             u64 offload_horizon = (u64)NSEC_PER_USEC *
> > +                                   nla_get_u32(tb[TCA_FQ_OFFLOAD_HORIZON]);
> > +
> > +             if (offload_horizon <= qdisc_dev(sch)->max_pacing_offload_horizon) {
> > +                     WRITE_ONCE(q->offload_horizon, offload_horizon);
>
> Do we expect that that an administrator will ever set the offload
> horizon different from the device horizon?

We want to be able to eventually deal with firmware/hardware bugs,
like lack of backpressure on the timer wheel, which probably has some
kind of capacity limit.

I think it is much better to let the admin choose, eventually
disabling the whole thing, or enabling it for a small horizon like
2500 ns.

>
> It might be useful to have a wildcard value that means "match
> hardware ability"?

 "ip link" will show the device max capability.
Same story for gso_max_size attribute.
We do not automatically set it to dev->tso_max_size

I do not think we have a precedent for a qdisc/link attribute where
the kernel automatically caps the user
choice with the device capability.

>
> Both here and in the device, realistic values will likely always be
> MSEC scale?

msec granularity proved to be not good for TCP stack, we went to us already.

Fast path compares in ns unit, storing the value in ns removes
multiplies from it.
Willem de Bruijn Sept. 30, 2024, 6:17 p.m. UTC | #3
Eric Dumazet wrote:
> On Mon, Sep 30, 2024 at 7:33 PM Willem de Bruijn
> <willemdebruijn.kernel@gmail.com> wrote:
> >
> > Eric Dumazet wrote:
> > > From: Jeffrey Ji <jeffreyji@google.com>
> > >
> > > Some network devices have the ability to offload EDT (Earliest
> > > Departure Time) which is the model used for TCP pacing and FQ packet
> > > scheduler.
> > >
> > > Some of them implement the timing wheel mechanism described in
> > > https://saeed.github.io/files/carousel-sigcomm17.pdf
> > > with an associated 'timing wheel horizon'.
> > >
> > > This patchs adds to FQ packet scheduler TCA_FQ_OFFLOAD_HORIZON
> > > attribute.
> > >
> > > Its value is capped by the device max_pacing_offload_horizon,
> > > added in the prior patch.
> > >
> > > It allows FQ to let packets within pacing offload horizon
> > > to be delivered to the device, which will handle the needed
> > > delay without host involvement.
> > >
> > > Signed-off-by: Jeffrey Ji <jeffreyji@google.com>
> > > Signed-off-by: Eric Dumazet <edumazet@google.com>
> >
> > > @@ -1100,6 +1105,17 @@ static int fq_change(struct Qdisc *sch, struct nlattr *opt,
> > >               WRITE_ONCE(q->horizon_drop,
> > >                          nla_get_u8(tb[TCA_FQ_HORIZON_DROP]));
> > >
> > > +     if (tb[TCA_FQ_OFFLOAD_HORIZON]) {
> > > +             u64 offload_horizon = (u64)NSEC_PER_USEC *
> > > +                                   nla_get_u32(tb[TCA_FQ_OFFLOAD_HORIZON]);
> > > +
> > > +             if (offload_horizon <= qdisc_dev(sch)->max_pacing_offload_horizon) {
> > > +                     WRITE_ONCE(q->offload_horizon, offload_horizon);
> >
> > Do we expect that that an administrator will ever set the offload
> > horizon different from the device horizon?
> 
> We want to be able to eventually deal with firmware/hardware bugs,
> like lack of backpressure on the timer wheel, which probably has some
> kind of capacity limit.
> 
> I think it is much better to let the admin choose, eventually
> disabling the whole thing, or enabling it for a small horizon like
> 2500 ns.
> 
> >
> > It might be useful to have a wildcard value that means "match
> > hardware ability"?
> 
>  "ip link" will show the device max capability.
> Same story for gso_max_size attribute.
> We do not automatically set it to dev->tso_max_size
> 
> I do not think we have a precedent for a qdisc/link attribute where
> the kernel automatically caps the user
> choice with the device capability.
>
> >
> > Both here and in the device, realistic values will likely always be
> > MSEC scale?
> 
> msec granularity proved to be not good for TCP stack, we went to us already.
> 
> Fast path compares in ns unit, storing the value in ns removes
> multiplies from it.

Ack on all points. Thanks Eric.
Willem de Bruijn Sept. 30, 2024, 6:38 p.m. UTC | #4
Eric Dumazet wrote:
> From: Jeffrey Ji <jeffreyji@google.com>
> 
> Some network devices have the ability to offload EDT (Earliest
> Departure Time) which is the model used for TCP pacing and FQ packet
> scheduler.
> 
> Some of them implement the timing wheel mechanism described in
> https://saeed.github.io/files/carousel-sigcomm17.pdf
> with an associated 'timing wheel horizon'.
> 
> This patchs adds to FQ packet scheduler TCA_FQ_OFFLOAD_HORIZON
> attribute.
> 
> Its value is capped by the device max_pacing_offload_horizon,
> added in the prior patch.
> 
> It allows FQ to let packets within pacing offload horizon
> to be delivered to the device, which will handle the needed
> delay without host involvement.
> 
> Signed-off-by: Jeffrey Ji <jeffreyji@google.com>
> Signed-off-by: Eric Dumazet <edumazet@google.com>

Reviewed-by: Willem de Bruijn <willemb@google.com>
diff mbox series

Patch

diff --git a/include/uapi/linux/pkt_sched.h b/include/uapi/linux/pkt_sched.h
index a3cd0c2dc9956f8c873f35c7b33b2bcf93feb2f1..25a9a47001cdde59cf052ea658ba1ac26f4c34e8 100644
--- a/include/uapi/linux/pkt_sched.h
+++ b/include/uapi/linux/pkt_sched.h
@@ -836,6 +836,8 @@  enum {
 
 	TCA_FQ_WEIGHTS,		/* Weights for each band */
 
+	TCA_FQ_OFFLOAD_HORIZON, /* dequeue paced packets within this horizon immediately (us units) */
+
 	__TCA_FQ_MAX
 };
 
diff --git a/net/sched/sch_fq.c b/net/sched/sch_fq.c
index 19a49af5a9e527ed0371a3bb96e0113755375eac..aeabf45c9200c4aea75fb6c63986e37eddfea5f9 100644
--- a/net/sched/sch_fq.c
+++ b/net/sched/sch_fq.c
@@ -111,6 +111,7 @@  struct fq_perband_flows {
 struct fq_sched_data {
 /* Read mostly cache line */
 
+	u64		offload_horizon;
 	u32		quantum;
 	u32		initial_quantum;
 	u32		flow_refill_delay;
@@ -299,7 +300,7 @@  static void fq_gc(struct fq_sched_data *q,
 }
 
 /* Fast path can be used if :
- * 1) Packet tstamp is in the past.
+ * 1) Packet tstamp is in the past, or within the pacing offload horizon.
  * 2) FQ qlen == 0   OR
  *   (no flow is currently eligible for transmit,
  *    AND fast path queue has less than 8 packets)
@@ -314,7 +315,7 @@  static bool fq_fastpath_check(const struct Qdisc *sch, struct sk_buff *skb,
 	const struct fq_sched_data *q = qdisc_priv(sch);
 	const struct sock *sk;
 
-	if (fq_skb_cb(skb)->time_to_send > now)
+	if (fq_skb_cb(skb)->time_to_send > now + q->offload_horizon)
 		return false;
 
 	if (sch->q.qlen != 0) {
@@ -595,15 +596,18 @@  static void fq_check_throttled(struct fq_sched_data *q, u64 now)
 	unsigned long sample;
 	struct rb_node *p;
 
-	if (q->time_next_delayed_flow > now)
+	if (q->time_next_delayed_flow > now + q->offload_horizon)
 		return;
 
 	/* Update unthrottle latency EWMA.
 	 * This is cheap and can help diagnosing timer/latency problems.
 	 */
 	sample = (unsigned long)(now - q->time_next_delayed_flow);
-	q->unthrottle_latency_ns -= q->unthrottle_latency_ns >> 3;
-	q->unthrottle_latency_ns += sample >> 3;
+	if ((long)sample > 0) {
+		q->unthrottle_latency_ns -= q->unthrottle_latency_ns >> 3;
+		q->unthrottle_latency_ns += sample >> 3;
+	}
+	now += q->offload_horizon;
 
 	q->time_next_delayed_flow = ~0ULL;
 	while ((p = rb_first(&q->delayed)) != NULL) {
@@ -687,7 +691,7 @@  static struct sk_buff *fq_dequeue(struct Qdisc *sch)
 		u64 time_next_packet = max_t(u64, fq_skb_cb(skb)->time_to_send,
 					     f->time_next_packet);
 
-		if (now < time_next_packet) {
+		if (now + q->offload_horizon < time_next_packet) {
 			head->first = f->next;
 			f->time_next_packet = time_next_packet;
 			fq_flow_set_throttled(q, f);
@@ -925,6 +929,7 @@  static const struct nla_policy fq_policy[TCA_FQ_MAX + 1] = {
 	[TCA_FQ_HORIZON_DROP]		= { .type = NLA_U8 },
 	[TCA_FQ_PRIOMAP]		= NLA_POLICY_EXACT_LEN(sizeof(struct tc_prio_qopt)),
 	[TCA_FQ_WEIGHTS]		= NLA_POLICY_EXACT_LEN(FQ_BANDS * sizeof(s32)),
+	[TCA_FQ_OFFLOAD_HORIZON]	= { .type = NLA_U32 },
 };
 
 /* compress a u8 array with all elems <= 3 to an array of 2-bit fields */
@@ -1100,6 +1105,17 @@  static int fq_change(struct Qdisc *sch, struct nlattr *opt,
 		WRITE_ONCE(q->horizon_drop,
 			   nla_get_u8(tb[TCA_FQ_HORIZON_DROP]));
 
+	if (tb[TCA_FQ_OFFLOAD_HORIZON]) {
+		u64 offload_horizon = (u64)NSEC_PER_USEC *
+				      nla_get_u32(tb[TCA_FQ_OFFLOAD_HORIZON]);
+
+		if (offload_horizon <= qdisc_dev(sch)->max_pacing_offload_horizon) {
+			WRITE_ONCE(q->offload_horizon, offload_horizon);
+		} else {
+			NL_SET_ERR_MSG_MOD(extack, "invalid offload_horizon");
+			err = -EINVAL;
+		}
+	}
 	if (!err) {
 
 		sch_tree_unlock(sch);
@@ -1183,6 +1199,7 @@  static int fq_dump(struct Qdisc *sch, struct sk_buff *skb)
 		.bands = FQ_BANDS,
 	};
 	struct nlattr *opts;
+	u64 offload_horizon;
 	u64 ce_threshold;
 	s32 weights[3];
 	u64 horizon;
@@ -1199,6 +1216,9 @@  static int fq_dump(struct Qdisc *sch, struct sk_buff *skb)
 	horizon = READ_ONCE(q->horizon);
 	do_div(horizon, NSEC_PER_USEC);
 
+	offload_horizon = READ_ONCE(q->offload_horizon);
+	do_div(offload_horizon, NSEC_PER_USEC);
+
 	if (nla_put_u32(skb, TCA_FQ_PLIMIT,
 			READ_ONCE(sch->limit)) ||
 	    nla_put_u32(skb, TCA_FQ_FLOW_PLIMIT,
@@ -1224,6 +1244,7 @@  static int fq_dump(struct Qdisc *sch, struct sk_buff *skb)
 	    nla_put_u32(skb, TCA_FQ_TIMER_SLACK,
 			READ_ONCE(q->timer_slack)) ||
 	    nla_put_u32(skb, TCA_FQ_HORIZON, (u32)horizon) ||
+	    nla_put_u32(skb, TCA_FQ_OFFLOAD_HORIZON, (u32)offload_horizon) ||
 	    nla_put_u8(skb, TCA_FQ_HORIZON_DROP,
 		       READ_ONCE(q->horizon_drop)))
 		goto nla_put_failure;