diff mbox series

[net,v1] net: taprio offload: enforce qdisc to netdev queue mapping

Message ID 20210511171829.17181-1-yannick.vignon@oss.nxp.com (mailing list archive)
State Changes Requested
Delegated to: Netdev Maintainers
Headers show
Series [net,v1] net: taprio offload: enforce qdisc to netdev queue mapping | expand

Checks

Context Check Description
netdev/cover_letter success Link
netdev/fixes_present fail Series targets non-next tree, but doesn't contain any Fixes tags
netdev/patch_count success Link
netdev/tree_selection success Clearly marked for net
netdev/subject_prefix success Link
netdev/cc_maintainers success CCed 6 of 6 maintainers
netdev/source_inline success Was 0 now: 0
netdev/verify_signedoff success Link
netdev/module_param success Was 0 now: 0
netdev/build_32bit success Errors and warnings before: 2 this patch: 2
netdev/kdoc success Errors and warnings before: 0 this patch: 0
netdev/verify_fixes success Link
netdev/checkpatch success total: 0 errors, 0 warnings, 0 checks, 121 lines checked
netdev/build_allmodconfig_warn success Errors and warnings before: 2 this patch: 2
netdev/header_inline success Link

Commit Message

Yannick Vignon May 11, 2021, 5:18 p.m. UTC
From: Yannick Vignon <yannick.vignon@nxp.com>

Even though the taprio qdisc is designed for multiqueue devices, all the
queues still point to the same top-level taprio qdisc. This works and is
probably required for software taprio, but at least with offload taprio,
it has an undesirable side effect: because the whole qdisc is run when a
packet has to be sent, it allows packets in a best-effort class to be
processed in the context of a task sending higher priority traffic. If
there are packets left in the qdisc after that first run, the NET_TX
softirq is raised and gets executed immediately in the same process
context. As with any other softirq, it runs up to 10 times and for up to
2ms, during which the calling process is waiting for the sendmsg call (or
similar) to return. In my use case, that calling process is a real-time
task scheduled to send a packet every 2ms, so the long sendmsg calls are
leading to missed timeslots.

By attaching each netdev queue to its own qdisc, as it is done with
the "classic" mq qdisc, each traffic class can be processed independently
without touching the other classes. A high-priority process can then send
packets without getting stuck in the sendmsg call anymore.

Signed-off-by: Yannick Vignon <yannick.vignon@nxp.com>
---

This patch fixes an issue I observed while verifying the behavior of the
taprio qdisc in a real-time networking situation.
I am wondering if implementing separate taprio qdiscs for the software
and accelerated cases wouldn't be a better solution, but that would
require changes to the iproute2 package as well, and would break
backwards compatibility.

---
 net/sched/sch_taprio.c | 85 ++++++++++++++++++++++--------------------
 1 file changed, 45 insertions(+), 40 deletions(-)

Comments

Jakub Kicinski May 14, 2021, 3:32 p.m. UTC | #1
On Tue, 11 May 2021 19:18:29 +0200 Yannick Vignon wrote:
> From: Yannick Vignon <yannick.vignon@nxp.com>
> 
> Even though the taprio qdisc is designed for multiqueue devices, all the
> queues still point to the same top-level taprio qdisc. This works and is
> probably required for software taprio, but at least with offload taprio,
> it has an undesirable side effect: because the whole qdisc is run when a
> packet has to be sent, it allows packets in a best-effort class to be
> processed in the context of a task sending higher priority traffic. If
> there are packets left in the qdisc after that first run, the NET_TX
> softirq is raised and gets executed immediately in the same process
> context. As with any other softirq, it runs up to 10 times and for up to
> 2ms, during which the calling process is waiting for the sendmsg call (or
> similar) to return. In my use case, that calling process is a real-time
> task scheduled to send a packet every 2ms, so the long sendmsg calls are
> leading to missed timeslots.
> 
> By attaching each netdev queue to its own qdisc, as it is done with
> the "classic" mq qdisc, each traffic class can be processed independently
> without touching the other classes. A high-priority process can then send
> packets without getting stuck in the sendmsg call anymore.
> 
> Signed-off-by: Yannick Vignon <yannick.vignon@nxp.com>
> ---
> 
> This patch fixes an issue I observed while verifying the behavior of the
> taprio qdisc in a real-time networking situation.
> I am wondering if implementing separate taprio qdiscs for the software
> and accelerated cases wouldn't be a better solution, but that would
> require changes to the iproute2 package as well, and would break
> backwards compatibility.

You haven't CCed anyone who worked on this Qdisc in the last 2 years :/
CCing them now. Comments, anyone?

This looks like a very drastic change. Are you expecting the qdisc will
always be bypassed?

After a 1 minute looks it seems like taprio is using device queues in
strict priority fashion. Maybe a different model is needed, but a qdisc
with:

enqueue()
{
	WARN_ONCE(1)
}

really doesn't look right to me.

Quoting the rest of the patch below for the benefit of those on CC.

> diff --git a/net/sched/sch_taprio.c b/net/sched/sch_taprio.c
> index 5c91df52b8c2..0bfb03052429 100644
> --- a/net/sched/sch_taprio.c
> +++ b/net/sched/sch_taprio.c
> @@ -438,6 +438,11 @@ static int taprio_enqueue(struct sk_buff *skb, struct Qdisc *sch,
>  	struct Qdisc *child;
>  	int queue;
>  
> +	if (unlikely(FULL_OFFLOAD_IS_ENABLED(q->flags))) {
> +		WARN_ONCE(1, "Trying to enqueue skb into the root of a taprio qdisc configured with full offload\n");
> +		return qdisc_drop(skb, sch, to_free);
> +	}
> +
>  	queue = skb_get_queue_mapping(skb);
>  
>  	child = q->qdiscs[queue];
> @@ -529,23 +534,7 @@ static struct sk_buff *taprio_peek_soft(struct Qdisc *sch)
>  
>  static struct sk_buff *taprio_peek_offload(struct Qdisc *sch)
>  {
> -	struct taprio_sched *q = qdisc_priv(sch);
> -	struct net_device *dev = qdisc_dev(sch);
> -	struct sk_buff *skb;
> -	int i;
> -
> -	for (i = 0; i < dev->num_tx_queues; i++) {
> -		struct Qdisc *child = q->qdiscs[i];
> -
> -		if (unlikely(!child))
> -			continue;
> -
> -		skb = child->ops->peek(child);
> -		if (!skb)
> -			continue;
> -
> -		return skb;
> -	}
> +	WARN_ONCE(1, "Trying to peek into the root of a taprio qdisc configured with full offload\n");
>  
>  	return NULL;
>  }
> @@ -654,27 +643,7 @@ static struct sk_buff *taprio_dequeue_soft(struct Qdisc *sch)
>  
>  static struct sk_buff *taprio_dequeue_offload(struct Qdisc *sch)
>  {
> -	struct taprio_sched *q = qdisc_priv(sch);
> -	struct net_device *dev = qdisc_dev(sch);
> -	struct sk_buff *skb;
> -	int i;
> -
> -	for (i = 0; i < dev->num_tx_queues; i++) {
> -		struct Qdisc *child = q->qdiscs[i];
> -
> -		if (unlikely(!child))
> -			continue;
> -
> -		skb = child->ops->dequeue(child);
> -		if (unlikely(!skb))
> -			continue;
> -
> -		qdisc_bstats_update(sch, skb);
> -		qdisc_qstats_backlog_dec(sch, skb);
> -		sch->q.qlen--;
> -
> -		return skb;
> -	}
> +	WARN_ONCE(1, "Trying to dequeue from the root of a taprio qdisc configured with full offload\n");
>  
>  	return NULL;
>  }
> @@ -1759,6 +1728,37 @@ static int taprio_init(struct Qdisc *sch, struct nlattr *opt,
>  	return taprio_change(sch, opt, extack);
>  }
>  
> +static void taprio_attach(struct Qdisc *sch)
> +{
> +	struct taprio_sched *q = qdisc_priv(sch);
> +	struct net_device *dev = qdisc_dev(sch);
> +	unsigned int ntx;
> +
> +	/* Attach underlying qdisc */
> +	for (ntx = 0; ntx < dev->num_tx_queues; ntx++) {
> +		struct Qdisc *qdisc = q->qdiscs[ntx];
> +		struct Qdisc *old;
> +
> +		if (FULL_OFFLOAD_IS_ENABLED(q->flags)) {
> +			qdisc->flags |= TCQ_F_ONETXQUEUE | TCQ_F_NOPARENT;
> +			old = dev_graft_qdisc(qdisc->dev_queue, qdisc);
> +			if (ntx < dev->real_num_tx_queues)
> +				qdisc_hash_add(qdisc, false);
> +		} else {
> +			old = dev_graft_qdisc(qdisc->dev_queue, sch);
> +			qdisc_refcount_inc(sch);
> +		}
> +		if (old)
> +			qdisc_put(old);
> +	}
> +
> +	/* access to the child qdiscs is not needed in offload mode */
> +	if (FULL_OFFLOAD_IS_ENABLED(q->flags)) {
> +		kfree(q->qdiscs);
> +		q->qdiscs = NULL;
> +	}
> +}
> +
>  static struct netdev_queue *taprio_queue_get(struct Qdisc *sch,
>  					     unsigned long cl)
>  {
> @@ -1785,8 +1785,12 @@ static int taprio_graft(struct Qdisc *sch, unsigned long cl,
>  	if (dev->flags & IFF_UP)
>  		dev_deactivate(dev);
>  
> -	*old = q->qdiscs[cl - 1];
> -	q->qdiscs[cl - 1] = new;
> +	if (FULL_OFFLOAD_IS_ENABLED(q->flags)) {
> +		*old = dev_graft_qdisc(dev_queue, new);
> +	} else {
> +		*old = q->qdiscs[cl - 1];
> +		q->qdiscs[cl - 1] = new;
> +	}
>  
>  	if (new)
>  		new->flags |= TCQ_F_ONETXQUEUE | TCQ_F_NOPARENT;
> @@ -2020,6 +2024,7 @@ static struct Qdisc_ops taprio_qdisc_ops __read_mostly = {
>  	.change		= taprio_change,
>  	.destroy	= taprio_destroy,
>  	.reset		= taprio_reset,
> +	.attach		= taprio_attach,
>  	.peek		= taprio_peek,
>  	.dequeue	= taprio_dequeue,
>  	.enqueue	= taprio_enqueue,
Vinicius Costa Gomes May 14, 2021, 8:40 p.m. UTC | #2
Jakub Kicinski <kuba@kernel.org> writes:

> On Tue, 11 May 2021 19:18:29 +0200 Yannick Vignon wrote:
>> From: Yannick Vignon <yannick.vignon@nxp.com>
>> 
>> Even though the taprio qdisc is designed for multiqueue devices, all the
>> queues still point to the same top-level taprio qdisc. This works and is
>> probably required for software taprio, but at least with offload taprio,
>> it has an undesirable side effect: because the whole qdisc is run when a
>> packet has to be sent, it allows packets in a best-effort class to be
>> processed in the context of a task sending higher priority traffic. If
>> there are packets left in the qdisc after that first run, the NET_TX
>> softirq is raised and gets executed immediately in the same process
>> context. As with any other softirq, it runs up to 10 times and for up to
>> 2ms, during which the calling process is waiting for the sendmsg call (or
>> similar) to return. In my use case, that calling process is a real-time
>> task scheduled to send a packet every 2ms, so the long sendmsg calls are
>> leading to missed timeslots.
>> 
>> By attaching each netdev queue to its own qdisc, as it is done with
>> the "classic" mq qdisc, each traffic class can be processed independently
>> without touching the other classes. A high-priority process can then send
>> packets without getting stuck in the sendmsg call anymore.
>> 
>> Signed-off-by: Yannick Vignon <yannick.vignon@nxp.com>
>> ---
>> 
>> This patch fixes an issue I observed while verifying the behavior of the
>> taprio qdisc in a real-time networking situation.
>> I am wondering if implementing separate taprio qdiscs for the software
>> and accelerated cases wouldn't be a better solution, but that would
>> require changes to the iproute2 package as well, and would break
>> backwards compatibility.
>
> You haven't CCed anyone who worked on this Qdisc in the last 2 years :/
> CCing them now. Comments, anyone?

I guess I should suggest myself as maintainer, to reduce chances of this
happening again.

>
> This looks like a very drastic change. Are you expecting the qdisc will
> always be bypassed?

Only when running in full offload mode it will be bypassed.

And it's kind of by design, in offload mode, the idea was: configure the
netdev traffic class to queue mapping, send the schedule to the hardware
and stay out of the way.

But as per Yannick's report, it seems that taprio doesn't stay enough
out of the yay.

>
> After a 1 minute looks it seems like taprio is using device queues in
> strict priority fashion. Maybe a different model is needed, but a qdisc
> with:
>
> enqueue()
> {
> 	WARN_ONCE(1)
> }
>
> really doesn't look right to me.

This patch takes the "stay out of the way" to the extreme, I kind of
like it/I am not opposed to it, if I had this idea a couple of years
ago, perhaps I would have used this same approach.

I am now thinking if this idea locks us out of anything.

Anyway, a nicer alternative would exist if we had a way to tell the core
"this qdisc should be bypassed" (i.e. don't call enqueue()/dequeue())
after init() runs.

>
> Quoting the rest of the patch below for the benefit of those on CC.
>
>> diff --git a/net/sched/sch_taprio.c b/net/sched/sch_taprio.c
>> index 5c91df52b8c2..0bfb03052429 100644
>> --- a/net/sched/sch_taprio.c
>> +++ b/net/sched/sch_taprio.c
>> @@ -438,6 +438,11 @@ static int taprio_enqueue(struct sk_buff *skb, struct Qdisc *sch,
>>  	struct Qdisc *child;
>>  	int queue;
>>  
>> +	if (unlikely(FULL_OFFLOAD_IS_ENABLED(q->flags))) {
>> +		WARN_ONCE(1, "Trying to enqueue skb into the root of a taprio qdisc configured with full offload\n");
>> +		return qdisc_drop(skb, sch, to_free);
>> +	}
>> +
>>  	queue = skb_get_queue_mapping(skb);
>>  
>>  	child = q->qdiscs[queue];
>> @@ -529,23 +534,7 @@ static struct sk_buff *taprio_peek_soft(struct Qdisc *sch)
>>  
>>  static struct sk_buff *taprio_peek_offload(struct Qdisc *sch)
>>  {
>> -	struct taprio_sched *q = qdisc_priv(sch);
>> -	struct net_device *dev = qdisc_dev(sch);
>> -	struct sk_buff *skb;
>> -	int i;
>> -
>> -	for (i = 0; i < dev->num_tx_queues; i++) {
>> -		struct Qdisc *child = q->qdiscs[i];
>> -
>> -		if (unlikely(!child))
>> -			continue;
>> -
>> -		skb = child->ops->peek(child);
>> -		if (!skb)
>> -			continue;
>> -
>> -		return skb;
>> -	}
>> +	WARN_ONCE(1, "Trying to peek into the root of a taprio qdisc configured with full offload\n");
>>  
>>  	return NULL;
>>  }
>> @@ -654,27 +643,7 @@ static struct sk_buff *taprio_dequeue_soft(struct Qdisc *sch)
>>  
>>  static struct sk_buff *taprio_dequeue_offload(struct Qdisc *sch)
>>  {
>> -	struct taprio_sched *q = qdisc_priv(sch);
>> -	struct net_device *dev = qdisc_dev(sch);
>> -	struct sk_buff *skb;
>> -	int i;
>> -
>> -	for (i = 0; i < dev->num_tx_queues; i++) {
>> -		struct Qdisc *child = q->qdiscs[i];
>> -
>> -		if (unlikely(!child))
>> -			continue;
>> -
>> -		skb = child->ops->dequeue(child);
>> -		if (unlikely(!skb))
>> -			continue;
>> -
>> -		qdisc_bstats_update(sch, skb);
>> -		qdisc_qstats_backlog_dec(sch, skb);
>> -		sch->q.qlen--;
>> -
>> -		return skb;
>> -	}
>> +	WARN_ONCE(1, "Trying to dequeue from the root of a taprio qdisc configured with full offload\n");
>>  
>>  	return NULL;
>>  }
>> @@ -1759,6 +1728,37 @@ static int taprio_init(struct Qdisc *sch, struct nlattr *opt,
>>  	return taprio_change(sch, opt, extack);
>>  }
>>  
>> +static void taprio_attach(struct Qdisc *sch)
>> +{
>> +	struct taprio_sched *q = qdisc_priv(sch);
>> +	struct net_device *dev = qdisc_dev(sch);
>> +	unsigned int ntx;
>> +
>> +	/* Attach underlying qdisc */
>> +	for (ntx = 0; ntx < dev->num_tx_queues; ntx++) {
>> +		struct Qdisc *qdisc = q->qdiscs[ntx];
>> +		struct Qdisc *old;
>> +
>> +		if (FULL_OFFLOAD_IS_ENABLED(q->flags)) {
>> +			qdisc->flags |= TCQ_F_ONETXQUEUE | TCQ_F_NOPARENT;
>> +			old = dev_graft_qdisc(qdisc->dev_queue, qdisc);
>> +			if (ntx < dev->real_num_tx_queues)
>> +				qdisc_hash_add(qdisc, false);
>> +		} else {
>> +			old = dev_graft_qdisc(qdisc->dev_queue, sch);
>> +			qdisc_refcount_inc(sch);
>> +		}
>> +		if (old)
>> +			qdisc_put(old);
>> +	}
>> +
>> +	/* access to the child qdiscs is not needed in offload mode */
>> +	if (FULL_OFFLOAD_IS_ENABLED(q->flags)) {
>> +		kfree(q->qdiscs);
>> +		q->qdiscs = NULL;
>> +	}
>> +}
>> +
>>  static struct netdev_queue *taprio_queue_get(struct Qdisc *sch,
>>  					     unsigned long cl)
>>  {
>> @@ -1785,8 +1785,12 @@ static int taprio_graft(struct Qdisc *sch, unsigned long cl,
>>  	if (dev->flags & IFF_UP)
>>  		dev_deactivate(dev);
>>  
>> -	*old = q->qdiscs[cl - 1];
>> -	q->qdiscs[cl - 1] = new;
>> +	if (FULL_OFFLOAD_IS_ENABLED(q->flags)) {
>> +		*old = dev_graft_qdisc(dev_queue, new);
>> +	} else {
>> +		*old = q->qdiscs[cl - 1];
>> +		q->qdiscs[cl - 1] = new;
>> +	}
>>  
>>  	if (new)
>>  		new->flags |= TCQ_F_ONETXQUEUE | TCQ_F_NOPARENT;
>> @@ -2020,6 +2024,7 @@ static struct Qdisc_ops taprio_qdisc_ops __read_mostly = {
>>  	.change		= taprio_change,
>>  	.destroy	= taprio_destroy,
>>  	.reset		= taprio_reset,
>> +	.attach		= taprio_attach,
>>  	.peek		= taprio_peek,
>>  	.dequeue	= taprio_dequeue,
>>  	.enqueue	= taprio_enqueue,
>


Cheers,
Jakub Kicinski May 14, 2021, 9:01 p.m. UTC | #3
On Fri, 14 May 2021 13:40:58 -0700 Vinicius Costa Gomes wrote:
> Jakub Kicinski <kuba@kernel.org> writes:
> > You haven't CCed anyone who worked on this Qdisc in the last 2 years :/
> > CCing them now. Comments, anyone?  
> 
> I guess I should suggest myself as maintainer, to reduce chances of this
> happening again.

Yes, please.

> > This looks like a very drastic change. Are you expecting the qdisc will
> > always be bypassed?  
> 
> Only when running in full offload mode it will be bypassed.
> 
> And it's kind of by design, in offload mode, the idea was: configure the
> netdev traffic class to queue mapping, send the schedule to the hardware
> and stay out of the way.
> 
> But as per Yannick's report, it seems that taprio doesn't stay enough
> out of the yay.
> 
> > After a 1 minute looks it seems like taprio is using device queues in
> > strict priority fashion. Maybe a different model is needed, but a qdisc
> > with:
> >
> > enqueue()
> > {
> > 	WARN_ONCE(1)
> > }
> >
> > really doesn't look right to me.  
> 
> This patch takes the "stay out of the way" to the extreme, I kind of
> like it/I am not opposed to it, if I had this idea a couple of years
> ago, perhaps I would have used this same approach.

Sorry for my ignorance, but for TXTIME is the hardware capable of
reordering or the user is supposed to know how to send packets?

My biggest problem with this patch is that unless the application is
very careful that WARN_ON_ONCE(1) will trigger. E.g. if softirq is
servicing the queue when the application sends - the qdisc will not 
be bypassed, right?

> I am now thinking if this idea locks us out of anything.
> 
> Anyway, a nicer alternative would exist if we had a way to tell the core
> "this qdisc should be bypassed" (i.e. don't call enqueue()/dequeue())
> after init() runs.

I don't think calling enqueue() and dequeue() is a problem. The problem
is that RT process does unrelated work.

> > Quoting the rest of the patch below for the benefit of those on CC.
Vinicius Costa Gomes May 14, 2021, 11:47 p.m. UTC | #4
Jakub Kicinski <kuba@kernel.org> writes:

> On Fri, 14 May 2021 13:40:58 -0700 Vinicius Costa Gomes wrote:
>> Jakub Kicinski <kuba@kernel.org> writes:
>> > You haven't CCed anyone who worked on this Qdisc in the last 2 years :/
>> > CCing them now. Comments, anyone?  
>> 
>> I guess I should suggest myself as maintainer, to reduce chances of this
>> happening again.
>
> Yes, please.
>
>> > This looks like a very drastic change. Are you expecting the qdisc will
>> > always be bypassed?  
>> 
>> Only when running in full offload mode it will be bypassed.
>> 
>> And it's kind of by design, in offload mode, the idea was: configure the
>> netdev traffic class to queue mapping, send the schedule to the hardware
>> and stay out of the way.
>> 
>> But as per Yannick's report, it seems that taprio doesn't stay enough
>> out of the yay.
>> 
>> > After a 1 minute looks it seems like taprio is using device queues in
>> > strict priority fashion. Maybe a different model is needed, but a qdisc
>> > with:
>> >
>> > enqueue()
>> > {
>> > 	WARN_ONCE(1)
>> > }
>> >
>> > really doesn't look right to me.  
>> 
>> This patch takes the "stay out of the way" to the extreme, I kind of
>> like it/I am not opposed to it, if I had this idea a couple of years
>> ago, perhaps I would have used this same approach.
>
> Sorry for my ignorance, but for TXTIME is the hardware capable of
> reordering or the user is supposed to know how to send packets?

At least the hardware that I am familiar with doesn't reorder packets.

For TXTIME, we have ETF (the qdisc) that re-order packets. The way
things work when taprio and ETF are used together is something like
this: taprio only has enough knowledge about TXTIME to drop packets that
would be transmitted outside their "transmission window" (e.g. for
traffic class 0 the transmission window is only for 10 to 50, the TXTIME
for a packet is 60, this packet is "invalid" and is dropped). And then
when the packet is enqueued to the "child" ETF, it's re-ordered and then
sent to the driver.

And this is something that this patch breaks, the ability of dropping
those invalid packets (I really wouldn't like to do this verification
inside our drivers). Thanks for noticing this.

>
> My biggest problem with this patch is that unless the application is
> very careful that WARN_ON_ONCE(1) will trigger. E.g. if softirq is
> servicing the queue when the application sends - the qdisc will not 
> be bypassed, right?
>
>> I am now thinking if this idea locks us out of anything.
>> 
>> Anyway, a nicer alternative would exist if we had a way to tell the core
>> "this qdisc should be bypassed" (i.e. don't call enqueue()/dequeue())
>> after init() runs.
>
> I don't think calling enqueue() and dequeue() is a problem. The problem
> is that RT process does unrelated work.

That is true. But this seems like a much bigger (or at least more
"core") issue.


Cheers,
Yannick Vignon May 17, 2021, 5:06 p.m. UTC | #5
On 5/15/2021 1:47 AM, Vinicius Costa Gomes wrote:
> Jakub Kicinski <kuba@kernel.org> writes:
> 
>> On Fri, 14 May 2021 13:40:58 -0700 Vinicius Costa Gomes wrote:
>>> Jakub Kicinski <kuba@kernel.org> writes:
>>>> You haven't CCed anyone who worked on this Qdisc in the last 2 years :/
>>>> CCing them now. Comments, anyone?
>>>
>>> I guess I should suggest myself as maintainer, to reduce chances of this
>>> happening again.
>>
>> Yes, please.
>>
>>>> This looks like a very drastic change. Are you expecting the qdisc will
>>>> always be bypassed?
>>>
>>> Only when running in full offload mode it will be bypassed.
>>>
>>> And it's kind of by design, in offload mode, the idea was: configure the
>>> netdev traffic class to queue mapping, send the schedule to the hardware
>>> and stay out of the way.
>>>
>>> But as per Yannick's report, it seems that taprio doesn't stay enough
>>> out of the yay.
>>>
>>>> After a 1 minute looks it seems like taprio is using device queues in
>>>> strict priority fashion. Maybe a different model is needed, but a qdisc
>>>> with:
>>>>
>>>> enqueue()
>>>> {
>>>> 	WARN_ONCE(1)
>>>> }
>>>>
>>>> really doesn't look right to me.
>>>

My idea was to follow the logic of the other qdiscs dealing with 
hardware multiqueue, namely mq and mqprio. Those do not have any 
enqueue/dequeue callbacks, but instead define an attach callback to map 
the child qdiscs to the HW queues. However, for taprio all those 
callbacks are already defined by the time we choose between software and 
full-offload, so the WARN_ONCE was more out of extra caution in case I 
missed something. If my understanding is correct however, it would 
probably make sense to put a BUG() instead, since those code paths 
should never trigger with this patch.

OTOH what did bother me a bit is that because I needed an attach 
callback for the full-offload case, I ended up duplicating some code 
from qdisc_graft in the attach callback, so that the software case would 
continue behaving as is.

Those complexities could be removed by pulling out the full-offload case 
into its own qdisc, but as I said it has other drawbacks.

>>> This patch takes the "stay out of the way" to the extreme, I kind of
>>> like it/I am not opposed to it, if I had this idea a couple of years
>>> ago, perhaps I would have used this same approach.
>>
>> Sorry for my ignorance, but for TXTIME is the hardware capable of
>> reordering or the user is supposed to know how to send packets?
> 
> At least the hardware that I am familiar with doesn't reorder packets.
> 
> For TXTIME, we have ETF (the qdisc) that re-order packets. The way
> things work when taprio and ETF are used together is something like
> this: taprio only has enough knowledge about TXTIME to drop packets that
> would be transmitted outside their "transmission window" (e.g. for
> traffic class 0 the transmission window is only for 10 to 50, the TXTIME
> for a packet is 60, this packet is "invalid" and is dropped). And then
> when the packet is enqueued to the "child" ETF, it's re-ordered and then
> sent to the driver.
> 
> And this is something that this patch breaks, the ability of dropping
> those invalid packets (I really wouldn't like to do this verification
> inside our drivers). Thanks for noticing this.
> 

Hmm, indeed, I missed that check (we don't use ETF currently). I'm not 
sure of the best way forward, but here are a few thoughts:
. The problem only arises for full-offload taprio, not for the software 
or TxTime-assisted cases.
. I'm not sure mixing taprio(full-offload) with etf(no-offload) is very 
useful, at least with small gate intervals: it's likely you will miss 
your window when trying to send a packet at exactly the right time in 
software (I am usually testing taprio with a 2ms period and a 4µs 
interval for the RT stream).
. That leaves the case of taprio(full-offload) with etf(offload). Right 
now with the current stmmac driver config, a packet whose tstamp is 
outside its gate interval will be sent on the next interval (and block 
the queue).
. The stmmac hardware supports an expiryTime, currently unsupported in 
the stmmac driver, which I think could be used to drop packets whose 
tstamps are wrong (the packet would be dropped once the tstamp 
"expires"). We'd need to add an API for configuration though, and it 
should be noted that the stmmac config for this is global to the MAC, 
not per-queue (so a config through sch-etf would affect all queues).
. In general using taprio(full-offload) with etf(offload) will incur a 
small latency penalty: you need to post the packet before the ETF qdisc 
wakes up (plus some margin), and the ETF qdisc must wake up before the 
tx stamp (plus some margin). If possible (number of streams/apps < 
number of hw queues), it would be better to just use 
taprio(full-offload) alone, since the app will need to post the packet 
before the gate opens (so plus one margin, not 2).


>>
>> My biggest problem with this patch is that unless the application is
>> very careful that WARN_ON_ONCE(1) will trigger. E.g. if softirq is
>> servicing the queue when the application sends - the qdisc will not
>> be bypassed, right?

See above, unless I'm mistaken the "root" qdisc is never 
enqueued/dequeued for multi-queue aware qdiscs.

>>> I am now thinking if this idea locks us out of anything.
>>>
>>> Anyway, a nicer alternative would exist if we had a way to tell the core
>>> "this qdisc should be bypassed" (i.e. don't call enqueue()/dequeue())
>>> after init() runs.
>>

Again, I don't think enqueue/dequeue are called unless the HW queues 
point to the root qdisc. But this does raise an interesting point: the 
"scheduling" issue I observed was on the dequeue side, when all the 
queues were dequeued within the RT process context. If we could point 
the enqueue side to the taprio qdisc and the dequeue side to the child 
qdiscs, that would probably work (but I fear that would be a significant 
change in the way the qdisc code works).

>> I don't think calling enqueue() and dequeue() is a problem. The problem
>> is that RT process does unrelated work.
> 
> That is true. But this seems like a much bigger (or at least more
> "core") issue.
> 
> 
> Cheers,
>
Michael Walle May 17, 2021, 5:34 p.m. UTC | #6
Am 2021-05-17 19:06, schrieb Yannick Vignon:
> On 5/15/2021 1:47 AM, Vinicius Costa Gomes wrote:
>> Jakub Kicinski <kuba@kernel.org> writes:
>> 
>>> On Fri, 14 May 2021 13:40:58 -0700 Vinicius Costa Gomes wrote:
>>>> Jakub Kicinski <kuba@kernel.org> writes:
>>>>> You haven't CCed anyone who worked on this Qdisc in the last 2 
>>>>> years :/
>>>>> CCing them now. Comments, anyone?
>>>> 
>>>> I guess I should suggest myself as maintainer, to reduce chances of 
>>>> this
>>>> happening again.
>>> 
>>> Yes, please.
>>> 
>>>>> This looks like a very drastic change. Are you expecting the qdisc 
>>>>> will
>>>>> always be bypassed?
>>>> 
>>>> Only when running in full offload mode it will be bypassed.
>>>> 
>>>> And it's kind of by design, in offload mode, the idea was: configure 
>>>> the
>>>> netdev traffic class to queue mapping, send the schedule to the 
>>>> hardware
>>>> and stay out of the way.
>>>> 
>>>> But as per Yannick's report, it seems that taprio doesn't stay 
>>>> enough
>>>> out of the yay.
>>>> 
>>>>> After a 1 minute looks it seems like taprio is using device queues 
>>>>> in
>>>>> strict priority fashion. Maybe a different model is needed, but a 
>>>>> qdisc
>>>>> with:
>>>>> 
>>>>> enqueue()
>>>>> {
>>>>> 	WARN_ONCE(1)
>>>>> }
>>>>> 
>>>>> really doesn't look right to me.
>>>> 
> 
> My idea was to follow the logic of the other qdiscs dealing with
> hardware multiqueue, namely mq and mqprio. Those do not have any
> enqueue/dequeue callbacks, but instead define an attach callback to
> map the child qdiscs to the HW queues. However, for taprio all those
> callbacks are already defined by the time we choose between software
> and full-offload, so the WARN_ONCE was more out of extra caution in
> case I missed something. If my understanding is correct however, it
> would probably make sense to put a BUG() instead, since those code
> paths should never trigger with this patch.
> 
> OTOH what did bother me a bit is that because I needed an attach
> callback for the full-offload case, I ended up duplicating some code
> from qdisc_graft in the attach callback, so that the software case
> would continue behaving as is.
> 
> Those complexities could be removed by pulling out the full-offload
> case into its own qdisc, but as I said it has other drawbacks.
> 
>>>> This patch takes the "stay out of the way" to the extreme, I kind of
>>>> like it/I am not opposed to it, if I had this idea a couple of years
>>>> ago, perhaps I would have used this same approach.
>>> 
>>> Sorry for my ignorance, but for TXTIME is the hardware capable of
>>> reordering or the user is supposed to know how to send packets?
>> 
>> At least the hardware that I am familiar with doesn't reorder packets.
>> 
>> For TXTIME, we have ETF (the qdisc) that re-order packets. The way
>> things work when taprio and ETF are used together is something like
>> this: taprio only has enough knowledge about TXTIME to drop packets 
>> that
>> would be transmitted outside their "transmission window" (e.g. for
>> traffic class 0 the transmission window is only for 10 to 50, the 
>> TXTIME
>> for a packet is 60, this packet is "invalid" and is dropped). And then
>> when the packet is enqueued to the "child" ETF, it's re-ordered and 
>> then
>> sent to the driver.
>> 
>> And this is something that this patch breaks, the ability of dropping
>> those invalid packets (I really wouldn't like to do this verification
>> inside our drivers). Thanks for noticing this.

Is this really how the taprio should behave? I mean, should the frame
really be dropped by taprio if TXTIME is outside of the window? Why
would taprio bother with TXTIME at all?

> Hmm, indeed, I missed that check (we don't use ETF currently). I'm not
> sure of the best way forward, but here are a few thoughts:
> . The problem only arises for full-offload taprio, not for the
> software or TxTime-assisted cases.
> . I'm not sure mixing taprio(full-offload) with etf(no-offload) is
> very useful, at least with small gate intervals: it's likely you will
> miss your window when trying to send a packet at exactly the right
> time in software (I am usually testing taprio with a 2ms period and a
> 4µs interval for the RT stream).
> . That leaves the case of taprio(full-offload) with etf(offload).
> Right now with the current stmmac driver config, a packet whose tstamp
> is outside its gate interval will be sent on the next interval (and
> block the queue).

If both are offloaded, are the h/w queues reordered if there is a
new frame with an earlier TXTIME? Or will the queue be blocked by a
frame with a later transmission time?

TBH I've never understood why the ETF qdisc will drop frames with
TXTIME in the past. I mean it makes sense with hardware offloading. But
without it, how can you make sure the kernel will wake up the queue
at just the right time to be able to send it. You can juggle the delta
parameter but on you don't want to send to too early, but on the other
hand the frame will likely be dropped if delta is too small. What am
I misssing here?

> . The stmmac hardware supports an expiryTime, currently unsupported in
> the stmmac driver, which I think could be used to drop packets whose
> tstamps are wrong (the packet would be dropped once the tstamp
> "expires"). We'd need to add an API for configuration though, and it
> should be noted that the stmmac config for this is global to the MAC,
> not per-queue (so a config through sch-etf would affect all queues).
> . In general using taprio(full-offload) with etf(offload) will incur a
> small latency penalty: you need to post the packet before the ETF
> qdisc wakes up (plus some margin), and the ETF qdisc must wake up
> before the tx stamp (plus some margin). If possible (number of
> streams/apps < number of hw queues), it would be better to just use
> taprio(full-offload) alone, since the app will need to post the packet
> before the gate opens (so plus one margin, not 2).
> 
> 
>>> 
>>> My biggest problem with this patch is that unless the application is
>>> very careful that WARN_ON_ONCE(1) will trigger. E.g. if softirq is
>>> servicing the queue when the application sends - the qdisc will not
>>> be bypassed, right?
> 
> See above, unless I'm mistaken the "root" qdisc is never
> enqueued/dequeued for multi-queue aware qdiscs.
> 
>>>> I am now thinking if this idea locks us out of anything.
>>>> 
>>>> Anyway, a nicer alternative would exist if we had a way to tell the 
>>>> core
>>>> "this qdisc should be bypassed" (i.e. don't call 
>>>> enqueue()/dequeue())
>>>> after init() runs.
>>> 
> 
> Again, I don't think enqueue/dequeue are called unless the HW queues
> point to the root qdisc. But this does raise an interesting point: the
> "scheduling" issue I observed was on the dequeue side, when all the
> queues were dequeued within the RT process context. If we could point
> the enqueue side to the taprio qdisc and the dequeue side to the child
> qdiscs, that would probably work (but I fear that would be a
> significant change in the way the qdisc code works).
> 
>>> I don't think calling enqueue() and dequeue() is a problem. The 
>>> problem
>>> is that RT process does unrelated work.
>> 
>> That is true. But this seems like a much bigger (or at least more
>> "core") issue.
>> 
>> 
>> Cheers,
>>
Vinicius Costa Gomes May 17, 2021, 9:38 p.m. UTC | #7
Michael Walle <michael@walle.cc> writes:

>>> 
>>> At least the hardware that I am familiar with doesn't reorder packets.
>>> 
>>> For TXTIME, we have ETF (the qdisc) that re-order packets. The way
>>> things work when taprio and ETF are used together is something like
>>> this: taprio only has enough knowledge about TXTIME to drop packets 
>>> that
>>> would be transmitted outside their "transmission window" (e.g. for
>>> traffic class 0 the transmission window is only for 10 to 50, the 
>>> TXTIME
>>> for a packet is 60, this packet is "invalid" and is dropped). And then
>>> when the packet is enqueued to the "child" ETF, it's re-ordered and 
>>> then
>>> sent to the driver.
>>> 
>>> And this is something that this patch breaks, the ability of dropping
>>> those invalid packets (I really wouldn't like to do this verification
>>> inside our drivers). Thanks for noticing this.
>
> Is this really how the taprio should behave? I mean, should the frame
> really be dropped by taprio if TXTIME is outside of the window? Why
> would taprio bother with TXTIME at all?

Yeah, I understand what you are saying, I also didn't like the idea of
having TXTIME knowledge inside taprio. This trade off was made because
the HW we have is very... particular about the launchtime, if the
launchtime for a packet ends after the "close" time for that queue, that
queue can end up stuck (for multiple seconds in some extreme cases).

The idea is that other vendors could be as particular.

Also, what helped convince me was that txtime is a feature of the
socket/skb, and if taprio/whatever can use it to make the driver life's
easier, then let's use it, that is

>
>> Hmm, indeed, I missed that check (we don't use ETF currently). I'm not
>> sure of the best way forward, but here are a few thoughts:
>> . The problem only arises for full-offload taprio, not for the
>> software or TxTime-assisted cases.
>> . I'm not sure mixing taprio(full-offload) with etf(no-offload) is
>> very useful, at least with small gate intervals: it's likely you will
>> miss your window when trying to send a packet at exactly the right
>> time in software (I am usually testing taprio with a 2ms period and a
>> 4µs interval for the RT stream).
>> . That leaves the case of taprio(full-offload) with etf(offload).
>> Right now with the current stmmac driver config, a packet whose tstamp
>> is outside its gate interval will be sent on the next interval (and
>> block the queue).
>
> If both are offloaded, are the h/w queues reordered if there is a
> new frame with an earlier TXTIME? Or will the queue be blocked by a
> frame with a later transmission time?

Even if offloaded ETF will re-order packets. In bit more detail, ETF
will re-order packets if their launchtime is more than "delta"
nanoseconds in the future.

>
> TBH I've never understood why the ETF qdisc will drop frames with
> TXTIME in the past. I mean it makes sense with hardware offloading. But
> without it, how can you make sure the kernel will wake up the queue
> at just the right time to be able to send it. You can juggle the delta
> parameter but on you don't want to send to too early, but on the other
> hand the frame will likely be dropped if delta is too small. What am
> I misssing here?

I don't think you are missing anything.

Just some background, ETF was written thinking mostly about it being
offloaded, the non-offloaded/"software" mode was a best-effort
implementation of that idea.

That is, If you have use cases for non-offloaded ETF and it work better
if it didn't drop "late" packets, I would be happy hear more.


Cheers,
Vinicius Costa Gomes May 17, 2021, 10:42 p.m. UTC | #8
Yannick Vignon <yannick.vignon@oss.nxp.com> writes:

> On 5/15/2021 1:47 AM, Vinicius Costa Gomes wrote:
>> Jakub Kicinski <kuba@kernel.org> writes:
>> 
>>> On Fri, 14 May 2021 13:40:58 -0700 Vinicius Costa Gomes wrote:
>>>> Jakub Kicinski <kuba@kernel.org> writes:
>>>>> You haven't CCed anyone who worked on this Qdisc in the last 2 years :/
>>>>> CCing them now. Comments, anyone?
>>>>
>>>> I guess I should suggest myself as maintainer, to reduce chances of this
>>>> happening again.
>>>
>>> Yes, please.
>>>
>>>>> This looks like a very drastic change. Are you expecting the qdisc will
>>>>> always be bypassed?
>>>>
>>>> Only when running in full offload mode it will be bypassed.
>>>>
>>>> And it's kind of by design, in offload mode, the idea was: configure the
>>>> netdev traffic class to queue mapping, send the schedule to the hardware
>>>> and stay out of the way.
>>>>
>>>> But as per Yannick's report, it seems that taprio doesn't stay enough
>>>> out of the yay.
>>>>
>>>>> After a 1 minute looks it seems like taprio is using device queues in
>>>>> strict priority fashion. Maybe a different model is needed, but a qdisc
>>>>> with:
>>>>>
>>>>> enqueue()
>>>>> {
>>>>> 	WARN_ONCE(1)
>>>>> }
>>>>>
>>>>> really doesn't look right to me.
>>>>
>
> My idea was to follow the logic of the other qdiscs dealing with 
> hardware multiqueue, namely mq and mqprio. Those do not have any 
> enqueue/dequeue callbacks, but instead define an attach callback to map 
> the child qdiscs to the HW queues. However, for taprio all those 
> callbacks are already defined by the time we choose between software and 
> full-offload, so the WARN_ONCE was more out of extra caution in case I 
> missed something. If my understanding is correct however, it would 
> probably make sense to put a BUG() instead, since those code paths 
> should never trigger with this patch.
>
> OTOH what did bother me a bit is that because I needed an attach 
> callback for the full-offload case, I ended up duplicating some code 
> from qdisc_graft in the attach callback, so that the software case would 
> continue behaving as is.
>
> Those complexities could be removed by pulling out the full-offload case 
> into its own qdisc, but as I said it has other drawbacks.
>
>>>> This patch takes the "stay out of the way" to the extreme, I kind of
>>>> like it/I am not opposed to it, if I had this idea a couple of years
>>>> ago, perhaps I would have used this same approach.
>>>
>>> Sorry for my ignorance, but for TXTIME is the hardware capable of
>>> reordering or the user is supposed to know how to send packets?
>> 
>> At least the hardware that I am familiar with doesn't reorder packets.
>> 
>> For TXTIME, we have ETF (the qdisc) that re-order packets. The way
>> things work when taprio and ETF are used together is something like
>> this: taprio only has enough knowledge about TXTIME to drop packets that
>> would be transmitted outside their "transmission window" (e.g. for
>> traffic class 0 the transmission window is only for 10 to 50, the TXTIME
>> for a packet is 60, this packet is "invalid" and is dropped). And then
>> when the packet is enqueued to the "child" ETF, it's re-ordered and then
>> sent to the driver.
>> 
>> And this is something that this patch breaks, the ability of dropping
>> those invalid packets (I really wouldn't like to do this verification
>> inside our drivers). Thanks for noticing this.
>> 
>
> Hmm, indeed, I missed that check (we don't use ETF currently). I'm not 
> sure of the best way forward, but here are a few thoughts:
> . The problem only arises for full-offload taprio, not for the software 
> or TxTime-assisted cases.
> . I'm not sure mixing taprio(full-offload) with etf(no-offload) is very 
> useful, at least with small gate intervals: it's likely you will miss 
> your window when trying to send a packet at exactly the right time in 
> software (I am usually testing taprio with a 2ms period and a 4µs 
> interval for the RT stream).
> . That leaves the case of taprio(full-offload) with etf(offload). Right 
> now with the current stmmac driver config, a packet whose tstamp is 
> outside its gate interval will be sent on the next interval (and block 
> the queue).

This is the case that is a bit problematic with our hardware. (full taprio
offload + ETF offload).

> . The stmmac hardware supports an expiryTime, currently unsupported in 
> the stmmac driver, which I think could be used to drop packets whose 
> tstamps are wrong (the packet would be dropped once the tstamp 
> "expires"). We'd need to add an API for configuration though, and it 
> should be noted that the stmmac config for this is global to the MAC, 
> not per-queue (so a config through sch-etf would affect all queues).
> . In general using taprio(full-offload) with etf(offload) will incur a 
> small latency penalty: you need to post the packet before the ETF qdisc 
> wakes up (plus some margin), and the ETF qdisc must wake up before the 
> tx stamp (plus some margin). If possible (number of streams/apps < 
> number of hw queues), it would be better to just use 
> taprio(full-offload) alone, since the app will need to post the packet 
> before the gate opens (so plus one margin, not 2).

It really depends on the workload, and how the schedule is organized,
but yeah, that might be possible (for some cases :-)).

>
>
>>>
>>> My biggest problem with this patch is that unless the application is
>>> very careful that WARN_ON_ONCE(1) will trigger. E.g. if softirq is
>>> servicing the queue when the application sends - the qdisc will not
>>> be bypassed, right?
>
> See above, unless I'm mistaken the "root" qdisc is never 
> enqueued/dequeued for multi-queue aware qdiscs.
>

That's true, mq and mqprio don't have enqueue()/dequeue(), but I think
that's more a detail of their implementation than a rule (that no
multiqueue-aware root qdisc should implement enqueue()/dequeue()).

That is, from my point of view, there's nothing wrong in having a root
qdisc that's also a shaper/scheduler.

>>>> I am now thinking if this idea locks us out of anything.
>>>>
>>>> Anyway, a nicer alternative would exist if we had a way to tell the core
>>>> "this qdisc should be bypassed" (i.e. don't call enqueue()/dequeue())
>>>> after init() runs.
>>>
>
> Again, I don't think enqueue/dequeue are called unless the HW queues 
> point to the root qdisc. But this does raise an interesting point: the 
> "scheduling" issue I observed was on the dequeue side, when all the 
> queues were dequeued within the RT process context. If we could point 
> the enqueue side to the taprio qdisc and the dequeue side to the child 
> qdiscs, that would probably work (but I fear that would be a significant 
> change in the way the qdisc code works).

I am wondering if there's a simpler solution, right now (as you pointed
out) taprio traverses all queues during dequeue(), that's the problem.

What I am thinking is if doing something like:

     sch->dev_queue - netdev_get_tx_queue(dev, 0);

To get the queue "index" and then only dequeuing packets for that queue,
would solve the issue. (A bit ugly, I know).

I just wanted to write the idea down to see if any one else could find
any big issues with it. I will try to play with it a bit, if no-one
beats me to it.

>
>>> I don't think calling enqueue() and dequeue() is a problem. The problem
>>> is that RT process does unrelated work.
>> 
>> That is true. But this seems like a much bigger (or at least more
>> "core") issue.
>> 
>> 
>> Cheers,
>> 
>


Cheers,
Yannick Vignon May 18, 2021, 2:34 p.m. UTC | #9
On 5/18/2021 12:42 AM, Vinicius Costa Gomes wrote:
> Yannick Vignon <yannick.vignon@oss.nxp.com> writes:
> 
>> On 5/15/2021 1:47 AM, Vinicius Costa Gomes wrote:
>>> Jakub Kicinski <kuba@kernel.org> writes:
>>>
>>>> On Fri, 14 May 2021 13:40:58 -0700 Vinicius Costa Gomes wrote:
>>>>> Jakub Kicinski <kuba@kernel.org> writes:
>>>>>> You haven't CCed anyone who worked on this Qdisc in the last 2 years :/
>>>>>> CCing them now. Comments, anyone?
>>>>>
>>>>> I guess I should suggest myself as maintainer, to reduce chances of this
>>>>> happening again.
>>>>
>>>> Yes, please.
>>>>
>>>>>> This looks like a very drastic change. Are you expecting the qdisc will
>>>>>> always be bypassed?
>>>>>
>>>>> Only when running in full offload mode it will be bypassed.
>>>>>
>>>>> And it's kind of by design, in offload mode, the idea was: configure the
>>>>> netdev traffic class to queue mapping, send the schedule to the hardware
>>>>> and stay out of the way.
>>>>>
>>>>> But as per Yannick's report, it seems that taprio doesn't stay enough
>>>>> out of the yay.
>>>>>
>>>>>> After a 1 minute looks it seems like taprio is using device queues in
>>>>>> strict priority fashion. Maybe a different model is needed, but a qdisc
>>>>>> with:
>>>>>>
>>>>>> enqueue()
>>>>>> {
>>>>>> 	WARN_ONCE(1)
>>>>>> }
>>>>>>
>>>>>> really doesn't look right to me.
>>>>>
>>
>> My idea was to follow the logic of the other qdiscs dealing with
>> hardware multiqueue, namely mq and mqprio. Those do not have any
>> enqueue/dequeue callbacks, but instead define an attach callback to map
>> the child qdiscs to the HW queues. However, for taprio all those
>> callbacks are already defined by the time we choose between software and
>> full-offload, so the WARN_ONCE was more out of extra caution in case I
>> missed something. If my understanding is correct however, it would
>> probably make sense to put a BUG() instead, since those code paths
>> should never trigger with this patch.
>>
>> OTOH what did bother me a bit is that because I needed an attach
>> callback for the full-offload case, I ended up duplicating some code
>> from qdisc_graft in the attach callback, so that the software case would
>> continue behaving as is.
>>
>> Those complexities could be removed by pulling out the full-offload case
>> into its own qdisc, but as I said it has other drawbacks.
>>
>>>>> This patch takes the "stay out of the way" to the extreme, I kind of
>>>>> like it/I am not opposed to it, if I had this idea a couple of years
>>>>> ago, perhaps I would have used this same approach.
>>>>
>>>> Sorry for my ignorance, but for TXTIME is the hardware capable of
>>>> reordering or the user is supposed to know how to send packets?
>>>
>>> At least the hardware that I am familiar with doesn't reorder packets.
>>>
>>> For TXTIME, we have ETF (the qdisc) that re-order packets. The way
>>> things work when taprio and ETF are used together is something like
>>> this: taprio only has enough knowledge about TXTIME to drop packets that
>>> would be transmitted outside their "transmission window" (e.g. for
>>> traffic class 0 the transmission window is only for 10 to 50, the TXTIME
>>> for a packet is 60, this packet is "invalid" and is dropped). And then
>>> when the packet is enqueued to the "child" ETF, it's re-ordered and then
>>> sent to the driver.
>>>
>>> And this is something that this patch breaks, the ability of dropping
>>> those invalid packets (I really wouldn't like to do this verification
>>> inside our drivers). Thanks for noticing this.
>>>
>>
>> Hmm, indeed, I missed that check (we don't use ETF currently). I'm not
>> sure of the best way forward, but here are a few thoughts:
>> . The problem only arises for full-offload taprio, not for the software
>> or TxTime-assisted cases.
>> . I'm not sure mixing taprio(full-offload) with etf(no-offload) is very
>> useful, at least with small gate intervals: it's likely you will miss
>> your window when trying to send a packet at exactly the right time in
>> software (I am usually testing taprio with a 2ms period and a 4µs
>> interval for the RT stream).
>> . That leaves the case of taprio(full-offload) with etf(offload). Right
>> now with the current stmmac driver config, a packet whose tstamp is
>> outside its gate interval will be sent on the next interval (and block
>> the queue).
> 
> This is the case that is a bit problematic with our hardware. (full taprio
> offload + ETF offload).

Based on your previous comment to Michael, this is starting to look a 
lot like a work-around for a hardware bug. Would moving it to the 
corresponding driver really be out of the question?
Note: there are currently only 4 drivers implementing ETF (including 2 
from Intel), so validating their behavior with late packets would likely 
be doable if needed.

>> . The stmmac hardware supports an expiryTime, currently unsupported in
>> the stmmac driver, which I think could be used to drop packets whose
>> tstamps are wrong (the packet would be dropped once the tstamp
>> "expires"). We'd need to add an API for configuration though, and it
>> should be noted that the stmmac config for this is global to the MAC,
>> not per-queue (so a config through sch-etf would affect all queues).
>> . In general using taprio(full-offload) with etf(offload) will incur a
>> small latency penalty: you need to post the packet before the ETF qdisc
>> wakes up (plus some margin), and the ETF qdisc must wake up before the
>> tx stamp (plus some margin). If possible (number of streams/apps <
>> number of hw queues), it would be better to just use
>> taprio(full-offload) alone, since the app will need to post the packet
>> before the gate opens (so plus one margin, not 2).
> 
> It really depends on the workload, and how the schedule is organized,
> but yeah, that might be possible (for some cases :-)).
> 
>>
>>
>>>>
>>>> My biggest problem with this patch is that unless the application is
>>>> very careful that WARN_ON_ONCE(1) will trigger. E.g. if softirq is
>>>> servicing the queue when the application sends - the qdisc will not
>>>> be bypassed, right?
>>
>> See above, unless I'm mistaken the "root" qdisc is never
>> enqueued/dequeued for multi-queue aware qdiscs.
>>
> 
> That's true, mq and mqprio don't have enqueue()/dequeue(), but I think
> that's more a detail of their implementation than a rule (that no
> multiqueue-aware root qdisc should implement enqueue()/dequeue()).
> 
> That is, from my point of view, there's nothing wrong in having a root
> qdisc that's also a shaper/scheduler.
> 
>>>>> I am now thinking if this idea locks us out of anything.
>>>>>
>>>>> Anyway, a nicer alternative would exist if we had a way to tell the core
>>>>> "this qdisc should be bypassed" (i.e. don't call enqueue()/dequeue())
>>>>> after init() runs.
>>>>
>>
>> Again, I don't think enqueue/dequeue are called unless the HW queues
>> point to the root qdisc. But this does raise an interesting point: the
>> "scheduling" issue I observed was on the dequeue side, when all the
>> queues were dequeued within the RT process context. If we could point
>> the enqueue side to the taprio qdisc and the dequeue side to the child
>> qdiscs, that would probably work (but I fear that would be a significant
>> change in the way the qdisc code works).
> 
> I am wondering if there's a simpler solution, right now (as you pointed
> out) taprio traverses all queues during dequeue(), that's the problem.
> 
> What I am thinking is if doing something like:
> 
>       sch->dev_queue - netdev_get_tx_queue(dev, 0);
> 
> To get the queue "index" and then only dequeuing packets for that queue,
> would solve the issue. (A bit ugly, I know).
> 
> I just wanted to write the idea down to see if any one else could find
> any big issues with it. I will try to play with it a bit, if no-one
> beats me to it.

I looked into it this morning, but I'm not sure how that would work. 
With the current code, when qdisc_run() is executed sch->dev_queue will 
always be 0 (sch being the taprio qdisc), so you'd need my proposed 
patch or something similar for sch to point to the child qdiscs, but at 
that point you'd also be bypassing the taprio qdisc on enqueue.

If removing the TxTime check inside taprio_queue remains difficult, I am 
wondering if keeping my patch with the change below would work (not tested):
***********************************************************************
@@ -4206,7 +4206,10 @@ static int __dev_queue_xmit(struct sk_buff *skb, 
struct net_device *sb_dev)
                 skb_dst_force(skb);

         txq = netdev_core_pick_tx(dev, skb, sb_dev);
-       q = rcu_dereference_bh(txq->qdisc);
+       if (dev->qdisc->enqueue)
+               q = dev->qdisc;
+       else
+               q = rcu_dereference_bh(txq->qdisc);

         trace_net_dev_queue(skb);
         if (q->enqueue) {
***********************************************************************
One risk though for multiqueue-aware qdiscs that define an enqueue (only 
taprio is in that case today) is that the child qdisc selected by 
enqueue would not match the txq from netdev_core_pick_tx, so in the end 
we would call qdisc_run on the wrong qdisc...



>>
>>>> I don't think calling enqueue() and dequeue() is a problem. The problem
>>>> is that RT process does unrelated work.
>>>
>>> That is true. But this seems like a much bigger (or at least more
>>> "core") issue.
>>>
>>>
>>> Cheers,
>>>
>>
> 
> 
> Cheers,
>
Yannick Vignon July 9, 2021, 9:41 a.m. UTC | #10
On 5/18/2021 4:34 PM, Yannick Vignon wrote:
> On 5/18/2021 12:42 AM, Vinicius Costa Gomes wrote:
>> Yannick Vignon <yannick.vignon@oss.nxp.com> writes:
>>
>>> On 5/15/2021 1:47 AM, Vinicius Costa Gomes wrote:
>>>> Jakub Kicinski <kuba@kernel.org> writes:
>>>>
>>>>> On Fri, 14 May 2021 13:40:58 -0700 Vinicius Costa Gomes wrote:
>>>>>> Jakub Kicinski <kuba@kernel.org> writes:
>>>>>>> You haven't CCed anyone who worked on this Qdisc in the last 2 
>>>>>>> years :/
>>>>>>> CCing them now. Comments, anyone?
>>>>>>
>>>>>> I guess I should suggest myself as maintainer, to reduce chances 
>>>>>> of this
>>>>>> happening again.
>>>>>
>>>>> Yes, please.
>>>>>
>>>>>>> This looks like a very drastic change. Are you expecting the 
>>>>>>> qdisc will
>>>>>>> always be bypassed?
>>>>>>
>>>>>> Only when running in full offload mode it will be bypassed.
>>>>>>
>>>>>> And it's kind of by design, in offload mode, the idea was: 
>>>>>> configure the
>>>>>> netdev traffic class to queue mapping, send the schedule to the 
>>>>>> hardware
>>>>>> and stay out of the way.
>>>>>>
>>>>>> But as per Yannick's report, it seems that taprio doesn't stay enough
>>>>>> out of the yay.
>>>>>>
>>>>>>> After a 1 minute looks it seems like taprio is using device 
>>>>>>> queues in
>>>>>>> strict priority fashion. Maybe a different model is needed, but a 
>>>>>>> qdisc
>>>>>>> with:
>>>>>>>
>>>>>>> enqueue()
>>>>>>> {
>>>>>>>     WARN_ONCE(1)
>>>>>>> }
>>>>>>>
>>>>>>> really doesn't look right to me.
>>>>>>
>>>
>>> My idea was to follow the logic of the other qdiscs dealing with
>>> hardware multiqueue, namely mq and mqprio. Those do not have any
>>> enqueue/dequeue callbacks, but instead define an attach callback to map
>>> the child qdiscs to the HW queues. However, for taprio all those
>>> callbacks are already defined by the time we choose between software and
>>> full-offload, so the WARN_ONCE was more out of extra caution in case I
>>> missed something. If my understanding is correct however, it would
>>> probably make sense to put a BUG() instead, since those code paths
>>> should never trigger with this patch.
>>>
>>> OTOH what did bother me a bit is that because I needed an attach
>>> callback for the full-offload case, I ended up duplicating some code
>>> from qdisc_graft in the attach callback, so that the software case would
>>> continue behaving as is.
>>>
>>> Those complexities could be removed by pulling out the full-offload case
>>> into its own qdisc, but as I said it has other drawbacks.
>>>
>>>>>> This patch takes the "stay out of the way" to the extreme, I kind of
>>>>>> like it/I am not opposed to it, if I had this idea a couple of years
>>>>>> ago, perhaps I would have used this same approach.
>>>>>
>>>>> Sorry for my ignorance, but for TXTIME is the hardware capable of
>>>>> reordering or the user is supposed to know how to send packets?
>>>>
>>>> At least the hardware that I am familiar with doesn't reorder packets.
>>>>
>>>> For TXTIME, we have ETF (the qdisc) that re-order packets. The way
>>>> things work when taprio and ETF are used together is something like
>>>> this: taprio only has enough knowledge about TXTIME to drop packets 
>>>> that
>>>> would be transmitted outside their "transmission window" (e.g. for
>>>> traffic class 0 the transmission window is only for 10 to 50, the 
>>>> TXTIME
>>>> for a packet is 60, this packet is "invalid" and is dropped). And then
>>>> when the packet is enqueued to the "child" ETF, it's re-ordered and 
>>>> then
>>>> sent to the driver.
>>>>
>>>> And this is something that this patch breaks, the ability of dropping
>>>> those invalid packets (I really wouldn't like to do this verification
>>>> inside our drivers). Thanks for noticing this.
>>>>
>>>
>>> Hmm, indeed, I missed that check (we don't use ETF currently). I'm not
>>> sure of the best way forward, but here are a few thoughts:
>>> . The problem only arises for full-offload taprio, not for the software
>>> or TxTime-assisted cases.
>>> . I'm not sure mixing taprio(full-offload) with etf(no-offload) is very
>>> useful, at least with small gate intervals: it's likely you will miss
>>> your window when trying to send a packet at exactly the right time in
>>> software (I am usually testing taprio with a 2ms period and a 4µs
>>> interval for the RT stream).
>>> . That leaves the case of taprio(full-offload) with etf(offload). Right
>>> now with the current stmmac driver config, a packet whose tstamp is
>>> outside its gate interval will be sent on the next interval (and block
>>> the queue).
>>
>> This is the case that is a bit problematic with our hardware. (full 
>> taprio
>> offload + ETF offload).
> 
> Based on your previous comment to Michael, this is starting to look a 
> lot like a work-around for a hardware bug. Would moving it to the 
> corresponding driver really be out of the question?
> Note: there are currently only 4 drivers implementing ETF (including 2 
> from Intel), so validating their behavior with late packets would likely 
> be doable if needed.
> 
>>> . The stmmac hardware supports an expiryTime, currently unsupported in
>>> the stmmac driver, which I think could be used to drop packets whose
>>> tstamps are wrong (the packet would be dropped once the tstamp
>>> "expires"). We'd need to add an API for configuration though, and it
>>> should be noted that the stmmac config for this is global to the MAC,
>>> not per-queue (so a config through sch-etf would affect all queues).
>>> . In general using taprio(full-offload) with etf(offload) will incur a
>>> small latency penalty: you need to post the packet before the ETF qdisc
>>> wakes up (plus some margin), and the ETF qdisc must wake up before the
>>> tx stamp (plus some margin). If possible (number of streams/apps <
>>> number of hw queues), it would be better to just use
>>> taprio(full-offload) alone, since the app will need to post the packet
>>> before the gate opens (so plus one margin, not 2).
>>
>> It really depends on the workload, and how the schedule is organized,
>> but yeah, that might be possible (for some cases :-)).
>>
>>>
>>>
>>>>>
>>>>> My biggest problem with this patch is that unless the application is
>>>>> very careful that WARN_ON_ONCE(1) will trigger. E.g. if softirq is
>>>>> servicing the queue when the application sends - the qdisc will not
>>>>> be bypassed, right?
>>>
>>> See above, unless I'm mistaken the "root" qdisc is never
>>> enqueued/dequeued for multi-queue aware qdiscs.
>>>
>>
>> That's true, mq and mqprio don't have enqueue()/dequeue(), but I think
>> that's more a detail of their implementation than a rule (that no
>> multiqueue-aware root qdisc should implement enqueue()/dequeue()).
>>
>> That is, from my point of view, there's nothing wrong in having a root
>> qdisc that's also a shaper/scheduler.
>>
>>>>>> I am now thinking if this idea locks us out of anything.
>>>>>>
>>>>>> Anyway, a nicer alternative would exist if we had a way to tell 
>>>>>> the core
>>>>>> "this qdisc should be bypassed" (i.e. don't call enqueue()/dequeue())
>>>>>> after init() runs.
>>>>>
>>>
>>> Again, I don't think enqueue/dequeue are called unless the HW queues
>>> point to the root qdisc. But this does raise an interesting point: the
>>> "scheduling" issue I observed was on the dequeue side, when all the
>>> queues were dequeued within the RT process context. If we could point
>>> the enqueue side to the taprio qdisc and the dequeue side to the child
>>> qdiscs, that would probably work (but I fear that would be a significant
>>> change in the way the qdisc code works).
>>
>> I am wondering if there's a simpler solution, right now (as you pointed
>> out) taprio traverses all queues during dequeue(), that's the problem.
>>
>> What I am thinking is if doing something like:
>>
>>       sch->dev_queue - netdev_get_tx_queue(dev, 0);
>>
>> To get the queue "index" and then only dequeuing packets for that queue,
>> would solve the issue. (A bit ugly, I know).
>>
>> I just wanted to write the idea down to see if any one else could find
>> any big issues with it. I will try to play with it a bit, if no-one
>> beats me to it.
> 
> I looked into it this morning, but I'm not sure how that would work. 
> With the current code, when qdisc_run() is executed sch->dev_queue will 
> always be 0 (sch being the taprio qdisc), so you'd need my proposed 
> patch or something similar for sch to point to the child qdiscs, but at 
> that point you'd also be bypassing the taprio qdisc on enqueue.
> 
> If removing the TxTime check inside taprio_queue remains difficult, I am 
> wondering if keeping my patch with the change below would work (not 
> tested):
> ***********************************************************************
> @@ -4206,7 +4206,10 @@ static int __dev_queue_xmit(struct sk_buff *skb, 
> struct net_device *sb_dev)
>                  skb_dst_force(skb);
> 
>          txq = netdev_core_pick_tx(dev, skb, sb_dev);
> -       q = rcu_dereference_bh(txq->qdisc);
> +       if (dev->qdisc->enqueue)
> +               q = dev->qdisc;
> +       else
> +               q = rcu_dereference_bh(txq->qdisc);
> 
>          trace_net_dev_queue(skb);
>          if (q->enqueue) {
> ***********************************************************************
> One risk though for multiqueue-aware qdiscs that define an enqueue (only 
> taprio is in that case today) is that the child qdisc selected by 
> enqueue would not match the txq from netdev_core_pick_tx, so in the end 
> we would call qdisc_run on the wrong qdisc...
> 

I haven't had much time to work on this, but the "naive" approach I 
described just above didn't work as is.

More importantly, I thought this patch was still under discussion, yet 
it seems it got merged into netdev, and on May 13th, so even before this 
thread started...
Did I miss something in the submission process?

> 
>>>
>>>>> I don't think calling enqueue() and dequeue() is a problem. The 
>>>>> problem
>>>>> is that RT process does unrelated work.
>>>>
>>>> That is true. But this seems like a much bigger (or at least more
>>>> "core") issue.
>>>>
>>>>
>>>> Cheers,
>>>>
>>>
>>
>>
>> Cheers,
>>
>
diff mbox series

Patch

diff --git a/net/sched/sch_taprio.c b/net/sched/sch_taprio.c
index 5c91df52b8c2..0bfb03052429 100644
--- a/net/sched/sch_taprio.c
+++ b/net/sched/sch_taprio.c
@@ -438,6 +438,11 @@  static int taprio_enqueue(struct sk_buff *skb, struct Qdisc *sch,
 	struct Qdisc *child;
 	int queue;
 
+	if (unlikely(FULL_OFFLOAD_IS_ENABLED(q->flags))) {
+		WARN_ONCE(1, "Trying to enqueue skb into the root of a taprio qdisc configured with full offload\n");
+		return qdisc_drop(skb, sch, to_free);
+	}
+
 	queue = skb_get_queue_mapping(skb);
 
 	child = q->qdiscs[queue];
@@ -529,23 +534,7 @@  static struct sk_buff *taprio_peek_soft(struct Qdisc *sch)
 
 static struct sk_buff *taprio_peek_offload(struct Qdisc *sch)
 {
-	struct taprio_sched *q = qdisc_priv(sch);
-	struct net_device *dev = qdisc_dev(sch);
-	struct sk_buff *skb;
-	int i;
-
-	for (i = 0; i < dev->num_tx_queues; i++) {
-		struct Qdisc *child = q->qdiscs[i];
-
-		if (unlikely(!child))
-			continue;
-
-		skb = child->ops->peek(child);
-		if (!skb)
-			continue;
-
-		return skb;
-	}
+	WARN_ONCE(1, "Trying to peek into the root of a taprio qdisc configured with full offload\n");
 
 	return NULL;
 }
@@ -654,27 +643,7 @@  static struct sk_buff *taprio_dequeue_soft(struct Qdisc *sch)
 
 static struct sk_buff *taprio_dequeue_offload(struct Qdisc *sch)
 {
-	struct taprio_sched *q = qdisc_priv(sch);
-	struct net_device *dev = qdisc_dev(sch);
-	struct sk_buff *skb;
-	int i;
-
-	for (i = 0; i < dev->num_tx_queues; i++) {
-		struct Qdisc *child = q->qdiscs[i];
-
-		if (unlikely(!child))
-			continue;
-
-		skb = child->ops->dequeue(child);
-		if (unlikely(!skb))
-			continue;
-
-		qdisc_bstats_update(sch, skb);
-		qdisc_qstats_backlog_dec(sch, skb);
-		sch->q.qlen--;
-
-		return skb;
-	}
+	WARN_ONCE(1, "Trying to dequeue from the root of a taprio qdisc configured with full offload\n");
 
 	return NULL;
 }
@@ -1759,6 +1728,37 @@  static int taprio_init(struct Qdisc *sch, struct nlattr *opt,
 	return taprio_change(sch, opt, extack);
 }
 
+static void taprio_attach(struct Qdisc *sch)
+{
+	struct taprio_sched *q = qdisc_priv(sch);
+	struct net_device *dev = qdisc_dev(sch);
+	unsigned int ntx;
+
+	/* Attach underlying qdisc */
+	for (ntx = 0; ntx < dev->num_tx_queues; ntx++) {
+		struct Qdisc *qdisc = q->qdiscs[ntx];
+		struct Qdisc *old;
+
+		if (FULL_OFFLOAD_IS_ENABLED(q->flags)) {
+			qdisc->flags |= TCQ_F_ONETXQUEUE | TCQ_F_NOPARENT;
+			old = dev_graft_qdisc(qdisc->dev_queue, qdisc);
+			if (ntx < dev->real_num_tx_queues)
+				qdisc_hash_add(qdisc, false);
+		} else {
+			old = dev_graft_qdisc(qdisc->dev_queue, sch);
+			qdisc_refcount_inc(sch);
+		}
+		if (old)
+			qdisc_put(old);
+	}
+
+	/* access to the child qdiscs is not needed in offload mode */
+	if (FULL_OFFLOAD_IS_ENABLED(q->flags)) {
+		kfree(q->qdiscs);
+		q->qdiscs = NULL;
+	}
+}
+
 static struct netdev_queue *taprio_queue_get(struct Qdisc *sch,
 					     unsigned long cl)
 {
@@ -1785,8 +1785,12 @@  static int taprio_graft(struct Qdisc *sch, unsigned long cl,
 	if (dev->flags & IFF_UP)
 		dev_deactivate(dev);
 
-	*old = q->qdiscs[cl - 1];
-	q->qdiscs[cl - 1] = new;
+	if (FULL_OFFLOAD_IS_ENABLED(q->flags)) {
+		*old = dev_graft_qdisc(dev_queue, new);
+	} else {
+		*old = q->qdiscs[cl - 1];
+		q->qdiscs[cl - 1] = new;
+	}
 
 	if (new)
 		new->flags |= TCQ_F_ONETXQUEUE | TCQ_F_NOPARENT;
@@ -2020,6 +2024,7 @@  static struct Qdisc_ops taprio_qdisc_ops __read_mostly = {
 	.change		= taprio_change,
 	.destroy	= taprio_destroy,
 	.reset		= taprio_reset,
+	.attach		= taprio_attach,
 	.peek		= taprio_peek,
 	.dequeue	= taprio_dequeue,
 	.enqueue	= taprio_enqueue,