diff mbox series

[RFC,v3] net: sched: implement TCQ_F_CAN_BYPASS for lockless qdisc

Message ID 1616404156-11772-1-git-send-email-linyunsheng@huawei.com (mailing list archive)
State RFC
Delegated to: Netdev Maintainers
Headers show
Series [RFC,v3] net: sched: implement TCQ_F_CAN_BYPASS for lockless qdisc | expand

Checks

Context Check Description
netdev/cover_letter success Link
netdev/fixes_present success Link
netdev/patch_count success Link
netdev/tree_selection success Guessed tree name to be net-next
netdev/subject_prefix warning Target tree name not specified in the subject
netdev/cc_maintainers warning 1 maintainers not CCed: atenart@kernel.org
netdev/source_inline success Was 0 now: 0
netdev/verify_signedoff success Link
netdev/module_param success Was 0 now: 0
netdev/build_32bit success Errors and warnings before: 4511 this patch: 4511
netdev/kdoc success Errors and warnings before: 0 this patch: 0
netdev/verify_fixes success Link
netdev/checkpatch warning WARNING: line length of 82 exceeds 80 columns
netdev/build_allmodconfig_warn success Errors and warnings before: 4751 this patch: 4751
netdev/header_inline success Link

Commit Message

Yunsheng Lin March 22, 2021, 9:09 a.m. UTC
Currently pfifo_fast has both TCQ_F_CAN_BYPASS and TCQ_F_NOLOCK
flag set, but queue discipline by-pass does not work for lockless
qdisc because skb is always enqueued to qdisc even when the qdisc
is empty, see __dev_xmit_skb().

This patch calls sch_direct_xmit() to transmit the skb directly
to the driver for empty lockless qdisc too, which aviod enqueuing
and dequeuing operation. qdisc->empty is set to false whenever a
skb is enqueued, see pfifo_fast_enqueue(), and is set to true when
skb dequeuing return NULL, see pfifo_fast_dequeue().

There is a data race between enqueue/dequeue and qdisc->empty
setting, qdisc->empty is only used as a hint, so we need to call
sch_may_need_requeuing() to see if the queue is really empty and if
there is requeued skb, which has higher priority than the current
skb.

The performance for ip_forward test increases about 10% with this
patch.

Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com>
---
Hi, Vladimir and Ahmad
	Please give it a test to see if there is any out of order
packet for this patch, which has removed the priv->lock added in
RFC v2.

There is a data race as below:

      CPU1                                   CPU2
qdisc_run_begin(q)                            .
        .                                q->enqueue()
sch_may_need_requeuing()                      .
    return true                               .
        .                                     .
        .                                     .
    q->enqueue()                              .

When above happen, the skb enqueued by CPU1 is dequeued after the
skb enqueued by CPU2 because sch_may_need_requeuing() return true.
If there is not qdisc bypass, the CPU1 has better chance to queue
the skb quicker than CPU2.

This patch does not take care of the above data race, because I
view this as similar as below:

Even at the same time CPU1 and CPU2 write the skb to two socket
which both heading to the same qdisc, there is no guarantee that
which skb will hit the qdisc first, becuase there is a lot of
factor like interrupt/softirq/cache miss/scheduling afffecting
that.

So I hope the above data race will not cause problem for Vladimir
and Ahmad.
---
 include/net/pkt_sched.h   |  1 +
 include/net/sch_generic.h |  1 -
 net/core/dev.c            | 22 ++++++++++++++++++++++
 net/sched/sch_generic.c   | 11 +++++++++++
 4 files changed, 34 insertions(+), 1 deletion(-)

Comments

Vladimir Oltean March 22, 2021, 8 p.m. UTC | #1
Hi Yunsheng,

On Mon, Mar 22, 2021 at 05:09:16PM +0800, Yunsheng Lin wrote:
> Currently pfifo_fast has both TCQ_F_CAN_BYPASS and TCQ_F_NOLOCK
> flag set, but queue discipline by-pass does not work for lockless
> qdisc because skb is always enqueued to qdisc even when the qdisc
> is empty, see __dev_xmit_skb().
> 
> This patch calls sch_direct_xmit() to transmit the skb directly
> to the driver for empty lockless qdisc too, which aviod enqueuing
> and dequeuing operation. qdisc->empty is set to false whenever a
> skb is enqueued, see pfifo_fast_enqueue(), and is set to true when
> skb dequeuing return NULL, see pfifo_fast_dequeue().
> 
> There is a data race between enqueue/dequeue and qdisc->empty
> setting, qdisc->empty is only used as a hint, so we need to call
> sch_may_need_requeuing() to see if the queue is really empty and if
> there is requeued skb, which has higher priority than the current
> skb.
> 
> The performance for ip_forward test increases about 10% with this
> patch.
> 
> Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com>
> ---
> Hi, Vladimir and Ahmad
> 	Please give it a test to see if there is any out of order
> packet for this patch, which has removed the priv->lock added in
> RFC v2.
> 
> There is a data race as below:
> 
>       CPU1                                   CPU2
> qdisc_run_begin(q)                            .
>         .                                q->enqueue()
> sch_may_need_requeuing()                      .
>     return true                               .
>         .                                     .
>         .                                     .
>     q->enqueue()                              .
> 
> When above happen, the skb enqueued by CPU1 is dequeued after the
> skb enqueued by CPU2 because sch_may_need_requeuing() return true.
> If there is not qdisc bypass, the CPU1 has better chance to queue
> the skb quicker than CPU2.
> 
> This patch does not take care of the above data race, because I
> view this as similar as below:
> 
> Even at the same time CPU1 and CPU2 write the skb to two socket
> which both heading to the same qdisc, there is no guarantee that
> which skb will hit the qdisc first, becuase there is a lot of
> factor like interrupt/softirq/cache miss/scheduling afffecting
> that.
> 
> So I hope the above data race will not cause problem for Vladimir
> and Ahmad.
> ---

Preliminary results on my test setup look fine, but please allow me to
run the canfdtest overnight, since as you say, races are still
theoretically possible.
Ahmad Fatoum March 23, 2021, 6:37 a.m. UTC | #2
Hi,

On 22.03.21 10:09, Yunsheng Lin wrote:
> Currently pfifo_fast has both TCQ_F_CAN_BYPASS and TCQ_F_NOLOCK
> flag set, but queue discipline by-pass does not work for lockless
> qdisc because skb is always enqueued to qdisc even when the qdisc
> is empty, see __dev_xmit_skb().
> 
> This patch calls sch_direct_xmit() to transmit the skb directly
> to the driver for empty lockless qdisc too, which aviod enqueuing
> and dequeuing operation. qdisc->empty is set to false whenever a
> skb is enqueued, see pfifo_fast_enqueue(), and is set to true when
> skb dequeuing return NULL, see pfifo_fast_dequeue().
> 
> There is a data race between enqueue/dequeue and qdisc->empty
> setting, qdisc->empty is only used as a hint, so we need to call
> sch_may_need_requeuing() to see if the queue is really empty and if
> there is requeued skb, which has higher priority than the current
> skb.
> 
> The performance for ip_forward test increases about 10% with this
> patch.
> 
> Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com>
> ---
> Hi, Vladimir and Ahmad
> 	Please give it a test to see if there is any out of order
> packet for this patch, which has removed the priv->lock added in
> RFC v2.

Overnight test (10h, 64 mil frames) didn't see any out-of-order frames
between 2 FlexCANs on a dual core machine:

Tested-by: Ahmad Fatoum <a.fatoum@pengutronix.de>

No performance measurements taken.

> 
> There is a data race as below:
> 
>       CPU1                                   CPU2
> qdisc_run_begin(q)                            .
>         .                                q->enqueue()
> sch_may_need_requeuing()                      .
>     return true                               .
>         .                                     .
>         .                                     .
>     q->enqueue()                              .
> 
> When above happen, the skb enqueued by CPU1 is dequeued after the
> skb enqueued by CPU2 because sch_may_need_requeuing() return true.
> If there is not qdisc bypass, the CPU1 has better chance to queue
> the skb quicker than CPU2.
> 
> This patch does not take care of the above data race, because I
> view this as similar as below:
> 
> Even at the same time CPU1 and CPU2 write the skb to two socket
> which both heading to the same qdisc, there is no guarantee that
> which skb will hit the qdisc first, becuase there is a lot of
> factor like interrupt/softirq/cache miss/scheduling afffecting
> that.
> 
> So I hope the above data race will not cause problem for Vladimir
> and Ahmad.
> ---
>  include/net/pkt_sched.h   |  1 +
>  include/net/sch_generic.h |  1 -
>  net/core/dev.c            | 22 ++++++++++++++++++++++
>  net/sched/sch_generic.c   | 11 +++++++++++
>  4 files changed, 34 insertions(+), 1 deletion(-)
> 
> diff --git a/include/net/pkt_sched.h b/include/net/pkt_sched.h
> index f5c1bee..5715ddf 100644
> --- a/include/net/pkt_sched.h
> +++ b/include/net/pkt_sched.h
> @@ -122,6 +122,7 @@ void qdisc_warn_nonwc(const char *txt, struct Qdisc *qdisc);
>  bool sch_direct_xmit(struct sk_buff *skb, struct Qdisc *q,
>  		     struct net_device *dev, struct netdev_queue *txq,
>  		     spinlock_t *root_lock, bool validate);
> +bool sch_may_need_requeuing(struct Qdisc *q);
>  
>  void __qdisc_run(struct Qdisc *q);
>  
> diff --git a/include/net/sch_generic.h b/include/net/sch_generic.h
> index f7a6e14..e08cc77 100644
> --- a/include/net/sch_generic.h
> +++ b/include/net/sch_generic.h
> @@ -161,7 +161,6 @@ static inline bool qdisc_run_begin(struct Qdisc *qdisc)
>  	if (qdisc->flags & TCQ_F_NOLOCK) {
>  		if (!spin_trylock(&qdisc->seqlock))
>  			return false;
> -		WRITE_ONCE(qdisc->empty, false);
>  	} else if (qdisc_is_running(qdisc)) {
>  		return false;
>  	}
> diff --git a/net/core/dev.c b/net/core/dev.c
> index be941ed..317180a 100644
> --- a/net/core/dev.c
> +++ b/net/core/dev.c
> @@ -3796,9 +3796,31 @@ static inline int __dev_xmit_skb(struct sk_buff *skb, struct Qdisc *q,
>  	qdisc_calculate_pkt_len(skb, q);
>  
>  	if (q->flags & TCQ_F_NOLOCK) {
> +		if (q->flags & TCQ_F_CAN_BYPASS && READ_ONCE(q->empty) &&
> +		    qdisc_run_begin(q)) {
> +			if (sch_may_need_requeuing(q)) {
> +				rc = q->enqueue(skb, q, &to_free) & NET_XMIT_MASK;
> +				__qdisc_run(q);
> +				qdisc_run_end(q);
> +
> +				goto no_lock_out;
> +			}
> +
> +			qdisc_bstats_cpu_update(q, skb);
> +
> +			if (sch_direct_xmit(skb, q, dev, txq, NULL, true) &&
> +			    !READ_ONCE(q->empty))
> +				__qdisc_run(q);
> +
> +			qdisc_run_end(q);
> +			return NET_XMIT_SUCCESS;
> +		}
> +
>  		rc = q->enqueue(skb, q, &to_free) & NET_XMIT_MASK;
> +		WRITE_ONCE(q->empty, false);
>  		qdisc_run(q);
>  
> +no_lock_out:
>  		if (unlikely(to_free))
>  			kfree_skb_list(to_free);
>  		return rc;
> diff --git a/net/sched/sch_generic.c b/net/sched/sch_generic.c
> index 44991ea..2145fdad 100644
> --- a/net/sched/sch_generic.c
> +++ b/net/sched/sch_generic.c
> @@ -146,6 +146,8 @@ static inline void dev_requeue_skb(struct sk_buff *skb, struct Qdisc *q)
>  	}
>  	if (lock)
>  		spin_unlock(lock);
> +
> +	WRITE_ONCE(q->empty, false);
>  	__netif_schedule(q);
>  }
>  
> @@ -273,6 +275,15 @@ static struct sk_buff *dequeue_skb(struct Qdisc *q, bool *validate,
>  	return skb;
>  }
>  
> +bool sch_may_need_requeuing(struct Qdisc *q)
> +{
> +	if (likely(skb_queue_empty(&q->gso_skb) &&
> +		   !q->ops->peek(q)))
> +		return false;
> +
> +	return true;
> +}
> +
>  /*
>   * Transmit possibly several skbs, and handle the return status as
>   * required. Owning running seqcount bit guarantees that
>
Yunsheng Lin March 23, 2021, 11:34 a.m. UTC | #3
On 2021/3/23 14:37, Ahmad Fatoum wrote:
> Hi,
> 
> On 22.03.21 10:09, Yunsheng Lin wrote:
>> Currently pfifo_fast has both TCQ_F_CAN_BYPASS and TCQ_F_NOLOCK
>> flag set, but queue discipline by-pass does not work for lockless
>> qdisc because skb is always enqueued to qdisc even when the qdisc
>> is empty, see __dev_xmit_skb().
>>
>> This patch calls sch_direct_xmit() to transmit the skb directly
>> to the driver for empty lockless qdisc too, which aviod enqueuing
>> and dequeuing operation. qdisc->empty is set to false whenever a
>> skb is enqueued, see pfifo_fast_enqueue(), and is set to true when
>> skb dequeuing return NULL, see pfifo_fast_dequeue().
>>
>> There is a data race between enqueue/dequeue and qdisc->empty
>> setting, qdisc->empty is only used as a hint, so we need to call
>> sch_may_need_requeuing() to see if the queue is really empty and if
>> there is requeued skb, which has higher priority than the current
>> skb.
>>
>> The performance for ip_forward test increases about 10% with this
>> patch.
>>
>> Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com>
>> ---
>> Hi, Vladimir and Ahmad
>> 	Please give it a test to see if there is any out of order
>> packet for this patch, which has removed the priv->lock added in
>> RFC v2.
> 
> Overnight test (10h, 64 mil frames) didn't see any out-of-order frames
> between 2 FlexCANs on a dual core machine:
> 
> Tested-by: Ahmad Fatoum <a.fatoum@pengutronix.de>
> 
> No performance measurements taken.

Thanks for the testing.
And I has done the performance measurement.

L3 forward testing improves from 1.09Mpps to 1.21Mpps, still about
10% improvement.

pktgen + dummy netdev:

 threads  without+this_patch   with+this_patch      delta
    1       2.56Mpps            3.11Mpps             +21%
    2       3.76Mpps            4.31Mpps             +14%
    4       5.51Mpps            5.53Mpps             +0.3%
    8       2.81Mpps            2.72Mpps             -3%
   16       2.24Mpps            2.22Mpps             -0.8%

> 
>>
Vladimir Oltean March 23, 2021, 11:39 a.m. UTC | #4
On Mon, Mar 22, 2021 at 10:00:33PM +0200, Vladimir Oltean wrote:
> Hi Yunsheng,
> 
> On Mon, Mar 22, 2021 at 05:09:16PM +0800, Yunsheng Lin wrote:
> > Currently pfifo_fast has both TCQ_F_CAN_BYPASS and TCQ_F_NOLOCK
> > flag set, but queue discipline by-pass does not work for lockless
> > qdisc because skb is always enqueued to qdisc even when the qdisc
> > is empty, see __dev_xmit_skb().
> > 
> > This patch calls sch_direct_xmit() to transmit the skb directly
> > to the driver for empty lockless qdisc too, which aviod enqueuing
> > and dequeuing operation. qdisc->empty is set to false whenever a
> > skb is enqueued, see pfifo_fast_enqueue(), and is set to true when
> > skb dequeuing return NULL, see pfifo_fast_dequeue().
> > 
> > There is a data race between enqueue/dequeue and qdisc->empty
> > setting, qdisc->empty is only used as a hint, so we need to call
> > sch_may_need_requeuing() to see if the queue is really empty and if
> > there is requeued skb, which has higher priority than the current
> > skb.
> > 
> > The performance for ip_forward test increases about 10% with this
> > patch.
> > 
> > Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com>
> > ---
> > Hi, Vladimir and Ahmad
> > 	Please give it a test to see if there is any out of order
> > packet for this patch, which has removed the priv->lock added in
> > RFC v2.
> > 
> > There is a data race as below:
> > 
> >       CPU1                                   CPU2
> > qdisc_run_begin(q)                            .
> >         .                                q->enqueue()
> > sch_may_need_requeuing()                      .
> >     return true                               .
> >         .                                     .
> >         .                                     .
> >     q->enqueue()                              .
> > 
> > When above happen, the skb enqueued by CPU1 is dequeued after the
> > skb enqueued by CPU2 because sch_may_need_requeuing() return true.
> > If there is not qdisc bypass, the CPU1 has better chance to queue
> > the skb quicker than CPU2.
> > 
> > This patch does not take care of the above data race, because I
> > view this as similar as below:
> > 
> > Even at the same time CPU1 and CPU2 write the skb to two socket
> > which both heading to the same qdisc, there is no guarantee that
> > which skb will hit the qdisc first, becuase there is a lot of
> > factor like interrupt/softirq/cache miss/scheduling afffecting
> > that.
> > 
> > So I hope the above data race will not cause problem for Vladimir
> > and Ahmad.
> > ---
> 
> Preliminary results on my test setup look fine, but please allow me to
> run the canfdtest overnight, since as you say, races are still
> theoretically possible.

I haven't found any issues during the overnight test and until now.

Tested-by: Vladimir Oltean <vladimir.oltean@nxp.com> # flexcan
diff mbox series

Patch

diff --git a/include/net/pkt_sched.h b/include/net/pkt_sched.h
index f5c1bee..5715ddf 100644
--- a/include/net/pkt_sched.h
+++ b/include/net/pkt_sched.h
@@ -122,6 +122,7 @@  void qdisc_warn_nonwc(const char *txt, struct Qdisc *qdisc);
 bool sch_direct_xmit(struct sk_buff *skb, struct Qdisc *q,
 		     struct net_device *dev, struct netdev_queue *txq,
 		     spinlock_t *root_lock, bool validate);
+bool sch_may_need_requeuing(struct Qdisc *q);
 
 void __qdisc_run(struct Qdisc *q);
 
diff --git a/include/net/sch_generic.h b/include/net/sch_generic.h
index f7a6e14..e08cc77 100644
--- a/include/net/sch_generic.h
+++ b/include/net/sch_generic.h
@@ -161,7 +161,6 @@  static inline bool qdisc_run_begin(struct Qdisc *qdisc)
 	if (qdisc->flags & TCQ_F_NOLOCK) {
 		if (!spin_trylock(&qdisc->seqlock))
 			return false;
-		WRITE_ONCE(qdisc->empty, false);
 	} else if (qdisc_is_running(qdisc)) {
 		return false;
 	}
diff --git a/net/core/dev.c b/net/core/dev.c
index be941ed..317180a 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -3796,9 +3796,31 @@  static inline int __dev_xmit_skb(struct sk_buff *skb, struct Qdisc *q,
 	qdisc_calculate_pkt_len(skb, q);
 
 	if (q->flags & TCQ_F_NOLOCK) {
+		if (q->flags & TCQ_F_CAN_BYPASS && READ_ONCE(q->empty) &&
+		    qdisc_run_begin(q)) {
+			if (sch_may_need_requeuing(q)) {
+				rc = q->enqueue(skb, q, &to_free) & NET_XMIT_MASK;
+				__qdisc_run(q);
+				qdisc_run_end(q);
+
+				goto no_lock_out;
+			}
+
+			qdisc_bstats_cpu_update(q, skb);
+
+			if (sch_direct_xmit(skb, q, dev, txq, NULL, true) &&
+			    !READ_ONCE(q->empty))
+				__qdisc_run(q);
+
+			qdisc_run_end(q);
+			return NET_XMIT_SUCCESS;
+		}
+
 		rc = q->enqueue(skb, q, &to_free) & NET_XMIT_MASK;
+		WRITE_ONCE(q->empty, false);
 		qdisc_run(q);
 
+no_lock_out:
 		if (unlikely(to_free))
 			kfree_skb_list(to_free);
 		return rc;
diff --git a/net/sched/sch_generic.c b/net/sched/sch_generic.c
index 44991ea..2145fdad 100644
--- a/net/sched/sch_generic.c
+++ b/net/sched/sch_generic.c
@@ -146,6 +146,8 @@  static inline void dev_requeue_skb(struct sk_buff *skb, struct Qdisc *q)
 	}
 	if (lock)
 		spin_unlock(lock);
+
+	WRITE_ONCE(q->empty, false);
 	__netif_schedule(q);
 }
 
@@ -273,6 +275,15 @@  static struct sk_buff *dequeue_skb(struct Qdisc *q, bool *validate,
 	return skb;
 }
 
+bool sch_may_need_requeuing(struct Qdisc *q)
+{
+	if (likely(skb_queue_empty(&q->gso_skb) &&
+		   !q->ops->peek(q)))
+		return false;
+
+	return true;
+}
+
 /*
  * Transmit possibly several skbs, and handle the return status as
  * required. Owning running seqcount bit guarantees that